Apple Patent Reveals a Visual System to Assist Siri in Future Devices to understand Gestures, Activities, & More
In December 2015 Patently Apple posted a report titled "Apple Continues to Build a New Team that is focused on 'Deep Learning' Technology for Autonomous Vehicles." That was the first time we heard of Apple having a 'deep learning team.' While one of the projects related to autonomous vehicles, the team is obviously involved in a wide range of projects. Back in June 2016 Apple's Craig Federighi, Senior VP of Software Engineering for iOS and macOS went out of his way to show the advancements in Siri's ability to understand deep questions relating to Siri's new "deep learning" capabilities. Back in October 2016, Google delivered their manifesto on the arrival of the AI revolution.
So it is clear that the shift to deep learning systems using next generation user interfaces for Home and Vehicle Automation markets is where the next revolution rests.
While Google tried to plant a mental flag on deep learning being something that they were first to bring to market, I'm not quite sure that will stick. A Forbes article published last year attempted to frame this next revolution as one being led by Google, Facebook, Microsoft and Amazon – leaving Apple out of the equation which couldn't have been further from the truth. Between Siri, their acquisition of Vocal IQ and their deep learning team in general, Apple is most definitely in the game if not poised to lead it in the future.
This week a master patent came to light regarding deep learning. A master patent usually covers a new area of technology with broad strokes to provide a grand overview of the subject matter. In the future new patent filings will then break out varying elements of the technology and elaborate on it in more detail and perhaps for specific products Apple has in mind.
Today's report is no doubt long, but it's because it's an overview of a new 'Intelligent System' that could be a part of a next-generation HDTV system, an advanced 'Apple TV' box, a Home specific device or a Mac down the road.
While the system will integrate Siri for voice, this week's patent is really about the brains and eyes of a future AI system. You may say 'Hey Siri' to get the systems attention or you'll be able to use a gesture to summon Siri or directly instruct the system to do something like answer your phone or start your car. The system will also perform advanced functions like adjust the audio of a movie or new Tune so that when you walk around a room the audio will be adjusted to your new position in the room; lower or raise the volume, base or treble as required to ensure the sound is optimal for the user.
Though in the end, the patent is really about giving eyesight to the new intelligent system via 3D cameras by Apple' acquired LinX that could fine tune the system's understanding of a gesture system tied to instructions or intent of the user.
Patent Background
Traditional user interfaces for computers and multi-media systems are not ideal for a number of applications and are not sufficiently intuitive for many other applications. In a professional context, providing stand-up presentations or other types of visual presentations to large audiences is one example where controls are less than ideal and, in the opinion of many users, insufficiently intuitive. In a personal context, gaming control and content viewing/listening are but two of many examples. In the context of an audio/visual presentation, the manipulation of the presentation is generally upon the direction of a presenter that controls an intelligent device (e.g. a computer) through use of remote control devices. Similarly, gaming and content viewing/listening also generally rely upon remote control devices. These devices often suffer from inconsistent and imprecise operation or require the cooperation of another individual, as in the case of a common presentation. Some devices, for example in gaming control, use a fixed location tracking device (e.g., a trackball or joy-stick), a hand cover (aka, glove), or body-worn/held devices having incorporated motion sensors such as accelerometers. Traditional user interfaces including multiple devices such as keyboards, touch/pads/screens, pointing devices (e.g. mice, joysticks, and rollers), require both logistical allocation and a degree of skill and precision, but can often more accurately reflect a user's expressed or implied desires. The equivalent ability to reflect user desires is more difficult to implement with a remote control system.
When a system has an understanding of its users and the physical environment surrounding the user, the system can better approximate and fulfill user desires, whether expressed literally or impliedly. For example, a system that approximates the scene of the user and monitors the user activity can better infer the user's desires for particular system activities. In addition, a system that understands context can better interpret express communication from the user such as communication conveyed through gestures. As an example, gestures have the potential to overcome the aforementioned drawbacks regarding user interface through conventional remote controls. Gestures have been studied as a promising technology for man-machine communication. Various methods have been proposed to locate and track body parts (e.g., hands and arms) including markers, colors, and gloves. Current gesture recognition systems often fail to distinguish between various portions of the human hand and its fingers. Many easy-to-learn gestures for controlling various systems can be distinguished and utilized based on specific arrangements of fingers. However, current techniques fail to consistently detect the portions of fingers that can be used to differentiate gestures, such as their presence, location and/or orientation by digit.
Computer or Entertainment System for responding to user Presence and Activity
Apple's invention illustrates the use of an intelligent system that responds to user intent and desires based upon activity that may or may not be expressly directed at the system.
The intelligent system acquires a depth image of a scene surrounding the system. A scene geometry may be extracted from the depth image and elements of the scene, such as walls, furniture, and humans may be evaluated and monitored.
Apple's filing notes that "In certain embodiments, user activity in the scene is monitored and analyzed to infer user desires or intent with respect to the system. For example, if the user is observed leaving the room, the output of the intelligent system may be paused." In context, Apple later identifies some of the hardware that this 'intelligent system' on pause is designed to work with: a television, a set top box, a multi-media entertainment system and general purpose computer system.
In addition, the scene geometry may be a factor in the system's response. As an example, if a user enters a portion of the scene with low acoustic reflectance, the audio volume of the system output may be increased to compensate for the relatively decreased acoustic reflection being experienced by the user.
In other embodiments, the intelligent system may determine that the user is attempting to engage the system to provide express instructions. If express instructions are interpreted, some embodiments contemplate slavishly following the express instruction, and other embodiments contemplate generally following the instructions, while, however, compensating based upon scene geometry.
For example, if user is detected as older in age and that user expressly requests a higher volume, the system may decide that the user requires better differentiation of voice dialog in the system output. Therefore, the system may change the relative spectral distribution of the system output (to relatively amplify voice) rather than increase the average volume.
Apple's invention covers concepts that provide a method to identify fine hand gestures based on real-time three-dimensional (3D) sensor data. The method includes receiving a first depth map of a region of space, the first depth map having a first plurality of values, each value indicative of a distance.
Overall, the invention pertains to systems, methods, and computer readable media to improve the operation of user interfaces including scene interpretation, user activity, and gesture recognition. In general, techniques are disclosed for interpreting the intent or desire of one or more users and responding to the perceived user desires, whether express or implied. Many embodiments of the invention employ one or more sensors used to interpret the scene and user activity. Some example sensors may be a depth sensor, an RGB sensor and even ordinary microphones or a camera with accompanying light sensors.
Varying embodiments of the invention may use one or more sensors to detect the user's scene. For example, if the system serves as a living room entertainment system, the scene may be the user's living room as well as adjacent areas that are visible to the sensors.
The scene may also be as small as the space in front of a user's workstation or the interior of a car.
In an even a small area adjacent to a user's smart phone or other portable device (to interpret user desires with respect to that device).
On the flip side, the scene may additionally be large, for example, including an auditorium, outdoor area, a playing field, or even a stadium. In sum, the scene may be any area where there is a value for intelligent systems such as computers or entertainment systems to interpret user intent or desire for system activity.
Many embodiments of the invention allow for direct user manipulation of the system either to operate system settings or to control an application of the system such as games, volume, tuning, composing, or any manipulations that a user might expressly desire from the system in use.
The intelligent system may also have device sensors which may include one or more of: depth sensors (such as a depth camera); 3D depth sensor(s); imaging devices (such as a fixed and/or video-capable image capture unit); RGB sensors; proximity sensors; ambient light sensors; accelerometers; gyroscopes; any type of still or video camera; LIDAR devices; SONAR devices; microphones, CCDs (or other image sensors), infrared sensors, thermometers, etc. These and other sensors may work in combination with one or more GPUs, DSPs or conventional microprocessors along with appropriate programming so that the sensor outputs may be properly interpreted and/or combined to contribute to the interpretation a scene and user activity within the scene.
Scene Geometry
In Apple's patent FIGS. 4a and 4b illustrated below we're able to see embodiments contemplate detection of the scene geometry by the system #405 or devices and equipment in cooperation with the system. One or more sensors may be employed to detect the scene geometry, which generally refers to the structure of the room in two or three dimensions.
For example, one type of scene geometry contemplated for embodiments of the invention involves determining or estimating the location and/or nature of each element in a space visually or acoustically exposed to the system (e.g. television, set-top box, a multi-media entertainment center or computer system). Thus, varying embodiments of the invention may determine or estimate the two- or three-dimensional position of vertical surfaces such as walls; horizontal surfaces #435, such as floors; furniture #410 or other chattel #440; fixtures #445; as well as living things, such as pets or humans #430. By understanding scene geometry, the system may provide a better experience for the user.
Apple notes that "Some embodiments employ one or more sensors to recover scene geometry. One type of sensor that may be used is a depth camera, which may serve as a type of depth sensor (e.g. cameras provided by LinX Imaging)." For the record, Apple acquired LinX two years ago.
At present, some depth cameras provide multiple sensing modalities such as depth detection and infrared reflectance. Depth detection recovers the existence of objects or portions of objects as well as the distance of the objects from the sensor. Thus, for example, referring to FIG. 4a, a depth detection of the scene taken from a sensor in the system should recover data representing the approximate dimension and location of: the three vertical surfaces 401 (walls) in front of the system; the horizontal surface(s) 435 (floor); the person 430; and separate parts of the furniture 410 (because varying parts of the furniture have different depths).
In addition to depth detection, some contemporary depth cameras also detect infrared reflectance. Information regarding infrared reflectance of an object reveals properties of the object including information about color. Information detected regarding infrared reflectance can be used with the depth information to aid in identifying objects in the room or potentially determining how those objects affect the use of the system in the scene (e.g., the effects on a user's perception of light and sound).
System Response
Under the section identified as 'System Response,' Apple notes that varying embodiments of their invention employ detecting user activity and/or the indicators of that activity in order to shape or alter a system service.
Audio service is one example of shaping or altering a system service in response to user activity. A user's perception of audio varies based upon the user's physical position with respect to the audio source(s). Changes in a user's location with a scene may cause variations in perceived volume or in the perceived balance of spectral components that arrive at the user's ears.
In the case of volume, there is a 6 dB drop in sound intensity (e.g., volume) for every doubling of distance from an audio source. Therefore, a user's distance from the audio source affects the volume of the sound that the user perceives.
In addition, even in an open room, only a portion of the sound perceived by a user travels directly from the source to the user's ears. Referring to FIGS. 9a and 9b below, simple examples of these concepts are shown. System #405 may radiate sound that travels both directly #950 and indirectly #925 to the user. Thus, by understanding scene geometry, the system may account for user location in adjusting both the spectral power distribution and the overall intensity of the audio.
In some embodiments, the intelligent system's understanding of scene geometry extends to the acoustic reflectiveness of the various surfaces, and this capability enhances the system's potential to refine a responsive adjustment with respect to both volume and spectral balance.
In some embodiments, the system may automatically adjust volume or sound intensity according to a user's position in the room. If the user becomes closer to an audio source, the intensity may be proportionally decreased. If the user moves further away from a source the intensity may be proportionally increased. The concept similarly applies to situations with multiple audio sources where each source may be altered accordingly or an overall average intensity (or other aggregating function) may be altered in response to user movements. In some embodiments with only one user in the scene the former arrangement may be preferred (i.e., volume of a source proportional to user distance), while in some embodiments with multiple users, the latter arrangement may be more desirable (i.e., average or aggregated volume proportional to average or aggregated distance).
Engagement Analysis
Under the section identified as 'Engagement Analysis' Apple turns to patent FIG. 11 noted above. There we're able to see a process for determining whether a user in the scene is engaged with the system (i.e., trying to communicate with the system).
The process of FIG. 11 illustrates the use of an RGB sensor and a depth sensor which are respectively employed for face detection #1120 and head orientation detection #1140. A registration is performed at #1150 to align the face detection results with the head orientation results. The registration provides information so that a more precise facial feature may be accurately paired with a more precise head orientation. This added information may be used during engagement analysis #1160 to more accurately determine if the user is engaging the system.
For example, if the user is speaking as the user's head orientation is moving toward the system, then the intent to engage is more likely. If the user is speaking as the head orientation leans toward another user, the user's intent is more likely conversation. If the engagement analysis determines that the user is not engaged #1170, then the system need not do anything #1180. If the engagement analysis determines that the user is engaged, then the system should respond. For example, if the user is asking to increase the volume, then the system should increase the volume.
Some embodiments of the invention contemplate the use of learning loops #1165, which augment the engagement analysis with information learned from successes and failures in prior activity when responding or not responding. In some embodiments that employ learning loops, the system keeps a record of its engagement analysis and responsive action (or not).
Fine Gesture Detection
Under the section identified as 'Fine Gesture Detection' Apple turns to patent figure FIG. 14 where in one embodiment, 3D region #1400 is sensed by 3D image sensor system #1405 where region #1400 includes individual #1410 making a hand gesture and a depth of field #1515.
In Apple's patent FIG. 15 we're able to see a sensor data of region #1400 captured by the 3D image sensor system that may be analyzed slice-by-slice.
As shown, in some embodiments, each slice #1520 has overlap #1525 with an immediately prior slice and a thickness #1530. In general, the slice thickness should be "thick" enough to engulf an individual's hand or other target object but not so thick as to include an excess amount of other structure so as to make hand detection more difficult.
Three-dimensional sensor data acquisition in accordance with this disclosure may use any of a variety of optical and/or non-optical sensors. Optical sensors use light to carry depth information and include, for example, laser triangulation sensors and stereo vision systems. Non-optical sensors include acoustic (e.g., ultrasonic and seismic) sensors and electromagnetic (e.g., infrared, ultraviolet, microwave) sensors. These techniques typically measure distances to objects by determining the time required for a pulse of sound or electromagnetic energy to bounce back from an object.
Apple's patent FIGS. 22 noted above illustrates one approach to representing a hand's volume (depth map); FIG. 24 (right) shows, in block diagram form, a two-stage gesture classifier.
User Communicating with the System
And lastly, as we noted at the top of our report, Apple's Craig Federighi, Senior VP of Software Engineering for iOS and macOS spoke about Siri's new deep learning capabilities last year. Apple's patent filing published this week also acknowledges Siri as part of this next-generation 'intelligent system.'
Under the section identified as 'User Communication with the System' Apple turns Apple notes that a microphone or other audio sensor may be used to determine if the user is attempting to make voice contact with the system. The system may use voice analysis similar to that used in Apple's SIRI intelligent assistant in order to understand user utterances and decipher those that are directed at the system.
In some embodiments, the use of voice analysis may be constant so that all user voice is analyzed to determine context and/or to identify attempts to communicate with the system.
The use of voice analysis may be augmented by the use of other sensors. A depth camera, a LIDAR, an RGB sensor, or ordinary camera images may be used to analyze user body movements in order to determine when a user's utterances are most likely directed at the system.
Rather than constantly performing voice analysis on user speech, some embodiments may only analyze a user's speech when a sensor detects that the user appears to be addressing the system (e.g., that the user is engaged).
In yet other embodiments, a particular pre-set physical user action may tell the system that the user is addressing the system, so that voice analysis may be used. For example, the user may: employ a hand gesture like a peace sign; raise a hand as in a grade school class; hold one eye shut for more than a few seconds; or perform any reasonably distinct physical action.
In a similar fashion, the voice analysis may be constantly monitoring the scene but looking for a key word or words to trigger a communication session with a user. For example, the voice analysis may scan for the name Siri, or "hey Siri." Of course, the latter aspect of using "Hey Siri" has already been implemented.
In any event, once the system recognizes that the user desires to communicate, the communication can take place in any known manner. Differing embodiments of the invention may employ voice recognition or gesture recognition or a combination of both techniques.
Apple's patent application was filed back in Q3 2015. Considering that this is a patent application, the timing of such a product to market is unknown at this time.
Patently Apple presents a detailed summary of patent applications with associated graphics for journalistic news purposes as each such patent application is revealed by the U.S. Patent & Trade Office. Readers are cautioned that the full text of any patent application should be read in its entirety for full and accurate details. About Making Comments on our Site: Patently Apple reserves the right to post, dismiss or edit any comments. Those using abusive language or behavior will result in being blacklisted on Disqus.
Comments