Apple invents a Proactive & Reactive assisted Transcription app for FaceTime Calls on Apple Hardware including an MR Headset
In Apple's patent background they note that conventional systems do not effectively provide proactive and reactive assistance based on these transcriptions, nor do such systems effectively generate transcriptions based on conversational context or environmental factors. For example, traditional systems do not offer users an efficient means by which to quickly review portions of a transcription based on specific parameters, such as conversational topics, environmental conditions, and the like. Such systems also do not offer user assistance based on a user’s attentive state, such as when the user becomes distracted from the conversation. Thus, an improved system for transcriptions and transcription assistance is desired.
Apple's patent describes techniques for generating transcriptions and providing proactive and reactive assistance with transcriptions.
In general, transcriptions can be helpful to review and summarize information related to conversations or other interactions between parties. Given the increase of conversational communication between devices, and the technological advances of technology on such devices, conversational transcription can now be effectively utilized.
In addition, various technologies may lend to effective translations with respect to an environment, such as an environment associated with extended reality or similar technologies.
Overall, Apple's patent covers systems and processes for transcriptions and transcription assistance. For example, a textual representation of a conversation between a user and at least one conversation participant is obtained. Based on the textual representation, content associated with the conversation is identified, wherein the content includes at least one of a first input from the user and a second input from the at least one conversation participant. In response to a determination that the content is associated with predefined content, a portion of the textual representation is identified based on the content. Based on the identified portion, an output responsive to the at least one of the first input and the second input is provided.
Apple's patent FIGS. 8A/C/E below illustrate a process for transcriptions and transcription assistance.
More specifically, Apple's patent FIG. 8A, sets up a conversation between a user and one or more other users. The conversation may correspond to a voice communication (e g., telephone call), a FaceTime conference call, a conversation through a social media platform, and most interesting, a conversation in a virtual and/or augmented reality setting.
For example, a user of an iPhone (electronic device #800) may be engaged in a telephone conversation with other users. While the conversation takes place, a textual representation (e.g., a transcription) of the conversation may obtained.
Another feature of this app may include a prompt that includes various options related to the transcription of the conversation. For instance, the prompt may further provide the participant with the option to anonymize or otherwise modify or eliminate identifying information from the respective participant’s inputs, such that the obtained textual representation includes modified input from the respective participant.
A modified textual representation of the conversation may include various modifications, such as anonymized user names (e.g., “User A: Hello ”) The modified textual representation may also omit various items of information, such as personal information (e.g., addresses, phone numbers, account numbers, and the like).
A response to a provided prompt is then received from devices associated with the various participants, including responses that may approve transcription, deny transcription, or otherwise approve a modified version of transcription for the respective participant.
Initiation of the transcription may occur in various ways. For example, the user may indicate a desire to transcribe the conversation through various configurations or settings prior to the conversation being initiated and the transcription approval prompts being sent to the various users.
The user may also provide an input during an already-established conversation, for example, by activating an affordance (icon) #802 depicted on an active call screen in FIG. 8A. In some examples, the icon may be used to toggle between the active call screen and the textual representation of the conversation (discussed in part via FIG. 8B), for example, when transcription has already been initiated.
In some examples, initiation of the transcription may occur based on various context information. For instance, a transcription of a conversation may be initiated in response to a respective threshold being exceeded, such as a noise threshold (e.g., the user is engaged in a video call within a crowded supermarket).
As another example, the transcription may be initiated in response to the detection of various trigger words or phrases. Specifically, one or more users participating in the conversation may utter a phrase such as “can you repeat that,” “say that again,” “what was that?” and the like. In some examples, the trigger word or phrase may correspond to an explicit request from the user of the electronic device to begin a transcription, such as “Start the transcription now.”
Generally, proactive and reactive assistance using the textual representation may be provided to users and may be based on various factors. Referring to FIG. 8B above, in some examples, content associated with the conversation is identified based on the textual representation, wherein the content includes one or more inputs from the iPhone user and/or the other participants of the conversation. Such input may generally trigger reactive assistance from the iPhone (and/or other devices associated with the conversation).
In particular, the input may correspond to speech input, text input, input from activating various affordances (icons), controlling one or more secondary devices, and the like. For example, the user may activate a mute button, share various media items within the conversation, control virtual objects in a virtual setting, etc.
Apple's patent FIGS. 9A-9B illustrate a process for transcriptions and transcription assistance. In a co-presence session (e g., within an AR/VR environment), various objects or user avatars may move about the user’s viewing perspective, enter or exit the environment.
The transcription app in a FaceTime call within a Mixed Reality Headset could include added features. For instance, in patent figure 9A above, image 900 may correspond to the user’s living room which is physically located in the city of Atlanta, GA. Weather information corresponding to the current location may also be obtained, such as “sunny + 70 degrees. This could be illustrated in the Headset imagery of the person you're speaking with.
In Apple's patent FIG. 9B above, an event associated with representation #900 may be detected, such as a third user entering the environment. Accordingly, an updated set of identifiers may be retrieved based on the detected event. For example, a physical user may arrive at the location represented by representation #900, such as by walking through door #902.
Alternatively, a user may enter the virtual session (e.g., using call-in or log-in information), such that an avatar associated with the user is displayed within representation #900.
For more details, review Apple's patent application number WO2022266209
- Shiraz Akmal: Apple AI/ML Future Experiences. Mr. Akmal was CEO and Co-Founder of "Spaces Inc" that Apple acquired back in the summer of 2020. Spaces was a pioneer in VR Videoconferencing. Image below from "Spaces Inc."
- Brad Herman: AI/ML Future Experiences. Herman was CTO and Co-Founder of Spaces Inc.
- Aaron Burns: Software Engineering Manager