AR Live Captions on the HoloLens

Project Overview

I created a live captioning solution over the course of two design iterations that seeks to provide a tool for Deaf-hard of hearing (DHH) and English as a second language (ESL) people that makes following conversations and identifying context clues easier.

My Role

Sole: Product Designer, UX Designer, UX Researcher, Interaction Designer, Visual Designer, Back-End Developer, Information Architect

Objective

The goal was to create a low fidelity prototype that takes in audio, converts it to text and displays the text as an AR object in a HoloLens app. This MVP serves as a proof of concept for future iterations on more advanced devices.

My Tools

Microsoft Hololens, Visual Studio Code, Unity

Defining the problem

Deaf-hard of hearing (DHH) and English as a second language (ESL) people have difficulties when engaging with hearing enabled people. As technology has progressed we have found progressively better ways to treat medical conditions. Except for hard of hearing and deafness. We still have hearing aids and progress has been made regarding cochlear implants. However, modern technology hasn’t been leveraged to address these issues yet.

Research

I performed research in the form of interviews, personas, and competitor analysis to better understand the user base and identify the expectations of the users. Through these methods I develop a touchstone that I use to periodically center myself and the project around the user. The touchstone is metaphorical, it represents an iteration of the project that addresses only the core issues identified by the users. This can, of course, be updated as new information is gathered.

Craig Henderson Persona. 22 year old male, Deaf, uses product for daily conversations with roommates, professors, other students, and in the classroom for notes and in class questions.Jaquline Santos, 19 year old female ESL. Sometimes misses context clues when people speak fast. Likes not having to ask people to repeat themselves.

Personas

We created four personas based off of the four major user groups. They are: Deaf, Hard of hearing, ESL, Neuro-typical hearing. Quotes are paraphrased from interviews with the respective stakeholders.

Interviews

  • Interview participants: Deaf, Hard of hearing, ESL, Neuro-typical
  • Questions: Project vision, mvp expectations, preferences and advice
  • Insight: Visibility and speed are primary concerns, accuracy is secondary.

Competitor analysis

  • Competitors: Augmented Reality captions, Wearable captions, Auxiliary captions, text bubbles
  • Key features: the medium used, success rate, points of failure, similarity
  • Insights: Captions have been attempted in most mediums. Widely available mediums have significant drawbacks. Most issues result from a lack of sensitivity/directionality of the onboard microphones.
an image with several head shots overlaid on a scene of a man in profile working at a desk and a woman standing at the desk reviewing a paper

Synthesizing

Why I made this choice

Instead of sending the audio out to Google, it's processed locally to increase speed and reduce dependency on internet speeds. While the onboard mic isn't the best, this might work in my favor as it reduces the effective distance and filters out any conversations that don't happen within the immediate vicinity.

The impact of my choice

These decisions resulted in more of a low-fidelity product than first planned. They also resulted in not having to incorporate auxiliary devices.

Research Insights

  • Speed, not accuracy is the most important factor of how the user rates their willingness to use the product.
  • The product shouldn't be impacted by background conversations.
  • Simplify the design, don't use auxiliary devices
A black arrow pointing to the right.A black arrow pointing to the right.

Moving Forward

I focus on processing the audio locally and making use of the onboard mic. The main concerns are building up the base product before the interface.

Ideation

Process

Identify themes

  • Identify core features and key performance indicators (KPI)
    - 0.5 second transcription speed
    - Captions must be visible
    - Accuracy
  • Identify user specified features
    - Useful despite background conversation
    - Add to conversational experience
    - Better microphone
    - Translation
    - Scroll through history
  • Identify where others have made mistakes implementing the above
    - Better microphone has been implemented by using auxiliary devices (outsourcing to phone, external microphone array). Results in unwieldy and non-user-friendly devices.
    - Translation requires internet access which goes against self contained requirement.

Discard ideas and features that have failed for others

  • Keep new methods of doing failed ideas and features, but set at lower priority unless heavily requested
    - Cull better microphone and translation

Order in terms of needs vs wants

  • Reference research notes, KPI, and requirements
    - Needs: speed, visibility, accuracy, add to experience
    - Wants: Useful despite background, scroll through history

Prioritize the most pressing issue

  • What has the greatest effect or needs to be done first?
    - Speed: users specified that the correct speed was critical to their willingness to use the product.
A lightbulb with lines indicating that it's glowing and the word idea is written inside in the filament

Prototype

A blue gear

The prototyping phase took place over the course of both iterations. During the first iteration I focused on sketches while the second iteration primarily focused on a low fidelity concept created in unity and visual studio code. This process took the majority of the time during the second iteration, nearly three months. The issues encountered include dealing with a deprecated API for the HoloLens 1, incorrectly functioning HoloLens 1 emulator, computer issues, loosing access to the HoloLens 1 and the time needed to learn C#. The issues and delays experienced during this phase effected the rest of the phases significantly.

Testing

a rectangle with a check mark and an x
  • 3 participants
    - aged 26 - 59
    - two males, one female
  • Participants either gave short answers or indicated level of agreement with the question via a disagree-agree 1-5 scale.
  • Participants were sent an email with a link to a YouTube video introducing them to the HoloLens 1.

The issues experienced during the prototyping phase restricted the available time and resources available for testing. This along with restrictions imposed by COVID-19 made testing a wearable headset incredibly difficult and further restricted testing opportunities. The only option remaining was to make use of the HoloLens 1 emulator. Unfortunately, this option was also not viable due to  software restrictions of test subjects. The primary issues was that the emulator requires Windows 10 Pro (a costly upgrade from Windows 10 Home) to have access to Hyper-V which is required for rendering virtual environments. Further restricting testing was the lack of screen shots of the application in use. These restrictions meant that the best testing I could do was a survey about DHH people's experiences with captions, AR, their preferences and opinions on the concept.

Results

  • Participants also verified the lack of need for perfect accuracy noting that it might decrease speeds and that users only need to be able to fill in missing info from context
  • When asked about conversation history all participants agreed that this was an important feature, especially in learning environments.
  • When asked if they had anything to add that wasn't touched on they mentioned that indicating the direction of unseen speakers would be helpful and that combining holographic captions with translation services would be a great boon.
  • Participants indicated a moderate to high level of experience with captions with usage during movies, television, YouTube, and as a foreign language learning aid.
  • On a 1-5 scale captions that participants used were rated between 3 and 4.
  • Two of the participants indicated that they were interested in captions for everyday life and that they would be most useful in meetings, classrooms, technical and translation settings.
  • Participants verified the need for speedy captions stating that the speed should be in line with speech and that discrepancies would be distracting.

Final Product

The final product is a low-fidelity prototype. It listens to audio in the immediate vicinity via the onboard mic and transcribes what is said into text. The text is then printed at the bottom of a text box with the history accessible by scrolling upwards. This prototype works on the HoloLens and the HoloLens emulator. The scripts for the application were created using Visual Studio which were then imported to Unity. In Unity the visual assets were created and integrated with the scripts. Once the application was finished it was built using Unity and then run via Visual Studio. When using the emulator an auxiliary microphone is needed.

A mockup of the proposed UI is shown below.

Reflection

It took 7-8 months to complete both iterations of the project. During this time I learned just how difficult a deprecated API can be and how important it is to maintain documentation of a project.

One challenge was taking on all the roles by myself. This was a time consuming project from start to finish and doing it by myself was demanding. However, through my perseverance I gained several valuable skills. Experience creating AR apps, using Unity, prepping machines for virtualization, learning C#, creating virtual assets, and adapting to challenging research situations.

If I could do anything differently I would have opted for the HoloLens 2 and made do with the emulator. The documentation was far more complete, the hardware was more advanced, and the online communities are more active.