Google DeepMind User Testing

Google DeepMind partnered with Aira to test their AI model as a visual interpreter for blind users.

User Research

Project Overview

Client: Google DeepMind
Industry: AI Software
Timeline: 15 weeks (2025)
My Role: User Researcher

As Artificial Intelligence models become more ubiquitous, ensuring they are accessible and reliable for users with disabilities is critical. Google DeepMind partnered with Aira to test the efficacy of their AI models when acting as a "visual interpreter" for blind and low-vision users.

The objective was to move beyond theoretical data and understand how these models perform in real-world scenarios, specifically focusing on the intersection of technical accuracy and human trust.

Skills

Core Methodologies
Usability Testing & Validation
Ethnographic Observation
Contextual Inquiry
Qualitative Interviewing
Analysis & Strategy
Sentiment Analysis
Mental Model Mapping
Data Synthesis & Reporting
Error Taxonomy (Accuracy/Hallucination tracking)
Domain Knowledge
Human-AI Interaction (HAI)
Inclusive Design / Accessibility (a11y)
Assistive Technology (Screen Readers/Visual Interpreters)

Results

The research I assisted with influenced the development cycle of Google’s AI models. By identifying specific failure points and user concerns, the engineering teams were able to reshape the models to better support a historically marginalized user base.

Key Outcomes:

Model Refinement: Detailed reports on model inaccuracies drove specific technical improvements.
Trust Calibration: We established a clearer understanding of "User (and Visual Interpreter) Comfort," helping Google define the boundaries of what the AI should and should not attempt to interpret based on user, and Visual-Interpreter-assessed, safety and trust.

Process & Methodology

Over the course of several months, I conducted and observed dozens of intensive user testing sessions to gather both performance metrics and qualitative sentiment data.

Performance Tracking (Accuracy vs. Hallucination):
- I acted as the bridge between user experience and technical performance, witnessing firsthand the reliability and failures, of the model.
- I meticulously logged instances of accuracy and inaccuracy, creating a list of errors that helped engineers understand where and why the model failed in real-time contexts.
Sentiment & Mental Model Research:
- Beyond just "does it work," I investigated "how does it feel?" I analyzed user sentiment regarding Local vs. Cloud-based models, specifically looking at trade-offs between speed, privacy, and capability.
- I researched user understanding of the technology, identifying gaps between what the model is capable of and what users believe it is capable of.
Strategic Reporting:
- I synthesized session data to be used for actionable reports to Google DeepMind, highlighting not just technical bugs, but the emotional and practical requirements of blind users relying on AI for independence.

Opportunities for Improvement and Expansion

The primary goal for improvement is enhancing the safety, generalizability, and quantified performance of the AI model's real-world application.

Develop a Trust Calibration Scale. The current study gathered qualitative sentiment, but a future phase would introduce a quantifiable AI Trust Scale. This scale would allow researchers to track how different model behaviors (e.g., successful identification vs. minor hallucination vs. major error) correlate precisely with a user's willingness to rely on the service for high-stakes tasks (e.g., medical reading vs. reading a menu).
Expand Edge Case Scenario Testing. The current study identified accuracies and inaccuracies. A future phase would actively target known failure modes, such as low-light conditions, rapidly moving objects, and highly cluttered visual environments, using predefined test scripts to build a more robust, generalizable performance benchmark.
Comparative Analysis of Local vs. Cloud Models. Fully analyze the trade-off between privacy (local processing) and accuracy/latency (cloud processing) through quantitative testing. This would provide Google with hard data on user preference when faced with tangible trade-offs (e.g., "Would you accept 5% lower accuracy for 100% data privacy?").