Beyond words: AI goes multimodal to meet you where you are

by Susanna Ray

It’s been raining for days when you’re scrolling the web and come across a picture of a beautiful beach set against turquoise water that sparkles in the sunshine. Where is that, you ask aloud, and how can I get there? 

The answer is immediate. Your AI assistant not only identifies the beach but puts together a whole vacation plan for you. You talk through the details to refine your itinerary, get some tips on coping with the dreary weather in the meantime and start playing a suggested soundtrack to help lift your mood. 

AI experiences increasingly are becoming multimodal, which means they can go beyond simple text prompts — you type a question; the tool answers — by using images, audio and video to see what you see online and hear what you hear. Those capabilities are helping the latest AI tools get a fuller picture of what you’re looking to do, all while giving you more intuitive ways to interact with the technology and get information even more quickly and easily. 

Just like human brains absorb information from text, images and audio simultaneously, with multimodal AI researchers have worked to “collapse all these capabilities into one universal model,” says Ryan Volum, who’s guiding the development of AI products at Microsoft. “We’re giving it more and more of the world we see as humans.” 

While multimodal AI models are not entirely new, they’re starting to have real-world impact with tools to help doctors diagnose and treat patients with more precision and weather agencies predict severe storms more accurately.  

Multimodal tools are helping people simplify more mundane matters as well — such as when Volum was recently trying to choose among different health insurance options. 

Instead of having to pore over the dense language of each plan, Volum turned to Copilot Vision, a Microsoft feature that provides real-time assistance to make navigating the web less overwhelming. With his permission, Copilot Vision was able to see everything on the site he was perusing — not just text, but charts and images as well — and summarize it all for him in less time than it would have taken him to wade through the first line.  

It then answered his questions in a natural conversation, bringing in information from other sources to provide context that helped him decide. 

“It was able to meet me in my world” and offer better assistance, Volum says. He likens it to how two people often work together to fly a plane. 

“If your copilot in a plane could only hear what you’re saying but couldn’t see what you’re seeing, they’d be much less helpful,” he says. “But because they’re able to see the clouds in front of you, the dashboard indicators, the telemetry from the plane, that copilot is able to be that much more helpful, and there’s much less work necessary for the user to communicate what they need.”

With multimodal AI, developers have built on the foundation of recent breakthroughs with natural language and extended those capabilities to different inputs. Just as traditional large language models (LLMs) perform text-based tasks by extracting concepts encoded in human language and thought to make logical inferences, solve problems and generate content, multimodal models do the same with other modes of communication such as voice and visuals. 

Models are trained on vast datasets to identify key features in different types of data, such as words and phrases in text, shapes and colors in images, or tones and pitches in audio. They sort these inputs and connect them in a unified way — linking an image of a cat to the typed and spoken word, for example — and then recognize patterns to make connections across modalities.  

Once trained, a model can translate between modes to understand and create content. It can generate an image from someone’s spoken directions, for example, or create audio from a typed request. 

These expanded capabilities are helping clinicians and scientists, in particular, make great strides, says Jonathan Carlson, who leads health and life sciences research at Microsoft Health Futures. 

LLMs are being used during medical appointments to record and sort through conversations with patients — even if the discussion bounced around among symptoms and questions — for various follow-up tasks that otherwise take a lot of a physician’s time and attention, such as drafting an after-visit summary and a referral to a specialist that the doctor just has to proof and sign. 

And multimodal models are going a step further by applying that reasoning ability to analyze pixels in medical imaging, identifying possible tumors or other abnormalities that might be difficult to find. The AI can be used to support and validate a pathologist’s work and even catch things a human eye might miss, Carlson says, or extrapolate to help diagnose rare diseases that have limited training data. 

“We now have models that understand concepts encoded in images and in language,” Carlson says. “So you can say, ‘Hey, I have a pathology image, show me all of the immune cells, identify any suspicious cancerous cells and let me know if there are any likely biomarkers that can help me choose the appropriate treatment.’ Once you have models that have these rich concepts, it’s actually very simple to align those concepts and basically snap those together and end up with this rich experience where you can now essentially talk to an image.” 

That capability helps guide medical practitioners toward more targeted tests and precise treatments, improving outcomes through earlier diagnoses and saving patients time, discomfort and money by reducing unnecessary procedures. 

Many people will be able to use multimodal capabilities in Edge browsers with Copilot Vision, now available to all Copilot Pro and free Copilot users in the U.S. Each person is in control when it comes to using the new tool: You must click the Copilot Vision icon to start a session, and once you end it, data is deleted.

Businesses and developers can pick from a whole catalog of multimodal models — or get help mixing and matching from the 1,800 options in the Azure AI Foundry — to create more intelligent and interactive commercial tools. 

Mercedes-Benz, for example, created a tool that uses Azure AI Vision and GPT-4 Turbo to see a car’s surroundings and verbally answer questions from the driver, like whether they’re allowed to park on a certain street or what the building is that they’re approaching.

Microsoft’s recently introduced Magma model integrates visual perception with language comprehension to help AI-powered assistants or robots understand surroundings they haven’t been trained on and suggest appropriate actions for new tasks — like grabbing a tool or navigating a website and clicking a button to execute a command. It’s a significant step toward AI agents that can serve as versatile, general-purpose assistants.

And the new Phi-4 multimodal model can process speech, vision and text directly on devices, using less computing power than its predecessors. This smaller, more accessible model allows developers to create efficient applications that excel in mathematical and logical tasks.

Multimodal capabilities in services like Azure AI Content Understanding can help find meaningful insights out of loads of unstructured data such as call center recordings, scanned documents or social media posts.

All that capability comes with new risks and a broader need for education about AI and collaboration in safeguarding it, says Sarah Bird, Microsoft’s chief product officer of Responsible AI

How people are represented — or misrepresented — is a risk unique to multimodal AI, Bird says, since the way someone looks or sounds can be impersonated with the generative technology. 

And people’s reactions change with the modalities used, she says. For example, violent images are perceived as more severe than violent text; a video is seen as more trustworthy than a written story; and when an AI assistant such as Copilot speaks with an audible voice, errors feel more intentional than when they appear on-screen. 

So safety researchers and engineers at Microsoft have been building on top of the guardrails already in place for generative AI, Bird says.  

 As more modalities introduce more risk, inputs like text, images or audio that might be benign on their own can be used to create harmful content when combined, such as a photo of a famous person with text describing them as an animal. That’s why Microsoft is upgrading its safety models to review the sum of the output, rather than just the individual parts, Bird says. 

 Broad awareness about the risks and how to recognize AI-generated content is also key. Microsoft cryptographically signs all AI-generated content made with its technology so anyone can identify it. Education and training are crucial so that people know to expect these signatures and know what they mean — as is collaboration among technology organizations, such as the C2PA coalition founded by Microsoft and other industry leaders to develop standards for certifying sources. 

“There’s a lot we can do technologically and within the platform” to reduce risk, Bird says. “But also, there is new content in the world, and the world needs to adjust their approach to that. Every single person has a role to play in  how we assess and defend against multimodal risks.” 

Research is moving forward rapidly as developments build upon each other. 

For the first time, in just the last couple of years, Carlson says, researchers have the machinery and multimodal AI assistance allowing them to build a holistic picture of a cell.  

“The next set of things is, how does a model learn how to understand proteins?” he says. “We’ve been working on that a lot, and you can take the same ideas from language modeling and apply it to hundreds, thousands, millions of protein sequences” to engineer antigens for vaccines, for example. 

“It’s about learning the language of nature,” he says. “In the same way that we learn the language of how humans talk, can we learn the language of how the cell expresses itself, or how protein sequences actually work?” 

Being able to use text, speech, images, audio and video to solve all sorts of problems at once opens up a world of new opportunities, Volum says.  

“Increasingly, artificial intelligence will meet us where we are,” he says, “so that it can better understand our needs and more proactively fulfill them.” 

Illustrations by Michał Bednarski / Makeshift Studios. Story published on March 18, 2025