Video fatigue and a late-night host with no audience inspire a new way to help people feel together, remotely

When the global pandemic hit and everyone turned to video calls for work, school and happy hour, Jeremy Bailenson thought he was prepared.

Video conferencing had been around for years, after all, and the Stanford University professor had spent two decades studying and writing about digital communication and behavior. But video calls had always been more of an option than the rule, and Bailenson – along with the rest of the world – quickly found himself shocked by the impact of a complete shift to remote communication.

“After a week of shelter-in-place, I was just flabbergasted by how intense and exhausting it was,” says Bailenson, who lives in California, the first U.S. state that required residents to stay home to reduce the spread of the COVID-19 virus. “Most video conference studies are about how to improve productivity and collaboration, but the notion of it being draining hasn’t been studied.”

While Bailenson began re-reading “everything there was to read about video conferencing,” his friend at Microsoft, Jaron Lanier, was pondering a different angle to the problem. A late-night talk-show host in New York whose band Lanier occasionally played in was struggling to perform his monologue to a camera in his living room, without a live audience to react to his jokes. Lanier cast a net into Microsoft’s sea of researchers, psychologists and programmers, and within weeks he had pulled together what he calls a “magical” new feature to help the TV host and his viewers feel connected. His idea evolved into a Teams feature, Together mode, that potentially could reduce the fatigue of video calls for everyone.

Portrait of Jeremy Bailenson smiling at camera
Jeremy Bailenson, a Stanford University professor, spent two decades researching digital communication and behavior, but he was still surprised by how fatiguing it was to shift completely to remote work and video calls when the global pandemic hit this year. (Photo provided by Bailenson.)

“It was a fortuitous coincidence of needs” that led to a dramatic leap in improving remote meetings, says Lanier, a computer scientist, musician, artist and author who coined the term “virtual reality” and is considered a pioneer in the field.

Together mode, now rolling out in Microsoft Teams, combines decades of research and product development to place all the participants on a video call together in a virtual space, such as an auditorium, meeting room or coffee bar, so they look like they’re in the same place together. The new feature ditches the traditional grid of boxes, creating an environment that users say has a profound impact on the feel of the video conference and provides more cohesion to the group.

Together mode is built to give people the impression that everyone is looking at the entire group in a big virtual mirror, which Lanier says was the unique yet simple solution that changes the whole experience. People’s brains are used to being aware of others based on their locations, and the mirror effect makes it harder for the brain to notice eye contact irregularities. Those are some of the qualities that make it easier for everyone to tell how they are responding to each other.

“We’re social creatures, and the social and spatial awareness systems in the brain can finally function more naturally” within Together mode, Lanier says.

Scientists began studying problems with eye contact – or gaze misalignment – in earnest in the 1960s, and Lanier has been working to improve that element of video conferencing since the analog days of the 1970s. Yet while the technology has grown more robust and stable over the decades, there had been no real improvements to the human experience that were viable for widespread use.  Together mode uses cloud computing instead of the specialized cameras and screens that used to be needed to make video calls better.

To understand video-call fatigue, Bailenson, the founding director of Stanford’s Virtual Human Interaction Lab, combed through decades of studies on communication and found a few key causes.

For example, he says, if someone’s face looms large in your visual sphere in real life, it generally means you’re either about to fight or mate. So you’re alert and hyper-aware – reactions that are automatic and subconscious – and your heart rate goes up. And in video calls, there’s often a grid with multiple people’s faces filling the boxes. It’s a lot for your body’s nervous system to handle, he says.

In addition, people are constantly interpreting others’ eye movements, posture, how their heads are tilted and more, and attributing meaning to those non-verbal cues. Researchers in the 1960s watched videotapes of groups frame by frame, Bailenson says, and discovered a complex, intricate dance: One person would turn their head and the other would lean back a little, for example.

When Microsoft software engineer Henrik Turbell heard of Jaron Lanier’s challenge, he turned for inspiration to a “just for fun” prototype he developed three years ago, when he put multiple versions of his six-year-old daughter into a single-background video stream. (Video provided by Turbell.)

But on a video call, those movements aren’t diagnostic, he says, meaning they’re not accurate information about what’s going on. One person might look at another for a response, but since everyone is organized differently on each participant’s screen in a grid view, it’s not clear to anyone else whom they’re actually looking at.

“It’s a Catch-22 where you’re getting smothered with non-verbal data, but none of that data is diagnostic,” Bailenson says. “Together mode puts the truth back in the gesture. When head movements have actual meaning, aligned with the intention of the people, things become less confusing, and that reduces fatigue because you’re no longer bewildered by what’s going on.”

Mary Czerwinski, a cognitive psychologist at Microsoft, says non-verbal social cues are so automatic that audience members can even synchronize their breathing to the speaker’s.

“There are all kinds of subtle cues – head nods, facial cues, body language – that we use to show that we have an issue, or we want to speak, or we agree or don’t agree,” Czerwinski says.

Using Together mode, she says, “I’ve seen people lean over and tap each other. I’ve seen people make eye contact with each other who weren’t sitting near each other. So people can now practice some of the social signaling they would do in real life.”

The Together mode view is the same for everyone in the meeting and doesn’t change, unlike grid views that show participants’ videos in different locations on each person’s screen and that move the boxes around during the call based on who’s speaking. Since a whole area of the brain is devoted to spatial memory, Together mode’s consistency is a “huge” way to reduce the cognitive load of a video call, Czerwinski says.

Kori Inkpen has worked on how technology can support collaboration – by providing a feeling of being together – since the early 1990s, when she spent a summer as a graduate student watching kids play video games at a science museum. She researches AI-human collaboration for Microsoft now but returned to her first passion of video conferencing when Lanier asked for help.

“We are always trying to envision the future and work on things long before people think they might need them, and often down the road there will be a need for it in our products and we can pull it off the shelf and say, ‘Hey, we did this five years ago, is it helpful now?’” Inkpen says. “There was always pushback over the years for doing anything virtually, and we got criticized by people who said, ‘Why would you want your kids to play virtually?’ But the idea was that we could build tools so kids could play together in a natural way even when they couldn’t be together. No one ever envisioned a pandemic that would force everyone to isolate from each other.”

Gathering in person is undeniably more enjoyable than a video environment, Inkpen says, but Together mode creates the perception of shared space to offer “a feeling of togetherness that’s really compelling.” The new feature reminds Inkpen of a study she did a decade ago, where kids could see themselves with friends on video and told her they felt like they were all playing together in the TV. It helped them behave more naturally, she recalls, because their brains didn’t have to map where things were or reverse the images to hold toys in the right place for the camera, for example.

“When you work on collaborative technology, it’s easy to think that if we just build a really cool tool, people will work together like a hyper-efficient factory,” says Jeff Teper, the visionary behind Microsoft Teams, SharePoint and OneDrive. “But humans are social beings who connect emotionally using body language and verbal cues to build feelings of trust, and part of what makes a team is a shared purpose and sense of trust. Together mode is rooted in human psychology and sociology.”

The push in recent years by Microsoft Chief Executive Officer Satya Nadella to foster collaboration and brainstorming among different groups was key for the new feature, Teper says, allowing the loosely formed team of experts with vastly different backgrounds to go into overdrive in response to the urgency of the new need.

“We have so much cognitive technology for vision and speech, and the hardest part is how to leverage it to solve human problems and bring human value, beyond just being cool,” says Lan Ye, who leads the Teams calling, meeting and devices group. “But here we had these human connection problems created with this new mode of working, so we saw that and went 120 miles an hour on this one to build it up.”

The new feature and the speed at which it came together are examples of how research can pay dividends down the road.

Software engineers David Zhao, Henrik Turbell and Walid Boumerdassi built a prototype of Together mode in just one weekend, relying heavily on the work they’d done two years ago for a Microsoft Hackathon project. That design originated with Inkpen’s team and essentially removed a person from their video surroundings and superimposed them into another’s background. Boumerdassi, who is from France but lives in Seattle, remembers the fun of video calling his family back home and seeing everyone together on one screen – no squares – with the Eiffel Tower in the background.

Together mode builds on work that began on Turbell’s first day with Microsoft in Stockholm seven years ago, when he flew to London to meet with researcher Jamie Shotton‘s team at Microsoft’s Cambridge, UK, lab about the future of video segmentation. That’s a method of dividing up pieces of video – such as the foreground and the background – that can be used to create a more shared experience than putting people in a grid.

In Together mode, participants can find themselves in overlapping spaces and even “touch” the people around them. The absence of barriers creates greater social awareness and a sense of a shared journey.

That unique change promptly won over the new feature’s developers as they tested it from home.

Video fatigue had set in quickly for Boumerdassi when he started working from his apartment in Seattle, rather than at Microsoft headquarters in nearby Redmond, Washington. He began having audio-only meetings instead, but he didn’t like how much that limited communication.

Smiling girl and man look at camera
Microsoft software engineer David Zhao enlisted his daughter for help testing a 2018 Hackathon project that used segmentation to separate participants from their surroundings in a video call and place them together — in this case, putting her into his home office with him. (Screenshot provided by Zhao.)

When he started testing Together mode, though, he noticed an immediate shift in conversations. They flowed more naturally. People didn’t hog the time, because they began picking up on body language and could tell when others wanted to speak. Boumerdassi found he no longer automatically watched himself in the video like he often did with the grid view, nervously wondering who else might be looking at him. Instead, he forgot he was even in the video and focused instead on the people surrounding him, which meant he was less distracted and picked up on more in the meetings.

“As engineers, we had it working, but we didn’t know what the impact was,” Boumerdassi says. “But Jaron understood the potential, he was the first one to put it into words, and his view of this convinced everyone to pursue this as a feature. It’s pretty magical, and that’s why we’re all excited about it.”

Zhao, who started his career in 2007 as Skype’s second video developer and built the group call feature for the company, called the Together mode experience a “breakthrough” for video conferencing.

“This is really just the beginning,” agrees Ye. “We have a lot of ideas that we want to build on this scaffolding that will enable us to really change how meetings are today.”

Together mode isn’t for every situation. It’s so natural and creates such a shared presence that if people are multitasking and looking down at their desks, others might think they’re looking at the person below them, Bailenson jokes.

But bringing social awareness to remote gatherings the way Together mode does “will have a dramatic effect in terms of increasing social cohesion, respect and trust,” Czerwinski says. “The better we do this, the more we’ll understand and appreciate each other.

“So this is a huge thing for society. And who knows how long we’ll be in this pandemic situation.”

Top image: Together mode, shown here with an auditorium background, drew from work by a loosely formed team of Microsoft experts with vastly different backgrounds, including (left to right, top to bottom) Kori Inkpen, Henrik Turbell, Walid Boumerdassi, Jeff Teper, Mary Czerwinski, David Zhao, Jaron Lanier, Lan Ye. Photo illustration by Microsoft.