Skip to main content
Microsoft
Source
Source
  • Home
    • Company News
    • Official Microsoft Blog
    • Microsoft On The Issues
    • Europe
    • Asia
    • Latin America
    • India
    • UK
  • AI
  • Innovation
  • Digital Transformation
  • Diversity & Inclusion
  • Sustainability
  • Work & Life
    • Global

      • Microsoft 365
      • Teams
      • Windows
      • Surface
      • Xbox
      • Deals
      • Small Business
      • Support
    • Software
      • Windows Apps
      • AI
      • Outlook
      • OneDrive
      • Microsoft Teams
      • OneNote
      • Microsoft Edge
      • Skype
    • PCs & Devices
      • Computers
      • Shop Xbox
      • Accessories
      • VR & mixed reality
      • Phones
    • Entertainment
      • Xbox Game Pass Ultimate
      • PC Game Pass
      • Xbox games
      • PC games
      • Windows digital games
      • Movies & TV
    • Business
      • Microsoft Cloud
      • Microsoft Security
      • Dynamics 365
      • Microsoft 365 for business
      • Microsoft Power Platform
      • Windows 365
      • Microsoft Industry
      • Small Business
    • Developer & IT
      • Azure
      • Developer Center
      • Documentation
      • Microsoft Learn
      • Microsoft Tech Community
      • Azure Marketplace
      • AppSource
      • Visual Studio
    • Other
      • Microsoft Rewards
      • Free downloads & security
      • Education
      • Virtual workshops and training
      • Gift cards
      • Students and parents deals
      • Licensing
      • Microsoft Experience Center
    • View Sitemap
    • No results
    0 Cart 0 items in shopping cart
    Sign in
    Source
    Category: AIOctober 14, 2020

    What’s that? Microsoft’s latest breakthrough, now in Azure AI, describes images as well as people do

    By
    • John Roach

    Microsoft researchers have built an artificial intelligence system that can generate captions for images that are, in many cases, more accurate than the descriptions people write. The breakthrough in a benchmark challenge is a milestone in Microsoft’s push to make its products and services inclusive and accessible to all users.

    “Image captioning is one of the core computer vision capabilities that can enable a broad range of services,” said Xuedong Huang, a Microsoft technical fellow and the chief technology officer of Azure AI Cognitive Services in Redmond, Washington.

    The new model is now available to customers via the Azure Cognitive Services Computer Vision offering, which is part of Azure AI, enabling developers to use this capability to improve accessibility in their own services. It also is being incorporated into Seeing AI and will start rolling out later this year in Microsoft Word and Outlook, for Windows and Mac, and PowerPoint for Windows, Mac and web.

    Automatic image captioning helps all users access the important content in any image, from a photo returned as a search result to an image included in a presentation. A research breakthrough like this one can improve those results, although it doesn’t mean the system will return perfect results each time.

    The use of image captioning to generate a photo description, known as alt text, in a web page or document is especially important for people who are blind or have low vision, noted Saqib Shaikh, a software engineering manager with Microsoft’s AI platform group in Redmond.

    For example, his team is using the improved image captioning capability in the Seeing AI talking camera app for people who are blind or have low vision. The app uses image captioning to describe photos, including those from social media apps.

    “Ideally, everyone would include alt text for all images in documents, on the web, in social media – as this enables people who are blind to access the content and participate in the conversation. But, alas, people don’t,” Shaikh said. “So, there are several apps that use image captioning as way to fill in alt text when it’s missing.”

    Carousel image 1
    Lijuan Wang, a principal research manager in Microsoft’s research lab in Redmond, Washington, led the research team that achieved – and beat – human parity on the novel objection captioning at scale, or nocaps, benchmark. Photo by Dan DeLong.
    Carousel image 2
    Xuedong Huang, a Microsoft technical fellow and chief technology officer of Azure AI Cognitive Services, said that reaching human parity on image captioning continues a theme of human parity achievement across cognitive AI systems at Microsoft. Photo by Scott Eklund/Red Box Pictures.
    Carousel image 3
    Saqib Shaikh, a software engineering manager with Microsoft’s AI platform group, says the use of image captioning to generate a photo description, known as alt text, in a web page or document is especially important for people who are blind or have low vision. Photo by John Brecher.

    Novel object captioning

    Image captioning is a core challenge in the discipline of computer vision, one that requires an AI system to understand and describe the salient content, or action, in an image, explained Lijuan Wang, a principal research manager in Microsoft’s research lab in Redmond.

    “You really need to understand what is going on, you need to know the relationship between objects and actions and you need to summarize and describe it in a natural language sentence,” she said.

    Wang led the research team that achieved – and beat – human parity on the novel object captioning at scale, or nocaps, benchmark. The benchmark evaluates AI systems on how well they generate captions for objects in images that are not in the dataset used to train them.

    Image captioning systems are typically trained with datasets that contain images paired with sentences that describe the images, essentially a dataset of captioned images.

    “The nocaps challenge is really how are you able to describe those novel objects that you haven’t seen in your training data?” Wang said.

    To meet the challenge, the Microsoft team pre-trained a large AI model with a rich dataset of images paired with word tags, with each tag mapped to a specific object in an image.

    Datasets of images with word tags instead of full captions are more efficient to create, which allowed Wang’s team to feed lots of data into their model. The approach imbued the model with what the team calls a visual vocabulary.

    The visual vocabulary pre-training approach, Huang explained, is similar to prepping children to read by first using a picture book that associates individual words with images, such as a picture of an apple with the word “apple” beneath it and a picture of a cat with the word “cat” beneath it.

    “This visual vocabulary pre-training essentially is the education needed to train the system; we are trying to educate this motor memory,” Huang said.

    The pre-trained model is then fine-tuned for captioning on the dataset of captioned images. In this stage of training, the model learns how to compose a sentence. When presented with an image containing novel objects, the AI system leverages the visual vocabulary to generate an accurate caption.

    “It combines what is learned in both the pre-training and the fine-tuning to handle novel objects in the testing,” Wang said.

    When evaluated on nocaps, the AI system created captions that were more descriptive and accurate than the captions for the same images that were written by people, according to results presented in a research paper.

    Speedy ship to production

    The new image captioning system is also two times better than the image captioning model that’s been used in Microsoft products and services since 2015, according to a comparison on another industry benchmark.

    Given the benefit of improved image captioning to all users of Microsoft products and services, Huang accelerated the integration of the new model into production on Azure.

    “We’re taking this AI breakthrough to Azure as a platform to serve a broader set of customers,” he said. “It is not just a breakthrough on the research; the time it took to turn that breakthrough into production on Azure is also a breakthrough.”

    Reaching human parity on image captioning, he added, continues a theme of human parity achievement across cognitive AI systems at Microsoft.

    “In the last five years,” Huang said, “we have achieved five major human parities: in speech recognition, in machine translation, in conversational question answering, in machine reading comprehension, and in 2020, in spite of COVID-19, we got the image captioning human parity.”

    Top image: Legacy: A man riding a skateboard up the side of a building. New: A baseball player catching a ball. Photo courtesy of Getty Images. 

    Related:

    • Visit Azure Cognitive Services to learn more about the Computer Vision offering
    • Read: Novel object captioning surpasses human performance on benchmarks
    • Read: Apps can now narrate what they see in the world as well as people do
    • Read: Barriers fall as Microsoft’s speech and language technologies exit the lab
    • Read: Microsoft reaches a historic milestone, using AI to match human performance in translating news from Chinese to English
    • Read: Microsoft researchers achieve new conversational speech recognition milestone
    • Read: Microsoft creates AI that can read a document and answer questions about it as well as a person

    John Roach writes about Microsoft research and innovation. Follow him on Twitter.

    Check out these additional images and captions comparing results from the legacy and new AI system.

    Carousel image 1
    1 of 7Legacy: A person sitting at a table using a laptop. New: A person using a microscope. Photo courtesy of Getty Images.
    Carousel image 2
    2 of 7Legacy: A close up of a person cooking hot dogs on a cutting board. New: A person making bread. Photo courtesy of Getty Images.
    Carousel image 3
    3 of 7Legacy: A person sitting at sunset. New: A campfire on a beach. Photo courtesy of Getty Images.
    Carousel image 4
    4 of 7Legacy: A man in a blue shirt. New: A few people wearing surgical masks. Photo courtesy of Getty Images.
    Carousel image 5
    5 of 7Legacy: A man riding a skateboard up the side of a building. New: A baseball player catching a ball. Photo courtesy of Getty Images.
    Carousel image 6
    6 of 7Legacy: A close up of a plant. New: A close-up of wheat in a field. Photo courtesy of Getty Images.
    Carousel image 7
    7 of 7Legacy: A man standing on top of a mountain. New: A man carrying a surfboard. Photo courtesy of Getty Images.

    Tags:

    • Accessibility
    • AI
    • Inclusion
    What's new
    • Surface Pro 9
    • Surface Laptop 5
    • Surface Studio 2+
    • Surface Laptop Go 2
    • Surface Laptop Studio
    • Surface Go 3
    • Microsoft 365
    • Windows 11 apps
    Microsoft Store
    • Account profile
    • Download Center
    • Microsoft Store support
    • Returns
    • Order tracking
    • Virtual workshops and training
    • Microsoft Store Promise
    • Flexible Payments
    Education
    • Microsoft in education
    • Devices for education
    • Microsoft Teams for Education
    • Microsoft 365 Education
    • Education consultation appointment
    • Educator training and development
    • Deals for students and parents
    • Azure for students
    Business
    • Microsoft Cloud
    • Microsoft Security
    • Dynamics 365
    • Microsoft 365
    • Microsoft Power Platform
    • Microsoft Teams
    • Microsoft Industry
    • Small Business
    Developer & IT
    • Azure
    • Developer Center
    • Documentation
    • Microsoft Learn
    • Microsoft Tech Community
    • Azure Marketplace
    • AppSource
    • Visual Studio
    Company
    • Careers
    • About Microsoft
    • Company news
    • Privacy at Microsoft
    • Investors
    • Diversity and inclusion
    • Accessibility
    • Sustainability
    English (United States) California Consumer Privacy Act (CCPA) Opt-Out Icon Your California Privacy Choices
    • Sitemap
    • Contact Microsoft
    • Privacy
    • Manage cookies
    • Terms of use
    • Trademarks
    • Safety & eco
    • Recycling
    • About our ads
    • © Microsoft 2023