Pushing the Boundaries of Search

REDMOND, Wash., May 31, 2006 – According to the “Live Labs Manifesto” of Gary William Flake Ph.D., the Internet operates in a manner fundamentally unlike anything that has ever preceded it. It is a world where “something small and intangible – a better algorithm – can massively increase global utility and welfare.”

Microsoft Live Labs is a partnership between MSN and Microsoft Research that takes a holistic approach toward applied research for Internet-enabled products and services. The partnership brings together people with a variety of skills and perspectives to foster research programs, incubate entirely new inventions, and improve and accelerate new Web-based technologies.

This week Flake – head of Live Labs and a Microsoft technical fellow – and his team are announcing 12 winners of a new Live Labs request for proposals (RFP) entitled “Accelerating Search in Academic Research.” The RFP aims to identify bold, innovative and new approaches to information retrieval, data mining, machine learning and human/computer interactions, with the ultimate goal of creating new technologies that can drastically change the way we interact with the Web and its vast array of resources.

Recipients of the Live Labs grants are posing some of the most compelling questions in search technology today. Even if the user gets relevant results, can he or she trust that information? How can a search tool get the best data from the Web? What’s happening on that part of the Web that’s “below the surface,” that’s not being crawled today? How can user behavior help predict economic or social changes?

Although winners will receive cash grants of between US$35,000 and $50,000, according to Evelyne Viegas, Ph.D., program manager for External Research & Programs at Microsoft Research, most applicants say what drew them to the RFP was not the cash, but the promise of a wealth of real-world user query and click-through data from MSN. Awardees will gain access to extensive data logs from MSN to aid in their research, as well as an increased quota of queries to the MSN software development kit that enables programmatic access to real-world search results. (The data being used by the awardees contains no personally identifiable user information.)

“We’ve been hearing a lot from academia that they need this kind of data,” Viegas says. “That’s what’s really unique about this RFP. We’re giving them access to more than 15 million real-user queries, with click-through information. The academic community is hungry for this kind of resource, and we’re excited to see what they can do with it.”

Real-World Data

Researcher Amélie Marian, a grant recipient from Rutgers University, explains the value of MSN’s data for the academic community.

“One of the main difficulties in data management research, and specifically in Web data management, is to find large volumes of real data that can be used in experiments and evaluations,” she says. “So I was very interested in the opportunity to access the MSN query logs and excerpts.”

Marian will use the data provided by Live Labs to help understand how users search and access information, in an effort to create more reliable, trustworthy search results. Her project may be the first to consider corroborating evidence as exhibited by the presence of the same information in multiple Web pages as a factor in ranking search results.

“Web resources often have unreliable data because of erroneous, biased, misleading or outdated information,” Marian says. “My project aims at providing an interface that does the extra work of checking different Web sources to corroborate query answers, in order to save the users from the hassle of having to go through multiple Web sites to compare information.”

Search as a Social Phenomenon

Beyond ranking the relevance and veracity of results, the depth and breadth of those MSN data sets can provide researchers with a range of new insights into the way people use the Web, and into the Web itself, says Viegas. So, not surprisingly, researchers around the world were jumping at the opportunity. Live Labs received 182 proposals from institutions in 36 countries, representing the gamut of disciplines from computer science to sociology.

“We had economists, educators and sociologists responding to our search RFP, so that was really exciting to see,” she says. “It’s not just search as a technology, but also as a social phenomenon. With this RFP, we’ve been able to gather experts in academia who can help us examine search beyond relevance, and look into this Internet cultural shift where people are the makers of information as well as consumers of information. The hope is that, by providing researchers with large-scale data sets, they can start getting some answers and direction on solving many of the questions we have about search.”

According to grant recipient Zoubin Ghahramani, who conducts research through the University of Cambridge, Carnegie Mellon University and University College London, many of those answers may come from an approach that combines the rigors of technology and science with the inherent unpredictability of human behavior.

Ghahramani’s research is focused on using Bayesian statistical techniques to create more intelligent machines that can “learn” to deal with the uncertainties of search posed by users. His group is focused on several problems, including how to identify users who anticipate what other people will search for, personalizing search by combining information across similar users, and predicting what people will be searching for.

“Search is an inference problem,” he says. “What is the probability that the user is interested in this Web page given that he or she typed this query? Once we start exploring this database of search query logs using machine learning methods, I imagine we will find many interesting and practical challenges.”

Live Labs Grant Recipients

The following is a list of grant awardees, along with the title of their proposal and a brief abstract:

Eytan Adar, Brian Bershad, Steven Gribble and Daniel Weld, University of Washington

“Vinegar: Leading Indicators in Query Logs.”

“The flood of queries coming into a search engine represents a slice of the collective consciousness of Internet users. Events in this stream, when properly detected and aggregated, can be used to explain current happenings and generate leading indicators to predict future events. The name Vinegar comes from the observation that months before SARS hit the world newspapers, and even before the disease was acknowledged by the larger Chinese medical community, the affected population of the Guandong province in China began buying out supplies of white vinegar, a local folk remedy.”

Lada Ademic, Suresh Bhavnani, University of Michigan

“VISP: Visualizing Information Search Processes.”

“Our main goal is to develop a visualization tool that will show the distribution of information among the search results, the links between the results and the user click-throughs. The visualization tool will both contribute to our understanding of information seeking behavior and enable search engine developers and Website designers to pinpoint the difficulty users have in finding comprehensive information.”

Soumen Chakrabarti, IIT Bombay

“Entity and Relation Types in Web Search: Annotation Indexing and Scoring Techniques.”

“The goal of our proposed project is to dramatically improve the quality of complex search and aggregation tasks over text and semi-structured data by annotating and exploiting entities and relations.”

Kevin Chang, University of Illinois, Urbana-Champaign

“Deepening Search: From the Surface to the Deep Web.”

“While the ‘surface Web’ has linked billions of static HTML pages, a far more significant amount of information is hidden in the ‘deep Web,’ behind the query forms of searchable databases. As the deep Web is largely invisible to current search engines, users’ search requests do not reach this uncharted territory. This proposal aims at opening up that deep Web.”

Bruce Croft, University of Massachusetts, Amherst

“Discovering and Using Meta-Terms.”

“Many queries, particularly ‘content-based’ Web queries, contain terms that are difficult to match directly with documents. We believe that many of these important terms are in fact instances, examples, or more specific forms of query terms which we call “meta-terms.” Transforming queries using replacements or expansions for these terms can make a substantial difference to performance.”

Brian Davison, Lehigh University

“Incorporating Trust Into Web Authority.”

“Search providers continually work to improve the quality of their product, while marketers strive for ever increasing visibility. Web link analysis is now well-targeted by search engine marketers, and so “Web spam” has become increasingly visible in Web search. In this project, we incorporate a number of measures of trust and distrust to improve estimates of Web page and site authority, reducing or eliminating the effect of Web spam in the process.”

Zoubin Ghahramani, University of Cambridge, Carnegie Mellon University, University College London

“Statistical Machine Learning for User Modeling.”

“Some of our specific aims include: identifying users whose queries anticipate those of others, leveraging other users to help personalized search, predicting the next query and clicked page, and identifying clusters of users, of queries, and their network structure.”

Panagiotis Ipeirotis, Anindya Ghose, New York University

“Combining Econometric and Text Mining Approaches for Measuring the Effect of Online Information Exchange.”

“This research studies the “economic value of text” in online settings, focusing on three important and varied categories of information exchanges: reputation systems in electronic markets, product recommendations in online communities, and the impact of social media (search engines, wikis, and blogs) on sales. This research program combines established techniques from economics with text mining algorithms from computer science, to measure the economic value of each text snippet, and understand how textual content in these systems influence economic exchanges between various agents in electronic markets.”

Amélie Marian, Rutgers University

“Aggregating Answers From Multiple Web Sources.”

“The goal of this project is to provide an interface that aggregates query results from different sources in order to save users the hassle of individually checking query-related Web sites to corroborate answers. In addition to listing the possible query answers from different Web sites, the interface ranks the results based on the number, and importance, of the Web sources reporting them. The existence of several sources providing the same information is then viewed as corroborating evidence, increasing the quality of the corresponding information.”

Alistair Moffat, University of Melbourne

“Predictive Exploitation of Click-Through Knowledge.”

“Web retrieval systems will be more effective if they dynamically adapt to the user’s information need according to how other users have responded to those same documents when they were returned in response to the same or similar previous queries. Access to the Microsoft query log data and click-through data will allow us to explore this conjecture.”

Gerd Stumme, University of Kassel

“Social Search: Bringing the Social Component to the Web.”

“Social bookmark tools like del.icio.us are rapidly emerging on the Web. Unlike link-based search approaches à la PageRank, these systems provide personal recommendations based on input from similar users. We will extend link-based search with social search, in order to provide enhanced functionality and multiple search paradigms for the Web.”

Cheng Xiang Zhai, University of Illinois, Urbana –Champaign

“Mining Query/Click Logs for Collaborative Internet Search.”

“We propose to leverage the large amount of query/click data that Microsoft Live Labs will release to study collaborative information retrieval, i.e., to exploit information from other users to improve the search accuracy for a current user.”