REDMOND, Wash., Oct. 18, 2000 — Information overload: It may be the first constant headache of the 21 st century.
Turn on the television, and it takes almost as long to click through the dozens of channels as it does to watch your favorite show. Answer the messages on your mobile phone, and there’s a whole new batch before you can hit the Off button. Search the Internet, and you lose interest in the arcane fact you are after before you finish browsing the first few Web sites — not to mention the fiftieth.
Information overload presents even more vexing challenges in the corporate world, where businesses increasingly rely on detailed information about their operations and customers to maintain their competitive advantage. Computer databases allow these businesses to stockpile billions of bytes of information about sales patterns and traffic on their Web sites and in their stores. But businesses have complained for years that the tools needed to pull useful data from this sea of information are too complicated and time consuming to operate, creating what many sarcastically refer to as
data bases — databases that take in information but can’t offer any in return.
Several years ago, researchers at Microsoft set out to make it easier to realize the potential of databases. The results appear for the first time in SQL Server 2000. The latest version of Microsoft’s database management software, which was released last month, is the first of its kind to offer built-in
tools to extract complex patterns from deep within massive stockpiles of information.
SQL Server 2000 offers other advances as well: self-tuning technology, a big step toward the development of a database that manages itself without the help of an administrator; the ability to analyze an almost limitless amount of information, and do it in a fraction of the time it took similar analytical tools before; the use of Bayesian reasoning to improve analysis of data; and dependency networks for the visual exploration of patterns in data.
“Before SQL Server 2000, you often would have to hire a large team of data analysts to extract useful information from your data — and often it wouldn’t be worth the investment,”
said David Heckerman, senior researcher and manager of Microsoft Research’s Machine Learning and Applied Statistics Group.
“Now, you can discover useful information using a significantly smaller team for a fraction of the cost.”
MSR: Spurring Innovation
The development of SQL Server 2000 is a prime example of how Microsoft Research (MSR) spurs innovation within the company. More than 500 MSR researchers work independently and in tandem with product teams at four locations around the globe — Redmond; San Francisco; Cambridge, England; Beijing, China — to create solutions to new and lingering problems, such as database management.
The increasing challenges of data analysis in a world with ever-expanding databases prompted researchers in Heckerman’s group and the Data Management, Exploration, and Mining Group led by Surajit Chaudhuri to develop the advances in SQL Server 2000. Heckerman and other Microsoft researchers developed and tested many of the
advances in SQL Server 2000 while helping Microsoft.com and MSN, the world’s most frequented Web site, analyze their massive databases for ways to better serve their users and attract more. This meant analyzing the preferences and habits, and accurately predicting the future predilections, of the tens of millions of people who visit these Web sites every day.
“While we were doing this,”
“we were asking ourselves, `What parts of this analysis are tough for people who don’t know statistics?’ and, `How can we make it easier through our knowledge of database management and statistics?'”
Another important challenge was to get data-mining tools closer to the data. Before SQL Server 2000, data-analysis tools within databases allowed users to track trends and compile statistics. A shoe retailer’s regional manager could, for example, query sales data to determine which store sold the most loafers, which salesperson sold the most high heels, or what brand was selling best throughout the chain.
But businesses and other users couldn’t search for deeper trends, such as the demographic and buying traits shared by the shoe retailer’s customers. For example, how likely are customers who buy loafers to buy athletic shoes? Do they tend to buy shoes only when they are on sale? Do they like to use coupons?
Businesses and other users need data-mining tools to retrieve this type of information. And these tools haven’t been able to communicate directly with database software in the past, forcing database managers to spend hours downloading data from a server across a network into a computer equipped with a mining engine. Then, they had to set up and run a mining query.
The results of these queries, however, didn’t include data gathered in the database during the information transfer and mining process. Nor could new data be easily compiled with old. A new query had to be done for each batch of data, Chaudhuri said.
“These types of extract-and-mine techniques are expensive and difficult, and compromise the consistency of the data,”
“The relationships between data get lost, producing flawed results.”
Integrating Databases and Data Mining
The solution created by a team of MSR researchers, in consultation with SQL Server’s developers, is called OLE DB for Data Mining. Put simply, it allows the data-mining engine to trade information with SQL Server automatically. And it does so in a way that’s familiar to most database administrators, using the same programming concepts, such as inserting and joining tables, as in previous versions of SQL Server, Chaudhuri explained.
“Databases were created to simplify data management, but previous data-mining technology never took this into account,”
“Consistency, maintenance and integration of the data are of paramount importance. The only way to achieve these goals is for the data mining to live within the same architecture as the database. With SQL Server 2000, we’ve shown it can live there in a very natural way.”
So natural that it has changed the way Inmar Enterprises Inc. runs its operations. The Winston-Salem, N.C. based company processes and tracks coupons redeemed by customers at more than 300 of North America’s top consumer-packaged goods companies.
“We no longer require a team of data analysts or a large development effort,”
said Doug Bordner, Inmar’s vice president of technology. The company, which began using a beta version of SQL Server 2000 earlier this year, now needs only one data architect, where previously four or five might have been required. Bordner and other employees with limited statistical knowledge can even help with mining projects.
SQL Server’s ability to partially self-tune database operations has further reduced costs for Inmar by eliminating many of the basic management tasks the company’s database administrators used to perform.
“They can do more than just focus on managing one system,”
“They can be more like enterprise managers who focus on macro operations of our systems instead of simply the everyday tasks usually required to maintain databases.”
Microsoft researchers found ways to hide much of the complexity of operating the data-mining engine, similar to how car manufacturers gradually made it possible for almost anyone to operate an internal combustion engine.
Chaudhuri and others made it possible for SQL Server 2000 to automatically identify suitable data indexes and indexed views. The indexes work similar to those in a book, providing short cuts to information. Indexed views combine and store information from several data tables, even if the tables are from disparate parts of the database.
Making SQL Server 2000’s data mining engine useful to businesses presented another major challenge. It needed to be able to rapidly process the increasingly larger amounts of information stored on databases. To put it in technospeak, it needed to be massively scalable or it wouldn’t be of much use to businesses such as Inmar.
Almost on a daily basis, the company must process client data with multiple millions of rows of information. Since Inmar began using SQL Server 2000, it has been able to process data in one-tenth the time it took before, Bordner said.
The scalability of SQL Server 2000’s data mining engine is the result of techniques developed by Microsoft researchers that allow the engine to minimize memory use, reduce the number of times it must scan data tables to find information, and sort through and eliminate portions of data that contain little or no information. SQL Server 2000 is so scalable, Heckerman said, that he has been able to analyze gigabytes — or billions of bytes — of data at a time.
“The first reaction from many of our clients was, `Wow,'”
“They are pretty amazed by the performance and the depth of the information provided.”
“When you are trying to make business decisions, you need answers in short order,”
“If it takes you a long time to retrieve those answers, the value of that information becomes less.”
Breakthroughs in Statistics
The actual analysis, or mining, done by data-mining engines can be done in many ways. The two most common are segmentation, to group similar data into clusters, and decision trees, to make predictions. SQL Server 2000 includes algorithms that perform both of these tasks. It also is capable of running other types of data mining algorithms created by other developers.
But SQL Server 2000 has statistical innovations not found in other mass-market mining engines. These include the use of Bayesian statistical methods, named after the Rev. Thomas Bayes, the18 th Century Presbyterian minister who established the mathematical basis for probable inferences. These methods, championed by Heckerman and his colleagues, make it possible to accurately distinguish true patterns from imaginary ones created by statistical
— one of the most difficult tasks in data mining.
In addition, SQL Server 2000 is the first product of its kind to offer dependency networks. These networks graphically depict patterns in data, using bubbles to depict columns of data and lines to connect the bubbles with statistical links. They resemble a map of airline routes with multiple flights connecting a network of airports.
In the past, data analysts could only look at predictions in one decision tree at a time, Heckerman explained.
“The dependency networks in SQL Server 2000 allow you to build and explore hundreds of decision trees at once,”
“As a result, you can discover trends that you would otherwise miss.”
Chaudhuri and Heckerman believe the innovations added to SQL Server 2000 put it well ahead of the competition. But they and others at MSR haven’t moved on to other areas of research. They are working on ways to make future versions of SQL Server and other Microsoft data-analysis products more powerful by adding new analysis algorithms and visualization tools. In addition, they are enhancing the self-tuning technology they developed, focusing much of their efforts on improved physical design of databases.
“What we are trying to do is make data analysis as effortless as possible,”
“We hope the advances we are working on will soon make the `write-only’ database a thing of the past.”