“A taxonomy does not require technology at all,” notes Daniel Rasmus, vice president and research leader at Giga Information Group. “A taxonomy is simply a way of classifying things.”
Still, there is a rapidly growing list of vendors offering taxonomy software and related applications. They promise many benefits, especially to enterprise customers: Content management will be more efficient. Corporate portals will be enhanced by easily created Yahoo!-like directories of internal information. And the end-user experience will be dramatically improved by more successful content retrieval and more effective knowledge discovery.
But today’s taxonomy products represent emerging technologies. They are not out-of-the-box solutions. And even the most automated systems require some manual assistance from people who know how to classify content.
What’s A Taxonomy?
“Taxonomy is a term borrowed from biology,” says Charles Weinstein, director of solution development at the content categorization company Sopheon. “We can contrast a taxonomy with a thesaurus, which tries to connect naturally related terms, but the two are complementary. And many of the systems we build try to reflect the way users would group things and the words they would use to find things because at the end of the day we’re trying to build a scheme that makes content more useful for people—both for the people who are retrieving it and the people who are contributing it. If we think about a hierarchical classification scheme, that’s a decent initial definition of a taxonomy.”
When the definition is applied to digital content, it usually includes software that uses auto-categorization algorithms to find, screen, and classify information. This approach often involves sample documents categorized manually and then used to train a taxonomy system to classify other information automatically.
Rasmus at Giga points out that such a classification system can include not only documents, but a variety of information resources, including people, which aids in “expertise discovery.”
“You can classify information based on what people know,” he says. “So instead of saying a document is about something, you say, ‘Because this person was the author of the document, he or she knows about this concept.’ Then you can go to the extreme of saying, ‘Because this other person was referenced by somebody, he or she must be an even bigger expert because somebody referred to his or her work.'”
What Makes a Good Taxonomy?
A good taxonomy, according to Weinstein, is one in which content is distributed evenly across the classification scheme. “The depth of the taxonomy should be relatively uniform,” he said. When some categories have too much or too little information, “it usually means that the people didn’t understand the nature of the content they were classifying, or they believe that they had more or less than they actually did.”
A good taxonomy also is one in which “everything has a place and only one place.” Weinstein says. “The sum total of the taxonomy is mutually exclusive of all of the content, and it’s collectively exhaustive as well.” Also, “the terms used in the taxonomy should be native terms to the user community. They have to be terms that the users will understand instantly, intuitively, and clearly.”
The benefits of a good taxonomy, he says, are that users can “navigate from need to resource consistently and quickly.” A good taxonomy “allows an organization to inventory and monitor knowledge resources based on a structured understanding of user and community needs.”
Who’s Providing Services?
Rasmus notes that a typical taxonomy implementation usually costs about $100,000. According to Merrill Lynch, the market for search and categorization products, now at about $600 million, will more than double by 2005.
The vendors serving this market offer a range of approaches and solutions. Some provide hybrid automated/manual systems designed to offer the benefits of both methods. Some offer taxonomy products only as part of overall portal and content-management technologies. A few offer add-on products, such as visualization tools that can enhance the value of taxonomies. Some are big names; others, mom-and-pop software shops.
One of the big names is Microsoft. It offers the SharePoint Portal Server, which includes content auto-categorization features. Another big name is Lotus, which offers the Discovery Server, an application that extracts, analyzes, and categorizes structured and unstructured content to reveal the relationships between the information as well as the people, topics, and user activity in an organization.
A well-known name within the content categorization market is Autonomy. It develops infrastructure technology that provides a platform for the automatic categorization, hyperlinking, retrieval, and profiling of unstructured information.
The approach at Sopheon is to provide applications pre-loaded with industry- and process-specific content and supported by expert services. “For instance, in our Accolade solution, the software is preloaded with best practices content related to product development,” says Weinstein. “It comes through a collaboration we have with leading experts in the product-development field, and we provide a proprietary expert database.” Sopheon’s products also can be used to enhance information-intensive processes, such as scientific research, quality management, and CRM.
Quiver is a fairly new company offering a taxonomy platform called QKS Classifier. This company’s approach is to combine software using classification algorithms with workflow and directory-management tools that allow human input. “We fundamentally don’t believe you get good enough quality out of entirely automated solutions,” says Andrew Feit, Quiver’s executive vice president of sales & marketing.
“We have a categorization engine, and if you give it ten or fifteen examples of your topic, it will go off and find all the other content that appears to belong to it,” he says. “Autonomy and Inxight have similar solutions, but where we differentiate ourselves is instead of just being a categorization engine, we actually have a workflow environment, an end-to-end solution for managing the taxonomy.
“Instead of just putting documents into topics, our system can send them to human beings for review, when necessary. It can follow a rule that says a document with 90 percent or greater confidence that it belongs in a topic can be automatically published to that topic. Nobody has to look at it. If our system thinks there’s a 60 to 90 percent confidence, the document can be sent to information managers to review the decision.” Feit says the system then will learn from the decisions the managers make and use that information when it classifies subsequent documents.
Instead of a system that learns from sample documents, Semio Corporation offers a “semantic processing” approach. The SemioTagger technology analyzes and categorizes large volumes of unstructured information automatically. The technology recognizes over 200 different file types and extracts key concepts and phrases from the documents. It assigns unique metatags to main concepts throughout a collection so they can be reused over and over and called up through any database. “We have a library of taxonomies,” notes Roger Phillip, Semio’s vice president of marketing and business development. “In a client engagement, we will interview the end-user organization, and if we need to customize that taxonomy, we will customize it based on the interviews.”
He adds that Semio’s approach offers such advantages as enhanced accuracy of data sets and support for more categories of information. “We have a site that has 50,000 categories,” he says. “In a learning-based approach, you can only go to a couple of hundred.”
Instead of a standalone product, Semio positions itself as “a component of software infrastructure,” Phillip says, “meaning a component of a portal, of a content or document management system, or of search and retrieval.”
The Value of Visualization
“Once a taxonomy is generated, you can use visualization tools to see the relationships between content in a different way,” says Rasmus at Giga. He believes more and more companies will consider using such systems for navigation because of “the overwhelming amount of content and the lackluster performance” of search engines, directories, and other traditional retrieval tools.
Visualization tools make it easy to understand the relationships between pieces of content, he says, “and you can’t do that without metadata. It’s just too much number crunching to do that kind of relationship mapping on-the-fly. It has to be done in the background. It has to be built on something, and metadata, which is what the categorization tools generate, is what does that.”
One visualization vendor is TheBrain Technologies Corporation, which offers “knowledge architecture” designed to model the way information is created and used, forming a single graphical map. The technology is used by such organizations as the Ford Motor Company, the United Nations, and the FAA.
Inxight Software offers technology for organizing information as well as a visualization tool called the Inxight Star Tree, which could be used for “something as simple as a site map,” says Ian Hersey, Inxight’s vice president of linguistic products, “or it could be a hierarchical representation of a taxonomy that’s a map into various content collections,” such as product catalogs and document archives.
“What makes us unique are not only the properties of the visualization itself, but also, on the back end, we have natural-language technology that figures out how to classify documents, given an example set.”
Inxight doesn’t provide taxonomy development services, but it has several content publisher and aggregator clients, and it’s able to resell Factiva’s taxonomy as a starter set to enterprise clients. “That’s particularly useful when the enterprise is also a customer of Factiva,” said Hersey. “It lets them link their internal and external data in the same taxonomy.”
Another supplier of visualization technologies is Antarcti.ca. Its flagship product, Visual Net, is a tool based on visual mapping techniques that enable users to navigate and browse information across multiple databases in multiple formats. The technology is applicable to research libraries, intranets, ecommerce, and cataloging.
“Assuming you have a taxonomy, we can draw you a compelling visual map of it to make it more useful,” says Tim Bray, Antarcti.ca’s founder, “Most taxonomies are fairly large, and your ordinary day-to-day users aren’t going to be carrying it around in their heads. Our tool lets users poke their way through the taxonomy without having to internalize it.”
Many visualization and categorization products are complementary, and the companies behind them have formed some partnerships. For example, Semio has developed systems with both Antarcti.ca and TheBrain.
“The combination of auto-categorization and visual access is a one plus one equals three kind of thing,” Bray says. “You get a higher value with both systems than either one offers on its own. There’s a problem with general intranet messiness, and it’s not going to go away, and that’s why there are big opportunities for the taxonomy technologies to provide automatic help solving these problems, and for our software to give people a coherent visual picture of what they’ve done.”
Overcoming Difficulties with Taxonomy Technologies
“No matter what taxonomy product you look at, it’s not going to be a turnkey solution,” cautions Rasmus at Giga. “Most of the systems, when you do automatic taxonomy generation, there still is quite a bit of manual effort involved to go back and change the names. The systems just come up with what they think a concept should be called. It’s a machine name. It may be just a string of characters that are put together. So you have to go back and give it a real name that means something in the context of your business.”
Exacerbating that problem is the fact that “a lot of organizations have dropped off the corporate librarians and other people who have the skills for organizing content. I recommend to companies that they keep their librarians, and they may want to hire knowledge engineers even if they’re using automated tools because the tools are really black boxes you throw content in. You read a document and put it into a training algorithm and say, ‘Now every time I throw content at you, classify any documents that are like this one in this category.’
“Well, sometimes I don’t know why the black box classified a document the way it did. I’ve heard stories from our customers that they’ll start classifying things and then they’ll say, ‘Why is this document sitting here?’ They have no idea because the system doesn’t tell them. Then they have to retrain it.”
Rasmus says another reason companies encounter difficulties with taxonomy products is “they just throw everything at it. It’s much better when there’s a domain of knowledge the system is related to. So if I buy an Autonomy product, it’s going to be much more successful if I link it into my CRM system and it’s indexing content related to customer stuff like legal documents and correspondence.”
“But, again, no matter what product you look at, it’s going to involve people building the taxonomy, deciding on common terms, and figuring out what documents train the thing the right way. That all takes work. The end result is very valuable if you’ve got a lot of content and you’re trying to find better ways of navigating it, but it’s not going to be an out-of-the-box solution, and you really have to think about what you’re trying to do with it.” So get ready to roll up those sleeves.
SIDEBAR: Taxonomy on the Highway
The U.K. Highways Agency, which is responsible for managing England’s network of trunk roads and motorways, used Semio software to help develop its “Anywhere Office” intranet, designed to give employees single-point access to a wide range of business applications and documents.
SemioTagger technology categorized documents from the Highways Agency’s existing Web sites, the site of the Department of Transport, and other roads-related resources. Semio’s product sorts the documents and concepts pulled from them into navigable topic hierarchies, which enables Highways Agency employees to quickly find the specific information they need and discover new associations between pieces of information.
The documents automatically feed into the Plumtree Corporate Portal, used as the framework for the Highways Agency’s 3Net portal initiative, which mandates the integration of the agency’s intranet, extranet, and Internet sites as a single, consistent, and up-to-date source for employees and authorized users.
“The Semio/Plumtree joint solution provides an ideal balance between the automation and customization needed for creating and maintaining large directories,” says John Walford, IT infrastructure manager for the Highways Agency. “During our pilot phase, Semio was able to categorize and provide a taxonomic structure for the data we provided within just two days.
“At that point, we knew we could quickly organize data for easier access with minimal human intervention and we had greater confidence that a larger roll-out would be successful within our projected timeframe.”
The agency said it chose Semio to provide the software because it offers a more granular and more accurate system than traditional document-based categorization products and because it allows all materials to be classified at one time, which eliminate the need to re-classify content separately for different communities.
SIDEBAR: Taxonomy at The Hartford
Hartford Technology Services Company, a subsidiary of The Hartford Financial Services Group Inc., is an in-house technology-consulting firm. To help with content management, the company has been implementing Sopheon technology.
“We had a need for expertise location and understanding who’s who within our organization,” says Jeff McCartney, knowledge coordinator for Hartford Technology Services Company. “As we started to go down that path, we knew we also had some longer term needs with regards to document management, content management, and having a portal, so we decided to go through the exercise of doing a knowledge mapping process to really understand what we knew.
“As a consulting firm it was very important for us to be able to provide to our staff the knowledge of who knows what, what information we have, and making it as accessible as possible so when they’re on assignment, they can draw upon that to help bring value to the clients they serve. We have implemented a taxonomy within the Sopheon tool so that it provides an additional way of getting to who knows what about a particular topic.
“You can still go to a search engine, but one of the things we were concerned about is making sure there are alternate ways for people to find information. Sometimes people discover things just by browsing so we wanted to provide that capability. By having that taxonomy and refining it through the implementation of our expertise location tool, it’s providing the baseline we will be using when we actually select and implement the document-management tool.”
What advice would McCartney offer other organizations considering using taxonomy products? “First, understand your specific needs before you start down the path of selecting a tool or a vendor. We initially planned to start implementing the technology about two years ago, and I’m glad we didn’t. I’m glad we stepped back and said, ‘No, let’s take a page from our own methodology and do a needs analysis first.’ Based on the analysis, we came up with a completely different direction in terms of our actual knowledge needs.”