Meaning = Data + Structure: Inferring Structure from domain knowledge October 29, 2007Posted by jeremyliew in Consumer internet, data, domain knowledge, meaning, metadata, semantic web, structure, user generated content.
As more user generated content floods the web, I’ve been thinking about how to draw more meaning from the content, and the idea that Meaning = Data + Structure. A number of readers commented on my previous post, about user generated structure. They point out that one of the challenges of relying on this approach is finding the right incentives to get users to do the work. I’m inclined to agree. I think user generated structure will be part of the solution, but it probably won’t be the while solution – it won’t be complete enough.
If people won’t do the work, perhaps you can get computers to do it. Is there a way to teach a computer to algorithmically “read” natural language documents (ie web pages), understand them, and apply metadata and structure to those documents? Trying to do this on a web wide basis rapidly gets circular – since this ability is exactly what we need for a computer to comprehend meaning, if it existed then you don’t need the structure in the first place. The structure is our hack to get there!
All is not lost though. In the grand tradition of mathematics and computer science, when faced with a difficult problem, you can usually solve an easier problem, declare victory and go home. In this case, the easier problem is to try to infer structure from unstructured data confined to a single domain. This substantially constrains the complexity of the problem. Alex Iskold has been advocating this approach to the semantic web.
Books, people, recipes, movies are all examples of nouns. The things that we do on the web around these nouns, such as looking up similar books, finding more people who work for the same company, getting more recipes from the same chef and looking up pictures of movie stars, are similar to verbs in everyday language. These are contextual actuals that are based on the understanding of the noun.
What if semantic applications hard-wired understanding and recognition of the nouns and then also hard-wired the verbs that make sense? We are actually well on our way doing just that. Vertical search engines like Spock, Retrevo, ZoomInfo, the page annotating technology from Clear Forrest, Dapper, and the Map+ extension for Firefox are just a few examples of top-down semantic web services.
Take people search as an example. By only worrying about information about people on the internet, people search engines can look for specific attributes of people (e.g. age, gender, location, occupation, schools, etc) and parse semi-structured web pages about people (e.g. social network profiles, people directories, company “about us” pages, press releases, news articles etc) to create structured information about those people. Perhaps more importantly though, it does NOT have to look for attributes that do not apply to people (e.g. capital city, manufacturer, terroir, ingredients, melting point, prime factors etc). By ignoring these attributes and concentrating on only a smaller set, the computational problem is made substantially simpler.
As an example, look at the fairly detailed data (not all of it correct!) available about me on Spock, Rapleaf, Wink and Zoominfo. Zoominfo in particular has done a great job on this, pulling data from 150 different web references to compile an impressively complete summary:
Companies with a lot of user generated content can benefit from inferring structure from the unstructured data supplied by their users. In fact, since they don’t need to build a crawler to index the web, they have a much simpler technical problem to solve than do vertical search engines. They only need to focus on the problems of inferring structure.
Many social media sites focus on a single topic (e.g. Flixster [a Lightspeed portfolio company] on movies, TV.com on TV, iLike on music, Yelp on local businesses, etc) and they can either build or borrow an ontology into which they can map their UGC.
Take the example of movies. A lot of structured data for movies already exists (e.g. actors, directors, plot summaries etc) but even more can be inferred. But by knowing something about movies, you could infer (from textual analysis of reviews) additional elements of structured data such as set location (London, San Francisco, Rwanda) or characteristics (quirky, independent, sad).
In addition to search, inferred structure to data can also be used for discovery. Monitor110 and Clearforest are two companies that are adding structure to news data (specifically, business news data) to unearth competitive business intelligence and investment ideas. By knowing some of the relationships between companies (supplier, competitor etc) and their products, and by analyzing news and blogs, Monitor110 and Clearforest can highlight events that may have a potential impact on a particular company or stock.
The common criticism leveled against this approach is that it is insufficient to handle the complexity of an interrelated world. Arnold Schwarzenegger for example is a person, a politician, an actor, a producer, an athlete and a sports award winner as the excerpt from Freebase below shows:
Confining an ontology to a single domain, such as movies in the example above, would mean that you are unable to answer questions such as “What Oscar nominated films have starred Governors of California?”.
This is a problem depending on whether you believe search is orienteering or of it is teleportating:
Teleporting means trying to get to the desired item in a single jump. In this study it almost always involves a keyword search. Orienteering means taking many small steps–and making local, situated decisions–to reach the desired item.
Teleportation requires a universal ontology. With Orienteering, local ontologies with some loose level of cross linking is enough. I suspect that we’re in an orienteering driven search world for the foreseeable future, and that local solutions for specific domains will provide sufficient benefit to flourish. Adaptive Blue and Radar’s Twine are two early examples of products that take this approach. Radar’s CEO,Nova Spivack, talked to Venturebeat recently in some depth on this topic.
Once again, would love to hear more from readers.