Big Noise always accompanies Big Data, especially when extracting entities from the tangle of duplicate, partial, fragmented and heterogeneous information we call the Internet. The ~17m physical businesses in the US, for example, are found on over 1 billion webpages and endpoints across 5 million domains and applications. Organizing such a disparate collection of pages into a canonical set of things requires a combination of distributed data processing and human-based domain knowledge. This presentation stresses the importance of entity resolution within a business context and provides real-world examples and pragmatic insight into the process of canonicalization.
Info on Strata Conference website: http://strataconf.com/stratany2011/public/schedule/detail/21389
Slides on Slideshare: http://www.slideshare.net/TylerBell/dedupe-merge-and-purge-the-art-of-normali...
Link to this comment:
All Comments (0)