Text By the Bay 2015: Stephen Merity, A Web Worth of Data: Common Crawl for NLP





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jun 10, 2015

The Common Crawl corpus contains petabytes of web crawl data and is a treasure trove of potential experiments. To introduce you to the possibilities that web crawl data has for NLP, we will take a detailed look at how the data has been used by various experiments and how to get started with the data yourself.

Stephen Merity is responsible for crawling billions of pages a month at Common Crawl, a non-profit that provides petabytes of web data free of charge. Prior to joining Common Crawl, Stephen worked with Freelancer.com and Grok Learning in Australia. He holds a Masters of CSE from Harvard University and a Bachelors (Honours) from the University of Sydney in NLP. ----------------------------------------------------------------------------------------------------------------------------------------

Scalæ By the Bay 2016 conference


-- is held on November 11-13, 2016 at Twitter, San Francisco, to share the best practices in building data pipelines with three tracks:

* Functional and Type-safe Programming
* Reactive Microservices and Streaming Architectures
* Data Pipelines for Machine Learning and AI


to add this to Watch Later

Add to

Loading playlists...