Brian Carter: Lifecycle of Web Text Mining: Scrape to Sense





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on May 30, 2015

Pillreports.net is an on-line database of reviews of Ecstasy pills. In consumer theory illicit drugs are experience goods, in that the contents are not known until the time of consumption. Websites like Pillreports.net, may be viewed as an attempt to bridge that gap, as well as highlighting instances, where a particular pill is producing undesirable effects. This talk will present the experiences and insights from a text mining project using data scraped from the Pillreports.net site.The setting up and the benefits, ease of using BeautifulSoup package and pymnogo to store the data in MongoDB will be outlined.A brief overview of some interesting parts of data cleansing will be detailed.Insights and understanding of the data gained from applying classification and clustering techniques will be outlined. In particular visualizations of decision boundaries in classification using "most important variables". Similarly visualizations of PCA projections for understanding cluster separation will be detailed to illustrate cluster separation. The talk will be presented in the iPython notebook and all relevant datasets and code will be supplied. Python Packages Used: (bs4, matplotlib, nltk, numpy, pandas, re, seaborn, sklearn, scipy, urllib2)

Brian Carter

Comments are disabled for this video.
When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...