Miroslav Batchkarov - Gold standard data: lessons from the trenches





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 26, 2017

The first stage in a data science project is often to collect training data. However, getting a good data set is surprisingly tricky and takes longer than one expects. This talk describes our experiences in labelling gold-standard data and the lessons we learnt the hard way. We will present three case studies from natural language processing and discuss the challenges we encountered.

It is often said that rather than spending a month figuring out how to apply unsupervised learning to a problem domain, a data scientist should spend a week labelling data. However, the difficulty of annotating data is often underestimated. Gathering a sufficiently large collection of good-quality labelled data requires careful problem definition and multiple iterations. In this talk, I will describe three case studies and lessons learnt from them. Each case shows several aspect of the process that should be considered in advance to ensure the project is successful.


PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

Comments are turned off. Learn more
When autoplay is enabled, a suggested video will automatically play next.

Up next

to add this to Watch Later

Add to

Loading playlists...