This tutorial will guide you through the process of analysing your textual data through topic modelling - from finding and cleaning your data, pre-processing using spaCy, applying topic modelling algorithms using gensim - before moving on to more advanced textual analysis techniques.
Abstract Topic Modelling is a great way to analyse completely unstructured textual data - and with the python NLP framework Gensim, it's very, very easy to do this. The purpose of this tutorial is to guide one through the whole process of topic modelling - right from pre-processing your raw textual data, creating your topic models, evaluating the topic models, to visualising them. Advanced topic modelling techniques will also be covered in this tutorial, such as Dynamic Topic Modelling, Topic Coherence, Document Word Coloring, and LSI/HDP.
The python packages used during the tutorial will be spaCy (for pre-processing), gensim (for topic modelling), and pyLDAvis (for visualisation). The interface for the tutorial will be an Jupyter notebook.
The takeaway from the tutorial would be the participants ability to get their hands dirty with analysing their own textual data, through the entire lifecycle of cleaning raw data to visualising topics.
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.