David Soares Batista - Semi-Supervised Bootstrapping of Relationship Extractors





The interactive transcript could not be loaded.


Rating is available when the video has been rented.
This feature is not available right now. Please try again later.
Published on Jul 26, 2017

Semi-Supervised Bootstrapping of Relationship Extractors with Distributional Semantics by
David Soares Batista

Semi-supervised bootstrapping techniques for relationship extraction from text iteratively expand a set of initial seed relationships while limiting the semantic drift. This talk presents an approach to bootstrap relationship instances using word embeddings to find similar relationships. Results show that relying on word embeddings achieves a better performance than using TF-IDF weighted vectors.

Relationship Extraction (RE) transforms unstructured text into relational triples, each representing a relationship between two named-entities. This relationships can then be used to populate knowledge bases, or build knowledge graphs, which can support several tasks, such as Question Answering.

A bootstrapping system for RE starts with a collection of documents and a few seed instances. The system scans the document collection, collecting occurrence contexts for the seed instances. Then, based on these contexts, the system generates extraction patterns. The documents are scanned again using the patterns to match new relationship instances. These newly extracted instances are then added to the seed set, and the process is repeated until a certain stop criteria is met.

Bootstrapping approaches relying on TF-IDF weighted vectors have limitations when trying to find similar instances, since the similarity between any two relationship instance vectors is only positive when the instances share at least one term. For instance, the phrases was "founded by" and is the "co-founder of" do not have any common words, but they have the same semantics. Stemming techniques can aid in these cases, but only for variations of the same root word. By relying on word embeddings, the similarity of two phrases can be captured even if no common words exist. For instance, the word embeddings for "co-founder", "founded" and "creator" should be similar, since these words tend to occur in the same contexts.

I propose to present a system which extracts relationship instances by bootstrapping and by relying on word embeddings. It was evaluated against a popular system which relies on TF-IDF weighted vectors, the paper describing the system was presented at EMNLP'15 and won an honorable mention for best short-paper award.


PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.

Comments are turned off

to add this to Watch Later

Add to

Loading playlists...