 My talk today is about a pretty new piece of work that I've been doing with Innocent, the other members of Explosion AI, on connecting Spacey to the latest NLP models, which are these transformer models. So this is the first time I've given this talk, so there may be bits where it's a little bit rough, and as I said, the work's very much in progress, so some of the things which I'll be describing are basically the intended functionality or how things will be working in the future, rather than what exactly can be used if you install the package right away. So just as a bit of background, so I'm the developer of this open source library, Spacey. I've been working on natural language processing for pretty much my whole career. I was lucky enough to get into the field when it was still quite small. I finished my PhD in 2009, and then around 2014, I decided to leave academia basically at the point where I would have had to start writing grant proposals and decided to start working on this open source library, Spacey. And my main colleague on this is my co-founder, Innocent, who'll be talking tomorrow in the keynote session, who's also been working on Spacey pretty much since its first release. So you can find Spacey at Spacey.io. It has quite good documentation, and as I said, it's quite popular these days. So it's always hard to estimate usage for an open source library, but we figure we've probably got at least 100,000 users who are basically using the library quite actively, and we've got many styles in GitHub, like around 15,000. So the other main project that we do at Explosion.ai, the company that Innocent co-founded is this annotation tool, Prodigy. So this is especially useful in conjunction with Spacey because it lets you basically train your own models for Spacey. So if you want to train your own named entity recognition models or train your own text classification models, Prodigy is a very easy way to do that. And it's also an easy way to basically have a little bit of interactivity in your local data science workflow so that you can do sort of error analysis and basically have a deeper connection to the data because often when you're doing things like natural language processing, being able to look at the data and basically think about what to do next is much more important than deciding what model architecture to try or these sorts of things. And it's actually Prodigy that was a special motivation for the Transformers work that I'll be talking about because these developments in transfer learning that the Spacey Transformers package that will help you use are especially useful in conjunction with an annotation tool to basically allow you to make use of the fact that you now need so much less training data. So as with Spacey, the Prodigy tool is really quite popular. We've got a lot of users using it, including a lot of companies. And this is actually how we support the development of Spacey. It's sales of Prodigy that are funding the explosion AI company. Okay, so onto basically what I'll be talking about here. So I don't know how many of you have heard of Transformer models. I won't ask people to raise their hands because generally nobody does. But there's been a lot of headlines around this and a lot of excitement. So the idea here is that it's basically been a goal of natural language processing to move through what's called this knowledge acquisition bottleneck. And this is really the problem that in any language processing application, there's a lot of knowledge about word usage and knowledge about the world which generalizes across tasks and is really hard to encode for a specific task that you want to do. So you want to solve a particular problem like sorting your support tickets into different categories and if you have to learn all of the information from that, from tickets that you've already classified, the problem is really hard. And the obvious observation is that most of the information that's in those tickets are things that anybody knows from a whole host of other background tasks. It's not information which is specific to the problem that you're trying to solve. So what we want to be able to do is gain that knowledge from somewhere else, some general sort of knowledge of the language and then be able to reuse that knowledge across different applications. In just the same way as if you were teaching somebody to do this task, you would expect them to know the language and be able to read the tickets and then build on top of that the specific knowledge about what your classification scheme is. So although this had been a goal of natural language processing for basically since the field had begun, efforts to use raw text in this way were not reworking that well before neural networks. And even after neural networks came out, the initial ways that we could reuse raw text resources and gain knowledge of raw text was really limited to basically the dictionary level or the meaning of individual words. And so what's happened over the last couple of years is we've really just gotten much better at importing knowledge from raw text into our applications, including knowledge from raw text about the context of words and these contextual representations. So that's really what these transformer models do. And that's what the sort of knowledge that we want to connect here. So this was all nicely summarized in a blog post by Sebastian Ruder. If you want to look it up, the blog post is called NLP's ImageNet Moment has finally arrived. And this is really an analogy with computer vision where people import knowledge from computer vision tasks and basically reuse that into other small specific computer vision models. OK, so and then of course this was accommodated in an article about the New York Times. And it did sort of blow my mind a little bit that this field that I had started off in 2009 that what was actually kind of an incremental update in this ends up being like news in like a major publication. So you can kind of see how the field has evolved and how this has kind of made a mainstream splash. So in practice, the way that a lot of people are using these transformer models in their applications is via a package developed by our friends at Huggingface. And this has started off being called, I think it was like BERT PyTorch Transformers or PyTorch BERT. Then it became PyTorch Transformers. And now it's just Transformers because it supports TensorFlow as well. So this is really quite a popular library and it gives you a pretty easy way to use pre-trained weights. So the models take a long time to kind of calculate on the raw text and cost quite a lot of money, several thousand dollars in many cases, but this gives you an easy way to use those model artifacts that people have developed. And even if the models have been trained with TensorFlow, they kind of translate them into PyTorch so that you can use them in PyTorch. So this is like a quick view of what the API looks like, but basically you get an easy way to load them up. So once this was developed, we said, all right, well now that we've got this nice way to use the models in PyTorch, we want to basically have a connection for this for Spacy. So this is the Spacy Transformers library, and this is kind of what the usage of it looks like, and also what the pipeline here looks like. So as with other Spacy models, you'll call Spacy.load to load up a pre-trained model. The models are distributed kind of packaged as pip packages that an individual package can have dependencies and basically declare itself that way, and you can serve it out to your applications in exactly the same way as you're serving all of the other Python dependencies. So it's kind of used as standard tools, and it's kind of easy to work with that way. And then this NLP object can be just used as a function, so you can call it with a single text. You can also call it with batches of text with NLP.pipes and it will do the minibatching internally for efficient processing. So, what can we do with this? Well, out of the box, before we connect on a model which is trained for your specific task, there's actually not that much that we can do, but we can already say, all right, we can kind of get a similarity judgment. It may not match the similarity that you want for a particular application, but you can at least look at how similar the different vectors are. And then you can also access the representations that have been assigned to the tokens by the transformer model. And then, importantly, one of the quite neat usage things that we were pleased to develop is that the transformer models tend to use a non-linguistic tokenization scheme in order to limit the number of different vocabulary words. So they tend to basically divide up a single word that's as long as the word's rare. So something like Chen I might be in two tokens, Chen and Nai. And it does this so that the model doesn't have an unboundedly large vocabulary of words to deal with. Now, of course, when you actually go to use the models, the fact that you don't have a vector that represents an actual word for your application is quite a pain. So what we do in the Space Transformers Library is basically just do this alignment process so that even if you wanna work at the word level later on, you're able to get a word representation out of these BERT models or Excel net models and use the alignment to basically give you a representation for the words that you actually want to work with. So here you can see this about the alignment. We basically ask whether we've got something like laced, which has been split up into two tokens by the word piece tokenization, and we align those back up to laced here. So that when we are calculating the vectors, we can be able to say, ask, what's the vector representation for laced, rather than just what's the representation for l and laced. So after you add something like a text classifier model onto the end, we can say, all right, we'll use Spaces API for text classification so you can use it within your NLP pipeline. And you can train this text classifier and have it back propagate into the BERT model or other transform model so that you're learning a very accurate text classification model using these transformer models. And then soon we'll have other pipeline components that work in the same way. Now, as soon as you have multiple pipeline components here, though, there's a bit of a challenge to design of these things. So that's what I'm going to talk about in this next section and talk about different trade-offs here and different ways that we want this to work. So first, here's a little bit of background about the way that Spacey does components and the kind of component system that we have here. So what you're able to do is, at least in Spacey Transformers, and this is coming soon to the Spacey Library as well, you can decorate a function and describe it as basically registered as a way to describe a model architecture. So that then, in your components, you can swap out this model architectures in the config files. And you can basically bring your own architecture or define your own architectures for these things. So here with another nice detail here is that Spacey has basically this interface library thing, which does its machine learning. And we have a wrapper for PyTorch that you can put around the PyTorch models to use them in Spacey with no copies involved. So you can use a custom PyTorch layer and use it to power a Spacey component. Even if it's not a transform model, you'll be able to do the same thing. Even if you just want to say, you as a bi-LSTM tagger or some other custom PyTorch model. And as soon as TensorFlow 2 supports this DLPack format, you'll be able to do the same with TensorFlow. So what you would do if you wanted to define a new component for, say, a new NLP task that you wanted to solve, you would subclass this NLP this Spacey pipeline.pipe component. You would define this function that returns a model. You don't really have to do much here. You basically can just use the registry system here and just pass it its config through. And that will basically do enough to instantiate the model architecture that you've defined up here. And then the other parts of the lifecycle that you'll want to define is there's a predict method where you can say, OK, extract some features from the docs, like their word IDs or any other types of features that you want your model to have access to. And then you would pass that representation forward in your model and then return the scores from it. Then you have the opportunity to set any annotations based on those scores. So let's say you wanted to have custom features or custom attributes that you wanted to define for some task that Spacey doesn't support. You can have extension attributes and basically calculate all of the things to update the doc objects that you want to work with. And then finally, you can have an update function which will basically allow you to calculate a weight update for your model based on some gold standard information and based on the batch of documents. So this is all very nice. This is basically how a pipeline component works in Spacey. And once you've defined it, you can add it to your Spacey pipeline. You can register it as an entry points that you can have a package that basically people can just pip install and your component will be all available and ready to use so that you can extend Spacey in these ways. So if we imagine this kind of like working through with multiple pipeline components, let's say we've got some text. Spacey will tokenize that into a doc object. And then inside the NLP call or NLP pipe method, you'll call the something like the named entity recognizer and then call the text classifier. And as you're going through, you'll update annotations into doc and then get out a doc object. Cool. OK. So now, one question here is, how does this all work when we're also updating the models? So let's imagine that we have our transformer models here and we've got, say, a named entity recognition head on top of a transformer and then also a text classification head that's on top of a transformer. So how should this work? Should we have two copies of the transformer models or should we have only have one? So the transformer models are quite slow to run and they require pretty expensive hardware. So do we really want to be running the model twice, especially if it's quite similar, or do we want to run it once and sort of reuse that result through it? So here's kind of like the sketch of the architecture of what it would look like if we want to run it once and update. So here we would say, OK, have this component token vector encoder, which will run the transformer model, put all of the vectors up on the doc object, set all of those extension attributes that we had, and then we can pass that doc forward. The named entity recognizer will make use of those features that the token vector encoder extracted and then it's able to pass gradients back into that model so that we can update it. And then similarly, we can move forward into the text classifier. And the text classifier can also make use of the information that was calculated back there and also update that single shared representation. So on some way, in other words, this can be described as multitask learning. But it's really just a question of whether we want to run early and set state or whether we want to have all of these things be independent. So there's actually quite an all general trade-off that we have encoded around here. So the question is whether we want to have more modular components that are independent and don't depend on much state externally, or do we want to sort of have a more complicated setup with more assumptions and therefore gain better efficiencies. And that's definitely not a question that is only going to come up in machine learning. This is something that we have to figure out with code all the time, these trade-offs around making something more modular and making it run more efficiently or in some cases actually perform better in terms of accuracy. So here's what it would look like in this sort of alternate architecture. Here we would have an individual named entity recognize a component, and it will kind of own its own transformer weights. So we don't have any multitask learning here, and we have kind of some nice architectural simplicity where we don't have to depend as much on previous computations. And similarly, we'll pass forward in a tech classifier, and then we'll run the transformer again. We'll run a completely separate set of transformer weights. So again, this has some problems, not least of which if you're running this on a machine with only one GPU, you're actually going to really struggle with memory here because you're probably going to have all of this stuff set up in memory and then a whole separate model loaded in memory, and you may run out of memory because even if your card has say 12 gigabytes of GPU memory, the transformer models are so large that actually you may still struggle here. So you end up having to provision machines with multiple GPUs, and as your pipeline grows and you say you have more of these components, you have to ask questions about what's the nature of the software that I'm running and what machines do I have to provision it. So the deployment and infrastructure questions start to get quite tricky where even once little config change in your config file to say load a different component means that a whole different hardware has to be provisioned. And if you over provision the hardware, you'll find that it costs thousands of extra dollars a month because the GPUs, especially on cloud, are quite expensive. So we have basically this dilemma. So to summarize the dilemma here, there's really strong code motivations for wanting a modular architecture. We want the functions to be small and self-contained, and our systems are much easier to reason about if we avoid state and side effects. And of course we can compose lots of different systems from a smaller number of coded parts. So we get lots of code reuse and lots of combinations and we can get lots of complexity resulting from a small amount of code, which is great. But on the other hand, well, performance. So if we divide up our work into lots of small functions, we have to repeat lots of work a lot of the time. We can't make as many assumptions about the total computation that's running and so we can't optimize as efficiently. Another thing is that without that state models can lose information and that can actually limit the number of features or how we can actually define our models and that can actually also result in less accuracy. Finally, I think that something that is unique to machine learning when we're making this sort of trade-off is that models are often not that interchangeable. So the behavior of one model, even if it has the same sort of signature, as long as it's trained differently, it's gonna have different behaviors. And that means that you really, it actually becomes quite difficult to compose the pipelines together and to treat these different building blocks as actually interchangeable units. And I think that that's actually a good motivation to take a different approach here and instead of making things as modular as possible, in some ways we may want to change this. So in Spacey, actually initially, we had a very non-modular architecture because the tagger was used as a feature in the Paasa and then the Paasa's features were used in the name entity recognition. And this actually defined the way that the Spacey API was worked initially and it's sort of informed the way that Spacey's API has worked since. So that's why you load up this one model and it gives you a whole set of components that are defined together. Like you load up a whole configured pipeline rather than an API which where you might say, okay, I'll load up, like I'll initialize a pipeline and then I'll add the components that I want to it and then I'll run, which in some ways would seem like a, a more modular design. The decision was made to give it this sort of style because if there's only one valid combination of the components and only one valid ordering, it doesn't actually make sense to make people, you know, chant that incantation every time. If there's only one answer, you don't have to make people code it up, you should just like do it. So that's why it's sort of worked this way and you know, the same trade-off is now being seen in the transform models. So for V2 at least, we managed to break these dependencies and make things a bit more modular. So the tag, it doesn't actually depend on the previous state in the V2, V2.1 and V2.2 models because you know, as I said, it's nice to have things modular but now that we're moving to transformers, we actually want to have a slightly different approach. So moving forward to kind of the conclusion thing of the talk, the pros and cons that are transformers that the transformer architecture really makes it easy to define networks for new tasks. There's kind of a clear right answer, a clear template to follow for how to add new heads and how to do things. You get great accuracy. The accuracy improvements have been really, you know, quite remarkable and they keep coming out. So we're really moving up in accuracy quite quickly which makes the models very much easier to use in applications because you can reason about the, you can kind of expect decent performance which really takes a lot of the guesswork out. And a big improvement as well is that you need fewer annotated examples and that really helps you move quickly and you can iterate much quicker on the tasks and basically reason about things much more easily. But, you know, there's sort of significant downside to that the models are slow and expensive. They take expensive GPUs to run and they particularly take large batches of text to operate efficiently. And that means that if you're using them in a streaming environment or something, then they're very hard to deploy. And also the fact that the models are quite bleeding edge and things are changing quickly around this space is also a significant downside for using them in your applications. So for spacy transformers, this is kind of what it looks like for the code example. You can add different architectures on this and what you'll be able to do is add on a text classifier and an identity recognizer and have those use the same shared states that you can only calculate the tensors once and then reuse those. So that's a key way that we want to make this work. So at the moment, you can already do pip install spacy transformers. We support text cat, so text classification models. We have the aligned tokenization working and there's already a pretty nice system for defining custom models or custom architectures and things. We're most of the way through designing the name identity recognition component. The tagging component will be pretty easy to do as well and I think the dependency parsing model should follow afterwards. Then another thing that we really want to get working is having a remote procedure calls to the transformer components so that you can host just a transformer on a GPU server, have it do the batching for you and have the rest of your pipelines on the CPU and calling into it. I think that that'll be a really effective way to ease some of these efficiency requirements and ease some of these deployment requirements as well. Finally, we are very excited about having support for the transformer models in Prodigy, our annotation tool because in many cases, you only need a few hundred examples to get decent performance with these transformer models because of the power of the transfer learning and this means that having an annotation tool that's scriptable that you can run locally and sort of very quickly click through and save the results out into a database that's just running on your local machine will be very useful and so we're very excited to have that working. So yeah, thanks. You can follow me on Twitter at Ad Honnable and then Explosion and Ad Explosion, so. Thanks Mathew for the talk and we have enough time for Q and A so anyone have questions? Yeah, over here, first row. Hello, I'm an auditor. Hello, I'm an auditor. Yeah, you're an auditor. Okay, so my question is already, Spacey had any other algorithms like trigrams, CNN based algorithms and things like that. So for a small dataset like say 500, 600K and then records, is it worth taking all the pain of transformers when already the things are pretty much working? So. Okay, so I'm not entirely sure I got the question. Is it that if you have a small and moderately sized dataset and you're getting decent performance with Spacey, is it likely to be worth the trouble of switching over to the transform models? Yeah. I would say probably not, especially since the fact that you have to run them on GPU really makes life difficult for many applications. On the other hand, there are lots of situations where you basically load up your dataset and you don't immediately get good performance and this happens in machine learning all the time. And this can actually be quite tricky to solve because you don't know what to try next and you actually don't even know whether the task can be solved. So I would say that one of the great things about having the transform models is sort of like getting a peek at the answer where you can say, all right, is there an existence proof for what sort of accuracy could be achieved? And let's say you try out a simpler model like a bag of words model even or with scikit-learn or you try out Spacey model or some other thing and you get say 70% accuracy. And then you load up the transform model and you get like 92% accuracy. It makes you think, well, okay, if I'm only getting like a little bit better than a bag of words or my bag of words model is getting this, what could I actually do? And it makes you think, well, maybe I should change my features a little bit or even just try a little bit harder to train the simpler model. Another thing you can do is use the transform model to label text for the simpler model. Even if you don't want to deploy to transform model, you might be able to say, well, I'll use it and I'll train from the 92% accurate data and then maybe I'll recover some of the accuracy and get up to say 84%. But if the model's already getting decent performance, there's always no end of other things you can do. So typically there's a lot of other work in getting the application working correctly and trying out different ideas. So once you already hit acceptable accuracy and you're kind of at the top end of the curve where always performance has this S shape, if you think, well, I'm not getting that much utility from extra accuracy, there's definitely no point in moving on with that and you probably want to prioritize different work. Hello. So Huggingface recently released their digital bird model. Hello. Could you repeat that slightly? Yeah. So Huggingface recently released their digital bird model and the digital GPT-2 model, which is performance-wise much faster than the original bird. So is PC already have support for that or do we have to port it to a PC? Okay, so is the question whether we, which models we support or like whether we support Distilbird specifically? Yes. Okay. Yes, we have a package for Distilbird and in fact the process for creating the spacey packages is very quick. So there's a single script that runs that basically downloads the pre-trained model and packages it up for spacey. And so we've got this added to our model automation now. So we added support for Distilbird within like a day of it being announced and actually most of the weight was waiting for the PyTorch Transformers Library to release a new version that includes a support for it. So I think you'll be able to expect pretty quick support for new architectures as they come out and we already do support Distilbird, which I'm excited to experiment with. Thank you. Hi. Hi, this side. Hi. Here, clearly. So this is a generic question about the packaging of spacey treat-in NEM models. More often when we try to use spacey inside data breaks or any proprietary, let's say PC, which is from any corporate, since the spacey NEM models are not coming up with the package of spacey itself, we are facing problem on installing them differently. So is there a plan as in probably to get rid of that? I'm sorry, the speakers are a little bit distorted and I really didn't catch that and I think we're actually at the end of the question. So when we use spacey NEM models, the spacey pre-tain NEM models, we have to install each language model separately after we install spacey. Yes. So that is creating a lot of problem when we try to use that in data breaks or probably in any proprietary system. Okay, well, you can always just, the models adjust these, like tar files that are served out of GitHub and you can point spacey to a directory instead of a package. So if your deployment needs to refer to them as directories instead of as, you know, basically pip packages, you can do that, but you can also just download the archives and however you're installing spacey, you should also be able to install the models. So there should be a range of ways that you can solve this sort of problem. The challenge if you could have seen in data breaks is the moment we do that, every time we install it, every time we reach out the cluster, we have to install again. I'm sorry, but I think we're out of time we can talk about this offline. So thanks.