 Okay. So, yeah, this is our first time in the Bay Area, so it's nice to meet you all, and thanks for coming. I'm not so much notice. So, I'll start by just giving a quick introduction of us and, you know, some of the things that we're doing before I start with the sort of main content of the talk, which is about this open source library that we developed, Spacey, for natural language processing. So, the other things that we develop as well at Explosion AI is a machine learning library behind Spacey, I think, which allows us to avoid depending on other libraries and kind of keep control of everything and make sure that everything's easy to install. We also have an annotation tool that we develop alongside Spacey Prodigy, which is what Innis will be talking about, and we're also preparing a data store of other pre-trained models that for more specific languages and use cases and things that people will be able to use that basically will extend the capabilities of the software for more specific use cases. So, to give you a quick introduction to Innis and I, which is basically all of Explosion AI. So, I've been working on natural language processing for pretty much my whole career. I started doing this after doing a PhD in computer science. I started off in linguistics and then kind of moved across to computational linguistics. And then around 2014, I sort of these technologies were getting increasingly viable and I was also at the point in my career where I was supposed to start writing grant proposals, which didn't really agree with me. So, I decided to leave and I sort of there was a gap in the capabilities available for something that actually translated the research systems to something that was more sort of practically focused. And then, you know, soon after I moved to Berlin to do this, I met Innis and we've been working together since on these things and I think, you know, we kind of have a nice complementarity of things. She has is the lead developer of our annotation tool Prodigy and it's also been working on our spacey pretty much since the first release. Okay, so I included this slide, which we normally actually give this when we talk to companies specifically, but I think that it's a good thing to include to give you a bit of, you know, this is what we tell people about what we do and how we make money and how the company works. And I think that this is a very valid question that people would have about an open source library. It's like, well, why are you doing this and, you know, how does it fit into the rest of your projects and plans? So, the explain it like I'm five version, which I guess is also the explain it like I'm senior management version, is we give an analogy. It's kind of like a boutique kitchen. So, the free recipes we publish online, you can see is kind of like the open source software. So that's spacey thing, etc. At the start of the company, especially we were doing consulting, which I'm happy to say we've been able to wind down over the last six months and focus on our products. And then we also focus on a line of kitchen gadgets, which is things like Prodigy. These are these downloadable tools to use alongside the open source software. And soon we'll have this sort of premium ingredients, which are the pre-trained models. So the thing that we don't do here is enterprise support, which I guess is probably the most common way to people, you know, fund open source software or imagine that they'll fund open source software with a business model. And we really don't like this because we want our software to be as easy to use as possible and as transparent as possible and the documentation to be good. So I think it's kind of weird to have this thing where you have explicitly a plan that we're going to make our free stuff as good as possible. And then we're going to have this service that we hope nobody we hope people pay us lots of money for, but we hope nobody uses. And that's kind of weird, right? It's kind of weird to have a company that, you know, you hope that your paid offering is really poor value to people. And so we don't think that that's a good way to do it. And so instead, we have the downloadable tools, I think is a good way to, you know, we have something which works alongside Spacey and I think is useful to people who use Spacey as well. Okay, so, you know, onto the sort of main content of the talk and, you know, the bit that I'll be talking about. So I'm going to talk to you about the syntactic parser within Spacey, the natural language processing library that we use. And so before doing it, so this is kind of what it looks like as, you know, sort of visualizes an output. So it's this sort of tree based structure that gives you the syntactic relationships between words. So the way that you should read this here is that the arrow pointing from this word to this word means that Apple is a child of looking in the tree. And it's a child with this relationship and such. In other words, Apple is the subject of looking. And is is an auxiliary verb attached to looking and then at is a prepositional phrase attached to looking. So these sorts of relationships tell you about this syntactic structure of the sentence and basically help you get at the who did what whom sort of relationships in the sentence and also to extract phrases and things. So for instance, here, to make the thing more easy to read with merged UK startup, which is, you know, sort of basic now phrase into one unit. And you can find these sorts of phrases more easily from given the syntactic structure. And just above here, we've got an example of, you know, what the code looks like to actually get the syntactic structure and navigate the tree. In Spacey, you just get this NLP object after loading the models. And you just use that as a function that you feed text, or pipe text through if you've got a sequence of texts. And given that you get a document object, which you can just use as an iterable. And from the tokens, you get attributes that you can use to navigate the tree. So for instance, here, the dependency relationship is just a dot depth. By default, that's an integer key integer ID, because everything's kind of coded to an integer for easy and efficient processing. But then you can get the text value with an underscore as well. And then you can navigate up the tree with dot head, and then you can look at the left and right children of the tree as well. So we try to have a rich API that makes it easy to use these dependency relationships. You know, so that it just getting dependency passes, you know, obviously just the first step, you want to actually use it in some way. And that's why we have this API to make that easy. So the question that always comes up with this, and it's, I think this is a very interesting thing for the field in general, is, you know, what's the point of parsing? Like what is this actually good for in terms of applications? So you have Goldberg is a very prominent parsing researcher. And he's, you know, this is kind of the stuff that he studied for most of his career. And he's, you know, one of the more well known parsing people. And so it's interesting to see him and other people reflect on this and say that he finds it fascinating that even though we have so many best papers in NLP, and so it's kind of a high prestige thing to study parsing. But it seems like syntax is hardly used in practice in, you know, most of these applications. So the question is, you know, why is this? Is it just that because parsing is based on trees and structured predictions kind of fun to study and there's all these deep algorithmic questions, is it just kind of this catnip to researchers? And is it, does it have this kind of over prominence in the field? Or is it that, you know, there is something deeper about this and we should really be continue studying this? Well, I think that this is there's kind of I can go out the way on this. And so this slide shows you that, you know, the case for parsing. And then I'll, you know, kind of have a counterpoint in a second. So I think that the most important case for parsing is that there's a sort of deep truth to the fact that sentence is a tree structure that they just are right. Language syntactic structure of sentences is recursive. And that means that you can have arbitrary long gaps between two words which are related. So, for instance, if you have a relationship between say, a subject in a verb like syntax is, whether the subject of that verb is plural or singular is going to change the form of the verb. And that dependency between them can be arbitrarily long because you can have this nested structure. But it will not, but it can't be arbitrarily long in tree space because, you know, if you, the relationship between them will always be the subject and the verb, like, sort of next to each other in the tree. So you can see how in some, for some of these things, it should be sort of more efficient to think about it or model it as a tree. And the tree should tell you things that you otherwise would have to infer from an enormous amount of data. It should be more efficient in this way. So we can say, okay, you know, in theory, this should be important. And it should be something that we study based on this knowledge about how sentences are structured. So then the sort of counterpoint to this is, all right, so sentence is a tree structured, and that that's a truth about sentences. But it's also true that they're written in red in order. So, you know, if you read a sentence, you do read it from left to right, or in English anyway, or like basically from start to finish, or you hear a sentence from start to finish. And this really puts a sort of bounding on the linear complexity that you will empirically see, right? Because when somebody wrote this sentence, yes, they could have an arbitrarily long dependency. But they expect that, you know, that would mean that their audience listening to it will have to wait arbitrarily long between, you know, some word and the thing that it attaches to. And that's kind of not very nice, right? So empirically, it's not very surprising to see that most dependencies are in fact short. And, you know, there's a lot of arguments that the options that kind of provided to grammars are sort of arranged that you're able to keep your dependencies short. Like that's sort of some of the reasons you have options for how to move things around in sentences to make nice reading orders. Because, you know, you want short dependencies. So this means that if most dependencies are short, then processing text as say chunks of words of one or two at a time kind of gives you a pretty similar view. Most of the time you don't get something that's so dramatically different if you're able, if you look at a tree instead of looking at chunks of three or four word sentences. So, you know, this is kind of a counterpoint that says, you know, maybe even though the sentences are in fact restructured, maybe it's not that crucially useful. So I think that the part that makes this, you know, particularly rewarding to look at syntax or particularly useful to provide syntactic structures in a library like Spacey is that their application independent. So there's the syntactic structure of the sentence doesn't depend on what you hope to do with the sentence or how you hope to process it. And that's something that's quite different from other labels or other information that we can attach to the sentence. If you're doing something like a sentiment analysis, there's no truth about the sentence of it, sentiment of a sentence that's independent of what you're hoping to process. Like that's not a thing that's in the text itself. It's, you know, a lens that you want to take on it based on how you want to process it. So, you know, there's a, whether you consider some review to be positive or negative depends on your application. It doesn't really, it's not necessarily in the text itself because, you know, what counts as positive or negative? What's the labeling scheme? What's the rating scheme? Or, you know, exactly what are they talking about? Well, that the taxonomy that you have will depend on what you're hoping to process with. Those things aren't in the language, but this but details about the syntactic structure are in the language. They're things which are, you know, just part of the structure of the code. And that means that we can provide these things sort of loan it once and give it to many people. And I think that that's very valuable and useful and different from other types of annotations that we could calculate in an attach. And that's why Spacey provides pre-trained models for syntax, but doesn't provide pre-trained models to something like sentiment. Because we know how to give you a syntactic analysis that's, you know, as useful as it may be, or maybe not depending on, you know, whether that actually solves your problems. But at least it's sort of true and generalizable. Whereas we don't know how to give you, we don't know what categorization scheme you want to classify your text in. So we can't give you a pre-trained model that does that, because that's your own problem. So we try to, you know, basically give you these things which are annotation layers which do generalize in this way. And that means that there has to be a sort of linguistic truth to them. And that means that looking at things like the semantic roles or sentence structure or sentence divisions are things that we can do. And that's why we, you know, are interested in this. So the other thing about syntactic structures and, you know, whether they're useful or not is that in English, not using syntax is pretty powerful because English orthography happens to cut things up into pretty convenient units. They're not optimal units, but they're still like pretty nice in a way that doesn't really hold true across a lot of other languages. So in the bottom right here we have Japanese which, you know, usually isn't segmented into words. Like you can't just cut that up trivially with white space and get something that you can feed into a search engine or get something that you can feed forward into a topic model. You have to do some extra work. And the extra work that you do there really should consider syntactic structure. You can use a technology that only makes linear decisions, but the, you know, truth about what counts as a word or not is very entangled with the syntactic structure. And so there's real value in doing it jointly with syntactic parsing. For other languages you have kind of the opposite problem. So we have here a German word and this, you know, is the German word for income tax return. Now, whether or not you want that to be sort of one unit will depend on what you're looking for. For many applications, actually, the English phrase is too short. And the domain object, the thing that you want to be, you know, looking for and having a single node in your knowledge graph for would actually be income tax return. That's pretty awesome. But in other applications, maybe you just want to look for tax return. And so in those cases, the German word will be too large and your data will be too sparse. So there's, you know, there's sort of different aspects of this. In the bottom left here, we have an example of Hebrew and like Arabic and other, a couple of other languages like this, there's no vowels in the text and the words tend to be kind of have all sorts of attachments to them that are difficult to segment off. So there again, you have difficult like segmentation problems that are all tangled up with the syntactic processing. Okay, so going forward to sort of an example of what we can do if we, you know, recognize non-white space looking words and feed them into somebody other processing stuff that we have. So this is a demo that we prepared a couple of years ago for an approach that we call, that is termed sense-to-vec. So all it is is basically processing text, using natural language processing tools, in this case specifically spacey, in order to recognize these concepts that are longer than one word. So specifically here we look for base noun phrases and also named entities and we just merge those into one token before feeding the text forward into a word-to-vec implementation, which gives you sort of these semantic relationships. And this, this lets you search for and find similarities between phrases which are much longer than one word. And as soon as you do this, you find, ah, the things which I'm searching for are much more specific in meaning. I'm not, you know, looking for, you know, one meaning of learning or one meaning of processing, which doesn't tend to be so useful or interesting. Instead, I'm looking, you can find things related to natural language processing and then you see, ah, machine learning, computer vision, et cetera. These are, you know, real results that came out of the thing, as soon as you did this division. And so, um, we can do this for other languages as well. So if we were doing, if we were hoping to use word-to-vec for a language like Chinese, you really want to be processing it into words before you do that. Or if you're going to do this for a language like Finnish, you really want to cut off the morphological suffixes before you do this. Okay. So, um, incidentally, Innis has cleaned up the sense-to-vec recently. So, um, you can actually use this as a handy component within spacey. So, you can load up a, um, a standard model and then, ah, add a component that gives you these sense-to-vec senses. So, you can just say, alright, um, the token for three would be natural language processing because it would do it emerging for you. And then you can also look up the similarity. So, it's now much easier to actually use the, ah, the pre-trained model and use that approach, ah, within spacey. Um, incidentally, we have this, um, concept of an extension attribute in spacey so that you can kind of attach your own, ah, things to the tokens so that you can, you know, basically, ah, attach your own little markups or, um, processing things. So, the, this underscore object is a, um, kind of a free space that you can attach attributes to, um, which ends up being quite convenient. Ah, it's a lot more convenient than, you know, trying to subclass something or something. Okay. So, um, ah, for the rest of the talk, I'll give you a little bit of, a pretty brief, ah, overview of the parsing algorithm, ah, and then, ah, explain how we're going to, how we're modifying the parsing algorithm to, ah, work with languages other than English, ah, so that we can basically broaden out the support of spacey to these other languages. So, ah, what we see here is a, um, a completed parse, and I'm going to sort of, ah, talk you through the steps, um, that the, or the decision points that the parser is going to, ah, make to derive this structure. Um, and the, so the kind of key, I think to keep in mind or the key, um, ah, like aspect of the solution is that it's going to read the sentence from left to right and maintain some state. And, ah, then it's going to have a sort of fixed inventory of actions that it has to choose between, ah, to manipulate the current parse state to build up the arcs. And, ah, this type of approach, which is called transition-based parsing, I find deeply satisfying because, ah, it, um, it's linear in time because you, ah, only make so many decisions per word, ah, and I do think that it makes a lot of sense to, ah, take algorithms which process language incrementally. I think that that's sort of deeply satisfying and, um, sort of correct in a way that a lot of other approaches to parsing aren't. And it's also a very flexible approach. So we can do joint modeling and, ah, have it output all sorts of other structures as well as the, um, the parse tree, and that's actually what we're going to do. So already in Spacey we've been, we have, ah, joint prediction of the sentence boundaries in the parse tree. And, ah, what we're going to do is extend this that it does joint prediction of, um, word boundaries as well. Okay, so here's how the sort of decision process of, um, the site of building the tree works. So we start off with, um, in an initial state. And so, um, for sort of ease of notation or ease of readability, um, we're notating the sort of first word of the buffer, um, a sort of first word that's kind of being focused on, um, as this kind of beam of highlighting. And then the other element of the, um, the state is a stack. And so when, ah, as the first action that we do, we have an action that, ah, can advance the buffer one and put the word that was previously at the start of the buffer onto the stack. So here's what that's, that shift move is going to look like. Um, so here, um, we have, ah, Google on the stack which we write up here, ah, and the first word of the buffer is read on. Ah, and so then, ah, another action that we can take is to form a dependency arc between, ah, the word that's on top of the stack and the first word of the buffer. So in this case, we want to attach, um, ah, Google as a child of reader. So we have an action that does that. And, ah, because we're building a tree, when we, um, make an arc to Google, ah, we know that we can pop it from the stack because it, ah, the, because it's a tree, it only can have one, ah, head. It can only have sort of one attachment point. It's not a, you know, ah, it's not a different type of graph. And so that means that we can kind of, you know, do that and keep moving forward. So here's what that looks like. We add an arc and pop Google from the stack. Ah, so now we make the next move. Um, ah, clearly we've got no words on the stack. So we should put reader on the stacks that we can continue. Um, now we're at was, um, and now we want to decide whether we should make an arc directly between was and reader. In this case, no, we want to attach was to canceled. So we're going to move was onto the stack and move forward onto canceled. So then here we do want this arc, um, between cancel was, um, so we do another left arc and so we basically continue here. So, ah, we, so, sort of stepping back a bit and thinking about this, we've got a fixed inventory of actions and, ah, and as long as we can predict the right sequence of those actions, we can derive the correct pass. So that's how the machine learning model is going to work here. Ah, the machine learning model is going to be a classifier that predicts, given some state, ah, what to do next. And, ah, you can, sort of imagine that we can have other actions instead if we wanted to, um, predict other aspects of the structure. So in the case of Spacy, we have an action that inserts a sentence boundary. Um, so it just says, alright, given the words that are currently on the stack, you have to make actions that can, ah, clear the stack but you're not allowed to, um, push the next token until your stack is clear. And that means that, you know, there's going to be a sentence boundary there. Um, if we, and, you know, we could have other actions as well, there, work to, ah, jointly predict part of speech tags at the same time as you're passing or, um, you can do semantics at the same time as you do syntax. And so you can code up all sorts of structures into this and, you're going to read the sentence left to right and you're going to output some, you know, meaning structure attached to it. And, you know, as I said, I find this like a satisfying way to, ah, you know, do natural language understanding because it does involve like, you know, reading the sentence and adding an interpretation incrementally. Okay, so that's what, you know, this looks like as we proceed through. Um, so, all right, so how are we going to do this? You know, splitting up or merging of, um, other things? Well, it's actually not that complicated given this transition-based framework. So, um, already you can kind of see that in order to merge tokens, all we really have to do is, you know, we've got those tokens and we can, if we want to Google read it to be one token, we just have to have some special dependency label, which we are going to have in the tree. And so, you know, obviously, the label sub-token and then, uh, all we have to do is say, all right, at the end of pausing, we're going to consider that as one token. So, um, the step from, you know, going through something like this and, uh, labeling a language like Chinese is actually super simple. We just have to prepare to training data so that the tokens are individual characters. Uh, and, uh, then we can say, all right, things which should be one word get, um, should have this sort of structure with this label and then, uh, if they, if the parser decides that those things are attached together, then at the end of it, you just merge them up. Um, the splitting tokens is more complicated because you have to have, you know, some universe of actions that manipulates the strings. So, I'm still sort of working on the implementation of this in a way that isn't, uh, that's sort of clean and tidy, but I actually think that this will be useful for a lot of English texts as well because, uh, if you have English texts that's sort of misspelled, a lot of the time, things which should be two tokens get merged into one. So, it's is a, um, particularly common and frustrating one of this because, you know, the verb is, uh, should be its own token. But if you have it's as ITS, which is also a common word in English, um, you, you know, need to figure out that you have to have two parser actions, two parser states to that. And, in general, you could have a statistical model that reads the sentence beforehand, but that statistical model that, you know, is going to read the sentence and process it, is going to end up to, taking on work and doing jobs of figuring out the syntactic structure of the sentence in order to make those decisions. And that's why I think doing these things jointly is kind of satisfying because, um, instead of learning the same, that information in one level of representation and throwing it away, um, only to build up the same information in the next, like, parser to pipeline, uh, you can do it all at once. And so I think that, you know, the joint incremental approaches I think are very satisfying and good. Okay, so where are we at the moment? Um, so, uh, merge, um, uh, you know, um, side of things which involves, uh, uh, figuring out better alignments between the, the gold standard tokenization and the output of the, uh, tokenizer. And, uh, that's allowed me to complete the experiments for, um, Chinese, Vietnamese, uh, uh, and Japanese of the, um, conference in natural language learning, uh, 2017 benchmark, which was a, sort of bake-off of these parsing models which was, uh, uh, conducted last year. Now, in, uh, the team from Stanford did, um, uh, extremely well compared to everybody else in the field. There were, you know, some two or three percentage points, uh, better. Um, and, uh, so at the moment, we're ranking kind of at the top of what was the second place pack. So in most of the languages were coming, sort of underneath the Stanford system, uh, but, um, uh, with, uh, significantly better efficiency and with, sort of this end-to-end process. And in particular, we're doing better than, uh, Stanford on these languages like Chinese, Vietnamese, and did have this disadvantage of using the sort of pre-processed text. They didn't do the whole task. They wanted to just use the, the, uh, provided pre-processed, uh, uh, uh, texts so that they could focus on the parsing algorithm. And that meant that they did have this error propagation problem. If the inputs are incorrect because the pre-processed segmentar is incorrect, then they, uh, uh, have a big disadvantage on these languages. So, um, satisfyingly we, um, the sort of doing all at once and entangling all of these representations, um, it does have this advantage and we're seeing that in, uh, the results that we have for, um, those languages. And the other thing that's satisfying is that this joint modeling approach of deciding the segmentation at the same time is deciding the, the parsed structure is consistently better than the pipeline approach in our experiments. So, um, by, you know, basically, um, we're getting a sort of one to three percent improvement from this, which is about the same size as we're getting from using the neural network model instead of the linear model. So I've found this, you know, also quite satisfying that, you know, the sort of conceptually neat solution is also working well in practice. Okay. So where does this go and what do we hope to deliver from this? Um, uh, yep. Um, yes, that would probably be it. Um, how many for time? Ah, okay. Well, this is the last slide, so, you know, wrapping up. Um, uh, okay, so, you know, what we want to do is we want to, um, deliver a sort of workflow or user experience where it's very easy to start with the pretrained models for, um, the different languages and, and, broad application areas. And we want to make sure that they have the same representation across languages. So you get the same pass, uh, uh, pass scheme, which, you know, the, uh, folks have been working hard on and basically now have a pretty satisfying solution from the universal dependencies. And so if you're processing texts from different languages, it should be easy to find, say, subject-verb relationships or direct-object relationships. And that should work across, you know, basically any language. So that you can use these parse trees and basically have a level of abstraction from the, um, which language the text is in. Um, uh, and then, uh, given this, you should be able to be powerful rule-based matching from, uh, the parse tree and other annotations to provide it. Um, so, it should be pretty easy to find information, uh, even without knowing much about the language and reuse rules across language. Um, and, uh, then, the active model and the entity models that we provide aren't accurate enough. Um, the library should support easy updating of those, including learning new vocabulary items without you taking particular effort from this. Um, and overall, we sort of want to emphasize a workflow of rapid iteration and data annotation. And so, the concept of this is that, you know, we should be able to provide things which give a, sort of broad based on understanding of language, but, uh, that still ends up with, uh, a need for the, uh, knowledge specific to your domain and the training data and evaluation data specific to your problems. Uh, and, uh, we want to make sure that it's, you know, easy to connect the two up and, uh, go the extra, start from a basic understanding of language and move forward to, uh, building the specific applications. Um, which is, uh, you know, uh, now, Innis will be talking about that aspect of the, you know, sort of, intended package. So, oh, yeah, yeah. Yeah. Yeah. Right, um, so to, ah, yes, certainly. Um, uh, so they passed what the, sort of overall difference or main, most important difference between spaces, parsing algorithm and Stanford's parsing algorithm. So amongst other things, the sort of most fundamental difference is that Stanford, uh, Stanford's system, uh, is a graph based parser. So this is O N, um, like, uh, O N squared or maybe O N cubed, um, in length of the sentence. So, uh, it's, you're unable to use this type of parsing algorithm for joint segmentation and parsing. Uh, you have to have a pre-segmented text, which is why it's, uh, you know, has this disadvantage on languages, uh, which are more difficult or text, which is more difficult to segment into sentences. So in space, you want to make sure that we, you know, basically are, you know, only use linear time algorithms, and that's why, uh, we only take this transition-based approach. Um, uh, other reasons, sort of why they, they get such a good result, other people have done graph-based, uh, models and they're not nearly as accurate. So, I, you know, I hope to meet the Stanford team in the next couple of days and, you know, shake out the details of why the system is so accurate, because, actually, it is quite surprising. I've read their papers several times and, I can't get the sort of one key insight that, you know, means that, their system performs so well. It's, it's interesting. Um, so, I think that, right, yes, certainly, yes. Um, so the, uh, the question, which is a, a very good one that, uh, many people have been thinking about is, um, to what extent can end-to-end, uh, systems which, you know, maybe learn things about syntax, but learn them latently and don't have an explicit syntactic representation internally, uh, replace the need for this type of, um, syntactic processing. So, I would say that, um, for any application where there's sufficient text, um, currently the best, uh, doesn't use a parser. Um, and actually this includes translation and other things where you would, you know, kind of expect that having an explicit syntactic layer would help. If there's enough text, it seems to go straight to the end-to-end representation tends to be better. However, that does involve having a lot of text and for most applications, um, creating that much training data, especially initially when you're prototyping, um, tends not to be such viable solution. So, um, the way that I see it is that the parsing, uh, stuff is a great scaffolding. Um, and it's a very practical thing to have in your toolbox. Um, especially when you're trying to figure out how to model the problem. So, because otherwise you end up in this chicken and egg situation of, well, we need lots of data to make our model work well. Um, and otherwise it just doesn't really get off the ground. But then how do we even know that we're collecting the right data for the right model until we have that data collected and we can see the accuracy? So, if you can take sort of smaller steps, um, using these sort of rule-based scaffolding and bootstrapping approaches, I think you have a much more powerful and practical set of tools. And then finally, once you have a assistant that you know you want to eke out every percent, you know, maybe you end up collecting enough data that you don't need a parser in your solution explicitly. So, the Leap is, uh, pointed to a paper that, you know, recently showed that it, uh, you know, by LSTM models don't necessarily learn long range dependencies. I think that that's, you know, probably true, but, you know, as somebody who's worked on parsing for a lot of my career, I try to remind myself not to cherry-pick results. And, uh, you know, even if I do find a paper that shows that parsing works on something, well, the overall trend is that, um, you know, by LSTM models which don't use parsing, um, work well. And the fact is that long range dependencies are kind of rare. So it's, you know, uh, it's, it, you know, that that's basically why it's important to be asking, well, you know, what are these things good for and not say, ah, everything should be you, using parsing, because it's, you know, it's true that not everything should. Um, uh, so the question is, uh, you know, if we look at other, uh, aspects of language variation instead of just, you know, say, the segmentation and things, um, how does the incremental model perform? So specifically, how does it perform in free word order languages, uh, perhaps once with, you know, cross, long range, longer range or crossing dependencies. So Stanford actually, their, their paper had excellent analysis about a lot of these questions. And so they, um, they showed that, um, their model, which is much less sensitive to whether the trees are projective, um, they do do relatively well on those languages. Um, so for, um, uh, for our preliminary results, um, oh, my, uh, we do fine on German, um, and pretty well on Russian. Um, uh, we still suck at Finnish, um, and, uh, I think there's a bug in Korean, um, you know, it's at like 50%. Um, so, uh, it's a mixed bag, but I would say that there's some problems to solve about the projectivity. Um, the way that I'm doing this is a little bit crude at the moment. Um, but, uh, yeah. So in general, there is a disadvantage that we take from the incremental approach in this, um, and, uh, there's a lot of clever solutions that I'm looking into for this. Uh, so, yeah. Any more? Um, so there's an, uh, there's a pretty good extension package for our coreference resolution that, um, has taken some of the pressure off us to support it internally. Um, we do think that coreference resolution is something that does belong in the library because it's something that is, does have that property of being a language internal thing. I think that there's a truth about whether that he or she belongs to that noun that doesn't depend on the application. It's just a true fact about that sentence. So we're very interested in being able to give you that piece of annotation. I don't, I wouldn't quite say the same thing about the sentiment. Um, I don't quite know, I don't, I haven't been convinced by any schema of sentiment that, um, is sufficiently independent of what you're trying to do that we could provide it. Instead, what we do provide you is a text categorization library. And the text categorization model that we have is, you know, only one of many that you might build. And it's not best for every application. Um, but it does do pretty well for short text. Um, uh, and, uh, I think that, you know, on many sentiment benchmarks it performs quite well. Um, it's a lot slower than some other sentiment. Um, ways that you could do sentiment. Um, so it depends on what your, what type of text you're trying to process. And, you know, that sort of thing. Well, oh, yes. So the, right. So, um, explicitly the, the co-reference resolution package that you should use is called neural coref. Um, so, neural coref. Yeah. Yeah. Yeah. And if it's built on, it's built on PyTorch. It's overall pretty good. You can train it yourself. Like, yeah. Well, PyTorch, like PyTorch is the machine learning way of it. Yes, it's built on space. So, yes. Mm-hmm. Um, so using the, so for German, um, I think it's pretty easy. Like, uh, I've been using the word vectors trained by FastText and, uh, you know, you can basically just plug that in. So, uh, there's sort of one command to convert that into a spacey vocab object and, you know, load it up. Uh, we're trying to provide pre-trained models which don't depend on, uh, pre-trained word vectors so that you can bring your own. Because otherwise there's kind of this conflict of, you know, with, the model's been trained to expect some word vectors and then if you sub your own in, it's going to get different input representations. Um, so, but, yeah, uh, training or bringing your own, uh, vectors is designed to be pretty easy and if it's not, I apologize if there's bugs and we'll try to fix them. So, um, so the question is after, you know, parsing and interpreting, uh, do we have an interlingual representation that can then be used to generate another language? Um, the answer is probably not. I mean, I don't have, we don't have generation capabilities in Spacey. Um, people have worked on this, uh, sort of thing, but, um, in general, it's, having an explicit interlingual, uh, tends to perform less well than, um, more brute force statistical approaches to syntax. And the, I think the reason does sort of make sense that, you know, the, languages are pretty different in the way that they phrase things and the way that they model the world in lots of ways. And so, getting a, a translation that's remotely dramatic out of that sort of interlingual representation is pretty tough. So, and then there's another argument that you're solving a ha, a sub problem, uh, that's harder than the, than the direct translation approach, which I'm not sure whether I buy that argument or not, but it's a common one that people use. Okay, so, um, uh, should we move forward to the next talk? So yeah, we've, we started out by hearing a lot about the more theoretical side of things, and I'm actually going to talk about how we collect and build training data for all these great models we can now build. And, um, the nice thing about machine learning is that, well, we can now train a system by just showing in examples of what we want, and that's great. But the problem is, of course, we need those examples. And even if you're like, oh, I got this all figured out, I'm using this amazing unsupervised method that where my system just infers, the categories from the data, and I never need to label any data. That's pretty nice, but you still need some way of evaluating your system. So, we, very much always need some form of annotations. And now the question is, well, why, why do we even care about this? Why do we care about whether this is efficient? Whether this works or not? The thing is, the big problem is that we actually, with many things in data science and machine learning, we need to try out things before we know whether they work, or we often don't know whether an idea is going to work before we try it. So, we need to expect to do annotation lots of times, and start off from scratch, start all over again, if we got, if we fucked up our label scheme, try something else. So we, we need to do this lots of times, so it needs to work. And similarly, especially, you know, if you're working in a company, in a team, where you really, you know, want to use your model to find something out, ideally the person building the model should be involved in that process. And also, you know, we always say, good annotation teams are small, a lot of people don't understand this, there's a lot of, by the movement towards, oh, let's crowdsource this, get like hundreds of volunteers, and we always have to remind, especially companies that, well, look at the big corpora that we use to train models, like those, the good ones were produced by very few people, and there's a reason for that. It like does not, more people doesn't always mean better results, actually quite the opposite. So, you know, how, how great would it be if actually the developer of the model could be involved in labeling the data? And of course, we also have the problem of the specialist knowledge, especially in, you know, in industries where this matters, you might want to have a medical professional, give some feedback on the labels, or actually really label your data, or maybe a finance expert, and yeah, those people usually have limited time, if you get an hour of their time, you want to use it more efficiently, and you don't want to, bore them to death, or actually find the one person, who has nothing else to do, because they're probably, their knowledge is probably not as valuable as, yeah, other people, other experts knowledge. And yeah, and another big problem since, you know, you want humans is that humans are actually, humans kind of suck, like we really, we're not that efficient at a lot of, a lot of things. So for example, like we really have problems, performing boring unstructured tasks, especially things that require multiple steps and multiple things we need to get right. We can't remember stuff. We, yeah, we really, we're bad at consistency and getting stuff right. So yeah, fortunately, computers are really good at that stuff. And in fact, it's probably also, yeah, the main reason we built computers. So there's really no need to waste the humans time by making them do stuff that they're going to do badly anyways. And instead, we want our annotation tooling to be as, like automated as possible, or we want to, in general, we want to automate as much as possible, and really have the human focus on the stuff that the human is good at, and we really need that input. And that's usually context, ambiguity, like stuff like, we can look at a sentence and most of us will be able to understand the figure of speech immediately without thinking twice about it. That's the stuff that's really, really hard for a computer. Also, you know, put differently, yeah, humans are good at precision, computers are good at recall. So the thing is, yeah, what I'm saying here, it sounds a bit like our floss and eat your veggies. Yeah, we've probably all have had some experience with labeling data, and normally, yeah, we also gave this talk to a crowd of, like, yeah, more data science-focused industry professionals, and actually, yeah, you'd be surprised how many companies we talk to, also very large companies, they actually technologically sophisticated companies that mostly use Excel spreadsheets for everything. And it's not inherently bad, but they are very obvious problems with Excel spreadsheets, and there's definitely a lot of room for improvement. So once people figure this out and realize that maybe they could do something better, or it's just terrible, like, we don't want to do this, the next move is normally, let's move this all out to Mechanical Turk, or some other crowdsourced platform, and, yeah, Mechanical Turk, the Amazon cloud of human labor. And so, yeah, people do that, and often, they're now also surprised that their results are not very good. And the problem is, yeah, okay, so you have some guy do it for $5 an hour, get the data back, train your model, doesn't work. And actually, it's very difficult to then retroactively find out what the problem was. Maybe your label scheme was bad, maybe your idea was bad, maybe the data was bad, maybe you didn't write your annotation manual properly, maybe actually, yeah, another nice thing, maybe you paid too much, because if you pay too much on Mechanical Turk, you tracked all the bad actors, so you kind of have to stick to the like half of minimum wage. So that could have been a problem. Maybe your model was bad, your training code was bad, it's very, very difficult to find that out. And also you realize that, well, it's not really just a cheap click work. Like you, you know, you need to do it more. So then, yeah, what most people conclude from this is, fuck this labeling in general, I don't want to do this anymore, let's just find some unsupervised method, and like not bother with this. And the, that's actually, yeah, also the conversation I had recently where we talked to a larger media company and they've done exactly that and now they have a few hundred clusters. And it's really great, they have really great clusters, but now their problem is that they have no idea what these clusters are. So they now need to label their clusters and now they're kind of back in the beginning. And I think what we see from this is that the label data itself, it's the fact that we need label data, that's an opportunity, that's not the problem. The problem is how we do it. And yeah, so there are a few, like we've been thinking about this a lot in there, there at least, yeah, from our point of view, there are a lot of things we could do better. So one of the things really to work against this problem that we have caused by us being human is that we should, we need to break down these very, very complex things we're asking the humans into smaller, simpler questions. And ideally these should be binary decisions. So we can have a much better annotation speed because we can move through the things faster and we can also measure the reliability much easier than if we ask people open questions because we can actually say, okay, do our annotators agree? Do they not agree? Because that's in the end very important to find out whether we've collected data the right way. And the binary thing itself, it sounds very, it sounds a bit radical, but actually if you think about it, most or pretty much any task can be broken down into a sequence of binary decisions, like yes or no decisions. It might mean that we have to accept that, okay, and if we annotating a sentence or entities, we won't actually end up with a gold standard, with gold standard data for this sentence. We might actually end up with only partially annotated data and have to deal with that, but still we're actually able to use our humans time more efficiently, which is often much more important. So a lot of these samples I'm going to show you now are from our using our annotation tool Prodigy, which, yeah, we started building as an internal tool, but we very, very quickly realized that, okay, this is really something pretty much every company we talk to, most users we talk to, this was always something that kept coming up. So we thought, okay, what if we really combine all these ideas we already have and how to train a model, actually use the technology we're working with within the tool, and also use the insights we have from user experience and how to get humans to do stuff most efficiently, how to get humans excited, actually, even how to, the whole idea of gamification, how to get humans to really stick to doing something and put this all into one tool, and that's Prodigy. And so here we see some examples of those tasks and how we can present things in a more binary way. So in the top left, we have an entity task. So here this comes from Reddit, and we're labeling whether something is a product or not. And what we did here is we load in a spacey model, ask the model to label the products, and then we look at them and say yes or no. We could even, or we can also use a mode where we can then actually click on this, remove this, label something else, but still you see, okay, we don't have to do this in an Excel spreadsheet. We actually get one question. We look at this and pretty much immediately we can say yes or no. The same here on the right there we're using, I think this is actually a real example using the YOLO2 model with the default categories, and we have an image of a skateboard. We could say, is this a skateboard? Yes or no. Immediately have our annotations here, and even this one in the corner, even if we're not able to really break it down into a true binary task, we can still make it more efficient and easier for a human to answer. Because here with keyboard shortcuts, you can still do maybe two, three seconds per annotation and you have an answer. Or we say hey, it's actually so fast if we can get to one second, we might as well label our entire corpus twice positive, negative, other labels we want to do and just move through it quicker. And yeah, to give you some background on like why did we do this? What do we think Prodigy should achieve? We really think that okay, we want to be able to make annotation so efficient that data scientists can do it themselves. Or here what we call data scientists can also be researchers, people working with the data, people training the models. Like it's still, yeah, reading it like that, it still doesn't sound like fun, but the idea is we could really make it a process that's efficient that you actually really want to do this because you don't have to depend on anyone else. You can just get the job done and see whether your idea works or not. And the same, yeah, and this also means you can iterate faster. We're very used to, okay, you iterate on your code, but you can actually iterate on your code and your data. You try something out, doesn't work, try something else. Maybe see, okay, is it going to work if I collect more annotations? You can all try this out. And we also want to waste as little time as possible and use what the model already knows and have the human correct its predictions instead of just having a human do everything from scratch. And yeah, as a library itself, we really want Prodigy to fit into the Python ecosystem. We want it to be customizable, extensible in Python. You can write scripts for it. And we also, it was a very conscious decision not to make it a SAS tool because we think data privacy is important. You shouldn't have to send your text to our servers for no reason. And we also think you shouldn't be locked in. Like you should get JSON format out that you can use to train your models however you like. And not our random format that you can then download from our servers. So that's where we're going with Prodigy and just here's this very simple illustration of how the app looks. The center are recipes, which are very simple Python scripts that orchestrate the whole thing. You have a rest API that communicates with the web app naturally so you can see things on the screen. You have your data that's coming in, which is text images. And you can have an optional model state that's updated in a loop if you want that. And then you have the model then communicates with a recipe. As the user annotates, it's updated in a loop and can suggest more annotations that are more compatible with the annotator's recent decisions. And yeah, we also, there's a database and a command line interface so you can actually use it efficiently and don't have to worry about these aspects. So here, can you see, yeah. In the corner, we have a simple example of a recipe function which really is just a Python function. You load your data in and then you return this dictionary of components. For example, an ID of the data set, how to store your data, a stream of examples. You can pass in callbacks to update your model. Things to execute before the thing starts. So the idea is really, okay, if you need to load something in, if you can write that in Python, you can do it in Prodigy. And you can also, we provide a bunch of pre, built-in recipes for different tasks with some ideas of how we think it could work like named entity recognition. For example, you can use the model, correct its predictions. You can use the model, say yes or no to things. You can use it for dependency parsing and look at an arc and annotate that. We have recipes that use word vectors to build terminology lists, text classification. So there's also a lot that you can mix and match creatively. Like for example, you have the multiple choice example that's not really tied to any machine learning task, but it fits pretty much into any of these workflows that you might be doing. And of course, the evaluation is also something we think is very, very important and is often neglected, especially in more industry use cases. But we think there's actually, ABL evaluation is actually a very powerful way of testing where your output is really what you want it to be. And so here we see an example of how you can chain different workflows together, all using models, word vectors, things you already have in order to get where you want to get to faster. So here, a simple example, we want to label fruit. It's kind of a stupid example because I can't think of many use cases where we actually want to do that, but it makes a great illustration here. So yeah, we start off, we say, okay, we want fruit. What are fruit? We have some examples, apple, pear, banana. That's what we can think of. And we also have word vectors that we can use that will easily give us more terms that are similar to these three fruit terms that we came up with. And then we can use this terminology that we collected by just saying yes or no to what we've gotten out of the word vectors. Look at those in our data and then say whether apple, apples in this context is a fruit or not. Because we're not just labeling all fruit terms as a fruit entity because it could be apple, the company, but we get to look at it and it's much more efficient than if you ask the human to sit through and highlight every instance of fruit nouns in your text. And so this also leads to kind of one of our main aspects of the tool, workflows that we're especially proud of and that we think really can make a difference, which is we can actually start by telling the computer more abstract rules of what we're looking for and then annotating the exceptions instead of really starting from scratch. Or we can even use the technology we're working with to build these semi-automatically using word vectors, using other cool things that we can now do. And then of course also specifically look at those examples that the model is the statistical model we want to train is most uncertain about. So we try to avoid the predictions where we can be pretty sure that they correct and actually really ask the human first about the stuff that's 50-50 and where really the human feedback makes most of a difference. And so here's a quick example. Let's say okay we want to label locations. We start off with one city, San Francisco, and then we look at what else is similar to that term. So these are actually real suggestions from that sense-to-vec model that Matt showed earlier. And as you can see the nice thing here is we're using word vectors. We're not using a dictionary. So we're not going to annotate California and maybe University of San Francisco, but we're not going to annotate California roles because we're already in a vector space and we know that what we're actually looking for is at least similar to the real meaning of the word. And a lot of these are super trivial to answer. So we can accept them, we can reject them, or we can ignore them because this is a bit too ambiguous and we don't actually want that in our list because it can mean too many things. And then from here we can actually create a pattern that uses spaces, attributes, or in this case, yeah, the token, the lower-quase form of the token and GPE that stands for geopolitical entity, so anything with the government. And that's what we're trying to label so we can easily build up these roles very quickly, they are like automated, and then we have a bunch of locations that we can then match in our text. So here it found a mention of Virginia, which we can then accept. So that's a very, very simple example of this, but of course this also works for slightly more complex constructs where we can really take advantage of the syntactic structure. So here this was a finance example, so what we're trying to do is we want to extract information about executive compensation. So yeah, some executive receives some amount of money in stock, for example, like this one. And this is a pretty difficult task, but also the idea here is we have this theory that maybe if we could train a model, a text classification model to predict whether a sentence is about executive compensation or not, we can then very, very easily use what we already know about the text to extract, let's say the first person entity, we extract the amount of money, put that in our database, and we've actually, yeah, we found a good solution for an otherwise very, very complex task. So for this, this is just an idea. I haven't like, we haven't tried this in detail, but one possible pattern using token attributes we have available would be, let's try and look for an entity type person followed by the lemma or a token with a lemma receive, so received, receives, receiving, and followed by a token with the entity type money. And let's just look at what this pulls up. That's an idea, I mean, there are plenty of other possible patterns you can come up with, and the nice thing is we're actually going to be looking at the beginning context, so they don't have to be perfect, and even actually in fact, even if it pulls up random stuff that you realize is totally not what you want. This is also very important because you won't only be collecting annotations for the things you know are definitely right, you're also collecting annotations for the things that are very, very similar, or look very, very similar to what you're looking for, but are actually not what you're looking for. And that's probably just as important as the positive examples. So yeah, the moral of the story is what we're saying is, you know, we're very used to iterating on our code as programmers, but you should really be doing both, like the data is just as important. So as we see here, okay, that's the normal type of programming, you have a runtime program, you work on the source code, you compile it, get your runtime program, you don't like something about your program, you go back, change the source code, compile it, and so on. That's a pretty standard workflow. And in machine learning, we don't have a runtime program in that sense, we have a runtime model. So the part we should really be thinking about and working on is the training data. Instead, most focus is currently on the training algorithm. And that's, if you use that analogy, that's kind of, that's very similar to going and tweaking your compiler if you're not happy with your runtime program. You can do that, but of course, you probably go back and edit your source code. And I think this is actually, this is actually a pretty good example and it's pretty accurate. And you usually, there are only so many training algorithms, but what really makes a difference is your data. So if you have a good way and a fast way of iterating on that data, you'll actually, and you were able to really master this part of the problem, you'll also get to try more things quickly. You really, as we know, most ideas don't actually work. It's always one of these things that's kind of misrepresented or a lot of people have this idea, oh, you're doing all these amazing AI things and everything just works and it's like, kind of doesn't, like nothing works. And sometimes, sometimes things work and you really, you want to find the things that actually work. And for that, you need to try them. And so it also means, if you can actually figure out what works before you try it and invest in it, you can actually be more successful of all because you're not going to waste your time on the things that might fail and most scale things up that actually, yeah, weren't even supposed to work in the first place. And one thing that's also very important to us is you can really build custom solutions. You can build solutions that fit exactly to your use case and you'll keep these, and if you collect your own data, you'll keep that forever and nobody can lock you in. You're not just consuming some API and if that API shuts down, you can start again from scratch. You actually, you know, you have your data, no matter what other cool things we can do at some point in the future. You can always go back to your labeled data and really, yeah, build your own systems and we believe that this is really something that's very important. In the future of the technology, that's also a reason why we think AI development in general, in companies should be done in-house. And yeah, we're hoping that we can, yeah, keep providing useful tools that will make this easier. So, yeah, okay, so the question is, yeah, Jeremy thinks we write very good software, even though we're only two people and how are we doing that? Yeah, that's a very good question. I mean, we do get this a lot. I mean, I think it's, I don't even know where this idea comes from, that like, yeah, you can scale things up. Like, I don't know, scaling things up makes things better because I do think, yeah, actually, the more people you get involved with you sometimes, it actually can have a very negative impact on the quality of the software you're producing. In our case, it's just, okay, it just works. Like, I also don't like this idea of, oh, everyone can do exactly the same thing if they just work hard, even though people like thinking of it that way. It's just, okay, in our case, we have a good combination of like, things that we like to do, things that we happen to be good at and it just works together. So I guess we are lucky in that way but we also cut out a lot of bullshit. Like, the amount of meetings we don't take, the amount of events we don't go to. I mean, yeah, it's kind of ironic saying that, speaking at an event, but like, I really don't normally go to many events. I don't, we don't take coffee dates with random people. We barely know. We don't, yeah, we mostly, we really just like to write software. And, yeah, we've had some good ideas in the past. Yeah. I mean, we've also, the question is, if we've done any experiments where we compare the binary decisions and whether it influences the annotators versus really doing everything from scratch. So we haven't done experiments specifically focusing on the bias because that's, in some sense, that's difficult because we're mostly looking, we're looking at the output, we're looking at, does it improve accuracy? We've done experiments of manual annotation versus binary annotation, but also mostly focused on our own tooling because we think it's kind of useless. Like, yeah, we can present you a study where we said, oh, we did stuff in an Excel spreadsheet and then we did stuff in Prodigy and it was much better. So it's really, it's mostly focused around our own tooling and we did find that, well, it depends on, it depends on the task you're doing. That's the other thing. It's sometimes, I feel like giving these answers sounds unsatisfying because I'm always saying, well, it depends on your data, but that's also kind of, that's also the whole point of it because, you know, we're doing this because your data is different and there's no one size fits all solution. But essentially, so we found what, binary annotation works especially well if you already have a pre-trained model that predicts something, ideally also something that's not completely terrible. Otherwise, the pattern approach does work very well on kind of limited, very specific domains. Like we did one example of where we labeled drug names on Reddit, like on our opiates, which was a pretty good, it was, this was a pretty good data source because it's a very specific topic and also it's a subreddit that's very on topic because people, I mean, you know, people who discuss, who go on Reddit to discuss opiate use, you know, usually, you know, are very dedicated to talking about this one topic so it was a good interesting data source and so what we wanted to do is we labeled drug names, so drugs and pharmaceuticals and in order to, for example, have a better, have a better tool set to analyze, really analyze the content of the subreddit and see how it develops over time anyway. So there we found the pattern based approach worked very, very well because we have very specific terms, we can use word vectors to bootstrap these, especially also we can include spelling mistakes and stuff which was very interesting, like we can really build up good word lists, find them in a text, confirm them and get to pretty decent accuracy. I would expect this work, to work a little less well, the cold start problem on a much more ambiguous domain and there you probably better off to say, okay, we're labeling by hand, but even there, that's something I haven't really shown in detail here, but we've also, we have a manual interface where you highlight, but what we do there is we use the tokenizer to pre-segment the text, so you don't have to sit there and pixel perfect like highlight and then ah, shit, now I got the white space in, let's start again. So that's, that's another thing we're doing, you can, yeah, you can be much lazier in highlighting and also there get more efficiency out of it and still use a simpler interface. Yeah, yeah, so okay, so the question is, well, first, you gave an example of annotating patient data which is obviously, yeah, very problematic because doctors are not always very specific in what they fill in and then in the end, this was how did they enrich that with, yeah. Yeah, so basically, okay, the question is whether we have some experience with, yeah, in the medical field mixing is not like, the answer is, well, we haven't personally done this, we do have quite a few companies in that domain also because, yeah, the tool itself is quite appealing because you can run it in your own compliant environment, you know, that data privacy aspect. But it's definitely, it's interesting, be interesting to explore. That's maybe also where, okay, having the professionals, getting the medical professionals more involved might make sense, which normally is very difficult. You don't want a doctor to do all the work themselves but if you can find some way to distill that and then ask the doctor, okay, you said, you wrote this here, does that mean, you wrote X, does that mean Y? And the doctor says, yep, or doctor says, nah. That if you can try this out and, you know, extract some information or that could be one idea to solve that. For example, yeah, I can definitely see that. You can, like right now, it's not, we don't have a built-in logic for that, although we are working on, oh, sorry, I forgot to repeat the question. Inter-anitator agreement, if you can calculate that and incorporate that into your model. So we're actually working on an extension for Prodigy, which is much more specifically for managing multiple annotators because we really, the tool here, we really designed specifically as a developer tool first and then, you know, scaling it up a second. But since you have the binary feedback and if you have an idea, if you have an algorithm you want to use and you kind of, you know what you want, you can already do that fairly easily because you can download all the data as JSON. You have a key that's answer, which is either accept, reject or ignore. You can attach your own arbitrary data, like a user ID, and then it's fairly trivial to, yeah, write your own function that really takes all of this, reads it in, computes something, and then uses this later on. So that's definitely possible. But we also, this is also something we're really interested in exploring and working on. And the binary interface is great, but we're trying to tell you it's great. Yeah, yeah, so we see binary, yeah, that's also, that's a big advantage of the binary interface is that, yeah, they're only pretty much, yeah, they're two options. You filter out the ignored ones and then, yeah, you can really answer that question. Yeah, yeah. Yeah, well, you can, like, you can design, so the question was, one interface I showed which was the sentiment one with the multiple selections. This is not binary. That's true. And actually, it's also something we usually tell our users, avoid this as much as possible if you can. Like that's, and some, in some cases, you might still wonder, or we say, look, if you need to, if you, you know, a lot of people still think of surveys when they think of annotating data. And really, I get where this is coming from, but I think if you can leave that sort of mindset and really open up a bit and think of other creative ways, you can get more out of this. If you want to re-engineer a survey, maybe you want to use a survey tool. But, but so this, I would, so for example, if I were doing this with those four options, I would say, okay, we have all texts. The annotator sees every text four times and says, is this happy or is this not happy? And because you can get to one second per annotation, that's very fast. Like you can, even if you have thousands of examples, you can do this in a day yourself. And so that's how we would probably solve this. And it also means you get every example four times. And for each text, you know, is it sad? Is it happy? Is it neutral? Is it something else? You have much more data. But not everyone wants this. Like some people really want to build that survey and we let them. But yeah, yeah. So the question is, yeah, if you're doing the same example multiple times, whether it slows down the annotation or not. Well, actually, I mean, it's difficult to say because it depends, huh? But I've actually found that even if you do the bare maths, it can easily be much faster because if you, you know, you say, okay, 1,000 examples. And normally if you really have to think about five different concepts that are maybe not even fully related, that just every tiny bit of friction you put between a human and the interface or the decision can very significantly slow down the process. So you think about, oh, is this happy or is this sad or is this about sports or is this about horses? And just this thing that can easily add like 10 seconds to each question. So if you do, if you do the whole thing three times at one second, you're still faster than you would have been if you'd added this friction. And also, and the other part is just the human error. Human, if you have to think too much, you're much more likely to fuck it up and do it badly. And then that's also something, you know, that's also something you want to avoid. You know that the area you're learning helps a lot and your label's pretty confident. So you just don't capitalize on it. Yeah, yeah, too. Like, yeah, repeat this. The active learning also makes a difference here because you can actually, yeah, you could pre-select the ones that really make a difference to annotate and don't have to like really go through every single one that, yeah, it's not as important as some of the other ones that you really care about. Yeah, so the question is, yeah, what about tasks that need a lot of context, like the whole medical history or just a whole document? And so we have, whether we have experience with that. So in general, we do say if you can't, if your task requires so much context that you can't fit this into the Prodigy interface, then it doesn't mean that you can't train a model on that. But for most of the tasks that users most commonly want to do, this is often also indicated that it's very, very difficult to actually teach your model that, like if you're doing named entity recognition or even text classification and you need a lot of context and all the context is equally as important, that's often an indicator that that might not work so well. So for example, text classification we say, okay, we start off by selecting one sentence from the whole document and then instead of you annotating the whole document, you say, okay, this is the most important sentence. Does this label apply or not? So there are some tricks we use to get around this problem because yeah, we also think that, okay, it's important to get this across and frame it in that way because yeah, if you need two pages on your screen, it's not efficient at all and also likely you can do all that work, but your model won't learn that because your model needs local context as well, at least for the tasks that we're presenting. I don't know if you had anything to add to that. Yeah, okay, right. Yeah, so the suggestion was, yeah, okay, having some tools, some process that goes along with the software that helps people break this down. Yeah, we've actually been thinking about this a lot because we do realize, you know, the tool is quite new and we're introducing a lot of new concepts at once and also some best practices where we think, ah, that's how you should do it or you could try this. And we are also realizing that there's no real satisfying one-size-fits-all answer. That's another problem. Everyone's use case is different, so right now what we're doing is we have a support form for Prodigy where we answer people's questions and actually a lot of users share what they're working on, asking for tips. We kind of talk about it, other users come in and are like, oh, I actually try to do this type of medical or this type of legal annotation and here's what worked for me and have to sort of exchange around it to figure out, okay, what works. Because, yeah, it's just like, I think machine learning, deep learning, a lot of the best practices are still evolving and it's very specific. So we're open for suggestion there as well, but we're still in the process of really coming up with a good set of best practices and ideas. The question is whether, yeah, we have any plans to sell models like medical models? Yes, as part of what Matt mentioned in the very introduction, we are definitely planning on having more of a models, sort of like an online store for very, very specific models. So medical, that's a very interesting domain and if so, we really want to have it specific like medical texts in French or Chinese and really go in that direction because we believe that, okay, pre-trained models are very valuable and even if you do medical texts, you can start off with a pre-trained model, then you can use a tool like Prodigy or something else to really fine tune it on your very, very specific context, have word vectors in it that already fit to your domain, maybe update those as well. We think that this is a very future-proof way of working with these technologies, yeah, yeah. So currently, so question is the text classification model. We're using Prodigy, more details on that. So what we're using is Spaces text classification model. That's what's built in, but I think actually this question is pretty good because what's important to note is that Prodigy itself comes with a few built-in recipes that are basically ideas for, okay, how you could train a text classifier. You could use Spacy, but it's definitely not tied to those. Like the idea, the tool itself is really the scaffolding around it. So if you say, hey, I wrote my own model using PyTorch and I would like to train this, all you need to do is you need to have one function that takes examples and updates your model and you need to have one function that takes raw text and outputs a score for each text. And then you provide that to Prodigy and then you can use the same active learning mechanism as you would use with a built-in model. So the idea is really the models we ship are just a suggestion or an idea you can use to try it out. But ultimately, we also hope that people in the future will transition to just plugging in their own model and just using the scaffolding around it to do that. Or we definitely don't want to lock anyone in and say, oh, you have to use Spacy, like especially for any R and stuff and other things. We think Spacy is pretty good, but if you don't want to do that or for other use cases, especially text classification, we think a lot of cases where you might want to use scikit-learn or bubblewabbit, yeah, or what a great name. Yeah, basically something completely custom. Yeah. So the question is active learning part, whether this is built on the underlying model or not, so a question is active learning versus no active learning how well this works. First, also maybe as a general introduction, so what we're doing for most of these samples is we use a basic uncertainty sampling. That's what we found works best, but we also know there are lots of other ways you could be solving that. So in the end, how we implement this is we have a simple function that takes a stream and outputs a sorted stream based on the DSI scores and the model in the loop. So how you wire this up again is also up to you and to answer the part about what works best, in general, in our kind of framework where you really see one sentence at a time and often you start off with a model not knowing very much, the active learning component basically resorting the stream is actually very crucial because otherwise if you start from scratch, have very few examples, you'll be annotating for a very, very long time and all kinds of random predictions, you annotate your stream and order, there's very little, you need some kind of guidance that tells you okay, what to work on next, especially if you feed in millions of texts, you need to sort them, you need to pre-select them based on something and this could be the model's predictions, this could be something else, this could be the keywords or the patterns, but without that, yeah, it's very, very difficult and that's also, yeah, that's kind of what we're trying to solve with it all. Okay, thank you so much, Innes and Matthew. I gotta say, you know, anybody who's using FastAI, anytime you've used FastAI in LP or FastAI.txt, you've called the spacey tokenize function, you're using spacey behind the scenes and the reason you're using spacey is because I tried every damn tokenizer I could find and spacey was like so much better than everything else and then the kind of story of FastAI's development is that over time I get sick of all the shitty parts of every third-party library I find and I gradually rewrite them myself and the fact that I haven't rewritten spacey or attempted to is because I actually think it's one of those rare pieces of software that doesn't suck at all, is actually really good. It's got good documentation, it's got a good install story and so forth and I haven't used Prodigy, but just the fact that these guys are working on and recognizing the importance of active learning and the importance of combining human plus machine, what's in that rare category of people, in my opinion, are actually working on what's one of the most important problems today. So thank you both so much for coming and for this fantastic talk and I look forward to seeing what you do next. Thank you. Thank you.