 OK, so first of all, am I mic'd correctly? OK, great. So yeah, this is a talk about two technologies that we've built, spacey and prodigy, but it's not primarily a technology talk. The genre is not really library talk. It's actually about much more general considerations that arise whenever you try to apply technologies like natural language processing, but also computer vision or other machine learning things to new problems. And there's an under discussed design consideration in this that arises that I think is actually a very important issue that I'd like to highlight instead. So I guess it's kind of more like an opinion talk that I hope will be useful to people regardless of which tool chain they happen to be using. So there's a quick orientation around this and the activities of the company and to basically help you get your bearings. Our primary, so I'm a co-founder of a very small digital studio company called Explosion AI. And we make spacey, this open source library for natural language processing. And we have a few other projects associated with it. So spacey has its own machine learning library that we use to kind of power it so that we can keep things working more specifically called think. And in particular, we have an annotation tool called prodigy, which people use alongside spacey and also other work to basically help them create new annotations and help them adapt models to their requirements. And then finally, we're going to be releasing a data store of these pre-trained models that people will be able to use alongside spacey as well. So these technologies are used in a variety of companies and they're primarily designed to basically make it easier to adopt sort of more cutting edge or sort of at the edge of understanding technologies and put them into production quicker. So the work that I'll be talking today is joint work with Inez Montani, my co-founder. And you can see a little background about us. I've been working on spacey since 2014 and have been working on natural language processing things for most of my career. OK, so the analogy that we give to people about how to solve works and how the company works, which actually you can hear more about this in Inez's keynote after this, is that the open source software is kind of like these free recipes that are published online. And then we initially did some consulting alongside this, which you can see is kind of like catering. And then this tool prodigy is kind of like this gadget tree that you can use alongside the recipes. And so that's kind of the thing, the things fit together as a set of offerings. OK, so as I guess as an opinion piece, this is kind of the motivating statement of this. And it does take a little bit of explanation. So the concept here is that these projects that use natural language processing are like startups. They fail a lot. And what I mean by this is that there'll be a few projects that use NLP or other machine learning projects that are wildly successful, and the rest of the projects will basically struggle. And so if you think about why this might be true, you can imagine that the world would look very different if it were the case that natural language processing projects usually worked. Imagine if it were just easy to take any process which would involve natural language in an office situation or in a business situation and just automate it. Well, natural language is really the underpinning of the human information system, right? So it must be difficult to do this, otherwise the world would look very different from what it does. And so we can see that, all right, there's enormous potential in these technologies, and indeed often they do succeed wildly. But you can still see that, all right, it must be difficult to do this, otherwise things would look wildly different. And so the question is, all right, if natural language processing projects do fail a lot, then what's the cause of that failure? Like what's the thing that makes this so hard? Okay, so we can turn this around a little bit and say slightly flippantly, all right, what would it look like if we were trying to maximize the risk of an NLP project's risk of failure? So we would start off with this sort of, okay, let's just start off by imagining. We will just decide what the application ought to do, and we really want to be ambitious because nobody changed the world saying, doubting whether something would work. So let's just start with the vision, right? And leave the technology for later. And then finally, then the next step would be to forecast. Okay, we've got a vision of what we want our app to do, and then we say, all right, what accuracy do we think we'll need from the technology to drive this forward? What do we need to make this work? And if we don't know, let's just say 90%, I mean, sounds about right. Then there are some details here which we don't care so much about. So next we'll just outsource the data collection. It's just click work. We'll pay somebody else to get the data. We'll think carefully about the requirements that we've stated and decide that we need 10,000 rows for some reason, and then, you know, having got that, we can now start the real work. The part that everybody talks about when they talk about machine learning projects. And this is this process of wiring, this beautiful iterative sequence of tinkering where we implement a network and, you know, tensor all our flows and descend every gradient, optimize everything, tweak our hyperparameters, and come up with something that fits beautifully well in our 10,000 rows. And then, you know, let's hope that it works. And if it doesn't, well, I hope that we have somebody to blame. So if this is what it looks like, you know, when we fail and, you know, I hope you can see a few suspicions here for why this might not work so well. And, you know, I'll flesh this out a little bit. But first I want to say that we shouldn't accept this risk of failure even if we acknowledge that it's true. Like, okay, we can accept that, you know, all right, empirically there's a high risk activity here, but I still want to keep our eyes clear that failure sucks, right? So we still want to minimize this. We don't want to embrace methodologies that make it more likely that we fail. Even if we say, all right, many projects fail and that's a reality of the situation, it doesn't mean that we just say, oh, well, you know, embrace this and move on. No, failure sucks. We want to fail less. How do we do this? Okay, we can look at, we can start thinking about this as kind of a hierarchy of needs. And I think at the base of this pyramid, the sort of core food and shelter level of this hierarchy of needs is understanding how the model will work in the larger application or business process. So having a clarity of what we're trying to do and where the value is going to come from. What are we trying to ship? How's it going to work in the rest of our application? Why do we need machine learning at all? What can we do without? That sort of clarity of purpose. Then translating that clarity of purpose into an annotation scheme and using it to guide what data we need to collect, I think is the next stage of this. So translating the requirements into a set of models. I think that that's really a key step and it's the step that I'll be talking about the most through this process. Then translating that annotation scheme after we've decided what models we ought to have or after this need of models. Translating that into an annotation process so that we can actually get the data cleanly. So this is this project management stage of having attentive annotators who know what we're trying to do. We have a good quality control process and a good process for cleaning up discriminant annotations, et cetera, et cetera. And then finally, at the top of this pyramid, the parts that matter less but also the parts which are discussed much more, these questions of model architecture, so making smart modeling decisions so that the model's more likely to be accurate, using the wisdom that's in the literature to basically have the right technologies or whatever, and also optimization tricks so that we actually end up with good weights from this. And so you can see here that the parts which I've identified as sort of the tippy top, the self-actualization part, the less necessary part, these are the parts which are vastly more discussed than these other issues. And it doesn't make sense to these are vastly more discussed in the literature because there's kind of globally, if you think of the field as a whole, that is kind of the bottleneck, right? Like if we have better model architectures and better optimization techniques, that does generalize across all of the projects. But the same consideration doesn't necessarily apply if you're considering your specific project. If you're considering your specific project, the set of considerations are kind of different and the part that you should spend most of your time thinking about is why you need machine learning at all and how you're going to map that need into a set of specific models and then how you're going to get data to meet that need. And if we're going to solve this, then a difficult chicken and egg problem ends up arising. And so the difficult chicken and egg problem, the circular dependency here works like this. If the most important thing is having a clear vision of the product and what we're trying to do, well then we want to know how accurate the model might be so that we can basically come up with realistic plans. So we need an accuracy estimate. But in order to get an accuracy estimate, we need to have training and evaluation data. And in order to, we need to train and evaluate a model. And then in order to do that, we need to get labeled data. And in order to do that, we need an annotation scheme. But if we have to decide what to annotate, then we're going to need to know how this is going to work in the product. So there's a feedback loop here, there's a cycle. So what can we do? Well, in any other things where we have this sort of cycle, this solution is iterative. So what we need to do is have an iterative process where we progressively refine these estimates. And the iteration has to happen not just on the code, but also on the data that we're collecting and the vision of the product that we're trying to build. So basically don't have this waterfall approach where you start off making these assumptions and just feed them forward and hope that they're correct. We need to have, except that the initial estimates are going to be slightly wrong and basically start trying to travel in the circle and refine our estimates so that we can collect some evidence that we can base these on. So we're asking what model should we train to meet the business needs? Does the annotation scheme make sense? And then finally, does the problem look easy or hard? As soon as we start doing it, we can start getting evidence about that. And then we can also try to figure out what we can do to improve fault tolerance when we start to see what sort of mistakes the model might make and how serious those are. So if we don't try to, if we don't take an iterative approach and we just sort of blindly go with these things, then especially in natural language processing, it's very easy to make modeling decisions that are simple, obvious and wrong. So as an example of this, imagine that we had the following requirements. We want to build a crime database based on news reports and we want to label the following. So we want to get extract information from text, very common type of need where the technology's currently performed quite well. So we want to get the victim name, perpetrator name, crime location, offense data, arrest date. So here's an example of what that sort of annotation might look like. This is the sort of output that we might want. So we want something like 24 year old and Alex Smith labelled us a victim and then was fatally stabbed in East London and we want that labelled as a crime. So all right, how should we do this? Like how should we map this requirement into a set of modeling decisions? Well, the simple way to do this, which actually a lot of the current fashion is guiding people towards, is to take an end-to-end approach and just basically map this labeling scheme directly to the model. And we say, all right, we're just gonna have a sequence labeling scheme where we're going to extract that information directly. Now I suggest that this is quite likely to be an unideal way to approach the problem. So instead, I suspect that it's actually gonna be better to do this. Apply a label of crime to the whole text and then apply more generic labels to the individual entities. So apply the label person to the entity Alex Smith, the label location to the entity East London and also the label location to Kings Cross. And so what we're doing here is we're factoring the information better and so we need to have much less annotation data because we're only adding one bit of information crime and we're deciding that once over the whole text. As opposed to in this example, the bit of information crime is coupled to the first person entity and also the raw semantic role they're victim is coupled into that as well. And so you have to decide that all at once. And then as well in the next one, East London, you have to decide all at once that it's a crime and this is the location of the crime and then again in Kings Cross, you have to decide that this is the location but it's not the crime location and therefore the label is null. And this makes the modeling much harder and you need much more, many more examples to estimate the model this way in many cases. So it's quite likely to be an unideal way to do this and you should at least explore composing the models in a different way and saying, all right, I'll take, I'll decide once that it's a crime and then compose these things and have a bit of a rule-based logic to match this up afterwards. So in terms of what that rule-based logic might look like, this is an example of a kind of generic annotation that can be applied to text from Spacey and also from other technologies as well. So this is a syntactic dependency pass. So here we can see that the phrase Alex Smith has a syntactic relationship of passive subject to stabbed and fatally is a modifier here and East London is attached as a prepositional phrase. And so we can use this kind of generic annotation in order to basically start building rules to hang our logic on. Now, it may not be the case that this is actually the optimal way to do it, but there are at least these choices and I wanna bring awareness of the fact that there are many decisions to be made in how you're actually decomposing a set of needs into a set of models and so that you can at least try different options because that's the kind of decision that's going to decide whether the problem's easy or hard. And much more than using machine learning techniques to solve a hard problem slightly better, making the problem easy in the first place is a much higher leverage way to get the problem solved. So the general sort of approach here is that we can compose generic models into novel solutions. So if we have generic categories like location and person, we can use pre-trained models and just improve them on the data and then I would normally recommend annotating events and topics at the sentence or paragraph level so that you don't have to decide the exact boundaries of something like a crime occurred. Instead, you can just apply it at the sentence level rather than coming up with policies which will struggle to enforce and then for semantic rules as well, you can annotate these at a word or entity level and use the dependency pass to find the boundaries. So this is kind of a suggestion of a solution. So this is what the workflow of this looks like and the specific tooling that we've built is basically in Prology. And so specifically Prology lets you basically quickly spin up an annotation task so that you can start trying out whether it's easy to label sentences with a crime or not. So basically you run this command, you get a little web server, you make some annotations, they're stored in a database and then you can train a model from them. And then the integration with Spacey is also quite nice. You can basically read this out as a Spacey pipeline and then start using this directly. So you get the capability of saying, all right, dot, dot, cat's crime and you've seen the probability of that. So the other problem or the other consideration that I wanna raise with this is that I guess often stops you from taking an iterative approach. I think it's worth awareness is that if you focus mostly on big annotation projects, then it becomes very difficult to collect evidence and very expensive to collect evidence because there's this high startup cost of starting a new project. So rather than viewing annotation is something that has to happen at scale with lots of people and something where the biggest consideration is actually driving down the marginal cost of each additional annotation, actually driving down the overhead of the annotation projects that you can collect so that you can try out more things is a much better consideration and a thing that's actually gonna take more projects from failure to success if you can get these right. And so if you're able to run specific annotation projects much quicker and basically run in a few hours, decide whether something is gonna work or not, then you can try things out and basically explore the space of different modeling options. And yeah, so this is the solution that we have to this. Basically, even as a data scientist yourself, you should be basically have a methodology or a workflow that lets you yourself have an idea and just label some data and try it out so that if you have an idea for something that you wanna try, you don't have to basically convene a meeting, convince your boss that your idea is good, who will then get the annotators to give you some time, then you get the data back and you decide, oh, it didn't work. Instead, just labeling a few hundred examples yourself gives you a much better perspective on whether the thing is likely to work and then you'll be able to basically try more things and have more successes. Then additionally, actually, AB evaluation is a particularly good methodology for this and especially since it lets you work on generative tasks. So I don't have time to explain this in detail, but basically, even if you have a task where you're trying to output text, so for instance, imagine you're trying to caption images, you can't compare this statically to one reference annotation because you don't know what's a good caption or what's not, but if you use a randomized AB evaluation which Prodigy supports, you're able to still rigorously evaluate these tasks well and I think that that's a very good tool to have in your toolbox. And then finally, another detail about annotation projects which people often get wrong is if you think of it primarily as boring work that doesn't matter in your project, then it shouldn't be so surprising that you end up with data that's actually unideal and you also shouldn't be surprised that there ends up being this terrible overhead in your projects of maintaining the quality and making the data good. So instead, you can just not do this. Like it's actually not that expensive to hire people who don't have computer science degrees to just do things and you can hire them consistently and like talk to them and stuff. So rather than trying to outsource this pathologically and everything, you can have a few people that work like 30 or 35 hours a week on your thing and you can talk to them and they will work like humans and understand what they're doing. And this is generally a better way to get work done and so I would actually recommend this rather than trying to make people as dehumanized and disconnected from their tasks as possible. So this is, and again, also if annotation teams are smaller rather than larger, this is also quite good because it lets you iterate and if you need 100 hours of annotation, it's much better to have three people working for a period of time rather than have 100 people do one hour because then you don't get any time to iterate. Okay, and so this is really the solution here. As with any other thing with cyclic dependencies, we can't solve this problem analytically. We have to solve it iteratively. So we have to basically as quickly as possible start moving through the cycle and say, all right, what would it look like if I could make this work? Here's how we can actually export the model and have it plugged into the rest of the product. And then you see, oh, okay, well it doesn't quite work well this way, let's try it this other way. And moving around that cycle quicker is going to lead to better results rather than having a very silent perspective of say, getting lost in TensorFlow for weeks, improving the model, the accuracy on some data set that might not even be the right data set for what you want to do. It's much better to basically be moving over the whole pipeline. Thanks. So thank you again. There's time for questions now. So in terms of what's the funniest thing that I've seen go wrong, there's definitely been some misunderstandings about what the technology is able to do and where to, you know, what are reasonable product plans and what are not. I would say that the most general thing that, or common mistake that I find like sort of puzzling is the general chatbot enthusiasm, I think is driven by a quite deep misunderstanding of what the technology is actually doing and in particular, people act as though the primary task is in understanding the message when there's, you also still have to have your application actually do the thing that the messaging codes. So for instance, people imagine that if you can just understand what people are searching for in say like, you know, a menu system or something, like, oh, find me, you know, I placed it sells French tarts to AM that if you can understand that that's what people want to look for that you can just look for it. You still have to have your database indexed by whether the thing sells French tarts, right? And so the scope of capabilities is so much more narrow than people imagine for this and that's fundamental because you're not just gonna like generate code. And so people are like, oh, why is it so narrow? Why doesn't it, you know, why does it feel so stiff? And I'm like, because it's still a program that you've just wired a user interface to. And so I think that that's, you know, definitely something that I've seen go wrong in at a large scale across the industry and people trying to apply these technologies. About the information extraction because you know, there's a rule base and a model base. What do you think about rule base, the method is still alive in the future or no? So what I would normally recommend is actually using machine learning to add annotations to text that allow you to hang rules. And you know, if you think about what you're actually doing with the machine learning, there's always at some point where you probably want the output of your machine learning system to feedback into some other system. So at some point you need to translate from like, you know, the continuous space that you're probably in if you're doing machine learning to some sort of Boolean logic that the rest of your program is going to interact with. And so the question is, what's the minimum that I can learn about this text that gives me consistent attributes that I can then get by with rule-based approaches? So I would never want a rule-based approach that tells whether some sentence is about a crime. That's silly. Like you know, it's so much easier to do that in a machine learning approach. But it might be the case that once I noted it's a crime, I can just say, all right, the first person in the crime is this sort of role because of the nature of my data. Or okay, like if I noted it's a crime and I've got this list of verbs and I noted that's the crime that occurred. And that might be a much easier way to do it than trying to, you know, basically learn all of those bits of information coupled. So that's what I would say is the hybrid of those approaches in practical terms. Hi, great talk. My question is on spacey. It looks like a really useful tool. How much work is it to add an additional language model to spacey? So it really depends on what capabilities you're interested in adding. So at this point, the process of just sort of adding a new tokenizer is pretty easy. And similarly, if you have a large sample of unlabeled text, the process of training basic word vectors is pretty easy. But then most languages need, say, limitization. So, you know, you normally wanna have, say, jumped maps to jump. But in English that's a very simple process but in languages like, you know, Finnish or Arabic, it's like actually quite an involved process. And so that ends up being difficult. Then for most of the other things, you really need to have data. And so there's some data sets which you can license or which you can use depending on the licensing terms you need, but in many situations you actually don't have a pre, a suitable corpus and then you have to create one. And so we're interested in doing annotation for this, using Prodigy, but we wanna do, because we want to basically pay for that, these are likely to be more commercial models. But there's definitely, you know, some data sets out there which are available and so we do wanna provide models on that basis, basically, you know, free like the current English one. Great talk. So I really liked your instructions. Will they be available online? So I don't have to like copy them. So do you mean the specific commands on the slide or? No, like the graph, like the cyclic graph. I mean, so the slides will be available and the talks recorded, but do you mean like, we don't have it written up as a blog post yet, but maybe we can like do that. Maybe. Thanks. So the question is about what the considerations are about how well the technologies will work on different languages. So in general, the less like a language is, the less like English a language is, the worse everything works. So English being the language most like English, everything works pretty well. Dutch is also quite like English and so things work fairly well. Chinese, even though there's more text than Dutch, it doesn't work as well because it's less like English. So, you know, these methods have been really quite well tuned to the characteristics of English as a language and there are a couple of attributes of English that are slightly convenient that basically mean that there are some easier problems associated with it. So I would say that that's the biggest consideration. I would say that even though there's plenty of text for say Arabic, Arabic language processing is quite difficult because it's quite unlike English. Okay, so let's thank Matthew again.