 Thanks, everyone. It's really nice to be back and yeah, I'll be talking about large language models It's a topic on everyone's mind and it's also really been on our minds for a long time as Alex mentioned We You might know me from spacey, which is an open source library for natural language processing in python one of the most popular ones and People really use spacey to build pipelines to extract information from text and the library has been around for quite a while and We've also always put a lot of work into stability and keeping the API clean and actually so much so that chat GBT is now pretty good at writing spacey code And the other thing we're doing is prodigy prodigy Is an annotation tool for machine learning developers to create training data for machine learning models? And it's fully scriptable in python So it really easily integrates with everything that's out there in the ecosystem and also a lot of new developments in the field so before we dig really deep into Natural language processing and large language models Let's first have a look at NLP more generally and the type of tasks that People are working on so roughly you can kind of group them into two categories We have generative tasks, which are things like summarization reasoning problem-solving paraphrasing style transfer question answering and on any other side we have predictive tasks. That's classic NLP tasks like text classification entity recognition relation extraction Co-reference resolution grammar morphology semantic parsing discourse structure and The generative tasks they really take any free-form input and output free-form input So you could really think of it as generative models producing human readable output. Where's the predictive models? produce structured machine readable output and large language models Have really transformed especially the generative space There's so many capabilities and things we really couldn't do easily before that are now pretty straightforward and really easy to integrate into our systems And they also show a lot of promise for the predictive tasks. So Where are we going from here? How do these models fit in and how should we imagine the future of NLP and of our work? One thing I always like to do when thinking about how big changes in technology have impacted things is to look to the into the past and look at significant moments when new technology was introduced and how people imagined it would change the world and what actually happened and One of my absolute favorites here is this series of postcards from the roughly around the year 1900 and it shows how people back then imagined the year 2000 so it really goes from everyday life tasks like cleaning firefighting knowledge acquisition hairdressing and then actually not so far off depiction of video calls and One thing we can really see here is that people imagined Things in very human shaped forms because that's kind of what we see and those are the tasks And that's what we interact with so the solutions here are very close to the actual tasks that the humans performed and Of course the year 2000 has passed. We know the answer and we know what technology we've built and In a way, yes, all of these things were solved with technology and some of them actually in a quite similar way We've built a machine a robot vacuum cleaner that does the cleaning But some of them have also been solved in slightly different ways. Yes We've changed the technology for firefighting But we've also built completely different things that solve the same problem like smoke alarms So even changing the way we built buildings to prevent fires from breaking out in the first place and The same can be said when we look at how technology has changed jobs and the future of work They are some tasks that can are very very straightforwardly translated into machines like manual calculation We were able to replace whole rooms of people calculating things with a simple calculator and Then there's also jobs that do not exist anymore and have been Completely eliminated like my other favorite the job of the knocker upper those were actual professions and up until the 1950s people were hired to walk around with a large stick and knock on people's windows before alarm clocks were reliable enough and Again, we have replaced this this job doesn't exist anymore, but as you could see here We do placed it in a way that delivers the same value We did not go and build a window knocking machine We built alarm clocks and that's something that's very important to keep in mind when imagining the future We don't want to be thinking of imagining and building window knocking machines We want to be thinking thinking about what the next alarm clock can be and what can deliver that same value and Another more recent example is Trying to replace Human personal assistance. That's something where we've seen a lot of attempts both Replicating the human-shaped interaction as well as more clever solutions and They are definitely use cases where pay doing a chatbot works out quite well, but We've also seen that there are other applications that really deliver the same value as a human assistant like sending some on a calendar link and Really solving the problem of scheduling a meeting without going through all the steps that normally a human Would do to get to the same result so with that in mind What is next and how can we imagine large language models to have an impact and What will be the equivalent to an alarm clock as opposed to maybe a window knocking machine? So when we think about NLP in the age of LLMs They're kind of two questions or two dimensions we can look at One is the question How will we interact with these models? Will we even need structured data and databases at all or will we mostly be querying a Model in a dialogue form will we will everything basically become a dialogue and will we be chatting with our data and The other question is will we even need training data anymore? How are we getting things out of models? Will we be? Still using humans to label data and training models from it or will we mostly be focusing on prompting and Will everything move to coming up with the right prompt? And To make this more concrete. Here's a pretty classic NLP problem Information extraction that maybe if you're working in the field you've probably encountered in some way or another So here what we're doing is we want to Analyze fundraising announcements and populate a database in a structured way so that we can do something with that information later So we want to be recognizing entities company names. We want to disambiguate them So really map them to a specific mention of that particular company We want to look them up in a database That is custom and maybe has other information that we want to need want to use and we also want to Take these entities that we've extracted take the currency once normalize it So we end up with integers that we can do maths with like calculate all number of Investments that were made and how much money was invested and then given the entities we've extracted We want to relate them to each other. So these are all things That we want to do and of course this is not the end of it Usually any model like this will feed into some other system like an application where you can look things up To really get to the information that you're interested in and there are different visions for how this Can be achieved in the future one is Dialogue is all you need and the large language model will really be the system and In charge of managing the whole interaction. So as a user you will query the system with natural language input and it will respond with an action or with an information so that means that really the Model has to own the whole information flow and to go back to our example of the fundraising announcements in this case You wouldn't even need any of the structured data or any of the Database at all you would have a model It has access to fundraising announcement and you can ask it questions and follow up questions And it will reply for example with a number and you have to trust it that it's correct Another vision basically thinks of the large language model as replacing what we currently use As a machine learning system So you would still have structured data a user would query a system but in order to get to that structured data the text would be translated into a prompt and The model would be in charge of outputting that structured data. So the model would really be implemented and functioning at runtime here and And then finally in a third version or a third vision We can think of the model would really be more of a compiler instead of the runtime So it helps with building the pipeline and we would still be building machine learning systems that produce structured data training a model and Developing the code for it and a large language model would really take the role of helping us get to that goal quicker And of course if it's needed it can also Be take part in producing the structured data at runtime and There can be multiple models. There can be multiple machine learning systems And the LLM could also be working on both the structured data or it could be helping out with generative capabilities, which is Where it often really makes sense to have a model like this in the loop So if we're going back to our NLP tasks We've seen that the model these models have really transformed the generative space and At this point, they're not really quite yet a drop in replacement for a lot of the predictive tasks That are a lot more specific and really rely on producing this structured data So the question is can we achieve better accuracy and better results by training task-specific models for these predictive tasks? And if so, where can large language models help? So here are actually some recent experiments that we've ran and are about to release We've basically looked taken a couple of data sets for text classification assigning categories to text and we looked at How well GPT-3 performs on them out of the box in a zero shot or few short way and the first data set we looked at is a data set about on sentiment Analysis sentiment classification and here we can see that out of the box the model Baseline is pretty high. So we have 94 percent accuracy without having seen any examples and we need about at least 50 percent of The training data if we're just training a model ourselves with spacey to get to the same results still pretty good But sentiment analysis is also something that's relatively general purpose and also relatively easy so we also tried the same thing with a slightly more challenging data set on news classification So here the baseline is lower and we can see that even with 1% of the data we can train a model that significantly Tops that accuracy. So 1% of the data here is about 1000 examples That's something that one person individually can annotate in one or two hours So given the improvement we see here with very little data That's definitely something that's already worth the investment and pretty promising But still news still a relatively easy domain So we also tried a much harder data set and this one is about banking It has 77 categories. So quite challenging and all of these express base specific Intents so here the baseline is pretty low And we even with 5% we can get a significantly more accurate model And we can also see that it doesn't even top out at 100 percent So it's still the accuracy is still going up So that let's it makes it fair to assume that if we just annotate some more examples We can get even better results on these very specific tasks And in addition to text classification, we also looked at named entity recognition which which is a lot more intricate In terms of information extraction because we are not only predicting labels over a text We also have to extract spans and then predict label labels over those spans so Here's here some results on a standard evaluation set the GPT numbers are from a paper that was actually just released the other week and This is the current state of the art on few short prompting And what we can see here is that out of the box even without having seen any examples The accuracy is pretty good, but it still doesn't get anywhere close to the state of the art at the moment in 2023 Or even the state of the art in 2003 when this data set was first released so What we can see here is It even though we can get significantly better accuracy by training task specific models these models here Out of the box Few short prompting they do make pretty good prototypes. So that is definitely something We shouldn't underestimate because the time spent to go from basically nothing to a working version That's pretty good and can be improved is actually pretty significant. So To look at a more practical workflow On the one hand, we do have these large language models They know a lot about what the text means and the world and they're also very large But they don't really necessarily know what you want them to do, especially if it's not generic And then on the other hand we have the task specific models that by that I mean fine-tuning bird Or something similar and they know less about the text. They're smaller They have less knowledge about the world, but they can encode exactly what you want them to do very specifically Even if you know what you mean and if you know what you want you can encode it in a model so as a developer what we can do is We can take large language models use them to produce better task specific models and basically get the best of both worlds And that as that includes prompt engineering. So we need to make sure that the prompt we design is good and produces the right output We need to define the problem take the bigger business problem break it down into components into things We can actually express and train as a model We can that feeds into the data annotation We need to create the examples for our model with the help of the large language model We need to train a model and of course we also need to evaluate it because we need a stable and robust way to find out whether our model does what we wanted to do and Also, whether anything we do and change and improve actually improves the model and the task specific model will be more efficient smaller More you know to the point more predictable and that's also the model we can then more easily ship to production and it's ours so When we go back to NLP in the age of LLMS and the different visions We basically See that we kind of have to align ourselves Somewhere in the middle here. We definitely need structured data. There's kind of no way around that It's incredibly useful for a lot of applications, especially if we're looking at predictive tasks We also really need humans and we need the humans to be in the loop We need humans to help evaluate the models We need humans to help create the training data and check that it's correct if we're training task specific models So we need at least at some point Humans in a loop and we can use large language models for faster prototyping We don't need to wait until we have a small data set that we can train from we can build a prototype and Have a working system that we can improve and that will likely lead to a lot more projects Going ahead and being successful, which is great And we've also seen that we really need to work with the models and we need to work with the code And we need to really get in there and iterate so Open source is very important here because it lets us work with the code We this can't happen in an entirely closed-off environment where we don't have any control over the code over the data or over the models And finally while there are tons of use cases where conversational interfaces add a lot and are great There are also a lot of cases where other interfaces are much better at providing the same type of value and where a conversational interface You know, it's much more the equivalent of a window knocking machine Then the alarm clock. So both of that will still be relevant So, how does this really work in practice if you're a machine learning developer? How is this gonna work? So we really envision the solution And the LLM powered NLP as a collaborative Development environment for your data that is powered by large language models that help you Annotate data create data sets for specific tasks and also allow humans to get in there and review the labeling decisions and correct the errors and At the same time you need to be able to tune the prompts. Make sure that you're using the models optimally Compare them compare them to task specific models and finally build data sets train Evaluate and build the right pipelines that really solve the problems efficiently So here's an example of how this data development process could look This is an example from our annotation tool prodigy But you can also do something similar in a lot of other tools or even just in plain text if that's what you prefer So here we are annotating or creating data for a named entity recognition model That annotates dishes ingredients and equipment in posts from the cooking subreddit So in this case, we're sending that text to the open AI API and Ask it to respond with the structured information about those entities and we then map them back into the original text and display them so of Course inevitably they're going to be mistakes and in the UI you'll be able to correct them and you're also able to add correct or especially important answers and Examples that you've corrected to the prompt in order to improve the future predictions But of course we have another problem here, which is that language models out of the box will respond with Unstructured text and structure text goes in and structure text comes out So the challenge here is given a prompt and a natural language response How do we go and map it back into the original text into a structure that we can actually work with and For that we have an extension to spacey, which is called spacey LLM that basically does this and takes care of the prompting making the query and then passing out the response and setting these annotations back into spaces data structures so you can access them in the original input and This means Given some unstructured text input You can get a structured doc object out of it that has all the structured information you need And you can do various different things along the way for example Lamentization resolving the words to their base form which especially in languages that are not English can actually be very important If you're analyzing text you can recognize entities classify the text Extract relations and for each of these components you can mix and match the different approaches and Different techniques that you want to use so you can have a large language model in your prototype Or maybe even in your final system if it Performs the best you can replace it with a task specific supervised model if you want and augment it with rules if there are things where you definitely know the answer and can get better accuracy and As you can see there's really a lot to do and really a lot to think about if you are working with NLP and One thing that's usually The center of the discussion when we're talking about large language models is it really focuses on how easy it is How easy it is to get started and how easy is this to build systems? And that's definitely true and a really important aspect But in a lot of cases easier just isn't enough We also want things to be good and we shouldn't be settling for systems that are worse than what we're currently building because currently Companies developers are building a lot and are building a lot. That's incredibly valuable. So we want to match that We don't just want things to be easy. We also want them to be good and If we know the answers and if we know exactly what we want We can build models that are specific and really solve our problem We don't need to settle for something general purpose if we can define a model that does the thing we want and if we can use Large language models for instance to help us create data for these models and train systems that we control and that really do exactly what we want We also don't have to settle for Incredibly long API latency our models that are incredibly expensive or annoying Or difficult to run and serve because they're so large If we have a specific thing in mind We can actually train models that are good or even even significantly better and that are smaller faster and that we own and we can deploy efficiently And we also don't have to settle for APIs and third-party services where we don't know what our data is going to be used for a lot of high-value use cases work with Data that's private and that's fine And you should be able to train models that are yours and that you own and that you control and can run yourself in a very fully private environment and Finally, we shouldn't settle for anything. That's worse than what it should be the new Possibilities and new capabilities of the new models that we have introduced so much new potential And we should be able to build things that are better and we should be able to build things that Exceed what we can do at the moment and not let anyone tell you that you should settle for anything less We can build things that are better Thanks. Oh, thank you very much. We have you know, we have plenty of time for questions so If you have a question in a room, there are mics up there and there Please just send up and ask you a question So if you cannot come up with a question, you can use jet gtp to come up with a question for you You can do prompt. What should I asked Enos Montani of spacey about large language models? I would actually be curious. Yeah, it would be curious. Let's do that So, okay. Yeah, but I think one let me ask you so I think actually when large language models were released I said, oh spacey. What what will this mean for spacey? Will this be like a replacement because we need all these tools no longer But basically you pointed out hey It's just also like in our space is the best assistant ever because it also relieves us from a lot of human labor Like creating domain specific models, right? Yeah, and it's actually in a way I think it's also really showing how important it is to have pipelines of different components that do different specific things and Because for a while there was kind of this idea that like oh all you need is like one model and that's it And I think especially LLMs have shown that you want to do different things at different points have different prompts And so the pipeline design of space you really helps there Yeah And also I think some many people forget your business secrets or like business special business knowledge will there can never be in a Large language model. I mean it shouldn't be unless you leaked all your internal data That's another point. So we have questions I think the gentleman in the green shirt was first, right? Thanks for a talk has been really interesting my question is on you proposed like Generating input and and training data for specific models I was wondering when training a model you would generally always have to deal with biases and Like when you're selecting When you're selecting training data yourself you at least know what a training data is from and what bias it may have But when you now take data generated by a large language model How do you even keep track of that or how do you how do you which you? Assess that yeah, so I think in short the answer is well You definitely still need humans in the loop and also a lot of the suggestions. I presented We're mostly around using the model to put structure into the data. So often the assumption is Usually the main problem is you have a bunch of raw text lying around and you want to train a model to analyze it And so you can use a large language model to help you put the structure into the text You already have and of course you still want to review that but you're not necessarily generating New examples, but if you you can do that. I think for paraphrasing. There are a lot of interesting ideas Data augmentation that could be super useful, but yeah, I think you yeah, you don't get around Having humans in a loop the same way you would in other scenarios Thanks Think the next one Thank you for your talk. I was curious in the example you you demonstrated of converting unstructured data to structured data How robust is it? Is it like always guaranteed that you will get something that could be parsed as an output or not? No, I mean, it's it's it's genuinely difficult. So and they're different approaches I think at the moment what we would do is if it doesn't match because the model might respond with something Completely arbitrary like you can't necessarily control that and if it does then this would be the equivalent to a model You've trained not predicting something even though there's something there. So that's kind of how we treat it But yes, that's definitely a challenge if you get freeform output back It might or might not or sometimes it just stops at some point That's also really common depending on the model It just sort of gives up halfway through and then yeah We have to deal with that But I hope that that is something that also can be better and we definitely we're also super keen To look at fine-tuning models to be better for these kind of structured prediction Tasks because there's a lot of low hanging fruit. I think that hasn't been fully explored Okay, thank you I have a very naive question Do you think one can do the thing the other way around so now you are proposing to use LLM to improve Nature language processing Can one use the data or the code of the structured data from NLP to improve LLM to go in the other direction? Yeah, actually, this is this question ties in slightly to what I what I just mentioned before like Given all of this, you know experience and all of the outputs and everything we created here We could consider fine-tuning an LLM to be better at Producing the structured data. So going the other way around. I think that's definitely super relevant and Yeah, we also it feels like we sometimes end up at this NLP exception Where we have a model that outputs suddenly it outputs natural language again And then we're like ah damn now we need to use NLP again to take the output of that model and turn it into Something useful. So there's definitely there's definitely a cycle there and a lot of interesting things to explore Next question from the back. Yes. Thank you very much for the talk and thank you very much for spacey Which is a fantastic library well documented. I'm using it. I'm loving it I'd like to have your take on something which has happened Which is that open air refused to publish the details about GPT-4 and Right now I'm wondering and I would like to hear your take because you're much deeper than mean in all this world And are we running towards a model where all the major players are going to publish close source? LLMs more and more powerful or we'll open source come back. What's your take on this? I think we'll have both and yes, sure this you know open AI that's also why I think a lot of practitioners in the field are quite Skeptical and it's not really something, you know, you can rely on for these types of robust tasks because It can also change at any time. There was also I think recently the model changed and open AI says no We didn't change the model and then you kind of have to read between the lines They say they didn't change the model, but maybe they changed something else around it. It's completely Intransparent, but it doesn't have to be that way. There are other Providers that can also train models and release them in more developer friendly ways. They are open source models a Lot of these things are not really kept private for long There's research happening in parallel. I don't think You know, yes, that's something at the moment. That's happening at the moment, but I don't think That will be yeah It's just I think we kind of seeing the same pattern repeat that people had way back in the day where everyone had these closed up APIs and what one in the end is open source people need to program people need to You know really work with the models and that's not going to change Thank you Basically Open AI is a black box to us. Yeah, right and that's different. That's hugging face with many love models So there's more transparent there. Okay. Next question. Jakob. I Think we've all experienced when chat GT PT spews nonsense I'm wondering would it be possible to build some sort of Measurement of certainty in the answer in into the models or is this something that's going to play these models forever I think there is so I'm not like super deep in that part of the research But there's definitely been movements towards building in more explainability or at least having the model try and cite Sources or examples or try to get a bit closer to like what is this based on but of course even there We've already seen I think with these search engine bots that like if you post a lot of nonsense on the internet the bot will then Eventually cite your nonsense link as the source and I think especially in that area I will always see a race of spammers against Developers, but I do think for these use cases that I'm thinking about and that I've more developer focused I do think there's a lot of interesting stuff happening in the explainability area getting more You know getting results with different temperatures and different certainties Looking at them deciding which one to use which one is most likely correct. I think that's That's definitely Possible on a development level, but I think on the more general internet level will always see this race of people trying to Produce nonsense and people trying to prevent nonsense and I think that's gonna just keep following us around forever We have one remote question from Yoni and Yoni actually asked JetGDP Oh cool. So what does JetGDP say? I'm just reading it to you. It's pretty cute Dear Ines Montani, I hope this message can find you well As a part of the development team behind Spacey one of the leading libraries for natural language processing your insights Would be invaluable. Thank you. You see JetGDP thinks you're Incredible Space isn't great. That's good. So given your expertise in NLP and the recent Advancements in large language models. I'm curious about your perspective on the intersection of the two specifically, how do you see large language models like GTP for affecting the future of libraries like Spacey Can these model be integrated into Spacey's pipeline to improve certain tasks such as named entity recognition? parts of speech tagging or dependency parsing or Conversely, do you see these models positively replacing traditional models altogether? Thank you for your time and insights. I Mean disclaimer JetGDP did not see your keynote yet. Yeah, I was just gonna comment on that Yeah, it's a bit long-winded. Yes long-winded but it's interesting. Yeah, but yeah, I think I hope I answered all of JetGDP's questions in my talk today. No, you have to write back. I'm just reading the question Yeah, okay next question Okay, thank you so sticking With open AI for a bit longer They recently released this feature called function calling I think I'm just wondering if you had a chance to try it out And what do you think about it because in a sense it does what do you describe during during your keynote, which is? basically Using a generative model to produce a structured output. So I'm just curious what you do think about it Yeah, so I've definitely seen it haven't really tried it much but like it definitely, you know, it shows that You know that is it's something other people are thinking about as well I would say that like I think open AI generally target a lot more use cases that are Consumer focused and where really, you know, the API is mostly like part of a system So I feel like I still don't I don't see open AI really going into the developer space or offering things that are really useful To work with and I think also there's still a lot of the even the basic suggestions for structured tasks Don't yet work very well like for example We've tried out what open AI recommends for entity recognition and that's like pretty mare Then that those results I showed and a prompt any our paper that works much much better And so we can see they're still like there's a lot of active research happening and there's a lot more that can be done But yeah, I haven't I haven't tried it out in detail yet Okay. Thank you. We have time for two more questions. Perfect. I Have a question What do you think about a support of other natural languages than English? You think that Features that we saw today can be available for other languages. I don't know like Arabian Chinese and Yeah, I think that's actually super important and something that isn't talked about enough Of course, yeah, a lot of the research is happening in English The general rule of thumb is the closer a language is to English the better it works for NLP But I think there's basically a huge also something we'd be very excited About working on because there are a lot of the more classic NLP tasks that maybe people often ignore like Lematization That are not so interesting for English, but actually if you look at other languages, it becomes much much more relevant You want the base forms of words if you you know want to compute something over a corpus of text that you have or even Tokenization, what's a word? That's something that a lot of English speakers don't really think about but if you look at Chinese There's a whole you know that you can't just separate on white space so even those boring old-school NLP things are much more relevant in other languages and I think Yeah, that that is an exciting area. We also definitely want to try out more multilingual models, but Yeah, I think the bottleneck is data because yeah there's just such a huge amount of English internet to train these models on and That would be significantly harder for say check or other lower resource languages Yeah, I think sometimes even if you have judge it we write something you get like a certain site guys even in it like Yeah, yeah too many cooking blog posts or something like that. So last question, please Don't you believe powerful open source largely? LLMs might bring a Skynet closer sooner So sorry what open source LLMs open source powerful LLMs? Yeah, I think there's a lot of there's a lot happening There are a lot of models you can already try out also in space the LLM There are a lot of open source models. We already support And I think it's great to have these models and especially if we work a bit more on distillation It I think it will become really viable to run these internally. Yeah, but I meant Then you think then you think it's gonna bring Skynet closer. Oh sky that yeah, I don't think that's something we That's something that makes sense to think about Have you read sorry only one question. Have you read live 3.0? Anybody read it from Mark's their tag mark? No, sorry. Okay. It's the tale of the Omega team. It's a basically a Good ending. Ah, okay. Yeah. No, sorry Thank you. All right. Thank you very much in us for coming by for the keynote