 OK, I'm going to switch mics here. Is this one on? Great. I have a hard time standing still, so great to see you. So if I mess up or it makes a lot of noise, just let me know. So my name is Craig Martel. I am the chief digital and AI officer at the Department of Defense. It's a pretty cool role. I'm responsible. I am the chief data officer, the chief analytics officer, and the chief AI officer for all of DOD. Which means I go to a shitload of meetings. Fuck yeah, exactly. Please don't put that on YouTube. My boss is the deputy secretary of defense, and I don't want her to see that. So we will edit that out. But the meetings are really cool because we realized very quickly that we had to build a hierarchy of needs, where we had to say, look, AI is going to come. So this was last year before, here's a precursor for my talk, before all of the hype about large language models. I said it like that on purpose, before all of the crazy, silly, absurd hype about large language models. We built this hierarchy of needs. And first it was get the data rights. And if you get the data right, and that's really hard, it's someplace as big as the Department of Defense, because it's the world's largest organization. And it's distributed, I don't know, everywhere, including space. So getting the data right is really hard. But if you get the data right, then AI is going to come for free. That was the story I used to tell. Now the problem is, people want AI for free, and they shouldn't. So we'll get to that part in a second. The second layer was analytics, because the term AI is hyper-overused. Hyper-overused. And in my job, AI often means I want magic that will magically do my job and allow me to get rid of people who are too expensive. That's what the phrase AI means at the DoD often. But really what they're asking for is just to see where their crap is. They just want an analytics dashboard that will let them know, here's my stuff, here's my troops, did that move, when is that stuff showing up? And I can easily see that. So a lot of the calls for AI, at least in the Department of Defense, aren't really calls for AI after all. They're simply calls for high quality data and a good dashboard. So we said quality data, and then we said an analytics layer. And then finally, I put at the top of that hierarchy AI. And what I want to provide for AI for the whole Department of Defense is what I call AI scaffolding. And so on the left, so the model, the building of a model is now a commodity. 10 years ago, 85% of the problem was the model. Today, 5% of the problem is the model. And the other 95% is the pipeline in which that model sits. And the generation of that model is a commodity. You send your data to one of the big players and they'll give you back a model. And that model will be good enough. I feel like a wedding singer. And that model, I can't sing. And that model will be good enough. So really, we've been thinking very hard about what goes around that model. On the left-hand side of that model, so before you build the model, is high quality labeled data. Nobody wants to label data. Labeling data is the hardest thing in the entire pipeline. But it's the single most important thing in the pipeline. High quality labeled data means way more than the algorithm. The algorithm is mostly irrelevant to this point. Choose one of 10. Choose one of five vendors. And you're gonna get a good enough model that solves your problem. Maybe not perfect, but good enough often. Good enough to sell shirts and make billions of dollars for some companies, right? Good enough to return documents and make billions of dollars for other companies. Good enough to recommend jobs, and I used to work at LinkedIn. Good enough to recommend jobs and make billions of dollars. So the model itself has become less relevant. Getting high quality labeled data is really hard. So we wanted to provide that as a service. And then on the other side of the model is monitoring. So what is AI? Somebody tell me what AI is. What? It is not a panacea. That comes later in the talk. Cheap, something parlor trips to full muggles to think there's intelligence in a box. Didn't I teach you that? Didn't you come to my class and I taught you that? Well done. Ooh, ooh. We'll get to that. He said plausible sentence generators. AI is statistics at scale. That's it. We measure the past. We track the past. We use data from the past and we use it to predict the future. It's no different than when you're driving the car trying to figure out whether that bicyclist is gonna turn in front of you or not. You've used your past experience. You look at the features based upon the way that bicyclist is riding and you make a decision about whether you think you're safe to pass them or not. We use the past to predict the future. What's the problem? So that works really well. That works extremely well on lots of contexts. But it doesn't work very well when the world changes. Because we've predicted the past based upon a particular set of features in the world. And those features change. So we wanna think about model monitoring on the right side of the model to ask, is this model still delivering value? Now industry really has some great wins here. Why? Because users of a shopping site or users of a search engine label the data at scale massively millions of times per minute. You go in there and you buy the shirt or you don't. You click on the document or you don't. And they can track, we did this at Lyft, we did this at Dropbox, we did this at LinkedIn. They can track over time that the model's becoming less and less effective so it's time to retrain. It's time to take samples from the world and train again. That's, they get that for free. They get that from us just interacting with their products. But in the DOD that's extremely hard. So we wanted to actually, we're working very hard to build in model monitoring into the specification of models that we should buy so that we can at least tell, has there been a distributional change in the labels that the model is generating? Plus let's be clear, in warfare the world changes pretty quickly sometimes. The world doesn't remain stable. So in lots of things that we do or not even just in warfare what if you're on a ship doing search and rescue and suddenly clouds roll in and the waves get bigger? Well your model is gonna be very different in that case than it would be in a case where the seas are calm. So we, the DOD, more than I think most folks live in a world that changes very quickly. And so we really wanna build models that have high quality labeled data and are monitored on a regular basis to make sure they're still bringing value because we usually pay millions and millions of dollars for these things. All right, that was the plan and that's still the plan. That's what we're doing. We're building on a hierarchy of needs. You can go back and look at any of my public talks. I talked a lot about the hierarchy of needs. But then large language models came along and my life got really tough. How many people, how many people use large language models on a regular basis in the course of their job? How many people don't use large language models on a regular basis in the course of their job? Wow, that's fascinating. For those of you who don't, just yell out, how come? The government, I'm here trying to build bridges. Here I am, I'm trying to build bridges. I showed up. What do you mean by the government? She can't tell me. So not just the government, but everybody. All kinds of bad players, the government and others. All right, okay. Why else do other folks not use large language models? They're full of shit. That's not what that person said, that person said it lies, but I elaborated a little bit on that. I'm good at making up my own sentences. I like this audience. Okay, let's flip the script. Those who like large language models, why? They make your language, that is a great use case. Where are you from originally, sir? Germany, and so can you type in in Deutsch and get good English out? Okay, so that, I think that's a, so what he said was he's German originally, and oh, he's probably a Deutsche now still in his own mind, right? So okay, so he's a Deutsche living in the United States, and he writes it in English that's stilted as German English, I'm assuming, and it comes out much more fluent. That is an absolutely fantastic use for large language models, and I support that one completely. Okay, other folks who use them. Wait, hold on, right here. No ads. No ads, so use it for like information retrieval, okay? It makes old coders relevant again? Did you say old coders? Oh, because you can learn go really fast or learn rust really fast? Okay, so I also think that's a really good use case for large language models is coding. So let me throw something out to you. You can tell that I used to be a professor. This is what I would do all the time. Why is coding a good use case, but the free generation of texts less of a good use case? Well, structure, what else? Let someone else have a chance. You're embarrassed, which one, the language? You can verify that it works, how? Yeah, run it. So the beauty of, both of those are the same answer, and I think they're right. The beauty of text from a large language model, sorry, whoa, code from a large, I'm tired, from a large language model, is that code is verifiable. It's a formal language. There are external tools that you can use to verify that code, I saw you shaking your head, not completely, not completely, but it has to compile. If the code doesn't compile, it's not gonna run, but by definition, that's the definition of not running. If you're a good software engineer, then you'll write a unit test, and those unit tests are part of your larger system. Now, you might be a crappy software engineer, in which case you wouldn't have written a unit test anyway, so I'm not that bothered by that. And if you work at a company that has good infrastructure, there are integration tests. So your code actually gets put back into the system as a whole. And I'm not a security expert, but however you scan code for security flaws, you can scan this code for security flaws as well. I'm happy to be wrong about that. You can come yell at me later if I got that part wrong. I really am fascinating about, fascinated about the pluses and minuses of scanning for security flaws and whether large language models are better or worse. I think that's a fascinating question. I don't know the answer to it. But I'm saying in principle, I'm in support of code generation as part of large language models. Why am I not as much in support, or I'm not quite as sanguine, quite as optimistic about large language models generating big chunks of text? People don't proofread. Okay. The slide says, well everybody understands that shall we play a game? Oh, hey Ray, where's Ray? Somebody made me a model of The Whopper from War Games. One of my favorite movies is A Kid. Okay, so here's what I find fascinating about large language models and human psychology. We evolved to treat things that speak fluently as reasoning beings. Now, we also have other ways to judge whether someone's rational. If we watch them try to get out of a, if they go into the closet five times and can't find the door out of their hotel room, maybe they're not rational. So we have other triggers about rationality, behavioral triggers about rationality, but one of those, oh, yeah, Ray, can I see that? Thanks, Ray. One of those behavioral triggers for rationality is fluency of language. Thank you. So, shall we play a game? I wanted to show it. Thank you, Roro. Roro built this for me. And if you press it, you can clap. I think you should. It's good. I can't say, what did that say, Roro? Oh, yeah, do you wanna play thermonuclear war? Okay, well, it's a prototype. It's a good prototype. Okay, so we evolved to treat beings, entities, that speak fluently as if they reason. Okay, so we have this training data. Actually, let me take a step back. A large language model is, in fact, a language model. What does a language model do? Say it again? It simply, actually, you predict sub-words, but we'll pick what's, because, yeah, no. It predicts the next word given the prior words. That's all it does. Language models have been around for decades. My whole academic career, industrial career. What's the good use of a language model? Auto-complete. Auto-complete is a great use of a language model. Data, I don't understand that one. She said data classification. No, no, just a language model. Forget the large part just for a second, just a language model. Okay, so here's a good use for a language model. If I am a translation system and I translate German into English, there's maybe five words which are semantically similar that I could choose from. I choose the one that's most likely given the prior sentences. So the language model becomes part of the translation system. Here's another one, speech recognition. In a noisy channel, which you have all the time, one of the words might be garbled. So you can use a language model to predict what that word might have been given the prior contexts. So all language models do a predict the next word given the prior words. That's it, that's all they do. It's really important. What does a large language model do? That on steroids. That's it, that's all it does. The last function for a large language model is can I predict these, did I predict the words that I dropped out of a sentence? Did I predict them correctly? So the only difference, ooh, people are gonna disagree with the only part there. A lot, people in industry and in academia will disagree a lot with the only part there. But the major difference between an old fashioned Bayesian language model, which I taught you, and a massive new large language model is the size of the context so that we call them prompts now, the size of that window, right? And the computational power behind it so it can ingest massive amounts of data. So you can think about a large language model, it's not 100% mathematically correct, but you can think about a large language model, it's just one big statistical engine that says given all of the past context, which includes the prompts, prior answers, and what I've set up until now, what's the best next word? And now you know why large language models do two things. One, they seem really fluent. Because you can predict a whole sequence of next words based upon a massive context that makes it come out sounding really complex. Also, transformers had a big, we can talk about transformers in a second, but also transformers had a big part in allowing not just the prior few words, but words a long time ago to have an impact. So the math and the engineering has just really been optimized to produce fluent text. But now you also know why these things hallucinate. And you also know why that I think you're wrong if you think they reason. Because the loss function, the way they were trained was only for fluency. There was nothing about these things whatsoever, nothing about these things whatsoever, is this translating me? That's a great use for a language model. If it doesn't understand what I said, because I said, ah, ah, ah, a lot, let's see, I want to see what, ah, perfect. It can use a language model to help guess what the word that got dropped out was. Okay, great, now I'm gonna stare at that thing the whole time, I wish I didn't know that was there. Okay, so these things do not reason is my conjecture. The reason I say they don't reason is there's nothing about the way they were trained to constrain them to reason correctly. The only thing, the only constraints on training for large language models is fluency. And then we as humans, I believe, are duped by the fluency. Now, plenty of people in the audience are gonna tell me all kinds of really valid use cases and I'm excited about that. So I'm not saying duped in the sense that we're not gonna get great use cases out of this in great technology that we can use in all kinds of places. I'm gonna tell you a task force that we just launched that's gonna figure out exactly that. Where will these things be valuable? But in a sense, we're lucky if they land on a fact, not unlucky that they hallucinate. I'm surprised they're truthful at all as opposed to being surprised that they hallucinate. Hallucinate should be the norm because all they do is generate, all they were trained on is generating coherent text. Okay, so now what I really think happens is because we evolved to think that those who speak fluently have reasoning underneath them, we believe that the, here's what I really believe, we believe that all of the training data that was used to train a large language model has captured within it the patterns of reasoning. We believe that there's, we tacitly believe that there's so much training data of fluent speech that it captures within it the patterns of reasoning so that the large language model itself has somehow encoded the patterns of reasoning. That's what I think we tacitly believe. And to me, that's an empirical question whether these things are going to be able to reason or not and so far they have not borne themselves out. Okay, how many people tried to hack LLMs earlier today? Okay, how many people succeeded in getting it to do something bad? What happened? You switched to German and told you secret things. Why do we, why is there a large, first of all, let's ask the administrators, why is there a large language model being used that has US secrets in it? Let's not show that to the world either. Did it really show, I mean, it wasn't actually secret, right? I see, I see. Anybody else, what did you get it to do? Write malicious scripts, interesting. Did anybody get it to tell a lie? What'd you get, what'd you do? You asked the first what? First computer game for girls, what's the answer actually? So here's what I got today, I think that's a great example. Anybody see the movie, The Usual Suspects? It's a great movie, right? It's a great movie. Who did Stephen Baldwin play? Michael McManus, just okay. So I asked it, who was Craig Martel? Of course, I'm a little self-centered enough to do that. And it told me that Craig Martel was the character that Stephen Baldwin played in The Usual Suspects. I didn't know that was the case. I went back and checked, I wasn't in the movie. I wasn't portrayed in the movie, it's too bad. So you could say, I don't know why that happened, why did it reason like that? That's the wrong question. Just for some reason, the prompts generated those strings. I don't know why it was the state that it was in, but you really gotta see that these things are just string to string. And even when you ask it, what are you thinking? Or why did you, this chain of reasoning questions, why did you believe, why did you come up with that answer? It cannot introspect, folks. That's also just a unicode string to a unicode string function. What was your chain of reasoning is a unicode string to a unicode string function. And if its output had anything to do with its internal state, it's purely by chance. So it's really important that we do not impart upon these things sentience, that's kinda nuts, reasoning ability or factual constraints because none of those things are true. Now, does that mean these things aren't useful? Not at all. Not at all. I don't know what I'm saying next. Oh, yeah, so let's talk about this. Another big frustration I've been feeling is that AI does not equal LLMs. I have a T-shirt that my team just made for me that says that, but I'm waiting for my belly to go down before I wear it. And we have a bunch of stickers that says AI does not equal LLMs. What do I mean by that? Yeah, the hype cycle has just decided that large language models are the only thing, right? But that's unfair because large language models are unproven and we wanna help prove them out. I actually think as a scientist who studies natural language processing, I think LLMs are the most exciting thing that's happened scientifically in my entire career. I just think there's a really large gap between LLMs as a scientific enterprise and the products. That's really my frustration. My frustration isn't in the science, but part of the underlying technology in large language models is transformers, right? The T in GPT is transformer, general purpose transformer. I think that's right. I think it's in our purpose in terms, but I know for sure the T is transformer. And so transformers have made the world radically better. Transformers are why you can talk to your phone as well as you can. Transformers are why image recognition works as well as it does. Image recognition works really well. Like, fascinatingly well, right? Transformers, deep learning and transformers. Transformers are why you can go to a country that doesn't use Roman script at all. My wife is Thai so I'll use Thailand as the example. You can go to Thailand and you can hold up your camera and suddenly the script turns into English that I can read, transformers, right? I can go to Deutschland and hold it up and get it mostly right, right? So transformers have been really amazing. The problem, and so large language models have benefited from transformers and they are also really amazing at generating fluent text. Here's my frustration though. If I show you a picture of a dog and tell you it's a cat, what's the cognitive load that you need to decide it's not? Do you wanna just change? I tell you it's a cat, you can decide extremely quickly it's not a cat, it's a dog, right? So you can evaluate the output of that system. So take a image recognition algorithm and embed it into a pipeline with human beings and let's say it gets it right 70% of the time. That might sound scary depending upon the use case but it's really not because for most image recognition tasks you can say right away yes, no, yes, no, yes, no, yes, no, yes, no and a human being can be in the pipeline to say oops, false positive, false positive, false positive and just take it out of the queue. So there's been really massive growth because of transformers and things like image recognition and it's allowed for really cool human machine interaction and it's really changing the world. I'm super excited about it. However, if I generate 30 paragraphs of text how easy is it for me to decide what's a hallucination or not? She's telling me to change the slides. Thank you, Kathleen. How easy is it for me to decide what's a hallucination or not? It takes time. You also often want to use large language models in a context where you're not an expert. Like that's one of the real values of a large language model is asking you questions where you don't have expertise, right? And so how likely are you a non-expert to read through that whole document? First of all, I fear a lot of uses, a lot of desired uses of large language models are to create massive amounts of text that I don't have to read so my job can be easier. Okay, that's a fear. Witness the lawyer who submitted that brief with 10 phantom cases, right? So setting aside natural laziness, none of you guys are lazy, right? So even if you wanted to go through and proofread all of the text generated, how likely are you able to determine what's a hallucination or not? Highly unlikely. So my fear, my concern is a better word because I think these things are great. My concern is that the thing that the model gets wrong has a high cognitive load to be able to determine whether it's right or whether it's wrong, okay? So I'm a big fan of the science of large language models. It has changed, as a natural language processing researcher, it has pushed the field decades forward. I'm concerned with the products being sold and that gap in between, okay? Any thoughts or questions? I'm really curious to what this audience thinks. Yeah, so the question is, why should a human validate the output? Why not have other AI models validate it? You can't have turtles all the way down. You can't. If you show me that some other model which probably had to be validated by humans is that good and it can validate this other model, here's the way I think about it. Here's a big black box. You could have five language models inside of it or one. A question comes in and the answer comes out. I just want the answer to be true. So if you can build me a validatable system where I can measure the correctness of the output text to test whether that box is good and you use other models to validate it, I am all for it. But at some point a human being has to judge the box. It can't be turtles all the way down. If it's turtles all the way down, we're just giving up our scientific responsibilities, and we can't give up our scientific responsibilities. Yeah, and he's way smarter than I am. Yeah, so I have the intuition let's start with I am an empirical scientist and I do not mind being proven wrong at all. It won't not bother me at all. I would love for us to solve the hallucination problem. Let me jump to the end of the talk. Here's what I really want is a set of use cases, classes of use cases, kind of like level one through level five autonomy in cars where we have increasingly difficult acceptability conditions and metrics to go with those acceptability conditions. And then you can just show me here. I have proven to you there are no hallucinations and it works for this use case. Or I've proven to you that there's 80% hallucinations and that's fine for now. Or I've proven to you that it works really well for first draft generation and our acceptability conditions for first draft generation are lower. I'm on board that. So what I really want to push the community towards and I'll talk about Task Force and Lima in just a second which is doing exactly that. What I really want to push the community towards is a set of classes of use cases that are increasingly difficult that have acceptability conditions and metrics to measure them against. And what's really interesting in all my conversations with the big companies I've had fairly similar conversations. I think everybody agrees we need those evaluation metrics. It's right now we evaluate it by going, ooh, isn't that cool? Or we read some of it. We read a paragraph or two and go, oh, sounds good. That's just not acceptable to me. I am totally okay if these things work. In fact, I really want them to. The scientist in me really wants them to. The person who runs AI for the Department of Defense wants to make sure that if you're a soldier in the field asking a large language model a question about a new technology that you don't know, how to set it up, I need five nines of correctness. I cannot have hallucination that says, oh yeah, put widget A connected to widget B and it blows up, right? So there's this part of me over here that's a scientist who is freaking out happy. And then there's part of me over here who's responsible for determining when we accept things in the Department of Defense. And right now the community is going, ooh, isn't it cool? So over here, I'm saying ooh, isn't it cool? Over here, I'm saying ooh, isn't it cool, isn't enough? Tell me what the acceptability conditions are and tell me the metrics we're using to measure it. Don't care if you use another model, check in another model, awesome. Don't care if you use humans in the loop. Don't care, show me the input, output pairs and show that we get there, right? The only reason I play the chicken little role because I think what I'm interpreting your comment is, is you're being a little bit chicken little. And well, the reason I play that role is because I really wanna de-hype this. I really wanna turn it from here is this monolithic technology designed to solve all of our needs and I wanna bring it down to here's a set of use cases and evaluation metrics for those use cases and some technologies will work for some use cases, other technologies will work for other use cases. I wanna make it boring and measurable. That's really, that's really my goal. Yeah, I didn't quite hear, here's what I heard, tell me if I got it right. In the absence of good data, can you still do things? No, sorry? Can you stand up and yell? Sure, it's just that you gotta, to do that, you gotta take a baby, we gotta invent a baby and here's the fundamental flaw with the way people think about this. Babies come into the world with all kinds of cognitive apparatus, all kinds. They can recognize faces, they know hunger, they know causation, they know, they have numeracy, they can tell numerical, they can tell numerical things. I have 10 minutes left, okay. They come into the world with all kinds, so we can't just assume a randomly set network. And by the way, an artificial neural network isn't the brain, that's a whole different conversation, but it's simply a set of logistic regressions or pick your activation function connected together. It's got nothing to do with the brain, right? It looks similar in the way you draw it on the board, but you could turn those circles and arrows into equations that look nothing like circles and arrows. Circles and arrows are imagery, calculations are all happening in a computer. So you can't see a randomly set network to be anything like a child's brain. I think that's the fallacious piece there. Does that make sense? Way over there, yep. Let me, okay, that's awesome. Let me echo his comment so that people hear it. His comment was, what about those damn kids in the next generation? That's basically what it was. We already have a hard enough time, we're trying hard to find hallucinations ourselves, we already have a hard enough time with it, hard enough time doing that. What about the next generations who get used to using these machines and the machines tell them that they're telling the truth? Okay, two-fold. One, I think you just nailed something that's extremely important, and what you nailed that's extremely important is people have a proclivity to believe authoritative fluent speech, and that's one of the problems with these models because they're trained to speak fluently and we defer to them, so that's a real problem. Second one is how about we draw a line in the sand and we don't ship these damn things until they stop hallucinating or that we have ways to mark their hallucinations? Or industry, how about we have ways that facts have weights on them where we think it's 70% correct? There's all kinds of things that we could push the community towards. And I want to, my objection is this free-flowing text and telling everybody that AI's arrived, it hasn't. Oh, sorry, this woman in the front. She's in the Screen Actors Guild, we should all give her a round of applause for being on strike. Yeah, so that's, so she's saying that the studios want the first draft to be written off of your intellectual property but written by our AI, and then we'll hire you to edit it. Keep fighting, sisters, the only thing I gotta say there, like, I think, first of all, I think if the studios win this battle, we're all gonna see really crappy movies. Just really, really, really crappy movies because, what's that, too late? There's some good movies. I just saw Big George Forman, I loved it. So if these models regressed to the mean, they regressed to the mean very quickly, and so I really am not excited about a world where the mean movie is constantly generated over and over and over again. Ask it to generate a bedtime story for your kid and tell me if your kid likes it. It's very unlikely to be the case. I did that for this audience only. I would have said text to text anywhere else. Sure, yeah. So what he said was, I'm in the DOD, he reminded me that I was in the DOD, and then he told me that Palantir, I got five minutes, he told me that Palantir is a really great contractor and partner of ours is developing a system where he was showing how the large language models was developing plans and all sorts of things. Okay, the short answer, and then I want to talk about TossForceLima. The short answer is, I talk with the CEO of Palantir all the time. He completely understands my concerns and he's on board with use cases and metrics and evaluation, I mean acceptance criteria. So I'm totally cool with companies doing all of this research. It's incumbent upon us, and I haven't asked for all of y'all. It's incumbent upon us to say, we will accept it if and only if. And that's TossForceLima. So TossForceLima was just announced yesterday and it's a DOD-wide effort but it's being run out of my office at the CDAO. It's incumbent to say, what is UL? Weep UL. That's great. It needs a better language model. TossForceLima's job is two-fold. First, to understand the demand within the Department of Defense for generative AI and large language models in particular. To think hard with those who are making that demand, whether we could actually field one of these systems to solve it, and what are the risks and how do we mitigate it? So for example, first draft generation, no problem, connected to something kinetic, absolutely not. Imagining a war gaming situation maybe because human beings would read it and validate it before sending it up. So we're thinking really hard, sort of what is this ladder of demand and what do we need the system to be able to provide? And can we build a human machine teaming that could mitigate against the risks? So that's one use case, one thing that we're doing. And the second thing that we're doing is working with industry, as I said a bunch of times here, to build out these use cases and acceptability conditions so that industry as a whole, academia industry and you guys as a whole can start pushing us towards meeting those acceptability conditions. So now I've been completely negative the whole time. Now let me be clear. I would love for the promise of generative AI and large language models to come to pass. I would love it. But we have to be non-hype, systematic and empirical about it and measure that they're actually delivering against the acceptability conditions. Let's be scientists. Now what do I need from y'all? Hack the hell out of these things. Hack the hell out of them. Tell us where they're wrong. Tell us how they break. Tell us the dangers. I really need to know, because I just have strong intuitions, but you guys can give me some facts. And on that note, if you go to dds.mills, you have to have the www, sorry. If you go to www.dds.mills slash Task Force Lima, you can click on that work with us and you'll send us an email and there's a template in that email about the kinds of things you might wanna do to help us understand where these things are broken. Why am I here today? I'm not here today to be Professor Craig and teach you about large language models. That was just a positive side effect for me because I like to hear myself talk. I'm here today because I need hackers everywhere to tell us how this stuff breaks because if we don't know how it breaks, we can't get clear on the acceptability conditions. And if we can't get clear on the acceptability conditions, we can't push industry towards building the right thing so that we can deploy it and use it. I am out of time. I have one minute. Any other questions, comments? Anybody wanna yell at me? You're welcome. Ah, that's a great question. Oh God, that's a good, I don't have, I only have a minute. Oh my God, that's a good question. So she asked, how do you measure something like Google search? That's great. We already have cognitive apparatuses that allow us to do that. We go when we read it and we look at the URL and we check it against the other knowledge in our head or we go to another link and see if they are coherent. We already have built up over, Google search was 98, I think. We've built up since 1998 these tools that allow us to do this. They don't speak fluently and authoritatively. You go when you read it and you extract facts like you have been trained your whole life to extract facts from documents and use your reasoning and your judgment to decide whether it's right or whether it's wrong. Sometimes you're right, sometimes you're wrong. But we know how to do that. We know how to take responsibility for doing that. But these things that speak fluently, I am done, these things that speak fluently and authoritatively are creating that next generation that the gentleman over here is afraid of where they just go, aha, the machine told me. Let's work hard to not ship the system until the machine tells us to write things. We can do that, we're scientists. We don't have to give in to the hype. I'm done.