 But I'm very pleased to introduce Sanjay Sarma. He's a professor of mechanical engineering here at MIT. He has a very storied career. He's also the co-founder of the AutoID Lab at MIT, has founded a company. He's done a bunch of great stuff. You can see his bio. But as the AutoID Lab has had a long partnership with GS1. So as that community is AutoID Lab, researchers and students have been looking at the future of automation technologies for over 20 years. We've been working with MIT on projects related to automatic identification, data capture, data sharing, and supply chain automation. So it's my pleasure to introduce Professor Sanjay Sarma. Thank you Melanie. Thank you Melanie, such a pleasure. So our conversation today is about data and actually GS1 US that Melanie works at is one of the great sources of data. But let me just start, by the way I'm a professor as you know, so I'm gonna profess and ask you questions. All right, just to keep things going. Where does the word data come from? Anyone? Where does data come from? The word comes from datum as in reference. And it also has roots in donate, give. It comes from that. Do you know where the word statistics comes from? Anyone? What? Wires? Liars, oh yes. Not just liars, damn liars. Liars, damn liars, and statistics. Comes from the fact that the state needed data. The state collected data, and they analyzed the data. That's where the term statistics, statistics comes from. So data is very central to modern civilization, right? And it goes back to the beginning of counting. The first counting that we know is probably on fingers, then the hexadecimal or sexadecimal system from Mesopotamia, right? Based on the numbers 12 and five, which gives us 60, which survives today in the form of minutes, seconds, degrees. Then the decimal system, which came from India. Before that you had the X, the Roman system. These are all things used to count, keep statistics. So data is a very central part of the transition from agrarian to trade economies, to towns, to the management of cities, states, and countries, and international. And then data took a real sort of bump with the invention of the telegraph. Because suddenly with the telegraph, it wasn't word of mouth anymore. You couldn't just sort of catch up with someone and informally share information with some data thrown in. You had to be very specific about what you sent. And the telegraph was used to transmit news of wars in Europe. It was used to transmit commodity prices, right? To make bets. But suddenly you had to now be extremely specific about your data. Data also, of course, came from trading from the Silk Route and then from the great colonial and international movements. And that goes back to about 1,000 founded years ago. It was the Indo-China where it really started to grow, where the Indian and the Chinese civilization occupied Southeast Asia as colonial empires. There's a lot of trade. In fact, a lot of these numbers that we used today, that's where they really took off the digital decimal system. Then, of course, the European colonial movement and keeping logs became very essential. So data is very central to everything we do. And now we have, of course, the world is analog. We all live in analog, in the analog world. And in the last, actually going back about 100 years, but mostly in the last 50 years, we've gone digital because of the growth of computers. Of course, even when you sent stuff over the wire, you had to digitize data in some respects, but really digital data is about 50 years old. The strange thing is, we act like the world is digital, but we live in an analog world. Sort of weird, right? We live in an analog world. I mean, everything we do is analog. This coffee that I'm drinking is analog. It's not some, you can't, it's, I would have to, it would be impossible to capture the amount of coffee exactly as a digital number. It's an analog number because it's the number of bits I need would be almost infinity, right? I have to count it down to the atomic level to get it right. And in fact, we're sort of struggling with that right now. So if you look at, for example, the automotive autonomous industry, Tesla uses autonomous driving and it uses digital data to navigate an analog world. And often it gets it wrong, right? So it's sort of weird. I mean, at Tesla with self-driving or any self-driving car, this is how it tries to tell that it's come to a traffic light. It looks at this thing, it looks for green circles and says, I think that circle is red. I should stop. In fact, there was an event recently where Elon Musk was live streaming from a Tesla and the thing misreads the light and he has to take over. It's a little embarrassing because of Facebook live stream, right? So that was, in fact, trying to get analog data from the traffic light and turn it into digital data and make a digital decision whether to go straight or turn right, then we got it wrong, right? So this is a constant theme here. Analog world, we act like we're digital, we're not there yet. Now, actually the right thing to do is to put an RF transmitter in the traffic light and have the transmitter tell the car, I'm green, you can proceed. But that's actually not that easy because how do you know that it's directional? One of the things about site is you know where you're looking, right? With an RF signal, it can be hacked, et cetera. There's a lot of work, in fact, we're working in it. So this whole sort of connection between data, analog, digital, I think we're in a very special time and we're sort of, you know, groping our way through a very uncertain space right now, right? So that's analog versus digital. Now, if you look at data, we can also classify data in terms of sources. Obviously, human-created data is a huge source of data, right? So every time we tweet, and this is relatively new, you know? Every time we create something on TikTok, every time you create a movie on Netflix, every time you shoot a video, take a picture of your family, take a picture of your dog, it's data. That's data too, it's human-created. By the way, human-created data is sort of turning to be really interesting because it's crowdsourced, the term, right? You may not know this, but in 2011, when there was an earthquake in DC, the Washington Monument developed a crack. Remember that earthquake? 2011. People in Boston got wind of that earthquake because there were tweets before the earthquake arrived here, right? So human-created data, the sort of crowdsourcing of data is a very interesting space. We were right in the beginning. The problem was it was unstructured data. And with LLMs, LLMs have this magical power of taking unstructured human-created text data and making it structured. I think that's one of the hidden extraordinary powers, magical powers LLMs have given us. In some ways, I think we're distracted. We look at LLMs and say, oh, wow, look, it just responded to me, right? But actually, the real power is it takes unstructured human data and structures it, right? If you've used auto, how many of you used auto on a Zoom call, right? To summarize, pretty good. What we used to do called sentiment analysis. It was so primitive. Look for words that are negative. But now with LLMs, you can look for the double negatives. You can look for the subtleties and really pull some amazing stuff out of it. So that's unstructured versus structured data. Most of the data we deal with is structured. We live in a world of structured data, but unstructured data, this is a whole window that we've just opened our eyes to. And it's gonna be game-changing, right? I was talking to a friend of mine the other day. He says, he has a startup that's doing really well and they give customers forms, they won't respond. Have a chat and have the customer respond informally in a chat, fill up the form, three to four X improvement in performance, right? Because now you're being more human. So LLMs are very interesting. We'll come back to LLMs, but I think that's a very fundamental way to look at its impact on data. Another comment, another way to classify data is beyond structured and unstructured is to go higher level than data. So you have data and then you have metadata, right? That's data too. So for example, if I take a picture and there's a GPS tag that says I took it while I was on vacation with my family in Portugal, when I see that picture show up on my Mac, right? And the feed function, I can click on it and go, oh yeah, that was Portugal, right? That was wonderful. I remember we went on that vacation, we went to Cintra, right? And you know, we had this wonderful meal. I wonder if I can get a picture of the time we went up the palace in Cintra and then you can search for Cintra. Cintra has a location in Portugal because the metadata stored in your pictures captures GPS. So that's metadata. So it's data, but it's one level up metadata. So another way to classify data, metadata. Now there's a whole new shebang, one level even higher than metadata. There's no word for it. Some people call it weights in machine learning, right? In deep learning, but I call it for the lack of a better term, hyper data or supra data, right? So the idea is it's not data, it's not metadata. It is machine learning abstract distillation of that data. It's fascinating. That's the new gold, right? That's the world we're in now. So if you, for example, one of my former students is currently at my request creating a fake version of me. Okay, I myself am fake. So this is gonna be even more fake, right? So it's gonna be my face and my voice and my accent. And she just sent me, I'll play it for some of you if you want afterwards. She sent me the generated version of me, completely synthetic. There are numbers behind that that distill some aspects of me. And that is trained on various videos she's seen of me, right, it's my voice, it's my face. It's the way I speak, my diction. And it's reduced to a bunch of numbers. I am reduced to a bunch of numbers. That's data too. It's not regular data. It's not metadata. It's one level higher. It's this thing, weights, supra data, hyper data, something. That's data too. This is why Twitter, Elon Musk, that's why he keeps trying to, in addition to the struggle with making Twitter, we should call it X Twitter now, right? Isn't that a better term for it? Why do we call it X formerly called Twitter? Why don't we just call it X Twitter? It's one word anyway. So Twitter, one of the challenges, the reason that access is being controlled in addition to monetization is because the eye of Sauron, these AI systems are looking for anything they can feed on to get data. So they're controlling it. That's how Reddit, for example, is shutting off access because any corpus of data is fair game to generate the hyper data, the gist, the essence with which machine learning systems are driven. It's a fascinating thing, fascinating time. In fact, it's reached a point, and I'm right now, of course I'm here, but I'm also the CEO of a university in Malaysia. The Malaysian language, Bahasa Malayu, has a very small corpus. It's only 33 million people, so they have a small corpus. Indonesia, which is a similar, but certainly different language. It's like New York English versus Boston English, and I'm just kidding, it's much more. Indonesia has 270 million people, much larger corpus. So the generative AI works with Bahasa Indonesia, but not with Bahasa Malaysia. They just don't have big enough of a corpus, right? So data is valuable now, even for that, for that hyper data, right? Very interesting time we're in right now, and I think we're gonna fight over it. So that's hyper data, and then there's human data, right? I don't mean just human data generated by through our social media, tweets, product, but actually something weird, which is brain-human interfaces, right? So what happens there is, in fact, I'm working on this, you can actually literally today read the human brain and figure out what they're seeing, what they're thinking. It's not rocket science, but it is rocket science, but it's not fiction. It's a neural link, you know, a must startup. We do it with EEGs because most of our volunteers weren't too happy about drilling holes through their skulls, so we decided to just use EEGs, but with EEGs and with MRIs, you can actually do some pretty spectacular things. There's a professor at UC Berkeley who, through an MRI, he can show the individual a number and guess which number they're looking at. It's quite fascinating, right? Because we're figuring out how the brain works. The brain is a circuit. It uses these things called neural networks, which would run a mimic, right, in computers, very different architectures. This is the occipital cortex, and with an EEG in your occipital cortex, I can tell where your attention is in my research group. And so what you can do is use brain interfaces to generate data, and by the way, they can be used to train machine learning algorithms. And then finally now there is all the machine-generated data, right? Barcodes, we just talked about GS1 US barcodes. You've seen them everywhere, QR codes, right? Where was the barcode seen? RFID, supply chain data, IOT data, temperatures, HVAC system has temperatures. Every camera is capturing pictures, and you should be able to distill that data. In fact, we use it today, sometimes using Google cameras to find, figure out where, who did something on the street? A criminal activity, right? That's all data. And with machine learning with LLMs, we can take all this data and they've been put value out of it. Now, right now, there's an argument about autonomous driving. You know, a Tesla car is just, and I work in this space, so I can say this with some confidence, it's just a skateboard of batteries, two motors, four wheels, and an iPad. That's real all it is. Everything on top is just stuff you sit on. That's it, right? So what makes Tesla so valuable? Well, what makes Tesla valuable is the data. Because every Tesla car, because they rolled it out so fast, has videoed, has, well, they've stopped using radar, but they'll come back to it, but radar data, and that's what you train autonomous vehicles on. It's all the data, right? There's one other company that has incredible data, and that is Mobileye. Mobileye was acquired by Intel, then they spun it out. I think it has a valuation of $42 billion. It's an MIT company, actually, and non-shashua based in Israel. And Mobileye has data, self-driving autonomous car data. All comes down to data, right? And now we come to AI, neural networks. What do neural networks do? I mean, by the way, neural networks. Anyone know how old they are? I mean, I'm not talking about natural neural networks. I'm talking about artificial neural networks. How old? Anyone? Yeah, about 70, 80 years, right? We've known about artificial neural networks for 80 years. This took off, and then they sort of came to a crunching halt in about the 70s. Actually, a professor here, Marvin Minsky, and Seymour Papu, two of them, who used to be the media lab next door, not this building, they showed that artificial perceptrons are really not doing it. And then there was something called an AI winter. That was the first wave. And in about the 2006 to 2012 timeframe, brute force won the game. So GPUs and lots of data from the internet. This is my dog, this is my cat. Label data sets, right? And using GPUs and something called back propagation, people like Jeffrey Hinton, Chipala, et cetera, they created the second wave of AI, which is all about recognition, prediction, face recognition, Siri, Shazam. All of these are based on pattern matching. That's a second wave, deep neural networks, convolutional neural networks, et cetera. The third wave we're in does not signal the end of the second wave. The second wave continues as healthy. In fact, it may have a bigger impact. The third wave we're in is LLMs. Large language models. They started in the 2017 timeframe. And 2022 December, November, depending on where you are, you'll probably remember the first time you logged into chat GPT shook your world, right? And that's LLMs. And now with LLMs, as I said, you can take unstructured data and you can put some structure in it. And that is game-changing, right? Now you could record the speech, all your comments, and turn into the report. Just imagine how incredible that is. I'll make one comment about LLMs that I think is sort of spooky and beautiful at the same time before our end. But one last comment I wanted to make is some of you may have heard of MetaAI, or MetaAI, of course, is Facebook AI. I had a product called Lama, large language model MetaAI. And what happened was they open sourced the whole model, right? And they decided to keep the weights, the super data, to themselves. So they open sourced the software and decided to keep the weights to themselves. And those got leaked. And suddenly what's happened is LLMs are being released into the wild and they are now public domain and you can download it and you can create your own digital twins, right? Which actually one of my friends did to me. And I think that is our future. So ladies and gentlemen, I think we're in a very special time. This is the world of data. You know when they say data is the new oil? It's unfortunate that it's so clichéed because it's true. And we'll talk more about it today, but thank you very much.