 Good afternoon, everyone, please be seated. We'll be starting shortly. Test, test, test. Maybe a little bit lower. I tend to speak loud. Yeah. Yeah. Let's go to the room. OK. Go to the room. All right. Just. Hello, everyone. Welcome back. I hope you have had a nice lunch. It was good at least to me. So please do not fall asleep. We have another riveting second half coming up. So we have with us Professor Jan Wittek. He's a professor at Northeastern University in the Czech Technical University. He works on topics related to designing and implementing programming languages. He has led the design of several languages. A few popular ones include Java, Julia, and R. And I'm sure if any one of you is a data scientist, you'd be very familiar with Julia and R. He was one of the founding members of H2O.AI. Again, if anyone is a data scientist here, I am sure you would have heard the name. He's also a founding member of Fiji Systems, a provider of embedded solutions. Today, we are very lucky to have him join us to talk about data scientists. I'm sure a lot of you would be thinking about it as a career as you are a bachelor's here primarily. For the longest of times, data scientists have been seen as domain experts with a few amount of proficiency in programming languages. But he is here to break the myth and say that data scientists are in themselves also programmers. Over to you, Professor Jan. All right, thank you, thank you. Yes, don't fall asleep. So it's a great pleasure to be here. It's an honor to be able to share a few thoughts and a few moments with you. I'm sorry, I'm a bit sick. I went two days ago with my wife to watch Dune 2 and the air conditioning was so, so cold that I saw my apologies. All right, so today I have prepared this talk which I've entitled data scientists are programmers too. And sort of the general point I want to make in the talk is that my area of research, which is programming languages, is part of data science and that data science is a great source of inspiration for programming language research. And so I've structured my talk in about three parts. First, I will tell you a little bit about my path towards data science. Then I will give you a definition of the term and some examples of what it means in practice. And the remainder of the talk will be some open research problems. Some I work on, some I would like somebody to solve for me. Okay, and interrupt if you want, just yell, ask questions, it's always appreciated. Okay, so let's get started. So I thought for this first part to have a framing question and this is something that maybe you have asked yourself and if you stay in science, I can guarantee you will ask yourself. And the question is how do I pick what I want to work on? How do I pick a research topic? And it's an important question. You're going to spend several years working on whatever it is you choose so you'd better make the right decision. How do you do that? And maybe one example can provide at least some food for thought. So by training I'm a programming language researcher so for the first 15, 18 years of my research career I was working on program language semantics, on type systems, on program analysis, on garbage collection, all sort of very traditional computer science programming language topics. And then around 2010, at that point I was a professor at Purdue University, I met a statistician, his name is Luke Tierney and I met him at a conference and he was really excited about what he was doing and he was doing some computing using this strange language called R. And he sort of showed me his code and I looked at it and I thought, shall I be polite? Or shall I tell him this is the most god-awful language I have ever seen in my life? Usually, honesty is the best path forward and I said, you know, as a programming language person we can do better than this and he smiled and said, OK, show me. And so it took a little while but I said, all right, maybe I should take this seriously. And so I started a site project besides other things I was doing to understand what R is and maybe to try to find like tiny improvements. And this kind of started a path. Then about a year later I was invited to give a talk at the Embedded System Conference and for that talk I decided to write a paper on how to do more rigorous experimentation and systems research. So that means repeatable research and when a part of the paper was talking about how we were using R for repeatable analysis of the experimental data. Then that led me to artifact evaluation so at the time I was also chair of the ACM Special Interest Group for programming languages and as in that position I was overseeing several conferences in the field. So I managed to convince all of the conferences that it would be a good idea to adopt artifact evaluation. So what is artifact evaluation? It's the idea that when you submit a paper your research work to a conference you should also submit all the code, data and the analysis code that produces the graphs of the paper the whole thing as an artifact. And then there's a special committee that will evaluate that artifact and if they find it correct you will get a badge that said not only is your research good but the methodology that you use this sound and we believe it is reusable. And so that was successful and as I was giving talks about artifact evaluation and better experimentation I met the founder of this start-up H2O which was you know planned they hadn't opened yet so they said well you know we're going to start this wouldn't you want to spend a year with us because we really need programming language people I don't think they needed programming language people they just needed people to look like a company but it sounded like an interesting challenge so I said alright fine and I spent the first year of that company with them and the one thing that I sort of learned out of this was you know sort of the business side so they were doing essentially distributed machine learning at that time or distributed statistics what I found out was how few programmers out there could understand both statistics and computing and that was certainly a place where you know there was a need in the market for people with the right skill set. After a year I left H2O and I moved universities I moved to Northeastern in Boston and since I wanted I had started being curious about that data science business I thought what is the best way to learn something well it's to teach it so I volunteered to teach their master's program in data science and then the next year I wrote a grant which was awarded a very large grant in Europe in Prague which is my home country and I also got a grant in the US so I had enough money to build a research lab on two continents and we started really sort of doing research in this area and we picked two target languages R because I had some background and Julia because randomly I had met the creators of Julia in Boston and they looked like nice people and you know it started a long line of really exciting works and I'm very proud of the research we did in this space and I think somewhat influential alright so let's take back a little bit I said we would start by a question which is how do you pick a research topic and so if you follow this line there's a number of completely random events meeting people at conferences getting invited to give a talk meeting a founder of a company so that's always the case but underlying all of this is your curiosity and you looking for what is a real problem what is a problem that hasn't been solved what is a pain point what is something that people feel unhappy about and that's the source of your ideas so as a student you often do not have that source well you use last year's papers and you say sure I can add 10% and that'll be next year's paper but eventually you know you realize yeah but it's sort of all fairly sort of self-contained and you're not making much of a dent so going out and talking to people and saying you know or thinking in your head what is the real problem is there something that me as a computer scientist I can contribute to this situation and for me the statisticians and the scientists working in data science clearly needed help on the programming language side and so that was my path into this area right so now that we've got that first part out of the way let me try to give you a definition for the term and some context what do I mean by data science and why is it a programming language problem so first where does data science come from so if you sort of try to look back one origin story comes from where a lot of computer science started AT&T Bell Labs you know the home of UNIX the home of C and also a place where a statistician by the name of John Tukey was working in the 60s and Tukey wrote this book called Exploratory Data Analysis and his vision and you know think this is the 60s was we need statistic and computing to merge so you can seamlessly interact with data we have to make it very easy for you to explore the data to peek and poke at the data to see what is the story that it's trying to tell you so that was his vision and part of his group created the language for statistics they called it S I guess the name makes sense and then a few years later about 90 something an open source version of that language was released and that was called R because it's S-1 so where does the term come from so as far as best as I can tell it comes from a paper one of my colleagues at Purdue Bill Cleveland wrote in 2001 so Bill Cleveland had been at AT&T Bell Labs in the same team that was started by Tukey and he came and he had this idea that there was a need for people trained in this strange mix of statistics computing and in order to do that it couldn't be done in a statistics department because they really fundamentally didn't care about computing and it couldn't be done in a computing department because most were scared of statistics like there was this lack of a space where this could happen so the thought he had was well let's invent a new science let's call it data science and that's how it was born and in his view data science is like this combination of many things that we know and some things that may be less familiar for us so the basics are statistics mathematics for the foundations you use a lot of machine learning for sure but you also need to be able to understand how to deal with data that means everything from storing it manipulating, representing it how to visualize your results visualization techniques you need a little bit of communication abilities so teach some soft skills and at the center everything is tied together with software so you have to understand software and programming so this is sort of the reason why I made this claim that in data science there is a core element of programming and more often than not the things that data scientists do end up represented, embodied by a piece of software from a programming language person's point of view the question is is this piece of software good? is it fast? is it correct? is it maintainable? all of these questions which you would ask about any other piece of software, well they apply in this context too but the interesting thing is data science comes from that tucky legacy of interaction the software is typically not designed for being efficient, being correct, being maintainable and that's where the interesting tensions happen so what I want to do next is make this more concrete and what do I know about concreteness well I show you code I did pre-recorded the video because typing is hard when people are looking about a little bit of data analysis we were going to analyze the survival rate of passengers of the titanic the boat with Leonardo DiCaprio the question is does he survive? can we model his likelihood? all right so we are going to do this in R and this is a screenshot of the R studio environment I hope that it's legible but I'll try to just talk you through it even if everything is a bit small it should be fine very briefly, what is data science? what is the data? the first thing we do is grab data we are reading a CSV file with the passengers and there's a whole slew of columns every time you see the data you're manipulating the first thing I'm going to do is get rid of a lot of those columns that are not interesting and you see the data simplified then this survive attribute starts recording it as a Boolean and maybe I want to record it as a yes and no variable because that will be more legible and then you see we have names, ages the price they paid for their tickets so every time I make a change the data is re-displayed next thing we are going to do is we are going to say in the name of our passengers what is embedded in it is a title maybe there's information in that title let's extract it and we write a little regular expression that grabs the title and puts it here and then we turn that title into a factor an enumeration and for various reasons there's misspellings and various equivalent titles we're going to collapse them into fewer categories and that's our list of titles so we're reasonably happy now we can start doing some analysis so the first thing we're going to do is we're going to plot the missing observations so here we see various columns and the red is missing observations there's a lot of ages that we didn't know people came on the Titanic, we didn't know their age the next thing we're going to do is missing values are a pain so we're going just to guess ages we're going to guess the ages by taking the median of the same title group so all men will have a median age then we can now plot so we're going to plot title and survival rates so this tells us that for instance men have a low survival rate because they gave their seats to Kate Blanchett and royalty has a higher survival rate because they're not as generous the next step is we're going to plot the price of your ticket and your survival rate so the peak here is cheap tickets have bad outcomes and finally we're going to do prediction so we're going to use random forest we're predicting survival as a function of age cost of your ticket and title and eventually this runs and here I had a typo up there I deleted the column and then we will see here the accuracy is 85% so that was a run through data science but it was a very typical approach you have data, you're looking at it you change things, you play around and what do you end up with 15 lines of code that do what we wanted to do is this code right? so it was a bit fast but I made a mistake at some point I flipped on line 3 the value of no and yes and then I caught myself and flipped it back who says this is the right interpretation who says there is not another mistake so generally speaking how do we know that the code is correct so we did a study we asked about 85 practitioners in industry and academia and we asked them about errors in data science codes so this is just some data from the study those people mostly use R some are unlucky and use Python other are doomed with SQL none of these choices is great I said what I said about R and I stand by it but all of the others are strictly worse so there is no salvation here and we split the process in 6 steps formulation of the problem data acquisition, data cleaning, exploration modeling and reporting and for each we asked what kinds of errors do you encounter and people said things like in data cleaning sometimes we have outliers values that are completely out of range and we don't handle them well or we have missing observation and we don't check for them or we have type conversions that are wrong so you expect a number but you get a string and that turns into 0 or something stuff like that so all sorts of errors now the interesting question I will not go into details of those errors is how often does it occur so we ask everybody how often did it occur in your case and basically every step of the path is error prone so the green is academia and the blue is industry and the percentage is how many people experienced errors in those steps and basically there's errors everywhere it's interesting that academia and industry have different kinds of errors so for instance in the formulation phase that's a big problem for industry and my hypothesis is because when you're formulating the problem you have to talk to your customers which are outside of your group and miscommunications happen whereas in academia your own customers so maybe whereas the statistical modeling the industrial folks are much better at it and the academics are bad that's odd too the reason is well for statistical modeling in industry they use four different techniques and they stick to those four techniques which are known for 40 years so you don't make that much mistakes whereas the academic keep trying new algorithm, new methods and of course that leads to potential mistakes alright so so everything can be wrong which is a bit scary but it gets worse so it gets worse it gets worse because well let me illustrate why it gets worse so this is another example in R so I have a I have a file that has names, dates of birth weight and height and for some reason assume I want to plot for each age like people who are 56 like me, 55 what is their weight per foot right I am 6.5 so my weight is quite a lot per foot think like that why I don't know but let's just say and I want a linear regression so that's my expected output this is the code that does it it turns the date of birth into an age it computes the weight per foot it groups all the people that have the same age it computes the mean within each age group and then it plots yeah all good all good so this is cool and it works and most people will check that it works by playing around with the data making sure that the numbers are good and then they are happy and they release the code on the world and new data comes and things go wrong so here's an example to imagine your data source when it doesn't know the height of somebody it encodes this by putting 0 as heights reasonable maybe perhaps I would say not but whatever the code runs fine and you get a graph life is good the graph is funny though your regression line is a bit shorter than it was before and your last column the one for 52 years is a bit high what happened well it's not actually a bit high it's infinity because we had one observation of 52 and it sort of goes up so the regression breaks down but no errors right you get a result and if you're not paying careful attention and I can assure you every company I've talked to is not paying careful attention that will just go through and move on and life is good and then another data point comes in I get 0 weight so now I have another funny graph so there's nothing on 57 and there should have been well there is something there is a 0 observation but it just doesn't display and that drags down my linear regression again no mistakes and what's the problem here the language is not at blame the programmer maybe I don't know the problem is we don't know what is a good result or a bad result we don't have a specification we don't have types we don't have anything to tell us yay or nay the problem this is showing is that a lot of the code that analyzes data is likely wrong and you don't know it so maybe we could just use programming languages right that's my field right so we'll just do that what can we do well what about Rust there was a speaker earlier who mentioned Rust Rust is a cool language right so it should be the solution so let's solve this with Rust so this is the code in R which is fine it's fairly readable code you would perhaps agree if I could say it groups people it computes an average it joins to table it computes some other thing I don't do Python never have never will but chat GPT does it so I ask chat GPT to translate this in Python and you get the slightly less pleasant this why is it less pleasant well you'll notice there are many more strings and whenever you are programming a string is an opportunity for typos you will never get a complaint about a string so that's strictly worse than the previous solution but Rust Rust is a typed language so Rust to the rescue so again chat GPT writes this for me is this readable no but it has strings everywhere so it's no better than any of the others it's just as bad so the problem here is our current language technology doesn't know how to deal with data science data science is programming patterns that are very different from what we're used to and we need to adjust our technology to answer those problems so that's why we're doing research alright and now I will zip through five open problems that are worth thinking about so so when I talk about programming languages think the way to think about this is a programming language is not just a set of instructions but it's an entire ecosystem if you take Python or R the value of Python or R is not in the interpreter it's in everything around it so when you say ecosystem you mean the stories the packages people have contributed the user code that uses those packages and the piles of documentation questions on stack overflow and so on so that's the whole ecosystem and if we want to improve the fate of data scientists from a programming language standpoint we have to think about the entire ecosystem we can't just be thinking about this how much implementation affects 100 people but the user code so R has 5 millions and counting users so the user code is where things are happening okay so what kind of research can we do so the first thing I'm interested in lives all the way at the top think about so when I wrote this slide this slide was I wrote it like two years ago or three and that was before the AI boom so the argument I was making there was if you want to find an answer three years ago you go to stack overflow and stack overflow knows the answer and then you cut and paste the code and you're good today I would say you go to chat GPT but fundamentally it's the same thing chat GPT uses everything we have written three years ago to generate an answer and the problem we're facing is that these answers are stale the libraries evolve and you get code that is not matching the current version of the libraries you're going to use and therefore you're going to do something wrong so for instance above is one version of one of those titanic queries but that's an old version the current idiom is written like this there's a difference there are commonalities I believe that we can use machine learning combined with static analysis to figure out how to translate one into the other so we haven't made yet much progress on machine learning because that's still something that I find a bit of a black art but what we did build was a fairly large infrastructure for querying GitHub and searching for information on GitHub and that's been a couple of years of work but now we as researchers can have a much higher level query interface on GitHub and the hope is that will allow us to to get the data we need for the next steps so the next thing I wanted to talk about was this question of specification inference so I told you R doesn't have a type system and it doesn't have any support for specifying what you want so in our example we received a column called survive which was populated with floating point numbers and we received titles embedded in strings and in order to deal and also there were ages which sometimes had NAs and when we write code we have certain expectations we may expect a certain format for these strings we may expect certain properties for these ages and if the data we get doesn't meet those specifications the results and those results will be silently returned so how can we get specifications if we don't if programmers don't write them so the thought we had was can we leverage two things one is the fact that when you do data science you always have code and data so you always have a runnable thing and most of the time people have checked carefully that for that one piece of data the code does the right thing so you can sort of say well let's assume that my program is correct modulo everything that could happen to the data like you know if you send me data that is different it will behave differently so the thought we have and the project we're working on is to try to use dynamic analysis we run the program to extract a set of invariance and these invariance describe things that were true in that one good run of the program and then we try to generalize those invariance to allow more input and it turns out that we can output at the end is formulated in something called incorrectness logic so if you want to know more ask me after so the next bit I want to mention is fail silent code so as I told you code that fails without complaints the reason in R for things going wrong are three some one R is a lazy language what does that mean well it means if I call a function I'm not evaluating the argument until it's needed and that causes a lot of confusion because you don't know when things are happening it's an untyped language and it's dynamic so all three together cause much challenges for catching errors so here what we started doing was saying okay can we reverse engineer the design of R and redesign it while keeping as many programs running as possible so the first question we ask ourselves is can we make R not be lazy so why is R lazy it is lazy because the designers of R use laziness for extending the language laziness allows them to do domain specific languages in R this is this whole all of the syntax I've shown you before is really a DSL written on top of R but the majority of code never needs the laziness so what we did in our research is we analyze 250,000 R programs and we counted how many times you actually needed the laziness and then we try to figure out can we tell ahead of time which is which and the answer is most of the time we can so the the outcome of that was saying well we looked at how you use the language and we know that if we make this one change add a declaration for which functions need lazy arguments we can dispense of 90% of the lazy cases this makes R easier to analyze and makes makes us happier campers then I told you that R is untyped so there is a concept called gradual typing in the research community and I worked on that for a few years and the idea of gradual typing is you take a language that is untyped and piecemeal you start adding types so you have all used the gradual typing language if you used type script so type script is javascript untyped with additions of types so the question is can we do the same for R and the challenge for us is yes but there is so much code written in R how do we how do we come up with the initial types for all the libraries and so what we did is we built an infrastructure that can take all of the packages in R there are 15, when we wrote this there were 15,000 packages so 15,000 libraries run them extract guesses for the types then inserts runtime checks to make sure that the guesses are right and then run them again with different inputs and I'll just go to the conclusion here and we found that 98% of the time we are right just by guessing and 2% of the time we need help from the programmer that's not a bad outcome then the next level of the language stacks is on the core libraries so one of the big problems you have in R and Python is the interoperation between the two worlds you have on top your high level language and below there C and any bug in C will break everything so here we worked on implementing checkers static analysis tooling that works at the LLVM level to analyze the C code and C++ code and try to find bugs automatically and you can do some amount and the last bit I mentioned is at the core language level we have we have worked on performance so this is a performance graph so what it says is I have 12 programs one here is the performance of C so imagine there's a third line here that's one everywhere and that's for C blue is Python and Python is about 50 times slower than C and red is R it's hundreds of times slower than C so not a great story right so why is it so slow well R has been designed to be the worst case R has been designed as an adversarial language for compilers you sort of think about adversarial analysis it's everything it can do to break a compiler it does so example imagine I have a function f I want to compile f and f is not that hard it takes its argument and it adds it how hard can it be to compile this function imagine that I have compiled this function and then I find code that calls f and passes an argument bad what is bad well it's another function do we care because I told you R is a lazy language when I see f parent bad it really means call f and then when you need to evaluate bad evaluate it now this is what bad does it says when you're called there's a call stack right so you're a function and you've been called by someone so go to the caller look for variable b and delete it so now imagine I wanted to compile f I have to assume somebody may delete my local variable b and there may be another b in the program I have to bind to that correctly right so it's an adversarial language and we've been working on just in time compilers for R I've written we've written 3 so that means we've at least failed twice it's just hard so that's what I've been doing so data scientists are programmers they need us we enjoy them so nice union and I've been having a lot of fun in that space for your attention so you mentioned about scram mirror what is that a scram mirror scram mirror previous slide sorry previous previous sorry apologies so that was too fast scram is the github of R so statisticians are much better software engineers than we are statisticians have a github where you cannot submit unless you submit code with tests and your code gets kicked out if the tests stop working and they test every change of the system constantly so 15,000 packages with test data with proper tests that are working just like a treasure trove for people working in languages you can push a button and run 15,000 programs okay well it's been great I'll be here for a bit and you know come ask me questions thanks a lot okay yes okay okay everyone here has enjoyed Professor Rohit will be felicitating Professor Yan with token of appreciation thank you very much one more oh and I have the microphone so I should not leave with this thank you so much so while they plug it up we have a very wonderful session from Professor Yan on data science, programming and a lot of discs about python language sadly I like it a lot in my work it's fine I guess so next Professor Rohit Gurjar an upcoming rock star in our department will be joining us for a talk so he's an assistant professor in our department his research focuses on computational complexity algebraic algorithms and randomness so today he'll be talking about QR codes and blockchain verification personally I'm very interested in an example in his talk which is about error correcting codes in QR codes so you might have used QR codes tens of thousands of times nowadays I think mostly because of UPI so what happens when you have errors in those QR codes so he'll be talking about that as an example so his talk will be mostly focusing on the role of algebra and computer science just a bit of delay for some technical for it sorry for the delay reintroduce okay so this title looks different from what is there in the schedule and the abstract so what I wanted to talk about is showing some usefulness of algebra in computer science I mean you might not imagine it's much more useful than what we tend to think so I wanted to show two examples one is QR codes and the other is like fast verification of computations which is among other things which is used in blockchains but then I realized that in just half an hour I cannot really write the examples and I always like to show some mathematics so I will stick to one of the examples so I want to talk about algebra in how it is used in QR codes so first what do I mean by algebra so algebra is just what basic algebra we learnt in school addition, multiplication, division polynomials their roots evaluating a polynomial at points and all those things another thing which is quite useful is modular arithmetic which we don't really learn in school but it's not that hard so for example if it's 3pm right now after 10 hours it will be 1 am so how do I find this I just take modulo 12 so that's modular arithmetic similarly you do it with multiplication so amazing thing is that everything you know with the usual addition multiplication that also works with modular addition and modular multiplication so why is it useful, why do we need it I will tell you, you will realize at the end of this talk okay so algebra has a wide range of applications in computer science so for example it's used in data compression for any kind of communication happening from one place to another place you can lose data right there could be errors so you need some mechanism to correct errors so it's all involves I mean the fundamental tool we have for handling errors comes from algebra comes from polynomials right also for cryptography right so reliable and secure communication so in cryptography also a fundamental tool is algebra this efficient verification of computation which I talked about which is one of the things one of the fundamental things in blockchains again it's based on polynomials some people are now also using algebra for software verification so yeah so algebra has lots of uses and today I am going to show you so before we I go to QR codes let's see some quick facts very basic facts which I am sure you must have seen in school so here is a basic fact from algebra so any polynomial of degree D has at most D roots right does everyone know this right everyone has seen this so it's a I mean looks like an innocent statement but as you will see it has many many applications in algorithm design complexity theory in general in computer science so equivalently you can restate this statement as so if I give you D plus 1 points in the plane then there is a unique degree D curve passing through them right so it's like saying if I take 2 points in the plane there is a unique line passing through 2 points right if I take 3 points in the plane there is a unique degree 2 curve passing through them why is it equivalent you can think about it it's not very hard okay yeah so 2 there is a unique line passing through 2 points and then there is a unique degree 2 curve right like this okay so now let me come to QR codes and let's see why why do we need at all algebra right so you know what I mean you use QR codes everyday you know why what are the applications of QR codes but so what is so non-trivial about it I mean so it's clearly black and white represent 0 1 we know every information is can be represented as bits right so you can of course convert any piece of information into 0 1 and then black and white pixels so so what is non-trivial here so the non-trivial part is this that you might have observed that many times when there are obstacles right when some part of the code is hidden right or it may be even erased scan it it works right you must have experienced this before right so yeah so the QR codes can be read even when they are partially occluded or erased right so sometimes this I mean part of it could be turn off right and sometimes what happens is you I mean it's turn off and behind it there is a older QR code somewhere so what is the difference here the difference there is a crucial difference here right so the first example I showed it was like some part of the data is deleted right erased now what is happening here it is corrupted now the original I mean the data we should have been there now it has been replaced with something else which is wrong right so so this is a different challenge so it is a harder challenge just from deletion right it's worse than deletion so so even this if you scan this even this will work right even when some so there are errors can proper I mean this is not maybe not very common example but there are very common scenarios where errors can appear right so maybe I mean maybe there are some obstacles for due to which you read zeros as ones and ones as zeros in general in any communication channel your messages get errors right so then how do you recover your original message when there is there are errors right so that's the fundamental question so when we want to send some data some information we want to send and then the channel whatever medium we are sending it through some part of the message will eventually get corrupted will the bits will replace which will get replaced with some wrong bits so then how do you get internal message right and how efficiently you can do it that is the main challenge so so I mean at first you might guess that okay it's I showed you that some part of it was hidden but still we were able to read it so it looks like maybe the same information is copied multiple times right so for example let's maybe it's like this that the same information we have created four copies and we have placed these four copies in these four blocks right then so if any one block is hidden you can still recover the information right so that would be a good guess but I mean we can do this mechanism but there is a problem with this so do you see the problem what is the problem if you just take four copies it could it is possible that the same bit right the same part of the information gets deleted from each of the four blocks right that that could happen right then how do you get how do you recover that right so we need much stronger guarantees so I'll show you what guarantees QR codes give you right so for example if you do this so this I'm I mean from these four blocks I have deleted these four spots scanner will still be able to read this right or if I do this if I hide these parts of the QR code you could still be able you will still be able to read it so so this is there are different kinds of QR codes they have different levels of error correction so this what I'm showing you that has that is level H so what does level H mean level H means you can correct even if any from any part of this code from any 30% portion of this code you can delete or you can modify you can replace whatever black and white pixels you can replace them with some other black and white picture right and still you'll be able to read it I mean the scanner will be able to read it right so you can do this as much as 30% of this code right so yeah this is another example where it works so this will also work this is some other less than 30% of the code which is hidden here so someone wants to try this someone wants to try and scan see what do you get yeah you get the risk website right so it works right I'm not just claiming all these things okay now now try so that I have just enlarged the hidden part now it doesn't work right so now it has gone beyond 30% right so now more than 30% part is hidden so that's why it doesn't work okay so what is the magic here so let me first write what let's first see what some basic facts about QR codes right so so let's write some pixel pattern so for example so this there are it comes in many variations right you must have seen it so this is a grid of 33 cross 33 pixels so smallest is I think 21 so 21 cross 21 and then it increases with a gap of 4 so 21, 25, 29, 33 and so on so there are 40 variations okay so this is this is version 4 this is version 4 which I'm showing you so here 33 cross 33 that means you have 1089 total bits right these many pixels you have so each pixel is one bit so you have so many bits that means roughly 136 bytes okay now all of this is not just information so this red circles and the yellow yellow part you see that is some fixed pattern which we always see which we always put because these are used for finding I mean finding the right alignment right sometimes you are scanning it on a different angle right there are lots of variations so we need some whatever some bookmarks or whatever you can call it so that you know where it is at what orientation it is right so this some fixed pixel patterns are there and then also there are some pixels I don't know if you can see there are some green I'm showing some green color there's some small part of QR code which encodes format information so format information just there are different formats of QR code so for example this error correction level is one of the things which is stored in the format information okay so after counting all this you have roughly 100 bytes of data so out of 136 bytes of data you have remaining 100 bytes of data I mean 100 bytes where you can actually store the whatever you want to store okay so other 36 bytes go into all this fixed patterns and format information okay so so we can store 100 bytes of data and now you got this website right you got the risk website it has 29 characters that means it has 29 bytes right so we are we are storing 29 bytes into 100 bytes of right we have 100 bytes of space we are storing 29 bytes of data there so if you use your naive idea you could have created 3 copies right you could have just created 3 copies of those 29 bytes within 100 bytes but as I told you that will not work because what if the same information is deleted from every 3 copies each of the 3 copies so what is the guarantee for level H for level H QR code what is the guarantee that it will work even if you delete any 32 bytes of these 100 bytes right out of these 100 you delete any 32 or even modify not just delete you replace them with some wrong bits so even then you can get back the original information right so how do we do this right so this this is important right any 32 bytes from any part of this code you can delete any 32 bytes and it will still work so does this sound like a strong guarantee right I mean I hope it I mean do you see a solution for this how can you build this yeah different intensities or let's say different information for 3 these 3 circles sorry squares these major 3 squares and these 3 major squares actually tell me the crux of the QR code where it is actually going it tells you the position position and other part is just 0 and 0 so maybe it can be possible that other part get vanished or get but that's I I guess but what we need to ensure is that any 32 bytes from here are deleted and we are still able to recover the original data so that is my solution that I keep these 3 squares as more weighted intensities no no that squares is just for position they don't give you anything about I store something on that like more oh you cannot cheat you are cheating you cannot I am changing the way the so okay so so here is the challenge again so you have 36 bytes so in fact not just 29 bytes you can store 36 bytes in this in this version of QR code so you have 36 bytes of information you are allowed to you store it using 32 bytes right you can expand it you can whatever you want to do you can make copies of some part whatever you want to do with that you can go till 100 bytes and the guarantee I need is that if I delete or modify any 32 bytes out of this 100 I should still be able to recover my original message okay so this challenge is clear at least the challenge is clear because at the end you have to verify that we have met this we have solved this challenge okay so at the end we'll see this so as I told you simply duplication of data will not work so what works is coding theory which involves algebra and geometry and I'll tell you the basic idea of how it works so any question so far so what is the main idea so the main idea is again from our basic mathematics yeah maybe before you go into the solution are there different challenges here than just regular you know net coding in networks and stuff like that and it's just coding theories standard coding theories standard coding theories yeah I'm talking about 1960s okay yeah yeah I mean it's not something we'll see yeah it's debatable whether you can call it geometry what oh that's what did you mean that I mean of course that is true so of course so so you have to first do some image processing you have to I mean you have an image you have to get that zero one pixel pattern from out of it that is the first step of course right I'm only telling you the later part of the story right okay so what is the main idea so the main idea is that so this is like a toy case a simple example so you split whatever message you have this website you split into two parts and then every so then these two parts they are like zero one bit strings so every zero one bit string you can think of it as a number right you can think of it as an integer right so now you can any message you can see split into two parts and then you get two integers so let's say that two integers are y0 and y1 no no I'm just looking at the bit representation no no I'm I started with the website URL right so I started with the URL website URL and I yeah yeah yeah I mean I have to first tell the forward direction right yes so you take the URL website URL split into two parts right that is what your message is I call it message and then yeah you split into two parts you see two numbers right so now these two numbers let's plot them as two points on the 2d plane so 0 comma y0 and 1 comma y1 right I just put these two points on the plane now what do I do I pass a line through these two points right every two points there is a unique line passing through them so I take this line and then so my remember my message is y0 y1 right that is the website URL is this y0 y1 now what I'm going to do I'm going to take maybe another third point on this line let's say this is 2 comma y2 right I just see where at x equal to 2 what is the value this line takes right so let's say this is y2 so then I just append this y y2 I append with y0 y1 y2 right that is now my encoding so original message was y0 y1 and now encoding is I just append this third point y2 y1 y2 and that already has the magic right so do you see the magic now what is the magical property here that if I delete out of these three y0 y1 y2 delete any one right if you delete any one let's say y1 gets deleted can you still get the line right you can still get the line because I mean out of three any point is deleted you still have two points you can get the line if you can get the line you get the message right so because line is the message basically we are representing the message of the line okay so that's the main idea now but this is not the whole story right because I told you it's not just deletion right we can also get corruption so maybe y1 is not just deleted but what if it gets modified right so now let's say y1 moves up or down so let's say it moves down so now it's corrupted I don't know which points are which two points are correct right I only I mean suppose someone gives me guarantee that at most one point is incorrect so at least two points are correct yeah this same type of greeting technique can be work I didn't understand because in normal QR code there is only but if we introduce colors for more security yeah probably then this technique yeah I could generalize this too I don't know if so Sharath can answer if you can read colors and you can get information from colors I don't know okay so okay so now you have these three points you don't know which one is incorrect you know two are correct which you don't know which ones so can you get the line back I mean the answer is no you can't get the line because because there are like these two also have a line these two also have a line these two also have a line I don't know which line is correct right so this so deletion works but this corruption doesn't work so what do I do what is the next obvious idea one more point right let's add one more point I'm actually going to add two more points so now actually my message had message was y0 y1 two points I added three more points here right so now I have five points now okay so now if you remember what was my challenge my challenge was like roughly 30% of the data I want to corrupt right so 30% of these five points let's say it's two points so let's say two points get corrupted so y3 they get corrupted so y1 moves on y3 moves on now I know suppose someone guarantees that only two will be corrupted so I know three are correct three points are correct I don't know which ones so can you recover the line hmm so definitely probabilistically but definitely not why yeah because there is another line right but there is another line which passes through three points and that could be the correct line so we don't know right okay so then next obvious thing let's add one more point okay so let's do six points okay now again so I my goal is 30% error right so out of six let's say two points are corrupted so let's say y1 and y3 are corrupted so y1 moves up y1 moves down y3 moves up now again I someone let's say someone gives a guarantee that only two are corrupted so at least four are correct I don't know which one I don't know which four points are correct but four are correct so can you recover the line now yes why so confident right so yeah can can anyone say why why can we recover the line four point right so but I don't know which ones are correct right I don't know which four out of the six are correct but the point is that y0 y2 y4 y5 they are in one line and there is no other four points there is no set of other four points which are in one line so then I can uniquely determine what is the line right so that's why six points are now enough so I started with I converted my message into two points two points I made converted into six points and now I have the 30% guarantee right any 30% of the data you can corrupt I can recover my message okay that's the main idea but okay so now my claim is that challenge is solved we split the message into two blocks right convert them into six blocks these six points right so two blocks expanded expanded so it's like three times more right you are multiplying by three and now guarantee is that any two blocks are deleted or modified I can recover right so this is like the 33% that's the 30% challenge so is this over the story over or am I somehow fooling you here so let's say my only goal was 30% error so is the 30% error challenge solved right exactly right so these blocks these are big blocks right so it could happen that one bit from each block get corrupted right so I am not really solved the challenge I am just kind of fooling you here so what I am guaranteeing that any two blocks I can handle but what if one bit from each of the six blocks get corrupted then how will I recover right then I have no way of recovering so what we need to do here is we need to take small blocks right so if you have a big message you split into two blocks then you have really big blocks right and then it is possible that if you have big blocks then it is quite possible that something from each block will get corrupted so then I cannot do anything so probably that I am out of but I started early there was I think I will take till 330 okay okay so okay so so the challenge is solved or not solved what do you think not solved right so we still need to do some work so the point is that you can instead of I mean you cannot have this large blocks you should work with small blocks ideally you should just work with maybe one byte so each block should be one byte right so yeah not really and the challenge is not solved so what do we do right so so originally my message has 36 bytes right I was working with this example of 36 bytes so let's say we split the message into 36 blocks so each block each byte is a block okay each block is one byte now I can visualize them as 36 points in the plane so here are let's say 36 they are not 36 but let's say they are 36 points now what was my trick my trick was to pass a line through them can I pass a line through them what can I pass through 36 points right a curve of degree 35 right so that's what I'll do I'll pass a unique degree 35 curve right so let's say it's this curve then what will I do I'll take some other points on this curve so I'll take 64 other points why am I taking 64 because I want to go to total 100 so if you remember the challenge from 36 I wanted to go to 100 this curve is not unique right it depends on the order oh curve is the degree 35 is unique right oh right right so okay so when I said I had 36 blocks I had 36 numbers y0, y1, y2 I fixed the points as 0, y0 1, y1, 2, y2 yeah that is just like a canonical choice so we don't we don't worry about that okay so now this is what we need to show okay so we have we have taken 64 other points these red ones are the 64 other points now what you need to show you can go home and prove this that so if you delete out of this 100 if you delete any not just delete if you modify any 32 points there is only one set of 68 points which have degree 35 curve passing through them okay it's just like the other statement I made right so I had 6 points you modify 2 of them so then there was a unique set of 4 points through which were collinear right there is the same statement just the numbers are larger so the same kind of proof you'll maybe not the same kind of proof but you can try okay so that's the now the challenge is solved right so now each block is one byte so now I can say that I have 100 bytes I had 36 bytes I converted them into 100 bytes you delete or modify any 32 bytes I'll be able to recover okay so yeah as I was saying this is stuff from 1960 right this is called read Solomon code this is one of the widely used codes everywhere in all kinds of communication so okay one thing I should have told you so why do we need modular arithmetic here do you know so all these polynomials, roots, evaluations we have to do modulars modulo some number we can't do just over integers you see what's the problem with integers sorry 2 large 2 large right so you have a degree 35 polynomial if you put some integer you'll get really large values right so then it will overflow one byte it will not fit into one byte so if you want to fit into one byte then you should work with modular arithmetic so you can take modulo some prime and the magic is that it still works so over even modulo primes you can say that degree d polynomial as at most d roots this unique degree d curve passing through this one point all of this works still works over modulars modular arithmetic okay so as I was telling you these read solokoman codes and other kinds of codes they are used in all kinds of communications when you want to send let's say the Mars rover wants to send data to NASA it will use some error correcting code maybe use this one read soloman when you want to store huge amounts of data across you have many many hard drives so hard drive failure is common right so some part of data will be lost you want to recover so there you just making multiple copies of the data is not efficient I mean just like what I showed you so that's not a solution so then you need error correcting codes okay so other questions people study in so this was from 1960 but afterwards people study many kinds of questions right so for example you want to do all this very fast so that's obvious question in computer science whatever you are computing you want to do it fast so I mean that's lot of research went into that you also another challenge was like local error correction so what I showed you you could recover the correct point but you have to use all the points right so can we correct if some part of the message is corrupted can we recover it just using the local local information so that is what local error correction is right so all kinds of nice questions are here and people claim that this error correcting codes saves millions of dollars I mean maybe it's true I just I'm careful with these statements right so in this data storage when people want to store huge amounts of data they need error correcting codes and it supposedly saves them lot of money okay so this is the another part which I am not going to talk about the efficient verification of computation I told you it's a one of the crucial components of blockchains is someone does some computation and other parties need to verify that they have done what they are claiming they have done but the challenge is that you don't you want to verify it much faster you don't want to spend you don't want to do the same computation yourself right one way to verify is you just do the same computation yourself and then see if the answer matches but you want to do it much much faster exponentially faster than the computation time so verification should be exponentially faster than computation time and that challenge is solved by polynomials and some tools from cryptography okay so okay so there are lots of applications of algebra and some years ago I did not know this but it seems there are many many jobs if you do specialization in algebra in algebraic things if you do a PhD you will have lot I mean lots of options in industry yeah so that's all my messages thanks so if anyone has any questions please raise your hands I see to the hello yeah so I mean you like we're talking about storing storing bytes of data the way like you formulated it was that the message is split into two parts say for example and then they are both the parts are encoded as a number right and then you said the message between the line between those numbers is the real message so can you translate like you can imagine that but I mean the real information is those to y1, y2 the points I was just saying that you could imagine as the line is the data because once you get the line you can get the y0, y1 but how do we like in our original example of the six points yeah if we have any two are corrupted right then first of all how do we know which two are the original ones and which are the four that we have just added for to create the line and how do we understand that yeah I think this so here here my claim was that there is only one set of four points which pass through one line so I can recover the line once you recover the line once you know the line then you will see what is the intercept at x equal to 0 that is y0 and see what is the intercept at x equal to 1 that is y1 and then y0 and y1 that is the that is the message so I was asking that you had displayed the details about the website which is of only 29 bytes right so what if we have a data for our 100 bytes then you will use some other version of qr code some larger version of qr code or you could compromise with the error correction level so instead of 30% maybe you are ok with 15% error correction so if we have a large set of data we have to compromise with the error correction so either you enlarge the qr code or you compromise on error correction yeah so there is this tradeoff so why is the 30% limit for the corruption of data is there any reason for that 30% so the limit is actually 50% I think what I don't know why they stopped at 30% so until 50% we can corrupt the data and get the correct yeah so that is another model of error corruption that is you assume not I was assuming really worst case that any 32 can go corrupt but you can say it happens randomly with some probability some parts get corrupted so that's another area in itself that's random error correction security right if it has more than that 30% error yes it could be interpreted as a wrong message yes so the security part we have not seen it actually I wanted to ask the question basically you said that there could be 30% of error management so like the deleted data of 30% so it can recognize so is there any mechanism which can detect that 30% data is modified like if the data is modified then how will it detect the 30% is only modified like it can consider that if the rest of 70% modified it can consider that 70% is correct and show the difference there is no way for example everything can get corrupted there is no way to find out that but again you are cheating so there is no mechanism there is no mechanism like that for detecting that I am joking ok I think we are done with the question to thank professor Rohit will be gifting him another token of appreciation can we have a round of applause please thank you because you said geometry so Varsha would you call it geometry or not yes I think everyone resonates with me on this one that was quite an interesting talk we got to see a bit of visuals also when everyone thinks about linear algebra it's non-visuals but I think we got a few geometrical interpretations also yeah that was quite nice and I got my curiosity satisfied about QR codes yeah I mean you scan GP codes you see that you scan only one fourth of it I was curious why and how so I mean that's a rough number but yeah so on the agenda next we have a tea break I think you guys should all spread your legs and I think have some fresh air and then we'll reconvene for the poster session and also a panel talk on should we have a PhD should you proceed towards the PhD or not yeah so the tea break will be here I guess it's on so CC you go to the new CSA building for the tea break and also the poster session we'll reconvene here for the panel talk at 5pm