 So I've been working in academia for a very long time and this talk is inspired for someone who I was mentee in this year and it's basically like science is built on the top of giants right and those giants look scary look like oh how we're going to do it but sometimes giants are just windmills. Granted windmills who speak we're English who publish way too many papers and it's basically impossible to keep up. So I hope this story sounds familiar because what I'm trying to do is for the next half an hour even less a little bit less. I'm going to be your squire and I'm going to try to give you the tools to read papers properly. So there's a two folds to this talk. On one hand I will try to give you tools to refer reading papers and every time we talk about this is like my every co-worker says but I know how to read papers and like let me show you how to and the second part is like a more computer science part where I will give you some tools to implement a paper. In this case we're going to look at the very very famous attention solve unit and as a co-worker told me we all use transformers so yeah let's do it. So yeah how to read academic papers. So there's some tools then I think everyone should know about it for implementing for reading academic papers and as I said it was not like yeah I read papers I said and I read, I print them and I read them. No don't do that. First of all and I think like the most important tool are repositories because we all been there like we have hundreds of top hundreds of tabs open with different papers, blog posts, YouTube videos, podcasts like these days science is distributed across multiple mediums and we have everything open and never have time to read them. So we need something to properly collect and categorize everything. My main the main thing that it needs to have to have these tools is to be distributed like multi-platform. I need to be able to be on my phone and see a paper and save it and then maybe label it but it needs to be distributed. So mainly so there are like the old-school tools but also I really really like paper like paper pile I don't know it feels like a pile of papers I'm never going to read but I'm trying and then note-taking again note-taking could be pen and paper but if you like digitalized tools and then being able to do nice summaries and share them across the internet I find that good notes and notability are very good tools. And finally organizing sometimes you read a paper because you think it's interesting but it might not be interesting at that moment and you need to be able to come back and remember that paper. Again notion key notes paper pile and paper pile and obsidian are very good tools for doing that. And that's basically that's the tools that you need for reading papers. Almost. I have a couple of bonus tracks. The first bonus track is a tool for discovering new papers. Kranta Twitter is my main way for me to discover papers like academic Twitter is very like yeah it's there you get papers new papers every day but that might be skewing you to big labs big corporations so research rabbit and lead mouse will find you papers related and they're linked to each other. And then for my near the virgin family there's Bionic reading then that's very something very very cool and will help us read better. Okay so we have the tools but how we need to read it. Hopefully the repository has helped you massively let you don't have these tools open so you should be able to read. Cool so now what like now we sit on a desk and have like 200 cups of coffee and we read through them cover to cover. Well no because that's infest infeasible like please be kind with yourself like no night is able to read everything. So I do this thing I do the three pass approach. The first approach is me trying to figure it out is this paper relevant. Like I'm trying to be brutal like I'm don't going to spend more than 50 minutes doing this. I really title I really abstract scheme a little bit through the introduction maybe read the discussion and that's it. Nothing else. Then is like the moment where I maybe know then the paper is interesting. I might start brewing a cup of coffee because I'm going to need to read this. Again no cover to cover just the introduction the contributions and the limitations. My favorite doctors always have like this last paragraph with these are the contributions and they itemize their contributions and the limitations. That is fantastic. Please authors in the room do that. I will be very grateful and then I will read this figures and results sections. Depending on how expert you are on the topic this might be more or less useful. And yeah skim through the rest of the paper crowd more or less the idea of what the paper is about and write a summary. Granted is not going to be the best summary. It's just like well this topic are discussed in this paper. Cool. We're good. And then the next phase is when we properly need multiple cups of coffee and sit and read it cover to cover. There's no shortcut here. We need to read it properly. Say no to like you don't need to read it alone. Like find help find colleagues asking for help is a sign of strength not weakness. And something that I also do is when I'm reading the paper cover to cover I also add new papers to the repository so I know where to follow the lead. Next time the summary at this point you have a way better idea of what you're going to talk about. Brief note on how to highlight papers. Since I do these three phases a stage which is I think pretty common. Maybe not. I do this thing like because sometimes I read the paper but I'm not going to implement it or maybe I'm going to take almost a year since I read it again. So I have this semaphore thing where I highlighted in red the hypothesis the problem that the authors are trying to solve in yellow the hypothesis of the method or the methodology than the authors are proposing and finally in green I highlight the evidence. Then back up the hypothesis. Okay. And how does this look impractical? So this is my first pass. Very few things. This is basically like the only things I highlighted are the things I read. Basically this takes 50 minutes for real. This is the second pass. When I really highlighted more things, go to the figures. Like here I'm trying to pay more attention. Here I have like an idea of what the transformer looks like. And this is the third pass. Note this is not the attention you need but I wanted to show you this paper that was very recently published in fact that is values encoded in machine learning because we think the machine learning is neutral and hopefully from yesterday keynotes you know that it's not. So here you can see that I'm doing, oh it's a pointer. Okay. Here you see that I'm doing annotations like things happening where I'm reading. And when I finally I go to a summary so I go back to paper pile and I write this summary. That's it. Now I have a very good idea of what the transformer looks like. How is it going to, how they do, what problem, what methodology they follow, what proofs they have. Okay. So if I need to implement it, I know how to do it. Great. And now let's implement an academic paper. But before I jump into that, let's have a quick sync on what transformers are. I feel like everybody knows what a transformer is. So indulge me and let's go through it together. So a transformer is actually a family of neural networks. It looks more or less like this diagram. This is the original diagram. So it has a color branch and the color branch. And it's very, very popular because it allows parallelization of some tools. Like recurrent neural networks were very slow because you need to go through the whole recursion. And here we have some parallelism which allow us to train faster. And then we have the new magical block, attention, which allow us to come to, cut to relationships, long distance the relationship in a sequence. And finally we have positional encodings. Positional encodings are very important because it allows us to know what's the position of the token in the sequence. Because if you think about it, it's not the same if I say no at the beginning of a sentence or at the end. It might change the meaning. Okay. So positional encodings. As I was just saying, sequence improvements need to understand the order of the sequence. The authors use a sinusoid to encode the position. So every token at every time step, it has the deterministic vector. And they basically sum it. It's summed the word embeddings with positional embeddings. And that's it. That's positional encodings. You're more than welcome to try other positional encodings. There's no rule, but I don't know why. Well, you know, inertia, we all use sinusoid. And what's the attention block? So the attention block is where you're trying to capture the similarities between two words in the sequence. This is very easy to understand when you're talking about translations. So, for example, the word windmills are translated in Spanish as molinos. And you want to be able, when you're doing translation, to know that the words might not be aligned, might not be in the same place. But the word molinos is very tight with windmills. So you need to have, like, this relationship. And that's what attention is trying to capture. So we have the, I lost one thing. So we have the embeddings plus the positional encoding. We project that into a smaller vector space. We do the dot product between the query and the keys. If the query and the keys belong to the same sequence, that's self-attention. If the query and the keys belongs to an input and an output sequence, like, for example, the translation case, that's cross-attention. Okay. Then we project it and it's a non-linear projection. And then we compute another dot product and we get the attention coefficients. And that's the, all the magic, all the sugar, spice and everything nice that makes transformers. So, basically, in summary, we have an encoder branch that have this multi-headed attention because that mechanism they just show you is repeated multiple times. Then we have the color, the color branch in half. Multi-headed attention and cross attention, feed forwards, unnormalization, unlayers, unnormalization layers. The unlayers is because we are actually also writing in the irreducible. That's the total lines that connect in between. And that's it. And now let's do the quickest introduction to JAX. This is not a prescription tool. It's like, there's a myriad tools out there that you can use. And by all means, pick up the best for your needs. Having said that, we actually love JAX. So, why we love JAX? JAX is a NoonPie-like library that runs on accelerators. That means that if you knew NoonPie, you kind of know JAX as well. It's kind of, yeah. I've been like, I was thinking for that for a very, very long time. That is not completely true. And the good thing of JAX is they have these transformations. I'm going to explain in a minute. And this is the promised line. Like, you have the predict function, then takes the inputs, compute the dot product, adds the bias, adds a nonlinear function, and then compute the mean square root. And basically, I switch NoonPie by J NoonPie. And yeah, that's it. That's brilliant. So, if it's exactly the same, why do this change? So, we do the change because we have transformations. We have grad. Grad and jit are going to be the most common transformations. Grad basically takes a function and returns the gradient of the original function. If you want to get the values and the gradient, because you might want to get gradient and you want to compute the loss, you have the value and grad. And then you have jit, which is a just in time compilation. What it does basically does a trace of your program and then traces the program and writes an intermediate representation in JAX PR. Normally, the tradeoff between flexibility and fastness is shape array. There's the level of traceness tracer, which is like, we keep the shape of the object, right? But we don't keep the values. So, you can operate in different batches, but all the batches need to have the same shape. Cool. Oh, I forgot to show you. It's here. Oh, what do I do? It's here. Look how easy. Grad and jit. And you have the gradient function, trace. Brilliant. And then you have vectorization and parallelization. Beam up and parallelization. So, you can see that the parallelization and the beam up are quite similar. Beam up works from batches and beam up works across devices, which allow us to do grading for examples and parallel gradients. We could not train the big, big neural networks and we are training at the moment without them. Okay. So, let's implement a transformer. It's been a long time. We have been walking around and we have been working on JAX. JAX is function oriented and has tons of boilerplate and we don't like to write the same thing over and over again. So, what we do is we have this very nice library, Haiku, then allows us to write object-oriented models and you have the Haiku module, then builds this module and has some parameters and the function to apply to the inputs. And these models need to be initialized because we need somehow to go from regular functions to pure functions. So, we need Haiku transform and they give us the init and the apply, pure versions of the function. Okay. Now we are here for real. Brilliant. So, now we have the embedding block and the embedding block has the positional encoding and the embedding block and the embedding, the word embeddings. It's not really word embeddings because we all use Sundance piece. But yeah. And you can see that the most common modules like, I keep thinking that it's the wrong button. Okay. So, you have HK embed and since we have the parameters, we can tell, hey, get the position embedding and then we sum it and that's it. We have both of them and we're good to go. The attention block then we were just talking about is this thing. And again, we do some housekeeping to know if it's self attention or cross attention. And then we call the pattern because multi-headed attention is such a common model that is already implemented. Yeah. And also very important, please remember that you need to add casual masks if you're doing cross attention because you don't want to learn from something that you should not have seen in the current time step. It's obvious, but it has led to a lot of bugs. The feed forward network, the feed forward layer basically initialize the file, initialize the variable, compute the linear, and returns the linear. Very, very common things. So, it's really simple to implement. We don't need to do a lot of things. And here's the whole transformer. Maybe this is too small. Oh, you're seeing it. When does this happen? I thought that you were seeing the slideshow. Brilliant. So, yeah, this might be too small for you to read, but what I want you to see is that it's very, very similar to all the other tools. Basically, you have the attention block. You have the, I know what I'm doing. Yeah. Okay. Cool. You have the attention block, a dropout. And you can see here the normalization. It takes the attention and the residuals. So, something that has not passed through the attention. So, if something is meaningful, we keep it. And then the feed forward block, a dense layer, dropout, and layer norm. And we repeat this multiple times. Cool. And then we have to build the forward function. This is slightly different to other toolkits, but it's not something completely insane. It's basically get the tokens, get the embedding block, get the transformer block, and apply the transformer. And that's it. The last function is similar, very, very similar to the first thing that we saw at the beginning. So, you get the whole, the one hold embedding of the target. See, this is like, I'm completely sure that even though this might be the first time that you see jokes, you're perfectly capable of reading this code because you might already know NumPy. Yeah, probably that you need to know. You already know NumPy. It might be a little tricky. Cool. And we have a lovely call up. Then I'm going to be able to present, which is basically, I'm going to share it online, but it is basically everything that you do. We need to install lockup of libraries. This might not be big enough. Okay, you might need to start a couple of libraries. You need to build the sentence piece, but everything is like these days are so easily accessible, like TensorFlow Hub has a sentence piece tokenized so then you can basically import. And all these are the model parameters, then you are more than welcome to tune. Even though some models are, the drop-out rate is pretty much a standard. And then you love the data set. This data set is both in TensorFlow Hub and in Hugging Phase, so you can decide how you want to mix these things. The embedded block that we just see, but now with a little bit more of a boilerplate. The detection block, again, with a little bit more of a boilerplate for work. I'm going to skip everything we just saw. And here you define the update function, which is basically get the key, apply the key in order to make it reproducible, apply the optimizers, and return the new state. That's absolutely it. And when you train a model, I train for a very long, a very little time, but you can see then the losses get in lower. So let's take it for a good sign. Okay, let's go back to this. And so the main takeaways that I hope you get from this talk is, first of all, find the right system that allow you to keep in up with your literature. There's no good tools. I hope that some of the tools I presented to you are useful, but by means, find the one that is useful for you. Be as smart about how you read papers, like you don't need to read absolutely everything. And then if a paper is relevant to you, summarize it and store it somewhere safe, then you can go back and remember. I have a colleague who the other day told me, and they keep all the papers on their brain, and I was like, you know, one don't do that. Okay, so on transformers, remember that the key things for a transformer is that it allows parallelization, therefore faster training times. A lot of new flavors of transformer has improved the long range distance, but it has hard training times, so that's a caveat. Attention allows us to capture information into long range distances. The longer the context, the better for the prediction. That's why a lot of new variations, like S4, try to improve the context, but then if you want to print in production and do run experiments with them, it might be too slow. And finally, positional in coding, then capture the absolute position of the tokens in a sequence. And finally, for implementing papers, we will need to implement papers for either our academic career, for our business career. Find the right tool to implement the paper, like there's always a trade-off between flexibility and being able to modify things. So, there's no right answer. Find the best for you. We really like DAX because it's very easy to jump in and it allows us to do a lot of things. And on top of that, when we don't like DAX, we have Haiku, which is a DAX library that allows us to write normal Python code. And that'll be me. Let's be all amazing new roads. And please, please, please, if you implement new systems, new machine learning models, be conscious about your users, be conscious about the repercussions. This is not a black box. Like, we understand what's happening. And there's a massive new research on interpretability and trying to understand the depths of the transformer. So, I hope yesterday talks give you, like, an idea of what's happening in the field. But also, be happy. Like, I'm very cheerful about the future. I think it's bright because we have all these new tools and all this new blooming research. And yeah, that'll be me. If you want to have some time for questions, I'll be more than welcome. Yeah. Sorry, I rushed through. So, hey. Jax looks pretty cool. I haven't seen us before. What would be reasons that you might move over from something like PyTorch, either in a research setting or more particularly in a production setting? It allows way more flexibility than PyTorch. And then it's, like, business reason. Like, PyJax is implementing within the company. So, we have, like, the regional developers that we can ask. And it's very, like, set up for our settings. Like, it works very well with our tip use, with our tip use. And people just use it. But again, it's like, try to find the right tool. I think, from my experience, I'm all motion learning practitioner. And I used TensorFlow on the past. I never used PyTorch. I feel like PyTorch, it allows more high level development. I'm not sure if I want to touch something, like, if I want to get a gradient to respect different variables, how is it going to be done? So, it probably is, like, it's fine a tool that is right for you. Maybe you just want to import a transfer where you don't care how many layers you have. You just want to say, this is the main things that I want to modify. But you don't need to do something, like, fine grain. Yeah. And is there much of an ecosystem? Sorry. Yeah, go right there. Is there much of an ecosystem, you know, on papers like code, papers with code? Is there a lot of, like, JAX models up there? Or is it still kind of developed? Yeah. Yeah. So, Google, most of the research is either in JAX or TensorFlow, or research that we open source is JAX these days. So, there's a lot of things. Obviously, not as much as other people we don't open source that much, sadly, but for a good reason, too. But yeah, there's a good ecosystem out there. Good. Thank you. Thank you so much for your time and your talk. Yeah.