 Well, thanks for coming to this Transformers for Time series. My name is Ezekiel Lanza. I am AI open source evangelist working for Intel. Basically, what I would like to explain in this or like to share with you in this talk, it's my thesis work. I've been doing some research about Transformers and how you can use Transformers for Time series. The main idea is to share my experience or what I did, or my frustrations and what I think that can be used or what cannot be useful. First of all, this is the agenda. I would like to talk about a light explanation about Transformers 101. In case you are not aware about Transformers, how is the architecture and I would like to, I will not go into details of course, but just I would like to explain which are the most important parts of the Transformers that can be useful for Time series. After that, I will explain something about, okay, we have the Transformers, how we can use it for Time series. After that, I will explain two main architectures, that is Informer and Space Time Informer. We have tons of architecture there. I decided to use these two main architectures for a reason, so I will share later the use case, why, how we tested this implementation in a real use case and some conclusions. So, I hope you can enjoy it. It will be personal. So, any questions that you can have, you can ask for the microphone and you can ask them. So, Transformers, as you may probably know, everything started in 2017 with the paper attention is all you need. We have the Vanilla Transformer at the beginning, that it's basically an architecture to translate from English to French. I think so it's basically the idea is to have a phrase in one language and had a translation to other language. But it was back in 2017. What we have today is that to adapt or to use the same architecture that was really useful for that implementation, we had some transformations of the original Transformer. So, just to say an example we have, for instance, GPT, as you may probably hear, is an adaptation from OpenAI that they built. They did some modifications with the Transformer. They have the GPT-1, the GPT-2, the GPT-3, 3.5, and 4. And also for Dali, for instance, is the, for Dali was stability diffusion here on the bottom. If you like to write a text and you would like to get an image, you may use a Transformer, a part of the Transformer, adapted with other parts, neural networks or CNN and so on, to get the same result. So, but what we don't have yet is we have nothing related with time series yet. So, it's okay for language. It's okay for images or images plus languages. But for time series we are in a moment and we don't know if it's so useful. I mean, it can be useful, it cannot be useful, but it's not clear what is the state of the art. But you have a lot of research that you can find, but it's not clear today, actually. Just to explain, just to give an example, for instance, this is how a stable diffusion is built. You have a text that you would like to get an image at the end of the inference. We have the Transformer, which is the first part, is adapting the text to text embeddings to a part of our representation. And this representation goes to another network that is the unit. It's basically a CNN architecture, right? But what I wanted to showcase here without going in detail is that they picked the Transformer's architecture and they created something with images. And the same concept is the concept that we may find with the different scopes. Could be time series, could be computer vision, or it could be in any other case. So it was so impressive, the performance that the Transformers give is that most people would like to use Transformers. And this is at least what they've seen is that they would like to use the Transformers even if it's not the best option, which it could be the best option, but you have a lot of constraints that you need to keep in mind when you are using some kind of Transformers. There are multiple parts, of course, I won't go in all the details, but for time series, we should be aware of three main parts. The first part is how you represent your input, how you use the embeddings or how you do your embeddings. And how you use or how you adapt the encoder and the encoder and the decoder multi-held self-attention. This is something that I will explain later. So it's basically focusing these three main parts. Of course, we have a lot of different parts of the architecture, that is to have a fit forward layer and so on. But it's not in the main, I mean, it's not relevant in the topics of the time series, right? And just to explain, I will start explaining about how the Vanilla Transformer was created. Just to give you an idea, okay, this is how it works for text or for language, how we can adapt that for time series, right? So let's suppose that we have a phrase that is I love dogs, we have three words and we need to represent that in a different way because if you put the data or if you put the letters in the model, the model is not able to understand letters, right? You need to convert the words in numbers or vectors or a representation that a machine can understand, right? So how can you do that? For text, which is pretty good, we have another algorithm that is word2vec that it basically does is we can represent the data between, I don't know, one and 5,000 or one and 2,000, it could be all the letters, all the phrases that you can find in English, for instance. But it wouldn't be enough because we need to represent the information that we have in the text with some meaning. We need to provide a meaning to the model so they can understand. So this is why we use word2vec, what they use word2vec is that it gives you a representation, for instance, if you would like to represent the word king and the word queen and you would like to measure the distance between king and queen, it would be the same distance between men and women because queen, of course, I mean, it's obvious, but it's not just a number between one and 5,000, one and 20,000 or whatever, it's bringing information to the model and this is what word2vec does. It's a model that's already trained and you can download and you can use it in every use case, for instance, in your case. And what they did is they picked this word2vec and they used it to represent the input. And this is what they did and they represented, this is just random numbers, right? It's not the real numbers, but they decided or they picked a dimension, let's say 512 and they have for each word, you will have a vector of 512, the dimension is 512, right? And they have for i, for law, for dogs or whatever the phrase is. But there's another additional problem is that we can have the order that the word has in a phrase, it's also pretty important. So we need to embed what we are doing here is we are creating embeddings which is a representation of the letters, but we also need to provide more context, more information. Now we provide with the word2vec, we are providing just the meaning between each word. It's king similar to queen and so on. But we need to be aware of the position of each word in the phrase. So they decided a model, they decided a function, it doesn't matter the details of the function, it's a mathematical function. What it basically does is, okay, when you do that, you are also adding to the same embedding, the same representation, you are adding information about the position. So it's not the same to use the word dog or another word in different parts of the phrase, right? So doing that, you are representing that. This is the first part. So you probably may imagine that for time series, okay, I should do something similar for time series. Of course, it's not a word2vec because it's numbers, but this could be probably important in the future, right? When I will be explaining time series. And that's it, and we have for each phrase, for each word, we have a vector with information. And now we have the encoder and the decoder part. And without going into mathematical details, basically what a self-attention does is when you're training a model, you need to attract, you need to detect the relationship between one word and the other words. In that case, I would like to have this embedding with information, for instance, and I would like to get information to say, okay, how is children related with playing or how is children related with park? So the representation that I would get, and this is what they call attention, because what they want is with this layer is let the model just be focused on the parts that are relevant on the phrase because the phrase could be 2,000 words or 200 words. And you need to let the model know, okay, pay attention to this particular part. It's not paying attention to the other part, right? How they do that, of course, with a mathematical operation, but I just wanted to explain this part, which is also important for the explanation is that they have free weights, could be key, the queries, the keys, and the values. And what I do is with this initial embedding that we calculated, that we have the information of the word and the position, we multiply it by the query, oh, sorry. By the query, by the values, but the most important part is that you multiply the value by all the other's value. So if you have a four-phrase, a four-word phrase, when you are calculated embedding for the attention layer for the first word, you are multiplying the first word by the second, the first word by the third, by the fourth, and so on. And you are doing the same with the second word. You are multiplying the second word with the first one with the second one, and so on. You get an score, this score is weight, and it's again put it in one embedding representation, right? And this is how the self-attention works in a very high level, right? But it's important to explain here that if you have a very long phrase, you can imagine that the time that it takes to calculate the attention could be pretty high. I mean, it could take a lot of time to calculate this matrix's calculations. If we do that just once, what we can do is, with just one head, what they used to call the head or the multi-head attention, with one head will allow the phrase to just be focusing on one part of the phrase. But probably, for instance, like in the case, the word, the animal, didn't cross the street because it was too tired, and if I want to calculate the attention for it, probably it's related not only with tire, but it's also related with animal. So we need to be able to capture these two relationships between it and animal and tire, because the mother will need to have more information to make a decision in the future. So this way, instead of using the same thing that I showed in the past in one head, you use multi-head attentions, that could be seven, six, eight, I mean, it's a parameter that they decided when they are building the architecture. But the main idea is I would like to capture, for instance, shown term dependencies, but I would also capture long term dependencies in the phrase. And it's the same architecture. Once you have the encoder, the decoder works in the same way. So once you have all the vector representations or the embeddings for each work, after attention and after the embedding deposition and so on, you will put all the words in a matrix, and this will be your matrix for the encoder. Once you have that information, you will do the same with the decoder. With a difference is that the encoder was trained with input data, which is in this case, English, English language, and the decoder was trained with French language. So it basically does the same thing, but instead of paying attention on one language, it's paying attention on the other language. But it's using the input that it's already converted for attention and so on as an input for the decoder. This part could be a bit complicated to understand, but this is what it works. You do the same process, the attention process, the multi-attention head, head attention layers, either for encoder or for decoder. So it's exactly the same. How it works at the end is every time it makes predictions, for instance, in that case, the input will be this phrasing in French, which I have no idea about French, but this is the phrase. And once it's encoded and with embedding and so on, this input will go to the decoder. And the decoder will try to predict word by word in English, which is the target language in this case. So it predicts, for instance, I. The second step or the second round will be, okay, the input for the decoder will be the word I and all the embeddings that I have from the encoder. And he predicts the second one. And he keeps doing the same until he gets the end of sequence word. When they get the end of sequence, this relation is done. So just to give you a really high level idea about how it works, every time, everything, if we can't explain in very detail, but it would take a lot of time to explain it. But just to give you an idea what are the important parts of these transformers, the initial transformer. Right. How it can be used for time series? Firstly, we need to talk about what is a time series. We need to define time series, right? Just in case if you're not aware, time series is the point that I have in the time zero has a dependency from the previous points. Could be five points, could be six points, 10 points or whatever. But this relationship, it's really important to capture because if you can capture the relationship that you can get between the previous points and the point that you would like to predict in the future or in the present, it could be really complicated. So this is the main difference within time series and other kind of problems because we can think that it's like a language. So you say, if you have a phrase, you can say, okay, the phrase or the end of the phrase has a dependency on the words. For example, I was, I don't know. I was, I don't know, for instance. The word know has a dependency for by or I or even was. But in language, and this is why it's a bit complicated to use it with transformers, it doesn't really matter if the order is strictly the same order because in text, for instance, if one word is in a different position, the model should be able to get the message or to get the idea of what you are writing. So, and this is why the transformer doesn't, it does it pretty well. But with time series, I need to keep, I need to be strict with the order of the time series. So we can imagine, so we have a model that is transformers that works pretty well with sequences or what's pretty well with phrase. So we can imagine that it can be good with time series with a kind of similar, or we can think that as a similar. How are we solving those problems now? And we have analytics approach or classic approach that it's a RIMA or auto-regressive models that I'm not saying that they are not useful, but for some problems that happens in the past or mostly, you need to have a deep understanding of the time series. You need to understand the trend, you need to understand the season, you need to understand the residual values. And in base of that, once you do this analysis of your series, you build the RIMA model, right? And even if you can, of course it would depend if your series is not seasonal or if you can detect the season, it could be pretty complicated to write an RIMA model. And the main problem is that even if you can do it, it's only able to detect linear dependencies. Some cases if you have a multi feature problem, multi-variate problem, the relationship between the features is not linear. So if you have a problem with a linear dependencies between the features, it's pretty good, you can use it, but most of the times the relationship between the features is not linear. So 20 years ago or 15 years ago with the neural networks explosion, they said, okay, let's try to use fit forward layers, fit forward networks, which is probably the same. So you need to build a model from scratch and you will say, okay, this network, for instance, has a strong dependency from the six, yes, the six previous data points. So every time I need to predict the next data point, I will need just to see the previous six data points, the six data points. So I build a model in that way and this model, it's able to get non-linear dependencies, but I do have the same problem, but I do have the same problem that is just focused on this part. So I'm hand crafting the solution for this particular problem. Probably if it gets bigger, it could get a problem to use the fit forward layers because the way it turns on. So I don't wanna go into details, but, and they say, okay, we have the recurrent, the RNN, the recurrent neural networks, which is you fit, for instance, the time series with 10 points or seven points or the points that you think that are useful and the neural network, the RNN, it's able to memorize the importance of the previous data points. So which is pretty different than the fit forward when you are just calculating the weights. In that case, we have cells that they are able to memorize which part of the series is important. But we still have a problem with that, is that if the series is pretty long, this weight will be going lower and lower and lower and lower. So if we have a series with, which we need to pay attention for the previous 20 data points, they will probably won't be able to detect very well the last 20 data points because there is a vanish, vanish, vanish gradient. The gradient is going, it's vanishing. So the model is now able to detect pretty well the longer dependencies. So it's good if your series is just, it depends on the two earlier data points, but if we are talking about, okay, I would like to capture long dependencies, this is not a good option. And LSTM, that LSTM is pretty used today even with some use cases, I found that it's pretty useful to use LSTM, it's pretty easy to build them, it's pretty easy to use it, and which is the same concept as RNN, but the cells, they have memories and you can cut down and you can shut down each cell. So it allows you to remind long sequences because you are not using the same gradient every time that you are updating your weights as it happens with the RNN. So LSTM for those cases could be pretty useful and even today if you use those cases, in some cases LSTM could be a very good competitor of transformers in, but I don't wanna go at the end of the conversation, right, and this is another option that instead of just feeding your network, you can use the same thing LSTM through sequence to sequence which is I would like to represent my data, I would like to feed my data in an encoder, I would like to get the features of the input, this feature will be the input for the decoder as the transformer explanation that I did before, and once you have the decoder trained, you will get the output. So it's a way to try to improve the LSTM performance, and sometimes it works, sometimes it doesn't, but it depends on the use case. So now we can imagine that the transformers can be useful because the transformers can detect short term dependencies as I said with the multi head attentions, you can have a head attention that is only focusing on one part of the text, and the other one will be able to detect long term dependencies. Just to give you an idea and how well it works with long dependencies, you probably use ChagYPT of course, and the phrases that ChagYPT creates it's using GPT underneath, of course, they are pretty long, and they are not just phrases of five words, they have hundreds of words and they have a meaning. So this is why we can imagine that we can probably predict 1,000 data points in the future. This is why we think, probably yes, probably not, but we can imagine that it can be useful. There is a big issue that is, as I said at the beginning, is that the layer, the attention layer, it multiplies each timestamp or each input token or each word if we would like to go back to the NLP case, and it multiplies against all the others words in the phrase. So as long as we keep the same amount, we can say, okay, the time is okay, but if I keep increasing the input data size, the quadratic error or the quadratic time that it would take to process this, it's exponentially growing. So we can have a problem here, it can be more a computational problem or a problem that can give you, or can take a lot of time to process the answer. It's not a huge problem because it works, but as you may see, the quadratic error grows exponentially, right? What is the work that has been done with Transformers then? There is a survey, there's a paper from 2023, and they did a survey and they say, okay, these are the Transformers, and if you like to use it for time series, you need to be aware basically in two main things. We need to do network modifications in the architecture, and we need to be focusing on the positioning coding, it means how we represent the data and the input, and the attention module, which is basically to reduce this complexity or this time of complexity to make it work. And of course, it depends on the applications, could be for casting, anomaly detection, or classification, but the main modifications that you will find for Transformers are positional encoding and attention module. You can try to use, I did some experiments, but you can try to use the vanilla encoding, which is the explanation that I said at the beginning. You can use probably word to back with phrases or time to back with something similar that word to back, and you can use the same thing, but the problem is that it's not able to detect, or to fully explode the importance of the features, because as I said, it's an architecture that is designed to find relationship but not strictly attached to the order, so you need to find something that will help you on that part. In 2021, they started a lot of research and they showed that, okay, if this embedding of this representation that we are using, if we do it the learnable during the time, could help. And the other part, most in the future, they say, okay, let's try to embed it, let's try to make embeddings with timestamp. So we like to put the timestamp inside the embedding, and this is how you get informer, outformer, fatformer, lots of transformers that they basically do similar things, but they start playing with, okay, how we can change the embeddings in the input. And it changed a lot, to be honest, oh, sorry, there is more, they are focusing on the input, as I said, but they are also doing some pruning on the attention layer, or there is something called prop sparse, which is a probability of attention. So it doesn't matter, it's a very, it's not advanced, but they are doing, instead of doing all the calculations, they are using just the likelihood between the most important, and they do that. This is a very good way to reduce the amount of time that it takes to calculate the attention layers. Informer does that. So I decided to use informer and space informers for two main reasons, and the first one is because it's the informer and space informer, they are the only architectures that are open source. I mean, you have a GitHub, you can go there, you have examples, because otherwise, you need to go into details, and you need to understand a lot of how the attention work. I mean, you need to go into details if you like to use those implementations. Informer has a pretty good GitHub that you have samples, you have how you can fill with your data, and so on. It's pretty easy, and space informer, they also do the same. Space informer stands for university, so it's pretty friendly to use. How informer works, and what it said is, okay, we like to also add information about the week, about the month, about the holidays, for instance. If you like to capture, I suppose that we have a year of information, we can fit the model with this year of information, but in the middle, we have holidays, we have months, probably if your series is, have a season, for instance, that in the summer is different, has a different behavior compared with winter or with the fall. You need to also give this information to the model to try to force the model to say, okay, if it's summer, use the weights in a different way when you are representing the input data. And in the concept, it's pretty similar because what you have at the end is a projection that has all this information embedded. The global, what we see, weeks, months, and so on, and the position of embeddings, what is a way to give the model or to force the model to understand, okay, this timestamp is the position one, this timestamp is the position two, three, four, four, and so on. They did the same, as I said, with the attention model. They modified the props first, which is instead of doing the calculations when one against all the other ones, it just does minor calculations based on likelihood. It works, it has a formula that it's pretty nice to read it, but what they do is that thing. The results that they get is it's better, of course, they compare with LSTM. And what I see really interesting is that this is a, there are some benchmarks series that when you are running someone they are doing something with time series, you have benchmarks to use, and they use it to predict 24, data points, 48, 168, up to 712 data points in the future. So you are seeing 360, I didn't put that, but they are using an year, which is 360 data points, and they would like to predict two years in advance. They show that it's better, but what is really interesting if you compare with LSTMs, for instance, you can see that if we see the long term captured the MSC error is 1 to 50, and with LSTM is 196, 160. And this distance is even closer if we use just 24 data points. So the conclusion that we can see from here is that it's probably better if you use it for long term series. It's not so crazy. I mean, it's not so excellent that you can say, okay, with here we have an MSC of two, and here we have an MSC with informant 0.5, 0.4. I mean, it's better, it's not something awesome, but it's much better compared with the traditional LSTMs. Space and informant, they try to do the same but the representation, again, is different, and they say, okay, instead of using this positional encoding, using weeks, days, months, and so on, they try to represent in a different way. So every position, every timestamp, they say, these are the features that are present in this timestamp. Same for the second timestamp, the first timestamp. So they say, they represent by features, and they put all the features all together, and this is the main vector that they put. Okay, for time zero, position zero, for feature zero, this is time from zero to 10 or to whatever. For feature two, these are all the points that we have for time zero, so on. You may imagine that it's a different representation, but to explain it in a very easy way, what they do is, the represent that they are doing is you can have a temporal attention. If you represent all your inputs in one point or in one vector, you are just getting the representation between timestamps, but you would like to find relationship between timestamp but between features. So if you embed everything in just one token, you will lose the relationship between the features. So what I say is, with informant, you may have this temporal attention, which could be useful, could not be useful, but they say, you may not be capturing all the relationship between data. What you can do is, okay, we can force a graph. We can write a graph. If I know the relationship between the features, I can manually add a graph between the features. But the problem here is the same. You are hand crafting something that may change during the time. It could not be the same relationship between the feature. So you probably need to change it. What they do is, okay, let's try to make it more open. And if you see, for instance, the lines in blue are the attention that the model is capturing. With doing this thing with the space time former, you are allowing the model to pay attention to features and time, not just features or just time. And it works pretty well, to be honest. But I think that the thing or the concept, like, okay, let's try just to, or let's try to modulate or to modulate how the functions or the features works between them could be pretty awesome. And the architecture is almost the same. They are working in optimizations in the attention layer. They are doing some modification. They are probably adding a CNN in the middle. But, I mean, it's not the main point. The main point is how we can represent the data so the model can understand. So we need to force that information. These are the benchmarks that they showed. Of course, better benefits, but with something similar compared with the previous one. Once we predict more time in advance, the error is getting bigger, of course, but the LSTM errors is getting even bigger in some cases and not so bigger in other cases. So the difference between them, it's getting a bit bigger. But again, it's not what I think is not so excellent because we have 2135, I think that this is because it's not normalized, the MSC, but 2135 compared with 2211. So it's not so huge. It's better again, but it's not so huge. And this is what I did or what I tried to do is in use case, my use case was microservices architecture, where I wanted to predict the latency between the front-end service and the users. And the information that I have is all the latencies of each particular microservice or a particular services. Latency or what it means is that it times that each service takes to respond to a request or to process the request. So I put them all together and my target is to predict the front-end service to the user. I could have selected any other one, but I selected the front-end because I wanted to see the impact to the user at the end. Just to explain what we are doing, this is just one feature, of course, but we are taking 100 data points from the past which is the green line and we are predicting the blue line or the dotted blue line. We will do a short prediction and we also will do a long prediction. I would like to see how it works with both of them. Once I use 360 data points, which is the second one, and I would like to predict the next 36, which is pretty short, I got better results with Informer in 00.6 and again, it's not so bad with the LSTMs, but what came to my mind or my attention is the time that it takes to process the batch. Even if the Transformer is better, probably in some use cases, you don't have this time to wait. Again, it's not a lot of time, but it takes more time compared with the LSTM. What is affecting here, what is directly affecting here is this optimization that they are doing with this, with the attention layer. So if they don't do that, this time could be even the double or the triple. So they are getting there with all the research that most people is doing, but it's even much higher compared with the LSTMs. LSTMs is a pretty simple model, if you like to create a network, right? For 120, it's pretty similar. We have the same results, much better for the Informer, worse for LSTMs, but the time that it takes is getting bigger. Again, it takes more time to produce the Informer compared with the LSTM. This is from the paper, from the Space Time Informer paper, that I wanted to show because they did the same calculation with LSTM, with LogTrans, which is all the Transformer, with Reformer, and so on. And you may say that once the encoder length is growing, the time that it takes is getting higher, higher, higher. And the model that takes less time is always the LSTM because it's easier, simpler, and it's more easy to use. But Informer is not so bad. It takes a bit more, but it's not so bad. So, conclusions. Transformer seems to be a good solution, but I think that there's a lot of work that should be done, basically in optimizations, and try to think different ways to represent the data. And even if you find this model, it's probably not the best one for you because it depends. This is not language that we can see, for instance, with GPT or these kinds of languages, that once you train a language model, it's able to understand language. Of course, you need to fine tune for your particular use case, but it's able to understand language. And it doesn't happen with time series because all time series, they are all different. We don't have the same time series when we are working with time series. So, using an architecture, for me, it's an option to use, it's one more option to see if this architecture can be useful for my case, for my problem. Could work, could not work, you never know. You need to test it. And the third one is, of course, being involved and what is important with Transformers and time series is that communities who drive the state of the art. Again, it's not the same as GPT once you have one open AI or Google that they are training the huge models. Here, it's the community that it's researching, that it's testing, it's sharing information because you don't have a model that you can download and you can use for time series. You need to train the model with your data and you need to see if it works or if it not works. So, it's probably up to each use case or each problem. So, the community, if you are using Informer, Space Informer, LogFormer, or any other Transformer that we are using, try to be involved, try to collaborate or to inform. And we have cool projects. There are cool projects that like TSAI, that is, if you like to use with time series, you probably need a framework to test it because it's not easy to build your own LSTM, your own Transformer, your own Space Informer, and so on. So, if you have frameworks with APIs that can be very useful or very easy to use, it really helps. So, I recommend you. If you are working with time series or if you, yes, if you are working with time series, these are very good projects, TSAI, and this is the GitHub. And that's it. Thank you so much. Thank you so much for your time. I hope you have enjoyed it. And any questions that you have, please, we have the microphone. But thank you for your time. Thank you, presentation. Two questions. So, the Smart Informer, that's open source as well, right? The, because you mentioned the Informer is the one, the open source, but... All of them. All of them, okay. And so, in your use case example, so the data parameters, you basically took the lag latency, right? And timestamp, that's it. You don't, like, number of users, the type of the architecture, I don't think just two data points. Yeah, the reality is that, sorry, that was the question, right? In fact, I have the P95, P99, and P98 values for each feature. So, I had 60 features, and I multiplied all the P95 to get, I concatenate P95, P99, and P96, I think, or 98. I had a 2000 data set, 2000 features data set, and I also had a throughput, that is another 60, yeah, 60 more data points. So, I had a matrix, a matrix of 250, if I'm not wrong, with 5,000 or 6,000 timestamps. I mean, is it possible to add more dimensions to that data? Yes. And make it, like, around not a million years. Yes, yeah, exactly. And once you added this data, the matrix calculation with the attention layer is getting bigger and bigger and bigger and bigger. So, it takes time to train those models. It's a pretty easy use case. I mean, it's just latency. Right. And so, did you, like, you run the benchmarks against, once you predict the future, then you took the actual results, how close the variance between the actual and predicted. Yeah. For the close, the actual versus predicted. Yeah, yeah, cool. Yeah, and I had this, I had an initial 10,000 data set. I separated 2000, I didn't show to the model. I mean, I trained all the model with the 8,000. And once the model is trained, I test the performance with this 2,000. I mean, we're 36 and we won 120. Nice, we're talking about yesterday in this monitoring session. I don't know if you've heard that. And they're exactly, they're talking about how AI might help people to identify impact. Because if you can use this and say, look, based on trends, the impact is going to be higher and acceleration is going to be higher. It's like, you know, when you drive a car with a trailer, and you start shaking, right? Sometimes you know you're going to go straight and sometimes you're going to fly out of the highway, so this kind of stuff can predict the impact. Yeah, it's enough, yeah. Because you are modeling something. I mean, I suppose I don't know what are the features that you can get there, but I can think on speed, vibrations, movements and so on. Yes, that's pretty interesting. But again, you can do the same with LSTMs. Thanks for your talk. Do you know if anyone's doing work on putting out confidence intervals with the model predictions? So we could see that your uncertainty is growing over time as you predict into the future and then you could decide at what point your prediction is just not worth using? Yeah, this is in, no, I'm not aware, but I think that you can use that when you are implementing those use cases. Once you are, because this solution that I tried, in fact is that it has a model selector at the beginning. So the model selector will detect, for instance, if you don't have enough data to train the transformer, you will probably use a regression. You will probably use an LSTM. Once you are getting most models or most timestamps, you can try to train these other models. And in the middle of the process when you are working, you probably would say, okay, I have enough data to train a transformer, but in the middle of the process, you can probably say, okay, that's not the best process. It's not the best architecture for this solution. So in that case, when you are implementing that, you need to be sensing the performance between, I don't know, LSTMs, transformers, or even a regression could be something pretty easy. Yes, but those use cases you may see when you are implementing that thing or these algorithms, because when you are, all the research projects or all the things that you can find when you're forming and so on, is I have that stuff, this is the MSC, and this is how you measure how good it is. And they're using the same benchmarks. So it's probably the same thing always, right? But when you would like to use in a real case scenario, implement that, you need to monitor the performance, you need to monitor the MSC. Even if you have a good MSC, you cannot be able to detect peaks, for instance, because MSC is mean square error. So it's a mean. If most of the time is similar, but you have one peak in one second, the MSC will keep the same, will mostly keep the same, right? So if you're just seeing MSC, you probably will say you can be probably losing most of the time. So you need to have mechanisms to try to detect those peaks and to see how it's working. Thank you, thank you for your question. Yes, I really enjoyed this presentation and I wanted to ask about how, I guess time series, sequence to sequence, like data is set up, because like when I think about trading a transformer, like I think of it as like a, for example, like the translation task. You have like a data set of a bunch of not necessarily related sentences that have like a fixed or they'd all have, you know, I don't, a five letters or a five word sentence in one language, like input language translates to like a four word sequence in the output language. And so I like to know how you create or like how do you use a sequence to sequence model in time series data, where you have just like this continuous sequence of data points. Good question. It's a kind of times of sequence to sequence. But the difference is that when you are doing a sequence to sequence, like you can do with an STM, the model internal is you are giving them 10 data points, for instance, right? But the model is processing one data point each time, I mean, one per time. With transformers, we are feeding the 10 data points at the same point, at the same moment. And it's getting a representation of these 10 data points. So you are not doing sequence by sequence, right? So you are predicting, you are putting all the sequences at once, they are calculating the attention, they are calculating the relationship between them. And this information goes to the decoder. And the decoder, they have the different ways to use it, but for instance, in space time former, they don't process the output one by one. They do the same process as the decoder and they process everything at once. So you get all the series, let's suppose that you would like to predict the next 10 timestamps, you will get the 10 timestamps at once. So it's a sequence to sequence, but it's not the same concept as you can think for LSTMs, for instance. So is that like sequence length, like a hyperparameter in that case? No, it's the hyperparameters are these three queries that you can have in the transformers, the queries, keys, and values. These are the hyperparameters for all the transformers, right? For each head of the attention, you have these matrix. So I have a matrix for one head, I have a matrix, three matrices for other head and so on and so on and so on and so on. These are the hyperparameters. And what you get is a representation at the end. You have an output, which probably if you see this output, it has nothing to do with the input because it's an embedding. It's like how they think that it's important or how the model thinks that it's important, the input related with the attention, right? So this sequence, once you have that, this is feed as an input for the decoder. So you can think as a sequence to sequence in the concept that you are feeding an input, you are having a representation or an extraction of features. And this thing is what we are doing to encode, to decode your output. The concept is similar, but how it works, it's pretty different in that part. You're not feeding step by step, you're feeding it all together. And LSTM, for instance, when you do the sequence to sequence, if you have 10 cells, for instance, it goes first to the first cell. The second information goes to the second cell and to the third cell, to the fourth cell, and so on. And the model, the LSTM, is saying, okay, fourth cell, it doesn't matter, so I will shut down the cell. So again, here you are feeding all together and it's doing the calculations at once. All these multi-heads, they are being calculated at once. So it is, but not it is, in a concept it's a sequence to sequence, but it has a different behavior. It's interesting. Yeah, it's pretty interesting. And it's pretty challenging to understand the details. It is, yeah. Thank you, thank you so much for your question. All right, thank you for your time. Thank you, I appreciate it. Thank you.