 Welcome everyone, quite impressed with the size of the room. So bear with me a bit of adrenaline. So I'm Julien, Julien Peloton. I'm a research engineer at CNRS, one of the research institute in France. This is a picture of an observatory, a real one. So it's located in Chile. It's called the Rabin Observatory. And with Fabrice, who is on stage, I will join me on stage later and Etienne here in this room. I'll try to give you some idea of what we are doing here. Okay, so Fink, Fink is the name of the project. It makes astronomy and computing. But what exactly are we doing? So we track changes in the sky. And how we do that. So imagine you own the observatory I showed you before. You take a picture of the sky. You come back the day after and you take a picture of the same sky, the same position. You make the difference and then you wonder what has changed. And you would be surprised. A lot of things change at every time scale. Second, minute, hours, day, months. And that can be obvious things like asteroids, comets, that are passing by. That can be less obvious things. Death of a star, for example, something explode in the sky. And that can be very quick. Like in a matter of minutes, imagine a star just ripped off and nothing left. So we have to be very quick, often just to make sure that we can get information and understand what happened. The problem is if you have a telescope that is very powerful, so very deep, that goes very deep and that can scan every little details, you will get many of those events, of those changes that we call alerts. Typically million per night, millions even. And so you cannot just get them and look them by eyes. That would be just impossible. Even if you have an army of PhD students, you don't get enough. So you need to automate all of that. And here comes Think. So Think is a broker. So it's a software that is basically serving the community by ingesting those stream of alerts that are coming in real time. Try to classify the events because the events, when they come, they are pure information in a pure sense. They don't contain any scientific information. It's just position, a change, like a delta in luminosity and a time, nothing else. So we try to classify that and I will give you some details how we do that. We filter them because over the several millions maybe a few only if it is of interest for you. And we distribute that. This is what we do basically. How we do that. So all our services are deployed on academic clouds, large cloud that we operate. We operate 24-7, except when they are clouds, the real one on the sky. We cannot deal with that. And we are serving more than 100 users. And by users, I mean that can be a human, a scientist, that can be an observatory that wants to perform other observations of what you have found or that can be amateurs. So you, with your telescope in your backyard and want to observe something, you can just connect to Think and we will give you what's interesting at your position. And for that, we have so this pipeline that you can see on the right. So the alert, they flow from the bottom. So we collect everything from various observatories. It's a real-time component and we produce alerts via Kafka. And all of that is sent to Apache Spark clusters where basically the computation is performed. And here you have a lot of machine learning, for example, algorithm, that based on the small subset of information that we have, try to infer the nature of the physical process that is taking place. And we have to do that in real-time because then the users they want to miss the event of the year. And then you have to exit of Think either the real-time component so on the left for you where people basically can subscribe using various means that can be a Slack channel, that can be Telegram, that can be Kafka, that can be NameIt. So the idea is to make it simple so that when there is something they can receive a notification. And we archive everything as well. So those 10 millions a night over many years we have to store them on disk to make sure that it can be accessed later on. And so we have an event database of about 1 billion entries and for that we use Apache HBase distributed basically database. The nice thing about Think is I don't know anything about Astrid. I don't know anything about Supernova. I'm just a simple engineer. So how can I actually do classification if I don't know the physics behind? So this part in the classification basically is community-driven. So the users they will come bring the building bricks to the project what we call the science module. So things where there is the physics actually. So hey if I have this information this information that probably means this is an Astrid or this is Supernova. So we outsource basically this classification to the community. And if you think about that it's as if you are asking your users or customers to basically change the source code of your application to get the tailored experience. And you know what it works. It works very well. We have dozens of contributors. They come, they see the point, they modify things and they get exactly what they want at the end of Think. Because I cannot do that for them. But there is a question. There is the question that everybody should ask every day that we ask ourselves every day is when this will break. Okay. When and where where are the failure points diagramme here? It's nice. It works. It does what it's supposed to do. But will it do it forever? Of course the difficulty is really the long run. It's easy to make quick and dirty things. It's way more challenging to keep it running for years. And in Think we are set up for basically the next decade. Telescope will still observe and we have to serve the community. First, the number of maintainers is really low. University not many money. If you want to invest, we are nice. But we are not many. So deployment and operations really should be made easy. I don't want to struggle on production. Second, we don't observe all the time. Telescopes don't observe all the time. They are shut down, they are closed, there are a lot of things. So the throughput, the rate of alert that is coming it ranges from 0 to 1 million. So we want to have the capability to autoscale. We don't want to deploy the whole infrastructure if there isn't nothing. It's just a waste of power, electricity and machine. As I said, we outsource everything to the expert, to the domain expert and they are really crucial agents for the scientific discoveries. They know where to look basically. The problem is they usually have very low computer skills. So when you tell them, yes, they just don't know where to start and if they provide something it's just not usable most of the time. So you should provide with them high level abstractions. Definitely. Don't assume they will just know all the details that you know as a software engineer. They don't. And they don't want to know. It's not their job. Also, they come with their code and often they have requirements, dependencies. Oh, I want to use Spytorch, TensorFlow, Keras, name it. I mean all those huge beasts that are first huge make big image but also often incompatible between one project to another. And so we really have to think about microservices, not like monolithic things that just will break at some point because you are trying to put everything within the same thing. As well there is and I think it's the transition sort of started a few years ago before there was an infrastructure. It was there physically and the user had to know how works the infrastructure. They had to adapt this infrastructure. This is no more the case. I think we really are in a moment where this is the order around where the infrastructure should adapt to specific user needs. We are flexible enough, we know how to do different things. They are heterogeneous architecture. CPU, GPU, cloud, HPC machine etc. So we should provide with them this luxury of choosing depending on what they need. Another thing we often test the code but in our case the platform the infrastructure is part of all the thing. I mean there is the deployment it's highly heterogeneous. There are many of those clusters and should be tested as well. And so end-to-end testing is necessary including code and infrastructure. And when you see this list of code to Kubernetes when we saw that we say we should give a try to Kubernetes and try to see if it can help. And now I will leave the stage to Fabrice that will tell you all the details of implementation. Thank you so much Julien. So now I will explain how we leverage Kubernetes to scale the think-broker. So the goal is not only to work on code-native technology the goal is to understand or to use code-native technology and Kubernetes and cloud in the academic and premise platform. So what we learn with Think we use it also for other academic project. For example now we provide on-demand Kubernetes cluster on-demand Apache Spark cluster on our OpenStack platforms we also provide on-demand self-hosted GitHub Action Runner to academic community and we run a lot of training on cloud technologies for students and staff. So we want to embed the whole academic community toward code-native technology. So now I will introduce or we run a scalable self-hosted CI because as Julien explained think-broker is ton of data mining data machine learning algorithm you have to run and it's difficult to test in CI. So here is the GitHub Action dashboard and you see at each commit each time a developer commit to the think-broker code we run the whole stack we replicate the production platform at each commit. And so we are sure the code always work. So if you want to see the detail look at each commit on the runner we install Kubernetes cluster we install operator life cycle manager we install the Argos CD operator then we install Kafka Maneo we run some alert we simulate alert from telescope and then we run the think-broker to analyze the data with machine learning algorithm ok? So you see this is a huge stack and complex stack and we need to be sure it works at all time ok? So in order to do this we use the open source tool which name KTest Toolbox and you see KTest Toolbox you can see on the bottom right in four line you can install kind with Kubernetes you run KTB XQ8 or life cycle manager then Argos CD then Argos workflow so you have a cluster to Kubernetes cluster up and running in four line where you can bootstrap and run your application ok? Very easy and then you can trigger the DTops Procedure with Argos workflow to install StreamZ Kafka operator which works very well and to install Kafka and Maneo all thing dependencies which are up and running in the cluster in a few line of shell ok? All you have to do is wait for the container to come on the cluster ok? You have to wait for the container to download then for Spark it's a bit more complex because the Spark startup script is a shell script very complex shell script with plenty of options and as you know of installing application this is imperative way so this does not work with GitOps so what we have done is we have written a wrapper on top of Spark submit with a configuration file to ease the launch of this complex shell script ok? So this is better than having the pure shell script of course but we are not yet GitOps so we are investigating the Spark operator which is not the official Spark install procedure but we will allow us to be more GitOps and deploy Spark with Argos CD ok? It's a bit difficult to run Spark it's a bit difficult to run batch on GitOps but the operator provide a technique in order to do that also SUX, SUX it's a very lightweight tool which is a lot RCI and also our deployment production because you see with SUX with sometime in your repository you don't change the source code you change the integration test procedure you change the test data but not the source code so what's inside the image will not change and you have to wait for the image to rebuild even if nothing has changed and so SUX will avoid that SUX as a Git API and you will check if some part of your code has changed if no code has changed it will not match with the same code so the CI build process will run in a few seconds and then you can go to the CI step which runs the end-to-end test ok? so for the end-to-end test SUX is also very important because you see we have different project we have the main project but you have also other project in other repository development we have micro-savis based and we have to install a consistent stack of the version of allo micro-savis it will track all the version for all your micro-savis and so you know what you have run during your CI for all your micro-savis and it will get the source code build them and also get some image docker image if you need docker image and install some other product so you see in one line you can install all your dependencies for your micro-savis which are under development and you can track it and you can log it in your CI so you know what you will deploy in production so what is also very important is the self-hosted runner because you see think it requires lots of disk lots of memory even to run on end-to-end test and so the self-hosted we can't run it on a GitHub action runner does not scale so the self-hosted runner it's very important to scale and to be able to run think with all the science algorithm so self-hosted runner it run on a Kubernetes cluster ok, which is inside our OpenStack platform and then it's based on ARC which work well now and each time you commit in GitHub it will start a pod and your script, your CI script will run inside the pod and we will install kind, Kubernetes, OLM all we have seen previously inside the pod ok we are much optimized now because we run kind inside the pod so we run Kubernetes inside the pod so we are investigating vcluster to create virtual cluster inside of having kind inside the pod to be more optimized but it works but we are improving it and also it's powerful the CI process but you see you reinstall the production platform each time you do commit you have to maintain it it means ok you have lots of problems with the CI because it do a lot of stuff so you are maintaining the CI to work with up to date version of your code so you honestly you spend a lot of time waiting for the CI to install everything and then you debug the good thing is that you don't have to debug an environment platform that you set up from scratch each time there is a bug the good thing is that also when you deploy in production you push a button and it works so the time you spend on the CI you don't spend it in production because in production it's easy to set up production once you have set up your end-to-end test in CI so you see the self-hosted runner it bring us scalability so we can have plenty of resources for running a CI at scale and also we have interactive debugging this is very interesting because you spend maybe 2 hours or 2 hours to install the production platform or to install development version of Think of a new Kubernetes cluster here all you have to do is wait and if you have an error in your CI you run 2 lines you get the kube config you connect to the CI cluster in 2 lines and you are in your environment the think environment with all the microservice which are running and here for example you see on line I connect to the CI and I see there is an error on my pod ok, if I do not have interactive access I have to reproduce all of this on a virtual machine and it will spend maybe 2 or 3 hours if I do not do mistake so this is great for debugging very easy and then you see I can watch the log for my pod ok there is an error in the Python code or in the Python code dependency management here so I can fix it so what you do now when you debug you push a button you wait you log but you do not have to reinstall everything so CI will do all of this for you in a deterministic way this is powerful and so with GitHub action runner you can't do that because it does not scale and you have interactive access you can act to have interactive access interesting for CI, self hosted runner so as a conclusion I will say that things need to run until 2035 so we need to use stable technologies ok so we are very careful with the Kubernetes ecosystem and we try to stick with graduated project not to have to put too much energy on a project will disappear before the end of the maintenance we have we expect hundreds of users watching the sky ok so people all over the world will embed their algorithm inside think so it's an exciting challenge Kubernetes development there is still lot of work but you have something which work on your integration platform and the CI so we have something which work we need to improve and it's a great opportunity to learn better or to operate academic on premise cloud so we want to use on premise infrastructure ok it's great because even if you have some maintenance at all the level you understand pretty well what's happening so it's a great challenge thank you so much for your attention and feel free to ask any question quick question you already mentioned that you were planning on sharing with the scientific community are you also planning on extending your current cluster to allow other telescopes to add their events to your cluster kind of working towards one big system that allows you to combine all the astronomical events making it easier to combine data whatever that's a good question yes we should bring as many observatories telescope surveys as we can however I think it's not having a single system for everything is definitely not what we want to have for many reasons the first reason is mostly sociological having one system with one team is less flexible than having multiple teams and there are other brokers that think for example that are doing I wouldn't say the same thing because they do it worst but they are doing similar things and it's probably great for the diversity because we are limiting manpower as well so if there are other teams doing orthogonal things it's good but yes definitely the idea is still to converge and having as many things as possible because what you do as a scientist probably will benefit another scientist so imagine for example I am tracking gamma reverse though they are like very quick explosion in the sky they can be easily mimicked by something else that just flashes like an asteroid or some other things so if someone is working on my contaminants on asteroids etc and they can name it me I don't have to think about those things because they will be named and removed by someone else so yes having more teams is definitely a good thing 1 plus 1 is more than 2 in that case I have a question over here 1 1st of all thank you for the presentation I think it was the best one I've seen so far like real problems and real challenges I have a question more about the organizational structure like how was your team created how was it funded and what kind of organizational challenges do you have because I think you have very limited funding and you need to work around this and still scale all this stuff up Thanks for the question The project was created in 2019 after the observatory that I showed you at the beginning issued a call so they had this data volume problem they knew they had to send several terabytes every night millions of alerts and they didn't know how to do it so they issued a call while at university there are research labs that are doing R&D and we are working on similar problems so doing streaming at scale we spark so we say hey that's a good use case actually so we started contacting scientists to explain them that we had we had a solution to their problem and let's partner together that was the first thing then we had to find fundings for that and there you need to convince basically funding agency and research that's challenging they don't have a lot but still there is some money I won't lie so currently we are funded until 2035 in terms of hardware so that's to give you an idea that's a million euro for things for example in terms of hardware including all the cloud infrastructure I showed you the replacement of the hardware etc for the next 10 years and now we need people do you want to join us that's difficult that's difficult to attract so we can still brainwash students and tell them that we are better and they should stay with us but clearly it's highly competitive so whenever they learn about communities spark, Kafka they are attractive on the market so it's difficult to to let them it's difficult to ask them to stay there is a huge turnover at this point thank you yes I had another small question in the beginning you showed the processing pipeline from data to event alert and I was wondering if the end users only get the event alerts or are they also able to query the HBase archive and if so how do you provide that service so they can do both they receive in real time and they can go back to the archive maybe I can show yeah yeah that's challenging to so the real time is easy because they can use their favorite tool for HBase of course we had to provide API for them I would say scientific community is more sequel like things so when we provide with them something on HBase they say what's this syntax what's this so we provide on top of the API a lot of abstraction layer so that they can formulate their query in let's say meta language that they understand and then we interpret that into HBase query one of the difficulty with HBase if you don't know HBase in a nutshell tables they have only single row key so you can index only on a single thing the problem is queries that people formulate they are very rich okay so you can for example your key can be partition my data on space but if you formulate the query on time it will just stay forever so we had to use let's say clever techniques in HBase to enable queries as rich as possible and I would be happy in another talk to explain you all the clever things or stupid things we've done with HBase yeah thank you very much for the talk it was very very interesting you were just a second ago talking about the longevity of the people contributing to maintaining this pipeline and I saw you have the CD repo for this on your get hub and I was wondering are you like trying to solicit external contributions to help maintain it or from the other institutions using the data or are you thinking that you'll just make this public but really the ones mostly controlling the direction of the platform rather than of the subcomponents that go into it that's a good question I don't think I have a final answer maybe Fabrice will add some word on that so all the code is open source infrastructure and science usually people when they come they will just open a pull request for example hey I have this idea the difficulty is often they don't know how to formulate or how to integrate their piece of code into the thing a lot of tutorials, how to etc to just explain a lot of templates as well to explain them how it should be done but eventually it really works the first time so we spend a lot of time trying to work with them so we have a dedicated team basically that spend time with the scientists and they sit down we try first to work out the details so what are you trying to do then the requirements in terms of computing for example in terms of dependency so do you need deep learning machine learning or very simple thing that would be very different at the end and typically after a month or two we can hand off everything to the science team because they learn everything they need to learn and then they can live on their own and they can contribute without our assistance but yes it doesn't work the first time it takes a few months and then yes I will talk about the technical side also for contributing so look on the technical side there is two open source tools this one K test toolbox so we are happy to embed new contributors simple code so if you do a blue request of course we are happy to integrate it and because I think this covers large use cases for people who use Kubernetes and so yes we are happy to embed new features in the tool and make it to have a more future and also so this one is dedicated so I don't think it will interest so other people maybe people who use Spark and SUKS also is interesting for CI because you see I work on CI systems since now and these two tools yes I think they are useful because most of the time you fix with shell script and you have this complex shell script you need to move on all your CI platform or to version it or to deploy it these tools are easy to deploy easy to version and can provide a good feature and yes request are welcome so if you want to contribute we will be happy to embed you in this project Thank you for sharing this I was wondering maybe what was the architectural choice for choosing for HBase together with Apache Spark and because there is also Apache Flink I was just wondering if there were features within Apache Flink which could potentially replace both of HBase and Apache Spark Thank you for the question c'est possible to make a choice at some point concerning Spark we had proof of concept at that time back in 2019 that was working with Spark so definitely that was the technology we wanted to use and we are 2024 and we are still happy with Spark so definitely it's part of the stack and it won't be removed HBase I hate HBase just hate it every day but the team is small the team of engineer around us at that time we had an engineer working with the CERN the high energy physics collider lab in Geneva and they had a huge expertise in HBase and we didn't know what to choose at that time there was many different options but we had this person with knowledge on this and we made this choice I still not sure if it's the right choice now or if back in 2019 would have been better to take some training on something else so yeah you mentioned a few names yes definitely we are looking around the stack of the project is not graving stone so we are set to run until 2035 because the observatory will run until 2035 but every six months basically we consider every pieces of the stack and start to think well in 1, 2, 3 years will it still be there and do we need some to change something take training thank you thank you yes in addition I would say that our goal is to deliver a service to astronomer and not to we are not stick to some technologies so if your technologies come which is better than the one we use now and gets mature of course we will switch but now this work well if we have time and man power to switch and human power sorry it's easy to get locked you are hungry I follow a question towards HBase I was wondering how do you host this service is it collocated to the Kubernetes cluster is it running on VMs or is it a party that helps you setting this up because from experience I know hosting HBase actually having a production it's not a nice thing no it isn't and it's usually very costly as well on the operation side but also on the support side so I was wondering how that looks like on your side so now deployment is bare metal so we don't use Kubernetes for that so we have HDFS cluster basically we deploy HBase I would say manually and of course all those things that helps but yes this is definitely one of the weakness so most of the time at phase there is some problem with huge shuffle that suddenly HBase decided to trigger and we don't know what and everything is parallelized for hours so yeah that could be easier definitely that's easier also yes for databases no we don't run HBase in Kubernetes but databases are coming more and more up or running more and more well in Kubernetes so when we will move databases to Kubernetes we will investigate if we switch or if we run HBase I don't know if this stack is not working well in Kubernetes now it seems maybe there is some other technology but databases run pretty well now in Kubernetes thanks to Operator so we need to study that Thank you very much I think times is over I see a zero zero zero Bon appétit