 again. Yeah, Cheryl, go for it please. Cool. So Ricardo, I know you and I we've kind of collaborated and worked together in a few different ways now through CNTF, but it'd be great if you can tell us a little bit more about your role, what you do within your team and something a little bit about the infrastructure at CERN. Yeah, so yeah, first of all thanks for the invitation. I'm super excited to participate in this. I'm always happy to tell people about CERN and our use cases. So I'm a computer engineer in the CERN Cloud team. I focus mostly on containerization and networking aspects, but also on accelerators, GPUs and other types of accelerators. And more recently, I tried to help our users on machine learning use cases. So I've been here for a while. I had different roles, but right now that's what that's what I'm doing mostly. We have a pretty large infrastructure. I'm sure most people have heard of CERN, but we have we are a big particle physics laboratory and we have a large scientific experiments that do cool things like accelerating protons to close to the speed of light and making them collide at specific points. And the result of all of this, it's a lot of data and that's why we have to have a large infrastructure to to process to store and process all this data. We generate something like 70 petabytes of data a year. We have more than half an exabyte already in-house. So to analyze all of this, we have a large private cloud with more than 300,000 cores. And we actually that's not enough for our requirements. So in the last 20 years, we built a large distributed computing infrastructure that more than doubles that capacity. And this is done by collaborating with more than 200 centers around the world that that help us with our tasks. What was it that prompted the adoption of cloud native? Yeah, so this has been we've been doing big data things for for many years now. So we are always looking for the next thing that can help us be more efficient and basically process more events per second, which is our main measurement. And this is the main thing is that we saw the potential to use this kind of technologies to improve our infrastructure and be more efficient. And this was in terms of simplifying our infrastructure because we could just rely on a uniform API that is declarative and kind of separates well the workloads from the infrastructure and allows us to have our users just telling us what they need. And then the infrastructure itself will decide how to replicate the services or the workloads and also do things like cluster auto scaling. And the second one was the ecosystem around it. We started the Kubernetes, but there's like a very rich ecosystem that will handle things like logging, metrics, alarming even. So this is it gives us a very consistent set of tools to do all our tasks. And this is also like very important, of course. Maybe a third one would be I mentioned this distributed computing infrastructure that needs to scale out. And the fact that the Kubernetes API got so popular, it also means that there's a lot of options for people to deploy to elsewhere. And also there's public cloud supporting it. So we can more easily integrate external resources into our systems. And then instead of focusing so much on the infrastructure, just focus more on the physics side. That's actually a very good introduction of your infrastructure. However, you've mentioned before that you manage more than 600 cluster at the moment and you have hundreds and thousands of nodes. Would you be able to maybe dip dive a bit more into some of the technical challenges that you face of managing infrastructure scale? Do you have any challenges around like access management, networking storage, just some examples of what exactly have to solve on a daily basis? Yeah, so one of the main reasons as well is to have this kind of consistent infrastructure where we do the integration with our internal systems once for everyone. This includes identity, but it also includes as you mentioned networking and storage. The challenges are really like in some they are very different. One of the main challenges is that we have a huge variety of use cases we have to cover. So we have the more traditional IT services that have to like the administrative services and the things that keep our campus going. And this is kind of not something that we associate necessarily with CERN, but actually we have more than 10,000 physicists working here all the time. So it's also a challenge to do that. So we have to focus on availability and making sure that these services are always up. And on the other hand, we have to scale out to the physics analysis. So we need to like be able to distribute the workloads really fast and push a huge amount of workloads into these clusters. So I would say that having clusters that serve both use cases is a challenge. We need to be able to configure the clusters very differently depending on the target in use cases. The technical challenges I think the main one we have is managing upgrades that we've had for several years now, I would say. And the way we do this for the application, the application upgrades and deployments, we promote keytops as much as possible to be able to deploy the applications in different clusters. And then we kind of promote also clusters as cattle internally so that people, if their cluster breaks for whatever reason, they don't lose completely the service, they might lose some capacity. And also they can easily do upgrades into new clusters by just moving the applications there. So I would say those are the two main areas where managing many clusters for us is kind of challenging. You've actually mentioned two main very good areas I would like to maybe ask a bit more questions, the GitOps and cluster as a cattle. Now managing multiple clusters, there's usually the approach between managing either multiple name spaces and distributing that to different teams or having multiple clusters. Would you be able to maybe elaborate a bit more on how you set up those clusters and what is the tools of choice to manage your clusters overall? Right. So we have two different types of deployments. One is one big centrally managed cluster which is using OpenShift. And this is doing isolation using name spaces and and it's very well centrally managed. And then we offer a service that is more Kubernetes as a service where we build, we are able to deploy Kubernetes on top of our lower level OpenStack deployment. And users can create their own clusters using an API that we provide which allows them to create the cluster but also like scale, upgrade and delete the cluster so they manage the whole life cycle like this. And then on the layer above, the tools that we use for these clusters, maybe you mentioned this, so we deploy Kubernetes but we also deploy alongside Prometheus for monitoring, so indeed for the log collection, Corgi and S to manage the logs like pretty standard stack I would say. And then on top of that, we promote a lot of people using Helm as much as possible to manage their applications. And this is the first step to for the next level key jobs where then we give internal tutorials and seminars for disseminating these best practices so that they can use something like Flux or Argo CD to manage the deployments in the multiple clusters. So it's kind of several steps from the infrastructure. We focus a lot on making the clusters as reliable as possible and then disseminating best practices so that the end users can do the best for the applications. Amazing. That's actually a very nice segue to developer experience and how can we improve the application deployment life cycle because I think there has been a lot of thoughts about how can we transition our engineers to work with YAML, maybe to work with templates, like if you have customized our Helm, and now there's a big movement and shift towards GitOps. And you've mentioned already that that is a practice which is already enabled, so would you be able to maybe showcase a bit more, describe a bit more how exactly an application is deployed with GitOps if you use any particular tools specifically and what kind of support you provide for your developers? Yeah. So I mentioned this is not something new, I would say, GitOps, that we've been doing similar things with the other configuration management systems that we had like Puppet or previously we had others even with other types of version control. And this is something that we are used to. So one of the main motivations here was to get people interested and onboard into our Kubernetes deployments. Sometimes the learning curve can be kind of hard. So if they can just go straight to the version control where they have a bit of YAML describing how things should look and then build on that, it actually kind of put straps there, their entry into the Kubernetes world. So that was one motivation. But then what we try is to have a single source of truth for all the application definitions and this covers the defaults for all the environments that we have but also allows to customize per environment or per cluster. And we do this using, we used mostly flux at the start, there's quite a few users now that also use Argo CD, so we don't mandate one of them. What we try to do is to have people using either Helm or customize as much as possible. We have still a couple of deployments that use plain manifests and that also works with gate ops but it's not as customizable. So we do a lot of internal training where we advertise repos where we have small tutorials that show how to use these tools internally. Cool. I wanted to ask a little bit more, Ricardo, about observability because we did a CNTF end user tech radar on observability in September 2020 and we kind of asked people what observability tools are you using. So obviously, since you run 600 clusters, you must be observing those. So can you tell us about the tools that you're using and the approach that you take? Yeah, so what we took an approach which was to try to build, to integrate with the existing infrastructure we already have for other ways of deploying applications. So this was crucial so that people can migrate their workloads gradually. So our infrastructure for collecting logs and metrics already existed and we have gateways where we can push both logs and metrics. For logs we use as a back-end, we use elastic search for short-term log storage and querying and then we have an HDFS back-end where we store logs for longer-term analysis. For the metrics, it's a similar situation. There's a gateway where we can push the metrics and this will split into multiple destinations which include influx-tp and HDFS as well. When we started introducing Kubernetes, then Prometheus took a huge role on it. So we are able to push the same metrics types from Prometheus sources into these gateways and they are centrally collected. So one interesting part here is that we offer our users two levels of observability. So in cluster they get very fine-grained metric collection and they are kept for something like two weeks. This is configurable but on average I would say two weeks and this allows us to like debug very recent problems in detail. And then what we push centrally is an aggregation of these metrics for more long-term analysis as well. We actually do this using a standard Prometheus federation for now. So the registration of the cluster is still a manual step but one of the things we are looking at as we speak is projects like TANOS or Cortex where we could do a more dynamic registration of the endpoints centrally. Awesome. I was actually about to ask if you use anything like Cortex or FANOS to manage Prometheus scale but seems like it's already in your pipelines. The other thing I wanted to mention is that if any of the attendees have any questions please just ask them. We'll be again monitoring them so we'll try to ask them live if possible. The other thing I wanted to focus on is that currently you seem to have and solved multiple challenges across different stacks, be it infrastructure, observability. Another aspect I'd like to focus a bit more is the community because I know that CERN have been giving a lot of talks at past KIPKONs and cloud native KONs including keynotes and actual sessions throughout. And we're going to have the KIPKON Europe in May which is going to be from May 4th to 7th and there are going to be free main talks delivered by CERN employees. So the first two talks I'd like to focus about. The first one is going to be focused on CRDs and operators and how this can be used to manage triple websites. This is going to be given by a colleague of yours so if you have any details about that one I definitely would ask you to elaborate a bit more. But the session that you'll be actually delivering is going to be on managing a centralized machine learning platform using KIPKON. Now would you be able to maybe give a short summary of these sessions and why KIPKON attendees should pretty much sign up for these talks? Yeah, I'm happy to of course. And that's I would say that's also one of the things that is pretty special in this usage we have of cloud native is that it kind of touches everything at CERN, all the groups. It's very common that we use like an open source. We are very much users of open source technologies but having a set of tools that is touching so many groups in our community is kind of special. So like for the talk from Konstantinos and Verulia that we'll talk about managing triple sites this is something that is quite important at CERN because dissemination and communication is very important for us. We run tens of thousands of websites. Some of them are extremely visible like the home page of CERN and it runs in this infrastructure. So this what was developed is really exploring an operator that will allow us to not only deploy a website but manage the life cycle of these websites. So upgrades and backup restores things like this. So they will give all of you a pretty cool talk about this. I hope I'm sure. And then for the machine learning this is machine learning is really having a huge impact at CERN. There's many groups looking at it for different areas. There's use cases from doing like particle classification in the detectors to simulating generating simulation data in a very fast pace. And also things like reinforcement learning for beam calibration things that I try to help but I don't completely understand always. So the idea here is really to help our users. It's very common that groups will have like one node with one GPU or a couple GPUs and they can do their training there. But then if they want to scale out they don't necessarily have the infrastructure or the knowledge to do that. So we wanted to offer a service that will help them out with this. And also if we start thinking that there are multiple groups with these custom deployments then the resource usage is not super efficient. So we wanted also to bring everything together so that we can improve the efficiency of using resources that are kind of scarce these days like GPUs and other accelerators. So we try to offer a service that and we will describe how we do this that will manage the whole life cycle of their machine learning workloads from preparation of the data all the way to serving. But also how we integrate these types of different types of resources even like how integrate types of accelerators we don't have on-premises like TPUs, IPUs and things that we get we get from public clouds. I'm looking forward to to to like talking to people with similar requirements as well after. That sounds like a great talk actually. Like I think I need to build out my schedule for KubeCon. I'm sure there's plenty of other ones like yeah it's always the problem with KubeCon. So many talks indeed. But this one is definitely a great one. Sounds like a very good deep dive into how KubeFlo can can be of help to a real use case example. Cool. It's something that I really appreciate about you Riccardo and about CERN is how open you are to sharing what you've done and what you know with other companies and helping them on their journeys. I know you have another talk which is about the CNTF Research User Group and you and I have collaborated within this for probably a year or two now. So can you tell us a little bit about the CNTF Research User Group and how you got involved with it and what's your experience been? Definitely. Yeah so I think that's one of the advantage of working at CERN is that we are we are able to be really open and this is promoted about what we do and we really want people to learn more about our our main mission but also how we do things. So the Research User Group was actually came up after a lunch break during KubeCon Barcelona with Bob Killen and other people together and then we thought of having a group that would be dedicated more to the more research oriented use cases which are kind of different from the more traditional IT service deployments. This includes things like batch managing batch workloads, having queues and the way to do fair shares, focusing on accelerators and then he talked to you and as usual or as always the response was very positive and he kind of started this group and the goal is really to get together all these different institutions and academic institutions companies that have these similar requirements try to and like learn more from each other but also to document these use cases that we have and the solutions that different people have and then maybe give feedback to the whole community on how the systems can be improved to serve these use cases. So it's been like really exciting to be part of it. I think we have a pretty good momentum right now with a set of regular attendees and newcomers all the time so I think it's it's quite exciting. It definitely is exciting I mean I really enjoy learning about it. As you said we have this core group of people who have the same use cases and therefore run into the same challenges just very openly sharing what they do what they do and you know I want to mention also that you are one of the co-chairs for this group so you've really been a leader for this group and it's been fantastic to see that. Yeah and it's very interesting to see and it really shows another benefit of this cloud native community is that CERN has this big data requirements for many many years but back like if we look 20 years ago we're kind of by ourselves and now there's so many people with similar requirements so we can really benefit from from each other and then focus on like other aspects like more physics related and not so much on the infrastructure because there's so much momentum in here. So as Katie mentioned it does seem like you've solved a lot of challenges but what do you see as future challenges? What's next for you at CERN? Yeah so we have one internal challenge that is pretty big which is we have a big machine and we generate lots of data but we are about to have an upgrade in a couple of years that will multiply this by 10 and it's not totally obvious how we can handle this so we do as we always do we look around and try to to find solutions so we work together with everyone to try to tackle these challenges and I think if we look at the integration with cloud native technologies the things I mentioned before like supporting patch workloads in a more advanced way things like queues doing fair share on workload submissions this is very important for us our systems do this and if we manage to like integrate this functionality into the ecosystem then it will allow us to like have all the other benefits from scaling out integrating external resources in a much easier way coming together and then the second one is again we are moving a lot of our workloads from CPUs to GPUs and our accelerators so support for this type of resources is extremely important and we have to be able to do this it's not only integrating resources but even like virtualizing these resources so that can partition and do better use of them I think these are areas where we will be focusing quite a bit in the next couple of years or year at least. Fantastic and I know that you are probably a little bit unusual in the infrastructure that you run but if you were giving some advice to somebody today about you know where to get started or you know what advice would you give to someone who's just getting started with cloud native and what you learn. Yeah so I think the experience we have from from the way it internally is really to like the resources are plenty to learn about the details and all the the tooling but one important thing is to to focus on the intrusion gradually we've seen internally that any kind of dramatic transition will always be very complicated so making sure that all the integrations all the required integrations are done properly first sorry and then gradually transitioning the services to the new infrastructure this is something that is very very important to start slow I would say. Since we touched upon the community I have another question of certain's experience as an end user member would you be able to maybe describe a bit how do you find the end user community your engagement within it has it been helpful for you to either reach out to the other companies within the same industry just kind of more a bit of how do you find this this experience overall. Yeah so that's that's I mentioned this earlier as well it's one of the main benefits of being part of the community and such a huge community is that a lot of the problems we had to solve internally by ourselves are now shared problems it's extremely easy to to to approach other end users these days is more in in calls and different groups that are organized so joining the special interest groups or or the end user calls is is a good way or slack of course but one one thing that I really appreciate is also the conferences and the events where it's like the direct content makes it really easy to to engage with other with other communities with other end user members and this this we've we've had several we've had follow-ups from from like lunch or coffee breaks in in the conferences that turn into like full day sharing and exchanges between members of the different institutions so this this has been like great you definitely make me miss the in-person conference experience yeah having copies and yeah hopefully it's not too long I hope so definitely I can only echo that like I saw very much miss the coffee chats during the networking sessions at Kipkan hopefully it's some time soon another thing I wanted to mention is recently you have been elected as a TOC or technical oversight committee so congratulations for for your new positions would you be able to tell a bit more about your responsibilities within this role are you like any are excited about this like how you're feeling being within this position I'm very excited it's it's been I it's been a very like quick start jumping in but but I'm learning about all the tasks and all the things to do and yeah so it it's super interesting because I get exposure to to a much wider set of of like even projects and communities that are part of the cloud native ecosystem and not something that I could it's something I could do externally but not necessarily I would have all the time to to to focus on this but dedicate some time so really the task is to to make sure that all all that is coming into the ecosystem kind of integrates well together and that we we are able to to offer and especially in in my case I'm as an end user representative I'm part of the TOC as an end user representative to make sure that we we we take the requirements of the end users into account not not just the wishes from from the project of course but try to match those against the expectations of the end users it's very important that that we have like a vibrant tech coming all the time and it's also important that we don't keep our end users that we keep our end users happy and not like constantly break systems because there's a new one so it's it's really my role is really to to try to give an end user perspective into into these decisions that's so critical to have that end user perspective because I think there can be a lot of discussions about what the right thing is to do but once you have an end user who says I actually have this problem well I see other companies who have this same problem then that cuts through a lot of that and just keeps things moving so that that perspective is really really important and it's something that I think will grow over the coming years yeah and it's really also one thing that I keep saying as well internal is that all the work that is done in the in the TOC about evaluating the projects and even giving them like a maturity level is extremely helpful for for end users because you know what to expect you know what the criteria is for for the different levels so you can you can already approach a project with some some kind of assurance of what you can expect about it. Personally I'm definitely looking forward to your contribution as a TOC to the entire ecosystem so yeah it's going to be quite exciting. Another question I have maybe to kind of wrap it up since you are a TOC representative and CERN is actually a very big user of cloud native technologies I would like to ask if there is any new projects or new maybe methodologies of kind of practices on the horizon of the cloud native ecosystem that you'd like to use in the future or something which maybe draws your attention or your team's attention in particular is there anything that you would evidence so far? Yeah so I can give my a bit my personal experience but also from what I hear from my colleagues which is maybe also relevant which is one one project that I've been hearing about is open telemetry and how it aims to to offer like a more consistent way to to handle all the observability parts and this is something that is quite important we have a good story for logging in metrics but we probably don't have such a consistent story for tracing so this is something that some of our groups have been focusing on and then personally because I work a lot in this area of distributing workloads across multiple clouds and multiple clusters one project that I've started looking recently is crossplane and I think it's a really it's a nice way to integrate again existing infrastructure or existing services into the same ecosystem and try to make again the API's uniform and well managed in a central place so it's another project I'm kind of excited about as well. Absolutely especially with the crossplane I think there's going to be a dedicated qualitative entry keep going for it so definitely looking forward to that one as well and we have actually a question from Jerma and he's asking are using different public cloud providers to deploy the whole infrastructure there's communication between them so would you be able to elaborate a bit more on that? That's an excellent question so we are mostly on premises or using data centers that are shared with us basically that other institutions make available for us we are investigating the use of public cloud and we've been doing this for quite a while we have several use cases where we try to scale out and we gave a keynote about this where we managed to reproduce the Higgs discovery in just a couple of minutes using public cloud resources in that case Google and we keep pushing in reality what we want is to do like the most with the resources we can access so we will keep pushing for public cloud. The other question regarding communication this is a very very good question and I come again to the fact that we don't have traditional service workloads for most of what we want to scale out we we often say that our workloads are embarrassing parallel because we can just distribute them and they don't need to talk to each other and this allows us to actually distribute them even across different clouds in a much simpler way than if you would have to have like very tight interconnectivity between the different systems so we are able to scale out in this way more easily. This is true for our traditional workloads machine learning workloads are posing new challenges because they often need this sort of communication so we are also investing in understanding more the service mesh area and how it could contribute to our deployments. I just want to mention that keynote at KubeCon that was an amazing keynote I do remember it it was in Barcelona as well and it was a live demo where it used like an amazing amount of compute there like within minutes so I definitely recommend twice that keynote on YouTube. So is there anything that you'd like to maybe evidence or maybe some talks you'd like to recommend for our listeners to go to drink at KubeCon anything that maybe you'd like to leave us last remarks in regards to service and its journey towards cloud native? Yeah, honestly I didn't have time to go into detail into the schedule yet I plan to do it very soon so but I'm sure like the issue is really like to be able to watch all the talks even later it's very hard to find the time to go through all of it. But I think the last message I would put here is that the community around cloud native is having a huge contribution to our mission. I mentioned that yeah we have this mention of doing fundamental research understanding the universe better but this implies a lot of technical challenges including IT challenges and the fact that we can get so much help from the world community on all these institutions that collaborate in cloud native projects it's a huge plus. This changes the way we work compared to like 20 years ago again so I would stress that message that the world community is having a huge impact in ourselves. We try in our work we try to contribute back but what we get is much more. Well I want to say thank you to you Ikada actually because I really admire how much you've contributed in all different ways through being a member of TOC, sharing what CERN is doing and leading the research user group and just being a really warm and welcoming part of our community so thank you for being there. Thank you both as well for for for helping out in all areas. And we actually have a very positive comment from I'm trying to pronounce that name right Aya you but they are mentioning that the end user end user perspective at CERN and the support they've given to the community has been amazing so thank you very much for mentioning that. And with that I think we we kind of getting towards the end of this stream and I would like to thank everyone that attended and pretty much was with us for this first episode of the end user launch and thank you very much Ricardo for being our first guest and Cheryl for co-hosting this as well. And I would like to to mention that if you have again any questions please please ask them in the future streams as well to join us as well. We will try to deliver this every second and first Thursday of the month at 9 a.m. Pacific time. And please do join us at QPCON and CloudNativeCon it is going to be in May, May 4th and 7th and of course we're going to have the co-located events if you want to focus on more targeted technologies or methodologies as well. And if you'd like to join our end user community pretty much this is the community which is vendor neutral and we'd like to showcase your usage of cloud native tools you'll be able to join our end user community and you can find more information on cncf.io slash end user. And thank you for joining us today and hopefully to see you next time.