 Hi everyone, hello Benoît. I'm happy to welcome you for this session of the Alephium Tech Talk with Benoît Peru, who is the CTO of the Analytics Division of Open System, but also the Chief Foundry Officer and the person responsible for Alephium infrastructure. Hello everyone, thanks for having me here, and I think we can get started. Maybe just wait a minute before you get started, I just want to confirm that the video is working properly because... Oh, okay. Okay, it seems everything is fine now. Please go ahead, thank you. Cool, so I have a few slides, I'm sure. I will go through a practical demo of some stuff, and then we'll be open for a Q&A at the end. Let me share my screen first, and let's get started. So, my name is Benoît. I'm graduated from EPFL, so Mastering Computer Science. Graduated in 2007. I have two kids, I'm based in Switzerland, and I'm currently working at Open System, which acquires Kuba, the company I was working for before that, in 2020. I'm building Big Data Analytics Platform since 2010. I went through the cycle of Hadoop and MapReduce, up to now, a three communities, cloud-native type of platform. And what changed through the acquisition is that before I was building Genic, or General Purpose, Analytics Platform, and now I'm building Analytics Platform, mostly targeted for cybersecurity. I have a strong feeling for everything that is Ops, so deploying software, building software, deploying software, operating software. And I'm doing this since quite a very long time, and I'm basically calling myself DevOps before the DevOps world was key. Let me quickly go through Big Data Analytics Platform, what it is, how it evolves, so that we will all be on the same page and we will be able to understand the rest of the discussion. So, a data platform looks something like that. Basically, we have data source, rational database, customer database, field production database, and so on. The goal is to get into a central place, a common place, and then start doing analytics on top, so transformation and so on. So, the workflow, when we talk about data platform, is always from the data source, we extract the data, we store the raw data into this storage layer, and then we run some transformation. Transformation here could be joining different data sources, changing, aggregating, doing some average, whatever. And then, since the storage layer is not really well optimized for the end user, like data scientist or data analyst, or even data products, once we have the proper transformed data ready, then we load it into what I call here, Sync, which is basically a specific purpose data store. It could be rational database, it could be whatever you have in mind in order to do this. And here, I talk about data platform. I didn't speak at all about the business of the company. From my perspective, a data platform is a big data platform, and what the user is putting on top, so their analytics doesn't change a lot the data platform that we're coming with. If I go a little bit into the capabilities or the duties of a data platform, so data platform should ensure that the data is available at any time. So, the data should be... So, from the data source, the extraction process should be able to write at any time. The data sync should be able to read at any time, and data should be available. And one of the key indicators is the time for availability of the data. So, the data is extracted from the sync until when the data is available for the analyst or the scientist on the data sync, this is a key metric. The data should have integrity. So, the data should not be altered, and the data obviously shouldn't be lost, except if we are doing some time retention and losing the data. And at the end of the day, the data should be complete and correct, but this is not really a duty from a platform. The platform should ensure availability and integrity, and then the team working on the platform should, on their side, ensure completeness and correctness. So, all these properties can be achieved by a lot of different means or ways. So, usually, a data platform is scaling up and down. So, we should be able to add more storage, add more computes. Recent or modern data platforms have a complete separation between storage and computes. So, if there's no job, the compute should go down to zero, and the storage, if there's data that is cleansed or purged, then the storage should also be able, eventually, to go down to zero. In terms of time to availability, traditional or historical data platform were doing batch processing, and in the most recent platform, they're doing stream processing. So, there's a stream, meaning once the data is available from the source, it goes directly through all the stages of the data platform up to the consumer of the data. So, the latency of the data should be down to minutes or even seconds. A good data platform can also provide data provenance, data lineage, so where the data come from, where the data goes, and so on. Then data can be tagged with annotation for classification, like is this PII data, and then this classification can be linked with ACL, so a control list. And if a user is not able to see PII data, then this user should not be able to see transform data coming from a PII field, for instance. All this kind of stuff should be available inside a proper data platform. And at the end, of course, we would like to have encryption and to an encryption of data trust. So, when it's stored in the platform, data in motion, when it's flowing through the network somehow. So, this from a very high level perspective is data platform, so that's basically what I'm building on a daily basis. And I think mode has a question. Yes, I wanted to ask if we can try again to maybe put your presentation full screen for you. Because this, we have people asking if maybe somehow you can go in the presentation mode on your screen, yeah. So, is it like that? Does it help? We still see, we see kind of the, you know, the next slide, the side thing. We don't have the focus on the slide. The other one, so... I'll try... So, I'm trying to change... So, no, I can try to zoom in a little bit. Yeah, if you make it... Is that better already? It has to be better than what it was before, for sure. Let's try like this. For now, thank you. For a little bit more zoom in. Yeah, so data platform is something that evolved a lot since the beginning and I was lucky enough to start quite early in this journey. And I really started with... Not the very first version, but very, very old version of Hadoop. And Hadoop had only MapReduce at that time. So Hadoop version one, sorry, in 2006, which last 2012, something like that. Basically, Hadoop is one of these big data platform, which provides storage and compute for data analytics. And yeah, that's one of the first ones who came out with something usable and the big driver of Hadoop was this data locality. So instead of reading all the data and then doing transformation, the processing was sent to the data in order to execute as close as possible to reduce network consumption. So Hadoop had this data locality built in at the heart of the platform and it worked quite well at the very beginning. The main problem of the first version of Hadoop is that the only programming paradigm available on the platform was MapReduce. I'm not sure if you're familiar with MapReduce, but basically, it's a three-step processing where you will map, shuffle, reduce. I will not go too much in detail, but then you map, you shuffle, you reduce, and then you map again, shuffle, reduce. So the primitive or what you can do with such programming paradigm is quite limited. So then Hadoop in 2012 came out with a version two of their processing platform, which was called Yarn. Yarn meant to be a much more general-purpose processing framework allowing you to run more than simple MapReduce jobs. And it went well and it really helped Hadoop to take off and have some other processing framework running on top, the most famous one being Spark. But the problem with Yarn is that there's a really tight dependency between the runtime and the job you're running. So you can't run whatever version of Spark you want. You are tied to a Spark version given the version of the runtime you're running on. And that's pretty much when Kubernetes came out. Kubernetes was released in 2014. People didn't hear a lot about it, but it really started raising into 2017 where a big alliance of initially competitors joined forces around Kubernetes to make it the product it is now. And the big advantage of Kubernetes over other orchestration platform like Yarn is the containerization. And the containerization of an application means it removes completely dependencies of the processing on the runtime version. So it's container that is executing and the container has all dependencies needed. It doesn't get dependencies from the platform. So Kubernetes today is de facto technology because mostly it provides an API which looks like cloud type of API but is really cloud diagnostic. Every time you're here in the market we are cloud diagnostic as any type of application. It usually means it runs on Kubernetes and Kubernetes can run then on any cloud. And the last shift that happens in this space is the S3 shift. S3 is an object storage and this one started replacing HTFS and HTFS is a block storage. Again, it's out of the scope to explain what it is, but the S3 really started to take off and especially the S3 API or interface which is REST based on HTTP protocol and like HTFS. And it makes it very portable and reduce the operational cost of the overall data storage. So modern stack now looks, so we started from Hadoop and MapReduce and modern data platform stack looks pretty much like Kubernetes and processing running on Kubernetes and data stored on S3 or object storage like that. As I said, I was lucky. I went through all this transformation. I really had the impression that the day I really masterized one of these components. So for instance MapReduce, something new came out. So I'm running behind this trend and today it's interesting for me because I have a lot of history and I can really explain you why this works and this didn't work and why Kubernetes is the way it is and so on. So quite an interesting story to follow this history again. So as I mentioned at the beginning, I was doing this data platform for a US based company and then for SCUBA which was a startup doing data, big data analytics for customers and then OpenSystem acquired SCUBA in 2020 and OpenSystem is really focusing towards cybersecurity. So now I'm building data platform but in the cyberspace spec. So what is different? Well, we have different data sources. We have different processing with different data sync, but at the end, as I said, the platform is the same. So it's a three-base, Kubernetes and then all the Spark and type of application you can have on top of that. So if I focus a little bit on the cyberspace, obviously the primary data source of cybersecurity is network traffic, network type of traffic and the network traffic can be split in three different categories. There's one where we have pickup files. So you capture pickup somewhere and then you transport these files into the big data platform and you ingest it here. So extracting things, indexing fields and so on. The second type is network interface. So we have a server like a web server and you have a probe listening to the network interface you want to monitor and you're capturing the pickup and extracting as many information as you want and sending this information in the big data platform. Or the third one is a tab. A tab is basically a mirrored feed of network traffic. So you're not on live network interface but you have a mirrored network interface so you're offline and then you listen to traffic into this. And then this data source is usually enriched and which is joined basically with any other external data source that you can think of. The main data source we're dealing with are Active Directory. So AD logs is proxy logs. If your company has proxy, it's firewall logs, it's honeypots if you have and then it's all these threat intelligent feeds where you have companies like C-Score that publish list of what is called IOC so basically IP address used by malware, attackers and so and so. So you have the data so the network traffic and then these other data feeds logs and threat intelligent feeds and out of this then you're trying to find the needle out of the haystack as always. So if I dig a little bit more into current data platform analytics I'm working on for the cyberspace it would look something like that. So you have the traffic incoming so in this case we have a tap and then the tap is connected to capture nodes. This capture node is listening on the traffic trying to extract the data as much as possible and then the data is is it DNS request is it HTTP request is it whatever type of traffic and then sending all these metadata extracted into the data platform and then you start having the data platform magic so mostly enrichment JYP enrichment, if you have an IP you know which countries this IP come from then you have these threat intelligent feeds so joining with a malware list or whatever else you can apply a custom machine learning model on top and what is interesting at the end is what do you do out of this so one easy thing to do is reporting so you report that today we monitor this many network sessions or network packets that were matching a known malware for instance and when you start gaining into maturity in your platform but people is alerting or then even more mitigations mitigations means out of the platform it does not alert like we detect some unauthorized network traffic to malware for instance the mitigation could directly update firewall rules of the server or the laptop of the user that is impacted and try to stop as early as possible this network communication so here what I would like to do is a quick demo so for here I'm also helping the metapool mining pool team for operating and securing the pool and for this on one of the metapool server we have a tool called Archimi running Archimi is one of this deep packet inspection tool so it listen on the network interface and it tries to recognize the type of packets you might be familiar with this OZ OZ model which details the anatomy of a network packet and the SD is to go as up as possible so if it's HDP traffic what is the request the page you're trying this request is trying to to reach or if it's HDPS traffic what is the certificate that is used in the HDPS Handshake Exchange and so on so for this it can show this here so this is the Archimi tool we can see that it's running it's getting some fresh sessions and all these lines here are actual sessions so network packets entering and leaving this metapool this metapool server so here we can see that there's some TCP standard TCP traffic there's some ECMP so PINX basically coming from some random server, random IPs going into this one that's the IP of this metapool server this POR2032 is the default stratum pool for the mining pool so and here I can then start looking into the details so the details of the packets and here I can even see the payloads of the TCP session between this server and the metapool server so this is a typical stratum session initialization so the metapool the miner connects to the metapool exchange some data and then start mining so this tool here is really interesting to start slicing and dicing so digging into what is happening on the network side so one typical example here is this type of sessions and this port 9973 is the default alephium full nodes port used so if I comment I filter this so I keep only the session that arrived on this port so it means that these are other full nodes that try to so that's the IP of the other full nodes connected to these full nodes and here I can for instance come and on the source on the test IP I can export all the IPs and this is probably the IP of all the full nodes that this metapool full node is connected so it's really powerful it's pretty much way beyond the demo we want to do but what is interesting here is that I have a really nice and slick way of looking at the network traffic entering and leaving these nodes and we can see that some traffic pattern are really easily identifiable so if I again look into this one here it's a session for stratum port there's a lot of packets so it's a big session and if we look at the detail here it's basically the miner getting jobs to try to mine on so this is really easily identifiable so then building the machine learning model out of this is easy to start finding outliers because in cybersecurity what is really interesting is the needle out of the haystack this server is getting few megabits of bandwidth inbound and outbound traffic but what is interesting is the one session that is malicious and this is the one we want to identify and this tool helps really identifying all the normal session or the normal pattern which will then help building a model to start finding the outliers so again that's my daily work slicing, dicing trying to find a needle needle in the haystack this tool for instance helps us to find and to identify attack pattern because mining pool is a quite interesting business where attacks are common because if your pool responds slowly or even is getting your details so it's out of order then the other miners on other mining pool have more profitability so apparently is a game in the mining pool business you have a pool and you probably have a team dedicated to operating the pool and you have a dedicated team you have a team dedicated to attacking or the pool just to maximize the profit of your own miners so that's pretty much it for the demo so as I said I'm working really hands on into the cyber security space helping companies to find outliers find attacks or malicious network traffic on their own network and obviously I'm applying this on my daily basis so now if I switch a little bit to what I'm doing of the Alephium project so an Alephium project you might have understood but what I'm doing is building infrastructure and maintaining the the Alephium infrastructure and as part of the Alephium infrastructure there are few components which are basically listed here all these all these components listed here are open source there's obviously the the Gitter repo here and I have three categories there's the Alephium full nodes that's the one hosting the chain collaborating discovering peers and building really the blockchain network where we have the groups, the charts the transaction happening and so on the second group here is the Alephium Explorer so the Explorer is this nice UI you can go into explorer.alefium.org which helps you navigating through the the wallets and through the transactions the Alephium Explorer is based off three components one is the frontend it's here with the link here the second one is the backend the REST API serving basically and the data is stored into a Postgres database which is an open source component but not really maintained by us the third component out of the Alephium infrastructure is miners and mining pool and here again we have some some repos I pointed to a few the standalone GPU miner or the reference implementation of the stratum mining pool which is here and that's the one listening on port 20 2032 that I showed in the UI before and the fourth part of the Alephium infrastructure is the wallets so like what the end user use and see at the very end so if you think a little bit it looks like an iceberg the wallets being the visible parts from a very end user perspective and the full node explorer and mining mining pools and so on being underwater and seen from the end user and that's usually the case in every development question so if I dig a little bit into all the these components well all the three components composing in infrastructure we have first a full node so the full node is standalone Scala so JVM based application and we have multiple Alephium full nodes running out there some of them being well known and referenced as bootstrap full nodes so when we started the blockchain we started with one full node the bootstrap 0 and this one was the first one setting all the genesis the genesis blocks and then we started the bootstrap 1 2 3 4 5 we have bootstrap nodes and they started forming the the first the first blockchain cluster at the very beginning now every time you start a full node on your own these bootstrap nodes endpoints are hard coded so that's usually the first one into which your full node is connected to and from this one it discover other peers and it start connecting to all these other peers so nothing really fancy here except that we have geo distributed full nodes so we try to have full nodes on every continent of the world and we have like full nodes running almost everywhere on earth which is cool because it's mostly for reliability of the chain and also to be as close as possible from the end user in order to provide good response time to the chain at the very beginning we started some miners alongside the brokers which we then stopped once we reach certain threshold of hash rates so as of today the Alephium team is no longer mining at all it's only community mining at the end of the day on our side I spoke about all these Kubernetes orchestration and so on but the full nodes we wanted to keep it as simple as possible it's running in a Docker container inside the VM but there's no orchestration so it's really keep it simple type of deployment listening on port 9973 in order to find peers, get discovered and so on and that's that's pretty much it so nothing really fancy on the Alephium node except this geodistribution the explorer is what I would call the somehow typical 3 tier HTTP application so there's an endpoint like wallet.mainnet.alephium.org or back end or explorer the explorer one I can quickly show so explorer notalephium.org that's the front end serving an HTTP website and then the website to display the blocks or the number of transactions and so on calling the back end which then look into the database to find the required information the Alephium or the mainnet explorer is using multiple explorer back ends configured as read only so only reading data from Postgres replicas and then we have one specific explorer back end which is configured in read write modes and this one is listening to new transaction and new blocks on the full nodes and synchronizing on the Postgres master. All these are load balanced through multiple instance and the wallet.mainnet is actually an endpoint on top of the full nodes API which is also load balanced the same way than the other so with this we have quite responsive explorer implementation which as of today worked well there were some hiccups here but most of the time it's running smoothly and the last part here is the mining pool so I showed a very brief example of the metapool but basically metapool is based on this reference implementation of the mining pool that will provide as open source the mining pool needs to be connected to full nodes to get shares to mine and then sending the shares to the miners miners can run IOS or any type of mining software and the mining pool is using this as a persistence at the end for itself so not a lot fancy and then here I will not go too much into detail most of the details are really mining pool specific and since I don't speak for them it's really their own internal detail how they implemented, how they protected and so and in addition to all the open source things we have we have some stuff which are more alephium specific and one is the monitoring so we are monitoring the blockchain, we are monitoring the full nodes we are monitoring the explorer and so on all this is visible on the status page we have status page per chain one for the mainnet, one for the testnet and one for the global where everything is summarized and listed and the monitoring is mostly based on the primitive stack and then we have Grafana dashboard IRETS, IRETS manager IRETS dispatched through Slack channels and some other in-house doing in order to mitigate some alerts directly and automatically and one of the last parts of the alephium specific is the archives, we do have archive nodes where we are storing the full nodes data, the rational behind that is to be able to speed up the bootstrap of the full node so we can download from these snapshots like a recent snapshot we are doing a snapshot once a week of the state of the blockchain and then you start syncing from the snapshots which then helps you bootstrapping a full node in minutes like half an hour instead of several hours like it was the case before and that's pretty much it alephium infrastructure and on the cyber security in general I'm very open to any kind of question which are info related topics just for the people watching us you can either ask the question directly on YouTube or you can ask them on Discord and we will relay them for you here if there's no question I'll start with one thank you so much for the presentation my first question is why did you sound like a black hat or maybe are you also a black hat at your free hours well that's an interesting question and if I were a black hat I would probably not tell you that I'm a black hat I actually did my master thesis within the security fields I did some reverse engineering of a set so satellites receivers for a broadcast company and I really felt in love with cyber security it's really exciting to work with this it's a field which is in constant transformation that's one of these theft and police games always it's the hackers find something new and then the defenders have to update and so on at the end of the day it's really in order to be good at cyber security you need to have a very good understanding of every component of the computer and obviously when I was younger I did some black hat related activities I remember in the city I'm living they were appalled to find the name of the theater and it was like a simple web page a simple web page where people can say voting for names there was absolutely no protection into this one and then we had the evening with friends and we started posting a crazy number of votes so at the end one name, one with 1 billion votes was completely crazy like that and at the end the politics to cancel the vote it made quite some big noise in that space but no I'm not really attracted by the dark side of the force I prefer staying under the lights thank you that is interesting, thank you for sharing this I have a very good question from the community what's the postgres table structure that's a very nice question, thanks Nicola for asking actually I'm not the one writing the back end so I'm not the best one to answer this question but it's a typical Russian database where we have blogs, transactions wallets and so on we started adding quite some denormalization in the model in order to speed up some transaction but the best one to answer is Simon or Thomas or the back end dev team I think this question was asked on this course so we can have the team member that is focusing on this answer directly as well thank you so for my next question is when did you join Alephium and what attracted you to the project in the first place I joined Alephium I think was on May-June 2021 if I'm not mistaken so I heard from the network that this team based in Lausanne started a new blockchain and then I started looking at the Github repo and I'm in the crypto space quite some time but I'm mostly interested by the tech around the blockchain I'm not a trader or whatever a speculator in that field and what I found really interesting when I joined is that it was a very early stage which means that everything was feasible everything needs to be built so one of the first thing I built was a 4-set for instance so the 4-set on the testnet is one of the first I built around Alephium and then the core dev team are really great devs but they were lacking all the packaging and the application and another thing I did at the very beginning was building Docker containers in order to smoothen the developer experience of running Alephium through nodes so I started with simple tasks like that and then the trust got established and then I started doing more and more until I would be interested in taking charge of the world infrastructure which I was up for that and we're very grateful that you were it's a pleasure to have you in the team maybe I can ask you a little bit do you have a specific relationship with decentralization? Is it a topic that is important to you and if so how do you start being interested in decentralization? I'm building for quite some time to the Alephium model while distributed is not decentralized and what I was really missing here is this going one extra step and starting into this decentralization so distributed now is good I mean I did my homework into the distributed systems and I wanted to go this extra step to go into decentralization but it really appears that decentralization is harder than distributed because you're not in charge of the world infrastructure so you can have any type of actors and since I come from cybersecurity I'm usually suspicious my first feeling of someone is running or someone is connecting I know the first question is is this authorized or is this suspicious and having this as a concrete something concrete like the Alephium project I found it really really interesting then decentralization is hard because you somehow have to trust even though internet is the Wild Wild West and usually you should not trust so I found this kind of ambiguity in this that I was not able to find the proper answer or to solve it properly yet but I'm working toward this and making some good progress towards that Thank you maybe I think you can stop sharing your presentation I think maybe I'm not sure how it is showing up on YouTube but maybe this we can stop and then to bounce back off of this question this is a question we also asked during our last tech talk and that everyone has a different perspective on this is how does one measure decentralization what is your perspective on this that's a very good one so as I said decentralization is hard and usually actors or established actors try to avoid decentralization as much as possible because decentralization is a sort of losing the power if you want so if you think about banks they are fighting for decentralization why because they want to keep the power and the control over what they're doing so decentralization is really hard here and how can we measure decentralization is from my perspective is looking at the number of full nodes running on the cluster for instance and counting how many are managed by me and how many are managed by someone else and if I take this example for instance we are managing small subsets of the full nodes so the bootstrap nodes and a few others but that's pretty much it and I'm pretty happy to see that that's on the Alephium side that the decentralization works pretty well I would say but adding to this the entry barrier to decentralization is also something is also key because if still the example of the full nodes if it would take a lot of resources or a lot of effort to run a full node then it's a high entry barrier meaning that we will have less people willing to do that on their spur time or as a hobby which means that we only start having big actors running nodes and we will see then a shift towards centralization so centralization for me is a trade between trust but also low entry barrier in order to allow anyone with some good will but not too much skills and time to start contributing with in the ecosystem. Thank you for your perspective and to join a little bit the two topics of this call of this talk which is decentralization but also cyber security do you think that cyber security services can be decentralized efficiently for example could you build a decentralized cloud fair? Yeah that's a tough one so cloud fair is a typical distributed but centralized service and as I said and I'm restating this building a true decentralized system is hard and it doesn't fit all the use case we are running a few nodes on flux for instance that they are doing decentralization but in order to access their service you still need to go through single entry point which is managed by flux so having everything decentralized without any single point of failure or any single bottleneck managed by someone specific it's very hard to put this building a decentralized cloud fair from my perspective is really hard as of today I would love to find the solution to do that and you will have to fight against big actors that will on their side fight to try to bring this back to a centralized model because they want to keep the contract or to gain the contract thank you I think I will ask one last question and we can end this call on time so to follow your previous answer so even decentralized blockchain have to rely on some central infrastructure or services such as bootstrap node like you mentioned wallet backend and knowing kind of these concerns that you expressed just before what Alephium can do to improve decentralization with regard to those those points yeah so the desktop or the wallet end point is something very interesting because as of today we are running to the default but since everything is open source and the level of packaging and then the entry barrier of running these components decrease over time then anyone can have its own or even its personal wallet endpoints so in order to avoid centralization in this kind of topic the entry barrier and the developer experience or the non developer experience should be as good as possible so the entry barrier should be as easy as possible for that and I mean we are working toward direction on a daily basis we are putting a huge effort on packaging writing documentation making this as smooth as possible in order to have even I mean I would love to have my mother running wallet endpoints on her laptop if she would be needed well thank you so much if there's no other question at that point in time I will still ask you if you want to add any topic maybe something you want to add to the conversation no thanks for the time and I mean I'm really proud to be part of the team the team is absolutely awesome we have really really great person everywhere within the team it's a big pleasure to be part of this wide this is a shared feeling we are all very happy that you are part of the team because let's say your main activity and also you worked within LFM and thank you everyone for tuning in for this Tech Talk edition have a nice evening thank you, bye bye