 Okay, cool. Thank you. Thank you. So yeah, my name is Sirka Kremser. I work for Red Hat on a project called renalytics.io and we empower developers to build their intelligent applications. So today I'm going to talk about blockchain analysis and also about processing of graphs in Spark and yeah, so this is the outline. First I will describe briefly blockchain because it's just 30 minutes, so it will be really brief. Then how to represent blockchain transaction as a graph. Then how to take all those this graph from Spark perspective. It will be slightly relevant to the talk from the Neo4j guys about this open cipher and substantial part will be the demo. It will be half of the presentation. So hopefully it will work because I've tried an hour ago and it didn't work because of IPVs. So I hope I will be more lucky. So yeah, there's the room is crowded. I guess because this is because the blockchain and the hype around it. It's everywhere, right? Unfortunately the price went down, I don't know, two days ago because it's still volatile. Honestly, I don't care about the price. I don't care about the market. All I care about is the tech and I'm gonna describe how it works. You know, I'm not gonna influence you if you should buy or sell. It's not my business. So what it is is distributed ledger system. It tries to create distributed consensus. It's segue to the previous talk, right? but When it comes to data structure is just the linked list of blocks. So as you can see there's there's one block and there is a link to the other one and another one and Each new block has the reference to the previous one. So it's basically linked list single link list The trust stems from the so-called market trees. That's hash-based structures and something called mining. Mining is they also call it proof of work and it's an act of Lending your computing power to ensure that the system or the network behave correctly to make it more clear people People tries to find a nonce Which is a piece of random data that is then put to the block and if it is hashed with 256 hash function and the result targets is Lower than some number some threshold. They call it difficulty We succeed it and we find a new block and we are rewarded by the network by some bitcoins and also all the fees that were in the block This is the detail of one block. So as you can see or not. It's very small. There's something called magic number which identifies the Network because it's not I'm gonna talk today about Bitcoin blockchain But there are different flavors of Bitcoin of blockchain and this magic number identifies the type of the blockchain There are also blockchains for namecoin. This is very similar to Bitcoin So if if it is relatively similar, it can be identified by this magic number If it is different cryptocurrency like Ethereum or I don't know, Iota It is completely different architecture for for the trust. So these are Say like more diverse than from the blockchain than some more similar There's something called the mercury root. This is the hash hash of everything in there and The nonce that is this is the part that was used for for solving the block So and today each each block has a hard limit of one Mac It's relatively small. So only limited amount of transactional transactions would fit to the block and minor basically minors basically favors those transactions that that Put more fee into the more for more fee for for minors because when you create a new transaction It's up to you how much should be the how much the fee should be So if it is like small number, it would take ages to get to to confirm So in traditional banking system that the conventional banks if I want to send money to someone My bank basically holds my account my balance and it checks if I have enough money for the transaction I don't know the details, but I think it's stored in some Relational database like oracle or something and then it would subtract my money and send them to someone else in Blockchain is completely different if Alice want to send some money to Bob She has to basically spend all their money and there is implementation details that they so they can send The rest of the transaction back to her. So for there's something called input So she will create a transaction with one input and two outputs and as you can see that there is output that has address back to her So it is self-loop. She has to spend all their money, but she's allowed to Send her money back to her and so yeah, there is a Simple rule that some of all the inputs has to be equal or higher than some of the outputs It's hired. What is hired it? What's I mean if you subtract some of the inputs outputs and some of the inputs the difference is their fees It is the fee of the transaction So and as I've sent as I mentioned the confirming is finding the nonce for the block So as you here in this in this example Alice is trying to send 2.0 bitcoins to Bob so he said that he has no he doesn't have any bitcoins if the transaction is confirmed now He has balance 0.2 bitcoins and Bob is happy So this is happy Bob This is transsexual in more more general case because before we had just one input and two outputs But it doesn't have to be this way in this case We have three out three inputs and as you can as you can see there are reference the unspent output from previous transactions So this is basically the graph of the old transactions each input has to be connected to the previously unspent output And by referencing it we basically spend it One thing I didn't mention that miners have to also verify that The guy he's allowed to do the transaction because all the addresses are fingerprints of public keys and He has to prove that he owns the money by private by his private key But now how to represent it as a graph Because we we can have multiple input and multiple outputs transactions It's hard to say which address sends money to whom so my first shot was to create bipartite graph When I connected all the inputs with all the outputs And I use this representation for one of my notebooks when I'm calculating page rank because that's all I need I've also different representation. I will show in the different next slide which has more details So I use Apache parquet as a format because it's suitable for spark spark and open those parquet files It supports Partitioning so I can put different different piece of data on different different nodes and it also has schema and It's it's columnar. So it's suitable for this no SQL or distributed processing Right and I've also developed an application called blockchain Sorry, Parkett converter. It's based on other based on other open source project. And what it does it takes the binary binary files you have on your disk if you have official Bitcoin client if you run official Bitcoin client it downloads all the universe. It's I don't 280 gigabytes and you can use this converter to run it And create this parquet representation So ideally I would like have some network and run some analysis on top of it. So this is the other approach I have chosen I use it also for the other notebook and here I have more information I have nodes of addresses nodes of transactions and also nodes of blocks So I have information in which block the transaction appeared and also I have amounts of Satoshis that were sent on In the network in the previous representation I have only edge that some address was communicating somehow with the other address here. I have more information I'm basically capturing these multi input and multiple multiple output transactions Right. So how to tackle graphs from spark there are there is pre-built library or module called graph X and Some some external one called graph frames. You have to put some jar files on on class path to be able to use it So I have two notebooks and one uses graph X and the other one uses graph frames There is a lot of or there is a couple of built-in functions in spark for instance for label propagation page ranked Counting triangles, but it's extensible. It's called it supports something called pregle, which is Toolkit that declarative toolkit where you can describe as something called super step in which you say It's iterative method where you very specified that you accept messages from the from the neighbors You update the state and you send them those messages to our neighbors and in this iterate iteratively approach you can achieve something Complex algorithms and then there is a language in graph frames There is language core motif and is inspired by cipher But it's not as powerful but but is there so as if you were Here on the the lecture about cipher. We'll probably know more than I So this is the basic architecture of spark for those who are not familiar it's ETL Toolkit or framework. It stores everything in in memory. If not, so the other wise it also supports of his of hip Memory and it had a lot of bindings to other languages like pythons or are so it's suitable for data scientists It's written in scala. So there is native support for scala Java and the architecture so there is always something called driver program That's holds the application and it communicates with the master node that coordinates all those workers and schedules the work To those executors so executor runs on worker node and it runs those tasks So yeah, talk is cheap. Let's let's do the memo Is the phone visible should be right So what what I'm gonna do. I'm I have a cluster That contains open shift and I'm gonna create a new spark cluster there So I'm gonna use something called Oshinko. It's an open source library with a lot and I'm gonna deploy a new cluster. I Will make two workers And call the cluster For them and Here sorry Here I have the open shift console and I can see there is nothing there Yeah, there's nothing there. So if I run the command it should start deploying the cluster as I can see there is one one master deployed for me and Two pots of worker by the way open shift is Similar to Kubernetes, but it supports additional features like security templates, but it's very very similar as open as Kubernetes Actually, Kubernetes is running under the covers so now the cluster is up and running and I can deploy my my notebook application Jupiter notebook to do that I will use new app. It's also this is OCS open shift client so I am deploying my Docker container, but it also supports this OCI containers by cryo But here I have to specify the name of the master. So I called it for them, right? So if I do that it will deploy the the spark the Jupiter notebook What I have to also do is to expose the root because by default if I deploy something to open sheet is not exposed I Want to make it visible for the external world. So I'm exposing the root And by the way, I can do the same from the from the web console And also we have a web based application for deploying those spark clusters But I wanted like to mix it, you know As you can see that I can do something from the command line and it does things in the in the in the web UI so now the root is exposed so I can go there and It doesn't work. I Can check if everything is up and running by get all Yeah It looks good. No, no, no The container container is still creating. So probably it's it's pulling the container. I can I can I can kill it It sometimes help because Kubernetes or open shift tries to assure that there is at least once In this case one instance of running bot and if I kill it it tries to spin a new one So probably the previous one. Yeah, and so I kill the previous one and the new one is now deployed So if I go to this page now, it works So this is the Jupiter notebook previously. You may have seen the Zeppelin notebook. These are different. This is more more famous one and Yes, so now what you can see here is a python I can hide this I can hide this oh Unfortunately, I can't hide the top top bar. So it's relatively relatively low on the page So I can check if we are connected to the external cluster and here we are I can also create or open these park UI Console and check if there is application attached to this cluster and Sorry where Now should be should be right here on the master Yeah, this guy is the UI route. It's the is the Console for for spark itself and as we can see there is one application connected to it and it runs for 45 seconds That so this is our driver application is the Jupiter notebook So now if we evaluate these cells these are sent to the spark master and spark master schedules them to the workers and once it received the receives the the answers it's It's it's put It puts it back to the driver through the Jupiter notebook and we can see them also need to change this this guy So now what I'm doing. I'm reading these parquet files, which I have prepared before Using that using that parquet converter You know what let me run all the cells and I will describe them when I'm when they're really evaluating So it will be faster. So what I'm doing here. I'm reading those parquet files with notes Right here and the code is available on GitHub. I will give you the link at the end of the presentation so I'm reading all the addresses all the transactions all the blocks I think yeah and Mixing them together because it graph. This is the graph X technology. It expects all the nodes in one big Data frame. So I'm merging them all together and putting this discriminator column called type This implementation details and this is only a fraction of the blockchain network This is not whole blockchain because it wouldn't be feasible to do it in reasonable time as you can see we have four millions of nodes Finish transaction for a couple of days But the framework scales if you can you can use the same same approach for the whole blockchain if you have enough time Yes, so I've avoided here. I'm creating the graph representation in graph frame. Sorry. So I've mentioned this graph X is actually graph frame Sorry, it's graph frame. So so under covers it uses It uses data frames not the rdd's as graphics framework. So here, for instance, I'm calculating in degrees and our degrees and Creating some basic statistics like standard deviation average and Yeah, so under deviation and an average for each type of nodes. So for instance for For Type T. This is a transaction. We have average degree in degree is free red kind of free free free nodes or free free edges So in other words each transactions as in as in average free incoming edges This is the example of the motif language. So it is pretty similar to cipher. But for instance, I can't create or specify these these paths I Mean reach ability on some kind of edges. I can specify only just one one edge. I can't create Transitive closure anything like that is very simple. I can create those patterns. So for instance, there is if there is address and There is edge to transaction and here I will specify what the address actually means so address means that there is a type that equals to a and I can specify if the value on the edge is slower than 10,000 Fasatoshi and On the upcoming edge from the transaction is also lower than 10,000 Fasatoshi And it's somehow time windowed because I have the reference for the block and on the block I have a timestamp so I can specify the time frame for for the query when it happened So these are the results for quite quite complicated query and I can also visualize piece of the graph because it's really huge and it wouldn't It wouldn't be nice if I showed everything but I will do it again anyway in the end So this is the One transaction visualize so this is the green one is the out input of the transaction and The red ones are the outputs and the nice thing about Jupiter notebook is that it's interactive. I can I Can plug in pretty arbitrary Transaction ID right here and it would work. It would visualize the different transaction for me So at the end of the notebook of this notebook. I'm using network X library. It's a Python library for graph visualizing or processing and I will visualize the same transaction with graph network X But it's not as nice I can Show the labels and it's even worse Before this library Like this one. It's it's called Sigma JS. It's JavaScript library for visualizing the graphs I could have also use D3 but put a more effort to it This is like out of the box working to Sigma JS So yeah, and here in this in this case. I'm trying to visualize like bigger portion of the graph it takes four zero point zero zero zero four A random sample of this this fraction and randomly visualize it It's really messy So this is the graph frames notebook. I've got also different one which is called based on the graph X So let me deploy this one. Oh and the Looker containers are on looker hub. I need to expose the root for this one as well So this one will use a different notebook technology. Actually, it's called spark Spark notebook and should be should be accessible under this URL. I Loose Firefox for this because they have some issues with from So yeah, this is similar concepts notebook technology, but this time we use scala. It's different binding different language and I will again. I will evaluate everything and describe what it does this time I'm I'm I'm using this representation where each address is connected connected to each other address But there is no other information like blocks transactions or even values on the edges It's because I'm gonna run the page rank algorithm, which is Prepared in this library and I don't need this information. It's it's faster So yeah, I'm this is I'm massaging the daytime cleaning because it contains some prefixes so I can remove them And this is this is this is spark and I'm Working in I'm using our RDD. This is different concept than graph frames. It's called resilient distributed data sets It's like lower level. It's not optimized by this catalyst optimizer as graph as data frames, but it's in a way. It's more powerful and Yeah, so right now the the evaluation of this cell is running it's calculating the page rank So for each address, it's tries to find its rank. I I guess it's not necessary to introduce page rank, but in couple of words It's an iterative algorithm that for this is a result running of page rank on some very small graph it's tries to To put to put higher rank onto those nodes that have more incoming edges But at the same time if the if the node has high rank it contributes more to the to the other nodes who Into which he points so in each step it tries to calculate update its rank and It's a form of eigenvector vector centrality measure It tries to simulate like random work through the graph And if we randomly stop the state was the probability that I would end up in this node B for instance It's quite high, but this this node has also high probability Even though there is just one incoming edge and it's because of the fact that there is a loop from this high highly ranked node It's sorry This is arbitrary nodes 40% Yeah, this this is the random graph on which I want to explain page rank and yeah, it doesn't make sense for for for blockchain And actually once this page rank will finish These the results the ranks are not actually probabilities These are like random numbers because they are higher than one, but I can represent them as a probabilities by summing everything and normalizing it all I need is to rank those addresses by this page rank and See if the highly ranked nodes addresses in the network are Somehow interesting interesting Before I tried it it took two minutes and a half still calculating so Yeah, it's just finished and as you can see there are for instance this one This one has higher probability or higher ranked than than one so I can so here I'm sorting them by by the rank by their rank and In this notebook I can actually query external websites. So I'm query the I will be querying the blockchain info for fetching additional information about the nodes because it is because I have only a small portion of the graph and I for instance, I have I want to have like the overall Incoming transactions and overall outgoing transactions from the address like how many bitcoins does it receive for all the time or or send and I wouldn't be able to reconstruct it for only from this small fraction of the transaction graph So that's why I'm a little bit cheating by by querying external site for this So, yeah, I will I will be kind of augmenting this output with HTML sorry Sure. Yeah, so what is Yeah, what I have right now is a framework for working with blockchain data might benefits could be For instance, you can use this framework for For identifying similar addresses in a network because as you may know or not The Bitcoin or blockchain is a pseudonymos network. We can see all the transaction of each address But we can't be sure who is behind this address, but but if we can somehow Inspect, I mean see the behavior of some certain address We can plaster them together and say with some kind of probability. These are the same is or not This is just ideas, you know, and Yes, yes, it's not very useful. This is just for demoing open shift and you know You can use our technology or which is also open source and you can do it on the cloud. It's pretty pretty convenient So this is a sorted list of these Bitcoin addresses by this pendulum and There should be also linked to the external side and this case blockchain info and as we can see we just found a Mining pool core called fish pool. So just by looking into the graph and it's Topology, we're able to identify kind of significant address in a network By the way fish is these mining pools are entities in the Bitcoin network that serves for for miners to Like obtain more often less less amount of money because it's highly unlikely that it will be you who will confirm the transactions though, so they aggregate into mine those mining pools and These mining pools reward those members by Doing that work if the mining pool succeeds to find a new new new block He will reward all these members who were mining in this these periods Okay, yeah, that means that's fine. So I can see if there are some other interesting one No, this is for instance in this case We don't have any information and we had the information before because the fishing pool itself put this information into the blockchain info There's not some magic was not any any magical was happening As me I have also my Bitcoin address and I can go to the blockchain info I think slash text and I can put some metadata on my address if I want Yeah, this is different what this is for instance this one is a hot wallet So this is like wallet in browser or in online and they want to be identified You know, it makes sense that they put this take on on their Bitcoin address and Here you can see it actually starts with one fox that the hash and Actually finding this address is also Difficult task because they have to spend a lot of computer power to come up with pair of private and public key where the fingerprint of the public key starts with One fox Sorry And as you may see this is the like total number of Received or total number of send of these of these addresses and this is the resulting resulting or the current balance for for For instance, this this address received a huge amount of money, but the balance is zero and it makes sense If it's sense everything what it received So yeah, this is the these are the notebooks. They are on the GitHub repository Let me switch back to representation so key key takeaways Blockchain is out there. It's it's formal is open. You can you can dive into it. You can see what's there You can use my framework to convert it to the graph and process it further. This is just starter, you know, you may you may do a lot of More interesting work There are two different technologies in in spark one core graph frames one core graph X. I Had bad luck with graph frames for instance running the pagering algorithm. Not sure why it was it was working for relatively small graphs but once I I Achieved some threshold it didn't work like I had a lot of out of memories even on on relatively high memory platforms I Think it's really really important to to create reproducible experiments if you are in data science community because What happened to me? I read some scientific paper and they have some data set This is publicly or this achievable in some zip file or somewhere And it's relatively hard to reproduce their work if you put it into the containers and you create these notebooks It's pretty pretty easy to for other guys to you know extend your work And as always like good representation may be crucial So, yeah, please please visit our website rather than is now. I will also put some stickers right here if you want some and Yeah, this is my presentation if you want to download it and graduate so Questions No, no, it's not streaming. Yeah, so the question was if it is doing the stream mode. No, it's just batch processing I'm gonna I want to improve it by streaming Yeah, I want to improve that in the in that way it's relatively difficult because I don't want to I don't I can't do it using this Binary data store and the in the system I would have to run the Bitcoin DD more and hook to it using json RPC and you know trying to see the new transaction I visualize them using spark I was called streaming here This is something I want to do. Yeah, what? Yeah, it was it is It is because I can show the notebook again. We have time I actually find find it in a way that Firstly I Calculated the output edges of each transaction and I sorted them by the output edge and I took the Relatively high in the list. I can show you right here. It's filled. It's sorted by out-degree And I'm taking the 8,000 one 801 exactly if I pick random one, it would be just a couple of outputs and a couple of inputs. Yeah, exactly Good question Thanks a lot. Here is also around I think sorry. Are you around today? So yes, I'll be here. So you can catch me like that. Sure And next presentable behind us talking about G core a new graph query language