 Yeah, thanks a lot. Welcome to my talk bridging the gap from data science to production After the introduction a few words about myself. So I'm a data scientist at Inovex. I Have a mathematical background. So this is why I really like mathematical modeling I mean, this is important for being a data scientist I've done a few projects in recommendation system, which is a really nice and interesting topic Of course, I'm always interested to bring things into production meaning that I don't just look to make some proof of Concepts some some nice models one time and then like kind of forget about it So I really want to see the gain value that you only get if you put things into production And of course, I'm a big fan of the the Python data stack I'm just a few words about the company I work for for Inovex who is giving me the possibility to speak at cool conferences like the Europe Python Inovex is an IT project house with a focus on digital transformation And we offer everything around this meaning from operation to application development big data and data science, of course And we got many offices all over Germany So to the actual topic data science to production, I mean Who of you have already like worked on the on a data science project? And in the end you had some some some really cool proof of concept, but it was never really put into production can maybe Okay, so it seems like this really is a big topic and a lot of people are talking about this and And It's it's also a source of frustration I mean data scientists get frustrated after a while if they see like one proof of concept after the other that not really moves to Production and also the business sides gets frustrated So they have maybe hired a huge team of data scientists that do cool things But in the end they can never say okay our data scientists They did this and now we have maybe increased our revenue by 10 percent This is exactly why one should care about moving things to production and this topic is definitely not an easy one It has many different facets and throughout the talk. I'm gonna touch many of them before and actually One of the important things is the actual use case so this data product or this model you are building So we're gonna now look at this from a really high-level perspective So this data product you want to build in the company and if you Look at that you can basically say okay. It's it's quite easy. You have some some data somewhere You have your you have your model. This is basically doing some Transformations and in the end you have some results be it predictions or like decisions or whatever So this is the really high-level perspective, but from this components from these three components we can already Kind of classify what our use case is and this We have to later keep in mind if we want to things put things into production for instance the data Is it coming from some relational database or some some known SQL database? Is it coming from some distributed file system or do you have to deal with stream-based data? Does your model really the whole time need to consume data as a stream? so this is important question and Depending on your use case you have to clarify How are you going to do this in production and what the reasons in frequency? Requirements are so like are you dealing with batch data? Does your model need to react near real-time real-time or in a stream-based fashion? Then the model itself and with a model when I'm talking about model I'm not only talking about the machine learning algorithm So a lot of people say like yeah my model and they actually think yeah, that's the artificial neural network Or that's the random forest, but actually the model includes everything From the point where you get your raw data to the point where you give back some kind of results So this includes also the pre-processing So how you do clancing importation how you scale your data? Maybe all kind of feature engineering you do like construction of new derived features Like some exponential moving average and so on so this is all part of the model because if you do this on your laptop In some proof of concept you later have to also put this into production and you need to do the you need to think about this and That you don't kind of re-implement everything So then the last part is The results so what do you do with your result? I mean in a proof of concept your results basically are maybe some some CSV file and You make some nice plots and show it to some through some product manager But in production you need to care So do I put this in another database and are the consumers of my predictions of my Decisions or whatever are they reading from the database? So the database would be your your interface or is it again some distributed file system? Are you writing back new topics in a stream or maybe in a real-time use case like how it's many How it is quite often the case for recommendation system that you have to provide some kind of rest API So that people asked real-time for recommendations given a user user preferences So Looking back now at the at the whole picture. We have our data We have a model and we have our results and everything needs to be in the end in production so we care about deploying this model and I've already said a lot about that we need interfaces so we are in control of the model and You need to define how you Access the the data and how you in the end return the results and they're most of the times many other Like like teams Exists who are in control of this so it's important to to speak to them to communicate and to define interfaces so To actually give some more characteristics of a use case We said already it's the delivery is so you can depending on your data use case You can say it's a you need to have a web service or stream or database Then important is also the problem class that you're early on decide okay I want to do a classification regression recommendation I need or do not need explainability Because this will later on decide what kind of libraries you can use so it's important to to think about this Early on then the volume and velocity This will later tell you how your model what kind of scalability requirements your model needs to have Then the inference and prediction is it enough to do the inference? Like once a day in a batch way or again real time or stream So all this will later on decide how you gonna put things into production Additionally you have a technical side conditions. Maybe you are working in a company where There's a huge Java stack. I mean many companies they have a Java stack and this could be that they want that they say Okay, but in the end we can only roll out in a scalable way some Java model then this is a technical condition and You should think about this early on because if you do then everything in pure Python Then you will be bound to just providing proof of concept because your code will not be able to be moved to production or other things are like Is it going to be a on-premise solution or maybe in the cloud? So one important thing is there's no one-size-fits-all solution for this I mean there are provider offering like like some holy grail like use our framework and everything will work So for me, it's yeah, this is not true. So do you really have to? Evaluate your use case before and then decide on a use case by use case basis So the takeaway and the learnings from this high-level Perspective of your data use case your data product is that you need to state the requirements of your use case early on Think about how to move things in production before you actually start some kind of proof of concept Identify and check your data sources. So that means don't just get One-time data dump that someone gave you on an USB stick Rather think okay, where's the data coming from and how could a later access this data In a productive way then define interfaces with other departments meaning like if there's a special team for the management of the of the databases and The guys who are filling the the databases so that you define how the data should be formatted And so on this will be important for production because if someone later changes maybe the schema of a database everything could fail of course and another another Good advice actually is to test the whole data flow Early on with some kind of dummy model or some kind of heuristic meaning that you try to Make the whole process from data reading a simple transformation and writing it back into Database or writing the results back into a stream that you test this technically early on to directly see Where things could go wrong So this is the part about the the the more like organizational Or more like the use case aspect another big important thing I see in the topic of data science reproduction is actually quality insurance For especially for data scientists It's quite often so that they That they Yeah, that they program in notebooks and so on and code is more like a one-shot kind of deal but actually Building a data product is an iterative process and this isn't really old Methods actually it's has been invented by by IBM and it's more than 20 years old it's the cross-industrial standard process for data mining and Already there they said okay if you do data mining then it's going to be an iterative process So you're going to grab your data You're going to prepare your data you come up with modeling you evaluate you get more insights about the data and discuss Goes on and on and on and the same goes actually for for any kind of data product. So If you have now in mind that it's going to be iterative of course quality is going to be an important aspect So quality in different in different regions for instance If you program something even if you just start with a with a kind of proof of concept make your code clean. So What I see quite often is that people use a lot of To be their lab to be their notebooks and just put everything into a huge huge notebook and if they have a similar Task they just copy over things and so on. This is not really good clean coding and here we can actually you learn a lot from Clean coding principles that Java developers most of the times have like software design patterns the solid principle and Especially the clean code developer. I mean who knows the website clean code developer who's Okay, not so many hands. Yeah, this is actually what I what I thought so clean code is really important in the end for if you want to things move things into Production because other people are going to read your code. You have to make adjustments and so on and so there are many good resources And even or even or especially the Python developer. You should care about this another Practical thing one should care about and do is continuous integration so that you Continuously if you're if your team works on something that you in continuously integrate your code into a master code that you have Unit tests that continuously test your code that you think about versioning about Packaging about putting your packages your artifact on an artifact store that you optimize this and embrace some kind of development process this and This is actually quite easy to be done. I mean there's an open source tool Jenkins I mean I guess most of you know Jenkins who knows Jenkins who's Who knows and who's actually using Jenkins? Okay, that's good. So What I always do when I when I start a project directly Implement a really simple continuous integration process because it will help you so much later on and You're gonna need it then later for production anyways and I think it's monitoring so if you if you Do any kind of data product you of course interested in improving some key performance indicator Let's like recommendations could be we got to improve our click-through rate our conversions and so on So if you do that, of course, you need the whole time to monitor things. So how was it before? You implemented your cool new algorithm. How is this was it before maybe you tuned something or you retrained? So it's important to really monitor your KPIs. It's also important to monitor the whole the whole setup like how many requests did you have if you provide your your recommendations or your predictions as a as a rest service To see if this comes to some kind of limit Then also check the total number of predictions Maybe something was wrong in the data ingestion and now you you're predicting not enough and check the run times and so on so Monitoring gives you the site. So not having any kind of monitoring is like like flying an airplane Blindfolded and this is also something that that Google says. So there's the There's an open source book by Google the site reliability engineering guide and they have this nice Hierarchy where they say for any kind of product the most important and fundamental thing They ask at Google is actually we have to have Monitoring in place and I've seen so many times that people start with some some data science project and no one Actually cares until monetary about monitoring especially important for data science and and data products is Also monitoring how good the quality of your model is and I mean this is Yeah, you you have normally your metrics and of course you check your metrics, but you can also do this in a live test so here is the so-called response distribution analysis and It's a classification. Let's say you're classifying if a picture is a cat or a dog and if you Just make a histogram over all the results So if it's rather around zero a cat or around one a dog and if you'd make a histogram over all the responses You would directly see okay a is a working model and B is a rather confused model So it's not really sure about what it's outputting and having a simple thing like this in place will tell you directly If the model you maybe just deploy it is Is nonsense and you have to replace it or fall back to another model or or not? I mean, it's definitely better to see this yourself before another department calls or maybe a customer telling you that yeah, whatever you just deployed is Not predicting anything meaningful Another thing regarding monitoring is think about a B tests. So if you go to production you will You will care about those iterative processes. You will start implementing new features You will make improvements to your model So it becomes really important to keep track of how much you improved With respect to the current baseline and it's not always like this that you can do this in an offline test You also have to show this in online metrics and in the end the business unit or the product owner The stakeholder will care about the KPI because this is what he or she gonna report to their superiors a nice additional advantage you have if you are using a B tests is That you can for instance also Use hyper parameter optimization with the help of multi arm bandits The technical requirements you have for a B test is of course you have to have versioning in place. So I'm really eager on versioning so Version your models version your things provide proper Python packages because those versions You need a link to the test groups if you do some a B tests that you see this was version 1.0 And this was version 1.1 with the new cool feature and also you need to be able in production to deploy then several models at the same time because you're going to have at least two groups and You need to be able to track the results really until the point where it's facing the customer or whatever consumer you have so in case of recommendations for instances would be You need to track that this prediction or this recommendation from model a was shown to this User in in group a and so all this tracking has to be in place as if you're Using a tensor flow I can recommend so in one project we use tensor flow serving with this which is a tool An open source tool by by Google which does a lot for this organization and management of different models for you So those Where some some quality assurance aspect another big topic in the field of data science to production is actually Organizational problems or like cultural problems and again, it's nothing really new so if We look at the at the problems that normally just normal developers and operations have most of the time if It's like if you have a group like a team of pure developers and a team of pure operations People then the developers they say okay our responsibility is to to code to test to make releases Of course, they use version control they do in the best case also continuous integration and so on and When they are happy with something they make this release they throw it over the wall of confusion and The operations team like yeah, thank you and now we got a package this We don't understand what's in there, but we have to package this we deploy it We do the whole life cycle. Of course, there's going to be some configuration management to do We have to care about the security and the monitoring and if you keep this completely split up then people already Realized like years ago that this is not the way you can really fast and efficiently Develop software and with data science and data products this thinking even hurts more So it's especially dangerous for data products and teams and you Seriously going to have a problem with all your speed and time to markets if you just think as a data scientist Yeah, how do I get the things in production? I don't care. It's not my job So this is definitely the wrong way of thinking so the the better way of thinking is that you have a team that things Let's build a great data product and not okay I made a great model. So this is just a different way of thinking and for the well for the world of Software engineering actually there's this big movement How many of you know deaf ops deaf ops culture have heard of it? Okay, so few so that the idea is just to to overcome this wall of confusion to make a continuous delivery So that's continuous integration, but one step further that you could at any point in time if you decide also deploy and deliver your software and that you have hitter heterogeneous teams of developers and Operations people working together so now on on the side of of data scientists, it's actually Yeah, we can apply the same thing So for my experience like having pure teams of data scientists They don't get anything into production because they just lack the knowledge the knowledge how to deploy and how to do All those things you need to do to get it in production So let me so the learning is actually that you have to have heterogeneous teams of software engineers of data scientists of data engineers of operations People and if they all work together they also start sharing their knowledge and They can work together and on on on on a single product and see is as their responsible responsibility to get that product into production and As a rule of thumb It's even that for a single data scientists You need two to three data engineers which do things which help to do the things around So it's really you don't actually need that many data scientists And right now it's even harder to find good data engineers At least on the German market than to find a good data scientist optionally what is also a good thing is to have a Product manager also embedded directly in the team and if your data product is any way related to some Yeah, for instance like again the recommendation topic if you if you see if you have a customer facing User interface then it's also good to have directly the the user interface or UX expert in your team because How you show things to your customer will also Dramatically influence the results so it's good to have this close and not in another team where they maybe do completely different decisions without telling you about it a company that actually Does a lot of this? Organization is a Spotify, so they are really advanced when it comes to this they have fully autonomous teams for for every feature So they call it. It's like vertical teams With an end-to-end responsibility So really from from the design and from where the data comes to how it is shown in the in the Spotify Application or in the Spotify website. They're completely Responsible responsible for this and this allows them to to iterate really fast and to have especially less politics and This is I've added a link here, so you can later Read about it. It's it's really interesting to see and there are also a lot of talks on the web how how Spotify organizes their teams around this so This was Organizational or more like the cultural aspect of data science reproduction, but we also have a language Aspects or as I would call it like a two language problem, so As I've said before in the industry Many people use Java and the reasons are this for this are quite Yeah, quite oppressive. So many people argue that having a strongly type language is so more safe because already the compiler finds a lot of edge cases and so on and It has a stronger emphasis on on robustness and edge cases then it's it has been an industrial standard for many ages and For many years and people know how to deploy things. So if you go in many companies, you will find that If there's a separate operation team, they will say, okay Only Java things will get into production in the end So I don't care what you do as a data scientist But it's gonna be Java in the end and then there's the other side the other world where as a data scientist you're more like science guy or science person and You of course like Python or are you like the dynamic nature of the language? And you have a stronger emphasis on a cool methods and cool results and not on on robustness Maybe and you are happy as long as it runs on your machine. And so they are just two sides and Of course, there are many ways to resolve this problem And I'm gonna present now several ways how I've seen in projects how it was done and and Can discuss this for instance one is just to select one to rule them all so I've Once been in a project where it was said that in the end Yeah, okay. It's got to be Java in the end. So Right directly start doing everything in Java and I know that I've heard of I heard that Netflix for instance for their Recommenders they do directly everything in Java Java. So this has the The upside of it is that if you have a single language, of course it's later is gonna reduce the complexity of your deployment I mean most companies to know how what to do with Java you can package everything in a to a nice char and run it in some application server or so on and The downside the huge downside of it is of course that you're completely Abandoning one ecosystem in case of Java. It would be the the Python Ecosystems so you don't have Psychic learn you don't have pandas and so on so you have to re-implement a lot But it's a it's a solution that some companies do another thing is if you just say, okay a Python is the winner how About putting everything in Python in production. This is yeah, especially cool if you are then a data scientist because You can still do your favorite programming language I've found from my experiences that it's especially useful for the batch prediction use cases So in the categorization I've shown before so if you're doing some kind of predictions That you only have to do once a day Something like like we did at Luyander that you you're predicting the demand of the next two weeks Or so if you have if you if you have 24 hours to do one batch prediction Then that's a perfect use case for for Python actually if you need Some kind of web service delivery, of course you have many Python is a general purpose language. You have many nice libraries like flask to make some small rest Service when you do Python you can also always just scale horizontally if someone comes with a with a point that maybe Python is not fast enough compared to Java. You can always scale horizontally during prediction and during training. It's What I like the most is to have just a big metal node with many cores huge number huge amount of RAM where you can then a train your model and the good thing with Python is that you are also not only bound to The Python ecosystem you can also tap into the Hadoop world for instance with using PySpark and PyHive So their libraries they of course have some limitations compared to the Java libraries But you can do you can nowadays with Spark 2.3 use a lot of things from from Python and If you then later want to deploy something It's good to think about isolated containers and maybe use Docker for it just to have the all the dependencies and so on in packaged in one thing because there exists Nothing like a like a char file for instance where you have everything packaged Another solution to the problem is what I think is the worst case scenario is you let a team of data scientists do something in in Python or are and then some poor person has to rewrite everything in Java So this is something which once happened to me that I wrote a lot of Python and then we were sitting together And making a conversion to Java because it was only allowed to have Java in production It's really lots of efforts. It's slow if you then later on we said it's Building a data product is an iterative process. So if you later on decide on new features then of course You implement them in first in in Python then someone Moves it over to Java. It's it takes forever It's causing a lot of bugs if you see a bug in production It's always hard to find out. Okay. Is the bug may be in the Java code Or is this an actual reason in the in the was it in the in the Python code? So is it by design a mistake? So The upside is that everyone gets what they want but I Would never argue in favor of this Solution to the to the two language problem So another thing I've never really tried out is that you say, okay, let's just use exchangeable formats I mean there are many arounds like PMML or an X and so on They work great in theory, but if you try a little bit out with them So just we we tested it once we never put something like this in production is that they have quite a limited functionality you have no guarantee that if Use Python you do your model you save it in some exchangeable format and then you read it in in Java for instance that it really does the same thing so you have to you have to trust those two implementations and Yeah, I mean it's just like even with something like HTML you never to websites I never rendered the same way on two different browsers. So why would it be? Why would it work then for those exchangeable formats? So Yeah, and another downside they often have they don't include the pre-processing and feature generation So this is what I said before when I'm talking about the model. It's not only the machine learning algorithm it's also all the imputations and all the things you did beforehand and Of course those exchangeable formats. They need to be able to specify this otherwise your Reimplementing things again Another solution for the language problem is using frameworks So we've used TensorFlow for especially for some recommendation tasks And it's really nice in the way that you use Python to train your model You save it in some some binary format some proto buff based a format And then this this binary blob can be read in by Java and served by Java And this is a really nice thing. There are other frameworks. Of course H2O is quite common we've also done something with this and There we had a little bit of problem that not it doesn't allow so much Pre-processing so that you have basic machine learning algorithms in there But not all of the pre-processing and there it's also that you Use Python to build your model then you save everything into a mojo file. It's called and Later on Python I can run it if you opt for this solution, which I think can be a valid one depending on your use case as I said many times, but of course you should always keep in mind that You are paying with flexibility So if you decide on a framework, you will only ever be able to do what the what the framework Provides which can be fine, but maybe it's a little it's a limitation also, so we've Basically we have seen different ways Different possibilities different doors how to how to overcome this to language problem. There's the re-implementation Just re-implement everything in Java or use a framework or decide in a single language. So from my experience definitely Re-implementation is no option. So don't do this I've been there. It's it's not working so good Of course frameworks are a valid solution if you use tensorflow or h2o They can really help you get things into production way Easier and overcoming the two language problem. And if you decide on a single language, okay, I'm a bit biased here I would definitely choose Python and not let data scientist program in in Java Because this is really a frustrating or even Scala so Talking about Yeah, so we've talked about the language problem and now a little bit more about the deployment and some maybe general Advices and good practices Of course the deployment. There's no as I said before there's no one-size-fits-all it heavily depends on your use case and of the use case Evaluation that you've done before of course. They are Software engineer in principles that you should always use like as I said before continuous integration continuous delivery I can't say it often enough. Just do it and and also think about What part of your machine learning code actually? How big it is compared to all the other things? So there's a nice paper by Scali 2015 already a few years old It's it's saying where the technical debt in machine learning systems actually is and we see that in the middle your machine learning code there's not much technical debt in there, but Everything around just it doesn't get enough focus and a lot of those Boxes are actually related to deployment. So your configuration your process management your Machine resource resource management your serving infrastructure your your monitor, especially those are all things You need to care about and this doesn't get enough attention in in really many projects so this Scali paper was a kind of survey and It's good to to keep this in in in mind. So general principles again Version your things package have processes and quality managers management in place It also helps to keep the development and production environment as similar as possible as possible, of course So like programming your one thing on the Mac and moving everything else than on the on the Linux system I mean already there you can run into problem even if it's if it's Python Automate as much as possible again continuous integration continuous delivery and this also white avoids human errors and think about controllable environments like like for instance by using Docker or at least having Condi environments or other environments that you can pin worsens down Google also thinks a lot about this and they have Also nice a blog post about best practices for machine learning engineering. I'm not going to go through all those different rules Basically many have been already a set design and implement metrics and so on Most of them are actually if you want to bring things in into production most of things are actually engineering problems So this is it's in the end. It's not your cool data science model It's really a lot of engineering problems. You have to overcome to bring things into production Just as a as a practical tip how easy or practical advice how easy it is to do continuous integration It's also a blog post link. You will later see the slides, but if you use Jenkins and let's say def PI artifact store To to save your build packages. It's just like two jobs You have one Jenkins shops that clones the repo builds the package pushes it in some unstable index then you have another Jenkins job that installs the package runs the unit tests after having cloned the repository again and then depending on the results of the unit has pushed it back into some some testing or some stable index and then other people can use the new version and Speaking about packaging so a really cool tool for doing this and it's really easy to use It's like a five seconds thing is pie scaffold. It provides you easy insane python packages It's just giving you a kind of template Tool for this template a scaffold for a typical python project. It provides you with versioning for every commit So you basically just do Git tags and so on for the version and then it Enumerates the commit so you have unique versions out of the box It's integrates really well with with a git has pre-commit You have a declarative way of defining all the the configuration for your for your package with the help of setup config it follows community standards and you can even extend it with your extension so As the last slide Short recap what we learned so the the key learnings really are for data science to production that There's no one-size-fits-all solution Evaluate your use case and then think about how you can bring things into Production early on Think about quality quality assurance is really important try to establish a deaf ops Culture and a team responsibility for the whole data product and not just for some fancy data science model Then think about how you overcome the two language problem that you might have as a python developer Embrace processes and automate Automate as much as possible And the key thing really is production is not an afterthought so think early on about how you can later Move things into production with this. I want to close my talk. Thank you for your patience and your attention Thank you very much for young very interesting talk and many interesting and important things You have to do when we develop software any questions. Yes Thanks. Thanks for the talk. That's really great to see some putting effort and sharing those insights I've got a question on the monitoring part of your talk How would you put a process in place to monitor the performance of the model whether it's making suitable recommendations? or predictions After right because I think you mentioned a technique whereby you can visually see if the model is confused But what about when we don't really know what sort of input the model is gonna is gonna get How can we later on see if we can improve the model based on errors? I might have done and put a process around that Yeah, so I would divide the monitoring in several parts. Of course, you need to have some Monitoring for the incoming data. This is really important because then you can easily see all the errors Which are just due to the fact that you got new outliers or maybe just not available values somewhere So you should have monitoring in place. This is the incoming data. Does it still look like last week? you can define alarms on this and Like okay, suddenly we have not seven categories in this feature, but ten or Why are the number of not available values went up from ten percent to two fifty percent and so on So this is like the early alarming what goes into your model and then you have the monitoring some some after Your model so the results of your model there You can check simple things like how many predictions did I make is the number of predictions still As high as maybe last week if you're Yeah, I don't know depending on your use case what you are predicting then what I showed before this one slide about That you really check each result and do this histogram this response analysis of your model this can really help and of course also the When you iterate and make a new model you will have some offline metrics that you also save those and put the version number Next to it that you can see maybe I mean it looked good offline But then the online KPI metrics went down So this is again like offline the offline metrics that you can automize a lot and check for a curacy or Recall or whatever you want to check and At the same times you have to look at the yeah at the at the KPIs which might might be then the Click-through rate so there are many it really depends on the use case But there are many aspects so it would say input Output then the model quality Technical things like also is there maybe maybe your model is getting slower and you were running into a lot of time outs and all those things Hey So I wanted to ask you about the DevOps culture If you have the box culture DevOps, okay If you have experienced that before and what problems did you find integrated? the whole different thing skills to To work as a system thinking So I've in one project it was before that that we were like only data scientists And then we had all those problems then There was a decision made that we have heterogeneous teams and then we were Yeah doing more a DevOps culture. I mean of course first of all, it's a little bit Okay, why do we were now work together and people react differently on this and then there's also this like struggle sometimes if Let's say there comes the software engineer and asks you about your model and then make some people get critical like so Hey, I'm the data scientist. What are you asking me? Why am I doing this in my model? I'm the expert and For some people this can be quite hard at first, but yeah, you have to overcome this You you you need to communicate and you need to think okay this person has another background, but It has all the person has all the rights to know what is going on in the model So there's at least it starts with a little struggle I would say but then it comes down and it's definitely better in the end than it was before for my experience, but Yeah, it's it also depends on what kind of people are In in in your team if you have maybe some completely introverted data scientists and it could be hard for them maybe so Yeah Okay, last question Hi So I think the choice of the language is definitely a big issue in my company So basically we have a very heavy Java legacy presence a deserve two things one thing is to build like Pipeline like you need one to sing one thing and then feed the data to you the two that all in Java But now we want to plug in Python computations So the way we are training it is from individual we still keep the bike bone as Java and then the individual node We try to wrap around Python script basically Java wrap around the Python and then fire up the Python process You know the data and the caching certainly the problem So we would like to you know explore like Apache arrow in the near future. So do you have any, you know Experience of you know Java fire up fire pleasant person to share the cash sort of experience that they work well I've actually also tried to I once had the idea. Yeah Well, I just do some jar where I put in all my Python code and then run it And I had extreme problem getting this to run with any kind of library like like numpy and so on there is this Pi 4j you can do things like this and for simple For really simple Python application it works, but really simple, but I would not it's a really it's a hack And if you then have any numpy which is C Also wrapped in this and you have those conversion costs, but then again, I'm not I'm no expert in in those Java to Python thing like on a on a really software level. I know I know arrow and It's in spark 2.3 and things get a lot faster But I've never programmed it arrow directly because But I would actually I would be careful with doing things like this wrapping your Python things in in in Java Sounds like a huge hack to me. I would rather go for establishing Interfaces, I mean if you have a pipeline and I mean depending on your your runtime requirements If you can say you use a database as an interface a kind of thing that it's saved there And you grab it from Python you do your calculation you save it there Then it could work depending how fast it needs to be in the end, but I would rather define some clear Interfaces and not do any kind of black magic with Python inside a Java Thanks a lot