 So we have here Michelle Navavian. Yeah, I said, right, good. So she's going to talk about integration system using Python with Pandas. So Michelle here has been working at Bloomberg for 16 years, which is a lot of time. So I think that's really good. And she works in application teams. She's the team leader of an application team, right? And she works with Python C++. And I hope you enjoy her talk. Thank you. Thank you. So my talk is a bit of a software engineering talk. So specifically, we're going to talk about integrating systems using Python and Pandas. And it buzzes when you click forward. So specifically, what I want to talk about is Python as an integration facility. And what that means, and I think what I'd like for you to take away is that Pandas is not just a data science tool. So if you pull up Pandas' website, the entire description is about how it's awesome data science. And what I'd like for you to learn today is that you can use it for software engineering system integration specifically. And hopefully, I'm able to impart that. So the same features that you find, again, if you look up the website, the same features that Pandas, that makes Pandas excellent for data science. So specifically, the data structures, the data frame, the slicing and dicing, joining, and whatnot. And also the ability to read and write data sets into different forms very easily, very little code. Those same features of Pandas make it excellent for system integration. So specifically in our use case, we're integrating. And the software engineering problem is that we're integrating, I've simplified it to two systems, or endpoint systems, producers and consumers. So we want to be able to independently design our goals. You want to be able to independently design these systems. And the other really cool thing is that with this integration facility that we built, you're able to really integrate both new systems and old systems, meaning legacy systems. So that's the value here of having this layer, this integration layer, is that you don't have to redesign your legacy systems. And you can build your new systems and the design that you think is suitable for your new system. So again, we use the Pandas toolkit for its ability to maintain different, for being able to maintain our different data models in our different systems. We want to be able to have different interfaces. So you don't want to have to make any concessions for your data model. For example, a new system or one endpoint system because of the other system. You don't want to make any concessions for your interface. You don't want to make any concessions for your data model. And you want to be able to, this is also really cute, you want to be able to produce your data once in one form in your one data model. And then consumer applications and these systems on the other end, they can consume the data in different ways and it's produced once and it's consumed in different ways. Different applications can consume the same data. So our goals. And before I get into the next few slides, I want to give you some anecdotes or a little bit of reference about why these are our goals. So traditionally, and in my experience, the way that systems are integrated is as follows. You pick some middleware such as like an MQ and the systems that are being integrated decide on make an agreement. There's a contract. You agree on some data structure that this is the model, this is the object that we're going to communicate. This data structure gets filled by the producer and cued and then on the consumer side, it gets de-cued. So both sides need to know that contract. They need to know that data structure. They need to know that there's a queue in the middle. Something more crude. I've seen it as well, something more crude and far worse is the producer system writes some data into a database and the consumer system knows about that database and goes and retrieves it. And then you have complete dependence on each other and becomes very messy and recipe for having to redesign your system very soon. So our goals are to reduce this kind of interdependence or having to have knowledge of each other and really reduce the requirement, the need to be able to have to redesign either side. So specifically on your consumer side and in our case, on our consumer, we had a legacy system and we were really not interested in going and messing with its data structures, messing with its interface. We just really didn't want to just mess with that. So the problems we aim to solve, we wanted to acquire data and that's pretty much the role of our producer. It was acquiring some data. We want to be able to acquire data and pass it along to the consumer. And the consumer right now is receiving its data from someplace else and we want to short-circuit it and start creating data from this, our new system. And the thing is, in order to be agile, in order to make it safe, we wanted to be able to do this in increments. We wanted to build our modern system up gradually and we wanted to incrementally migrate and deprecate the legacy system and deprecates really the inputs to the legacy system. And really again, I want to repeat, we wanted to make sure there's no interruption in that existing system. It wasn't just that it's a legacy system. It was really serving a very important product and it had to keep running. So no changes to the legacy systems, interfaces, and data models, key goal. So we built kind of a structure like this. So we had our producer with its own data model. We had our consumer on the other side with its own data model. The other thing was that that consumer, its interface, its requirement was files, required files. So files for the data that it would consume and also files for being able to, like as a signaling mechanism, a done file or OK file. So everything file integration. For our modern system, we didn't care. Why should I take the data I'm writing to a database or have memory resident in a service and write it into a file? I don't care. That's not really my data model. I have a separate data model. And in this case, we had our databases and whatnot and we had a service interface for those. And what we did in order to make sure that we don't need to have any interdependence between these systems is we built an integration layer. And this is where we used Python and Pandas. We built an integration layer that would take inputs from the producer. It would apply indexing and joins, concatenations and whatnot. And it would output the required format. And it would go to the whatever application, multiple applications usually consume the same data, like pivoted for whatever reason, whatever the application needed to do. So the integration layer. So let's talk about that. We used Python with Pandas for the integration layer. And we used it for consuming the data, munging it, delivering it. And this is kind of a pictorial of that. We kind of wanted to produce the data. I mentioned databases, services. Also, sometimes we had a workflow where we would produce a file already. And we didn't want to produce the file again. And we didn't want to necessarily go and access the data again. So sometimes our inputs were files also. And so we used Pandas in the middle. And on the output side, we used files. But it's very easy to write database as well, or call service. So again, specifically, our goal was to have complete isolation of our endpoints. So we created a Python service. And in order to specify and really quickly spin up new integrations, we had JSON configs. In the JSON config, we specify where the inputs come from, where the outputs should go, what to do if you get collisions, if you receive the same index twice. And sometimes the outputs have to go in multiple places. That can be configured in the JSON. Then we used, inside this integration layer, we used Pandas to read any format, filter, join, pivot, and then output into text formats or pivoted text formats. So this isn't the real integration layer. I created some sample code to demonstrate what I'm talking about, the kind of design I'm talking about. It's not using any JSONs. I kind of put together some test files. So tell me if I need to use a pointer. This is minimal code. So I created two files in different formats. One of them is space delimited. The other is CSV. And here, I printed the data frames. And I created join. I want to join the data. And then I output it in two different formats. And here, I made it very simplistic. One of them is the CSV. The other is a text file. Or a pivot. Let's say I want to, again, take the same data. Now I'll take, again, a file. And then I'll take some existing file that I had used again. And then I'll pivot the data. And then I'll merge it again. And I'll do an index on it. And then, again, I'll send it out. So that's really, I mean, that's it. The pandas allowed us to write so much less code than we would even in Python. And that's one of the reasons why we used it. And in addition to that, we had complete isolation of our producers and consumers. We were able to incrementally replace the data inputs for our legacy system, our consumer. We were able to integrate an old system with a new system without having to redesign the old system. Honestly, without prejudice, we could be integrating any kinds of systems, right? And we were able to preserve business continuity. This is probably from our business point of view, our product point of view, the most important thing is you maintain business continuity. And that's it. So not a lot of things used. So the resources are kind of minimal. I'd like to take your questions. OK. Thank you for the talk. I see we have one question here. Hi there. Thank you for the talk. It's a cracking idea, very inspiring. You mentioned JSON configuration for input and out. But maybe you could elaborate a bit more. I'm really curious how you define those things. So the JSON configuration was basically where to get inputs from. And so there were some named entities in the JSON, like essentially where to get the data from, specifically what kind of transform to apply on the data. So again, our data models are different. You need to apply a transform on the data. And so there, there would be, again, named entities for being able to do a merge or a pivot, right? Specifically, we had in the JSON configured how to deal with collisions. So if you get the same thing twice, what do you want to do, right? And then whatever are the outputs? Again, like the columns that you require, in our case, it was files, what columns, what file names, what file paths, where it needs to go. So those kinds of things we specified. So the point was that any time we wanted to do a separate, a new integration, or rather replace one of the data inputs to our consumer system, we would just spin up a new JSON. And if the kind of data transform wasn't supported, obviously, we just spinned up. We wrote some more Python code for a little bit more pandas and just a new name in the JSON. Any other questions I think you had, right? Yes, thank you for the talk. So if I understand correctly, so JSON files provide recipes for the transformations. My question is more about when developers get to develop their transformation functions. Do you use any tools or packages that would type check the schema of your pandas data frames that go into the functions? I don't remember, actually, if we did anything like that. And probably not if I don't remember. I don't think so. But that's a good feature. Honestly, we own the producer. So we were kind of confident that we know what we're doing. And we own both the consumer and that integration layer. So I don't think we did that. I don't remember anything like that. OK, we have another question here. Thank you very much for the talk. I have two questions. So the first question is, what kind of service was this? Was it a pull push? And my second question is, did you ever come across any use cases where you were hit by challenges such as latency or due to the size of the transformations? And if so, how did you deal with them? OK, so I'm going to take your second question first because I remember it now. So the latency, honestly, I go back. It's not a data science use case. We had maybe in the orders of tens of thousands of things. So really, it was fast enough for us. I mean, it was faster than other solutions we tried, right? So we didn't have any problems with that. It was working beautifully. And your first question, just say it again. What kind of service was it? Was it a cron job? OK, so I'm at Bloomberg. We have a Bloomberg middleware that we use, that's kind of very common to use. And it supports, so you can build service and it provides necessary tooling for queuing and whatnot. So it was a Python service. It builds using that middleware. And I think you meant for the integration layer, correct? And yeah, I don't know if that answers your question, though. I mean, it's not like an M-Bus. And it's not like Kafka. Basically, you provide your interfaces, basically. You provide how you want to work, what kind of requests you want to make to your service. And then you also specify what kind of queuing or how many messages, for example, you're willing to queue up or when should it barf and whatnot, that kind of thing. OK, I saw somebody else had a question. We have time for two questions that I'm seeing now. I'll go with you later. In addition to a technical problem, this also seems to be an organizational problem because now you have more moving parts that you have to worry about. What sort of steps did you take to make sure, for example, when somebody changes the legacy system, even though you don't expect it to happen anytime soon, there's some way of making sure that your transformations get changed aside from just remembering that that has to be done? So we actually didn't suffer that problem because maybe I can try to answer it. We didn't suffer that problem mainly because it wasn't the first time this consumer system was consuming data. It already had contracts with its data providers and it was consuming data in a very specific form. And it's unlikely that that form changes. Now, to your point, what if it does? How do you detect that something's wrong? Or what can you put in place so that in production, all of a sudden, your whole system doesn't stop, right? I mean, I think doing things like the gentleman over there suggested type checking and whatnot is possibly useful. I think that, obviously, it's a legacy system. Obviously, if we had tests on it, those would be able to help on the consumer side tests. But I mean, I think there's, I haven't thought about it. I'll think about it more. And if you chase me, maybe I'll have a better answer. We have time for one more question. Thank you. So this is happening within Bloomberg. And you're describing that Bloomberg have the producer and the consumer, and then you have the middleware to process all this. Do you have any advice for the situation that Bloomberg is a producer and the client is a consumer and building the client integration from the client side? Is there any advice for that? Yeah, so we actually have lots of teams that support. I mean, Bloomberg is a huge data provider. And we have lots of software for producing that data and delivering that data in condensed forms. We've got tools for being able to access the data with an API from Excel or from whatever non-Excel consumer systems. A lot of our financial clients use Excel, actually, but any system. But yeah, I mean, there's quite a lot of infrastructure around that. This is internal to, this isn't for external clients. This is internal for internal systems. But those would have, they are the amount of data that clients consume is vast. So the problems, really, the most interesting problems are condensing the data and being able to transmit it quickly. The most interesting problems are, how do we have accountability for what kind of data clients are taking off of the system? So it's a different set of problems. And they are solved, or hopefully. Yeah, I'm sorry, but we don't have time for more questions. I'm sure that. Thank you, everybody. Here we'll be around for the day. And you can stop here maybe during lunchtime or something like that if you have any more questions. So thank you, Michelle, for the talk. I really didn't know that pandas could be used for integration. So one more thing to look in. OK, thank you very much. Thank you.