 Now we have two speakers, Kukan and Peters, about algorithmic trading with Python. Very interesting, thank you. Hi, this talk is an algorithmic trading with Python. Just to clarify some terms, by trading, I mean, buying and selling financial instruments on financial exchanges. By algorithmic, I mean, there is a computer program running some kind of an algorithm that decides on what to buy and what to sell in these markets. In winter capital, we manage about $35 billion using a platform, primarily constructive Python or largely constructive Python. We also use other Python for researching and data analysis for those activities. The talk is gonna go roughly as follows. We'll do a quick company overview. We'll have a little bit of an overview into our research activities and the trading pipeline itself, and then yours is gonna go into quite a bit of detail about how and where we use Python. My name's Isto Kucan. I'm the head of Core Technology at Wynton, and yours is a head of a very exciting new project we have the Data Pipeline Project, and it particularly is a heavy use of Python there. If you've come across Wynton in the past, you may have seen us call a Quant Fund, a algorithmic trading outfit, a hedge fund, commodity trading advisory. All of those are valid, but I think a lot would much rather be described as an investment manager in the company that uses scientific method to conduct investment. What do we mean by scientific? Well, a empirical study of a lot of uses, largest of empiricism, hypothesis testing, experiment construction, and statistical inference in how we derive strategies that then we trade upon. We have around 100 researchers, which is about a quarter of the company, typically with a background in academia, academics, ex-academics, or post-docs. These are organized in teams. A lot of the activities be reviewed, so it's a fairly open activity in how we arrive at the signals. Another quarter of the company is in engineering, which again is a fairly empirical discipline itself. Geographically, we're primarily a UK company, so roughly 400 stuff in the UK, mainly in London and some in Oxford, but we're expanding globally four offices in Asia and two in the US. A lot of those offices are not just sales offices. A lot of those offices are actively growing. So for example, we have a new data labs outfitting San Francisco, looking at esoteric data. Okay, so this is a Python conference. So what about Python and Winton? Winton's been active for about 20 years, and for the initial few years, the systems were far simpler than what they are now, and effectively ran over an Excel spreadsheet. Now, of course, then gradually C++ extension started creeping into that Excel spreadsheet, and gradually those things were taken out of the Excel and formalized as a set of object called the simulation framework. And that was and remains the core modeling tool and also execution tool for our training systems. But we found that as the framework gained flexibility, we needed Python to start combining these objects in a more flexible way. So for example, if I want to do a Delta series and then a volatility series, I would be using the same two objects as I would if I wanted to do a volatility series and then a Delta on the volatility, but I wanted to combine them in a different manner. So Python was quite useful to do that. As soon as we engaged in that, as soon as we started using Python in that manner, it became very attractive for us to start writing strategies or parser strategies in Python themselves. And from then on, it never really stopped. So over the last 10 years, we've adopted Python for constructing the training platform but also increasingly in data analysis and in research. So I'm sort of starting to create these two terms, research, investment technology. We have quite a strict distinction of what is exploratory activity and what is training activity. So exploratory activity research is looking to things that may lead to something or often will not. And again, the research itself is conducted along three lines, I would say, core research, which is research into signals and let's call them market behaviors. Data research, which is research into data and properties of that data and then in an extended sense, deriving data analytics like volatility profiles, volume profiles, correlations from that data directly. And we now, as I said before, have a data labs section in San Francisco which looks at esoteric data sets, speculative data sets like satellite imagery or the deep dark corners of the internet. But once signals are derived, we transfer them into the investment technology section. Now there's a much more rigorous exercise where we have a quite static trading pipeline. And again, the key there is that you can do things in a very repeatable, very reliable, very secure manner with some sign off and the data pipeline itself is composed of roughly four stages. Let's call them data management, signal generation, auto management and post-trade monitoring. Now Python is used a lot in research but it's also used now extensively in data management and signal generation parts of the trading pipeline. With data management, typically the things we do in the trading pipeline is obtain large sets of data, clean them, transform them into the things we need. We use things like versioning to make sure that we can repeatedly see data as it changes. Python underlines all that architecture. For the signal generation part of the pipeline, we also use Python extensively. So Python still drives simulation which is a time series transformation engine and increasingly so. Python is also interfaced to a data storage engine called a time series store and Nearest is gonna go into that in a bit more detail. Right, so I'll give a bit more detail about how we actually use Python, some low-level detail where exactly it sits in our stack. So the main reason we use Python really is because it presents a quite a friendly face to research. Our low-level code is all in C++ typically, so execution, our simulation platform. It's not something you want the researcher to write so we expose all our codes, all the APIs are typically in Python. There are a few other options but Python is definitely the main choice. It's not just for research because it's such a nice and programmatic interface. We use it for monitoring, typically to serve as a web service as well and directly in signal generation. The reason we chose Python is extremely well known. It's very easy to learn. If you don't know Python, probably not too long before you do. And it has a lot of, just comes with a lot of support for data analysis visualization so it's quite nice to, as a researcher, just to get all that batteries included. So this is a fairly large-scale overview of our training pipeline. There's a few kind of core principles to it. The whole thing is event-driven. So something happens which causes something else to happen. In this case, for example, we get our data, for example, from Bloomberg. As soon as the data is there, automatically we construct our equities prices or futures prices. Once that's done automatically, all our strategies kick off and that kind of event-driven flows sit really at the core of the winter technology these days. And then we have, as Istok mentioned, the simulation sits at the right bottom there. So whilst our, winter is pretty much a graph, it's a real-time graph, it just sits there all-services, listening to stuff happening. But we have the simulation, which is kind of like an in-memory offline graph. So essentially it's really catered, it's really designed to do kind of time series analysis. So you kind of spin up a trading system. That will kick off one of these simulations. You run them, you can tear it down, you can serialize it. And that's kind of the other main technology that we have. So I'll give a bit more detail about both the real-time graph, which we call COMETS, and then the kind of simulation strategies to test simulations, back test simulations, which is written in C++. First simulation, it's written entirely in C++. It's been going on for about 10 years now, I think, right after we moved away from Excel, pretty much. If you just ignore the left-hand side, it's kind of similar in concept to what these days appears, things like TensorFlow. So essentially it's kind of like a graph. It's very well optimized. It's, in our case, strongly typed, so you can just not feed everything in everything. You have to, it's strongly typed data. There's an example there of a graph. It's quite a simple one. Two data sets, two data series feed into something like a formula. It can be just some of these two series, and then you calculate that thing. Now that's all running remotely on a calculation server. Typically these things can be thousands, or tens of thousands of assets. You don't want to run that on your local machine. We run that on big calculation servers. But we expose, on the left-hand side, the Python client. So any user, any researcher can just connect or launch or spawn off any of these simulations, connect to it, and has full control over the remote simulation. So there's actually an example there on the left. It's a real Python script. So the first thing it does, it starts a remote session, which is gonna cause one of these simulations to be constructed and launched on the server. Then it constructs these two time series, and it constructs a formula, and then it calculates it. That's the only thing you have to write, and you have full control over the simulation. This means researchers don't need to know any C++. Anything they need really is in the simulation. It comes with data, it comes with training systems, it comes with universes, all the kind of stuff we need. And essentially gives them fairly high level control over anything they need to do. A little bit about the technology. I'm not gonna go too deep here. The bindings, the Python bindings are extremely lightweight, so they don't know anything about simulation per se. As soon as they launch a simulation, they get everything they need from that simulation. They populate your Python clients with all the objects that the classes are dynamically generated. The objects are spawned into your namespace. If you create new objects, they are created both on the remote client and on the local client. Essentially, it gives you full control locally as if you were doing it remotely. It's very friendly in Python, so all the data is returned as pandas series, data frames, all that kind of stuff. One thing you can't do with the simulation bindings is once you can control the graph, so you're kind of limited to these things like formula or value-based series or universes or particular training systems that technology has implemented in C++. What you can't really do this way is if you have a complete, outlandish training system that you wanna try and you wanna kind of plug it into this graph. If it doesn't fit this kind of formula or data, they're kind of stuck. And for that, we designed embedded Python. So what you can actually do is in Python, write one of these objects that run directly in the simulation graph. So from that on, anybody can launch them remotely, run your training system, and you don't have to write any C++. You can just contribute your Python code and everybody can run it. It wouldn't normally go into training, but this is more intended for rapid prototyping. A researcher can pretty much build their training system in Python, test it in the simulation, which means they can back to this from 1970 to now. They don't have to write C++, because there you might often have to wait like a month or two months for technology to actually implement it. It's not a really good turnaround. So they can just build their thing, run it, test it, and if we're happy with it, then we can still implement it in C++ afterwards. That's kind of the idea. It's definitely around to rapid prototyping, although some of it is actually in training as well. The technology there, unsurprisingly, the C++ executable hosts a Python interpreter. We use Boost Python to do the marshalling. All the data is exposed in NumPy, so we use the NumPy C API for performance. Yeah, essentially you've got full control to these embedded Python for making your own Python trading system available in the C++ backend. It's extremely powerful. So that's kind of the simulation. As Isdok mentioned, we have this problem that we need to shift lots of time series back and forth. There's an enormous amount of time series to be saved. We have hundreds of thousands of assets. You need to be able to very quickly load and write these to a database. Things like SQL are way too slow because we do so much historical backtesting. We have to load all the data for 300,000 securities from 1970s now in memory or distributors and then write the results back. So what we designed, at the time when we started this, there wasn't really a good alternative. So we built our versions and the duplicate data storage, which I'm not gonna go into too much detail, but it's a columnar format, so it essentially is super effective for storing lots and lots of time series and lots of columns very effectively they're typed. It's backed by MongoDB. That's kind of an implementation detail. Anything that can store something from key to a binary blob would have worked. And really keys, that's immutable data. So one thing we don't want is once you've written your data frame, if you do it in Python, all you want is to get exactly that data frame back. We don't want the data to change. If you've written something, it can never change. You always get it back exactly like that. That's it really at the core of our strategies. Obviously, if you're testing a strategy, you don't want the data to change underlying what kind of reproducibility you want to know exactly what you did, you want to be doing that forever the same way. So this time series store is really revolutionary. It actually opened up a lot of possibilities. Technology there, kind of the same pattern again. So we tend to do something low level in a really optimized way, C or C++, and then we expose all kinds of high level libraries to make it more accessible to users. So the store here is backed by MongoDB. There's a C library that sits on top of it and that kind of deals with this columnar storage, so that we can essentially store it very effectively. And then we build very thin libraries on top of that, C++, C-sharp, and Python here. We're building a JVM one as well. And essentially these can be accessed by different kinds of technologies. C++ would typically be the simulation, but a researcher might use the Python library. And rather than having to deal with this kind of low level columnar storage, a researcher can just put data frames in there. It will get translated into C arrays and then get given back to you as C arrays as well, as data frames. Yeah, so small implementation details about how we've done this, we used to see the foreign function interface. And the nice thing is that it's such a friendly Python interface, you give it a data frame and you get a data frame back. You don't need to know about any kind of table formats or type conversions and all that kind of stuff. Combat transforms, like I told you in the beginning, Winton is essentially a graph and simulations, services are sitting there, they're waiting for stuff to happen, they react from inputs, they produce outputs. The next thing is gonna listen to those outputs and go on again. So this is what we call the combat transform system. It's microservice based, they sit waiting there on a topic from a bus, it's Kafka actually. There's an example there, it's super simple. We get the data from Bloomberg, we write it to the store, we announce that we've done that. Yeah, go ahead. The equities transformation is gonna pick that up, it's gonna write to the store and then as soon as that's done, our strategies are kicking in. We've got loads of strategies, so there might be five strategies waiting for the equities prices, they're all gonna kick off simultaneously, distributed. And as soon as they're done, we can start going into execution. This is the next to last slide. A little bit about the technology with combat transforms and kind of bringing it all a bit together. All the red things are where we use Python. Everything that's not red is kind of low level and exposes Python as its external API. So all our events are posted on Kafka, we use protobufs throughout for the communication. It's really nice for the strongly typed and the kind of, you can increment your schema. And then our service stack currently is in C-sharp, so it's a proprietary service stack that essentially deals with getting the protobufs, translating the protobufs. But the combat transforms themselves are hosted by the C-sharp, our Python interpreters, so anybody can write anything and become part of the graph that is Wynton just by writing some Python code. That Python interpreter might be a strategy that launches a simulation that will use the same bindings, as I explained in the beginning. That simulation can host your own trading system that you've written that would be an embedded Python. The simulation will read and write its data from the version store, which is our efficient way of storing time series. And that then, again, can use in the Python library, can be read by the Python store library so anybody can read that data that's been written by the simulation. Everybody has access to it through the Python store libraries. Whilst all this is quite complicated, there's a lot of technologies going around, the theme is always relatively the same. It's low-level code that is really optimized, tends to be written in C or C++. Implementation details are quite proprietary. It can be protobuf, it can be Kafka. But as a user, you're only exposed to kind of well-chosen APIs that we've defined. They're quite flexible, they're programmatic, because it is Python, you can do anything you want, but it is tailored and it is accessible. And by using, by providing that as an interface, it's still extremely performant. And we find that this is really good. So roughly as that said, it's all good. We think this works really well. We're quite happy with the system. Python throughout, if you're a researcher or if you're in business, you wouldn't see anything else but Python. You'll just see Python, you don't even need to know that there's any C code under there. It's the primary interface, really, for data management and signal generation. Because it gives such fine, great control, you don't really need anything else. There's no need to go into C or C++, you can. And that's what technology does, if it needs to go really fast. But as a researcher, typically you don't need it. So you can define all your own data transformations. You can do with the data whatever you want. You can store data, you can retrieve data, you've guaranteed it will never change. So there's a time series store. As discussed, it's backed by a very low level C++ code that is implemented by technology and owned by technology. And the main reason we're doing this is because it's so great for analysis visualization, rapid prototyping, maintainability, I mean that because it's such a programmatic interface to all the underlying codes. You can write web services, you can write monitoring systems. Everybody can essentially start contributing to them in Python, which means we have an enormous view on what is actually going on in Winton's trading systems. Yeah, so it's all good. And that's also all I had. So thank you. Thank you. Yes, one minute left. Okay, oh many questions. Okay, so I just start from here. Hi, thanks for the talk. Thank you. So you're using C and C++ code because you're in the high-frequency trading or is this legacy code or? No, neither actually. So we are in low-frequency trading and it's definitely not legacy. So even though it is low-frequency, we do continuous historical backtesting. So it means even though we might just trade one new data point, we want to be able to very quickly test the simulation all the way from 1970 to now. So yeah, so that's the main reason, has to go fast because we test the whole of history. But we do trades over periods of months. So common tools like pandas and similar don't meet the data requirements? Like for huge backtesting? I think we found that it probably doesn't cover our needs. We kind of, we run, the trading systems that researchers contribute are in pandas and so a number of them go into trading. So things like the tracking error control, it does run on pandas and it does run into trading. It doesn't actually have a C backend. So it's not a C backend, it's not a C backend. It's not a C backend, it's not a C backend. We actually have a C backend, but we find that if things need to go really fast and we find we need that kind of speed, then the C implementation is still considerably faster to the extent that it's worth doing it. Hi, have you open sourced your time series store? And if you haven't, why not? Open source, which one? Your time series store for data. No, we haven't open sourced that. For no particular reason. So there is actually, initially we, it's only quite recently that we started looking into open source. I think this is actually on the list of potentially being open sourced. There's nothing particularly trading specific about it. It is very generally applicable. So yeah, that might come up. More questions? Your slide suggests Python 2, probably 2.7. It is exactly 2.7, yeah, 2.7. Why not 3 and what is the incremental cost of migrating to 3? So there's a lot of code in, there's an enormous legacy code base in Python 2. Upgrading it because we have all the C extensions right now is not trivial, but it is being actively pursued now. So all the new code that we're gonna start developing will be in Python 3. And then we should gradually migrate all the stuff. The problem is there's not really an enormous business case right now. It's a lot of efforts and we don't necessarily get a lot of it back at this very moment. But we definitely realize that especially as maintained as a support is gonna be dropped, we will have to have moved to Python 3. So that's gonna be our main reason. And obviously there's a lot of features that would be considerably better, especially with the multiprocessing and stuff. Yeah, that's for me personally at least. We need a good business case to move really. Hi there, thanks for the presentation. Which exchanges do you trade on and how many bytes of historical data do you have? Which exchanges, we trade all the exchange, but that's more is stuff. I asked you about 20 or 30 different exchanges. I wonder if it is, but American equities, European equities, Asian equities, futures, FX now, and fixed income as well. How much data we have? Depending how we describe data, typically we just probably about a billion numbers a day. We have a petabyte class total capacity. But that's a rough measure. How many we need is a different story, but that's how much we have. Thank you for your great talk. I have two questions. There is any authentication or authorization system? There is some researchers can see only few machines or something like that. Yes, sorry. And how does it work? The API does it? We have our own proprietary authorization system. It's basically token-based and then we have the SQL server. So you got the SQL code is backed by Microsoft. So we got the authentication there and then we got Mongo database, which is backed by certificates. So it's certificate-based authorization. Okay, and my second question is when a researcher wants data, I guess it go to the microservice and get a scroll scan operations. So all your data is going through HTTP and it does how it's so fast because millions of events can be sent. So a researcher would actually go directly to the store. So they make a direct connection to Mongo. They wouldn't necessarily have to be mediated. They can and we're actually considering to build like high performance services in the middle, like GRPC based or something. But right now the library that we exposed to researchers that sits on top of Mongo makes direct connection to Mongo. So that's why it's so fast. Yeah, so our authentication works. By the certificates. Yeah, certificates to MongoDB. Yes. Okay, thank you. More questions? Just a fairly semi-light question, but have cryptocurrencies or anything like that crossed your radars yet? Crossed the radar and then left it, I guess. It's not something we do yet. Thank you. More questions with still some time. I have a question. If nobody... Okay, what's the real, why is there a Kafka thing and a protobuf? So what's the issue with this one? There are no other arrows than just this one. Kafka is in the message box there. So you see the message was at the top there. We use Kafka to back all our events. So we have pop-stop style events, which mean any consumer can kind of connect to any event that happens. So we needed a pop-stop message box essentially. We chose Kafka and we put protobufs on the wire because it's strongly typed and it's actually fairly compact. So if we need to send a lot of data over it, then Kafka plus protobuf is actually a really good combination. So essentially it's Kafka sits really at the core of Winton. It's all the events go through Kafka and everybody can chip in. Can you give an example of such an event? So is it a trade or what is it? It's at different scales. One thing that an event that can be announced is where Bloomberg says, I'm done. I've actually downloaded Bloomberg data, co-finered in the store, but at a lower level actually we do send every single piece of information across Kafka as well. So that's before this. All the data that we ingest that we download is streams over Kafka. And then depending on who's interested, it can be stored in Mongo, it can be stored somewhere else. It can be actually transformed. It can run tests on it. So all the data goes over the bus as an event itself as well. Thank you again. I just wanted to know why aren't you using any event-driven infrastructure such as Apache Storm or something like that? It looks like it's a perfect solution. It's possible, yeah. We are actually investigating things like Storm, Spark, Flink, all of them. Do you have something to say about that? We're a company that's 20 years old. So there's a lot of technology that comes on the radar that of course you would immediately like to have, but you can't because it takes time to migrate and you need the business case to migrate as well. So something being unisex is not a business case. Of course, having 35 billion under management also means there's a lot of risk. Making a small mistake on such an investment just so that you can get sex new technologies, again, not something that's very easy to justify. So we do like adoption of new technology, but we have to be cautious at the same time. Hi. So I'm interested in how are you testing technical systems like this? Because I could argue that there are thousand things that can go wrong. Yeah, it's a distributed system. It's a real-time distributed system. So what's your approach to testing? When I say that initially we gain so much by the immutability of data, that's definitely one thing. If you know that your data is not gonna change, then you don't have race conditions about it. This might need to be right before this reads. So immutability is definitely one of the core principles. And then everything is strictly event-driven. It's strictly a DAC. So it means that because everything is defined by the events, you can write extremely good tests. If the whole history can be reconstructed from the events of Wynton. More information. All the simulations are run every day for the entire history. So for example, for our simulation today, we'll compare the simulation up to the previous point, let's say yesterday, and we'll make sure that every single data point in the entire history of the simulation is the same. And the incremental daily step is also usually human-verified. There's still some human interaction, not because it's needed, but because it's a silent process. So there's a checkpointing process for this human. More questions? Yes? Just say if you are tired. No, I can go on forever. Thank you. There was a slide with C-sharp usage in C++ and part and all together. The main and the other one probably, whether it was service usage as well. Anyway, the question was... Yeah, exactly. How the communication is done between the components, between different languages and so on. Sorry, can we repeat that? So once again, the question is how the communication is done between the libraries in different languages. The communication between the libraries or in the company? Between the libraries in C++, in Python and C. I don't know the details about all of them. I do know in Python, we use the C-form function interface. Essentially what we always aim for is a fairly simple C-89 I think, interface to C-library which compasses all the logic and then the other libraries are built on top of that. They tend to be fairly high level and just through the mediation and marshalling of data. So all the logic tends to be in the lowest layer and then all the other layers are just representing data in something that is useful for the language itself. Does that answer it roughly? Okay. No more, there's still one question. We are also at the boot, by the way, if you have more questions afterwards. Yeah, maybe this is then the last question, but why not? Hello. I would like to ask you if you save some pre-computed data in the history. Save some signals, you take the source data and compute something from them from all the history. Or you pre-compute everything every time, every day? Yeah, so that's what Istok alluded to. In order to make sure it fits in with your question, in order to make sure that nothing has actually gone wrong in the meantime, that no bug has been introduced, we run everything from history, from the beginning of history pretty much, to now, to yesterday. We check that everything is exactly the same and only then do we allow the newly generated points to go through, yes. So it gives us an enormous amount of certainty that, yeah, nothing has gone wrong. So you then have problems with the immutable data that they can change because you improve your algorithm and find some error in the algorithm. It can happen, and then we re-baseline essentially. So if we do introduce a change, it has to be in a controlled fashion. So the only thing we want to avoid is uncontrolled change. But of course, if there's an improvement, then we will re-baseline the system, yes. Thank you. Okay, so I think very nice talk, very interesting. Thank you.