 All right, thank you. Apologies for being wedged over this side by the microphones, but we've got to be wedged over here by the microphones. Yeah, we did get a while last suit. This talk today is about how we upset someone. And that's never a good thing to have to talk about. Because when you upset someone, it means that you've done something wrong. It means that you've changed something that they liked. It means that you've broken something they expected to keep working. And so this is where we start from. Poor Skalsak is very disappointed. Skalsak is probably not someone you've come across before. We certainly hadn't. So we dug into what he works on and checked out his GitHub. One of his projects is PyPyCats, which those familiar with information security will be aware of Mimikats, which is one of the most popular hacking tools that gets used. He owns the pure Python implementation of that. He's also got proof of concepts for a Cobra's attack in pure Python, and also a couple of CVEs in pure Python. Maybe upsetting this guy by breaking stuff is not actually that bad a thing. He doesn't seem to be doing nice stuff with Python. The way that we broke him is by implementing PEP578 for Python, which is auditing hooks and can be used for security transparency. So that's the subject of today's talk. We're going to go into details about what was changed, what's been added, and how you can use it to make people like Skalsak hackers using Python very unhappy. So my name's Steve Dower. I'm a C-Python core developer, and the original author of PEP578 also worked with a few other people on it, and I especially want to acknowledge James Power did a lot of the implementation work for that and was a big help. You can find me on Twitter at Zuba. I also work at Microsoft. Hi, I'm Christian Heimus. I'm author of C-Python core developer. I work with the BDFL delegate, so the person that worked with Steve to get the PEP landed. Also, I work for Red Hat, and you can find me at Christian Heimus on Twitter too. So today's agenda. We're going to first explain what are actually audit hooks and why should you use them, how you can use audit hooks to improve security of your programs, and finally, Steve will go through some windows-based example. I'm going to go through some Linux-unix-based examples and we're closing with a summary. Oh, you got the clicker. Thank you. So runtime, oh, yeah, that's pretty easy. So the runtime hooks, so it's only a small fraction of things you should actually do if you want to create a secure environment for your processes and for your services. It's not a whole solution, it's a small part of the solution. With runtime hooks, the auditing hooks we have, you can see what the interpreter is doing internally, hook into different parts of the interpreter from which files are my process opening, which sockets I'm connecting, and compounding, executing bytecodes, importing models, et cetera, et cetera, et cetera. By default, these hooks don't do anything, so they're just empty steps, and if you don't hook any of these auditing hooks, use them, they also don't impact the performance of the interpreter, which is probably something that several of you care about. And these hooks are designed for security engineers to inspect what's going on. It's something not like a ready-to-use solution, but you have to build your own solution, the foundation to create them. And as a security engineer, if you want to take care about creating a secure environment for your services, there are multiple things you should do. First, of course, please install security updates, and then you want to run your services as a limited user account, don't run them like as rooters admin. And then, please install security updates. Well, then do something like use a firewall, restrict your processes, and yeah, install security updates, and then don't install random stuff from the internet, actually control what you deploy on your machines, inspect what you control on your machines, don't install like a new version or don't fall trapped for like typosquatting, where somebody has a high PI page that sounds very similar, but has actually a typo, and yeah, install security updates, please. And maybe at the end, if you have done all that and even more things, then you might think about adding all these auditing security hooks. So let's have a look at what hooking this actually involves. We have two sets of APIs for this. It is meant to be used as a security feature. If you're that committed to doing this, you should really be thinking about compiling Python from source, controlling those sources so you know where it came from, and possibly modifying the runtime itself. The aim of the hooks is to provide a C API to make that safe and reliable so you're not constantly modifying the full source code of C Python. There's also a Python API for hooking into these events as well, which is very helpful for testing, and there's a few scenarios where that makes sense. But these are the general APIs. So you're gonna have some kind of callback, whether in C or in Python, and then it's a simple call, it's either PySys, add audit hook, or it's sys.add audit hook from Python, that's gonna give you a callback that is called on each of these events. These events can be called from anywhere. So they will get called on a variety of operations, variety of contexts that Python may be running in, certain locks may be held at very low level, and you do need to be careful about what you do in it, but there are, for the most part, they're not that complicated to implement now that you actually get a callback for it. The pros and cons of each of these approaches are important to consider. Obviously, implementing in C is more complex than implementing in Python, but the upside is it's gonna run faster, it's going to be harder to bypass for an attacker who may be executing code in your Python that you don't expect. If you've added your hook in C, you're still gonna get those events. There's very little that an attacker can do running pure Python code to stop you from receiving these events if you've written a C hook. But it does also require compiling and deploying your own copy of Python. Now, Linux users are probably not at all concerned by that because you're almost certainly doing it anyway. People running on Windows Mac OS are less likely to be doing that, so that's a bit more of a step up in complexity, but it is worth it if you're looking to enforce stronger security boundaries. Doing it from Python, of course, is very easy, very convenient. You can do sys.add-ordered-hook-print, and that will give you a printout of every single event that's going on very simply. The downside is it's per-sub-interpreter. Again, if you're not familiar with sub-interpreters, I have a look at those. They're a very complex area, but if you are using a Python hook, it's only per-sub-interpreter. You'll need to keep adding it to each one, and it does run slower. You may notice more of a performance impact if you have Python-based hooks than C-based hooks. What kind of events can you expect to appear while you're running? This is a small sample. That link at the bottom has the full table of almost the full table of events that C-Python currently raises. Other implementations of Python may add different events, and certain libraries that you use can also raise events or come back to that. But here's a handful of a selection. The built-in's input function is going to let you know any time input is called. It's also going to send a separate message with whatever is typed in. Your production server is probably not using input, and if, for some reason, input starts happening on your production machine, you probably want to know about that. Things like exec-import-compile, very useful. You want to know what code's running, and you especially want to know if code that you didn't write is running. We have more on that, more on being able to prevent that in the first place later on, but this is the way that you can at least be notified that someone is dynamically compiling code on your production server. Socket new, and there's a range of socket events that will be raised. Glob-glob, just in case, that's actually a very common thing that attackers are going to do if they manage to execute code on your machine. They want to find out what files you've got there. They want to see what tools they have access to. They want to know what user accounts are set up. If you start getting events for Glob, again, if you know that your app uses Glob, you can probably ignore them. If your app never uses Glob and you start seeing it, that's a good warning sign. So what should you do with an event? You get the callback, what are you going to do? The first option you always have is to do nothing at all. Depending on your risk profile or your threat model, certain events, it's very useful to do nothing with them at all, just ignore them. You can log it. You can write out the details somewhere and keep a semi-permanent or a permanent record of all the things that your app has been doing. You can abort the operation. From inside your callback, if you raise an exception, most of the time it's going to interrupt the action. So if a socket raises an event saying I'm about to connect to this address and you throw an exception, it's just going to abort it. It's going to be canceled. Your code will get an exception instead, and that socket will never be connected to. So there's that option. And you can abort everything. If there's an event that you recognize that you really, really don't want to happen, you can just call exit and tear the whole thing down. But the correct answer is to log it. And this comes out of an approach to security that's known as Assumed Reach, where a lot of people and kind of the traditional approach to security is you set up a really strong boundary so nobody can get in and then you're fine. And as long as nobody gets in, you're fine. And it turns out that the attackers are better at getting in than we are at keeping them out. And you only need one gap in that wall for someone to get through. And once they're in, if all your defenses are in that wall, you have nothing. Think about an office building. Your office building has locks on the doors. That'll keep people out. But there's also motion detectors and security cameras on the inside, which are completely useless if your doors are perfect. Why are they there? Your doors aren't perfect. People all get in, you want to know about it. Think of these auditing hooks as security cameras or motion detectors inside your app so that you know when someone's moving about in there. You will see legitimate people moving about in there. You need to be able to filter those out. But if you're not logging it, you have no chance. Logging everything is very important. And your first instinct is probably to filter out events that aren't that interesting, that aren't that relevant, and say we don't actually need to keep this. We can save disk space, we can save network traffic by not logging certain events. That's actually a really bad idea. If you have complete logs, maybe you're not checking everything. Maybe you're not constantly reading them. But when it comes to a retrospective analysis, when you discover your emails being published on torrent sites or wherever they get published these days, and you realize that someone's in your system, how are you going to find them? If you have all of the logs, you've got a chance of working backwards and locating what those attackers have been doing in your system. Anomaly detection is a growing field, especially using machine learning, to keep kind of a profile of what events happened during normal operation, recognize when that changes, even without someone having to review every single thing that's going on. And of course, incident response. If you have a live persistent threat inside your network and you want to find out what that's doing, what services it's approaching, what IPs it's pinging, which Twitter accounts it's looking at for its next set of instructions, having all of these in your logs already will give you that information, it will help you expel them nice and quickly. Premature or log filtering is going to cripple your defense. So log everything. And as I said, there's a way to create auditing events as well. For the most part, the intent is to listen to them. But if you're developing a library, if you're developing various extensions, then it can be very helpful to create your own events so that you know when these things are going on. So there's a C API, which is based on the Pi build value API. If you're familiar with that one, it takes a format string for the arguments that are going to be passed in and the name of the event. If you have the option to use the C API, use the C API. There is a Python API, sys.audit, which similarly takes a list of arguments. It's very easy to bypass that one. You can reassign sys.audit to another function and then those events go away. So it's very useful for testing, but if you can use the native API, that's strongly recommended. We recommend in the PEP that third party events should include the module name as part of the event name, helps namespace things. If you have two modules raising the same event for different purposes, that's probably going to cause issues. So putting the name of the module that you're raising it from as part of the event name helps keep things separated so there are no collisions. And just as a recommendation, validate the parameters you're going to pass in first, then raise the event and then do the operation. That means that hooks can assume that the arguments are going to be valid on the way in and you're not going to end up logging things that are just going to raise an exception and not run anyway, but it gives the hooks a chance to abort the operation. So if you notice something going on, even if you want to quickly deploy to expel a current persistent threat, then you may have a very specific case where you want to start raising an exception for a certain IP. And if you've implemented these events in a way that the operation is already going, you can't actually interrupt that with a hook that's going to raise an exception to abort it. So you want to try and put the auditing events after parameters are known to be valid and known to be the correct types, the correct within range, but before you actually do anything with them. And next, Christian is going to tell us about how you can just stop code running in the first place. Yeah, so rather special case, they should not hook but a new piece of code we added, the IO open code function, if you look at how usually binary is executed with shell libraries and native code, the kernel on the operating system knows the difference between yet actually code that runs on your CPU and that's data for the code. But with Python, we have like this rather bad case that PyC files are considered data for the CPU in the operating system. And this IO open code function in the first step to teach the operating system need 10 that we're going to open something that's going to be executable code. So this is a simple function that basically bolts down to just doing opening the path we path through for binary read only and returns a file like object. So it doesn't have to be actually a file object. We have some examples later on that uses bytes IO that looks like a file, but it's not changing a file. If you hook into the C API, you can override that but you do only do that in the beginning before you actually start the rest of the interpreter. You pass in the callback and the callback gets called with the file name and additional user data you can add custom information to your callback. And what can you do with the IO open code thing? For example, you can verify different attributes on the file, you can check properties of the file. And actually a regular file you're opening, you're maybe opening a pipe or a socket or some kind of special file or file on a file system that you don't expect to load any code from. You can validate the content of the file. So you can compare it to checksum. You can do is like code signing. You can make sure that nobody while you're working on the file does something with the file. You can lock the file content before you load it. And instead of what I mentioned before turning the actual file, you first load the content into a memory buffer, into a bytes IO and then do all the operation. Because if you read it for like two times, first time to do like a checksumming and the second time to actually turn it to Python, then there's a possibility that an attacker can use a time of check, time of use attack. So if an attacker can just intercept the second reading of the file and replace the content with some malicious code, that's bad. So always first do something, read the whole content into a buffer then validate the buffer and pass down the buffer down. And there's some caveats. One thing is if you implement that in C, it's executed while the import hook is hold. So you can't do another import while doing this hooking. And some calls may assume that's a actual regular file that's backed up by a file descriptor or physical file on the file system. But all the Python standard import system doesn't assume that it just requires a file-like object. Some additional things you have to do if you wanna validate the files and deal with the IO open hook is also, you don't wanna have any additional code in your project that like bypasses the whole IO open infrastructure. So in Python in 3.8, we replaced all code for importing and several other things like sip import and we plan to do that for pickle with the IO open code thing. But if you bypass that in your own application using like compile exec or exec file in Python 2, well. Other parts, you wanna make sure that you load file that come actually from the file system. So you can use introspection and other file system and operating system tool to verify what kind of files you load. So if you allow like dash C, where you can just pass an arbitrary Python code or do like curl some evil side shell to Python 3, shell and data in, then the admin has no chance to see which code you're actually executing. And also restrict which kind of environmental have like the variables can play funny tricks with your Python program. And also for which places you allow to load code. So maybe you don't wanna allow it to read code from like temp tier when temp you're the only place where an unrestricted user can store any files on disk or a home directory. Maybe you wanna restrict that. So now we come into Windows section. Where Steve will tell you how you can hook in all the hooks on Windows. Right, here is where we get to real kind of applications and a few samples of things. So I'm gonna go through three points of integration that exist in Windows that this enables which were previously unavailable. So these are operating system features that are really powerful security features. I suspect there's a very low number of Windows developers in the room because that's fairly typical for these conferences unfortunately. But these are features that get cheers from the Windows security-focused conferences because they're really powerful at locking things down. With these hooks, with the open code function these become available to Python developers. And as a security engineer in those contexts you can integrate Python code in your Python apps into the rest of the security infrastructure. All of these code samples are available at this GitHub repo and so you can go and grab those. You can get a copy of Python 3.8.2 and compile them and try them out for yourself. So the first one is the Windows event log. The kind of canonical example of code that you don't want running on your production servers is something like this. And in fact this is a, well no this is a great example. So you'll see that it's called S-Python. S-Python is kind of the code name that we've been using for a Python that has more things enabled. There's a lot of good reasons for a development cycle to have a Python binary that doesn't have these things enabled and use an S-Python binary in production that does have it enabled. That lets your developers use a whole lot of things that might otherwise be restricted. And then when you deploy, if you don't allow that Python binary to exist it also takes away an entry point. There's also some interesting kind of semi-research which I haven't linked to but I probably should have where some would say this is security by obscurity where simply renaming your Python executable hides the fact that attackers can run Python when they get on your machine. And they're just gonna figure it out and do it anyway. Someone did a study where they changed their SSH port by one and saw like a 99.9% reduction in attacks. And if you think that's a bad thing then I'm gonna disagree. If you can reduce 99.9% of attacks by changing the name of something then you should just do that. So anyway, this is kind of the canonical example. If you get onto a machine and you have the opportunity to run one command that command is very often, surprisingly often for most people, going to be something like Python-C decrypt this Base64 string and then execute it. And quite often that string involves opening up URL lib and downloading more files and decrypting those and executing those. In this case it just prints hello, EuroPython which is a much better option than all the rest of them. But this is fairly common. Looking at that command, you can't see what that's going to do. You don't know what that's going to do. And typically once an attacker is inside Python you have no idea what it's doing until you see the results being published on, you know, that, what is it? What's the site with all the passwords? Have I been pooned? Have I been pooned? That's when you find out that someone's been doing this. But we want to find out sooner. So the Windows event log is kind of the central event stream service on Windows. It has a handful of integrations with other more advanced features that are really helpful. There's an event log viewer for starters which, okay, you need to be able to actually look at the events that are in there. The valuable ones from a security perspective, event forwarding, you can configure it to automatically send all logged events to another machine. And that completely prevents anyone on that machine from clearing the logs because the messages have already been taken somewhere else. You have protected event logging where all events can be encrypted immediately with asymmetrical encryption, which means you can't actually read back to the log on that machine. You have to take it to a machine that's enabled for it. Clearing and modifying logs automatically adds in your event. So you will be notified if someone clears a log, even if it's you. Because one of the best ways to find out that someone is doing stuff in your machine that they shouldn't be is your event logs get cleared. So knowing about that is helpful. But it also gives a very simple API for logging the events in the first place. Which in this case, so this is my hook for when code gets compiled. These two lines and a little bit of boilerplate that's been generated elsewhere is gonna record the event. It's gonna record the code that's been compiled and the alleged file name it's coming from. The compile function takes a parameter that says the file name. We can't necessarily trust that, but the code is legitimate. We know that that's what's been compiled. This is gonna lead to the event viewer where you can see at the bottom the event has been logged. We're compiling from a string and there is the actual code that's been decrypted from that Bay 64 pass into exec. And so now because that code has been run through as Python, we actually have a record of what that was. And if that's been sent to a central machine, maybe we're monitoring a thousand machines, but this gets red flagged on our central server and we can see that 10 of those machines are executing arbitrary code that they shouldn't be. Now we have something to start with. We know we're under attack, we can start investigating, see what it's doing, trace it down, kick it out. From the point of view of blacklisting or whitelisting code or allow and safe listing or deny listing code, the typical approach on Windows is code signing. So in short, this is attaching a signed hash of the file so we'll make a hash of the file, we'll sign that hash with a certificate where the public key is on the machine to verify against but we've signed with a private key on a trusted machine. And then we can verify that on every use. And depending on your configuration, the kernel will verify that whenever you start running a binary file. Python now with the open code function can also verify its code files against cryptographically signed hashes. Unfortunately, Python files don't have a standard for embedding a signature in them, which is the typical approach on Windows. But we can use catalog signing. So a catalog file is essentially a list of hashes and file names for a set of files. The entire catalog is signed and we can refer to that to see whether the file we're looking at has been signed and approved. The standard Python installers on Windows for a couple of versions now have included a signed catalog file for every non-binary file in the package. So if you've installed Python on Windows recently, you already have one of these catalog files that you can verify every file from the distribution against to see if it's been modified or to see if something extra has been added or disallowed. And that file looks like this. So it has a handful of various properties. This one is signed by the Python Software Foundation as with all the files in the standard distribution and that security catalog tab, which I didn't show because it looks like a random number generator output is a list of all the hashes. So this is the code for doing it. This is literally all boilerplate. There is nothing interesting here at all. You copy paste this from somewhere just like I did. This is the interesting function call where we go, hey, Windows, can you verify whether we trust this file? And it will come back with a yes or no. And if it says no, then you don't trust that file you were bought and you get out. And what this means is we can run our build of S-Python here that has this enabled and we can import URL, we can import async.io. Those are part of the standard library. They include pre-compiled binary files that have been signed. They include Python files that are included in the catalog and because they're all there and they match, we're allowed to import them. And then if we try and import some unsigned file, we get an error message. The error in this case has come straight from the operating system, no signature was present in the subject. You can totally replace that with a different error message. Of course, this is not a built-in feature of Python at this point. You have to add this in based on our code samples if you like. You can return any error message you like at all. And bringing this all together, in the latest updates to Windows, to Windows 10 and Windows Server 2016-2019, we now have a feature called Windows Defender Application Control, previously known as Device Guard, if you've heard that name. This allows a kernel-enforced configurable policy for allowing denying applications from running, which is actually just a short way of saying it's a massive XML file. It allows you to use either signatures or catalog files or file names or paths or other attributes to determine whether executable files are allowed to be run or not. It's already integrated with event logging and detectors. It provides good feedback for users, so you don't just get random error messages, you get helpful messages. One thing I didn't put on that list is your configuration is signed and includes the list of people that are allowed to sign replacements for that configuration, which is a very nice feature. It means that you can't actually break into someone's machine and replace their configuration with one that allows you to do stuff, unless you're signing it with a certificate that they previously said was allowed to sign it. So you end up with this chain of allowed configurations that keeps things really nice and secure. So how this looks, on this machine, here's my Python install. You can see S Python towards the bottom there, but I'm on the regular Python. What I've done is I've explicitly banned Python.exe from running on this machine. Everything here except for S Python is signed with the Python Software Foundation certificate, but I've explicitly banned Python.exe. So if I try and run that, then I get this big pop-up message, which admittedly is not the most helpful message, but it's better than simply saying access denied, which is what you'd probably expect. And if I try and run it from PowerShell, then I get a similar message, program Python.exe has failed to run, contact your support person for more info. So this is an IT configuration set up for your machine that will prevent applications from running. Like I said, big XML file, this is about half of my part of the XML file that gets merged into one that's about 20 times longer that lets the rest of Windows run, because by default you can block all of Windows from running with one of these things. This is the interesting part. We explicitly allow the S Python executable because it's not signed. We allow anything signed by the PSF, and then we deny Python.exe. And just for fun, I'd SQLite Ctypes and LibSSL because that was part of my example that I was demoing with. If you grab the sample code for this and try it. For starters, I recommend doing it on a virtual machine, not your main one. But then after that, you'll find that you can't import SQLite or Ctypes or LibSSL either. And so in this case, if I run S Python, it will import a whole lot of modules. There's a lot of built-in modules that are signed by the catalog file that are approved here. Otherwise we wouldn't have gotten as far as the REPL prompt. But then when I import my unsigned file, it tells me it's blocked by policy. Again, that's a customized error message. You can do anything you like there. You can shut down the entire machine at that point if that's what you wanna do. It's completely custom code up to you what happens when something fails validation. As an extra bonus, I can tell from within the executable whether policy is being enforced on that machine or not. So I'm actually using that in this sample to disable the dash C command. So this is the code from earlier that was gonna say a nice hello to everyone, but I've blocked it. I've said you're enforcing code integrity policy on this, so I'm not gonna let anyone use dash C. There is an option to go into what's known as audit mode, which doesn't actually block anything, but every time you hit something that would be blocked, it's gonna log a message. So you get a nice record of everything that people are running, and you can go back before you start enforcing it and whitelist all of those things that should be allowed to run. In this case, I print a different message and do let you use the dash C. This is just proof of concept that you can tell the difference from within the program. You can behave differently based on how it's configured, which gives you a lot of flexibility to integrate Python with the operating system level protections that are available. As I said, all of these examples, and in fact, all of Christian's examples as of this morning are also in this repository. Feel free to go and check those out, and now we're gonna jump over to some of the Linux options. All right, thank you, Steve. So, this is mostly on Linux. Some of the examples may work on MacOS or a BSD, but I haven't verified that yet. So, one of the points I wanna talk about is, first of all, a quick advertising from our talk tomorrow, Detracing System tab. So I'm giving you a talk about tracing and profiling tomorrow. We're more about syslogging, just a quick intro. Steve already covered most of the logging for Windows. Syslogging is kind of similar with some more feature set, but yeah. And how to do IO open code on Linux with kind of code signing. But so, some of the prerequisites to doing this, of course, place install security updates. If you take anything from this talk, it's the first point, always update your machines. So you don't wanna run your Python interpreter or your application as a privileged admin user as root, because root usually can replace files and do other stuff. And lots of assumptions are, you run a way that you can't modify like your binaries easily. You also want to restrict where you can write to, what things you can write and modifying your system. Again, with unprivileged users. You want, especially running containers, run some kind of kernel security policy. So, there's app armor, there's SELinux, there's Tomojo, there are different kinds to further restrict what different kinds of application a special context can execute and modify and do on the system. And finally, you wanna configure a central logging, like there's SysLog, there's RSysLog, or there's John L.D., if you're running system D, enabled systems, and forward your events to some remote machine so that it can't overwrite and destroy your log files. And just, so first point, system that in detrace, it's a way to tall the kernel to lock what's going on. Again, tomorrow I will show more details how to use these kind of features more elaborately, but as part of the PEP578, I added markers so you can actually audit. And if you run this script here that attaches to Python and use the audit probe and just prints out the string, the first argument of the string, and if you run like Python 38 C pass, you see like, okay, in the end, surrounding a command here, it compiles something and then exit the actual pass. So that's the one to integrate the system to have in detrace. Logging, so SysLogging is very easy on most Unix systems. You have to first open the log file, which is not necessarily required, but you can configure what you do. And this is probably giving some option that in case logging doesn't work at all, it's just like maybe down it still prints something on your standard console and on STD error. And also it locks a pitch so you want to know which process identifier does something funky on your system. And then you just, if you have some kind of events, you call log, SysLog with the severity, like critical, and there's a format operator and something you want to probably do if you run in something, you want to app or process, don't want to run like exit or a shutdown, the Python operator like manually, you want to use underscore exit, which just stops the process immediately, which are doing much of cleanup because if an attacker was able to modify you at the pre turn away, then the cleanup, the shutdown and the exit hooks may do execute additional code, but you would just want to kill that one. And if you're running a container platform like Docker, Kubernetes, Potman, whatever, you have to set up your container environment to have a SysLog endpoint in your container. By default, you don't get that on, as far as I know, most container platforms. So that's something you should keep in mind. So how could you implement IO open code on Linux because we don't have this fancy catalog file like on Windows, which is really cool? So I came up with a rather simple proof of concept that does some verifications. For example, one thing I want to verify is that my file is actually a regular file, not some kind of circuit or a pipe or something special on a special file system. And I also want to deny any kind of non executable file system. So don't execute something that's maybe stored on the proc file system. If you have a hardened system, you often mark your temp directory as non executable. So you can't copy a binary to that execute that the kernel will disallow that and you can do the same kind of checking also manually. I do the usual dance, I load my file into a byte say oh and then use the open SSL libraries, which I'll use by Python anyway, to hash the file content and then very far the file content against a special property contained in an extended file attribute. The example is also on Steve's getup repository. So what's an extended file attribute? So extended file is a feature on Linux and also Windows has something similar with things called streams. It's kind of similar. So you just have like key value pairs attached to your files or directories. There are some additional properties you can store mostly arbitrary data. There's a namespace, you have like usually four namespaces, you have user, you have trusted your system and security. The three last ones are restricted to for something even a kernel policy to do something with that. So I used the user one and the user attributes are controlled by something called DAC, discrete access control, better known as the standard user group, other people read what execute bits. So if you can read a file, you can also read the user attributes, you can write to a file, you can modify or create or delete extended file attributes. The whole concept is inspired by something called IMA, integrated measurement architecture which is something like code signing and verification which is currently under development on the Linux kernel. And just to look at that, so it's how it looks on the shell. So I created a bunch of different files using extended attribute with a hash. This is one file like the OSPI and this is my user.org.python. So the standard suggests you should use some kind of an identifier based on your domain name you own, this is a hash. And that you actually get F attribute, it's file attribute to the internal API, X attribute, F attribute, so it's a bit confusing but not that part. So example, so if I have a like modified OSPI file that doesn't match, my example will just crash and tell me, there's a mismatch during the import and then I have a script that can be used both to generate and update the hashes. Like after updating Python to a new version you need to regenerate these. It's a very simple Python script that just creates a new one and then I can run my example again and just it just works. And the script is fairly easy but I just have a list of file names, Python file names, I use the hashlib model to hash the content here with SHA-566 and then use setxutra to update the hash. Yeah, that's it. So there are some caveats you need to, if you would use that in production you need to actually protect the ones because a user or a attacker that goes control could in theory create its own Python files and add its own hashes. So either you have to use one of the protected namespaces which I can't use easily without using special powers or kernel policy. You can also do something like a signed hash but it's going to be slow if you do the signing for every file. So the catalog files, you have like a list of hashes. The whole catalog is signed one time but you do it every time getting big slow or if you run a container as an application you can also do something like blocking the syscall. So there's a tool set on Linux kernel called seccomp which blocks and disallows to execute syscalls and if you have to block three different syscalls here, again, tomorrow morning at 10.30 I will explain that during my talk by the Q3. And there are also some open issues with that so you can do something with early preload or writing like parts of your program. If you wanna test that with a container, so initially I wanted to offer a container, you can play around with that but there's some problems with how you can store this user attribute in a container image. It doesn't require that, there's an issue of that. You can also be allowing user or attacker to write to proc solve mem which is just a file that is actually the whole process memory of your process. Get broke around that. DLopen is something that we both facing issues on Windows and on Unix platforms is that's the call you use to open an extension binary. So if you have a C extension Python model or like a C type, CFFI, library you load, that takes a final name and there's no easy way to verify the content of these binaries without being subject to a time of use, time of check operation. There's Fun Attack called Snake Etoe which abuses memory file descriptors and DLopen and the proc file system to actually download the binary and inject some stuff and there are probably many, many more things that can go wrong. Again, there's just a small list of things that can happen. There's currently an effort to implement a new feature on Unix kernel called omate exec. So that's a flag for the open source call. It's a hint for the Linux kernel to tell them, okay, I'm opening a file but this file actually may be something I'm planning to execute. It contains code. It comes from GNU Linux Clippers 4, security distribution for Linux and there are a couple of videos and talks about that topic. With omate exit, the kernel knows, okay, this is a binary file that maybe contains code or a text file that contains code after do extra checks on that and then you can have the kernel security policy perform additional checks like requiring to have the expert, the exterior open on the file or deny file opening on non-execute file systems, et cetera, et cetera, et cetera. So, closing summary, Steve. Sure, thank you. First point. So the whole idea here is that when your security is good, ordered hooks can make it better. There's a long list of things that as a security engineer you need to do to lock down your production systems. If you miss anything on that list, this point is not gonna save you. It's probably not even gonna help you. It's just gonna end up wasting your time if you missed anything earlier. These hooks are intended to provide transparency. Security is your job. Python is going to help you with that by making sure you're aware. You can see what's going on and along with everything else on your operating system that's helping with that, it gives you that added information to make smart decisions and to make fast decisions when the time needs it. These hooks enable the use of operating system technologies that have often been around for decades at this point, but were previously unavailable to Python. They do require custom implementation. There is some work involved. We hope this whirlwind tour has been some inspiration and hopefully extra information for that. Python now gets to play with the rest of the operating system world as far as securing and hardening things and, of course, the most important point. Well, yeah, please install security updates. Please. So thank you very much for coming. Here are some helpful resources. I believe we are out of time for questions, so feel free to come and chat with us outside. We'll be hanging around just out there. Contact us on Twitter. We'd love to hear what you'd like to do with this, how you'd like to be able to use it, what you want to integrate it with. So please come and chat with us. Thank you. Thank you. I want to start the next talk, but we're missing a speaker. So if anyone saw Yan or knows him, then this is the time to speak up today. So, yeah, we'll need to cancel this talk. So this is your time to give to another one. Yeah, it's a bit different topic, though. Sorry, guys, for that. Thank you. Who's going to tell us about the Secret Life of Software? Okay. Great. Thank you. Thanks, everyone, for coming along. It's a good turnout, which is always nice. This is something that I've been thinking about quite a lot for the last while. It's hard to define how long, but so I'm interested in sharing my thoughts. I've also recently heard people talking about the idea of conference-driven development, which I think is essentially you go and give a talk. You say something that's wrong and people can correct you afterwards. And then I can learn as well, which is good. I probably won't be taking questions just because I think I won't have time. But it's lunch after and we'll all be hungry, so just come and grab me. I'm definitely going to speak to you. So just very quickly, as you heard, I'm Dougal. I work for Red Hat, or as of yesterday, I work for IBM, which is just announced. And I work on OpenStack. There I'm working, so primarily working on infrastructure and cloud stuff. Today, what I want to talk about is what is your code actually doing? What is your code doing in production? You probably have some idea, but usually you don't have the complete picture. You don't entirely know what's going on. You might feel like you do, but generally speaking you don't, because everything is becoming so much more complex. The goal of this talk really is to have you start thinking about your infrastructure and thinking about how you can improve the visibility into what is going on and then improve your process, hopefully improve the visibility into some of the more opaque parts. So I mentioned complexity and really everything, so we are building more complex systems than ever before. It used to be that if you're making a website, you maybe have a web server and a database server, so Python and Postgres or something like that. And when I started building websites, that was pretty much the normal case. But that just doesn't really happen anymore. There's just so many more moving parts. So you will likely have multiple web servers and proxies, some load balancing perhaps. So it might be for static content and dynamic content. You'll likely have multiple databases. You'll maybe have like a relational database and a non-relational one, maybe elastic search for some searching functionality. Because of these, you'll then need your high availability and sort of failover for everything. You'll need to support multiple devices, so phones, laptops, TVs, smart fridges, who knows, watches, there's a whole lot going on. To make this realistic, you then need cache servers. You might need to then have multiple geographic regions. So multiple data sensors, for example, in different Amazon regions, that kind of thing. To distribute your static content, you'll need a content distribution network to make this sort of, to efficiently distribute it around the world. You'll need business analytics because everyone wants to know what's happening on, what the user trends and the growth and how things are going. And finally, a CI, CD pipeline probably to manage all of this and realistically deploy and keep everything going. And probably more. I mean, honestly, it feels quite overwhelming to me. It's just there's so many moving parts in a modern system. You might not have all of those, but the chances are you'll have a selection of and possibly some others that I didn't include. And you also need to take into account often people are now adding microservices for each of these different components. And I mean, this slide is somewhat ingest, I joke. But the reason I include it is that sometimes people think this is only really a problem for microservices. It's not necessarily the case because everything I mentioned there wasn't really a microservice. But if you are using microservices as well, then it becomes a bigger problem. And it's always quite fun to poke at microservices because that's sort of a bit of a buzzword, isn't it? But they do solve a certain set of issues, but they also bring their own. Anyway, that's just an aside. So am I just talking about monitoring? Sort of. I mean, in a way, it's certainly related to what we're gonna be talking about today. There's observability is the buzzword that people are using now. And I'm not a huge fan of it because it is a buzzword. But the nice thing about buzzwords is there a good way to rally people together and have a discussion about something and hopefully improve process. So to sort of define the difference between monitoring and observability, people have explained this in sort of multiple different ways, but for me, the biggest difference is that monitoring tends to be black box, whereas observability is, I guess, a white box so you can see inside. So with monitoring, you might have a system that you're checking sort of the end points. You're paying it, checking that it's returning sort of what you expect and it's returning within a reasonable timeframe, but you don't necessarily know what's happening behind that. Whereas with observability, you're able to see what's going on. But generally speaking, I would say there is a lot of overlap with monitoring. Essentially, the goals are the same in terms of making sure everything's working and operating. What you're trying to do is the same, but it's sort of an evolution of the idea and going to further depth. And with this, I then sort of think about the observability mindset. So you're just thinking about how you can better observe your system and what you can do to improve it and highlight the areas that you don't really know what's going on all the time. One of the things you'll hear people talking about is the three pillars of observability. And I'm not a huge fan of this, but it's actually, it's a decent place to start the conversation, it's a decent place for you to start thinking about how you can improve your setup. So the, these three pillars, I don't know where it came from originally, but I believe it came from large tech companies like Facebook and Twitter, they sort of talk about it. And it's quite often used in marketing as well. But one of the things that you should definitely take into account when you're trying to adopt some of these practices is that you can do something better than Facebook or Twitter or other big companies can do, simply because you don't need to operate at the same scale as they can do. So you could perhaps realistically store a lot more data or a lot more logs because you're generating a much smaller percentage of it. So you can, relatively, to how much you have, you can store a lot more for a longer time period. But to quickly sort of define these three things, so logs are, I guess, are the most obvious. Everyone's seeing log output from different systems and applications, you're hopefully logging output yourself. So it's semi-structured text. Metrics are sort of counter stats, generally time series data. They're probably at a higher level, more monitoring of all systems or you kind of have some specific stuff as well. And then tracing is application tracing. So it's tracing the actual flow of the code. So it's the most details in terms of related to your application and tracing through the, if you're doing a web app, tracing the request through the different services and seeing how everything loops together and monitoring it that way. So with logs, this is probably likely familiar to all of you. So this is an example I had to adapt to fit in a slide. Something I noticed when creating this is if you really want to view a lot of log files, you need probably a 4K monitor. And yeah, they're not really, they're not very human consumable despite that being one of the purposes. And yeah, this format should be very familiar. I had to remove the timestamp at the beginning, but you have a timestamp, often a process ID, although that's not an example either. Then you've got a log level, log name, and then some kind of message. Now, with this example, that's got some good things and some bad things. One of the nice things is it's including IDs. IDs are incredibly useful when you're looking through logs and trying to trace down something because you can find everything related to something. So I'd encourage you to always include IDs when possible. It's got like a semi-structure to it, but it's really bad because it's, I mean, that's not standardized at all. There's square brackets, RAM brackets, curly brackets, all sorts going on. You could write a parser for that, but it would probably be painful. So really, let's think how we can improve it. But first, just to take a slight, slight step. So exceptions and errors are probably your most important logged event. You might not necessarily think of the two in relation to logging, but your exceptions are always logged. Oh, sorry, they should always be logged, but there's something you tend to want to be notified about. You want to be alerted more recently. I like using Sentry for this. So Sentry is an open source, but they do a sort of SaaS offering. I think it's quite reasonably priced. Disclaimer, I'm friends with the founder. But they've got a very well-proven track record. And this is sort of the simplest example of how it works. And what will happen is it will actually send the exception to the Sentry server where it'll do some aggregation and allow you to view your exceptions. You can control notifications and so on. So if you've ever had the problem where an error happens, say, a hundred times and you've been bombarded with emails, you can use something like Sentry to have a much smarter way of managing all those notifications and you can see the same exception that happened a hundred times. Whereas if you're comparing it with logs, you might have a log file with these same hundred exceptions, but then there's one error in there that's different and you might miss it because there's just too much to go, too much there and it's hard to process. And one of the nice features is you can then, so you can mark things as resolved and then if you have it come up again in the future, it'll be flagged as a progression. So that's how I'd recommend dealing with errors. Thinking about going back to your logs again, adding structure to your logs is a great idea and the best way to do that in Python is with Stroke Log. You know, the maintainer over here. But yeah, it's a great library. Quite difficult to come up with an example that fits in a slide though, so if you have one, that'd be good. So it's really easy to use. It's essentially got an API that's compatible with the standard Python logging. And so in this example, you'll see I log once and then I log it again. The first time is like the default it'll do, which is I guess like a development mode. It outputs it in a human readable form, but when you're then using in production, you can output to different formats, but quite commonly you might want to use a JSON serializer. So that then means your log files containing each line is a JSON object essentially. And the nice thing about that is you can attach more data to it. You can attach structure to it. You can then parse it later in other tools, which is just really useful. And there's tons of other things you can do in terms of like filtering and processing, changing it. I mean, the documentation is very good and there's a lot of good examples there, so that's where you should go to look for more on structured logging. I think it's something that's generally becoming more popular generally. The other sort of structured logging library I've used is ZAP for Golang, which is quite nice, I think that came from Uber. So one of the first things I would probably do would be adding your request, so unique request IDs to your logs. And this is actually something I've maybe used before, struct log, but because it gives you the structured logs, it kind of makes sense to do it afterwards as well, if you're going to do both, just because then you can add it in a structured way to your logging. And this is a very simple example of how you do that with Flask. So basically the idea here is that every request you have that comes into your system is given a unique ID. You can then use that for when you're logging, so every log message has this ID attached to it. And again, you can then filter by that ID, see every log message specific to that request. You can then pass this onwards as a header and use that in other services as well. So if you've got microservices or maybe you've just got two monolith services, it doesn't really matter, they can then pass that header on and you can start tracing things between different applications. Sorry, I just realized I forgot to say something in the previous slide, but it doesn't matter. Yeah, and this just then adds that extra context information. So even if you're just using simple tools like grep, you can grep for that request ID and see all of your logs related to that request. One of the things that's tricky about this is not everything will support request IDs. So if you're using different databases, they might be hard to add that context to those. I have seen somebody do a hack with Postgres where you use the application name. And before queries, you essentially set the application name to the request ID and then it can include that in the logs. So your Postgres logs could then be linked up with this. I'm not sure if that's a good idea or not. I haven't actually tried it, but it seems quite interesting. There are ways that you can make this work one way or another. So I'm just gonna quickly jump back here. One of the questions or one of the problems you might see with this is that if you are using, you're using struct log in your application, but all the libraries you're using are just using standard Python logging. There is a Python package called Python JSON-logger. I should have put that on the slide. But if you use that, it can add JSON formatting to the standard Python logger, which then means everything is outputting JSON, because otherwise you could be in quite a strange situation. The viewer outputting JSON to the file and others are just doing lines of text and then be probably harder to parse. Okay, yeah, but logs do have their limitations. They can quite often be granular and it can be hard to see the trends. As you're adding structure and make the data easier to extract out, this is improving, but as I mentioned, you might be using other services and systems, which are not providing this additional data in context to the logs. So it's still much harder to see the trends and harder to use that for monitoring and alerting. It's also generally quite expensive to store logs, just because applications quite often put a huge volume of them. I think something we struggle with as developers is figuring out what is useful to log. There isn't really a good answer for that, because you're trying to, when you're writing a log message, you're trying to answer the question, what will be useful to know in the future when something goes wrong? Sometimes there's something obvious, but other times there isn't. I think the best way to try and handle that is to have a good balance of different logging-leveled messages. So at least you can increase or decrease the log-level as needed and get more information, and then maybe reduce it to just the critical stuff as well. And then, so next I would think about metrics. Metrics are something which are very useful, essentially. I mean, that's not a very good statement, but one of the metrics are cool because you can basically create lots of nice alerts monitoring, lots of pretty graphs and dashboards. You can start seeing trends and things that are happening over time. Your very basic metrics are probably your error rate, your response time, your request volume, that kind of thing. And that's certainly a very good place to start when you're looking at something like a web app or an API. But you can start to think about more granular and more detailed metrics. So for example, if your database load might be something which is more of a concern to you, just based on your application. So you can then do some simple tracking metrics of the number of queries done by a view or also like the query duration. So these can be useful. So if you're using, say, one of the Python ORMs, it's fairly easy to accidentally do a query in a for loop where you're doing like the N plus one queries because every time a new record is added, you're adding an additional query. And if you start tracking the number of queries that the endpoints are doing, you can start to see the trends if they're growing up and you can see that that's increasing. The other one which I've noticed is useful is if your query duration is starting to take longer, that likely means you're missing an index and as your data size is increasing, the query is slowing down because whatever your querying is not indexed. One of the best sort of integrations, I've seen that this is actually in Django, they have Django's database instrumentation, I think they call it. There's a really good page in the Django documentation describing how you can hook into Django's ORM and do recording of timing and basically wrap all the queries so you can track things in great detail, which is very cool. If you actually just want to do very granular metrics yourself, StatsD is probably one of the most popular choices and one of the nice things about StatsD is it's actually supported by a number of different places. So I think, for example, Data Dog, who are one of the sponsors, they support the StatsD interface and this is just a very simple example of how you can use it. So it's very trivial. I mean, this would essentially be instantaneous because it's just mathematical operations but you'll also find that there are integrations for things like flask in Django where there's people have done, like Django StatsD packages, which will help you just collect a bunch of metrics by default. And one of the things with metrics is this actually starts to overlap with tracing because when you're tracking things to that level of database queries and through application, you're almost collecting the similar data to individual traces. So tracing is probably, possibly the most useful and probably the hardest one to implement well, I think. There are some really good solutions for this from Data Dog, Elastic, Zipkin, and there's others. This is an open tracing standard, which is rising and it seems they're all starting to support that, which is nice because before that, you essentially, you would pick one of these projects and then you'd have to integrate that into your code yourself. So my primary experience is actually using the Elastic APM but I'm gonna show you the open tracing one which is more agnostic, which is quite nice. Sorry, my speaker now that doesn't make any sense. So with open tracing, they have a bunch of integrations for different frameworks and libraries and this is essentially how you can integrate it with Flask. And what's nice about it is it'll then, it knows how Flask works, so it can then wrap around the middleware and Flask. I think it uses the middleware, which will then track your different request durations and everything that's going on. And essentially the way you use this, it's all really quite standard stuff. There's not really interesting code to show you because most of the good benefits from tracing is that you're using a framework which has tracing support and then it's essentially like a turn key integration. So I, yeah, and so this is the open tracing code which will then work with Elastic, a data dog and Zipkin as well. What that actually creates in the end is you end up with something like this, a sort of a graph plotting out everything that's being traced through. So this is far too small and these graphs are always horrendous to read on a small screen, so I don't really expect to, actually, that looks better up there. But what you can see there is it, this is just an example from their documentation but you can see it's showing the different SQL queries that are going on, the different calls between the request of different services. And it gives you an incredible amount of insight to what is actually happening in that request. And the value of the data here can be really powerful. So for example, in the middle column there, you can see there's a select query which is taking up, I mean, it's actually quite quick but it's taking up quite a large majority of the request. So this was an endpoint which was causing you problems. You'd then want to start hoarding in on that perhaps and maybe improving that select or the one before it. And then you can really improve things overall. So just to sort of recap on these a little bit. Logs, I generally think of being very specific and detailed. Logs tend to be more for a service but then you can filter down into a request or something you've needed with the IDs that you're adding. Metrics are then higher level, the more about system health. And then tracing is actually for individual specific requests which you can then follow through your systems and your other services as well as needed. And at this point, are we done? The sort of this short answer is no. So this is one of the problems with the three pillars. People talk about it as you must do this, this is how you do observability, you've done observability. And you know, it's one of those cases it's like one doesn't just simply do observability. It's just not really enough. I mean, you can collect all of these different things but it's how you present the data, how you bring it together that really matters fundamentally at the end. And to mention Datadog, they actually have a very good example of how they, if you can visit their booth, I've got no affiliation with them. But they have a good demo showing how you can filter through all the different parts of these requests and you can view a trace and then click on it like one part and see the logs related to that trace, which is quite nice. And you can achieve this with other solutions as well. But they have a very good sort of real-time demo that can show you. And generally speaking, I tend to think about observability as taking a very practical approach. So start, I mean, when you're looking at these three different pillars, they're all just essentially data. They're all different things that you can learn from, different things you can use to adapt and develop with. So start collecting the data, learn from it and then rinse and repeat and improve your processes as you're growing. And there's actually a really good talk by Ben Siegelman. And he helped sort of enlighten me a little bit when I was doing some research for this talk. Three pillars, zero answer. So he's very critical of this term. And I've become sort of myself as well. Unfortunately, I wrote the abstract before I was as critical as I am. And it's not to say that I think it's a bad idea. I think it's a great starting point, but you just, you can't think of it as, okay, that's like a three-step process and then you're done. There's a lot more to it. And there's not necessarily the same, there's not necessarily the, this other divide between them that you might expect. So for example, the tracing data that I showed you a few, several slides ago, you could actually extrapolate logs and metrics from that. So there is some overlap with these. So you need to decide where you want to sort of focus your different efforts. But otherwise, that's actually all I have. I've come through that quickly, I don't expect this. Okay, so we have like four minutes left. So if anyone has questions, we need to queue for that microphone there and just, yes? Anyone? Okay, in that case, thank you very much. And it's lunchtime. And also the maintainer of libraries like Python Day2Tail and of course setup tools, everyone knows. He's also a Python core developer and a contributor to many other open source projects. His presentation is about how to build your Python extension using Rust. And let's give a round of applause to Paul Gansel. Howdy, that was a good introduction. Thank you. All right, so today I'm gonna be talking about building Python extensions with Rust. This talk came about because I wanted to write a back end for Day2Tail and found that there were a bunch of different options available and I wasn't really sure exactly which ones to choose and how they worked. So a couple things, a couple caveats I wanna start with. One is if you don't know Rust already, you should probably leave because this talk really requires deep knowledge of Rust. Thought that you would laugh. No, you do not need to know any Rust. There are also a bunch of slides with a bunch of code on them. Just get the general feel, like the general feel of the code. You don't have to look at all the lines and make sure that I did things correctly because it's really, those are more for reference later. So just, I don't want you to feel overwhelmed by all the code on the slides. Yeah, so let's get started then. Okay, so why would you even wanna write any sort of extension in Python? Well, it's a fairly common thing to do because Python is often known as a good glue language in the sense that Python is very expressive and sometimes people call it like pseudo code as code that you can just run and it's great that way but that often comes at a runtime cost. It's slower than many compiled languages. But you can kind of have the best of both worlds if what you do is you write the slowest parts of your application in another language or even you just use it for something like system orchestration where you're just calling system functions which are fast and then you're providing a high level API to your users so that they can write their code in Python. And in fact, if you look at the Python ecosystem, a huge amount of the fundamental libraries that we use including the standard library itself are essentially just glue libraries like NumPy is glue around some C and Fortrain code that's like super optimized scientific code with a nice Python wrapper that we can all use. OpenCV, TensorFlow, PyTorch, Pillow, these all have compiled backends. And so let's look at exactly what it means to have a compiled backend. So here is the demo function that I'm going to be using for these comparisons. It's called, this is an implementation of the sieve of Eratosthenes. And what this does is you give it a number and then it calculates all the prime numbers that are up to that prime number. And the way it works is that it allocates an array of all the numbers, one to that number and then it goes through or two to that number and then it goes through and it eliminates all the, it goes to the first prime, it eliminates all the even multiples of that and then it goes to the next prime which is the next number in the list that hasn't been eliminated and eliminates all the multiples of that. And you can see this Python implementation is like six lines or something. It's nice, easy to read. And the output seems to work. We have two, three and five for five and then if you have something like 20 gives you all the prime numbers up to 20. And this is the implementation if you use the CAPI. You'll notice that the font is a little bit smaller because I couldn't really fit it all on one slide. Even using this weird terse way of programming where I have like a for loop all on one line and some of these like a bunch of stuff where I'm just cramming things onto one line. So it's more verbose, but does that mean it's harder to program? That's not necessarily true. But you notice that the core part of the algorithm is this little part in the middle. Like this is the equivalent part where it's like, where it says sieve out composite numbers. The first part is some kind of thing where we're casting Python integers to regular integers and then checking for errors. There's a lot of error handling. I'm allocating memory here and then I have to keep track of all the references and construct a Python list from a C array. So why would I bother doing this? The Python version seemed easy enough to write. And the reason is performance. So if you compare the performance of these two you can see that for some modest number you get 30 or 40 times speed up when you use the C version because again, it's much faster. But it has downsides. A lot of the things that I mentioned, the memory has to be managed manually. You have to handle all the reference counting that's usually done by the interpreter itself. And C is not a memory safe language which means that you can do things like double free something or allocate some memory and then never free it. And in fact, you can also do things like this which I don't know if you guys are just amazing programmers and you spot the error with this right away even without the comment but this is an actual section of code from the original version of this that I didn't notice and it passed all my tests. But what was actually happening was that here when I went to iterate over the array I didn't realize that because SIV starts at the number two its length is actually one minus n or n minus one not n. And so I just did this thing that's almost wrote which is iterate from zero to n. But what happens in C is that you just kind of go one past the array and memory and then you take whatever's there and then you turn it into a Python object and you put it in the list. So my list would be instead of 235 it'd be like 23582 which is not what we were going for. And that's a problem because with C you just really bare bones. You're looking at the raw memory and you're manipulating it. So what's the alternative? I'm going to pitch that Rust is an excellent alternative. Rust has a lot to like about it. It is memory safe by default and it tends to use zero cost abstractions for this. So you actually still get high performance without losing your memory safety. It's also tries to enable fearless concurrency which means that instead of using the GIL or whatever it just is kind of nitpicky about when things are able to be used for multiple threads and then you can feel safe that if you manage to get something past a compiler it will work. And something that will really appeal to Pythonistas is that it has a broad community and like a really big open source ecosystem. So when normally you would just pip install something here you could just add it to your list and when you do your cargo build it will just pull things from crates and you can just use them. And there's already a ton of crates out there. So I don't have time to get into exactly all the benefits of Rust and how it achieves these things but I thought I would cover one little topic just to give you a flavor of what it's like to program in Rust. So I figured I would talk about ownership which is the one of the most important concepts in Rust and this is about again handling resources. So variable bindings in Rust have ownership over the resource that they're bound to. So when I assign V to this vector when V goes out of scope we know that because we own that vector we can free up all the resources of the vector. And if you assign that variable to something else then it actually moves the ownership to the new variable. So here's an example where I have this function called take ownership. All it does is it takes some variable but it will take ownership of that variable. So here when I assign V to this vector then I pass it to take ownership. After that take ownership owns the resources in V and then at the end of the scope of take ownership when that V goes out of scope you free up all those resources which means that normally this last line would be like a use after free or something. But Rust has this very picky compiler. It's also a very verbose compiler and it'll tell you exactly what you did wrong because I just explained in all those words what went wrong but if you just tried to compile this you would see, hey, this thing was moved from here into here and then you tried to use it afterwards. So generally speaking if you can get your Rust stuff to compile then it's not saying that it's a good program but it eliminates large classes of bugs like memory safety bugs and some concurrency bugs. I'll also note that there are certain things like de-referencing pointers, raw pointers and stuff that you can do by bypassing these safety things by just putting it in an unsafe block and you'll see a lot more of that later and the value of that is that it makes your code a lot more auditable in the sense that you can stop looking for these kinds of bugs in anything that's not an unsafe block and then you double check or triple check the unsafe blocks. Okay, so far we know how to write Rust programs but how do we write Python programs using Rust? Here is my Rust implementation of the similar atrocities and this first part is just pure Rust and that could be in a crate somewhere or it could be in your file. I separated it from the part that converts it to Python just to show you kind of what it would look like if you were actually just wrapping a crate that already exists. And so here I have the Rust version which gives us a VEC of U32s and to expose it to Python I just put this little decoration on it which is a procedural macro called Pi Function and then what that's gonna do is it's going to transform this function such that it exposes to the C API or it exposes a C function in an SO that takes an integer and returns a list. And this is how it works. You can see that I can just import it like normal and it looks just like any other Python function and you can see from the speed that it is of comparable speed to the C extension. In fact, in this case it's a little faster but like well within the noise and I don't think that that generalizes so I would say it is on the order of magnitude same speed as C. How does it do all of this? Because the C API itself obviously has a lot of unsafe code with a lot of mutable, I mean almost everything in Python under the hood is just some mutable reference or mutable pointer to some memory somewhere. The way it works is that Pi O3 is in two layers. The first layer, the lowest layer is the FFI layer and here is an excerpt of the daytime bindings which I actually added to Pi O3 and you can see that you have the functions and you have to recreate bindings to the functions where you recreate exactly what the signatures are. You recreate the exact data structures so it has to have the exact same data structures but exposed as a rust struct and then all the macros we have to implement manually because there's no symbol to bind against. But once that's all done you and all this is done in the Pi O3 crate with a whole bunch of unsafe code, our end users can use it wrapped up in a nice safe layer. So assuming we did all the FFI stuff correct at the lower level then we now have this safe layer which has constructors and function calls that are using unsafe code but are not themselves unsafe so essentially as long as the stuff in this unsafe block is correct then you can build on that with safe abstractions and so here is the implementation of the new constructor for Pi date and it basically just passes some rust stuff back down to the CAPI and for each of these sort of safe rust wrappers we have some constructors and we have some access traits and various traits. There's a lot to go into there but I think you get the general gist of we have the safe and the unsafe layer. So when you go to actually use it that's how it's implemented. This is how you would use it. You would use your Pi function procedural macro. Here is an implementation of something that just takes some seconds and then gives us a date that was that many seconds ago. So this first block is a Python function and then I have to expose it in a module so I create this function which initializes the module and I decorate it with this Pi module procedural macro and I just add my existing function and then I can call the function and it just works. It constructs the date time but you may note that the date constructor has a valid range and it can throw exceptions but I don't have any exception handling code here. I do actually have this part that says Pi result. So that's the return value. It can either be okay or an error and Pi or three does this nifty little thing where if something can raise an exception it'll return one of these Pi results and if you just let that Pi result bubble up to the Python layer it will automatically turn that into an exception with a trace back. You can also make classes. So you take a struct and then you call it a class and then you can implement whatever methods you want including certain special methods like new. Here I have implemented something that is just a point and you can take the normal function. I add it to this module that I call classy and then when I construct this point I can look at it, I can calculate the norm of it. Well you'll notice that X and Y, well maybe I haven't shown you this but X and Y are not directly accessible on that. If I want them to be accessible on that at the Python layer I need to do a little more code that will take those 32 bit integers and translate them into Python integers. Okay, so that is the Pi or three approach, the API approach and that generally speaking is gonna be a much more all inclusive experience where it's really built to work for Python. There's another approach that you can use which is to write CFFI bindings. And FFI, I forgot to mention, stands for foreign function interface. It's for exposing functions to other languages. So in this case, usually most languages they speak C. Under the hood, either they're written in C or they can handle C memory structures and they can handle C function pointers because so much at the lower levels it uses C as a lingua franca of programming. So you can use this approach. So what this approach does is it says, all right, we're gonna take our Rust function and we're going to expose it in such a way that you can use it if you can use an equivalent C function. So here I have this Rust function and then I have this scary looking function that is unsafe extern C and it returns this mutable pointer. This is like a super unsafe function. And in fact, this little mem forget, what this says is like actually just like forget that we ever owned this S vector. So what's happening is I pass it a vector and then I just say like allocate all this memory as a vector and then stop paying any attention to it. And then I'm going to expose that to whoever calls the C equivalent of this function. And this is going to be a problem because nothing except for Rust will properly know how to deallocate the vector that I have. So now I also have to pass it some similarly, some similar C function that deallocates vectors. You give it some memory and then it deallocates the vector. So this is obviously like a little bit more complicated in this sense, but it does bring one big advantage which is that JavaScript, Ruby, Python, a lot of other functions, a lot of other programming languages are already optimized to take advantages. They have bindings for generic CFFI libraries. So what that means is that you're going to be writing your low level Rust library once and you can have bindings to it in all kinds of other places. On the Python side, the way that you can talk to this there's a library called CFFI and CFFI allows you to work with CFFI interfaces. And then there's also this library called Milk Snake which comes from Sentry. And what that'll do is at setup time it'll generate a bunch of Python code that wraps CFFI for you and it'll generate this FFI and lib library and it'll basically just generate a little thin wrapper in Python but you still have to do things like here what I'm doing is I'm converting this to a list and then I'm deallocating the vector and I'm doing this on the Python side, not on the Rust side. So this makes your Python APIs a little bit more complicated but there's nothing saying that you can't just write this code for your users. And you can see if we compare this to our other implementations it's on the order of magnitude of the same speed. In this case it's a little bit slower sometimes it'll be a little bit faster. But generally speaking, there's not too much more to say about this approach because it's not everything in the kitchen sink. It's a very bare bones approach but it is very versatile. Okay, so far I've talked about the ways that we can use Rust but I don't know that I really made the case for why we should use Rust. I know that I've made the case for why we shouldn't use C but that's not all the other options. Probably the best contender with all this is Cython. So what Cython is is that it allows you to write this Python like language which is some super set of Python and it will compile that down to either C or C++ behind the scenes as part of your build process and it'll create a C function for you. It has a lot of the same advantages of Rust in the sense that it has memory safety mostly guaranteed by the Python interpreter. You can do memory unsafe things with this but you, and there's no unsafe blocks or anything but for the most part, if you write things that look like Python they'll probably be memory safe and it'll generate code that is pretty fast. So in this case, I've written something that is a little bit slower than the C extension. I've written it in C++ but it's still 10 times faster than your regular Python. So what should we choose? I mean, we have Cython, we have two different kinds of Rust, we can just use pure Python. So just for this little function that I wrote I have this little chart of speeds. Most of this slide I'm going to talk about how not to use this chart. So you can look at this chart and say, oh well, this stuff from here on out from Cython over is on the order of magnitude about the same speed. In my experience, Cython is a little bit slower than these other things but it has a much nicer interface. You can write something that looks like Python but also you should note that I didn't go out and choose some set of functions that are perfectly representative benchmark of all the things that you would do. I went out and I picked a function that I thought would illustrate some of the difficulties of using Milksnake and using PyO3. So I would recommend just writing, especially if you just want to wrap a Rust library, I would recommend just picking a couple things that you want to benchmark and trying out a couple different approaches if it's really important. But honestly, you could pick any of these and it wouldn't make much difference. So let's say that you do want to use Rust. Which between the CFFI approach and the API approach, what are the different downsides and upsides? So the CFFI approach, as I mentioned, is more portable. It also has a smaller Rust dependency in the sense that it doesn't have to compile this huge set of safe wrappers for the entire Python library, it just compiles in whatever it needs. So the binaries are gonna be a little bit smaller if you care about that. And then also PyPy, the alternate implementation of Python, I guess can do better optimizations when it's looking at CFFI than when you're using the CAPI. So you'll probably get better performance in PyPy if you're using CFFI. The downside is that it pulls in a runtime dependency on both Milksnake and CFFI. So you're gonna have to be pulling in third party code for that at runtime, not at compile time. It also has no support for Python specific types like lists, date time, tuple. You have to just write your own wrappers for all of that. And you have to manage your memory in Python. Also, I'm not crazy about the fact that your public interface that you have to maintain is all unsafe Rust. So I prefer to hide that away. But again, if these other pros outweigh it, I think it's still a great approach. And then on the API side, you're using safe Rust for almost all the code that you're writing. It has no runtime dependencies. It has native support for all the Python specific types. It's actually easy to call back into Python from Rust. And it also manages the gil and reference counts and stuff for you. The biggest downsides are to do with its stability. It's still a somewhat immature library. So it's somewhat buggy. It hasn't been, I don't know what happened there. It hasn't been optimized for speed. And it requires nightly Rust. Also, the API is still changing a little bit. But I recommend just jumping in and getting involved because this can also be cast as an upside in the sense that you, as an early user of PyO3, can probably influence the direction that it evolves in. Okay, so I think both the Milksnake approach and the PyO3 approach are early enough in the game that there are lots of opportunities for improvement. Which is to say it's kind of hard to even make a choice for the long term at this point because both of them, I think, improve a lot. So for the CFFI or Milksnake approach, a lot of those problems that I talked about about how difficult it is to wrap these things and how much you have to manage the memory, I feel like those could probably be fixed by taking some of PyO3's approach and writing procedural macros which will automatically generate this kind of code from macro. So that would be a library level implementation. And it doesn't have to go into Milksnake. This could just be a new crate that you write. And then you could similarly have an equivalent function in Python that says, given that you're using that procedural macro, we'll just import this function that will convert things correctly to the right Python types. And then you can get a pretty similar experience in using the CFFI approach that you would get with PyO3 except without any of the container specific stuff. And then for PyO3, I think the biggest thing that you could do to help improve PyO3 would be to just contribute. It's a super active project relative to how active open source projects usually are. This is now a somewhat dated screenshot because I didn't wanna make a screenshot right before I started this talk. But I think it did have commits merged as of early today or late yesterday. So it's actively being merged. I am not a committer on this library, but I would be happy to review any pull requests that you wanna make because I think that both of these approaches could really turn out to be something special in the long run and it could allow us to use more Rust and more Python because I think there's a lot of synergies between, did I just say the word synergies out loud? This is really going off the rails, people. All right, in any case, I think Rust and Python are sort of natural good friends and I'd like to see us foster that a little bit. All right, so that's the end of my talk. It looks like I do have a couple of minutes for questions, right? Thanks for your talk. I just wanted to ask a small question. I checked this thing out and I think it is an amazing project also, but also you can kind of think of a different approach. This is not like getting all the C stuff and trying to put a nice layer on top of it. I just wanted to ask you if you have any thought about the Rust Python project, for example, which is intended to write the whole C Python in Rust. Well, so Rust Python is an interpreter written in Rust. So in some sense, it's actually a completely different beast from these things. These are to the extent that they can be interpreter independent. So something like Rust Python, it may be easier to write Rust extensions for targeting Rust Python. And if they continue, and I really hope they do, because I think that project is great. How many times when I've been working on C Python code, have I been like, oh, I wish that I was writing this in Rust? But yeah, I think that they're sort of orthogonal and both can work together. And it may actually be easy to just add more stuff to PyO3 and or Milksnake to say, hey, if you're using Rust Python as your interpreter, you can take these shortcuts. You don't have to go through this whole C layer. Okay, thanks. Hey, great talk. So you said one of the cons of PyO3 is that it requires the Rust nightly. My experience using Rust, it seems like a lot of stuff in the ecosystem requires the Rust nightly. So is that, how big of a deal is that? I don't know. One of the problems is that I don't write a huge number of Rust applications. So from my perspective, it's never been any different. Like Rust nightlies do not feel unstable. And I think if you're using Rust at all, you may find it acceptable to use Rust nightly. But I have heard a couple of places that are using Rust regularly in production. I think they're a little uncomfortable with it. They prefer to use the stable builds. So it's just something that I've heard people worried about. They say I would use PyO3 if it were using stable Rust. But Rust nightly seems fairly stable to me. They're super responsive. Thank you. Thanks for the talk. One question. How much business value do you see in it? By optimizing, for example, this generator of prime numbers, what do you think is the scope in the real world application which you could optimize? Yeah. So, yeah, obviously, the civil veritastanese thing is not, it's a toy problem. It's a project that, it's an algorithm that I could fit on the screen in three different languages. So, but I think it's, there are two main ways that I can see this being useful. One is that people are writing things in Rust. And this is probably the bigger one. If you're already writing, say, like cryptography libraries in Rust, hashing libraries in Rust, JSON parsers in Rust, like those are going to be increasingly used as back ends for other languages things. And this is the way to get that exposed in Python. Something that's already in Rust and then you just want to use the good implementation in Python. And then the other side of it is, if you, in the same way that, usually you just use NumPy, if you're doing some big number crunching things, but occasionally you have some very low level bit twiddling that you want to do in a very tight loop. If you can move that tight loop into Rust, then that will be helpful. But generally the kinds of things that go into hot loops are common enough or admit a sufficiently good abstraction that someone will just write a general library for that. So you may be right, but you may not be writing too much, like Rust optimized hot loops in your own code. Thanks. Thank you for your talk. Are we at time? Should we just do this? I'll be in the hall afterwards. Okay. Sorry. We have to go to the next speaker. So if you have any more questions, please do it after the talk. Thanks. Our next speaker is Alyssa Dummer. She's a data scientist at FreeNow, formerly known as MyTaxi, where we use the app for calling a cab in the dead of night. She loves programming in Python and Rust and doing game development on the side and data science, of course. So give her an applause for her. And she'll be... Sorry. She'll be presenting Python versus Rust for simulation. Forgot to say that. Hello, everybody. Since recently, I'm actually a machine learning engineer at FreeNow, all former MyTaxi, and I do a lot of backend development and putting models from our data scientists into production, but not only just putting into web servers or into some batch jobs, but trying to optimize them so that we have certain threshold on our performance. This is also why I stumbled across Rust, and as previous speaker here told you about PyO3, I also tried it out and it was very promising. But right now, I'm working on simulation for our business purposes and the question of what should be executed, how fast, how expensive in terms of scalability or maybe development time, all different points, they are kind of important right now. This is why I decided to present this topic as well here. So what is a simulation? In a very simple manner, simulation is an abstraction of some event. So you can think a state machine is technically a simulation. An example would be, for instance, how the water flows through the pipes in a system or how the blood goes through our veins, and then, for instance, you have an aneurysm, how the blood flow behaves there, so all of this can be created in a simulation. So it's just a program that recreates some real event. There are different types of simulation. Normally, you will see continuous simulations which are implementation of mathematic models. For instance, that are continuous, they define the whole flow of the logic. A very well-known example would be a game of life or, for instance, in chemistry, how different chemicals react to each other and, for instance, whether you're gonna get an explosion or not, this can be also played in a continuous way in the simulation. A second type is discrete event simulation, which is that it covers most of man-made systems, for instance, post office, so event occurs and something happens only when somebody's in the post office, I as a customer, or manufacturing pipeline, logistic systems, and stuff like that can be used in a discrete event simulation. But right now, there are a lot of papers and most of the production-ready simulation frameworks are mixed where you can define certain event dispatching system that says, okay, these resources are dispatched and then the system decides how to proceed further or if there are any other side effects that can be described in a continuous way. For instance, forestry is a good example because you have a natural way of forest to grow or recover from a fire, for instance, but again, you can dispatch discrete events like man-planted forest or man-created fire. So it's a mixture of both and obviously it requires more development time involved. There are numerous tools in all languages, you can imagine. For instance, frameworks, libraries, game engines, simulation technically can be seen as a game engine without manual or human input. And of course, people build it from scratch with different programming languages. There are just some examples that you can use. For instance, you can use Unity to simulate how people will learn by providing some model that will replace the human input. For instance, a model will learn how to drive a car in Mario Kart or stuff like that. That can be considered a simulation. But in my case, there are different things you need to consider when choosing whether you want to go with a framework or a library or how you want to use the simulation, whether you need the simulation in your business. So four main things are cost. If you go with closed source thing, it might cost you a lot or it might result in a lot of development hours that also directly is translated into cost. Then speed of the simulation. You probably don't want to simulate something with real time or near real time speed, right? Like why then bother? The third point is scalability. For instance, you want to run different scenarios. You cannot just wait for one to finish and then start another one. So you want to, for instance, horizontally scale it. So you want to put it into a service. You want to put it into cloud. You want to provide some resources. You need to know how the framework that you've chosen or you've written behaves in this way as well. The fourth and one of the most important points is extensibility. If you are using something that is already there, like a backbone library or even a ready framework, you might run into a situation where you have a very important business use case, which is not covered by this framework. So you either have to pay the company to extend the framework or you have to do it yourself. And again, it translates almost directly into cost. So whenever you decide to do something, you need to keep an eye on those four. And for this presentation, I decided to go with very similar situation that I have on my work, which is we're going to simulate dispatch of taxi to a request or a customer. So we're going to have a world that can spawns with a certain chance of P a request. We can have only maxN requests, which can be described as you can only have 1,000 passengers in using your app. And one passenger, one request at a time. It's not possible for a passenger to request a taxi to A point and B point. Because how would you do it? Because you have only one physical body. What would you do? Would you split? I don't know. Then request can be assigned to a free car only. We don't have something like shuttle buses. It's just pure car, one request, one car. Request can be canceled after a certain amount of time if they don't get assigned, which, again, is the real situation from taxi business. Passengers are not waiting longer than, let's say, 15 minutes. They will just get a bus or a different taxi driver. Cars can be either free or occupied. We do not have any other use cases. And we will simulate one day, which means we will simulate 24 multiplied by 6060 ticks. So one second is our atomic unit of time that will be driving the whole simulation. Criteria, so the talk is called Python versus Rust. So we have to somehow compare them objectively or semi-objectively. And these are criterias I came up with. The objective criterias are amount of code you need to at least prototype your flow. Then testing simplicity. How many packages there are for testing? Is it simple to write a test? Is it just how much time you would spend on it? Documentation generation and documentation available because probably you're not gonna write everything yourself. You might need some additional libraries and crates. Performance, obviously, is important. Memory usage is also important because of cost and scalability points. And ecosystem will play later a big role. Once you have your prototype in place, for instance, if you have already existing business system that has, for instance, Hadoop and Co., you might run into problems where your simulation has no official adapters or connectors to Hive or Hadoop itself or like there's just no crates and you have to either write it yourself or rewrite it again in the language that has the libraries. And language versions. The previous speaker mentioned it already but it sometimes play a crucial role, sometimes it doesn't, so you also have to evaluate this risk. The subjective two points are code simplicity. Obviously, if I have more experience in Rust, I will say, ah, it's easy. I don't understand Python at all. I will spend one hour in Rust and one day in Python and vice versa. And the second one is development speed. They are connected but development speed is, Python is notoriously known for allowing people to fast prototype. With other languages, especially aesthetically type languages, you have to think first what structures you want to use, how you want to present your program, how is the flow, not of the logic but the whole flow of the program where objects are going, memory, collection, all that. So you have to put some time in advance. Saying all that, let's go to our presentation. So I will show you a couple of codes. We will start with Python first. So the implementations are identical with a slight change or unique things that only Rust has or only Python has. Otherwise, you have the same struct or class which is request. Request has UID which is its unique ID, driver ID that can be assigned. It's technically an option so there is none driver or there is a driver. Remaining lifetime and fulfillment time. These are parameters that will make our request either be canceled or being fulfilled. And is alive is a utility function that will tell us that yes, still kicking, still working. Another is a taxi which has ID so that we can see in the logs what request was assigned to what taxi if we have slightly more complicated logic of assignment other than random to random. And then we have our world. World is just a container that binds it all together so it has this main loop that says until I have time or in this day of simulation, do this. And we'll then do this. We'll happen spawn creation if we can. Assignment if there are three cars, update. So this ticking, for instance, all requests that are pending will start decaying and all requests that are in progress will start being fulfilled. And the cleanup that will say, ah, this one is already done, free the taxi, allow it to be used in the next cycle in the assignment and put this progress request into canceled or finished. So you can see all of this in the attributes and functions. So the world will have runtime. This is amount of seconds it will run for. Age is current step. Request, spawn chance is, again, this random chance to spawn a request. Max active request is the limiter of how many active requests we can have at once. Active mean, both in progress plus pending. And taxes. Taxes, as I said before, can be free and occupied. And requests are spread into four groups, pending, progress, finished, and canceled by the state and execution. Then we have maybe spawn request, which is obvious if we can spawn a request. It's not, we have not reached our max. We will do it with random chance. And then we have distribute. If there is a taxi in free taxes, we will try to attempt, or we will assign this car to a first non-assigned request. So there is no special logic yet. The update request is exactly that. Tick down, either remaining waiting time or fulfillment time. The cleanup is just put into next state if I have to. And the main loop is while we have time, do all the steps one after another. I'm not gonna run this program for the whole day because it's actually slow. So I will be going back to the presentation slide. Sorry for, fine. So all of our criteria's will translate into quite small amount of code to prototype our flow. It's actually 94 without documentation. Our performance is for one day without printing into console is 210 seconds. I did it with Hyperfine, which gives it like several runs and uses average. So it's believable number. It's not like I run it once with couple of different programs running in the background. Memory usage. Just remember this number. So for this execution, Python allocated on the heap about 35 megabytes. And it's small. Tests, I don't have to tell you about Python tests. It's gorgeous. It's cool. You can either use a unity test or doc tests if you don't have, if you don't want to bother or you can have some additional like PyTest packages or something completely different like NOS tests. The ecosystem is amazing. Tell me at least one use case that is not applicable to Python. The version of Python should not be a problem at all starting 3.6. Sometimes you might run into rendering problems, but it's a slightly different issue and it depends on the OS that you are running it on. I've spent about an hour writing this program. So I will say it's quite fast. The simplicity of the code you can judge for yourself, but I will say it's also quite cool. It's easy to understand what it's happening. And not only by the structure, but like from the Python code itself. Okay, so now we'll have the second contender, which is Rust. As you can see, it looks very similar. You have to use imports. This use std something is just an import to use certain structure or function from a crate. Then you have these things are called macros. You probably saw it from the previous speaker. They will give you all the implementation of this functionality into your struct. Just simplification. So we have the same request. You can see right now struct as a class that has ID, remaining waiting time, assigned taxi, which is now option. So again, it can be none or in this case, UID and fulfillment time. Then we have our request implementation. It's in Rust, you can create a constructor directly. And this is, let's say, a utility function where you also can provide some default values if you want to. I have it here. So and is alive. The next one would be taxi, which looks almost the same way as the Python one with implementation of the new function. And then we are at our world that has the very same fields that Python class as well has. Runtime, age, request, phone chance, max active request, taxis is a vector, like list in Python would be a vector in Rust. Then active request and archived request. I have them separately and not united as a one hash map. You could do this as well, but this was the idea to go with the first. So it's not completely optimal implementation in Rust, but something that works. Then you have the new function for the world. You have the RNG for random. You have to pass it through in comparison to Python that can just use it as a global. Then we have a utility function print and implementation for that. Then maybe spawn request, you already seen it's very similar. Distribute unfilled request is slightly different. You now can see this eterm root find, which in Rust you have a concept of mutable and immutable, you probably all know that. And you have to think in advance whether you want to change certain structures or not, where you want to reference them or not, and stuff like that. You normally get used to this after a couple of days writing Rust, but it is something that is very different from Python, just to notice. Then update request is the same. And then we have cleanup requests, which is slightly different in comparison to Python due to this concept of borrowing and that borrowing and reference checking. So you cannot just say one vector from an element from vector just append to a different vector because there's gonna be a movement and you have only one owner at a time. Just ask me later about technical details so that we don't spend too much time on this, but it's just the way it is. It still is kinda understandable from the code that you can see. Then you can have, like you have your run till done, which is the same. This display function is just a implementation for print. Rust does not always by default has print possibility for every struct you have. And then we have our main. So what does rust tell us? It results in 160 lines of code without documentation. Documentation was with three slashes. And performance with one day, with no logging is 154 milliseconds versus Python 200 seconds. The memory usage is a magnitude less than Python. It actually is heap location and not stack. So there are some shady business done there, I say, I must say, but in pure heap location rust just bits, Python, dead. The different criteria were the simplicity of the code. So I would say it's subjective and you would understand from reading the code twice or three times if you never did it before. But it's still Python wins in this regard. Then amount of time I spend on writing this one is one day because of my first implementation, actually I managed to make it quadratic time. So it never ended. I run off patience before that. So you really have to invest time in advance, thinking of, okay, what am I doing here? Also you will fight with the compiler a lot in the beginning, but he's your friend. So if it compiles, it works. The other one was tests. Cargo is the package manager, for instance, and everything manager in Rust, you can run just cargo tests with and write the test in the same file as your normal program or in a different file and just annotate it with a hashtag test, which is a macro for tests. Then ecosystem is not that good at all in comparison to Python. It can be in a certain domains like embedded programming or, for instance, low-level programming in general. Right now it's catching up on web services as well, but Python is, and again, the situation where you have something like Presto and Hadoop or Hive and Hadoop on your business use case, just forget it, there is no crate in Rust for using those SQL layers. Yeah, so let's have a more visual comparison. So amount of code, Rust is about two times more code. It can be even more or about the same amount if you know how to write your program, but for beginners, Python is a clear advantage there. The simplicity is, in both cases, very simple. Both languages are trying to keep it very simple and usable. Documentation, I would have said that it's good in both cases, but Rust has one killer feature for me and which is you can have offline documentation served on your local server, for instance, HTTP Python server, accessible offline if you have it cached with your cargo. So you can build your project and you have it offline. It doesn't matter whether you have internet or not, which is, as far as I know, not the case for Python. This is why for Python I left no mark or color for this one and Rust is like, yes. Memory efficiency, duh, performance, you saw it. Ecosystem, by the way, there were no parallelization in the program, so it can run even faster and quite cheaper for that matter. Ecosystem, Python is a clear advantage there. There is no denying that. Versions, in Python, you should have no problems if you got rid of your Python too, hopefully. And in Rust, it will depend whether you have to use nightly or not. Stable has most of the features, nightly has really nice features that you might need or might not, depending on how deep you want to go in your implementations, since we are using a writing from scratch use case. And development, simplicity and development speed. I am a Python developer, not a Rust developer, so this is biased for me. So would you write rather in Rust or in Python, your simulation, if you need one? To be honest, I don't know, because you have to consider your paint points first, like cost scalability, your extensibility and speed. Because at some point, if you go for Python, you have to do optimization. Like, it always happens. And optimization can take quite a lot of your resources in terms of development time, in terms of even getting some libraries that are not open source or whatever. At the same time, if you go with Rust, you will be very slow in the beginning, maybe even very frustrated, but the end results always worth it. You won't be able to reach this performance with Python pure code, without any site or whatever usage. So I'm sorry for not giving you the exact answer on this question. You have to consider for yourself, depending on what simulation you're writing. How fast should it be? How scalable should it be? How simple should it be in the end for other users to use it? If you want to have simple, okay, fast, and write now simulation, then go with Python. If you are able to invest time, then consider Rust. Thank you. So I think we have time for questions, if there are any. Did you try to run your simulation with PyPy? Not yet, but in previous years, we actually had a similar program, and me, I tried to use PyO3, Cpy and PyPy. PyPy from the box gave me 30% of performance improvement, which is still not the magnitude or several magnitudes. And also PyPy is not compatible with all packages, for instance, Psykit, if, again, I'm a machine learning engineer, it's my biggest pain. We use Psykit and we cannot use PyPy for that matter. To be fair, Rust probably will always have problems with that as well. But yes, I tried PyPy in a similar situation. It didn't perform as well as Rust. Thanks for the talk, they were interesting. Tell me about the business of writing simulations. That's a big thing, like is it critical to the business, how many people work on that? I will say simulation is the next big buzzword in industry. For instance, I can give you couple examples from mobility industry, Tesla, Uber, my taxi. We're all writing simulations for slightly different purposes. For instance, Tesla has a giant and very complex simulator for autopilot to test some new features before they roll it out into the streets. So to avoid certain fatal flaws or to catch them slightly earlier. Uber has simulator for marketing purposes. And in my taxi, we are developing it in the first row for development purposes, meaning we want to speed up our development cycle. For instance, we have a team that works on allocation. So connecting driver and a passenger and they have a brilliant idea, but until they put it into production and run in a B test for some time, they don't know whether it was worth it or not. We can even lose people if it was a bad decision. So what we are doing, we are building a simulation that will allow them to at least cut the bad ideas. Also, you can test certain things like how would my system react if we have a strike. Suddenly, we have small supply. Like we have, instead of 200 cars, one car. We still want to make money, so we will probably have to serve our best customers first and stuff like that. You just cannot replay it in the real world. Also from your historical data is kind of hard because it's just historical and not flexible. And this is where simulation kicks in. Also, simulations are very used for data generation purposes. For instance, if you want to not only test your system on robustness, like, okay, this service works, the request is coming in, but you want to produce some edge cases and you want to test it, like, end-to-end content-wise. Then simulation can also help you. Thanks. Hi, thank you for your presentation. You've mentioned about data generation and I know that there are papers for using machine learning models to speed up simulations. You're basically using input and output from the simulation as a then data set and data set for our model. Have you heard or considered using that approach? Yeah, the thing that we are building and it's quite famous is agent-based simulation. So we have trained models of drivers or passengers or, for instance, Yobaristas, if we're talking about coffee shop, or even customers that are learned from historical data. For instance, I have a customer agent that learned that he's, and sorry for the word, asshole and he likes leaving no tips and spilling coffee everywhere. So you would have something like this in your simulation. So I don't know exactly how would you use models to speed up simulation itself, but using models in the simulation is quite a very often technique. Thank you. Thank you for your talk. If you have any more questions, you can look below. Thanks for joining us. Our next speaker is, I'll wait, okay? Our next speaker is Theresa Ingram. She's currently working to help the internet be more accessible for us all. And she's gonna give out a presentation about opting out of online sexism and open source activism. So give her an applause to Theresa. So good afternoon. My name is Theresa Ingram, and I am founder of the non-profit Opt-Out that aims to build tools to help women and female identifying people re-engage with healthy online debate. Just a warning before I begin this talk, there will be some extreme language, so just FYI. So before I tell you about Opt-Out, I would like to tell you first about why we need to exist. So Diane Abbott, number two of the Labour Party in Britain, during the 2017 general election, Amnesty International did a study on all of the tweets that all MPs received. Of those classified as hateful, Diane Abbott received 45.1% of them. Laurie Penny, a journalist, writer, activist, often writing about feminist issues. By the age of 26, she'd received her first real bomb threat. And Johannes Schmidt Nielsen, former leader of the Danish left-wing political party in Liston, alongside the usual barrage of threats against her body and harassment. It was once reported by a troll that she was in fact dead. Now you might be sitting there thinking, okay, great, Theresa, but men and male identifying people also suffer abuse online. And these women that you're describing, they are public personas, so maybe they should be expecting something like this. But really, the stats speak for themselves. It's not just public-facing women that suffer this. Women are twice as likely to be sexually harassed online as men, mainly affecting our young women. 90% of all victims of revenge porn are female, and women are twice as likely to suffer adverse consequences as a result of this online abuse. Women and female identifying people disproportionately suffer online abuse, particularly those who challenge the status quo. Online sexism is real, and it's silencing the voices that society so desperately needs to hear. So at Opt Out, we're aiming to put a stop to the silencing nature of online sexism. We're building tools to help all female identifying people who've got something to say get back to the online spaces that they've been chased out from. We're doing this by not only building tools, but by also building a movement. By holding workshops that give female identifying people a chance to come together and share their experiences. We're not only building the important, vital social infrastructure that these people need, but we're also allowing them to come together and in doing so act in a form of protest. By helping to form this community, we are able to spot needed technical infrastructure and make sure that the tools that we build are fit for purpose, ensuring that our tech is as community driven as possible. In addition to our workshops, we aim to build a website that supports women getting their voices back. Inspired by Horasmap, an Egyptian based NGO that you can see here, where an individual can submit reports of individual physical harassment, which then gets displayed to an online map. We will build our website to allow somebody to anonymously submit their experiences. This data will be stored, studied, and feed the models that our tools depend on. Our website will also transparently show details of what we're doing with the data and also the impact that our tools are having for women across the world, hoping to fuel the movement. Our long term goal is to have a virtual Horasmap, which shows which communities on your chosen social media platform are sexist, sexually aggressive, or just downright nasty. Enabling women and female identifying people to navigate the murky waters of online society as best they can. So the opt out ethos, the general data protection regulation or GDPR has changed our lives on social media platforms. We have the right to be forgotten, to dictate what is being recorded about us, and to opt out if we wish. But the abuse that women and female identifying people suffer online is not avoidable. We see opt out as an extension of the GDPR that also protects the human rights of these people online, allowing them to join in online debate once more. So what tools are we talking about? Alongside the website and the workshops, our main idea is a browser extension that filters out online sexism from an individual's social media feed. And it does so by a sentiment, classification sentiment analyzer. As you can see here, apologies, the video is not brilliant, you can't see the button, but you get the picture. So currently our tool works on Twitter, and we've got a very, very simple neural net behind it that's trained on 10,000 of your normal troll tweets, but nothing sexism specific. Our plan going forward is to retune this model with a sexism-labeled dataset from Zira Quazim and his coworkers. And once this is done and our website is up and running and the word has been spread, hopefully we will start generating a larger sexism dataset. But we're going to need to annotate this dataset and we are proposing to do so with a two round annotation scheme. Taking inspiration from Zira Quazim and his coworkers, we're going to first label based on the categories generalized, directed, explicit and implicit. So here are some examples of what that actually looks like in terms of language. Generalized, all students are lazy. Directed, you are a lazy student, which we may already have heard in our lives. Explicit, the candidate did not write enough papers and implicit, the candidate was not an innovative researcher. But language is nuanced and complicated and it can be combinations of all of these and also sexist. So for example, the first comment there is both generalized using the bride Zilla word and then also directed because it's directed at somebody and yeah, similar sort of thing for the second. It's important though, even though this is going to be a challenge, that we identify what is explicitly sexist first because if we are to encourage respectful debate and avoid creating any unintended echo chambers or biases with our tool, we need to get rid of the really obvious stuff first and then understand the implicit implied sexism later. So once we've done this initial rounds of annotations, we're then going to further classify the comments based on five different labels taken from Maria and Savino's misogyny labels. What we have here underneath the different labels are tweet examples. So discredit, slurring over women with no larger other intention, stereotype and objectification to make women subordinate or description of a woman's physical appearance and slash or comparisons to narrow standards and then dominance to preserve male control, protect male interests and to exclude women from conversation. Sexual harassment and threats of violence to physically assert power over women or to intimidate and silence women through threats and derailing to justify abuse, reject male responsibility and attempt to disrupt the conversation in order to refocus it. With this two level annotation scheme, we will be able, we hope we will be able to identify the different faces of online sexism. So in addition to this data annotation and understanding, we're going to deploy what I call the three C's approach. So content, which is what I've already previously discussed, context and conversation. So content will be using the sentiment analyzer with the labelings I just talked about. Context, so who is the abuser in relation to the target? Are they part of a bigger mob attack? This is important to know. And then conversation. Has the sentiment of the conversation between the two taken a sustained nose dive? This could be an indication of intimate partner violence and requires a very different solution to what we're offering. With these labels and a better understanding of the behaviors and relationships of online sexism, we'll be better informed to answer the age or question of you know it when you see it, which is characteristic of online sexism. And so once this is all done, we can start to build and test different models and really start to make a difference to women and female identifying people all over the globe. But what's the coolest thing, which I really, really like about our tool, is that we are consent focused, meaning that we aim to block what an individual finds distressing and not what we think. We're doing this by deploying a technique that I call big sister instead of big brother, where there will be a local instance of the model in somebody's browser that they can supply feedback to with the simple click of a button. The data stays locally, but people will be encouraged to share their labelings with opt out via the website. By focusing on individual consent and not a one model fits all approach, we ensure that the diverse range of online interactions are not stifled, but that productive and respectful interactions can flourish. Enabling female identifying people to join healthy debate is only possible if we also ensure that these people are safe online. We plan to do this by utilizing the moderators that most social media platforms have effectively. Whenever our sentiment analyzes detect abuse, the comments will go automatically to the moderators with a traffic light labeling scheme, allowing them to prioritize more effectively what tweets or what comments, sorry, need attention immediately. This ensures that the user safety is never compromised. So once this is all said and done and the years down the line when we've got a great little NGO behind us, we're going to develop the browser extension. We're going to be able to have a functionality that allows people to just use a blacklist of accounts. And so these people are automatically blocked from an individual's social media. These will be maintained and shared by what we call Digilantes, which are groups of people that are seeded from the workshops that we're going to be holding that also act like a support network for anybody who has suffered online sexism. We then have the automatic replacement of comments like I just described. And then finally a sentiment dashboard that pops up before the page loads with a traffic light labeling scheme for each comment allowing the user to preemptively decide what they do and they don't want to see. So we've got a lot to do, as you can see, that we're planning to get a working product by the end of August and then it'll be so popular we'll get a huge data set straight away and then we can start playing with that in September by the end of September. And then what's really important is that we move across to different languages. We're going to design the web app so that all you'll need to do is change the data set and maybe some hyperparameters and you can change the language from English to Spanish to Romanian to whatever you'd like. This will enable us to build the community that we want behind it because online sexism is not restricted just to English. So with a topic like this, I think it's really important for me to tell you all who's behind it. We're a bunch of volunteers at the moment apart from myself, well I'm working out of savings but most people are just working in their free time. We are a group of people from social scientists to data nerds but there is one characteristic that we all share, we won't let hate win. Our vision, we want to champion women back into the online worlds they've been chased out from, support them and their voices while still protecting them, unholding perpetrators accountable. We need to exist. If you share our vision, if you believe in the cause, I ask you to join us even if it's just by talking to somebody about the issue, about what I've discussed today, mentoring, co-contributions, go on to the GitHub, Starros, all this stuff. I'm a relative novice. I've got about a year and a half worth of software engineering experience but the community has rallied behind me an incredible amount and this ship is sailing. So if you'd like to get on board, just let me know. Online sexism has to stop. Let's opt out. There's any questions? You can use the mics in the... Hope is not sexist to say like this first. Yeah, thank you for not being a sexist. But this then is discriminating my height. Yeah. Yeah, I think it's a really good idea. I'm really impressed by what you and your team is doing and I noticed one thing that I think is a really, really good idea which is like it's really customized to a certain user what he or she found that is offensive and then it's not one model fits all but that also raised a question in my mind is like that may be like technical challenges to kind of make it a customized model like it may require lots of resources and so have your team figured out like what's the approaches to overcome this challenge? I would really like to know if not then maybe we could find a solution to do it. No, so please. Let's talk about that. Let's talk about that, yeah, thank you. Well, first of all, congratulations on your wonderful talk, wonderful explanation. Congratulations for the project itself. Thank you. When sometimes it's too much easy to pretend that things doesn't exist or just happens to the others. However, I'd like to ask you more about the technical infrastructure that you developed but if you don't mind to clarify it a bit. Oh, you would like me to discuss? Yes, of course. So the browser extension is currently using Keras and TensorFlow and that's obviously for the NLP stuff but it's a very, very simple model. It's not even using any RNN or LTSM, it's very, very simple. And the back end is just in a nice, simple flash gap. We all make mistakes. Should have been Django, but there we go. It's fine. But yeah, it's a very lightweight thing at the moment. What we're really focusing on is just trying to understand the science behind it first. So we're putting a lot of effort into research and getting different data sets and playing around with them. So the actual web infrastructure is a bit thin on the ground but if there are any front end developers that would like to join please because I have no front end experience so that would be really great and we don't have a front end at the moment. So any more questions? Feedback, cool. You mentioned switching data sets to switch languages. Why do you need that? Why can't you put everything in one data set? Are there things that are accessed in one language and not in the other? Or is it too much data? I just presumed we'd need to do that. We haven't. Okay. Yeah, I just presumed we'd need to do that because I think sexism is so different in different languages than I think having. That's an advantage because it's different. You can put all in the same pot and it won't disturb each other. That is a very, very good point. I don't know. I'm not a data scientist. Yeah, yeah. Thanks. Thank you for your talk. Really nice. I actually have a question about business models, a non-technical question because I'm curious, I guess, all of those social medias now allow to flag offensive content, right? You say you want to develop a browser extension, but do you know how effective this offensive flagging is and it takes some time, I guess? Sorry, to you saying that there's already something similar that the social media platform does, right? Another approach, right? So on Facebook, I guess you're targeting Twitter. It allows to flag offensive content. Yeah, but the individual still has to see it. And so by just filtering it out, you just don't see it at all. Yeah, exactly. And also Twitter, Facebook, I have heard incidences when somebody has reported something and they've turned around and said, no, you're wrong. Or it's been very, very long for them to do something and to take down the comments. So what this is trying to do is just, because if you are, for example, a politician and you socialize online, but you say something and then your feed is full of misogynistic or sexist abuse, it dilutes what you're really trying to do, which is just read the news or talk with friends. And so with this way of filtering it, the good stuff remains. All right, thank you. Thank you for the talk. Are there plans to collaborate with social media platforms that, for example, if a person has a lot of tweets or comments that get flagged by a wide range of women, that, for example, accounts get blocked or something like that? That would be great. And you know, Twitter, I mean, this is a problem that a lot of the social media platforms are being pressured to solve. So it would be great if at some point we could be incorporated with the platforms, but we've not received an incredible amount of support. Let's put it like that. So at some point it would be great too, but yeah, we'll see. Okay, thank you. So I have a question. What kind of formats do you plan to support on the comments? In some social media platforms like Twitter, I suppose that a lot of the problems that people will experience also come in images or maybe audio or video. Do you have some plans to tackle those or maybe in the pipeline at some point? At some point, yeah, but yeah, at some point, but for now just the, yeah, just tweet, just text. Right. Going once, going twice. I'll ask then. Do you have many people using the extensions or like what's the, right now or is it live? Or is it just like a daily? It's not live yet. Everything is in the testing, experimenting proof of concept. Yeah, exactly. We've got a proof of concept and we're hoping to take it to, there's a funding body called the Prototype Fund in Berlin in August. Mozilla is also based in Berlin, so we're only going to bring this out on Firefox to begin with because Chrome do have something similar, but if you, for example, if you hit their API, the Perspective API with you are a feminist, it comes back as toxic. So their one size fit all approach is just not fitting anybody. So we're going to sit with Mozilla, maybe get some funding there. Living the dream, but yeah. So no, it's not live yet, but it will be soon. I've just struggled with Amazon web services for a week, so. So you just mentioned, you hope you might get funding from Mozilla. And you also said you're right now mostly operating it out of pockets, basically. So do you have any other funding plans? Yeah, so the Prototype Fund, which fits us perfectly. So their categories are things like, you know, like internet health, and we already have some contacts within the Prototype Fund. So that's really the one that we're focusing on, but yeah. Cool, I hope that works out. Thank you. Hello, everyone. I would like to welcome all the way from London, not too far. Juk Ho, she's going to ask the question if we have a diversity problem in the Python community, right? Yep. Okay, give her an applause. Thank you. Thank you. Thank you very much. Yeah, so thank you for coming in this room. I think I can assume that everybody in this room and also if you are watching the live stream, you care about diversity, especially in the Python community. That's why you're here with me. So thank you so much for that. You can keep the discussion with me on Twitter. It's welcome. I will tell a little bit about me. So thanks for introducing me. I currently live in London as a data scientist. In London, I'm very active in doing a lot of stuff. I try to, because I love the community, I want to contribute more. So I co-organize some meetups, including the AI Club for Gender Minorities. I'm so glad that I have two co-organizers here with me in Europe, Python. I'm so happy, I love them. Also, I organize some sprints, which is mainly to be focusing on contributing to open source. So far, we have sprints once per month. So almost once per month. So it's a fun time. I love doing it. Also, I myself contribute to a lot of open source libraries. It's a small contribution, but I think everybody can do it. If I can do it, you can do it. Also, I created this pics and mix. Thank you so much. I have a designer. I hope I've pronounced his or her name correct. It's Akvith, yeah. This designer created this logo for me. Thank you so much. That's about open source, right? It's like everybody contribute. So, okay, enough for open source. So, a couple of months ago, I found this blog post which kind of get my attention why women are flourishing in our community but lacking in Python. So, myself, my background as a data scientist, I started, when I started doing data science, I was using RL, but for different reasons, I switched to Python. From there, I know that the Python community is lovely. And that comes to my, like I think about why we have these problems because the Python community is lovely. Like why we, like women are not enjoying Python as much as ours, so I need to have a look. So, we see that actually, we have more than like six time more Python users than our users according to the Stack Overflow Survey last year in 2018. So, but there's a lot of people using Python, right? Python is lovely, you know, the community is lovely. But we only have 1.25 times more members in PyLadies than our ladies. So, what happened? Is it really like we don't have that much, you know, female Python users or female Python users are not active in the community? It's like, why the number doesn't match? And then also something worrying is like for contributors, again, like I contributed to Open Source, I would love to do more. And then, but why? RL has four times more female contributor than Python. I was like, oh, we are lacking in presence in the, you know, in contribution to Open Source in Python community. It's like, oh, no, it's not good. Also, for networks, so as a organizer, there's like, our ladies is there's 120 chapters in 40 countries, it's everywhere. But in PyLadies, we have less, we have only 45 chapters, active chapters in 12 countries. So, I'm very lucky. I'm in one of the biggest cities in Europe, in London, we have PyLadies. And I know that like there's also PyLadies remote, which is, you know, remote is for everybody who's like in the city that is, you know, like smaller city that doesn't have a PyLadies, they can join. But still, like, it's lacking. It would be great if like we have more and more PyLadies chapter for women to network, to, you know, feel that they are not the minorities, they are not alone, they are also important in the Python community. So it's like, hmm, there are problems there. Also, it could be the problem is not like a problem that, you know, that somebody will want to make, right? Because there may be a fundamental problem that, you know, the users, if you think about the users, right? For R, I keep, you know, having different people, they have like maybe their statisticians, they come from academic background, they say, oh, R is amazing, it's so easy to use. And all these, I was like, hmm, but Python is also amazing and easy to use. Have you heard about that? So there is actually statistic that is showing that actually most Python users, they're actually maybe from a more engineering, computer science background, maybe, you know, most students in computer science, they maybe, when they were in university, they learned, you know, using C++. My first computer course that, you know, I attend in university was actually using C++. And also Python, now it's getting more and more popular this morning, you know, Martin mentioned that in Basel, there's Python in education, it's like now they're teaching Python in schools. So yeah, maybe computer science students, they have a lot of exposure in using Python. But for students that, you know, maybe they're disciplined, it's like more like, you know, math or social science, or even like economies, economics, or even like journalism, because now data journalism is a thing, right? They may be using R instead of Python. So yeah, that's really like a difference demographic in the two language users. And we can also see that because, yeah, because for different disciplines, we can see that computer science, in computer science graduates, especially in the Western world, it's mainly mal-dominated. They're like, you know, you can see like for these like statistic majors, they are like more or less, you know, there's like half and half, there's like 44% women, but for computer science, it's only 19%, it's like, oh, maybe we should think about getting more girls to study computer science, maybe that's the thing. So yeah, there's also like million questions that we have to attack, have to answer. To help the diversity. Also, I have seen this Jupyter Notebook, that actually I can show you the notebook here. Okay, the internet is slow at the moment. Okay, can you see that? Yeah, right. So if, I assume, so this notebook is very, very good. Actually, the first time I came across it was in kind of, I joined the non-focused DISC mailing list, and then they were circulating this Jupyter Notebook because what they talk about is about inequality in the leadership in non-focused projects. Yeah, pie data leadership, it's like, oh, it's like inequality of, you know, under-representing groups. And then they use it very scientific way. I won't show you the details, I will upload a slide so you can, you know, check it out yourself. And they made this notebook, you know, and I'm thinking about maybe I could change this data.json and I can do different research on it as well. So this is the result, right? And maybe I should switch back to the slide so it's easier to see. So they use that scientific method to show that it's, how diverse is each project? So these are non-focused projects. I think this statistic is done last year. So there's some new ones that is not updated, but anyway. So you can see that the inequalities, like it's going to be, actually how to look at this is like, if it's lower, if you have a lower score, like and direct, they are very, they are more diverse. So you can see that NumPy, PMC-free, they are not very diverse. I think these ones are the ones that, yeah, they got a score that's one. So the uniqueness is one. So they have only one gender as the main contributor. So which gender I think is quite obvious and easy to guess. So you can see from these, from the bottom ones, a more or more equal one. And you can see like arm size in the head, like some open journals, they are like, oh, you can see that there's actually like a domain difference as well. Like maybe things that are more numeric, maybe physics related, they are less diverse. So yeah, that's very interesting. And also in that Jupiter notebook, you mentioned one idea, which I think is very, very good. It's called active and passive diversity problems. So active diversity problem means that there are people or a group of members that they are toxic. They don't want the community to be diverse. And passive means that actually nobody wants it, but if nobody's do anything to it, then it's the fundamental bias in it, in the inequality. So I think is my opinion that the Python community is having a passive problem. I haven't encountered any active problem in the Python community very luckily. Yeah, I think that's even worse. Maybe, so the other thing is like maybe in workspace, we have a chance to face some active diversity problem. But in the community, I think so far, my experience may be passive problem. So yeah, so that means that we have to do something to improve it. Okay, so the previous information that I gave you is mainly based on the blog post that I see and then also like as all and then also attribute a notebook that's on what they did, which is amazing. So I tried to do something myself. This is a research that I did referred last year. So I'm thinking about, because I've been to 10 conferences last year very luckily, I enjoy every single one of them, but is there imbalance of gender ratio in Python conferences? Because you can, I think it's quite obvious if you have a coffee break, if you go to the toilets, here the venue is lovely, it's very big, so maybe there's less obvious, but if you go to a smaller venue, you can see that usually outside the male toilet, there will be a long queue and a female toilet, it's like I can go in and then there's an empty and it's good in that sense, but maybe it means that it's not good in diversity. So yeah, so I tried to do something. I tried to see, is there like an imbalance, like it was obviously there's obviously an imbalance, but how imbalanced is this? It's difficult to count the participant because I talk to people, but I can't talk to every single one of the participant. So okay, it was the easiest way to count, or maybe I should count the speakers because all talks are recorded. It's very easy, but I have to really say that this is not very accurate reason being, not all talks are recorded because some speaker they may prefer it not to be recorded. Also for the gender of the speaker, I didn't really do a research of like, researching on every single speaker, are they male or female or non-binary? I just based on the pronoun that maybe the chair used to describe them. So yeah, if I made a mistake, that's difficult to avoid, but I hope this give a general idea of how imbalanced it is. So this is the statistic I hope it's clear enough. So you can see that, ooh, like the blue bars are male, the red bars are non-male, so including female and non-binary speakers. So except by going UK, there's 75% of speakers are male. They are doing very well, but which means that there isn't enough non-male speakers and why is it because when the call for proposal open, there's like a lot of male submission because this year I'm in the programming work group of Europe, Python, and I can see that you had this really in balance. And then I think conference organizers, they put into an account, they have the diversity mission in their mind, but still this is difficult to achieve because in the submission, there's already a very, very imbalanced demographic. And so how can we improve that? How can we encourage more diverse speaker, maybe not just gender, maybe speaker with different backgrounds to speak? So I will also talk about that later. But before I specifically talk about some suggestion that we could do in the Python community, I would like to talk about the diversity problem in theater because as a person living in London, I love going to theater. I'm very lucky that London is a theater city. So yeah, so it's like this is also a hot topic in London, like theater, global critics, they really like notice there's diversity problem in theaters. So maybe from learning from them, we can, we have some idea of how we can improve. So I also read a lot about news and blog posts in theater. So this is one of the posts that I come across. It's talking about, sorry, it's talk about the best way to address the theater's lack of diversity. So yeah, there's a problem. So what's the problem specifically? Who is on the stage? It's kind of similar to my problem about conference speakers. So people on the stage is actually attracting people who will be participating. But like for example, in London theater, there's a problem because some productions, they have a whole cast of 22 white actors on stage. So for them, the more diversity questions they're addressing is maybe ethnicity of the actors. So sometimes a play could be a modern play, could be the race of the actors, not the most important thing, but still it's kind of very white dominating on business. And who is in the audience? Also in London, I love London being such a diverse city. It's like 44% of the people in London, they are classifiers B, A, M, E, which is kind of, we can say they are ethnic minorities, they are Asian or black. So yeah, but still, if you are like a London theater girl or like me, you can see that if you go to the Royal Court Theater in London, you can see most audience, because my experience, I go to see a play, there's an Asian play, it's like almost all the cast are Asian girls. But still because the Royal Court Theater is located in the very prestigious location in London, it's like in the audience, there's still very white middle class dominating. So it's like, hmm. And also there's a problem in the influencer, it's kind of like the leaders of theater, like who is making the decision, who are the artistic director of theater, who decided what production to put on stage. And all the critics, they'll be like, you know, their identity may kind of change the way they, how they judge whether this play is a good play or not a good play. So, hmm, this influencer, there's also not very much of them from ethnic minority groups. So, ugh. So in that blog post, also talk about the chicken and egg situation because the writer, he mentioned that one time there's a play called, called Fella, happening in London. And it's Fella, it's actually about a Nigerian singer. So like, he's very popular in Nigeria, I guess. So when this writer at that time, he was like, you know, traveling in the cab and a lot of the cab drivers in London, they are from Nigeria and they mentioned to him that, oh, I love this play. I went to see that again and again and again. You can imagine a cab driver, they are not earning a lot of money, but they may be ready to go to the theater because they are from Nigeria, their identity kind of give them like attraction to this show about a Nigerian singer. So they love the show, they go to see it again and again and again. So what's being on stage actually affect who is watching it or who is participating. But sometimes the problem is like, if your audience is mainly, you know, I said before, if your audience is mainly wise, then maybe you want to have white actors on stage because you have to attract people to watch it. But then it's become a loop, an unhealthy loop that it will really stay into only one majority group and ignoring the minority groups, which is not very good. So also there's another article, which is very, very interesting. It's about how to improve this situation, like how to make it more accessible for people from different backgrounds. So for example, I mentioned about the inference of the people who make the decision, the gaze keeper. So nowadays in London, you can see that there's like in some kind of very famous theater, the artistic director of those theaters. There's like more and more people from the minority groups, they are being the artistic directors. I've met one lady, I think she is not white, but she is an artistic director of one of these famous theater. I think she really can't remember, but yeah, she's an artistic director. And I'm sure that there's a lot of artists that they are like talented artists in the minority group. And it's just like, it's taking them like quite a long time. If we like kind of discover them, they may become driving these big will to go to different directions. So it's a good thing that they kind of change it now in London. And also in the theater itself, there's accessibility. For example, in London, now I kind of enjoy that as well because there's always some kind of cheaper tickets like 10 pounds to watch a play in the West End, which is very, very good. So if you book early enough, you'll be able to enjoy some cheap tickets. So theater has kind of become accessible for all, even though for people who don't have a lot of budget. Also the theater, sometimes the bar and the restaurant, they try to not always have expensive stuff, maybe some kind of, still more expensive than buying from a corner shop, but they try to make it not super fancy and not for people who are in a dress and a suit to enjoy, which is good. Also, some accessibility measure was implied. For example, they have some sign language performance. So it's for people hearing impaired, they can enjoy the performance even though they can't hear. I've been to a production with one of the famous drama school in London, and they have two cars. One car is the student, very talented student, and the other car, which is performing with them side by side, they are actors that use sign language. They may be deaf, but they use sign language to perform, which I think is amazing, is amazing production because it's accessible for everybody. And also sometimes there's like audio described performance, including the one that I went to. For some stage direction, they would describe it. They would say it out because for people, maybe they have side impair, they could still enjoy the performance. And also, for example, accessible toilet, and all these is for people who maybe have mobility problems. Also, and there's one thing that a lot of places overlook, I think it's more popular in movies, it's a relaxed performance, so the lights won't be totally dim. People are welcome to come in and out because maybe they are parents with kids, so they can bring kids into the movie or bring the baby to movie. If the baby have needs, they need to care for the baby, they can go in and out, and everybody go to that screening, they understand that is for parents with the baby, which I think is a good thing. I mean, parents shouldn't be pinned down to not enjoy this. So, okay, I will spend the rest of the talk talking about what we could do in Python community. Can we learn something from the theater, how they can improve? So, in the blog post that I mentioned in the beginning, I also mentioned that PyCon now is having a great improvement because you can see that by doing, like encouraging women to come to the conference, they have talks by women starting from a very low percentage towards 40%, which is really, really good. And this is PyCon in the US, so I think in Europe we are doing that as well, which is very, very good. It's really to carry on the momentum. Also, non-focus, you can see that they're both outside, they're an organization that is like helping this, funding this PyData community. So, you can see that they have also measures, they have this DISC committee and they have all these, like mailing list and discussion, I joined the mailing list, so that's why I receive information. And also, they are giving diversity scholarships, so for people from the minority group if they kind of have a struggling to get a ticket, then they could maybe apply for scholarships to attend the conferences. Also, Jungle Girl is amazing. We had the Jungle Girl day on Monday, which has been a recurring thing for Europe, Python now. Every year they have this Jungle, kind of like a beginner's day to encourage women, especially, to start using Python and do something about Jungle, to build blog posts using Jungle. And it's translated to 12 languages, it's blooming. And then also, my friend mentioned, it's very empowering for women because Jungle Girl prefer, of course, like they also have male mentors, but they prefer girls who have been, or women who have been to the participant before, come back to become the mentor, because well, it shouldn't be always men teaching women, it should be, you shouldn't care about the genders, like who are experienced, teach the people who don't have experience. So in my opinion, for conferences, because I'm now involved in conferences a lot, so I have a lot of ideas, for example, childcare facilities, I think Python UK is doing it, PyData is doing it, PyData London is doing it. So yeah, I think maybe Europe, Python, maybe we should consider as well, because yeah, because you can, it's like those theater thing, right? It's like parents shouldn't be, they shouldn't have these mind that, oh, I have to care about my baby, so I can't go to the conference. Also, diversity in topics, again, how we can make this call for proposal more diverse is like maybe we encourage different topics. So like the academic diversity, maybe people from a generalist background or a statistic background, they could also talk about their work, so we are not limited to one kind of a profession, one domain or one topic, so people from a different background, like a different professions, they can also, if they work related to Python in some way, then they could also present. And also we should also put some emphasis on education. Again, we see from the university example, maybe we have to inspire more young girls to be doing a more computer science or engineering subject, so yeah. Also, I think we should go beyond the Jungle Girl workshop. Jungle Girl workshop is amazing, as I said, but I think it shouldn't be limited to one topic, like Jungle, one framework, Jungle, where we should expand it to other topics, because now Python is not just used for web development, it could be used in all different types of stuff, so maybe, because I'm data scientist, I'm involved in data science, and also Python is very popular in data science now, so maybe we should have something similar in data science to have a workshop to encourage gender minority, especially women, to start their careers in data science. And also we should fear free gender barrier. For example, we have people who are like, we have trans code, in Pylandinian, they have the trans code workshop that is for people identify as trans, or it's more for people have certain identity, they could feel that they're not alone, they're not minorities, they got support from the community as well. Also, this one, I love it, non-gender-labeled toilets. So my last year in Python UK, and they did a lot of work in this non-gender-labeled toilet, so the toilet is labeled as whether you have a urino or not, so it doesn't matter what gender, it's like what facility you're using. And also the t-shirts, my friend just told me in the non-focused skyline, if you organize a PyData conference, your t-shirt should be not gender-labeled, should be whether it's a straight cut or it's fitted, so it's also about your body shape, it's not about your gender, which I think is amazing. And this is mainly for conference, also we have other problems in the community and we have to try to solve it. For example, maybe we need more female leadership and contributors, so for me, my idea is like maybe if I organize some sprints, meetup, maybe I should have some, again it's like Jungle Girl, some meetup that is mainly for encouraging people in a minority that you are not alone, you got support from the community, let's do it together. So to get them started, to get them, you know, work their way towards the leadership or become a maintainer in the contributions. Also, yeah, in, you know, for young people and in academia or even not just young people, maybe for researchers, we have to maybe do some outreach to tell them Python is a good tool. Let's use Python instead of R. So yeah, I only got one minute left, so I don't think I could take a lot of questions now here, but I have this survey, which you can give me some feedback. I remember there's some free text that you could type whatever you like, please be kind to me and please give me feedback because I hope this talk I could keep giving it and maybe I can add more and more, like for example, some things like some opinions from my friend or information from my friend, which I can improve this presentation, so I could keep raising awareness and I hope we can work together and make the community more diverse and better. So thank you so much. Yep, so. I think we can, like if anyone has one question, I think we can have a minute to spare or we can talk after the... Yeah, you can grab me in the conference or yeah, the survey is always there, so please, your feedback, I value a lot, so thank you so much. This concludes our tracks.