 We'll have a talk by Peter Hofmann from Blue Yonder. He will talk about how they use SQL Alchemy in the company. And then we'll have five-minute questions and answers at the end. Thank you. Welcome, Peter. My name is Peter Hofmann and I'm a senior software developer at Blue Yonder. Blue Yonder is the leading provider of cloud-based predictive applications for the retail. We develop machine learning algorithms to deliver best decisions daily to our customer. These decisions could be, for example, setting optimal prices in an online store or providing the customer with replenishment orders so that his stores don't go out of stock or don't have too much waste on fresh products. We have evaluated no SQL databases and rapid use algorithms, but it came to our surprise how far we can scale the relational data model with the right database. At Blue Yonder, we use SQL Alchemy in all stages of our data science workflows to handle billions of records to feed into our predictive algorithms. This talk will dive into SQL Alchemy beyond the normal object relational mapping parts and concentrate on the SQL Alchemy core and the expression language, as well as database migrations with Alembic. So before I start digging into SQL Alchemy, just some overview of a daily simplified typical day at Blue Yonder. I want to show you the data flow in our data science company. The most important part or information for our machine learning algorithms is historical data with features and the target. In our cases, master data like products or location information as well as transactional data like sales and stock is combined with additional features like promotional data and weather features are delivered daily by the customers to us. A typical day starts with the customer sending us XML files with flat record-based sales information for the last day. We load the data in our data warehouse. The data warehouse is powered by an extra solution, a multi-node in-memory column-oriented database system from Jeremy. The next start is once the data loading is done, reporting tools and interactive data analysis from our data scientists start. If the data for the day is correct, our machine learning algorithms to start to query the data and generate predictions for the next day. The machine learning algorithms mostly work on two-dimensional feature matrixes with features and a target. Most of the work in our workflow is done in a batch-oriented way. The data sizing is not really big data in terms of terabyte of data, but too large to fit into memory on a single machine. In the end of the day, when all the predictions are calculated, the customer retrieves the order proposals or price proposals for an XML API. This presentation will be into two parts. The first one will give a brief overview of SQL Alchemy and the second one will show some usage patterns how we use SQL Alchemy in our company. So why is SQL Alchemy our backbone? Because all of our algorithms depend on data, and all the data wrangling, querying and processing is done to SQL Alchemy. We use a snowflake schema to store the master data and transactional data of our customers in the Excel Solution database. Each table in the snowflake schema correspondence to a delivery category which our customers can deliver through the XML API. The XML schema and parsing code is directly derived from SQL Alchemy table definitions. Our ETL processes are using meta information from the tables and relations to generate the booking logic on the fly. The features for the machine learning algorithms are also collected through an automated query builder based on the specific snowflake schema for the customer. New features can dynamically be added to the delivery categories. And are on the fly available through SQL Alchemy for the query builder. Our reporting tools are built around dynamically generated REST APIs. This is all only possible because we have an abstract model of our data schema built with SQL Alchemy. So let's start first with an architectural overview of SQL Alchemy. SQL Alchemy is the de facto way of working with relational databases in Python. It's a Pythonic way to represent common SQL statements through SQL Core and the expression language. The diagram on the right side is the famous pancake diagram which shows the different layers of SQL Alchemy. On the bottom there is always a DB API compatible interface for a specific database. The Python database API is specified in PEP 249 and defines a common set of operations to work with different databases. It provides an abstraction layer to different databases through different driver implementations. The top reason to use SQL Alchemy is to abstract your code away from the underlying databases and its associated SQL peculiarities. SQL Alchemy supports common statements and types to ensure its SQL statements are generated efficiently and properly for each database type and vendor without you having to think about it. The SQL Alchemy or M-layer is similar to many other object-oriented mappers that you might have heard of in other languages. It is focused around the domain model of an application and leverages the units of work pattern to maintain object state. It also provides a high level of abstraction to work on top of the SQL expression language and enables the user to work in a more idiomatic way. This talk will concentrate on the SQL Alchemy Core and the expression language because it's much lower and much directed to the actual database schema and allows better performance. If you want to read more about the design decisions and more about the pancake diagram in SQL Alchemy, I can only recommend you the chapter from Mike Bayer, the creator of SQL Alchemy, and the book, The Architecture of Open Source Applications. It's available online as a free download. Let's try the definition of SQL Alchemy again with a code example. So SQL Alchemy Core is a Pythonic way of representing elements of both SQL commands and data structure called the SQL Expressing Language. We can define database schemas, types, relations of a table in a pure Python syntax. The same applies to data manipulation like inserting, updating, and delete statements or data querying through select statements. SQL Alchemy can connect to lots of different databases through database-specific drivers. It works from single file databases like SQLite to massively parallel multi-node databases like Access Solution in our case or Amazon Redshift. Of course, you can also use the well-known open source database system MySQL or Postgres. They are supported very well. The code on the bottom shows the definition of a minimal sales table. It defines different columns with different properties and primary keys and foreign keys. Each column has a SQL Alchemy type and a name. At the bottom you see a simple query to calculate the sales by location for a certain product. What you already can see here is how good the query syntax fits into Python code. You can, for example, dynamically generate the list of columns in Python code and feed it into the query object. If you want to work with a database, the first thing you have to do is to create a database schema. SQL Alchemy provides a metadata object which is used together with the database structure so it can be quickly accessed inside SQL Alchemy. You can think of metadata as the kind of catalog of table objects with optional information about the engine and the connection. Table objects are initialized in the SQL Alchemy core with the metadata object by calling the table constructor with the name of the table and the metadata object. Any additional argument assumed to be column objects. Column objects each represent a field in the table. Columns define the fields that exist in our tables and they provide the primary means by which we define other constraints through their keyword arguments. Different types of columns have different primary arguments. For example, string types can have the column lengths as their primary argument, where numbers have a friction component that will show how the precision and the length are. Most types have no arguments. SQL Alchemy defines a large number of generic types that are abstracted away from the actual SQL type supported by each backend database. For example, the Boolean generic type usually uses the Boolean SQL type and on the Python side deals with true and false. However, it uses a small int on the backend database if it doesn't support a Boolean type. And you as a developer don't have to deal with that in your Python code, you can't just use true and false. The same applies to date and datetime types. When available, SQL Alchemy uses corresponding types on the database and creates Python datetime objects in the result set. When the specific types are useful and available in the same ways as the SQL standard types, however, they are only available with specific backends. For example, you can see here the powerful JSON field that was introduced in Postgres 8.0, I think, and it's just available from the dialect implementation. And you can use it as a normal SQL Alchemy querying syntax. You can use the SQL Alchemy expression language provided by SQL Alchemy Core to insert data in the tables. To do this, you just have to call the insert method of a table object to create an insert statement and then add the values with keyword arguments for each column you want to fill. Once supplied with values, the values will be replaced with a column name and a SQL statement, which is how SQL Alchemy represents parameters displayed via the string function. Parameters are used to help to ensure that our data has been properly escaped, which disables security issues such as SQL injection attacks. We can also insert multiple records at once by using a list of dictionary with the data we are going to submit. The example at the bottom shows how to build a delete statement. SQL Alchemy heavily relies on the pattern of method chaining to add additional clauses or restrictions to a statement. All this is done in a very clean syntax and integrates nicely with the Python language. To build a query, you can use the select function, which is an analogy to the standard SQL select statement. As you can see in the example code, it is very easy to build a query in different steps and use the full dynamic power of Python to generate the queries on the fly with dynamic values, additional join conditions, or specialized restrictions. The SQL Alchemy core language supports most of the features you can use in plain SQL as group by different joins like left join, right join, outer joins. Depending on the database and the driver implementation, it also supports more sophisticated statements like window functions. The return value of a query is a result set, which is a wrapper around a DB API cursor object. Its main goal is to make it easier to use and manipulate the results of a statement. It allows you to access the rows using an index name, a column object, or the name of the field. If you want to see the columns that are available in a result, you can use the key method on the result set to get a list of column names. In production, you should avoid using fetch all because it loads all the data from the result set into memory. The result set proxy implements the iterator protocol so you can easily walk over the results one by one or stream the rows into another function. In addition to the fetch all method, you can use the first method which returns just the first record of a statement and closes the connection afterwards. Or you can use the fetch one which returns one row and leaves the cursor open to make additional fetches. The scalar method returns a single value from a query with a single record in one column. Because at Blyander we have a rather difficult snowflake schema, we have implemented a query builder which allows our data scientists to just list the features they want to use in their models and the query builders does the rest. It does the right joins and applies the right restrictions. Providing an abstraction to our database schema where the data scientist could just specify the feature columns he wanted to use in his model provided a real productivity boost in terms of developer productivity. Our snowflake, because of snowflake schema is really a more complex schema and also includes temporal tables which are a little bit more difficult to join and always lead to bugs in our code. So we could build this query builder because we were able to work with a structure definition of the database schema through the SQL Alchemy metadata object and this really provided much benefit. To be fully able to describe our different delivery categories and tables we had to add some more metadata to the schema and table and column definitions but this was possible without much hassle. This is just a rather simplified version of our query builder for a star schema. As I said our implementation is a little bit more complex but it uses the same principle. In principle you just specify the columns as names that you want to query and then you loop over the tables in your metadata descriptions and take the fields and tables the user has selected. After that you just build the dynamically joining expressions and with the tables you need in the query and the columns the user has chosen. One great feature of SQL Alchemy is that you can reuse parts of selectables in other queries. So you can build common building blocks and reuse them in other queries. Selectables behave the same as tables so they can use the same way in other statements as normal tables and this is really a great feature to compose larger SQL statements to be defining building blocks of queries and composing them together. They also support the same metadata as normal table definitions so that you can example query all their fields all their types and all this is inferred from the select statement you defined earlier. The merge statement is the workhorse in our internal extract load and transform process to get the customer data into our normalized form into our snowflake schema in the data warehouse. The customer send us daily record-based XML files for every table delivery category which are defined in the snowflake schema. The first step is the validation process for syntactical errors and then starts the bulk data upload into the staging area of the database. Each table in the snowflake schema has a corresponding table in the staging area with additional meta information. After all the data has been loaded into the staging area we ran a couple of SQL statements in the following keys and checked the data for validity. All records which pass the validation are marked with a status flag. Once the data checking is done we use the merge statement to update existing records in the core tables or insert new ones if they do not exist yet. The merge operation is a DB internal highly efficient copy update operation. It allows us to update 40 million stock records in a stock's table with more than 40 billion records in 20 minutes. The merge statement is not supported by all databases. It's also known as an absurd and recently landed in Postgres but it's supported by the rather big databases like Oracle, Exosolution or the MSSQL. As you can see here this is a sample definition on how you provide the merge statement or how you build the merge statement. You always have to provide a source and the target table and then a condition on how to combine the two tables together. Together with an update and an insert statement to tell this merge statement what to do in case the record already exists in the target table and how to insert it if it does not exist. The merge statement is part of the SQL Alchemy Accessual dialect and it just generates plain SQL in the back end. That's how SQL Alchemy always works. You define the query or the statement in plain Python code and then it generates SQL code on the back end side. This is the same statement as before but in my opinion it's much easier to build these kind of queries in pure Python syntax than always doing string manipulation to build these queries as plain SQL. SQL Alchemy allows you to perform functional tests against a database or mockout queries and connections. Most of our applications consist both of unit and functional tests because doing integration tests against an actual database is important but it's often a very costly feature. A common pattern is to use a data access layer which acts as a proxy to the database queries so you can easily replace the database access in your unit tests if you just want to test your application logic. On the other hand, an isolation of database queries behind the data access layer helps much in testing, only testing the queries against a real database with all the rest of your application. Because testing against a real database is rather time consuming, we do testing against an in-memory SQL lite database in cases we just want to test features that don't use database-specific features. So that's a common pattern in testing SQL Alchemy. If you want to reuse the fixtures between the test runs you can also use a setup class instead of the setup to have it available for the whole test case. In cases where we rely on database specific like the merge statement we have set up schemas and fixtures in an actual database and we use a transaction rollback mechanism in our tests to test the queries. All code that needs to modify the database is run into a transaction and rolled back once the test has passed. Testing all our core booking merge statements takes up to an hour which is really long and it's but it's not other possible in another way because this is a feature only the access solution database has. Another nice feature of SQL Alchemy is database reflection and introspection. As I showed you earlier you can define the schema of a database in Python syntax with normal table create statements but you could also reflect the whole SQL Alchemy meta model another way to introspect your database or an unknown database is inspect module in SQL Alchemy. For example you can list all the table names or get the foreign keys for a specific table. So I think most of you know the little Bobby Tables XKCD comic. So one thing we learned is that you should never trust user input and always use SQL Alchemy to generate your expressions. I don't know if you can see the code. Just never do string interpolation with user provided text or input. Even in our cases where we think that we mostly work with data analysts in-house data analysts which are not interested in doing harm we learned that even in this case it could be possible to do harm in our production databases. We have been doing an ESO certification with a security company. These guys are much more clever than you think. They try to insert data in one of our meta databases which was used in internal services to build queries. So never trust the user and build queries with string formatting. Always use the SQL Alchemy expression language and the parameter binding and escaping for you in a secure way. If you build larger applications that involve over time an often forgotten aspect is that with involving the application the persistence layer is needed to be updated and changed. You might need to modify the structure of your data. For example you have to add new fields or remove existing ones or add a null constraint to one of your columns. Or you need to update the data inside your tables because your business logic changed. Alembic is a great tool to help you with database migrations. It adds an identifier to your database schema and provides a convenient way to define updates and downgrades in Python scripts. You can run Alembic against your database and you will check the current worsening of your schema, apply a series of updates to bring it up to the current version. Once you get used to this you will never want to go back. Because we are a heavy user of SQL Alchemy internally at Blue Yandere we have developed some tools and there's ongoing development and that's the tools I want to present you. The first thing is the SQL Alchemy accessor dialect. As I said earlier accessolution is an in-memory column- oriented relational database system. We use it in cluster sizes from 8 to 16 nodes and it's really a powerful database. We developed a SQL Alchemy dialect for the accessolution database and open source it on GitHub. If you want to try it out and do not have access to an accessolution database you can get a proprietary but for free the access solo which is a single node instance. Another thing we are working on is TopoDBZ which is a DBAP2.0 compliant ODBC interface. So at the moment we are working on ODBC but we are rather unhappy with its performance. So that's why we started the TopoDBZ and why should we use it because it's faster. And we are also planning direct export into NumPy arrays which will speed up our applications and lower the memory requirements. Another thing a colleague of me, Uwe Korn is working on is to get to use TopoDBZ database results as NumPy arrays or Apache Arrow without an intermediate transformation. Apache Arrow is an in-memory data structure specification used by engineers building data systems. It's mostly used in the Pandas and in the R community and is a language agnostic format. A column memory layout permitting O1 process and the layout is highly cache efficient in analytic workloads. Developers can create fast algorithms with process errors data structures. And it's also possible to efficient and fast do data inter-exchange between systems without the serialization cost associated with other systems like Swift, Arrow or protocol buffers. If you want to learn more about the ongoing project, you can go to Uwe Korn to get a Apache Arrow into Python and out of the database. I just can recommend you the software engineering daily podcast with Uwe or his talk on the PyData Paris conference. So that's the last slide. Just some links on documentation. My slides will be available on GitHub. If you want to learn more about it, you can go to the next slide. You can also go to the next slide, the first slide, the second slide. Any questions? Please wait for the microphone so that we hear you in the recording. You talked about the update of the database, right? If I already have a row data and they want to update this row, what were the name of the merge statement? Merge statement, yes. Is it protected by, if I have multi-core machine, that's if they make the update at the same time? It's protected by a transaction. Yeah, it's protected. Okay, thank you. I wanted to ask if you measure the performance penalty implied by using ORM layer, not just the core. In other words, what if you use the full stack of SQL Alchemy and not just the core? Would it be much slower? So this really depends on the type of the query. So normally in the ORM you have lazy loading and if you access nested data structures from other tables, then the ORM will always do another query for each field. So this will be much slower than just generating one select statement and just getting the results set. So it's for data science, analytical workloads, the ORM is not an option. Because of the data access patterns that it's implemented, it's much more used in web applications and then that's fine. But for getting much data out of the database, the access pattern is not usable. Yes. In Django ORM there is prefetch related and select related which gets already the foreign keys. Is there anything similar regarding the question that he made? You can define the foreign keys and then if you join two tables and they have a foreign key relation, SQL Alchemy core will automatically choose the right join condition. Okay, so it would already bring the foreign keys together, so in just one query you would have everything else? Yes, but the query builder was a rather simplified version. So our join conditions are a little bit more complex and they can't be automatically expressed in SQL Alchemy. So we have to put some more logic into the query builder. Okay, thank you. Okay, so I've seen that you're using the mapper declaration type. So is there a specific reason for not using the declarative base approach? No, I'm not using the mapper. The mapper is only used in the ORM. I'm using the metadata. So I'm not using the mapper. That's just part of the ORM and not of the SQL Alchemy core. Okay, thanks. Any more questions? No? Well, thank you for your attention and thank you, Peter, again.