 All right, I think we'll get started. So you can see I was a little bit starstruck yesterday at a barbecue. The inventor of pandas, West McKinney was there. Frances Altet, who's originally from Valencia, he's the guy that invented pie tables, which we'll see later on today. It's sort of a nice pythonic interface to a file format called HDF5. Fernando, who you know well, the inventor of iPython, Travis Olafant, the inventor of Numpy, former president of Enthought. He's now CEO of a new company called Continuum Analytics. And Paul Ivanov, one of the major contributors to Matt Plotloop. And then there were others around. And those of you that went to pie data, I guess also got to meet Guido. So it's been like an awesome Python time. And then this week upcoming is PyCon. In fact, probably a number of people from my group are not here because sort of people I know because they're at PyCon. I'll mention that later on. Was there a question in the back? Did you have a question in the back? Oh, I thought you were doing OK. So the subject of today is databases with Python. And probably many of you have seen this sort of symbolic representation of a database, just, I don't know, three little disks or stacks on top of each other. Python has some very beautiful interfaces to some of the most powerful databaseing systems available today. And some might argue that if there isn't a Python API or Python hook into that database, it's not going to get the kind of adoption that it needs. So many who are building new types of database systems are well aware that those that are going to be working with them are going to be, many of them will be Pythonistas. So the overview of today a little bit different than what we've done in the past, because we're going to be talking about databases, I thought it's important to get everybody on the same page understanding what a database actually is. And then we'll get into some of the nitty gritty of how Python interacts with that database. But really for the first hour or so, we're going to be talking about databases and then we'll be interacting with a nice little database system that you've already installed called SQL Lite 3. We'll be working with that from the command line, helping us to flesh out some of our understanding of how databases actually work. And then we'll have Chris jump in and he'll talk to you about how Python views those databases. And then we'll talk at the end about file format system called HDF5, which is sort of a nice marriage of industrial strength databases, but with sort of a file format sense. So you'll see more of that as we go through today's lecture. What are databases? Why are we even using databases? That's the first subject we'll get into, then introduce you to generic concepts of databases. And in particular, talk about relational databases and the so-called SQL. Then we'll go into SQL Lite 3 and its connection to Python, MySQL, and Postgres we'll do a bit of. And then we'll end with HDF5 and a bit of net CDF4. So I'm sorry for those that are very conversant and well aware of what databases are. I do want to sort of take a step back, present a high level overview of what databases are, the collection of data in an organized way. And it's managed essentially by definition by a so-called database management system, or DBMS. DBMS is generally offer storage. That is really their main use, but also the creation and manipulation of data in the concept of that storage platform. The other main use is being able to access results from queries on that large storage. And that's what we call search. And more and more what you're saying is that database systems aren't just sort of there for collecting data and for allowing you to search on it. It's also now allowing you to make some sort of native high level data operations. So instead of just saying give me the average of everything in some row, you might now say give me the fast Fourier transform of everything in that row. And while you could pull the data out of the database system and do a fast Fourier transform, as you saw in your last homework, within Python, there's some nice things if the algorithms are being built into the system itself and it can be run natively essentially within the database. So before we get into third party, you might say, why are we doing this? Doesn't Python already have a nice way for me to store data? So for instance, if on the command line, I said DB equals range of one to a thousand, skipping every other number. And then I stick into the index 10, spam a lot. I grab index 31. That's perhaps like doing a lookup. And I've got the number 15. And then if I want to store it, that's easy. I just pickle it up, or if it's an umpy array, then I could also just do a save z. And so this is great, because if I want to get back that data, I just open that file up and I can do what I want to do on it. But there's significant limitations, as I'm sure you all know. I mean, what we've been doing in this course until now is really talking about algorithms, a bit of workflow, interacting with the real world, but we haven't really got to the core of what it means to be a scientist. And I think for many of us across a number of different physical science disciplines, being a scientist now means dealing with tremendous amounts of data. And so Python has, as you know, the ability to deal with lots of data natively, but here are some of the major limitations. One is that when we search on a list, that's very slow. So if we did a listname.find, and we try to find some elements, we try to find what index that element is in, that's quite slow, that's order in. We know that databases, if we're gonna, in this sense, if we're gonna keep them in memory, they can only be as big as the available RAM to us. And one of the issues, of course, although the save Z with enumpy mitigates this a little bit, is that if we wanna efficiently store those files in the form of pickling, we have some portability issues where if I wanna now hand that file off to somebody say that uses MATLAB or uses some other code base, they're gonna have to do some translation on that file format, so there's a portability issue. And then, of course, if we're interacting with a file, only one Python instance is going to be able to interact with that file and sort of own that file. Now, other Python instances could look at the same file and try to map what's in memory on disk, but giving multiple Python instances read, write access to the same part of memory is very scary because it's not very well managed. In fact, it's not managed at all. In terms of searching, indexing is what you generally wanna do that allows you to make very fast queries where essentially hashing up words or numbers and you're allowing to search through hash tables instead of searching in sort of a regular expression way. Those that took the boot camp perhaps remember one of the first dictionaries that we presented. Here we have a dictionary from Monty Python of a bunch of different scenes. Remember, there's the cheese shop scene, there's the spam scene. The actors in that, John Cleese, Michael Palin in the first one, Eric Idle, John Cleese in the second one. And then we might have some information about those actors. This is effectively what a database is doing. It's a relational database at some level. And so we might know a lot about these various people and how funny they are. We might have some numerics on that, how old they are, et cetera. But if I now wanna ask a question, for instance, what's the average age of the actors in the cheese shop scene, you essentially have to go through and do what we will wind up calling joins on these various keyword value pairs. So how would you do this if you were gonna try to answer that question? Well, you might say, okay, go to this different scenes, give me the cheese shop, give me the actors in that, and then loop over actors until I wind up getting Cleese and Palin, essentially do a search on the values associated with the keyword actors, and then wind up grabbing those ages and then take an average of that result. So there are ways to store data within Python. There are ways to search within Python. But given all those limitations and caveats, it makes sense that Python's real solution to this is to third-party it when you're dealing with massive amounts of data. And dealing with that data in the way that modern scientists might wanna deal with that data. So if you go to this webpage, wiki.python.org, and database interfaces, you wind up seeing all the different ways in which we can interact with different database solutions, and this is a huge ecology in the commercial world. It's also got a lot of open source support for some of these, in particular, MySQL and Postgres, while MySQL is now, I think it's owned by Oracle, they have an open source version of that. Many of these others are, you actually have to pay a lot of money for it, but there are actually Python interfaces to those. So generally, when we're talking about these very large code bases, what you typically see are people in the Python scientific community working with MySQL and Postgres. Now this itself is also changing as new sort of science-ready databases are starting to pop up. I'll mention that towards the end of the lecture. This is somewhat outdated, and chances are, if you're very used to in your own research lab, working with a certain one of these solutions, chances are there's already a Python interface to that. And we'll talk about in detail what it means to have a Python interface to these various databases. Okay, so let's go with just trying to understand what it means when we talk about a database management system. What are those frameworks? What are sort of the different layers of our understanding of these systems? Well first is the external. It's what users of the database will see. So if you're interacting with a database at the Python command level, you're going to see some representation of the data. It's a high level abstraction of how the data actually live in memory. And depending on how sophisticated your database system is, you're going to have perhaps different password protections, so some people might have different views of the data than you. There's the conceptual, which is sort of the underlying architecture of how the data is meant to be accessed by the external users, and that's the logical structure of the database. It defines the relationships between the data, the security details, and typically, we wound up calling this the database scheme. And when you have multiple schemes, you have schema. Do you know what you have when you have multiple schema? Schemata, yeah, good. Okay, so, and then there's the last thing, which is where a lot of the architecture comes in for those that are building new databases from scratch, and they have a new concept of how you might access memory is the physical. How the data is actually stored, how it's managed at the low level, how you go from the actual bits and bytes on disk to this conceptual layer. And this physical layer is becoming increasingly more important as we start thinking about a routinely interacting with data that's much larger than RAM, much larger than what you can get on a single node of a computer, but now you start thinking about distributed databases across multiple nodes and perhaps not even in single physical locations like in a data center, but now perhaps distributed over the entire internet. In the end, what you'd like to have effectively is the external users not notice and not really care how that data is distributed, but in practice, when you're interacting with that data and you're actually searching on that data, how the data is distributed in the end will wind up dictating sort of what you can and can't do at the SQL level. Okay, so let's talk a little bit about essentially that sort of middle layer, the implementation, the conceptual layer, what we call the data model. There are a bunch of different flavors of data model and recognize that just because I have one file format doesn't mean I can't have and I'm sort of pushed into one of these different data models, living on disk is in some ways different than how you present and how you conceptualize that data. So the first way we think about databases is so-called relational databases and now we think about data sort of organized into tables or two-dimensional tables and we think about the connection of that data across tables with the concept of keys and we'll see much more of that later. Another way to think about that data is in an object-oriented sense. So you think about now not tables but you think about entities which have a similarity to tables but now there's the concept of sort of inheritance. Now there's all the things that we get out of the class structure in Python is possible. Most of the database solutions out there are relational in nature. There are some called DB4O which is a big one and big table from Google. These are object-oriented by trade. There are abstractions that allow you to try to go in between relational and object-oriented databases but one which is sort of gaining some level of popularity is the so-called hierarchical database and this is more thinking of the data in a tree-like way or you think of it like a document. So if you think about an XML document or you think about a JSON string, exist is one of the well-known XML-based databases, MongoDB, CouchDB, et cetera. These are sometimes known as no-sequel databases and if people are interested in why that is, you can talk about it later but effectively it's not no-sequel. It's no-sequel but you haven't defined your schema yet or it's a dynamic schema to be more precise. Okay so let's look at the architecture that the conceptual layer of a relational database. Here we have four tables. The names of the tables are study, subject, questionnaire and question and you can see at the very bottom of each of these tables is the kind of data that we might want to have in them in each of these tables and there'll be an arbitrary number of rows in these tables. The other thing to note are these things called PK or primary keys. These are the things that are indexed so they can be looked up and searched on very rapidly and compare it against each other very rapidly. It makes sense to call these things in this case study. The primary key would be study ID, subject, SID, et cetera. When you think about creation of a new row so let's say I wanted to create a new study and I'm gonna be surveying a bunch of students or people in general. I might have the name of this study. This might be hot or not. PI name, that's me. Is it completed? How many subjects are there gonna be in this study and then perhaps I wanna create another questionnaire, another questionnaire. For each questionnaire that I create there'll be another study and I'm gonna wind up having this sort of high level metadata about that study. Then that study is going to involve a questionnaire and I might have a number of different questionnaires for a given study. So I might have a hot or not study. I might have a hot or not for animals or something that could be part of the same study to figure out whether animals or people are hot or not. Sorry if I'm distracting too much or maybe no, I'm getting more specific than I ought to be. This primary key of a questionnaire, every time I create a new questionnaire, will be associated with so-called foreign keys and the foreign keys are my ways of looking back at what this is associated with. It's essentially my way of merging tables. So the foreign key of one table is the primary key of another table in general. And so here the foreign key points back to the study. So this questionnaire, when I create a new row in the questionnaire table, will be pointing back to one study. Now there are ways of creating this architecture such that you might have a questionnaire that might be part of two studies, but the way that this is laid out here that's not actually possible. If you wanna do it, you essentially have to repeat two rows but they'd have a different foreign key on STID. The other thing is that there's going to be a subject that will be taking part in this questionnaire and you might wind up having your subject ID here as your primary key and the subject may be part of different questionnaires. So they may actually take a bunch of different, a bunch of different questionnaires that they're answering. And then there's the questions, et cetera. I won't go into the details because it doesn't really matter what the implementation is here. What I wanted you to notice is that there's this concept of sort of the data that this table is really about. And then there's the connections between these different tables. And the zero dot dot dot star means that's how many of them there can be when I create it. So when I create, I don't have to even create any questionnaires and I have a really simple database because I haven't actually done anything interesting. But when I create a questionnaire, I want to have at least one question associated with it. That's with that one dot dot star. This is a very complicated object oriented database where it's some sort of ecology of people taking classes and people teaching others and people making money because they're teaching classes. I won't go in a great deal into object oriented databases only to show you that it looks a lot like classes and subclasses and inheritance. In particular, if you look at an employee, you might say that every employee has the concept of a salary and that salary may not just be a number. It may be the date in which they're paid, whether they're paid on a nine month basis or a 12 month basis. So that in and of itself might be captured in the form of a table. But we have different types of people who are employees. There's assistants and assistant professors and staff, et cetera. So assistants would have its own set of, if you want to think about it in the concept of a Pythonic way, might have its own set of attributes that are not part of the generic employee, but everything that the employee has and points to is something that will be inherited by the assistant and likewise the professor and likewise the staff. And so it allows you to sort of create a high level view of the relationships. But fundamentally in the end, what's happening under the hood is you're really creating primary keys and foreign keys. A fairly different concept of a database, but one which tends to be pretty intuitive for a lot of scientists is a hierarchical database. This is represented here in this little graph. We have the concept of what electronics are. And again, there could be kind of metadata that go along with each entry in this row. And then we have two kinds of electronics. We have televisions and portable electronics and different types of televisions and different types of portable electronics. Again, you might be able to create this and you will be able to create this in a relational database sense, but conceptualizing the entire universe of the stuff you're interested in with this sort of Venn diagram and these nested structures inside of it is actually very helpful. We'll see a lot more about hierarchical databases towards the end of the lecture today. Okay, so we have different conceptual layers. I talked about relational databases, object-oriented and hierarchical databases. And now we wanna sort of talk about how they're actually implemented in the real world. Most database management systems are ones that essentially have a server concept to them where I'm a client and I wanna connect to a database and I wanna interact with some tables in that database. This server sense is one where you think of it as a persistent code, which is just waiting to be interacted with. So you might just start up my SQL or Postgres and you think of it as essentially just sitting there listening on a port and waiting for somebody or some client to interact with it. The nice part of this, of course, is that it's persistent. So it's very fast when I wanna connect to a database and get going. I hope that if I have a good server, it already has preloaded some of the types of things that I'm gonna wanna be interacting with. For instance, if I'm gonna be connecting and I wanna actually give it a password so I can get permissions to certain tables or certain databases, then you hope that it sort of preloaded all the permissions tables, which are themselves represented as tables. And indeed that's generally what happens. The other nice thing about a server is that this server could itself have a bunch of different sort of master-slave where it's doing a lot of replication for you. It's keeping the data fresh. It's making sure that if a node goes down, you as a client connecting to that server doesn't even notice it because basically all the connections are made for you. Server goes, or a client goes, sorry, I gotta be very careful with my language here. If a slave goes down, the master says, ah, we gotta make sure that we're directing all of our search traffic essentially to these other slaves. If some of the data gets corrupted and the server knows how to deal with corrupted data, you will start replicating that data in other places to make sure that you have a very robust data set. So that's the vast majority of the systems. And then there's the so-called embedded database management systems. And this is where there is no server. What you do as a client is you essentially connect to a persistent file, usually on disk, and you wind up interacting directly with the database. So you have a bunch of APIs and there's a layer between the low-level APIs which connect to the database and then, say, Python. And in Python, you say connect to the database and what you're doing is essentially attaching yourself to a file. So there is no server. In a client-server system, it's pretty nice because you can have multiple clients logging in from all over the world, all interacting with the same data, and the server is essentially going to manage the transactions that people might be doing on a database. So if I'm interacting with a table and I'm sitting here and I wind up updating a column in the database and somebody else is trying to do the same thing, the implementation of the database management system will manage those conflicts and will sort of say, no, that person has priority because that's how this database was created, or it'll have a sense of transactions where everything you do instead of a transaction is effectively protected until you commit that transaction and then it's managing conflicts between those. In an embedded database, you don't usually have that sort of mitigation. And so you generally don't use embedded databases unless you're pretty sure that your client is the only thing that's generally interacting with that database. So if it's just you sitting there playing with your data and it's going to be in a big database, it isn't ridiculous to think about it in the context of one of these embedded systems. If it's going to be you interacting with lots of data and updating lots of data and then five other people from your lab logging in from all over the world and interacting with the same data set, it'd be silly to be thinking about it as with an embedded solution. Regardless, all of them are fast, safe, very scalable. Well, not all of them. The embedded ones tend not to be super scalable. And but any of the modern ones that people are using on a regular basis now, almost all of them have the ability to operate over multiple nodes and operate in a very fast way and allow you to issue large complex queries in a very efficient way. Okay, so I've sort of bandied around the couple of different terms. Let me get specific about what I mean in the database model concepts. One is database and that is the thing that contains tables. It has concepts of actions, which sometimes are called triggers. It deals with permissions and security. When you have a database management system, you could have multiple databases and one client can connect to one database, another client can connect to another database and a third client can connect to the second database at the same time that the second client is connected to that database. And that database management system is dealing with that for you. Actions are interesting because if somebody, say client three, updates a row, you may have essentially created what are called triggers where every time a row in that table is updated by that person, somebody else gets an email. And so you have a concept of being able to embed actions as the data is changed and interacted with. Tables, it's the data with a fixed number of columns and rows that you will then perhaps add data to. So you'll add more and more rows as you get more and more data. And it's the thing that you're gonna wind up querying. And then keys, this thing that I presented before, it's essentially one column, which is hash that makes it very efficient for looking it up and it makes it very efficient for drawing tables. And I already presented to you that concept of a primary foreign key. And in fact, you can create hashes or indices on multiple keys. So I can essentially create a tuple of keys and I can hash on that, create an index on that and make those sort of pairs or triplets or quadruplets very fast to search over. In document-based databases like these hierarchical ones, oftentimes tables are called collections, but essentially that's just parlance. Okay, so let's start getting to play a bit with databases. And if you haven't already, please go to bspace, resources, weekly files. And please download, there are two files there with a dot SQL at the end of it. And you can actually look at those SQL files. We'll start interacting with them now. Databases are everywhere, whether you know it or not. Almost everything you're doing on the web when you're searching, of course, is interacting with a very large-scale database. And even when you go to specific websites that aren't generally search engines, when you're interacting with that webpage, generally you're being presented the results of what's some sort of high-level query. The sort of example that we'll be using now is interacting with collections of artwork. So here is the result of having interacted with a database when you say, go to the DeYoung Museum website and just show me stuff in the collection. These are not static pages, these are pages that are generated when you make that call to go to that site. And of course you can also do searches. So if I wanna search on Devencorn, I'm gonna wind up getting back all the artwork by Devencorn at the DeYoung Museum. And so what you're effectively doing is you're selecting from a table, probably called the artist, where the artist's name is equal to Devencorn, you're getting basically the artist ID number, and then you're searching on another table for all the artwork that has that artist ID number. Now how you deal with it when you have an artwork that's made by several artists, not my problem if you're designing the database, you have to think about that sort of stuff. So we're gonna create our own collection of artwork just as a way of playing with databases. And what you should be able to do from the command line, if you don't mind, bring it up in a terminal, type SQLite3, and then some name of a database, we'll just call it art.db. This file, which is the database again, because we're gonna be playing with SQLite3, which is an embedded database, doesn't have to exist. We're going to essentially connect to and create a new database. And what you should see is something like this. Is everyone able to do that? Most of you? Darien doesn't have his laptop open. Thank you, Darien. Okay, and now I'd like you to type on this now command line, you see just like when you type Python in your terminal, you get a little Python command line, now you get an SQLite command line, type dot read art dash create dot SQL. And what this is doing is you're basically reading in and issuing SQL statements, we'll see more of what that means later on, and you're creating a database on the fly, basically from a bunch of scripts inside of this SQL file. Whenever you have a dot inside of SQL, that is sort of, if you want to think about it, sort of magic functions within IPython. This is stuff that isn't about the database itself, it's your sort of view of that database. So I can read essentially a script in and essentially issue those SQL commands. If you then type dot schema, you will see although not beautifully colorized like this, something like this. And I want to go through this a little bit to show you what's actually happening. So when I'm looking at the schema, and this is also something that will look very, very similar within the dot SQL file itself, you can see a bunch of commands, and I essentially have three commands. And what are these commands doing? These are creating three tables for me. So I'm going to now create three, yes. What are your file permissions on that file when you pulled it over? If you do ls minus l on the file name, well maybe I'll ask somebody to go over and just check it out. Is anyone else having troubles with it? Okay, I bet it's a file permission issue. Okay, so what are we doing in that create, our art dash create dot SQL, we're basically creating three tables. And lo and behold the syntax isn't so obscure, it's saying create a table. And in the first case, we're going to call that table artist. We're going to create a primary key, and we're going to call it AID, which is artist index or artist ID. It's not allowed to be null, so it can't be empty. And we're going to start off at zero. And every time I create a new row inside of that table, I want that number to be auto-incremented for me. This guarantees that I have a unique ID. So what are some other attributes of this table? Well, these are essentially different columns, the first name, last name, birth date. And you see that SQLite has the concept of a date time date. And if I don't give it when I create more, when I actually add a new row, I want you to set the default to be the current date, which is silly, right? It's an artist basically getting a birth date of right now. So probably I want to set a different default of like 1700 just to be sure that I get some more reasonable date in there. Birth country, that's going to be a text. Now I'm also going to create a table called museum. I'll have a museum ID. This will be the primary key of that table, name, country, city. Likewise, work, I have a work ID. And now what you notice is I've got a couple other indexes. This doesn't have to be called AID, but I like it to be called AID because then I know this is really a foreign key of the artist's table. And these arrows are allowing me conceptually to think about the connection between these tables. When I create a work of art, it's going to have an artist associated with it. And when I create, and it will also have a museum where that art currently lives. So if I take a piece of artwork and I move it to another museum, I can update this value inside of the work of art. Likely I'm not going to attribute that artwork to another artist, but maybe one day I realize that this divan corn was actually painted by a three-year-old and then I wound up pointing to a different artist ID. All of this is called the data definition language. And in creation of your tables, you wind up getting pretty familiar with the different types of variables that you can have text and you can have just specific string length characters. You can have something called the blob, which is a very large string. Integers, floats, dates, et cetera. All right, so now let's load in some data. You're already in SQL light three, you won't have to do this again. But I want you to read in some data. So this is going to be a bunch of SQL commands that will essentially pump data into these three tables. So dot read art dash load dot SQL. Did you guys get your issue worked out? Was it permissions? That is a permission issue. You are not allowed to look at things that are not there. Okay, we're going to do a little bit of SQL light just to give us some pretty output. If you type dot header space on, this will give us some nice headers.mode column and you can read about it in the man pages essentially what this is doing, but it's just going to give us a pretty output when I do the following command. In my data manipulation language, DML, select star from table name is, in this case we're going to select star from work. What do you think we get? Well, we're going to get all of the different columns of the work table and I'm going to wind up getting all the data associated with that. In SQL light three is pretty nice about showing us what a date looks like and you see that I have every work ID that got auto-incremented when I added more data. I have different artists, I'll have different titles, different types of the painting and they're going to be in different museums and you see here that I'm only pointing back to two different museums and you see when the work of art was finished. So these are basic queries that we're going to wind up doing in SQL. This is how we're going to start interacting with our rudimentary database. Select star from work, it's all rows from the table work and prints out all the columns. That sort of makes sense, right? Select title comma type from work and usually you need to put a semicolon at the end of this. Almost all SQLs require you sort of terminate your query with a semicolon. Get all rows from table work and print only those columns. So if I've got a table that's got 50 columns and I really only want to know the birth date and the first name of an artist, I would type select birth date comma first name from artist and then I would get only those printing out. So you can give that a try now. Everyone okay with that? Okay, good. So let's do some other things. Maybe I'm going to want to sort of make some pretty columns and I don't want it to be called work because it's not really clear what that means. Maybe I want to call it work type. So I can actually select title comma type as work type. So in the top column when I'm printing out what each column is about, in top row I'm going to be saying this is work type and not work from work. So why don't you try that? And maybe I don't like first underscore name. Maybe I want capital first with a space name and now you know how to do that. So one of the things to recognize here is that we're not actually natively changing anything about the underlying database. We are creating effectively a view of that database. This is a trivial view of that database but we're really kind of renaming what we think of as one of these columns. So if I wanted to now take the results of this select statement and then select on that, I could actually select on work space type and not on work, I'm not on type. Here's a more complex query that actually is doing a little bit of math for you in the query itself. Select title, date. This is a built in command within SQLite 3 and almost all databases have some concept of this of now. So this is going to be now minus finish date and I'm going to call that age whatever the result of that subtraction is from work where the finish date is less than 1970, January 1st, 1970 and I want to order by age in a descending fashion. So this is a slightly non-trivial query but you can see just how powerful it is. It's getting only those rows where the work was finished before 1970 show the name and age and years descending in order. By age. You want to try that? The other thing to try is try it without the descending and you can press the up hour, I believe in SQLite 3 to get back the command you just issued. You could try it with greater than sign if you want to just to see what kind of differences you get. And again, your database has like 12 entries in it or something, so this would be trivial to do if you were doing this within Python. Say interacting with a sort of a dict with three keywords which would represent the tables. But now if you think about this as some massive sort of compilation of all artwork in the universe, this is the type of query you would issue and be able to interact with all of that data right away. Everyone okay with this? So let's do some other interactions. Select star from work where type equals painting. Hmm, that's weird. I didn't get anything even though in my SQL table I had the works of painting and woodcuts and whatever. It would require the value of type to be exactly painting. So it's case sensitive. If I say like, that's allowing me to get access to strings that are like painting and capital painting is the same as underscore or lowercase painting. So I get back a couple of different paintings. Let's try another one. Select star from work where type glob p star. That's allowing me to do for those that are used to globbing and there is a Python module as well called glob. It allows you to do sort of Unix like searches on words or file names. So p star will give me everything that starts with a capital letter p and gives me as many other characters after that. So I get the painting but I also get a print because it started with capital P. If I only wanted things that started with lowercase p or capital P and then rint and then I don't care what happens after that, I might get these back. But if I have a star here, it means that I'm allowed to have the characters print or capital P rint in the middle of the string. So what do I get back here? Ooh, I get a screen print. And then it goes on below the page. And you see this is cut off just because SQLite3 is trying to make the columns that are small for you. You can play with the different parameters of how you want it to present it. So I hope you now have a sense a little bit of how you would wind up searching within strings. I'll give you a tiny little breakout exercise which will last one minute or two. What work was created less than 35 years ago and starts with the letter s? So who's got a solution? Yes, you wanna tell us who did it? You wanna read out your SQL for us? I didn't actually bother with that. Well, wait, no, you don't have a solution then. Now you have it? Okay, read your SQL. So 35 and title, glock, and then like bracket, s, they add some text. Yep, and so what was your answer? I think it was. Yeah, everyone get that? So you're really doing essentially two where statements. And you're gonna say not where type, but you're gonna say where title, glob, and then in quotes, s star. And which is how you connect sort of different requirements and that thing you said where it's the date of now minus the, I guess we're calling it finished date is less than 35. And you'd have to know something about SQLite to know that when you do date subtractions, what you get back are ages in years. There are other ways to do date subtractions where you get it back and you can actually convert the results of that subtraction into seconds if that's how you wanna interact with it. So that's knowing a bit about the language that we won't have time to go into unfortunately. Okay, everyone okay with that? So now things get a little bit interesting because until now we've essentially just been interacting with a single table. Now I wanna ask this question, what if I wanna view all the works that are in Berkeley by American artists? So one way to do this would be select from the artist table all the artists whose birth country was the USA. But now I wanna take those and I wanna kind of merge those with a bunch of different works, but I wanna make sure that those works are actually in the Berkeley Art Museum and not in the De Young Museum. So I wanna find everything that's by a certain artist that is currently in Berkeley, but I wanna now print out those works. So this information is located in, to answer this question, is located in all three tables. And doing this in Python now starts to get a little bit hairy, but here's how we would solve that. Select museum.name, so this is kind of a new representation here. Instead of saying select name, I'm saying museum.name because I'm gonna wind up bringing in a bunch of different tables here, I'm gonna join tables. So if there were two different column names within two different tables that was part of this query, I would wind up having a conflict and SQLI3 would complain to me. So museum.name, I'm gonna call that museum. Artist.lastname, I'm gonna call it just artist, just to make it easy work title, work type, work finish date as date from work. So I'm selecting in the end what I really wanna have is from work, but now how do I have access to the museum name or even the artist name? Well, I have to do a so-called join. And join basically says there is a criterion that allows me to merge these tables together. And so I wanna join the artist table on the following criterion. I wanna say the artist AID is equal to the work AID. And now I wanna join the museum table as well and I wanna make sure the museum table is equal to work ID. So this is where the foreign key, primary key connections become very important. But I have a couple of criteria and it allows, now I've got sort of a full view of all three tables simultaneously. And I wanna make sure the birth country is equal to USA. And museum city is equal to Berkeley. And so what do I get out exactly what I was looking for? I wound up getting the two works that are currently in the Berkeley Art Museum by artists that were born in the United States. Devencorn and Warhol, there's not a lot of Warhol in the Berkeley Art Museum unfortunately. There's a lot of Devencorn. And of course now if I wanted to sort of select on things that were made in less than 35 years, I could add another where criteria. So typically what you see in SQL is you have select, that's the stuff you wanna print out. You have all of your join statements to connect all of the different tables together. And then you have your sort of filter criterion, your where criterion. And then at the very end, you have your order criteria. This is a reasonable place to go to start getting up to speed on SQLite. And unfortunately, SQL is not sort of a common language across all database management systems. They all have their own sort of quirky ways of dealing with things. And so knowing SQL and SQLite through database management system will get you 95% of the way of doing the same thing in Postgres or MySQL, but you'll wind up seeing some slight differences. So you have to be careful when you're doing very complex things. It's not easily translatable across database management systems. So there are more advanced concepts that I won't have a lot of time to go into. But within a database, you can create things called views which sort of act like virtual tables where in some sense you can issue the query that I had before. And now if I wanna always query on only artists that have artwork in Berkeley that were born in the United States and I wanna do more complex queers on that, I might create a static view of exactly that query and every time I interact with that view I'm effectively interacting with it as if it was its own table. But you're not actually creating any new data, you're just creating in the same concept of what we had with NumPy where you create a new view of that same underlying data. We can do that as well. I already mentioned triggers. These are built in actions that the database takes on itself when something else is done to it. So if I am really worried about people adding stuff to a certain row, I might create a trigger that creates a log of every time somebody interacts with that table. And then transactions, these are basically protected series of interactions with the database typically where you're adding something into the database or you're updating something in the database. And you wanna make sure that there's anything volatile that fails in there that you wind up rolling back to the last point before the transaction started. So database has a sense of security on it that it won't get corrupted. Something else which I didn't put in the slides here but which is very important, SQLite 3 really doesn't have, most of the other database management systems do, is something called referential integrity. So that if I declare a certain key to be the foreign key of another table, if I delete that other table, it will wind up, if I said everything right, it will wind up deleting the row that's associated with it in the other table. So you can't have, I can't delete let's say an art museum and then these works are all pointing back to the Berkeley Art Museum and that just doesn't exist. There's no more index two anymore in the museum table. So referential integrity is something you typically wanna have in your database and most other database management systems sort of out of the box build that in for you. But typically when you're creating your tables, you wind up sort of telling it what are the foreign keys and what do you do if one of these things gets deleted? So there's our other SQL statements that you make. Okay, so I'm gonna hand it off to Chris now. We're gonna now start talking about the Pythonic interfacing with SQLite 3 and then I'll come back at the end and we'll talk about HDF 5. Any questions while Chris is setting up? Yes. The slides should be on B space. It should be under lectures and then week six dot zip, is that right? Week seven dot zip. I think we're confused whether in week seven or not. Oh, does it? Okay, sorry. I think we're in week six. It's the most recent thing that's been added to B space. Oh, Chris, you need the thingy. Okay, so I'm back again as is the anthropomorphic Python. So for the outline of probably the next 14 minutes to an hour is gonna be first, I'm gonna talk about how to use SQLite 3 and Python. Basically Josh just showed you how to do queries and joins and I'm gonna show you how to do those things as well as creating a table and populating it with information through Python. And then we'll be doing some examples kind of like the museum stuff or the multi Python stuff using set around live skits. And then also an example of plotting stock data using a database to kind of simplify what would be more difficult to do in normal Python. And then after that, just a couple of slides on MySQL and PostgreSQL, which have modules for Python and then a breakout exercise which is gonna be using similar plotting techniques to what we did in part one to plot seismograph stations across the world. The other thing I wanted to say is MySQL to be in Pygres QL and all the other things I'll be talking about in part two. Don't worry about if you have those or not because the purpose as Josh was saying of this lecture is to introduce you to the 95% of the common syntax and then if you know how to do SQLite three it'll be very easy for you to jump into MySQL which will be one of the more common things you'll encounter outside of this classroom. Okay, so SQLite three in Python is built in SQL database access and it allows you to store the database either as a file or in RAM and I'll be showing you how to do both of those in the examples. And it uses the structured query language which is that syntax that you'll be able to use in different types of database management systems and it attempts to protect the database from corruption. This is the type of embedded system you would use if you were just working with stuff on your own local computer or laptop and if you didn't have to share it easily with other collaborators. In the next slides I'll show you how to create a database create a table in that database, insert some data and then do queries as well as specific joins on columns within the different tables and what are called left joins. Okay, so let's get into the part of it where I talk about actual running some code. I want you to follow along in the notebook but I'm gonna stay within the keynote up here. It will be the same code and the notebook you should be looking at is the SQLite three I think notebook. The reason I'm gonna do this is because I've made some graphical displays of the database which kind of are not easily, I mean the formatting of the output in the notebook is contains the same information but I think it's easier to see it as I've graphically presented it in the keynote in the presentation here and so it should be part of the lecture materials from B space. Are people finding it? Can you just say that again, where is it? So it is in the lecture files. I care about this one. I mean that's the one you should have. Is it a post PDF version of the running? Oh, is it? Yeah, yeah, yeah. I created it, I just do it like this. That comes later. Are you getting this one on? Okay, so I think I have it in the, I mean this is what it is and basically each one of these blocks I'll be working through on a different slide but when it prints out stuff here like this is kinda, what do we have? This is kinda icky looking so it'll be nicer on the keynote basically. That's the reason why I'm doing the keynote. Okay, so do the import of course first and then we're gonna create a connection and this SQLite three dot connect command this is how we're gonna say where the database is located and so for now if you run this in that notebook you're actually gonna be creating a file called example dot DB within the slash TMP directory on your local disk. I'll show you in another example how to do that into memory but you don't have to worry about that too much now. Basically the advantage to storing it in memory is that you don't have to delete it when you're done you don't have to worry about it but if it's in slash TMP your system generally cleans that stuff out on its own you don't have to worry about deleting it. Of course again if you want to after this but it's not gonna be a large file anyway. And then we're gonna create a connection and through connection dot cursor and this is how we're going to be able to access that database and every time that I want to send a command to SQLite three I'm gonna do it in this string where in this example I'm calling it SQL underscore CMD and then I'm using three quotes so that it can extend over multiple lines just for ease of readability. You can call this string whatever you want but you do have to give it a string and it has to be in SQL syntax. So the command here is gonna create a table we're gonna call it Dan Ackroyd and then we're gonna give it columns with ID which is gonna be type integer and primary key it's gonna be a primary key and auto increment similar to how we did artist ID or work ID in the previous examples from Josh and then we're also gonna have skit title which is a text, air date which is a date, season which is an int, episode, EP which is an int and roll which is text, okay? So we're just creating this table in that database and that's it, we're done. So let's look at what that database and table looks like and this is it. It's just a bunch of blank columns right now. All we've done is set it up and the next slide we're gonna populate it with some data. Okay, so let's in Python write out a list of tuples where each tuple is some skit data. So we're gonna follow the same format as we defined in the table when we created the table so that we've got the skit name, the air date, the season, the episode and then Dan Ackroyd's character or his role and so we got three of these here from the 70s when he was on the show and then we're gonna loop through skit data and basically just create an SQL command and then execute it for each skit and skit data. So this is our command syntax which is just insert into table Dan Ackroyd and then in parentheses we give it the skit title, air date, season, episode and role and this is all strings that we haven't actually inserted from the skit data list yet but then we convert role into a string at the end here. So this is what makes it kind of simple and advantageous to do it in Python is we are scripting the insertion commands in SQL syntax in this loop. So I'll just, I'll walk through it once more. The syntax is insert into and then the table name and then in parentheses we give it the essential column names and then we say this other special, I guess special namespace values and then after that we convert the, the, this tuple, I'm sorry. We convert each one of these tuples into a string and a tuple and string format is what the SQL syntax expects here and then we just execute each one of those insertions and we build up the database and so if you're following along the notebook you should now have a print command which prints out a similar looking table except it's not organized quite as cleanly but all the data should be there when you print that out. One of the important things to point out here is that the ID was not something that you did. Right. Ignore the ID. Thank you, that's a very good point. When we did auto increment it, the SQLite knows to do that itself. It knows to auto increment by one each time we give it an insertion command into Dan Akroyd. So any questions on that so far? Isn't there in the notebook it has a print? It does not. Well let's go. So you're gonna execute, we're gonna execute a cursor dot execute and you're gonna do a string select star from. That's right here. Oh yeah, sorry. Yes, so it's, well, let me just load up everything here. Oh, yeah. It doesn't print out in that format, no, that's correct. So here I've done is I've just done the database creation and I've created the first table and then here is the insertion of the data and then to run a simple query that's right here and then here we have it. Wow. So if you wanted to, it would just select a star from Dan Akroyd. We can delete this and you can see everything there. Okay, and there's all three of them. Right, that's, thank you. I should speak to this because I don't think, oh, well it's actually the next slide so I can do it here. So I wasn't printing it out before because I was just showing how to insert data. So if you wanna print it out that's part of the query here which is that next line in the notebook file. So we just want all of the information for skits that happened after October 12th, 75. So we write the SQL query as a string like normal and then we execute it on the cursor. And this fetch all command is how we get the data back from SQLite 3 and into Python. So DBINFO is then an iterable list of all of our information that, so essentially star. And so for each entry in DBINFO we wanna print it and this is what it looks like. And at the end we commit and close our connection. Oh, so I believe that's more important if you were actually doing insertions and changing data. In this case we're not. But just get into the habit of doing both those things when you're done with the data. So that's sort of a temporary holding area for all the stuff you do until you commit. So if you're adding a whole bunch of data you can essentially add it into the storage area and then do a commit and it does it all much more quickly than if you just constantly commit it. That's also part of the transaction so that if you had an error that happens inside of all of what you do here, let's say you had to try to accept a new bar. If you haven't committed, it won't actually close the data. Can we first insert in the data though? It didn't as it's there. Oh, because we didn't wanna close it out yet. So if we go up here. I didn't commit it because I wanted to continue to access the, it's all the same interpreter session. If I'd done that I'd have to start over again from the beginning, if that makes. Anyway, so here is the full database and then to signify what we queried, I just box it. And that's the information that we got out. All right. Can you make more than one cursor per connection? I have not tried to do that. Sorry, what was the question? The question is, can you make more than one cursor per connection? But I don't know what the benefit of that would necessarily be. Well, if you were doing some thread operation with multiple threads, you could have multiple cursors playing at the same data. This is part of this concurrency. In the SQLite 3, this is not recommended because it's quite easy to, SQLite 3 is not managing transactions because you're very graceful. And so if you have multiple cursors all trying to write stuff in, reading is fine, right? If you get multiple cursors, if you're just reading, you generally don't wanna have more than one cursor right. If you had MySQL or Postgres as your back engine, it would handle things a little bit more gracefully, but I wouldn't do it. What exactly is a cursor, and why is it called a cursor of relevance that they can mention itself? What's the difference? Yeah, in some sense, I think of it as like a session. When you open up a connection, you're basically getting access to that database. And then I think a cursor is a session on that connection. And so you can have a whole bunch of things that one cursor is done and not committed, and a whole bunch of things that another cursor is done and not committed. Now if I bring in a third cursor and I ask, I'm gonna try to pull over the data, that won't be in the database. Until one cursor commits and say has a whole bunch of rows, none of the other sessions will see that. So I mean in the concept of like versioning, you have Git where you're like, sort of committing to your local system, and then you do a push that's equivalent of a commit here, where you're actually saying, now the central server has knowledge of what I've done. I think of it as a staging area. So to enable us to do joins, we're gonna create another table in our example database. And so this whole slide is just gonna be a block of Python text, but you can follow along in the notebook. And then on the next slide, I'll show you what it all looks like graphically. You open up multiple grenades. They can't independently commit, but again, you're creating a very volatile situation in terms of scale line three because no one is really mitigating and managing commit errors, right? And so you can have really bad transactions where you have two things trying to commit at the same time. It happens in all the time, and it will wind up essentially screwing up directly to your data. But this concept will wind up persisting over the concept of a cursor persists over almost all database management systems. So getting comfortable with that idea is fine, except with other database management systems, it actually generally don't get into much trouble. Okay, so this first slide should be, this first slide of this process of creating another table and then running the join should be just building up on what we did previously, just inserting different data. So we're gonna create essentially almost identical table for Jane Curtin. We're just calling it Jane Curtin. And it has the same columns. This is not necessary, but in this example, it just worked out to be that way. And then we're gonna insert three skits for her in the same way. And then this last entry here, this SQL command for select Dan Ecker, that skit title, Dan Ecker, that error data, et cetera. This is where we're gonna do the join. So the syntax is that we're gonna select from, well, basically the operation we wanna do is we wanna know all of the skits which featured both Dan Ackroyd and Jane Curtin. And we wanna know the roles that they both played. So we're selecting from Dan Ackroyd, the skit title, and from Dan Ackroyd, the air date and the season and the episode and Dan Ackroyd's role. And from Jane Curtin, we want her role. And everything except for the roles are gonna be shared across both tables. So it doesn't really matter whether we pull it from Dan Ackroyd or Jane Curtin for air dates, skit title, episode, or season. Oh, and this episode, this should be dot EP here, I'm sorry. I think it's fixed in the notebook though. And then we wanna go from Dan Ackroyd and comma Jane Curtin. So we're saying select all of this information from both tables, table Dan Ackroyd and table Jane Curtin where, and this is the important part of this, this is a more simple type of join on columns, where Dan Ackroyd dot skit title equals Jane Curtin dot skit title. So we're saying where these two things match and where the air dates match for both tables. So it's possible, for example, that the cone heads makes multiple appearances since she was such popular skit and it's always titled just the cone heads or the cone heads at home or whatever it is. But maybe Jane Curtin was sick that day and somebody else covered for her. We wanna know whether the air date was the same. So if both the air date and the skit title are the same, then they must have appeared in the skit together. And then we run the print and we get the cone heads at home which is the one skit that they were both in where I guess Dan Ackroyd was Beldar and Jane was Premat. And so here is a graphical representation of this simple join on columns. So this is our database with all the data in it, with tables Dan Ackroyd and Jane Curtin. And we're gonna ask where Dan Ackroyd dot skit title equals Jane Curtin dot skit title and Dan Ackroyd dot air date equals Jane Curtin dot air date. Where are these the same? And here I highlight in orange the one entry in both tables where these conditions are satisfied. And then the red box outlines all the data that is fetched. So the select arguments which is gonna be, or which was Dan Ackroyd's skit title, air date, season episode, end roll and Jane Curtin's roll. All right. So that's one of the simple ways to run a join. And it's I think one of the easier to kind of wrap your head around because it's more explicit in what it's saying the column information you wanna kind of merge these tables on. A left join is another way of doing this. And we're gonna, I mean, I say this just because for example, I don't wanna have to reopen the database. The new SQL command is this. Where we're gonna say select the same information as before. So this is the skit title, air date, season, episode and roll from Dan Ackroyd as well as the roll from Jane Curtin. And we're gonna do it from, and then from Dan Ackroyd all the way through Jane Curtin here. This can be kind of thought of as like a new table where Dan Ackroyd left join Jane Curtin is sort of like the new table. We're creating through this left join a joined table where it has both information from Dan Ackroyd and Jane Curtin. And we're gonna select from this table on the requirements that Dan Ackroyd dot skit title equals Jane Curtin dot skit title and Dan Ackroyd dot air date equals Jane Curtin dot air date. The same requirements as before. And fetch all the data and then print it. But this time it's gonna be a little bit different. And I'll show this graphically on the next slide. So if we imagine this is the Dan Ackroyd table and this is the Jane Curtin table, a left join is sort of like we take the Jane Curtin table and overlay it on the Dan Ackroyd table and we see what information matches. And in this case, the first two don't match because they've got different skit titles and different air dates. But the third, the cone heads at home does and the information that we get out of that is the information that we care about. So the left join table is taking the Dan Ackroyd table and is taking all the information there and it's saying if the skit title, the air date match, I want you to input Jane Curtin's role and create this left join table. If they don't match, Jane Curtin's role is none. And that results with our final answer was of course the third one in the join table, which is the same result we got before. We're essentially running the same query just through a left join or a simple join on columns. So is there any questions about joins in Escalite 3? Oh, it's just a different way of doing it in this case. Sometimes it can be more efficient if your data lends itself to that case. So let's take a look at the, here are the joins. So why would you do a left join versus a join on columns? A left join allows you to not have to specify the columns quite the same, right? So, and I think it's a little bit less taxing on the memory resources. So whereas in the simple join, the first one we did, you basically have to scan through everything. And you have to, it has to ingest all of both tables. Whereas in the left join, it just kind of smushes them together. And if there is data, it matches and you get the full row. If there's no data for Jane Curtin, you just get a none. So they produce somewhat different outputs. But when we do the where, where we require that skit title is the same and air date is the same, we're taking all of the information that could be returned and we're only saying we only care about this and this smaller single skit. And that skit happens to be the same from both these queries. But had we not used this same specifications here, it would have returned the three row table, which is mostly Dan Aykroyd and the none for the first two entries for Jane Curtin. So we could have done, we could have asked on that, where is Jane Curtin not have a roll? And it would have been everything the same except for I guess on Jane Curtin dot roll equals none from this table. The way I think about joins is you only get results back if there is a match on both the left table and the right table. But if you do a left join, you're always guaranteed to get back everything from the left table. And if you wind up getting a match from the right table, you get that information as well. If not, you get a none. So if you want all the information from table one and you also want to append the other information from the table two, you do a left join. Likewise, you do a right join. So you get something left out of your own and a none. So does it output a new table? It's a virtual table that is in memory for the moment that you get back the result. You can save temporary, there's a concept of a temporary table that you can name. So for your entire session, you could save a temporary table of the result of my crazy query. And then you can start querying that table. And then as soon as you close your connection, that temporary table gets destroyed. So you can actually, if you don't want to have a table, let's say we don't want to know about jinker name work, we can drop that table or you'll say drop table name. But I never named it jinker. Well, we named it jinker. I mean, realistically, if you're going to do this in a better database sense, you'd have one table called actors and another table called skits and you would do the joins that way. Calling a table after the actor names were silly, right? Well, unless you're Dan Ackroyd and all you care about is Dan Ackroyd. The purpose of what you're trying to illustrate is just fine. But it's not great database architecture. So the next example before we get to the breakout is how we can do plotting and specifically why it's advantageous to do plotting using a database for some types of problems. And so everything is in the notebook for this. So this first block is just gonna load up the things we're gonna need. We're actually gonna be pulling over data from finance.yahoo.com for stocks. And so we're defining this function which is just get stock data where it's essentially just a urlib2 call to download the data for a stock symbol and then creating a file to save the data into and then loading it in and returning the array with that data in it. So everything in this function should have been familiar to everybody. So I'll move on. This is an example of what that table looks like or the data file, what the data file looks like. Each stock symbol has has a very long list of open dates, dates that the stock was open for trade and then it's open value, the high value during the day, the low value, the close at the end of the day, the volume and this adjusted close value. And then it goes on from actually the day that you downloaded it. It might be up to date to today down all the way through 2000, February 1st because that's when yahoo started doing this stuff. Okay, so the way we're gonna use this function is we're going to create a database with all the stock information. And the first line here is how we create a database in RAM. So instead of spacing a file just give it the string colon memory colon and this will tell SQLite 3 to create the database in RAM. And for things like this plotting example where you're actually storing all the data is gonna be stored in files on your computer called, for example, vz.txt where vz is Verizon and that's the stock symbol. When you run it, it'll create that file and it'll store it in your local directory. So you have all the data in human readable format just as these text files but you're also gonna load it into the database so that we can take advantage of the queries that the database provides. So create a table stocks in our database in memory where we're gonna create a column for each one of the columns in the data files. So there'll be a primary key auto increment ID which is just as normal and then the stock symbol, the day which will be an int because for the purposes of this example I'm just converting days into integers so that when we plot it, it's a little bit smoother. We don't have to worry about date time objects. The open value, the high value, the low value, the close value, the volume and then the adjusted close all as floats because they're decimal data. Then we execute the command to create that table and then for each company that we're interested in in this case of Verizon, Apple, Ford, Microsoft and Bank of America we create an entry in the stock symbols list and if you wanted to you could modify this as you wish to track whatever stocks you care about. I just picked some popular ones I guess and we're creating a reference date which is gonna be where we're gonna be incrementing the integer of the day from. So when we end up plotting this data it will be number of days from January 1st, 2000 just to simplify the way that we're plotting it. So for each company and stock symbols essentially we're gonna call get stock data which as we recall returns an array of the data and then for each row in the data we're gonna have a SQL command to enter that row into our table. So here we're just building up the database table with all of the stock information that we download when we run get stock data. So I'm gonna run this and this will take just about five seconds because it has to download all that stuff and there it's done. So if you do that as well you will then have populated in your working directory files like this. So this is Apple, Bank of America, Ford we just downloaded and parsed and loaded those into NumPy arrays. And then the last step is gonna be plotting and this is hopefully where you can see how the database provides a little bit of advantage over if you were to do this just with straight Python and NumPy for the parsing of the tables. So for each symbol and stock symbols we're gonna create a query where we select the day and the close from the table stocks where the symbol is the one that we're carrying about. And then we execute that and we get back a list and it is super simple to slice into that and plot it up. So we call it an array so we can do the slicing and then we slice on the zero column or the first column and that will give us the day and the second slice on the second column will give us the closing value and then we can just plot it up, label it with SS and that's all we have to do for each stock symbol. Just to add some of these last four lines are just to make the plot look nice and to put the legend in a decent location. Well, let's try six and five. Yeah, yeah. Well, I got in the habit of being... Anyway, there it is. And so the advantage of this is that we could use the database to do the query on our stock symbol. Yeah, it's ridiculous. And you're gonna see this in the breakout that you're gonna encounter shortly is that you can use the database to effectively enforce like a logic statement that you would normally do in Python. So if we had instead of a database just a huge list or array where each row in the database was a row in our list or array, we would have to say if entry whatever for the stock symbol equals the stock symbol we care about plot that single rows point and make the color this. Instead, we can use the database to do a query only on that stock symbol get essentially a whole array or a list that we convert into an array of the information we wanna plot and we can plot it all at once. So that allows us to save a lot of time and computation power in plotting individual points for individual rows if we wanted to be fancy with our coloring and manipulations in the visualization. Okay, so is there any questions on this example? And I believe, yeah, here it is. This is the same thing as I had before and you can also play around and plot other things than just the closing value. You can plot the volume, for example, which I think Microsoft, its volume is a very different type of pattern to it than this does and you can see that. It's kinda cool. Okay, so next couple of slides, next few slides I'll be talking about the MySQL and PostgreSQL before the breakout. So MySQL DB is the module for Python which allows you easy access into a MySQL database and it supports, the main differences is that it's thread compatible as well as SQLite, but it is safer to use because the MySQL server is in charge of making sure that it's not concurrently writing into the same blocks in the database and so it has better checks and fail safes onto preventing data corruption in that sense. It also has a user password permission system built into the server so that if multiple people are sharing the MySQL database in Python, you have to provide a username and password and it knows what you're able to do, what tables you can query, what tables you can insert into and so that you can, if you are the administrator of the database, have protection and kind of like read permission set if you're sharing it with other people. It is not built in with Python. You have to download it yourself. You also, to be able to use it, have to have MySQL on your system. I suppose if you are connecting to a MySQL database that is not locally hosted, which is one of the more common use cases, you don't have to have MySQL installed locally, but if you want to test and develop with it, you should probably get it from the source forage page and just a word of caution. It's possible, but it's sometimes a little bit tricky to get MySQL installed on just a normal consumer laptop. Most of the syntax is similar. One thing that's different though, and I was reminded when Josh was using the glob command in previous examples, is in SQLite 3 and Asterix is the wildcard, that is not the same in MySQL syntax. So I believe it's a percent sign and there's another one, but just be aware that stuff like that can change. So look those up and know when you're working with it in a specific database language, which one to use. This is just a very small blob to show you how to get started with MySQL DB, just import it, and then this MySQL db.connect, oh I lost my cursor again, but that's how you would connect to the database and you have to provide it the host and so you would give it an IP address there for example, or local host if it's local, and then you give it a username and a password and which database on the host you want to connect into because MySQL database server can have multiple databases each with multiple tables exposed in it. And a connection timeout just because MySQL DB will often, as I said, allow you to access MySQL servers that are not local and oftentimes if the connection is broken or something you want the fail safe to know to exit out of your connection if the timeout occurs. And then after that, this cursor syntax and the SQL syntax is pretty much the same, just connect the cursor and then you can execute queries and use fetch all to get back your data. What's kinda nice is that MySQL DB goes through the trouble of translating data types from the MySQL syntax essentially into Python data types. So if there was a date in MySQL and you retrieved it, MySQL DB would form it and in this case, this DB info object and it's a list but it would create the proper date time Pythonic object in that list to represent the data you got back from the database. And so that's also if you had for example, a float versus a string versus an integer, MySQL DB would try to intelligently translate the data types from MySQL data types to Python data types. Okay, and then this slide is just to say that there's also PygreSQL and many different others to use with different types of database servers and management systems. I've only ever gotten familiar with MySQL DB but if there's a problem there's probably a solution already out there for it in Python. All right, so let's move on to the breakout exercise. I have provided for you in your lecture files, your download world.txt and a stations list.txt. So world.txt is a series of lines to create polygons which will create kind of a crude Mercator protection map of the world and then stationslist.txt is a large list of seismograph stations located on latitude and longitude in the world. And then I want you to plot up what the world looks like just from the polygons and then import the stationslist.txt data into a database locally with SQLite3 and plot the stations on top of the world map, their locations in latitude and longitude. But I want you to use different colors for open, reserved, and closed stations. There's a column in stationslist.txt where it says the status of the station where it can be open, reserved, or closed. And I want you to use queries, for example, where status equals reserved, pull out an array or a list which will turn into an array and then can plot all as one color all with the same parameters for alpha values or symbol sizes and things like that instead of doing a really inefficient for loop and logic test combination. For extra credit, there's also station elevation. So why don't you try and colorize the open stations based on station elevation? And that should be for those who get it done quickly. So go ahead and I'll be here to answer questions. I've provided breakout solutions again in a notebook but don't look at it. I'm sure you know the drill. Look at it when we're done and I'll show you the answers when we get back from break. KCD is it now that you know databases and when you create tables, you can also drop tables. First thing says, hi, this is your son's school. We're having some computer trouble. Oh dear, did he break something? In a way. Did you really name your son Robert drop table student? Oh yes, little Bobby tables we call him. Well, we've lost this year's student records. Hope you're happy and I hope you've learned to sanitize your database inputs. This is obviously we've been sort of presenting to you kind of anesthetized data where it's not actually malicious, but you can imagine executing SQL inputs where you actually construct some malicious piece of string. Many of the databases and the database management systems will try to protect against that. And the Python API's also give you a way of sending essentially sanitized data when you do inserts. So if you really care about this, which probably some of you will, you'll have to get to know some of the details of essentially the security of a database and now more than security, more of the integrity of databases. One of the powers other than what Chris was just telling you about with SQLite 3, one of the powers of coercing your data into a database, first of all, is that it makes you think about the structure of your data ahead of time and it allows you as you get data from all over the place, perhaps heterogeneous data, it allows you to think about having to course that into a more regularized form so that when you do your analytics on that data, you've got essentially already a sanitized version of that data. So that's one great thing about databaseing large amounts of your data. The other thing, of course, which you don't get from SQLite 3 is the fact that perhaps those stations are being opened and closed and maybe somebody in a research group that you're collaborating with is in charge of monitoring what stations are being opened and closed and they're writing into a MySQL database that both of you have access to, they're dealing with that but your job is to do analytics on that on a daily basis so that database could constantly be updating. So it's the fact that you can share data across platforms because your collaborator who might be in Europe might be using a completely different interface that could be using a web-based interface to that database, command line interface, they might have some sort of JavaScript thing that they're doing to interact with the data. You don't really care where the data came from. If you guys are all sharing that database then that's really one of the big powers of it all. The thing I do like about SQLite 3, of course, is that you can make that database and you can instead of writing it into memory which then gets destroyed as soon as you close your session, you can save it into a file and then you could actually send it to your friend and say, here's my little Python code that makes this nice little plot of the world and here's the database that goes along with it. So you essentially hand them a file which is the database because it's SQLite 3 and it's essentially just a single file, something.db and then the Python script, you hand them those two, they can completely reproduce what it is that you've done. Whereas if you have a bunch of different text files all over the place and you have a big tar ball and you give it to them, if they're not able to course the data in the same way that you've course the data, they're not able to produce what you've done. So it's worth spending time getting to know SQLite 3 as the module in Python that interacts with SQLite 3. But one of the things that I think you're gonna wanna start doing if you are dealing with lots of data on a daily basis is start playing with something called SQL Alchemy which is a so-called object relational mapper and what this does is it abstracts the peculiarities of a given database management system so that when you're interacting with that data in Python you see it as a consistent view. So at some level it means that it doesn't matter whether you have SQLite 3 under the hood or whether you have MySQL or Oracle or Postgres or what have you, SQL Alchemy allows you to essentially write Python code where you're now viewing the data in the database as essentially Pythonic objects. All of the different APIs are effectively letting you do that but they all have their own peculiarities and when you wanna issue an SQL query like an SQLite 3 you have to literally write a string which is the SQL query which you then execute but in SQL Alchemy what you wind up doing is you wind up sort of doing joins on objects for instance and if you wanna do sorting you're sorting on objects which under the hood is effectively doing that query but what SQL Alchemy has done for you is it allows you to interact with all of these different databases that I've listed up here in a coherent fashion and that is really nice. Because one day you may realize I don't wanna just have a local file called something.db I wanna start now interacting with a database that lives in a National Supercomputing Center or something right and so you wouldn't have to change your code at all. I've been through the pain enough times to tell you that I've written code that just does SQLite 3 and then someone's like yeah but now it doesn't scale now we have to do distributed and then I try to do like a SQL or whatever and a different database management system like Postgres and all of a sudden you're spending all your time dealing with the peculiarities of the different SQL queries that you would make and the different slight constructions of the creation of those different tables in those different databases and that's incredibly painful. If I'd done this with SQL Alchemy it would all be okay. So if you're really serious about doing scientific programming you should really give some great thought to SQL Alchemy. We didn't get you up to speed on that because it's a bit more of an abstraction away from SQL than I think you kind of need to know but it's worth spending some time with it. Okay so we're gonna change courses a little bit and talk about a different way of thinking about databases in some sense than a relational database. This is sort of akin to what SQLite is where you've got a single file on this somewhere now where it's not storing a relational database but now where you're gonna wind up storing a hierarchical database. And one of the nice websites you can go to which is maintained up at NERSC at LBL is from the Scientific Computing Center and they have sort of a consistent snapshot of what people are actually using in the scientific computing world for data management and they have this statement there is no common scientific data format that is used across all scientific disciplines for instance, net CDF is commonly used in climate modeling fits is preferred for storing astronomy data while HEF-5 is used for instance in combination and fusion simulation communities. So when you're talking about transport of large data so I can hand you some like 100 megabyte file that has a lot of data that you and I are gonna wanna perhaps talk about and interact over generally we wind up getting into domain specific data formats. SQLite 3 is great cause I can just hand you a file but what's very clear is that SQLite 3 is not a great way to be transporting binary images. So if I have a bunch of 2K by 2K images and I wanna put them all together into one file and hand them to you I probably should just do it as a tarball but if I wanna somehow do it as a self-contained unit SQLite 3 is not the way to do it. You wanna do it with some of these other different file formats. And these are all essentially hierarchical and this gets back to the view of a hierarchical database that we had before. So there is a C-based API database that was developed by NCSA that is a file format essentially a standard called HDF-5 that allows us to see our data in a hierarchical way. And when we think hierarchical one way to think about it is essentially either as a document structure so if you wanna think about it as like an XML file or a JSON string or a Python dictionary where you've got keywords, values and then you might have more dictionaries inside of that that sort of nested sense is hierarchical or if I wanna think of it as a slash on my Unix computer that's essentially the top level directory of what I have on my disk and then I go to slash users and I get my user accounts that is hierarchical. And what's nice is that this database which is essentially a file you interact with the data inside of that database as if it was part of a file system. So there are two types of objects in HDF-5. There's a multi-dimensional homogeneous arrays which can have attributes associated with them. Tables and these are essentially record like arrays in numpy and then something called an array. And then there are groups and these are containers that hold data sets and other groups like folders. So you think of groups as essentially a directory so you can traverse through a bunch of different groups and then finally you get to the stuff that you want to deal with, it's the data sets. And I'll show you some examples of that just a bit. So the really nice things about HDF-5, one is it's very portable unlike if I hand you some pickle file and somebody doesn't have Python, they don't know how to interact with Python, they're completely hosed. But HDF-5 is a standard where there are API plug-ins essentially across all different languages. So you could hand somebody an HDF-5 file and they would know what to do with it because it's self-describing. It has the metadata at each different node at each different group to describe what's going to be in that group. The other thing is it's hugely scalable so it's essentially as big as you can make files and you can move them around. And you can even think about having a single HDF file which is a conceptual one but it's actually physically many different files and they're all connected to each other essentially just with pointers where you can say actually to go farther down into this group you want to point to another HDF file which lives somewhere else because everything inside of the HDF file is considered essentially what you think of as a directory. Why not when you get to the endpoint just point to another directory in another part of the file system? And because file systems can be made as viewed by these different APIs can be made to look like they're actually on a common platform even though they're distributed across the net. All of this stuff can just sort of run in a very massively parallel environment. And the other issue which I didn't mention is the Indian issue which is a big one of course with pickles but essentially the file is self-describing so it says what you're about to see is in little Indian format and so whenever there's an API that is working on a computer system that's natively not little Indian it will know that it has to switch things around. So there's a very nice third-party module called Pi Tables which is the Pythonic API of HDF 5 and there is another competing version of this both of which I think come with the end thought distribution. These are very mature and in fact they're sort of companies one company is built on top of Pi Tables where they have sort of enterprise version of it but it's open source and it's quite powerful and it's also built on top of NumPi so that when you're pulling over and dealing with arrays you're basically seeing them and viewing them and able to slice them up as if they were NumPi arrays. And it works on compressed data files so there is a concept within HDF 5 of compressed data so that when you're writing in a huge chunk of data into this file you can actually set the compression level which is really nice. Okay, so here's sort of a structure of an HDF 5 file if I say tables.openfile effectively what I'm getting is the root and if I wanna now create a group and their attributes are about that root so this is essentially the metadata about the root it's saying when you do it at the root level how is the data compressed? Does there, is there any sort of sort of hash checking on this so you can do an MD5 sum checking on all the data inside of it, et cetera. I can create data sets so I can just create a group and I'll call it data sets and inside of that I can create arrays and I can have tables in there as well and I can create links, et cetera. Takes some use to, you know you actually have to play with this to get used to it but it becomes pretty intuitive pretty quickly. The thing to note about arrays is unlike numpy arrays these things can be extended so if you now wanna think about this as sort of in a row database sense I can just start adding and extending and appending more and more rows onto this data and it's totally fine within the HDF 5 framework. All right, so let me go to the notebook which you have, I think already it's called HDF 5 and we'll just step through the creation of one of these file systems within HDF 5. So I'll import numpy, import tables so I'll just grab everything just to make things easier and I'll get an HDF 5 file which is basically just an open file pointing to an HDF 5 file type. Say open file, I'll create something called spam.h5 in write mode and I'll give it a title which is part of the metadata and I'm gonna print out h5 file, it tells me it's a file, what the file name is what the title is, what the mode is some other information in particular this filters is pretty interesting let's see, I could do this. This filters is pretty interesting because it's telling us at the root level when I start shoving data in how should that data be handled? So I have some sort of concept of a compression level this thing Fletcher 32, setting it to false says don't do check sums on the data you can go in the documentation to see what all these other different things are. Obviously if I set check sums to be true I could do it up here and then when I start writing in data it's gonna be a little bit slower and when I start pulling out data it's gonna be slower because it's actually one way of sort of making sure that the file doesn't get corrupted and the data inside of it doesn't get corrupted and I've got a root group associated with this. Okay, so yeah and I said a little bit in here about what these different filters are. Now let's create a 100 by 100 random image with the create array and associate it with a group called data sets. So we'll say we've got this thing called data sets which is a new group so essentially think of this as making a new directory and it will have as its root the root of this file I'll call it data sets and I'll give it a title test data sets and then I'll do a create array which will be created inside of data sets I'll call this thing data set one and I'll just make a 100 by 100 random image, okay? So you can see some of the information about the byte ordering this is the endian ordering what flavor is this in it recognizes this was an empty array. Now let's inside of data sets we'll create some complex structure we'll create a particle which will be basically will be actually the file format of inside of a table. So effectively we're gonna create a table that will have rows and these rows will have a complex understanding of what is in each row. Particle because I'm thinking of atomic mass stuff I've got a name of the particle what the atomic number is, what the mass is just randomly saying something about pressure and you notice these things can be of different sizes I can say that inside I want this thing to be in position one, position two, position three so that I'm sort of saying this is column one this is column two, this is column three here I didn't say which column it is I could have called it column five and then later on said I want to have something in position four would have a lot of time to go into the details of how you do this but essentially what I'm doing is I'm creating a table and it's going to have this sort of complex notion of a particle. Okay, now I'll just point to this table row I'll print out what this row is so I'm pointing to the first row of this table and now I'm gonna add some data here so I'll say row name is equal to oxygen atomic number eight, mass is this pressure is some two by three vector and I'll append this row onto this table so this is where we start diverging from the concept of a numpy array where I wouldn't be able to append onto that I'd have to do some sort of fancy slicing and dicing and make essentially a new array in memory. So let me do that and I'm actually now gonna get back what that first particle is that essentially that first row and there's my answer so now I'll get another row and I'll add some more data to this and you don't have to do it in this way you can obviously loop through if you had a big dataset and you wanted to fill all this stuff up name clinium atomic number 150, mass 360 and now I've got two different particles inside of this table which is inside of a group called datasets. Now I can do some list comprehension where effectively I'm doing some searching over this data this is why we think of this as a dataset and you notice I'm essentially gonna do what looks like a where statement but I'm now gonna do this in a string sense because this is all being done by the API where the atomic number is greater than five and the mass is less than 100 and I get back oxygen. So I can search through this data and what I haven't told you about is how you actually would index some of these things if you wanted to merge across multiple tables you could do that as well. Any questions about what I did? The nice thing is I've now got a file called spam.h5 and I can send that by email to you or put on FTP site, you can grab it or Dropbox and there are different ways to look at this data. There are some nice graphical interfaces to HDF5 files. I've made reference to that in the lecture notes I'll show you in a second. Here I've just opened up spam.h5 and you can see the structure of this file. So it looks basically like I've just handed you an entire directory that's got lots of information in it and I've got something called data sets and I can see the information and the metadata about that data set. And then I've got this, well, a 64-bit floating point array of size 100 by 100. It's got a number of attributes associated with it and I can visualize that and that's what I've done over here. I can see this data in this way. I think if I can open this as, instead of looking it as an image I can open it as a spreadsheet. I can see it as a zero-based or one-based. So now I can look at the data inside of that. This looks too much like Excel to me, so I will close it. And what you basically have is a little palette over here where you're actually able to look at the, I don't know where it's throwing that. It's throwing that table somewhere. Oh, it's throwing that table. Huh. I don't know why it's not letting me look at that. Let me try opening it again. There's the data set. Where's the particles? Huh. I don't know why it's not showing me the particles. Well, it should be showing me a similar table for what these particles are. Not quite sure why it's not. Because it's got two data points in it. Can you try with that? Yeah. Okay. So you have to play around with the view of it. Yeah, that's a good point. Maybe if I just do this. Yeah, it shows all that. It's weird because I opened it up last night and it was showing me that as sort of, it knew how to display the sort of nested arrays. Anyway, so you can interact with this data. And in fact, if we wanted to, we could even create some more nodes on this. I think I opened this up in read-only mode. Yeah, you can actually get some versions of these viewers that will allow you to then create more data sets and more groups. Anyway, so that's kind of cute. And interestingly, on March 10th, what's today, the fifth? Fifth, so in a couple of days, the end of the week, there'll be a talk on this at PyCon. Python AACFI, fast storage for large data. So this is certainly a very hot topic in the Python community. NetCDF4 is another hierarchical data format, primarily using climate modeling for those climate modelers out there and climate deniers. You're probably using NetCDF4. Open, what? Deniers? Don't some people deny that there is even a climate? Sorry for those listening on the web. Okay, so there's another service, which is called OpenEndApp, which is essentially a protocol for accessing large amounts of data across the web. I won't have a lot of time to go into that either, but that's actually quite interesting. Say you and your group are working with terabytes of data and you wanna sort of get a small slice of that data, you wanna query on that data, let's say, and pull back a small part of it. If you have a server, which is serving certain data sets with this OpenEndApp platform, you can connect with a bunch of different Pythonic modules. There's a NetCDF4, for instance, that allows you to essentially connect to that data set and then say slice it and do a where statement on it. And instead of pulling back the terabyte data set, you only get back the pieces that you actually need. So you can be interacting with huge data sets remotely. There's a lot of things now that you're very excited about databases and you're completely comfortable with the idea that if you've got lots of data and heterogeneous data, there's something out there for you in the Pythonic community. There's a lot of different things you can do in terms of working with distributed databases. There's something called HBase, which is built on top of Hadoop. More interestingly, I think for the Python community or these column store databases, the big one that the Python community is using is something called Mone ADB. And that uses a concept of column stores rather than row stores. Allows you to do very simple problems called MapReduce problems that are sort of very easily scalable across multiple nodes. And then there's PsiDB, which is an up-and-comer in the scientific computing database world. And now what they've done is they've recognized that A, we've got tremendously large amounts of data. B, that data probably needs to be distributed over multiple nodes where there's a lot of computing power on each of those nodes. And C, we actually probably want to do some high level manipulation and computation on the data in the database itself. So, PsiDB is growing, it's open source, there's Python APIs. And it views itself essentially as a platform, that is a distributive platform. It's a database, that is it's a distributed data management platform. And it's sort of a computational engine. So you can do things like cross-correlate two fast Fourier transforms and you do that at the SQL level. And if you don't have something like fast Fourier transform, you can just write it in Python and then embed it in the database and then you can do very, very complex queries on the data. And it does it in an intelligent way because it is at the very base level, how the data is physically stored and how it's physically accessed is quite different than the way that other data is stored in other database management systems. So, if you really want to sort of start working on the bleeding edge, take a look at PsiDB. There are Python APIs and I think there's a new release coming out in about nine days or so. And it seems to be well-funded and well-liked by the sort of massive data systems in the scientific community. So I think with that, we'll end our database session within Python, just close with RxKCD again. And I guess we should show you the homework. So we think the homework's gonna be fun. Again, we have one of our small issues here is that we're trying to do scientific programming within Python, but we're also trying not to make it too domain specific. So something that's dealing with lots of data where you're able to ask high-level questions. We've, I guess we've appealed to sports and now we're appealing to elections. Well, before we were dealing with bare populations. Now we're gonna do the elections. So in this assignment, you're gonna create a database to analyze event prediction data pertaining to three ongoing elections, in particular, the Republican presidential nominee election, well nomination, 2012 presidential election and the vice presidential nominee. So what we'll ask you to do is create a database which that has a table called races. Populate each race name, election date, URLs as we show below. There is a trading market called intrade. For those of you that haven't played with that, that's kind of interesting. You can basically trade off of lots of different types of possible events like possibility that we're going to bomb a country in the Middle East by December 31st, 2012. That has a dollar value associated with it. And essentially, you have prices going from zero to $10 and you can buy and sell stock essentially at any one of these values. Every single event has an associated value with it. And the idea here is that there is some sort of common knowledge across the universe where if somebody had the probability of one of these things going vastly wrong or vastly right, depending on how you view it, if they really were off base, they might be willing to buy or sell this possibility for more or less than it was worth. So the idea is that the current value of these various stocks, if you want to think about it that way, really do reflect the total knowledge of what the likelihood is that this thing is going to happen. So if you want to look at the probability that Mark Rubio is going to become the vice presidential nominee for the Republican Party, you have to pay basically $2.40 for that option. And if he is not nominated, and there's a whole bunch of sort of legalese statements behind this of what it means to be nominated, blah, blah, but assuming there's a moment in time when this thing cuts off, if he's not nominated that holding this piece of stock will be worth $0 and if he is nominated, it will be worth $10, right? So right now there's something like a one in four chance effectively that he's going to become the vice presidential nominee for the Republican Party. You can pull over the data from Intrade for these three different races and what you'll do is you'll crawl the HTML that comes back from all three of these. You'll figure out who all the different people are that are part of all of these different races. Some of them will be the same person, like the probability that Rick Perry will be the presidential nominee is essentially zero, but the probability that he'll be the VP nominee is not zero. So create another table in your database called candidates and write a program to automatically populate it with biographical information about the candidate where you'll essentially automatically crawl the Wikipedia articles on these candidates. Include at least a hometown, home state, party affiliation, birth date, and a link to a local file containing a photographic portrait of the candidate which you will automatically also pull over from the web. And then create a table called predictions and populate it with the prediction data which you'll get. You can essentially get all of this data in CSV file for the lifetime of this race. And what you will basically ask you to sort of intelligently create primary keys and foreign keys. And then we'll ask you two questions. Use your database to plot as a function of time the probability of a candidate with a home state north or south of the Mason-Dixon line winning each race. So you have to look up at the Mason-Dixon line if you don't already. But this is kind of interesting to see because you're gonna aggregate over all candidates and some over all candidates, those that were basically from the north and those were from the south. I don't know what that looks like. It'll look like something. But you might be interested in what it looks like. So essentially is the south winning or the north winning. And then one of the things that you might be curious about is the efficient market theory where you would think that the probability that Obama wins is equal to one minus the sum over all the Republican candidates for all time that they win. So this means like as a function of time if Obama's probability of going down is going down then you'd think that the sum of all the other candidates is going up. But that's not gonna be true in general and that's because this is not a very efficient market. So you find the moments when this is not true and you might wanna try to cross correlate that with events that are happening. You can look at Google News or something to figure out what was happening on those days or around those days when this market became inefficient. And for now you can keep on doing this and every time you notice there's an inefficiency you can arbitrage and you can give us back all the money that you make. Okay, so some hints. You don't have to do all the first steps in order. You may wanna just create the tables ahead of time and think about the columns and the structures of those tables and then you wanna crawl through and populate those tables. That's probably how I would do it. And because there's basically not data on a daily basis or even hourly basis, you for many of these different candidates if you're gonna be summing up stuff as a function of time you probably wanna interpolate like every day or every two hours or something so that you can put everything in an apples to apples sense and do that sort of comparison for F. You can play with pandas and this is a hyperlink in the PDF that's already up in B space. We'll get to pandas in a couple of weeks but pandas is a really nice way of manipulating time series data where they're sparse and they're not naturally and easily alignable. And then you could play with SQL Alchemy instead of doing this all within SQL Lite 3 module. Are there any questions about that? Everyone's super jazzed? Okay, go out in arbitrage.