 Well, thank you for having me. My name is Alyse H. Favitz. I am at the University of Michigan. I'm a PhD student. But today I'll be talking not about my thesis work, but about CINYAC, an open source software package that originated here at the University of Michigan. And just a quick background on CINYAC and who the team is. We are five main trainers and three committers, of which I'm one, but we have over 45 contributors, 40-sided papers, and we are non-focused affiliated projects. So just to give some context, about half of us are still at the University of Michigan, but we have maintainers and committers in several different countries. So we are very distributed and we are an open source project. So the motivation behind CINYAC is essentially this, which is that there are two common approaches to file naming an organization when you're doing academic research. And typically people go for very long file names. So you just start appending a new variable every time or you have these very deeply nested directories. And this is great if you know exactly what your data is going to look like at the end of your project, but that almost never happens in academic research or ever. And so that raises a question. How do you introduce a new parameter? And how can you quickly check to see what parameter space you've already analyzed, what has yet to be analyzed, what is the status of your project, how do you quickly get information? So those are the two main motivations for the CINYAC framework. And CINYAC can be broken down into two separate packages. So it originated as CINYAC, the data management framework, and then evolved to include CINYAC flow, which is our workflow management framework. So to be more specific and give an example, CINYAC instead of having this nested directory structure, these very long file names, we instead stratify everything and everything belongs to a single workspace. And every what we call job is unique and has a specific hash ID. So while this may look like nonsense here, what this actually is, is making sure that every single parameter that you're running is unique. So for example, if I'm running a study and I know I'm burying the number of particles, the pressure of my simulation, the volume of my simulation, maybe how long I'm collaborating it for, for me as a material scientist, I would run this over a very large parameter space and I would want to know that each one of those are unique. And so each of these jobs has a unique state point, which is this dictionary that describes all the parameters that are intrinsic to all of the data within that workspace or within that directory. And then in that same directory is all of the accompanying data for that state point. So you can have lightweight metadata and something we call the job document. And then you can have any file type, any extra data stored within that directory. So everything is just one directory, the all belonging in the workspace and each of them are guaranteed to be unique because they're defined by these dictionaries that live in a single file of the Syniac state point JSON file. So that's our approach to managing your data space. And then when it comes to operating on it, so I said we want to know what data has been analyzed, what is the status in our whole workflow pipeline is where Syniac Flow comes in. So Syniac Flow takes anything that you can do to your data space and you define it as a Python function. So let's say I want to have a function called calculate that's going to take my volume and divide it by the number of particles. I can access the volume and the number of particles from the Syniac state point for every single job in the workspace. And what Syniac Flow allows me to do is iterate over any job in the workspace and operate on using this function. So also I know that at CSV conference that there's probably a lot of people that are doing data management type solutions. And so I want to be really clear about what we're good for and what we're not good for and sort of what are, who are target audiences and what problems we're trying to solve. So we're very good for managing file-based heterogeneous data. So I gave a very brief example just now but it's very good for when your data is messy and has a lot of different, yeah, if it's very heterogeneous. So you don't have to have this nice uniform where everything is perfectly defined and organized. It's very good for messy data. It's also very good for searching and accessing the data within Python or on the command line. So you really can use any tool you want as long as it can be run on the command line but it's very Python friendly. And it's very good for scalable and reproducible workflows. I say scalable to mean that it's very easy to go from your first prototype of something to working on a small project. This is not scaling in the sense that some other people here might be referring to scaling. And it's very good for making reproducible workflows because it really kind of forces you to do things in a very modular and well-documented way. And with that, it's also good for prototyping. So what it does is it really encourages best practices as you're doing the very first like nuptial part of your data science workflow. As I said, it's really good for integrating with existing tools. So anything you can access on the command line can be integrated into Cinec. And what Cinec's not good for is, as I said, with the scalability. Once you get over about 100,000 individual jobs. So that's like individual directories in that workspace directory. That's when we start to sort of see poor performance. So this is not for enormous data sets. This is meant for developing things very quickly and being very agile. It's also not great for existing databases. So if you have a lot of distributed data, this is not for you. And if you have purely tabular data, it's going to be overkill. So getting that out of the way, we're gonna just go through an example and quickly see how some of these features work. This is not all of the features by far. It's very flexible. It's very powerful. So I encourage you to please ask me follow-up questions in the Slack after this and or go to our documentation, which I'll link to at the end. So this is an example that was a course project by one of our maintainers, Bradley Dice, along with two other classmates. And they were looking at the network structure of U.S. air traffic. And we don't really need to get into the details, but I like to use this just as sort of a way for us to conceptualize how Cignac integrates with data science problem. So I'll cover the data management side as well as the workflow Cignac flow side. So if we know that we want to analyze U.S. air traffic and we need to first create a perimeter space. So we know that we want to look at how traffic patterns have changed over time. So let's first initialize a perimeter space with all of the range of years and the quarters within those years that we want to investigate. So really all this requires is a tiny misted for loop where you can see that we initialize the project and then with that project, we can open the job just by defining the dictionary that will uniquely identify them and then initialize it. So what this does is if you then try to go back and re-initialize and make a duplicate state point perimeter or a duplicate state point, it won't allow you to do that. It will make sure that every single thing is unique and it won't create redundant directories. So what this for loop would do is create a data structure a workspace that looks something like this where you have each of these points would be a directory within your workspace and they are varying over a year and then over quarter. And we specifically separate quarters from years because the way that the data is analyzed depends on which quarter the data is from. So for example, if we would go into one of these directories, let's say this is its hash ID. So this would be the name of the directory and then the JSON file within that directory would just contain the dictionary year 1994 and quarter four. So then we'll have that workspace. Let's say we leave and we come back a year later and we want to know what this workspace contains and what range of parameters it covers. And so we can do that easily with a command Syniac schema, which we can just run on the command line and it'll quickly tell us what the range of parameters and the data types for them are. You can imagine that this is much more useful when you have larger dimensions of data. And same thing with querying. So we can use Syniac find, which will tell us which jobs match a specific query. So if we want to know which jobs are corresponding to the year 1994, it will slice our workspace and tell us which job directories or the hash studies that correspond to them corresponds to the year 1994. So we can do much more complex filtering but this gives you an idea of how you would access the data once it's been initialized in Syniac. And then when we want to modify a workspace, this is where Syniac is actually very powerful because it allows you to modify it while maintaining the integrity of your data. So if we want to add country as another parameter, we can just iterate for all of the jobs and add another parameter. We can also rename an existing state point if we decide that now we want to capitalize quarter, for example, but you see how this could be very useful if we want to change to a different naming convention and not worry about invalidating any of your data. And then with the Syniac flow side, so this is actually how do we go in once we've decided what parameters we want to investigate? How do we actually go in and do the data science on it? So if we have an idea of, for each of these directories, we have a correspondence between what year, what quarter, and we know that we can access from, in this case, it was from a government website that had traffic data corresponding to the years and quarters. We can have effect data operation. And then once we have acquired that data, there's going to pull everything down and there's going to be a lot of data cleaning that needs to be done. So let's say that every single directory now has a readme.html file that we don't need and is going to get in the way of our data analysis. So let's, so in order to go through and clean out these unnecessary files, we would set up a flow operation. So this function is a flow operation. The only thing that makes it special is that it has this project dot operation decorator and it takes job as an argument. And then we also have these pre and post conditions, which is what sets this up in a flow diagram type structure. And so you can see that a pre condition here is underneath my circle, fetch data and another condition is had readme. So this makes sure that this operation is only run if a readme exists. And the way that we find that is now as a function called as readme's, which is just a simple liner that checks to see if a readme exists in the file. And so, and then I do want to mention there's also this functionality called labeling, which is very useful when you want to see the status of your overall project. So what this looks like is if you run pythonproject.py status. So project.py is what contains all of your functions. So everything that I just showed in the last slide and because we've assigned has readme as a label, it will tell you what percent of operations this is valid for. And so you can see that we also have the function fetch data and it will tell you how many jobs are eligible to be run so that if you would run this workflow, it would run fetch data on 104 eligible jobs and run remove readme's on 104 eligible jobs. And then once that has run, these would say zero. So finally, I'm gonna go over just a few other things that if you're interested, you can look at our documentation or our homepage. This was a very high-level overview with a very simple example, but I'm happy to answer specific use cases. But something that we're very good at is automated cluster submission. So this is, if you're working on several different supercomputers, we have custom templating so that you can simply run pygonproject submit and Syniac will take care of all of the queuing and how many CPUs you need parallelizing all of that. And it does all that with cluster-specific templating. So if you have a cluster that we don't already support, you can easily write your own and then have it automate for you. It's good for exporting for data sharing. So if you have a collaborator that doesn't use Syniac or if you want to archive something, you can export to that nested directory structure and you can convert back from nested directory structure to Syniac. And we also have Syniac dashboard, which I'm just plugging here, but I encourage you to look at it on our website which is the third in the Syniac, Syniac flow, Syniac dashboard, which allows you to view graphical output in a very interactive way. It's in, we also have some new features like groups, which allows you to group those operation functions to more complex workflows and more complex missions. We integrate with pandas and HDF5 data stores and we have container support with Docker singularity. So that's everything that I wanted to cover today. Please go to syniac.io if you're curious about our project, follow Syniac data on Twitter and we're very active on our Slack. If you have any questions, if you just want to know more about the project, come say hi, our Slack is linked to on our homepage. Thank you. Thank you so much. We have about two minutes for questions and so I'm gonna check the queue. And so one question that came up from the audience is from Kelsey Montgomery. Is storage on Syniac in the cloud such that compute can be performed without moving any data? So I suppose I didn't make this clear, but Syniac is completely file-based. So that is one of the cases that Syniac is not built for. So yeah, everything is a file-based system. In my personal case, I will run things on a cluster that you can, you can have things separately but not in the way that you are describing, not in the typical cloud computing sense. All right, so we have time for one more. How would Syniac work in terms of data portability or archived compressed data? Would there be a need for some speed pre-processing to get it up to speed? So I would like to talk to you more this after so I can have some clarifying questions, but people have used Syniac for archiving data specifically through the University of Michigan. And that requires, because you can export it to that nested directory structure where everything is nicely labeled for you, it's best to export it and then condense it essentially. But I'm happy to talk about that offline. All right, well, we'll move that question over to the Q&A in Slack and thank you so much for a really great presentation. Thank you.