 Hello everybody, thanks for coming. I am a fourth year graduate student at Oregon Health and Science University and while I'm not actually going to show you data much like Sarah Kate did, I will be discussing one of my favorite topics in science which is cell identity. So as Sarah Kate had mentioned cell biologists, well they look at cells and one of the ways that they try to deconstruct the complexity of a biological system like the human body is by looking at the fundamental units of that body which is the cell. Now cells come in many shapes, sizes, and perform many different functions. So biologists work to categorize each of the cells into a logical taxonomy that describes essentially their lineage where they come from, what they do, how they look, and where they are in your body. The catalog of human cell identities or cell types, when you think of this, this could be the cells that make your eyes function, the cells that produce mucus in your lungs, the cells that help you digest food in your stomach. That list of functional cell types or cell identities is long and ever-growing. As our technologies are improving, we're capable now of exploring cell identities in ways that we haven't before. But what exactly is cell identity? How do we define a discrete cell identity or two different cell types? Historically, one of the ways that biologists did this was actually by looking at the location within the body. As we all know, your organs do different things. Sometimes they have overlapping functions but by and large the role they play in the body is specialized. For example, cells in your lung will contain cilia or these small strings that you see here that will actually whip back and forth in order to bring contaminants out of the lung. Meanwhile, cells here in the pancreas form these circles called ducts where they will bring enzymes that help you digest food into your intestine. But the thing is, location wasn't sufficient for defining these two cell types because they're actually the same type of cell. We call them epithelial cells. These are the cells that align your organs. They share many of the same active regions of DNA. They look the same. Technically speaking, if you culture them in a dish, they're probably going to do the same thing. So if we can't just define them based on location, what else can we do? We can try their appearance. What do the cells look like? This is a motor neuron. You can see just by its shape that the protrusions of the neuron highlight the fact that its role is communicative. It is trying to transmit information from one end of the body to another or another cell. Meanwhile, we have another type of cell. It looks stretched out, doesn't it? This, of course, explains part of its function. This is a myofibroblast. It's one of the cell types that defines the connective tissue in your body. It helps keep things together, keeps things stuck. And more or less like, for example, if you get a wound, it's got to close that wound somehow. Fibroblasts are part of that healing response. But is appearance enough? No. Here's a T cell, which is one of the immune cell types in the body. You can have a T cell that is a killer T cell, where it performs its immune functions by destroying other cells that have been infected by viruses or have turned cancerous. Or you can have a T cell that is called a helper T cell, which actually performs more of a supporting role or a recruiting role by working with other cell types in the body. These cells under a microscope look very, very, very similar if they're not actually, for example, in the process of killing another cell. So while the killer T cells actually work to destroy other cells, the helper T cells work instead to interact with other functional immune types in the body and work to help that immune function in that regard. Despite their appearance looking very similar, their actual functions are very distinct. So can cell identity be described as function? Not necessarily. Just take this killer T cell that I was just describing. We have its active immune function where it'll interact with other cells and destroy them if they become problematic. But we can also have another form of this killer T cell, the exhaustive cell. This is a very distinct and stable state that the cell can take. Often happens, for example, in cancer where the tumor is trying to avoid being killed by your T cells, they will secret things into the micro environment that will tell T cells to just stop doing what they're doing or the T cells can't reach the tumor and they become exhausted. It's like running on a treadmill and never reaching your destination. Cell identity then can also be considered a matter of state. These are very different cell states despite having the same overall cell type. So how do we study cell state? One way is through the field of epigenetics. This is where I work. This is what I do as a graduate student. Basically, epigenetics is a way independent of your DNA code to dictate what type of cell you are. More or less you have in one cell type like the killer T cell, you'll have some regions of DNA active, other regions closed. But that T cell won't have the same active regions of DNA that say a fibroblast or a neuron will have. Figuring out these combinations is part of what I do. I work with chromatin state. To define chromatin, you have a cell. DNA is contained in the nucleus. It's organized into functional units that are replicable called chromosomes. The chromosomes themselves are wound. So if you look here, here's the DNA. They're actually wound around these proteins forming another unit called the nucleosome. So what I study actually is this. These are the regions of DNA that aren't being blocked by these bulky protein histones. And the combinations, basically where in your DNA these are open, dictates in part what state you're in and what kind of cell you are. So it's not just how you look. It's not where you are. It's not what you do. It's also, so to speak, what's happening under the hood. Which regions in your DNA are open? Which regions in your DNA are closed? This is what I study. So to give you just a quick example, here's a pancreatic cell. It's a type of fibroblast. These store fat, they help maintain the connected, like the extracellular environment, the environment in between cells, they help maintain that balance. They don't really proliferate and they don't really start signaling to other cells in this state. But in response to injury, when you get a cut, when you become inflamed, when certain environmental factors like TGF beta are in the proximity to this cell, it will actually transform into a myofibroblast, which has a different chromatin state. So depending on what happens, you can have that same cell. It's the same cell has all these functions. The moment that injury happens, it will transition by remodeling its chromatin, or again, those combinations of open and closed regions to become an activated myofibroblast cell, which has very different function. Again, this was the same cell. It just changed its state. So can we define the full spectrum of states that a cell can take? That's part of the equation of defining what a cell's identity is. So how exactly do we learn about chromatin state? Well, what I use is an assay called ataxic, or assay for transposes accessible chromatin. Just to simplify it, we put all these hyperactive proteins into the nucleus of the cell. They come in and they cut it up. But the only place that they can do that, the only place that they can cut up DNA, are these open regions. So it actually fragments the DNA. And what we get at the end product are just pieces of DNA that were open, that didn't have these nucleosomes associated with them. So what we can essentially do is, when you have a sample filled with different cell types, you can oversimplify it anywhere. Let's just put it in a blender. And then you get kind of like an average chromatin state. It's not representative of every single cell type in there. If you look at it, there's many different cell types here. So some regions will have a stronger nucleosome signal, others will not. It's not perfect. This is inadequate for a heterogeneous sample. When you think of cancer, when you think, just for example, of cancer, you think of, well, you have your malignant cells, but you also have immune cells that are trying to fight the cancer. You also have stromal cells, which try to contain the tumor. You have endothelial cells or blood vessels, which supply nutrients to the tumor. Samples are heterogeneous. There are many different cell types, many different cell identities. And even within that same region, some of those, for example, cancer cells can look one way, others can look a different way. It's a matter of cell state. So more recently, within the last decade, there's been kind of a revolution in this type of assay. Rather than taking that average that I was talking about, we're now actually capable of profiling the chromatin state of every single cell in your sample one at a time and then collecting all that information. So we'll know that the chromatin state or the open regions of DNA, closed regions of DNA look like this. And for a different cell that look like that, for a different cell that look like that, this is very, very powerful. So understanding the heterogeneity of a tissue. But the problem is, is that single cell methods are cursed with sparsity. This is just where the information that you've collected is mostly zeros, mostly negative information rather than positive information. You may not just have sampled it or there could have been dropout. So to give an example of this problem and kind of like the point of, you know, the talk here, can anyone here identify what this painting is? It's a famous one. No, close. But this is the point of what I'm getting at. This incomplete painting represents a harsh fact, which is that single cell data is sparse. We only collect maybe one to 3% of all the open chromatin regions I was just telling you about when it's not the single cell level. We only have like, this is, let's see, 100 squares actually, and I've only represented three of those. That's as much information as we can collect on a single cell at a time. So what do we do about this? We sequence more cells. As long as they're in the same state, as long as you're looking at the same type of cell with the same identity, you can just do this again. Make sure, you know, like as long as there's there's some shared identity between them, if we take this cell or this picture and just kind of push it and overlay it on top, we start to see a bit more of the picture. Can anyone identify this now? Again, we'll just take this, you know, make sure that they like find what's common between them and then overlap them. We repeat this again, and we repeat this again. If we continue to do this, eventually we start to see a better picture of what the cell state is. So in, you know, my field, we try to capture essentially the quote-unquote true picture, or the, at least with the type of data I work with, the chromatin configuration that kind of represents the true cell state we're looking at, whether that be, you know, an activated fibroblast or an exhausted T cell. So again, just, you know, it's kind of one, another idea I'm trying to get at here, especially with this whole jigsaw puzzle analogy, is that if you had this beforehand, if you had the complete picture beforehand, even if the cell doesn't have that much information, just overlaying it using a reference enables us to identify what that cell type was. So public data is so, so important. You may have an experiment that didn't work too well, but if another dataset exists that already describes the cell state you're looking at, you can do something like this. It makes unusable data usable. It makes, it cheapens the cost of a sequencing experiment. Can anybody identify, I'm going to identify one more problem before it. Can anyone identify this? Actually, is this the Mona Lisa? Not quite. This is actually the Lama Lisa, which reminds us of another great reason to use reference data, which is that if you have an incomplete dataset, the, you can be fooled. It may look like the cell state you're trying to describe, which is why bringing in public data and cross-referencing information between labs, between publications can be so helpful for things like reproducibility and, you know, confirmation that the cell states you're seeing are the same or maybe even distinct. The point is, is it making these comparisons is important because while they share some similarities, they're not exactly the same thing. So more or less, just to describe, to wrap this up in terms of CSV, in this type of data, remember I say we just get those open chromatin regions, right? We pile those up and then say these are the chromatin regions that were significant. The point of what I'm getting at here is that like all it takes to have a public dataset for something this important is a, sorry, CSV, a tab-separated file that is, you know, for the largest experiments, no more than 20 megabytes. Curation of reference peak sets or those open chromatin regions can vastly improve public data. It can vastly improve sequencing experiments from every place around the world. We're all human beings and we all use the same reference genome when we're aligning things. A cell state that describes, for example, an exhausted T cell state will look very similar from one human to the next. There will of course be differences, but that's the point of what I was getting at with the Lama Lisa, which is like we need to define the full spectrum. But public datasets and curating them are part of the only ways that we can do that. Having that information easily accessible is one of the only ways we can do that. So just that's the end of my talk. Just wanted to say thank you for inviting me to CSVConf to give this presentation. And a lot of these pictures are actually made with AI. So here's the prompts that I used. That's a wonderful talk. Thank you also for saying your prompts for the AI. Good transparency. Has anyone got a question? Andres? Thank you for this great talk. You mentioned some kind of public database with cell state. All in the science community agree that if they get the data they will share and all of you share to the same place and these things exist and everyone respect them and use this data? There are numerous databases where researchers are called to submit their information in a way that makes it easy because a small team can't possibly go through every single publication out there. I actually really encourage my colleagues when they publish a paper and a dataset to find as many databases relevant to their information and put it up there or request that information is put up there for visibility and for the reasons that I discussed here. One of the big ones that I can think of that describes this epigenetics you know kind of focus is actually ENCODE. They're a database. They have many many many different epigenetic cell datasets. But the single cell aspect of that that I was describing the single cell datasets are still they're still being curated. It's not quite as good as some of the more bulk average you know datasets that are out there. I'm just saying there's a bit of a hole for single cell epigenetic datasets. The databases that exist right now are often selected or biased towards those labs that want the highest visibility not necessarily you know just the more we curate the more context we'll have. So it's it's a it's a bit of both yeah. I'm going to be cheeky and ask one first sorry. Have you got allies or a coalition of the willing to help you make that vision a reality? Not quite as in like starting a database or at least you know ensuring that it's part of our standards as an institution to submit these you know these information. Right. What should I call the website? And kind of a related question so you you speak to not only the infrastructure gap of the need for more databases or repositories but also the need to have appropriate documentation and curation for the data that would go into it. Are there existing epigenetic data standards that you know of? Absolutely. I had already mentioned ENCODE but ENCODE does a fantastic job you know with this. Not only do they for every dataset on their website not only do they provide a detailed breakdown of how the data was processed and created but also provide a set of guidelines community standards that they believe you know can serve as a template for future studies especially with how much information you know they've collected. I think it's an absolutely invaluable resource. Again it's just the single cell aspect of this is kind of where it's lacking. But yes they publish where the data was came from, how it was collected, the computational steps that came to you know processing the data and even provide metrics you know for example a dataset of acceptable quality of exceptional quality of poor quality. These are all very helpful especially when you're you know trying to pass it your way through hundreds of datasets. I think it's a really good model for that kind of thing. Any other questions? No? Well there's always the opportunity to use a data table to brainstorm the name of what you want to do this, how you can build on ENCODE, how you can get people together. I guess only take three people at an epigenetics single cell conference and the dream can come true. Five years CSV conference in five years time. We'll hear about it. Good idea. Cool okay thanks so much Kevin.