 Hello, everyone. My name is Serena. My pronouns are she, her. I am based in Toronto, Canada, and I'm currently on a three-month sabbatical at the Reaker Center. If you haven't heard of the Reaker Center, it's a magical place where you basically can do a writer's retreat for programmers, and I encourage you to ask me about it if you're interested. Previously, I was a data scientist and a consultant at the Bardis Group. As part of that, I worked on a project in collaboration with the Government of Ontario, where our goal was to explore ways in which natural language processing could be used to extract information from laws that can be usable to the public. So today I'm here to talk to you about improving law interpretability using natural language processing. But what does interpretability mean? Let's take a step back and think about the context. By law, the definition of law is the system of rules which a particular country or community recognizes as regulating the actions of its members. So it's basically a system of rules that tell you how you may or may not behave in a society, and it's pretty important that you know about it because ignorance is no excuse for breaking it, and so really important to understand them. In terms of interpretability in this context, there's two aspects I want to draw the attention to. On one side of the content, so we want to be able to extract rules and obligation from the text, and then we want to be able to identify the entities that are responsible for compliance and are affected by the legislation. In terms of technical implementation, that means find a representation of the rules that makes them more accessible and understandable, and if you do that, you may be able to see things like patterns across industries, maybe see the difference between private and public sector, whether there's a difference in responsibilities, or even highlight ambiguities. There are several challenges in the space. The first one is there's no level set available to us, so no one sat down in Ontario and annotated the text of all the laws saying this is interesting, this is not. So the first problem is we can't run any supervised analysis. The second language, parsing and tokenization, is kind of hard on laws. They're all formatted with bullet points, bullet points are referred to other bullet points and lots of references to other laws, and that can break the parsing and tokenization algorithms very easily. Next, laws tend to have limited lexicon, so the vocabulary is relatively limited, but they're very specialized in context specific, so the same word can be used in different places in the same text, but with different context and so have different meanings. Sentences can also be very complex, so they can be really long and convoluted and sometimes it's hard to tell who's responsible for what, what is the rule, what is actually, you can get lost in the sentences basically. And then finally, they're very domain specific, so think about a domain in terms of an industry or a topic or a geography. So if we had a data set from Europe, for example, and we trained a model on it, we wouldn't be able to generalize as well to Canada. So in light of this challenges, we instead of building a single model, we decided to build a framework of analysis, and we put together a mix of pre-trained natural language processing models and supervised machine learning. And this was with the goal of extracting information, and that information boils down to the rules that are defined in the tax, the entities that are responsible for compliance, but also things like the difference between public and private responsibilities and trying to organize the rules into homogeneous groups, and we'll see in a few minutes what I mean by homogeneous. Before moving forward, I want to do a quick grammar refresher just because I'm going to use these words a lot in the next few slides, and I want everyone to be on the same page. So in grammar, the subject is the word of phrase that indicates who or what performs the action of the verb. So in the context of law, this is going to point us to the entities who are responsible for complying with the rules. And I have an example of a sentence here where every employer is the subject grammatically, and it's also the entity that is responsible for providing something. And next, we have the object of the sentence. So that's the entity that is acted upon by the subject. And so in this context, it's a rule specification. So we have that every employer was the subject. The object of this sentence is workplace emergency response information. That's what the employer should provide. Okay, so getting into more of the analysis, we built a proof of concept based on the accessibility for Ontarians with Disabilities Act. This is a statute and a regulation passed in 2015. It defines rules and requirements for accessibility in Ontario, and it also sets out processes for eliminating barriers. So in this context, the rules that we're interested in are called burdens. So a burden is a requirement or obligation that organizations have to comply with. And that can be related to physical and architectural barriers like steps, stairs, sidewalks, but also things that are less tangible like documentation and training. The analysis articulates into three steps, and we don't have time to cover all the details of the implementation because I would like to live some time for questions. But I'll try to give an overview of the challenges and the solutions at each of these steps. Step one, burdens extraction. So finding the sentences in the text that define rules. In this case, having a limited vocabulary actually works in our favor. So we can come up with a short list, with a set of verbs that are very likely to point us at the definition of a burden. So for each sentence, if any of these verbs appear in the text, then we'll label that sentence as a burden. It is a pretty coarse classification rule, but we don't have it label set. And even, you know, being a pretty high level business rule, we get .89 accuracy. So the proportion of sentences correctly classified is 89%. And we get .97 recall, where the which is a proportion of burdens correctly classified as burdens. The next step is to identifying the subject. So we want to know who's responsible for complying with these rules. The problem here is the sentences can be really long, can be complicated, and sometimes they can be sentences, the subject can be a sentence on its own. So we use a dependency parser to represent the syntactic relationships between words as a tree structure. And then we navigate the subtree of the subject to identify all the words that define it. And I have an example here to make it more clear. The subject, the main verb of the sentence here is keep. And that's the head of our tree. We also have that the subject, the dependency parser identifies organizations as the subject of the sentence. And then all of the words that define organizations are actually, have actually a parent-child relationship with organization. And so that's why navigating the tree, this subtree will let, well, just let us find all of the definition, the full definition of the subject here. The final step is a clustering analysis of the subjects. So the objective here is to organize the burdens into homogeneous groups based on the entities that they affect. At this point, we're not entirely sure still of what we want to be, what to expect from the results of the analysis. That's why we use clustering. But we want to see, we're looking for any kind of pattern or regularity. And we would like to see a difference between public and private responsibilities if they exist, and maybe similarities across industries. Before we can do the clustering analysis, we need to represent subjects of the burdens into a vector space. So we need to have some numbers instead of strengths. And so before we even do that, we need to do some sentence normalization. This is one of the steps that we do here, for example, is to delete the stop words. These are words that appear very frequently in the text, like the, but they don't really add a lot of information. So we're just going to ignore them. The next step is we need to actually project the sentences into a vector representation. And so we do that by using a semantic space. In a semantic space, words that have similar meaning are going to be projected with vectors that are close to each other in that space. And so you can do operations on the vectors. And one popular example is if you take the vector for king and subtract the vector for men and then add the vector for woman, you get the vector for queen. So you can do this kind of operations. The third step is dimensionality reduction. The vectors in the semantic space are still pretty large. And so we just do some fancy math to represent them into a two-dimensional space. And then finally, we can do clustering. For clustering, I use k-means. The goal of k-means is to partition and data points into k-clusters. And it does so by assigning each data point to the cluster with the nearest mean. One of the reasons to choose k-means is that it's very easy to interpret the results and they're very intuitive. So the average of the clusters can serve as a prototype for the group. So that means that you can just look at this one average centroid and kind of get a feel for what the group represents if you've done a good job. The plot here shows the representation of the burden subjects into the two-dimensional space. So that's what each point is. And you can see that the projection has this nice three shapes coming out of the center of the plot. So it looks like we've done a good job in terms of creating the groups. For evaluation, we're going to use TF IDF. TF stands for term frequency. That's the number of time a word or a term appears in a document, which in our case is a sentence. And the inverse document, IDF stands for inverse document frequency. This is proportional to the number of documents, which again in this case are sentences, where the word appears. So what TF IDF does basically is it gives you a measure of the term frequency, but also it weights it down if the word appears in many documents. And so that's important because a word that appears in many documents will have less power in explaining the difference between them. So what's important here, we consider the TF IDF score for the top boards in each group. And what's important here is that they're pretty well separated. There's no overlapping. So that's what we want. Going to look into more details in the group. The first group, you can see that the top three words here for TF IDF are transportation, service, and provider. And every other word has a much lower score. So we can say that this is a group that focuses on transportation standards. We don't see difference between public and private responsibilities, but that kind of makes sense. And that's about 21% of the burdens in this group. The second group has words like surface, trail, access, parking. These are all words that refer to physical barriers in public spaces. And so that's what we're going to call this group. Again, there's no difference between public and private here. There's really no reason for that to happen. So that's fine. And then we have about 25% of the burdens that are classified in this group. And then finally, we have this group where organization has by far the highest score. And then everything else can be a little bit confusing. So if we look, we have public over here. We have municipality and minister. So that points us as burdens that are probably government responsibility. But then we have words like organization, person, employer, and others that are kind of ambiguous. We can't really say whether it's public or private responsibility. And there are more in-depth analysis of the burdens in this group. It turns out that they're mostly about administration, compliance, and standards. And then about half of the burdens that we extracted from the text. So that is ambiguity. To recap everything, we were, in terms of content, we wanted to find rules and obligations. And we were able to automate the extraction of burdens. In terms of content, we wanted to know what the entities affected were. And we were able to do that by extracting the subjects of the sentences. And then we wanted to find some homogeneous group, which we did successfully by organizing them into three groups with k-means. The next thing we wanted to look at was patterns across industries. And we didn't really find that, but we did find that the legislation has a strong focus on physical barriers and transportation. And then a large proportion that is dedicated to administration and standard definition. So we got this like three main topics. We weren't really able to find the difference between private and public. And that was partly because it wasn't really important, but also not super clear in the last group. And so we were able to highlight some ambiguities with respect to the last group. So this project was a proof of concept. And so it was limited in scope. And there's definitely more that can be done in terms of refining the kind of information that is instructed. But it is a framework that can be generalized and easily applied to any legislation and any domain. So I say it as the first step towards an abstract representation of laws and that can serve the purpose of improving line interpretability in at least a couple of ways. On one end, it helps extracting information and summarizing it so that the rules and requirements can be made more accessible to anyone that needs to follow them, whether they're an organization with a legal department or a regular person. And then on the other end, it can help lawmakers by highlighting parts of the legislation that is ambiguous and could be rewritten or adapted to be more clear and more accessible. So this could be an instrument for both helping the understanding of the existing legislation, but also improving the way we write laws so that going forward, our legislators can write laws that are more interpretable and more approachable for everyone. If you're interested in seeing the code and sort of like the technical details of the implementation, there's a repository on GitHub at Vardis. The slides are on my personal GitHub and you can find me in the Slack at csv5-qna or full free to DM me on Twitter. Great. Thank you very much. Yeah. So we did get a couple questions. I don't know if you want to jump straight into the questions or I can help as well. Oh, I can see them. Okay. Okay. So the first one, both Canada and the US have common law systems. Would you need to approach a European style or even Quebec civil law system with a more closed known historical system differently? Let me just digest the question. I'm not entirely sure that I understand what you mean by that. In terms of technical implementation, I think this question is more focused on the access to information with common law versus civil law, some sort of systems that deal with more closed nonhistoric systems, historical based systems. I don't think they need to be approached differently to be honest. If I understand correctly the question in terms of technical implementation, I think this would be a way to like it would be possible to sort of like apply the same framework. You can tweak it very easily. It's not that that's why I was saying the it doesn't use a because it doesn't use a supervised methodology. There's a lot more freedom in how you play around with parameters and the steps of the process in general. The next one, but sorry, Kyla, feel free to like hang me if I haven't answered your question. If I haven't actually answered your question, because I'm not entirely sure that I've hit the point. The next one did fancy math. Okay. The fancy math is spectral embeddings. I didn't want to go down the path of explaining what a manifold is and then you lose everyone. Yeah. It's hard to fit everything into a short 15 minute. Yeah, I hope that that was clear. There's definitely a lot of tactical stuff that I couldn't pack into the 20 minutes. Yeah. And there's a lot of conversation going on in the thread here around different types of tools and different people pointing to different workflows and processes. But I think this is a great thing for us to continue in Slack. And we can have the thread here available to you. And also anybody who's got questions, just jump over to the Slack channel. So I think yeah, I'll, yeah, if you can give me access to this threat, I will respond to everyone. Sounds great. Awesome. Thank you very much. Thank you.