 Betsy Habel and this afternoon we're going to be speaking about regexes and specifically their DSL design and what we can learn from it when we're designing other DSLs. So just to keep everyone on the same page, we're going to start with a quick introduction for regular expressions to anyone who's not familiar with them or for anyone in the audience who could use refreshers since they haven't worked with them in a while. Here's the simplest regex I can think of. It searches a given text for the letters D, O and G in that order and with no characters between them. So it'll match any of these strings here. And here's a less trivial example. In this one we use the period wildcard to match any character. Since this wildcard matches any character, the regular expression D period G, which is now on the screen, thank you Google, can match the strings dig, D space G, D exclamation point G, or a lot of other things. There are a lot of other little wildcards. They can match more specific things as well. Word characters, white space, even a thing called a word boundary, which is the first or last character of any given word. Both characters and wildcards can be grouped if the default groupings aren't powerful enough and you can specify the number of characters to be matched with other wildcards like star and plus. The specifics matter less right now than the mere fact that there are a lot of things you can do. Being a little more complex, you can use capture groups to single out specific subsets of your match for special treatment and a back reference to refer to a previously captured capture group. Also Peter probably would pick up pickled peppers later on within a single regular expression. So we've got all of these building blocks and individually they're pretty simple. Good little computer. There we go. And I'm not going to pretend that all regex are simple. This for example is an email validation regex that someone, somewhere for some reason, recommended that other programmers use in production. The simple elements that make up regexes can be combined in ghastly, hieroglyphic-esque ways and often are. So at this point you may be wondering some things. Things like whether it is possible to learn about designing DSLs or indeed about designing anything from something that produces screenfuls of mess and that can't even fully parse an email address in the process because of course that email validation regex I just showed you did not actually work. And the answer is that regex are old. Like C, like shell scripts, like them, regex are gawky and horrible and everyone has used them for decades anyway. They are too bloody useful to erase. They are too bloody useful to give up. No matter how much we try to replace them with tools that are nonly aesthetically prettier. Anything that bloody useful has to teach us design lessons, whether it's surface seems polished or not. The biggest goal of software design over and above how elegant things are, is getting the damn thing to work. And regex, bless them, do that if nothing else. Some of that, as we will see later in this talk, is because they get to cheat. But we can still learn from the ways they cheat. So how old are regex anyway? They're first defined as a mathematical concept back in 1958. They were an outgrowth of set theory used for describing grammar of regular languages. A decade later, they were implemented as a simple independent programming language. Note that this first implementation treated them as a programming language in their own right. A few years after that, they began to see wider use when they were embedded into a concrete tool, the Unix Utility Grap. They then became embedded in more and more powerful tools, such as SED and AUK, and were embedded into the programming language Perl in 1987 as a first class language concept. In other words, regular expressions got a lot more powerful and useful, and therefore a lot more used when they became a domain specific language for string processing embedded within a more general purpose language. In the 28 years since Perl came on the scene, regex implementations have been baked into countless other programming languages. We're at the point where they're considered a language feature rather than a language in their own right. Most programmers have forgotten that or never knew. And when I frame regex historically like that, contrasting their early days as a programming language in their own right, with their modern days as an embedded DSL, it naturally is a question. What are DSLs anyway? Are they appreciably different from programming languages? Well, I don't necessarily think that the C2 wiki is an authoritative source. It's someplace where a lot of smart people have had a number of informed opinions, a number of informed opinions over the years. And they define DSLs in this consensus that is reached through a sheer stunning amount of debate as programming languages, as programming languages design specifically to express solutions to problems in a specific domain. There are a lot of spirit discussions about the merits of this pattern, because two programmers and three opinions and C2 wiki, but it's universally agreed by all of these programmers with all of these opinions that both their potential beauty and the potential horror of DSLs stems from their place as languages in their own right, because languages are difficult to design. They also do some talking about whether regex are actually a DSL. Fascinatingly enough, a lot of people don't think they're complex enough to count as a language. To each their own, but I am the kind of person who will die on the hill that CSS and SQL are also programming languages. And regex have far more complex control structures, even if these control structures are not actually powerful enough to avoid this kind of email validation regex, and to let you express those ideas in a more concise fashion. But that cautionary tale aside, which is absolutely what we think of when we think of regex in fear, in the wild, most production regex are a lot closer to this basic example. And while d.g isn't necessarily what we think of, it's a perfectly valid regex expression, and it exemplifies one of regex's genuine intuitive strengths. It's not just that far-leap to figure out that a regex containing the letter d will match on the letter d. More generally expressed, we can call this feature of regex expressions, tight domain integration. Wow, I timed that right. Remember, DSLs are programming languages designed specifically to express solutions to a problem in a specific domain. When DSLs tie themselves closely to the quirks and structures specific to a domain, they get a leg up in solving domain specific problems. This is something that goes a bit deeper than the ordinary programmer superpower of meaning things. You're not just importing concepts from the problem domain into Ruby. You're replacing the logic of Ruby with the logic of that problem domain. Regex's get to cheat a bit when it comes to this tight domain integration, their text processing language, and they're using text. Most DSLs we write don't get that automatic cheat, but we can express this tight domain and integration with a little more work to figure it out. For example, we're going to build a query language that runs Twitter searches. Targeted Twitter searches specifically, and we'll start with the simplest query possible, which is searching my Twitter feed for photos of my cat. We can see here why that is the simplest thing possible or we will in about 10 seconds. At this point, you don't really need a DSL to express the thought. A simple hash interface would convey my intent as clearly and implementing that interface would be far more straightforward. But what if you want photos of cats in my general social circle? Suddenly, a more complex query language starts to make sense. These two examples are roughly comparable, but when we start to add more complicated logic around the network diagram of my Twitter friends, then our quote unquote simple hash interface starts to look a lot less simple. This hash below would be difficult for the search function to parse and difficult to actually use. It would be difficult to document and difficult to remember. This is happening because we're defining our API and Ruby's terms rather than our domain's terms. It starts to look like a bad DSL, actually, and specifically one without tight domain integration. In the first example, by admitting that we are writing a DSL, we were able to maintain a tight focus on the core domain concepts, which ultimately led to a smoother design. Now, you'll note one thing that I am not saying here. A lot of people talk about strings like this as examples of successful API design because they're English-y. What's actually happening, though, is more complex. The two examples we're going to be looking at in about five seconds are both our spec from different years of the framework. They're both, I suppose, English-y in the loose way that we're using the term before. That is to say they both use English words to name things, and their grammar occasionally causes those English words to flow together in a way that apes an English sentence. The top-back example is definitely the English-y-er of the two. It's pretty much a sentence in its own right. But it's been supplanted by the second style as our spec has evolved, which is against what we'd be thinking if English-y-er was always the goal of API design. It's been replaced by that for a lot of reasons, among them a much cleaner implementation. And it actually isn't any harder to work with in practice, which goes against the idea that English-y is the goal. The mark of the good DSL isn't how closely it approaches English. It's whether it enables programmers to write programs. The R-spec DSL neatly encapsulates domain concepts like test cases and assertions, achieving the same tight and necessarily intuitive domain integration that Regex achieve by having dog-match-dog. And only some of R-spec's tight domain integration comes from it choosing good names for things. The vocabulary of the DSL makes sense, but languages are made of grammar as well as vocabulary. And this brings us to our next big principle of good DSL design, namely composability. If I want to make a Regex that searches for either dog or cat, the answer is pretty easy. Regex's grammar is simple and for the most part intuitive. Or combination and back references are really as complicated as it ever gets. Since all it's doing is providing a facility for simple text matching and since it's made out of text, it once again gets to cheat and for the most part lean on its own structure to develop a grammar. Since most domains aren't quite such natural fits for one character after the next, they need to develop more complex composition rules. When we build Ruby DSLs, we are building languages that are implementing in Ruby and which lean on the Ruby parser. And because of that, we're constrained by Ruby's grammar in deciding which composition rules to adopt. In practice, this leads us toward three basic shapes. The first and simplest is the class macro DSL, specifically the class macro with a lot of configuration options. This sort of example is useful as a top-level hook interface between a library and classes that want to make use of its features. It's how a lot of the Rails framework, for example, is expressed as well as a lot of image attachment libraries. It's not necessarily that expressive because you can only build concepts with it that can be expressed in a configuration hash, but it's easy to read and easy to implement and hard to screw up. The next most complex of the DSL styles that we're going to talk about is method chaining. In this style, which will hopefully appear on the screen now, you use a series of methods that return self to build code sequences that continuously refine when object means before using that object. This is a very common JavaScript DSL structure, but in the Ruby world, I've mostly only seen it in test libraries like mocha mocs or R-spec matchers. Honestly, I wish it were used much more often. Since it's designed around the idea of continuously modifying objects, it's easy to manipulate and reason about, and it can be bent to match a lot of different domain models. In our example, Twitter query DSL, our composition rules focus on the shapes of the relationships that people have with each other. In mocha, they focus on the different properties of mock objects. In each case, the grammar which defines how elements can be composed also echoes the domain structure. In other words, tight domain integration matters at both the vocabulary and the grammar levels of a domain-specific language. The last common Ruby DSL style is the block structure. In its simplest form, the one-level block DSL, it's a common's choice for tiny configuration DSLs. It provides a really pretty interface with a minimum of implementation. Got a little computer? And the, there we go. You can also build nested block DSLs. Since the style pushes you toward code that takes on a tree or a nested structure, it's a strong choice when the pattern echoes the landscape of that domain. In the Rails routing DSL, for example, the tree shape echoes the directory structures that web routes visually imitate. This block structure is a common one in Ruby DSLs. It defines a grammar that feels removed from the ordinary one method after another rhythm of Ruby. And so it feels DSLE in the same way that arranging things in sentences feels Englishy. It's not that hard to implement necessarily from a lines of code perspective, but because it relies on passing blocks of code in between different contexts, it's sometimes hard to reason about. When things go wrong, it can be difficult to intuit or even find the context in which any given line of code is executing. And this leads to one of my most common frustration points with other people's DSLs, namely them using the block structure inappropriately because it looked DSLE. It'll be slide demonstrating my point should appear in a few seconds, but in the interest of time. The abstraction that they try to implement with these inappropriate block structures doesn't neatly fall into a nested structure necessarily. And so when I write code that tries to fit what I'm trying to express within this nested structure that doesn't fit it very well, I wind up needing to pass around Prox a lot or use a bunch of instance vowels or both in order to get things done in a dry way. Worse yet, because I'm passing around all of these blocks that are evaluated in various contexts that I know very little about immediately, I need to read the get libraries code and really know a lot about what contexts these blocks are being evaluated in. I need to care about the internals in a way that I wouldn't necessarily need to care in a less leaky abstraction. And to be frank, this talk was inspired by a DSL that made me do that. It also was designed in a way that wasn't easy to extend or modify and so I wound up needing to monkey patch it a lot. It was a really bad, perfect form of frustration and so I was trying to write a talk to figure out why I hated that entire process and so all through the project, I was working with that DSL on. I wanted two big things from it. I wanted it to be easily extensible with ordinary object-oriented techniques so I didn't need to monkey patch it all the time and I wanted it for me to be easily able to merge blocks of code written in that DSL. In other words, I wanted it scrammer to allow for better composability and when I started working on that, this talk, I figured that those two were the same thing. I really did think I was going to find out proving that DSLs were irrelevant and I was wrong. And here's why. Regexes are made of strings. You can trivially build a Regex with Ruby using perfectly ordinary string manipulation Ruby. You don't need to use class val and feel dirty about it the way I did in the Regex examples I was showing before. And I figured that as long as I was going to say that you can do stuff like this with your DSL, it was going to be perfectly fine, it was going to be great. And this talk was going to just be about how to make it possible to do that stuff. But if we accept that domain-specific languages are just languages, then what actually is the difference between combining Regex fragments with Ruby and intermixing Ruby with other languages? What's the difference between the Regex with embedded Ruby up top and the JavaScript with embedded Ruby below? There isn't all that much of one. And if we poke at our instinctive ewe reaction to that JavaScript with embedded Ruby, we can figure out why. So in this example, we're initializing a JavaScript array and then using embedded Ruby to manually build up a set of literal push calls that reassembles a Ruby array in JavaScript world. When I've seen this first example in the wild, and yes, I have seen it in three different production code bases. God help me. Is generally been in the context of web application view. In other words, the developer was writing that code to transfer an in-memory Ruby array on the server to an array on the client. But of course, there's another more widely accepted way to do that. It's the example below. You just write an API endpoint on the server that returns the array and then the client-side JavaScript access is at using an ordinary Regex call. In writing the embedded Ruby, we're ignoring an existing well-defined interface for transferring information between the client-side and the server-side. And in ignoring that interface, we can figure out what's going wrong. It's not just that we're ignoring the interface, by the way. When I first had this ewe reaction to the array push, I didn't actually know enough JavaScript to understand that there was an accepted way to not bullshit that. But if there's a defined interface for us to ignore, then that means that we must have two objects that the interface is between. In this case, the objects are the Ruby server and the JavaScript client. But we can as easily think about that as the Ruby and the JavaScript. We can think of the languages as kind of objects in the CS meta sense. This is a little easier to understand when we look at the Regex example. It's very clear that the two different objects are the languages themselves. And if a chunk of any given language is a object in its own right, in again, some very interesting meta sense, then what we're doing when we use Ruby to compose a Regex or assemble a JavaScript array, is crossing those object boundaries of the language. Those interpolated Ruby strings are not actually spiritually different from using instance of AL to call a private method. They're reaching into the JavaScript's business and messing around with it, which is part of why code generated using this method is so very hard to understand and debug. And suppose we've got that mental framework in place. What's the difference between interpolating Ruby into JavaScript, like the example above, and interpolating Ruby into RSpec? And I know I just said a really weird thing. RSpec is written using Ruby, so it sounds funny to talk about interpolating Ruby into RSpec. But again, in order for a DSL to be useful, it needs to be a language in its own right. We need to give it that respect. And so we need to accord RSpec that respect. And RSpec is kind of weird in this way, right, because it expects you to embed Ruby into it, but it expects you to embed this Ruby in specific, cordoned off, and well-defined places. When you embed Ruby in a place that isn't one of those, like by using an each loop to define a group of similar examples, then you're crossing language boundaries, and it feels icky in the way that that always does and should always do. If I were to try to use ordinary object-oriented techniques to try and expend RSpec, like I wanted to be able to do with that bad DSL I was talking about earlier, that would also be crossing those boundaries. When was the last time you tried to extend the class that all described blocks build instances of? For that matter, when was the last time outside of Sam's talk earlier that you thought about the fact that described blocks instantiate an object? RSpec's language design successfully hides these implementation details from you, just like a good library and a good language should. You don't think about C when you're writing Ruby, unless you're doing weird optimization. More than that, it successfully obscures its own rubiness. We nearly forget using it, that it was written in Ruby and therefore must be made up of the objects and classes that make up all Ruby implementations. We get to do that because RSpec has removed the need to think about it. Instead of asking users to use ordinary object techniques to extend RSpec, it's maintainers have defined some specific extension APIs, such as the shared example API and the matcher API. And for matters connected to the actual purpose of RSpec, namely the structure of example groups and examples and expectations, you're expected to still not interfere. In other words, any language's rules of composition stay within the language. Composability is not about how easy it is to cross language boundaries to do whatever you want. It's about how easy it is to do what you want in a sensible way while staying within the bounds of the language. And that's great and all, but it doesn't solve one of the problems I had with that other DSL, the terrible one that I'm deliberately not naming. That I couldn't do all the things I wanted to do with it, period. Never mind sensibly while staying within the bounds. That's why I need to monkey patches internals. And so how do we avoid that problem in our DSL designs? Well, we can provide a small defined extension API like RSpec does. And that lets us define new words in the language without bending its grammar out of shape. But there's another way and I like this one better. And it's very simple. One of the beautiful things about regular expressions is that they search within text and they occasionally replace text. They do not try to do anything more. They do not claim to do anything more. They have chosen one specific problem space and they don't try to solve any other problems. As Stack Overflow's funniest answer is quick to remind us that regular expressions can only parse regular languages and those are a very small subset of all the languages in the world. They have their limits. They are not a complete parsing engine for anything, especially not HTML. And also, again, not to be a dead horse, email validations is totally okay because they do not need to do anything but search text. I'm going to call this closed domain integration. It's not enough to integrate deeply with domain. You just need to go through the limits of that domain and no further. In order to get there, you need the flip side of this coin, namely constraining the domain definition so that you know where those limits are. It's okay to define these limits with big red placeholder boxes like our spec does and say user code goes here, but you need to have that really specific definition. You need to know where those boxes lie. If you do that, it makes the problem of covering the domain completely, one that is even solvable in the first place. So I'll start wrapping up now. As Rubyists, we are not going to stop reading DSLs anytime soon. It's one of the things everyone jokes about us, but actually it's a strength because DSLs are very powerful and they're kind of cool when they're done right. So the question then becomes how do we write the good ones rather than the ones that Aaron is having feelings about right here. And so you can treat your DSL like you would any other API. You can expose what people need, you can close off the other stuff, you can stay close to domain you're describing and have sensible composition rules and you can keep everything small enough to complete. Getting there though is again a very hard problem. While a good DSL is often more usable than a good vanilla library API, a bad DSL is much less usable as we've all experienced than a bad vanilla library API. I'm not saying right now that you're doomed to screw up because obviously you've seen this talk and every DSL you designed from now on is going to be perfect, but a good DSL is a lot more work than a decent vanilla API and that's something that you get to respect. You're going to need to write that decent vanilla API anyway in order to implement the DSL. And so I'm going to suggest that you do that first and figure out if you need more and let things lie like that. That's everything I need to say right now. I am Betsy Habel again. I am Betsy the Muffin on Twitter which is going to pop up on the screen in about five seconds. I am very sorry about the AV issues. I'm not entirely sure what's going on with Google Docs. This talk is going to be up on my website at the URL on the screen shortly after this talk, probably sometime during the Lightning Talks for Dinner. Whenever I can get a decent lock on to GitHub and with the conference internet really. I tweet about books, code, my cat and feminism at Betsy the Muffin. And I co-organize a meetup back home called Learn Ruby in DC. This is an informal space for newbies to ask questions and find mentorship if you are interested in making a meetup like that in your own hometown or if you also run a meetup like that and want to talk shop, then please talk to me. I think it's a really good model for building the community and I would love to share nice stories and also pitfalls so you can avoid them. I work for a great little organization called Act Blue that builds fundraising tech for democratic candidates and causes. We focus on small-dollar donations which is a surprisingly powerful thing. Our average donation size is around $30 and we've raised nearly 850 million over the approximately a decade we've been in business. And this really helps those donors' voices be heard in a way that keeps the party accountable to the voices of people who only have $30 to spare at a time. It's something that means a lot to me. We are also committed to building sustainably at the kind of scale that can bring in that much money over time. We have a modern tested stack and we have a focus on maintaining culture that, well, my third day was one of our biggest days of all time, right? And pretty much everyone on my team hip-chatted me over the course of the day saying, by the way, Betsy, I know it's end of quarter. You're going to close your laptop at 5.30 and you're going to have dinner and you're going to do everything but be on call. And we're also hiring Rails, UX and DevOps people right now. So if the values I just outlined sound good to you if they resonate, then please talk to me. I'd love to work with you. Many thanks to Noel Rappen, Kenzie Connor, Chris Hoffman, Tina Wiest and the entire membership of Arlington Ruby users group for invaluable feedback while I was developing this talk that I personally have built. I have not built enough things that actually require a DSL. Like I really do take the responsibility to go up to those bounds and no further quite seriously. And so I've built some templating stuff that I'm pretty proud of but other than that I haven't worked in any problem spaces that I feel require that level of power. Unfortunately, that's all been proprietary stuff so I can't point you to a GitHub repo. The question from Walter is whether I have any mental litmus tests for when something does want a DSL. So for that, let's kind of go back a few slides. A lot of slides. Hi there, now you're working quickly. What the hell? So if you can see in that second example on the bottom, we're getting an increasingly complex hash interface. And one of the things about that is that as you acquire more and more options for what any given library access point, we'll call it that even though that sounds really fancy and it's not a fancy concept. What any given method call that's at the front edge of your API winds up starting taking a lot and a lot and a lot and a lot of parameters. You should start thinking about ways to encapsulate all of those parameters within an object and a lot of the time a nice simple method chaining DSL is a great way to actually build that parameter object in a way that's clean and readable. It's one of the questions I kind of anticipated getting was someone calling me on differences between RSpec and Minitest because they're very different stylistically in terms of implementation. But in terms of the ways the Minitest DSL has evolved over the years and the RSpec DSL has evolved over the years, one of the interesting things is is that they'd actually evolved toward each other. I think that it's valid to want something like the full on test unit prefix everything with test style. It drives me bonkers. And through the years we've seen a lot of things like RSpec like Minitest spec syntax like shoulda that attempt to impose more structure than the test case magic API gives you. And there's no hard and fast rules in programming so this is going to be matters of taste. But the outer edges of the RSpec API with describe blocks and hit blocks seem to be something that a lot of different things just ultimately eventually decide works for test cases even if that's not where they start out. Cool, wonderful. Well, I will let you all get to the lightning talks. Thank you so much.