 Well, it's that time of the week again. It's time for Chitchat Across the Pond. This is episode number 789 for March 16th, 2024. And I'm your host, Allison Sheridan. This week, our guest is Bart Bouchots with Programming by Stealth number 163. We're still having fun with JQ, right, Bart? Well, I'm certainly having fun. I hope you are, too. I am, I am. It's, as I told you, I find that JQ is not particularly sticky in my head that in between the two weeks, I complained to Bart while I was, I was actually trying to work on the JQ homework while in a car on an unpaved road out in the desert. And I was basically, I was just sitting there looking and going, I have no idea. I know you start with JQ in a single quote, but I, maybe some square brackets and a, but wow, I'm foggy on the beginning. I mean, like just the simplest things. And then I got to one point where I was just like, I know what I'm doing is right, but why is it not working? And I went to chat GPT and I gave it like a super simplified example of what I was trying to do. And it said exactly what I was saying. And I was looking at the two things. And the final thing was like, I hadn't, after I'd done everything, I had everything in an array. So to get to display name, I needed to explode the array and I kept forgetting to explode the array. But even looking at the answer that chat GPT gave me of the simplified example, I look back and I'm like, those are the same thing. How could that be different? So it seems to squirt out of my brain. And I know you're doing it regularly at work. I am not. So the homework is definitely critical for gluing the broken bits back into my brain. Well, some very wise person suggested that maybe we should do challenges as part of the series. I don't know who that was. I don't know. One other little thing I wanted to ask people is, how many different ways have y'all learned to spell laureates? Because between Bart and me, I think we may have found all of them, but it's not just one way we can misspell them. Well, I haven't found how to spell it right. I've just found all the wrong ways. So. Well, eventually it's funny. We get it right in the code because otherwise it's not gonna work. But if you write a comment, there's no chance it's spelled correctly in the comment. Yeah, that is a fair point. Oh, I see one wrong. Yeah, there we go. That's wrong there in line five with the first challenge solution, isn't it? Yeah, I thought I fixed it on that one. Maybe it's a different one. Okay, well, we'll find, oh yeah, no, I may have changed it to a different, that is a new, you did not spell it that way. I did that. Oh, great. In case it's not just me, right? There's like A's and E's going all over the place. But yeah, so it may or may not be spelled right in the comments. Good luck to all of you in following what we're saying. So before we get to our challenge solution, just to set the scene, I'm a big believer in the old tell them what you're gonna tell them, tell them, tell them what you told them approach. So I had promised you, I had teased that we would earn all about lookups in this installment. And that's half true because as I began writing the show notes, I realized that there were too many concepts. And if I didn't split it in half, I would end up teaching nothing instead of twice as much because none of it would actually go in. So we've ended up actually with quite a nice split. So we're gonna spend a lot of time today trying to understand why we are interested in lookups. And it's kind of a much bigger lesson than just JQ. It's a lesson for any data processing you do in any language. Because with my work head on, I do a lot of data processing with JavaScript because I like JavaScript. And with air node, you can write JavaScript to run shell scripts. So I do that a lot. And then when I have JSON data, I obviously use a lot of JQ. And when I have SQL data, I use a lot of MySQL or Postgres SQL or if I'm really unfortunate Oracle, very unfortunate, bad day. But a lot of the concepts are actually the same. It doesn't really matter what language you're in. So a lot of the stuff we're gonna do for the first half of this installment is way more philosophical than you're probably used to. I think we might have done this more in the very early days of JavaScript when we were meeting our very first programming language ever. I think we got quite philosophical sometimes, but we're going back on that road for the first half of today. Then we're gonna look at how JQ applies that concept in its own unique and special way because JQ is kind of unique and special. Because I don't know if you've noticed this, but you've been doing an awful lot of looping in JQ, but we haven't met a single looping keyword you'd recognize. We haven't met a for, we haven't met a while, we haven't met a for each. And yet when you explode an array, you're looping over every element in the array. Right, right. So we do loop a lot. We do a lot, a lot, a lot, a lot of loops. We've never looped explicitly. And so how JQ does things is usually quite different to how other languages do it. So how JQ does lookups is very JQ specific. And then we're gonna build some. And that's the point we're gonna draw a line under for the day. And then when we come back in two weeks time, we're gonna pick it up, is the opposite of building is taking them apart because it's actually very important to be able to do that for a couple of reasons. A lookup is a really great way to access data, but it's actually terrible for trying to process the data. So if the problem to be solved isn't find me something quickly, but do some sort of a search or do a transformation, you may actually want to break the lookup apart, assemble it into a more traditional structure, process it and then maybe put it back into a different lookup. Or the other thing is someone may have written the data with a lookup on one field and you don't want it on that field. And the example, this is when I in my real life learned how to do these things with JQ because when you do a lookup, it's on a key. And the wonderful, wonderful Troy Hunt does have I been pwned. Amazing service, amazing person. But his JSON is in the form of what we're gonna discover today is a lookup. And his key is designed for when the problem to be solved is what breaches was this person involved in. And the problem I have to solve is the opposite. A breach has just happened, which people are in that breach. Which means the data is- Oh, that's the exact opposite. Right, the data is wrong. So I have to pull- Well, it's not wrong for me. It's just not what you want. Yeah, it's wrong for my purposes, right? The context is important. So I have to pull his lookup apart completely, rejigger all the pieces, and build the lookup that I need to answer my question. And that's actually a very important skill. So the entire next installment is about disassembling and potentially if needed reassembling into what you need for your purposes. But before we do any of that, we of course had a challenge. We spent a lot of time in the previous installment getting very friendly with processing arrays. With manipulating them in various ways and two very useful manipulations we learned about was reordering arrays so that they were sorted in the same order. And deduplicating arrays so that instead of there being five copies of the value two, they did only the one copy of the value two. And that's actually something you need to do quite a bit when processing data. And a fantastic example of why you would want to do that is if you just want to list basically the honor roll of everyone who when they come to Princeton gets to park in the Nobel laureates area. Because they actually have a different car park for Nobel laureates because they have a few. And so you just want everyone once, right? Einstein actually he only had one Nobel prize. Marie Curie should she come back from the dead and drive? Don't think she drove. But she doesn't get three parking places. She gets one parking place. So obviously you want a deduplicator. And that's sort of why the example I picked was to do a list of all the laureates which made you have to deduplicate them because lots of people won more than one Nobel prize. And also I wanted them sorted alphabetically because well, it seemed like a sensible thing to do with the list of names. So how else you find whether or not someone's on the list, right? If you're the bouncer at the car park for the Nobel laureates and someone comes up to me and says, hi, I'm Andrea Gez. And if it isn't alphabetic, how's the bouncer supposed to say, wait a minute there, Dr. Gez. I have 100 and something or I think it's 176, I think is what I got in total or something. Might've been more. Anyway, whatever it is, a lot of rating to do, right? So anyway, that's why that was the problem to be solved kind of semi-practical anyway. So you will find my challenge solution for the basic solution, which is just print all the names and sort them alphabetically as you would write them. So Andrea Gez would sort as an A, the Institute of Peace, I think was something called, actually the International Panel on Climate Change would sort as a T. And just sort them as you see them basically. That was for full marks. And then for a bit of extra credit, you could sort them on surname, which is actually quite fiddly because organizations don't have surnames. So we'll deal with the fiddly bit after we do the important bit, which is the bit to get you full marks. So in order to start working with the laureates, the first problem to be solved is to get all the laureates. You're gonna have duplicates, you're gonna be in the wrong order, but just get them all. And so the first step to getting out all the laureates is to loop through all the prizes so we explode the prizes array. Then we have our prizes and we don't care about the year, we don't care about the category, the only thing we care about is the laureates. So we explode those out. And now we have all the laureates that ever existed from all the prizes. But we only want the names, only some of them have a first name and a surname and some of them only have a first name. So I basically copied and pasted my solution from the previous challenge, which was to do the pretty printing of the names and I just pasted that in. So at this point in time, what I have is many, many outputs, one for each laureate, but they are individual outputs because we've exploded and then we've exploded again. So we now have many, many pieces, but the functions for sorting and so forth, expect an array. So we need to collect our pieces together in an array. So we wrap everything we've done so far in square brackets and now we've recollected them all back into an array. And then a nice side effect about how the unique function does its magic of deduplicating is that it sorts the array and then it just goes through it from start to finish. And if it meets the same value multiple times in a row, it throws all of them away, but the last one. So that's actually how unique works is it's a sort followed by throw half of it away. And so by simply piping our array to unique, we get straight to the answer. So we explode twice, we pull out the name, group it all into an array and run it through unique. And hey, presto full marks. You know, I forgot that unique would do this, would do the, I knew it would do the deduplicating, but I forgot that it would do the sorting, but I kind of cheated. I thought, you know what? I know of a data set, a JSON data set that already has display names created because Bart did it last week in this show. So I started from your last time solution because I was like, why should I redo all that work? I already know how to do that if I remembered. And so I started from that, but you must have deduplicated last time because I didn't use unique and I don't have three Marie Curie's. That's unusual. That sounds like a side effect of something earlier in your logic. I haven't seen your solution. So I don't, I can't comment on the why. That is unexpected because I know for a fact that before I run mine through unique and I poked it through a word count minus L, there were many, many more of them. And when I run it through a word count minus L after I showed it through unique, which is my little test to myself that things were going as expected. Huh. Okay. So the question- I just checked and I only have one. I think I got accidentally lucky. Hey, if it works. Although it is kind of interesting to understand why it works. So I might be curious to have a look at that maybe after we record tomorrow. What the listeners don't know is that the clocks have done that really annoying thing where we're the wrong amount of hours apart. So it's now later for me. Anyway, it's all time you whine me wibbly wobbly. So in order to get the bonus, we then want to do something a little bit tricky. We want to sort on a different value than what we display because we don't want to display them in the wrong order. We just want to sort them by their surname but still display their name normally. Which gives me an excuse to use a trick that iTunes of all things taught me. So when you would look at an MP3 file in its ID3 headers, there was a field for title of the song. And then there was a field called sort order. And you could put Beatles comma the into the sort order to have your albums have the Beatles sort under B but they would still show up in your list as the Beatles. They would just be... Oh, that's kind of neat. After the Yard Varks who don't exist as a band but that's much in the video. And then after ABBA. Oh, you know, ABBA would come and then the Yard Varks and yeah, exactly. And so that gave me basically my basic approach is instead of exploding all the way to the name so explode, explode and then don't reduce it to the name instead inject two keys. A display name, copying and tasting my logic from previous challenge and another field called sort order. I think I called it sort by in my solution. But same logic, right? Yeah, I did call it sort by. I think. And the sort by field, I applied different logic so that it would basically say, do you have a surname? If you do have a surname, then build me the string surname space first name. If you don't have a surname, just give me the string first name. Which meant that my sort order would let the international panel on climate change be the international panel on climate change and Einstein would sort as Einstein, Albert, instead of Albert Einstein. So he'd sort from the ease. So I recollect all of those into an array. So I now have too many laureates in the wrong order. So the first thing I do is I use the unique underscore by function to de-duplicate based on- Wait, why are they in the wrong order? You just did a sort by. Not in my bonus solution, which is further than the page. So- So you just described creating the dot sort by. Yeah. But you haven't dot sort by. You've created that as like a variable. It's a field, yeah, exactly. So it's an extra key in the, so every dictionary for the laureates now has a sort by key. And a display name key, as well as the name, the surname, whatever it had when it started, it now has two extra keys, sort by and display name. Okay, and sort by has what in it, what are the values of sort by? So if there's a surname, the value of sort by will be surname, space, first name. Otherwise, it will be first name. Okay. Why not just sort by display name? Well, because that's not what we want. We want the display name is first name, space, surname, or just- Oh, right, we want it by last name. Okay. Right, that's the whole point. We're sorting by something different to what we're showing. So they're two different. So at the end of that, when I reassembled everything into an array on line 19 of my solution, then I now have an array of dictionaries. Not an array of strings, I have an array of dictionaries. So to sort the array of dictionaries, I need to tell it which key in the dictionary it should use to order things. So that's why I use a unique underscore by, which will have the side effect of sorting by the display name, but at least it's getting rid of my duplicates. So Einstein is now gone, or sorry, Curie now is now to one Marie Curie instead of three Marie Curie. But they're now sorted in a different wrong order, but there's the right amount of them. So now we pass it to sort underscore by, and we tell it this time that the key it should use is dot sort by. So now we have an array of dictionaries without duplicates in the right order. And then the last thing we do is we explode it one more time. We pull out the display name, and then I chose to put it back together to return an array as the output. You could have put the individual pieces. I didn't specify, so both would be perfectly valid. I set a list of names, so you could take that in English to mean names in a row, or in JSON, in which case it's names in an array, but they're both valid interpretations of me giving the vague hand waving description of list. You didn't actually say a list of names. So when I first showed you my answer, it was a list of dictionary objects in alphabetical order by surname. And you said that was right. Technically what you said was build an alphabetical sorted list with the names of all laureates. So I would say there's a list of names, but okay. Well, but not only names. It's not the best English I have ever crafted. I will hands up say that it is not the clearest instructions I have ever given. I mean, I botched a lot of other stuff up in my answer, but I was going for points from the professor for it to on a technicality there. Hey, we're working an exam. You know, you go to the exam's appeal board and I've sat on those and he was like, yeah, that question wasn't clear. Mark said the student. A little fun aside, before we'd even recorded this installment, we're recording now, basically the moment I hit publish on the previous installment, we got an email from one of our loyalist of listeners, one of the many listeners, Jill. Yeah, we have multiple Jill's, but this is a different listener, Jill. And Jill immediately pointed out that I had basically set an impossible challenge because the rules for actually in the real world, sorting by surname, are hella complicated. Because our own data set shows us that we have names like huffed as a surname, which is... What's up with that guy? It's an abbreviation of huthuffed, which technically means the head. So he's actually Gerardo's the head, which is kind of cool. I don't know if that means the head is on top of the body or the head is in chief of an organization. Both are valid meanings of the word in Dutch. But the surname is abbreviated with a, you know, huffed. Then you have special characters don't sort in the right order in JQ because in actual fact, when you're sorting on names, you should pretend that an E-Agu is just a plain O-D. Mac, M-C, should be sorted as M-A-C. O-Neil, should just be O-Neil. You should remove von and van. That's all these rules. She linked to an amazing blog post describing all the complexity. You could do... I think Jill is for the first person who talked to me about the joy of being a pedant. She's our people, that's fair. Or my people, anyway. I may be known for birds of pedantry. I also married a grade A pedant, who I love, but grade A pedant. One of the most interesting experiences is watching stand-up comedy because he pulls apart the logic instead of the joke. I think it's hilarious. Oh. That joke isn't factually correct. True. It is hilarious. Anyway, you actually could do all of this in JQ and you can, in fact, you can make your own JQ functions. I haven't quite... I don't think we're going to learn how to do that in this series because that is very rare to need, but you could write your own function called sort by name or something like that and have it take two arguments, you know? Anyway, you could absolutely do it so you basically have to remove all the prefixes. You'd have to run some regular expressions to sanitize things, then convert the whole Kicking Caboodle lowercase, pull out all of the unaccented version of the characters and replace them with their regular versions and remove any dashes and other bits and bobs and then you would have a perfect sorting by surname. If anyone... I think you even said you'd have to change people who were Mick, MC to Mack. You would. But then you'd need to put it back because you can't leave them as Mack if they're not Mack. But remember, I would still do this as a separate field so the display name we don't touch. Oh. Right? That's the power of having the separate field. You can be as brutal as you want, right? You're not going to offend anyone, right? And yeah, generally speaking, the order you sort on is not what you will display because these rules would make the name quite in unintelligible too or a lot of names quite weird looking. I think if I was Vaan something or other, I'd be very cranky if my Vaan disappeared. Mainly because it's the title of nobility. So nobility don't like it when you take away their nobility. Basically the duke of whatever is Baron Vaan, whatever. Anyway, so all of that to say, to get your full marks or to get bonus marks, that was what you had to do. And if you managed to do everything Jill described, you get so many bonus marks, you just win. You win programming by stealth, that's it. Congratulations. One of the things I do pull a little more detail on the show notes is the actual syntax for the sort by field creation because it's a good example of some features of the select function and of the alternate operator. So the select function returns either the full thing you gave it or absolutely nothing at all. So when you say select has surname, what comes out is out of the full dictionary or absolutely nothing at all. And then what you'll see in my solution is you take select has surname and then you pipe that into some string construction. So if there's no surname, what happens? Well, you end up with- Throw it away, right? Right, it gets thrown away. So nothing goes into that pipe. So that string construction saying take the surname, follow it by the first name, that doesn't do anything because it got zero inputs. So what arrives at the left-hand side of the alternate operator is absolute nothingness, which does not evaluate to true. So the rule for the alternate operator is if the left-hand side is empty, null or false, then you do whatever is on the right-hand side. So we come in on our left-hand side with empty. So if there was no surname, we get nothing on the left, so what happens is the alternate operator returns first name. So for the international panel on climate change, that's what will get returned. If there is a surname, the selector won't pass the surname, the selector will pass the entire dictionary. So the string construction gets a full dictionary and it can then correctly pull out surname followed by first name, which is not empty or null. Therefore, when the alternate operator gets that, it goes, oh, great, I'm done. And that gets handed back to be stored. And the stuff on the right-hand side never happens. My solution was that I decided organizations never deserve to get the Nobel Prize and I threw them all away. You would be very unpopular in Manus University where I work because we don't have many Nobel Prizes, not enough of them for a car park. But the one we do have is that one of the 1,000 and something scientists who won as part of the IPCC is one of ours. So we have like 1,000 international panel on climate change. Oh, okay. So we have like one tenth of a percent of a Nobel Prize. So put through an organization so we'd get very cranky if you took that away from us. Anyway, there we are. So that is the full-on challenge solution with all of its bits and bobs. So if we do get a chance to look at my solution, which I don't consider good enough to show to anybody but you, because you don't mock me, it will be curious to me how I managed to get a list of everybody in alphabetical order down to Richard's money. But the last like 20 are just people completely out of order. Like Anne-Marie Al-Sadat, Paul-Anne-Reed de Tournel de Castin. Oh, these people all have like de in there or they're the strange ones. Oh, I would be very curious to have a look at your solution. If you want to play with it yourself, it may be interesting to throw a few debug statements in at various points and see what the intermediate states are. Yeah. Yeah, I think it's the people with the weird stuff like Vander Walls, Vander Meer. Yeah, they do have that in common. I know I'm defending everybody in Holland, but okay. Yes. In Holland, by the way, Havana is not a noble title, it just means from. So it's from the old days, you are from the field, Vander Vilden is also one of my favorites. Ah, yeah, that's Bob from the field. Okay, you better get to the new stuff. Yes, so we're starting with the philosophy. So a dictionary is a spectacularly flexible data structure. It is a key value pair. Sorry, it is multiple key value pairs collected together. So you will see them used throughout programming in different languages for 20 kabillion different things. But like animals can be grouped into species, we can take different uses of the dictionary and describe them in sort of broad philosophical terms to say that this is a way of using a dictionary. We can use them to create, for our purposes today, records or lookups. They're not a different technology, they're dictionaries, they're collections of key value pairs, but how you use them to represent your data means they behave so differently that you really can think of them as a different species. Because- Are records and lookups two different things or are you saying those as synonyms? No, they're not quite antonyms, but they are two very distinct ways of, two very distinct species of dictionary, shall we say? I don't know, I sort of went with types and inverted commas, massive air quotes. Broad categories of dictionary. So to understand why we care about lookups, we're actually, that's really where we want to go, that's our destination for this story. But for those to make any sense, I need to say in the beginning, so we're gonna go doodle doodle right back and we're gonna look at a different concept, which is probably the most common use of the dictionary, which is the record. So we are going to use as our examples a nice small little data set, our menu. Now I've added to our menu, so last time we had hot dogs, pancakes, and waffles, we now have some scones. And scones, I love scones, particularly actually I discovered, America taught me that cranberry and orange is an amazing scone. Amazing scone, yeah. So scones are quite difficult to make in my mind and they're quite fancy, so I priced them the same as waffles, they're $7.50 like waffles are. So anyway, they've been added to our menu. And our file is still menu.json and it contains an array of dictionaries. And each of those dictionaries has a name, a price, and an amount of stock. So we have name, hot dogs, price, whatever stock, whatever name, pancakes, price, whatever stock, whatever name, waffles, right. Each of them is the same. So that's actually how you represent tabular data in JSON. That array of what I'm now telling you we call records is entirely equivalent to a database table with the three columns name, price, and stock, or an Excel spreadsheet with three headings across the top name, price, and stock. And if you don't believe me, I've created the table in the show notes, right? It is. Okay. The same column is hot dogs, pancakes, waffles, and scones. We have prices, $5.99, $3.10, $7.50, $7.50, $143 in stock, $43 in stock, $14.11. It is a table. And so what makes, when we have JSON representing a table, we call each row a record because we're thinking databases. It's a database table which contains records. So those types of dictionaries we call records. And what makes them records is that fact that you have every dictionary has a name, every dictionary has a price, and every dictionary has a stock. And you group them together and now they are the records in a table. So with records, the keys are column names and the values are the values in your table. So keys are names of a column. When you're dealing in records, and that's how you think of them. Now, I'm gonna take you even more philosophical. We like to say in our day-to-day life that there's no such thing as a free lunch. And that is very true of computer science. And one of the biggest trade-offs that you will come across when you start to do theoretical computer science, you will forever be told that you are forced at all points in time to trade space and time. Space and time are continuously fighting for each other in computer science. So an array of record-style dictionaries is the least inefficient way Jason can store tabular data. Jason is human-friendly to read. It is not the most efficient format. If you were trying to put data onto a spaceship made in the 70s with like four bytes of memory, you would never use Jason. But if you are gonna use Jason, the least inefficient way to use Jason is actually arrays of records. So there are very, there are as only inefficient as Jason gets in terms of space. So if you had a giant data set and you needed to save it in Jason, that would be the least bad way to do it. But it's a really terrible way of processing large amounts of data because the only way to find anything when you have an array of records is to loop through them every single time. You loop, loop, loop, loop, loop, loop, loop. So if you were using a traditional programming language, you'd be using a forage wall of reaches. In JQ world, we explode them and we show them through select, which is just a loop. We're just processing each one in order and checking it for some sort of a condition. That's no different to for whatever, if the name is whatever, yay. I would just say exploded, pipe, select, name, double equals, whatever, yay. So that is very inefficient at processing. So you're probably thinking, hang on a second, there's gotta be a better way to do this. And the answer is absolutely, you can make it way more quicker to process, but you shall pay. The cost is going to be disk space or RAM and probably a little bit of CPU as well. So not only are you like, you're definitely gonna end up paying in space, be it RAM or disk, you're also probably gonna pay a bit of CPU time as well, but you will end up speeding up your processing. So if we were living in the world of databases, which is the best way to store data, we have a name for this concept of being able to search your data quicker than a loop. We call them indexes. And an index on a database table is effectively a little mini table that sits next to your main table that just says that, yeah, if you're looking for Bob, he's on row 15. If you're looking for Alice, she's on row 32. If you're looking for Willie, he's up on row 956. It's just one of your data fields mapped to whatever row the full record is on. And so it's way quicker to sort, to find stuff in the index than it is to go look through the entire dataset. Bit like a library. Why don't you find it in an index without looking through the entire dataset for it? But it's actually stored as something called a hash table, which means that it's stored in memory. It's stored like a dictionary, in fact, which means that you just tell it the key and it will then go, okay, well, that key maps to the value four, then you go, great, okay, now let's go to row four in my database tables. You don't have to go searching through everything, you just jump straight to row four. It's like the same thing, but keep going. Oh, no, it's like, if you get the indexes and your database is wrong, your site will crawl, your everything will crawl. It's spectacularly efficient, but you have traded off space because you might think, well, I have a database with a table with 50 columns, I'll just make 50 indexes. Okay, well, you better start paying for that disk storage to go daddy or whoever because you are having one heck of an inefficient database space-wise. So you're always trading off space for time. Yay, I can search this really quickly. Ooh, that's a lot of disk space I'm wasting. The other trade-off you have is that every time you update your database, you also have to update every index. So if you have data you write once, query it often, indexes are a no-brainer, right? You save on searching, you've given up some disk space as well, but you don't have any cost on writing. If you have data that writes often and you have too many indexes, you just now crippled yourself the other way, you've now made it really difficult to actually manage your data. But I can search it really quickly, but I have this backlog of input here. You know, I'm supposed to be a credit card processor and no one can get any payment through, but yeah, I can search the payments very quickly that I'm not making. So you're always, always, always trading these things off against each other. It's always a trade-off, it's always a trade-off. Now, when we're working not in a relational database, but when we're working with pieces of data that can be represented in JSON, the way we simulate an index is with a lookup, which is another type of dictionary. But with a lookup, the keys are values. The keys aren't names of columns. The keys are the value from the column we want to index. So if we want to index names. So that would use less data because you're not giving it a key and then the value that you're just making the value be the key? Compared to an array where you don't have any top-level keys, it is still less efficient because in an array it's just a list, whereas you now have a dictionary of dictionaries. So afraid you don't, you do trade space. Keep talking. So the key point, we're gonna be a lot of puns in this and I'm not gonna mean them. The key point is that with a record, the keys are column names. With a lookup, the keys are the values from a chosen column. That's the column you're indexing. So if we index our menu by the most obvious thing you could want to index a menu by, which is the name of the item, then the keys will be the values in the name column. So our keys will be hot dogs, pancakes, waffles, scones. They will be our keys. So instead of our keys being name, price, and stock, our keys are now hot dogs, pancakes, waffles, and scones. And then the values are whatever it is you want the index to point to. So the simplest thing to do is to put the entire record as the value. So what you end up with is instead of a top level array, where each entry in the array is a record, you end up with a top level dictionary where the keys are the names and the values of the records. So we have a dictionary with the key hot dogs and the value name, hot dogs, price, whatever, stock, whatever. Then we have the next key pancakes, which is another record with the name of price of stock. And then we have waffles, which is another record with the name of price and stock, scones, yadda yadda yadda. So that, if you do the byte count, that is less efficient than what we started with because we now have hot dogs, pancakes, waffles, and scones twice. Instead of just being a square bracket, we now have a whole extra key everywhere. So it is actually a more complicated data structure. What you can do is you can remove the duplication by taking the name out of the record and only having the name exist as the key. So you can basically pull it back and say, hot dogs, colon, open a smaller record, price and stock. Pancakes, open a smaller record, price and stock. I hate, hate, hate that. Because when I do then look up a record, well, think about if you save that record somewhere. It's now incomplete. The record isn't the full record anymore. So you now always have to remember two things to remember one thing. Why, why so? What do you mean it's not complete? It has the same information, just more densely packed. Okay, but when you pull it out, you leave a piece behind. If you run that through a select, okay, if you run that through a select, you will be left with price and stock. The name is now disconnected once you pull it apart. While it's in a lookup, it's all there, but if you take a piece of the lookup out, it's now the record is incomplete. And that is sometimes very annoying. It depends on your use. You can also go the other way and you can make your lookup be, I'll call it a Tim Verporten lookup in memory of our mutual good friend who was a huge believer in menu bar apps because they did one thing and did it well. You can make lookups for one job and one job only. And this is actually one of the best uses of lookup. So you can make a lookup of price because you probably want the price of things really often. So when you make a special purpose lookup, you throw away all the data apart from your special purpose. And what you get is a really simple diction. It just says hot dogs are $5.99, pancakes are $3.10, waffles are $7.50, scones are $7.50, right? Really simple and for a one-to-one mapping, it's actually very, very simple. You just have a key going to a value, fantastic. What would happen if we tried to make a lookup where you use the price as the thing you're indexing and the items, the names, as the things that you're indexing to? Well, waffles and scones are about $7.50. This is why the other thing you will see often is that the lookup is to an array because an array lets you do a one-to-many. So we can have a lookup of product name by price by using arrays. So we say $3.10, just pancakes, $5.99, just hot dogs. But if you have $7.50 in your wallet, you have the choice of waffles or scones. So this is still bad, though, because you're gonna throw away the price when you do the lookup, right? But these are for really specific purposes. So these special purpose lookups are for very, very quickly answering what can I get for $3.99? Or very, very quickly answering the question, what price is a hot dog? So these are named special purpose. So you have a completely generic lookup which is name to full record. And you have completely special purpose lookups which is one key to one value or one array of values. The bit that I have problem with is the halfway house where you neither have everything nor exactly what you need. You have almost everything. And I just find that useless. I have never in my professional programming life found a use for that middle ground that doesn't know whether it's special purpose or general purpose. It's, I don't know what its purpose is. So at this stage, you're probably wondering, when are lookups worth the effort? Cause I gotta build a lookup, I gotta maintain the lookup. When is it worth all of this faffing about? And as a general rule, I would say that data you update infrequently but query frequently, that's a really good candidate for lookups. Cause what you're doing there is you're giving away very, very little CPU upfront because every time you update a lookup, it takes a bit of work. But if you, let's say the Nobel Prizes, they're awarded once a year. I think I can spare a few CPU cycles to build me a lookup of Nobel Prizes. I don't think that's gonna break the bank, you know? And then I now have the data the Nobel Prizes set in a structure that suits my purposes for an entire year. And I can query it efficiently for an entire year. And then next year we come along and we update it again. That's an ideal example, right? Real-world data is not usually that good. So in my real-world life, if you're wondering why do I do this, right? With my professional hat on, there are three places I use lookups and I use them a lot. So data on a schedule is a fantastic candidate here, right? There are a lot of databases and stuff in an enterprise system where, yes, technically speaking, they are updated all the time. But for reporting purposes, you snapshot them once a day and you say that between today and tomorrow, this is our enrollments. Yes, the students can log into the website and re-enroll in a different course, but none of that is reflected anywhere else until tomorrow because overnight we're gonna export the current state of the enrollments and then every system on the university is gonna pull that updated set of enrollments into themselves and then for the next day, that is the state of the university. So that's data on a schedule. So once a day... Okay, but hang on. That's not updated infrequently. That's updated really frequently. Well, okay, but the bit you care about is updated daily. So it's data on a schedule. So the on a schedule bit is where you build the lookup. That's just once a day. Oh, so once a day is not considered frequently and updated frequently. Oh yeah, when I'm talking frequently, I'm talking to something like Twitter, right? Posts, right? The ratio of posting to reading posts is really different. Yeah, exactly, right? Yeah, so everything's faster on computers. Yeah, right. So a lot of data on enterprise systems, whether it be accounts which are done by the day, well, accounts won't even be done by the week. You might not close your books until they're over the end of every week, right? So some of these things are gonna be on a long cadence, but there is a point in time when the system says, and this shall now be the snapshot for everyone's use for the next n hours, days, whatever. And so you're gonna have a script that does a dump of that snapshot. And that script probably already writes the file in multiple formats. It's probably writing a CSV because that's just the oldest data format that there ever been. Everything can read CSVs. I have yet to come across an enterprise system that doesn't output CSVs, right? For a few decades, XML was all so fashionable in the enterprise, especially if there were Java came anywhere near the equation. So most enterprise systems will give you a CSV and an XML file of the same data. And these days, an awful lot of JSON is used because it's actually a really nice format. So these days, a lot of enterprise systems will take their snapshot for the day for the hour and they will write it three times. CSV, XML and JSON. The amount of work involved in having it write two JSONs, one with the index you want and one as a flat file, basically an array of records is minimal. And if you know that this department needs to query it by academic year and this department needs to query it by student name, surname first, well, actually the amount of extra work in writing three or four lookups as long as you give them sensible file names is almost nothing. So why not dump the data in all the formats that all of the consumers want? They just have them all sitting in a folder and you might have students dash by name, students dash by academic year, students dash by department. Just write all the lookups you want, right? And yes, it's duplicative. So you're trading space for time because everyone consuming that data can now consume it way more efficiently because you've given it to them in the lookup they actually need for their business goals. So data on a schedule is a fantastic use case for building lookups. Another one is mostly static data. Like, we have information that we need to use inside scripts all the time. Like, what are the IP ranges we own? What are the domain names you own? What are they used for? Yeah, we have these 20 domain names. Three of them have email, four of them are used for this purpose or whatever. So you can represent that as a data structure that updates itself once or twice a year. You know, how often do you buy a new bit of IP space or whatever, right? It's not very often. Or maybe it's a list, maybe it's an org chart. Yeah, you hire people from time to time. And a general approach to this would be that you would have one copy that you consider to be canonical, which is probably just the array of records. And you probably write a little script that just takes your canonical version and spits out all the indexes you care about. So you keep one copy as your master copy and a little script that says, and now update me the indexes. You know, that works perfectly well for data that you manually edit a few times a month or whatever. Just edit the data on the script. Now you have your indexes again and everyone who consumes the data, they can have it in whatever shape they want, very efficient again. And the other way is, that should say complex, by the way, if you're wondering what that type of is at point number three, there's an X missing. I fixed it. Excellent. The other place where it's worth building an index is if you're going to do a whole bunch of complex processing within the one script. The data might be different every single time you're on the script, but nonetheless, the script itself is going to use the same data so many times that it's still worth spending the CPU cycles up front to build the index, even though you know it's going to be stale the next time you're on the script. Doesn't matter. You may never even write it to disk. It may just exist as a variable in memory. It's still worth the effort because the rest of your script is going to use the bejesus out of that data. And I do this so often that I actually have written my, I actually start my script by putting giant big comment blocks that say fetch data, validate data, build lookups, do work. You'll actually find that in my script as those four headings. And I genuinely put everything into those headings. And as I'm in the do work bit, if I need to know the lookup, I'll scroll up to the make lookups bit and I'll stick in another lookup. And it's also vital to validate your data because dirty data is the bait of my life. Anyway, so I do that so often in big scripts, particularly scripts that reconcile data or that synchronize data or that validate data. And they're all things that I have to do a lot. The other thing is there's a halfway house there that I also do quite a bit is if you have data where it's like, yeah, this script runs every minute and I'm okay with it being wrong by an hour for this data source. I'll build a lookup, write it to disk. And then before I read it from disk or as I read it from disk, I'll read it from disk and I'll use a script to check the modified time on the file. And if what I pulled in from disk was modified less than whatever I consider to be acceptable, just pull it from disk and don't recalculate it. If I pull it from disk and it's too old, delete the file, recalculate it, then shove it out to disk. And then the next time the script- It's like you've got a time to live in there. Precisely, exactly. So the data, I call it going stale, but yeah, it is effectively a TTL. And so that again is this is all very real world stuff. And I am sure there are people now shouting at their iPods or whatever iPods. How great. iPods. iPods. Shouting at their listening devices. Thinking about Bart, our CPUs are so fast. So what if it's inefficient? Just query the data the hard way. Why build all these lookups? There's a really good reason that will never cease to be true. Code that is simpler to write is simpler to debug and simpler to maintain over time. If you have your data structured in the right lookups, your JQ code is shorter. Your JavaScript code is shorter. All of your code is shorter, but not just shorter in terms of characters typed, shorter in terms of logical steps because you're jumping straight to the right data. So when you read the code, it almost reads like English if you have all the right lookups in place, especially if you name your lookups by what they do. If you call your lookups, name you by name, price by name. And when you read it out, it's really obvious what's happening here. All right, yeah, I'm getting, you know, IP address by department or whatever, right? Whatever it is that you need, if you name your rails blah, blah, blah, you get wonderfully readable code and anyone can maintain that code, you know, when you're gone and moved on to something else or you've handed the project over just into the operations team instead of it being in the project team or whatever. It's such the value of making your code more human friendly by using lookups will never go away even if we get the world's fastest computers. That is a win always and forever. And I saved that for last because that's the one I want you to remember. You and I are working with data. That makes a lot of sense. Especially if you're working with teams of people where they may not know exactly how you did something, but if they can look at the names, they can maybe parse it back out and say, oh, I see what he's doing here. Right, and also if you come to debug a script and you go, right, well, I can see here that they've built a lookup and then you check the lookup and you go, well, actually the lookup's wrong. Well, then everything in the script after here is no longer in your field of view for debugging. The only part of your script that's in your field of view is the bit where you build the lookup. If you look and go, oh, they built a lookup, right? Well, then all the code above there can go out of your field of view. So I don't worry about how they built that complicated lookup. Thank goodness they did. Ooh, that's horrible code. And then you just look down at the bottom where it all has a sense with everyone's go, ah, I see what the mistake is. He used, you know, name by IP address and he should have been using name by DNS name, whatever, right? It's just making code more human-friendly even if it saved zero CPU time would make lookups 110% worth the effort. They're also more efficient for other things. But the fact that they're more efficient for us human beings is actually why I adore them so much and why I'm so passionate about making sure we teach this as part of our work on JQ. So I said the best for last. Very good. So how do we know we're near done? Well, now we get into the JQ specific bit. So now we go from philosophy to practical, right? We were way up in the clouds here folks, right? We were, you know, software astronauting here. Now we're getting right back down to earth. So how does JQ do this? So before I tell you how they did it, I want you to put yourself in the shoes of the people designing the JQ language. The problem to be solved is write a function which can take any one of the infinity of possible JSON data structures, run it through this function you have to write to spit out any one of the infinity of possible lookup tables. How do you write one function that can do such a complex task? How many possible arguments is it going to need to be able to do every possible transformation? How do you attack a pie so big? So one approach a computer scientist can use when facing a problem like this is to build what I call a mega function, right? The documentation of this function is probably 20 pages long. It's like you can give it this argument and this argument and this argument and if you give it these four it'll do that and if you give it these three it'll do that. And if you're wondering what that looks like, have a look at the man page for the find command. The find command solves an extremely generic problem. Everything you might want to search the file system for. And their approach was we will provide you with a command line flag for every type of search we can conceive of and we have a good imagination. And the end result is that the command can do almost everything. But it's so hard to find what you want that most people give up. So actually that's a terrible solution and thank goodness that is not what the JQ developers chose to do. Another very sensible approach in a lot of cases when you're dealing with a problem this big is to use what we call a domain specific language. So you write your function, so it takes very few arguments. But one of those arguments is a string in a different language and that language is designed to do a very specific task. And we've seen this. That's how pattern matching is done in almost every programming language. Instead of trying to write a function that could take 20 million optional arguments for describing every possible type of pattern, they take a string which has a domain specific language we call regular expressions. And that domain specific language does one thing and one thing only, it describes patterns. So the function can be really simple. We take an input, which is the data to be processed, a second input, which is this domain specific language and that domain specific language describes the pattern we want to match. And so the function is simple because the work has been handed off to this domain specific language. So if you look at the JQ functions for matching patterns, they're very simple. They take what it is you wanna match and the regular expression as an argument and maybe some flags as an optional second argument. JavaScript was the same, the match function, the test function. They just take a regular expression as a single argument. The function is simple. The work has been done by the special purpose language that is regular expressions. So, okay, we're still pretty abstract here for me to see where we're going, but keep going. Right. The problem is there's no language for describing mappings from one dataset to a lookup. So that's not what they did either. What they did was another very common approach. It's called an intermediate data format. So instead of solving the problem in one step, you break the problem in two. There are an infinity of possible shapes of JSON, right? You can have a JSON data structure in any shape imaginable. Well, the JQ function say, if you wanna make a lookup, you must take that data and transform it into this very specific shape. And if you give it to me in that shape, I will make you a lookup. And that middle point we call an intermediate format. And JSON's job is to transform data. So this is a fantastic solution. Use JQ's core ability to put it in this shape and then give it to me and I'll make your lookup for you. Well, reshaping data is JQ. So use JQ to solve almost all of the problem, give it to me in the shape that I wanted and then I'll take the last mile and I'll make it a lookup for you. And that is how JQ deals with lookups. Transform it into an intermediate format and then run that format through a trivially simple function that takes zero arguments. You just give it the input and it gives you the lookup. As long as the input has been transformed into this intermediate format, which JQ refers to as entries. And entries are another flavor or species of dictionary. There are very specific species of dictionary. They are dictionaries with two keys. A key named key and a key named value. I know you're gonna hate this. You're gonna, I know, I know, I did my best to write it. I did my best to write it out as clearly as I can, but in effect what you're describing is an entry in a lookup. You're saying key colon hot dogs value colon 5.99, key colon pancakes value colon 3.10, key colon waffles value colon 7.50, key colon scones value colon 7.50. If that, they are the elements of our price lookup, right? Our price lookup is a lookup table that wraps hot dogs to 5.99, pancakes to 3.10, waffles to 7.50, scones to 7.50. In the intermediate format, that's more explicit. The key is hot dogs, the value is 5.99. The next key is pancakes, the next value is 3.10. The next key is waffles and its value is 7.50. So this is an example of an entry? Okay, so the intermediate format is an array of dictionaries, each of which has the key named key and the key named value. And I'm showing you. Right, so this is all entries have this exact format. That's all they can have in them, a key called key and a key called value. Correct. All those keys can have values in them. And that's it, that is the intermediate format. That is the intermediate format, an array of those really simple dictionaries. What is that? That says that I can't know three things about hot dogs or two things about hot dogs. Of course you can, of course you can. Only the one thing. No, no, the value of value. Oh, I hate, this is why I hate this, right? The value of the key value. The value of the key value could be a dictionary. If you put a dictionary in there of name colon hot dogs, price colon 599, then you can build a lookup that maps the name hot dog to the dictionary, name hot dog, right? You can put the full record here as the value for value. I'm just showing you a simple one, because otherwise, if I put the full record in for each of these, in this example, you wouldn't see the structure. But that 599 could be anything. Now, I appreciate you simplifying it, but now that I understand it doesn't, it could be more than that, great? Yes, exactly. So that actually lets us build any lookup conceivable. That actually gets us to the infinity of possible lookups, because all lookups map something to something. And this lets us say, here are all the mappings of something to something. That is a lookup. That is the most generic form of lookup. So the only last piece we have is, well, what's the jq function? It's called from entries. So you take your array of entries, you show them into from entries and out pops a dictionary, or sorry, out pops a lookup. Wait, wait, wait, wait, wait, wait. Okay, we finally got to the part I was gonna try to start really understanding, because we've been philosophy here. Say that all again, we've got this structure, this is entries, those have to look like that. We know the structure of those. Now we're gonna build a lookup by doing what? We take that and we pipe it into the function named for from entries, and that function will make our lookup for us. That's it, that's all there is to it. You just take that structure, shove it into from entries, and you're finished. I don't understand. Okay. What have we asked it? We haven't asked it anything. We haven't done a lookup. We haven't told it what we want the data to look like that we look at. Well, we have, but we have. We've given it this list of keys and values. That is all a lookup is, is a list of keys and values. If you build your list of keys and values and shove it to from entries, you get your lookup. What have we looked up? We're building the lookup. We're not using the lookup. But you said we're done. We're not building, we're done. You said we've already finished. Right, exactly once you shove it to from entries, it will build your lookup. It's that simple. It's so simple you think. I don't know what a lookup is then. Even though you just spent the last hour and three minutes telling me what a lookup is, I don't understand what's gonna squirt out the other side. Okay, so all of the examples above where we had hot dogs mapping to the full record or hot dogs mapping to the price, those are lookups, right? We have the full record by the name. We have the price by the name, right? They are lookups. Nope, not in my head. I, what? I thought you, I thought we were finally getting to what is a lookup like in JQ? Okay, JQ is a language for transforming Jason. So a lookup in JQ looks like Jason. So a lookup looks like JQ. But I haven't seen it in Jason. Okay, so in the section that starts with the heading lookups, there are one, two, three, four examples. Okay, I'm gonna have to scroll back. Types of dictionary after that. So that, yeah, so the section that just has the heading lookups. Right, after the tables, after the space time trade-off, we have the section called lookups and in there we give four examples of lookups. Different lookups all built from our menu. So we can have our menu as a lookup that maps names to the full record. We can take our menu, transform it into a lookup that maps the names of our food items to just their prices and their stocks. We can make a lookup that maps the names of our food items to their price. We can have a lookup that maps prices to which food items cost that amount. There are four example lookups built from our menu.json file. I think where I might be stuck is like a noun verb sort of thing. Like a lookup to me is like, I am going to go look something up. I'm gonna look up what is the price of pancakes. But that's not what a lookup is. A lookup is a data structure. Like a take is a data structure. Call it a lookup table. In your head, call it a lookup table. Call it a lookup table in your head. Because. Okay. That'll help a lot. So now where we were, where we're saying we're gonna pipe that structure, the entries structure into the command from entries. That will build the lookup table. We haven't told it what structure to make, but we haven't told it what structure to make our lookup table yet. Okay, the entries are the structure. They say you will have a key with this value. That's going to be an entry in our lookup table. But that's not the way I want my lookup table to look. I want my lookup table to look this other way. One of those four examples you gave up a couple of them. We haven't told it that yet. No, no, what we have. So when you transform it into that ugly shape and you run it through from entries, what spits out the far side is the pretty shape. Which one of the four pretty shapes you described? Okay, so in the example I'm showing you, in the show notes, sorry, let me scroll back to my own show notes. It says we're going to have the key hot dogs mapping to the value 5.99, the key pancakes mapping to the value 3.10. So that is exactly the representation of the third of our examples, hot dogs, 5.99, pancakes, 3.10. That is our lookup by price. Sorry, a lookup of price by name. Maybe, sorry about that really loud noise outside. Your mic didn't pick it up. Oh, good. Wow, we had to build a file that was key colon hot dogs value colon 5.99. We had to build that, but that's not the lookup table we want. Right, I'm saying that's the intermediate format. That is the known shape. So how do we tell it the format we do want? We just haven't told the set yet. No, no, that shape, that intermediate format is the specification. Look at what it says. You said we had no choice. It's always like this. You said it's always key colon hot dogs value colon 5.99. That's the only way we can put the data in. Okay, but the actual, right, but look at what that is. That is a description of a specific lookup table. So when you run that through the function from entries, you get out the lookup table you have specified here. You're specified you want a lookup table that maps hot dogs to 5.99, pancakes to 3.10, waffles to 7.50, scones to 7.50. That will produce the lookup with the key hot dogs and the value 5.99 and the key pancakes and the value 3.10. But you said we had to start with that. So how does from entries create that if we had to start with that? Okay, no, no, that is the big ugly one which has a key named key with the value hot dogs. That will produce a lookup with a key named hot dogs with the value 5.99. As opposed to a key hot dogs and value 5.99. It's a key named hot dogs with a value 5.99, which are apparently two totally different things. I won't be able to say it in five minutes, but I think I know what you just said. Maybe you've got examples. Oh, heck yes, I do. This is gonna, okay, let's go. Let's just keep going and maybe this is gonna drop a little bit better. Okay, so our examples are actually the examples we've already seen because the examples I put open the show notes when I was all open the sky, those examples were written in JQ, of course. So we're gonna rewrite them. So we're gonna see them in action. We're gonna build the lookups we just used as our examples. So our examples are our examples. So the first thing we're gonna do is we're going to build a nice, simple lookup, a one-to-one that's going to map the names of our food items to their prices because that is probably the single most useful mapping to want because you charge people for things. So in order to build that lookup, we start by exploding our menu.json file, which is an array at the top level. So we explode the array and then we need to construct the entries. And we do that using JQ's syntax for building a dictionary. So we open a curly bracket and we're going to build a dictionary with a key named key and the value is going to be dot name. Got it, okay. Then we have a value, a key named value whose value is dot price. Got it. Okay, all right. So we then got to wrap all of that in square brackets to capture the pieces because our from items function expects an array. So if I take exactly what I wrote there and I just wrap it in square brackets and put some nice comments around it to really make it clear what I'm doing, then we actually have what is needed to build the intermediate format. So to show what I mean, I'm gonna make you run this JQ command here, which is going to show us not the final product. This is going to show us the intermediate shape. So am I supposed to be in that folder with this open and not reading in another application? Like I want to see. And this is quite how far into the astronaut space we have been in the philosophy section here. We are quite some time into the recording. And now for the first time, I have said Alison wanna command. Normally we get to the Alison wanna command bit. I guess I got to be in installment resources. Where do I need to be? PBN, probably the error. Installment resources though. Okay. Yeah. I did not have a heads up for Bart that I would be doing this. And actually, and I should have known to give you the heads up because you told me you only managed to prove free to a certain point. So that might be. But I didn't know I was going to be running any of this live. So which one do you want me to run? The one where you've wrapped it in nice square brackets and such. So straight after that, I have it as a JQ command. Okay. So, okay. So it's got you the thing he just described. Okay. So if I do that, I get that nice little structure you talked about key, colon, hot dogs, value, colon 599. So that's a set of dictionaries inside of an array. Gotcha. So that is the input. That is our intermediate format. And that's what we shove into the from entries function to build the actual lookup table. So if we take all of that code from above at the very, very early sec pipe from entries, then we are running it through the from entries function. Now I have saved that as menu-pricebyname.jq. So you can build that yourself by running JQ minus F menu-pricebyname.jq with menu.json of the input. And that will show you the actual lookup we have built from that intermediate format. And I'm hoping the penny now goes, bing, bing. Sure. Okay, maybe you're expecting bigger fireworks. Well, so it was an array with a bunch of dictionaries in it of key, colon, scones, value, colon 750, et cetera. And now we have a dictionary. Not sure where we told that it was supposed to be a dictionary when it came out, but it's maybe that's what from entries does. Right, exactly so. Okay, but it expects as an array as an input. Right, yes. Yes, exactly. And then the intermediate format's been created and now we have hot dogs colon 599. Right. So the output is the lookup we wanted. The input is this strange intermediate format that is a list of entries. Where did we tell it the format we wanted? Which is what my question's been since we started this piece. The output from the previous command is where we told it the format we wanted. No, we told it in the previous command, we told that we wanted key colon name, dot name, value colon, dot price. Okay, and then look at the output. And then fermentories changed it into some other format. Okay, and how could you describe the output you got in any way better than the intermediate format which said you want a key hot dogs with the value, whatever, and then you want a key, scones with the value, whatever. That's where you told it the structure. You know what, I'm always yelling at you for making overly complex examples that throw in a whole bunch of other crap that confuses me. And this time you made it so simple, I can't see it. Because it's just exactly what we said. We created an intermediate format that was key colon pancakes, value colon 310. And now we have a dictionary with each of those key value pairs. Yeah, yeah. Yeah, that's our lookup. We can take a name and it gives us the price. That is a lookup of price by name. So we have... Okay. It's so simple, it's complicated. Yeah, yeah, that is exactly what the problem was. Okay. So that, we have now viewed our lookup, but in the real world, we were on a save it to a file. And we would like to save it in the efficient way computers like they're JSON, not in the inefficient way as humans like our JSON. You, I'm not gonna like this. No. So we're gonna take the command above and we're gonna pipe it back into JQ. And we're gonna give it the minus or a flag which tells JQ, I want raw output. I do not want you to wrap this in quotation marks to make it one big string. I want raw output. And then the JQ filter is at JSON, which is gonna format it into JSON the way computers like. And then we're gonna take the arrow. I'm gonna shove that into a file named menu-pricebyname.json. And that is my naming scheme, but I always say whatever my data source is dash, what is indexed by what? Price by name.json. I probably shouldn't have just created that file in the same repo that I'm about to push, right? That should not be in there. Well, you can put it there all you like because the ignore file for the repository ignores it. Sweet, okay. I normally copy it first, but I was caught with my- Yeah, no, no, I'm- I don't quite know where we're going. Remember, I make these show notes in place. So the ignore file usually covers you for that. Nice, nice. Okay. So that is our first of our example locus from above, price by name. We also have the more complicated one, which is the name leading to the full record. But you know something? The code for that is almost identical because we explode our top level array. We then build our mappings, our list of entries. Key is still name, so key colon dot name, and the value now is the entire record. So the value is dot. Then we pipe that to from entries. So I forget what the input file looks like, what menu.json looks like, but... So it's an array of dictionaries where each dictionary has name, colon, price, colon, stock, colon, so it's a record. Oh, okay. So what we're doing is we're saying for the value, give me the entire record, give me dot, and then we shove all of that to from entries. And then if you run jq minus f menu dash by name dot jq on menu dot json, you will see a lookup that maps the name to the full record, which as far as I'm concerned is the most powerful lookup. You have the full data indexed by the logicalist of possible indexes, its primary key. It's name. Okay, and you like, but this is okay because it's hot dogs, colon, and then name is hot dogs, price is 599, stock is 143, so that way if you pull anything out, you still have hot dogs with you. Yes, yes, yes, yes, yes. So basically I am a person who believes you have all or one. So give me the price, great. Or give me everything, but don't give me anything in between. I like my lookups, they're one specific thing or everything. So that's our example there. If we wanted to pull, I give you the code, if you want to pull the name out. All you do is after you build your entry, you use the del command we learned, the del function we learned about last time to delete dot value dot name. Because we've just said that value is going to be the entire record, which means value contains the name. So if we then run that through del dot value dot name, we then delete the name. And then we run through from entries. Because it's already the primary key. Right, so at that point, we've just made a dictionary with a key named value, whose value is the full record. So if I want to pull the name out, I now say dot value dot name. I don't want you to do that, but the notes wouldn't be complete if I didn't tell you how to do the thing. So there it is. So the last thing then is a little subtlety is that because of how dictionaries work, the value for the key key in our entries must be a string. So name has been a string all along, so we haven't had to worry about this. But if we do a lookup of something by price, the price is our numbers, we will get an error unless we take dot price and run it through two string. So if we build a lookup of the name of our product by its price, we have to run it through two string. If we don't, JQ will just get very cranky. From entries, we'll go, no, I insist on strings. And it will tell you key must be string. Now, this is an interesting one because it illustrates why everything I said so far works great for one-to-one mapping. So we're going to build an index that goes from price to the name of the item. And if we take our logic from above and we just say dot price, two string and dot name, well, that's the same logic as dot name and dot price, we just reverse them, this should work fine. Well, it kind of works fine. We get hot dogs, pancakes and scones. Waffles. What happened to waffles? I know waffles of all things. Well, waffles came before scones. So scones came and stomped on the waffles because scones and waffles are both 750. So this is why when you have a one-to-many, you use an array as the value. And assembling that is actually really hard. I try to do it from first principles. And I spent an hour banging my head at it going, this is impossible, this is impossible. This is a really common thing. How can it be impossible? This is impossible. Then I went, wait a second, maybe I don't do this from first principles. Maybe there's a special function whose job in life is this thing I've spent an hour trying to do. Yes, they're such a function. It's called group underscore by. You give group by an array of dictionaries and you give it a filter that says how you would like to group those dictionaries and you get back an array of arrays. Effectively, it subdivides the array. And so if you give it the array of our menu and you say group by dot price, what you get back is an array. And then the first child array is only the dictionary for pancakes because that's the only thing with a price of 3.10. The next thing is you get an array with only the dictionary for hot dogs because it's the only thing with a price of 5.99. And then you get an array with two dictionaries. Waffles and scones because they both share a price. So when we group by price, we get an array for all of the ones for 3.10 and an array for all of the ones at 5.99 and an array for all of the ones at 7.50. So this is really different. We were getting dictionaries when we did it the other way but with group by all of a sudden we got a bunch of arrays of arrays. Right, because this is, we need group by to build the entries to make our dictionary so make our lookup. Without the group by, I challenge you to build a lookup that maps to an array. Well, we need to use this before we use from entries. We're still going to end up. This is before from entries. Right, and I'm saying to you, you can't get it to a stage where it's in the right shape, that intermediate format without the help of group by. I spent oh so long trying to do it without group by because I didn't know about group by and I thought it must be easy. No, that's why they invented group by. So now let us build our lookup the other way around. So let's now build our lookup without losing any waffles. We start with the group by. So we start by saying, break this array of menu items into sub arrays by price. So we say group by dot price. So we now have an array of arrays. So we now need to build those into an array of entries. So we're gonna explode our array of arrays. We now don't yet have individual pieces. So we're going to explode the inner array. Wait, sorry. I'd pretend I didn't say those last few words. No. Okay. One explosion, one explosion, one explosion. So we explode the outer array. And so we are now processing a group at a time. So we're processing everything that is 399. Which is what we expect it to be. Yeah, sorry, my bad. We just explode once. Yeah, no, no. And so even though we've exploded our array, we still have an array because we had an array of arrays. We will do that. I've done my, I did so many comments here. So we now need to map a key to an array. So the key is going to be the price. But the price is inside each element in the array. But the price is the same because we grouped by price. So we could take the price out of any element in the array. So I'm just taking it out of the first one because there's always a first element. So if I say .0.price, that's the first one. That's always going to exist. Oh, and you don't have to know how many were in each one because you know there's going to be at least one. Exactly, and you grouped by price. So that's always going to be the same. So therefore I just say .0.price to string. Great, that's the key taken care of. The value, I want the value to be an array again, but I only want the names. I don't want the full dictionaries. So I start with a square bracket because I know I want an array. But the first thing I do is I explode the inner array. I just take the name and then I close my square bracket. So now the value is the array pancake scones for 750. Now it's in the right shape where we shove it to from entries. And the output we get is what we wanted, 310 to the array pancakes, 599 to the array hot dogs, and 750 to the array waffles and scones. Wow. Like I say. Using that group by it's still a little mind-bendy. It's only like five lines of code with your comments that are very helpful, but it's still sort of mind-bendy. This is why I would say the debug function is your friend because you can stick that in at any point to see what it looks like, right? So the last thing we have in the show notes is just a bonus extra. Like in the real world, if I were to be actually caring about the menu.json file, I would keep that as my canonical record, right? This is my actual master copy of my menu. It's a nice simple file as an array of records. And if I change my menu, I change that one file. And then I build from that file the lookups that are of value to me, which is the by name lookup, which is the full record by name, and the price by name lookup, which is the price by name. And I have the JQ files to do that work. And then I have a shell script that just does all the updates and wouldn't go. And I use a naming convention. And this is very much for real. This is how I actually do it in the real world. So the master list is the thing with the shortest name. Menu.json. Back up, back up, back up, back up. Cause you're going into a thing beyond where I'm lost. What do you mean you create a shell to update things, to update what? So a little shell script to rebuild my index, my index, my json files that have the index and then whenever I change my menu. So if I change my menu, I need to remake my indexes. If you change the menu, thank you. That's what you didn't say that. I'm like, what's updating? Okay. So on the rare occasion, I add some scones, or whatever I do, then I need to rebuild my lookups and instead of doing it by hand, I have a script. And that script makes use of JQ files to do the actual work. And my naming convention is such that my data files are the canonical one has the shortest name. So menu.json must be the master file. So if I go to a folder and I see something called menu.json and menu dash by name.json and menu dash price by name.json, I know which one is the master copy because it has the shortest name. And everything I build from the master copy gets a post fix. So I append to it a description of what it is. So the menu indexed by name. The menu prices by name, right? So it's obvious what's derived from what. The scripts then have the same names following them basically. So menu dash by name.jq is the actual JQ syntax for building that lookup. And menu price by name.jq is the JQ for generating the price lookup. And then my script for actually building it, all I call menu dash generate lookups.sh because I might have another script called sync to production server.sh and upload to third party service.sh, right? I mean, in the real world, you do a lot of things with your data. So I give it a nice generic name so that anyone else can look at that folder and go, ah, I've updated the menu. Therefore I must run generate lookups.sh. Generate lookups.sh. And then when Larry comes in and says, I want to look up by stock, you can create menu, stock, menu dash stock by name or, Right. Yeah, whatever Larry wants. Stock by price, whatever, yeah. Yeah, exactly. Okay. And then the actual shell script is wonderfully simple. We just have our normal everyday bash, shebangline followed by the JQ commands, JQ minus F, menu dash by name dot JQ, menu dot JSON of the input, pipe it to JQ minus or at JSON, shove it all into the file, menu by name dot JSON. And the same thing. I have a question. Why don't you, can you not do the dash R and at JSON at the end inside that JQ file? I could, but then if I need to see it. You don't have it. I don't have it. Yeah. So by doing it as a separate step, it means I get to have my cake and eat it. If I run the JQ file manually, I see it pretty, but my shell script will write it efficiently. So the actual dot JSON files. So it'll actually come up on the screen, but it'll actually be writing the file in the right format. Well, no, so if I just run the command JQ minus F and I stop, right? So on the terminal, I would not do the pipe to the second JQ. On the terminal, I only run JQ minus F menu by name dot JQ menu dot JSON. And I would stop typing and then I would see it. Look at it. And then do the second one. Oh, okay. And then the shell script, it does the extra piece. And only the shell. Oh, gotcha, gotcha, gotcha. Okay. Yeah. Okay. All right. We got a challenge, huh? We do have a challenge. So I have been working with our menu, which is a very, very small data set. And therefore anything I've done here in JQ, you could do manually in seconds. Our Nobel Prize data set is kind of bigger. It's a much more real world data set, data set coming from the actual real world. So I would like you to build the most logical lookup I could think of. I would like the Nobel Prizes by year, please. I would like to be able to say dot 2020 basically take JQ, the entire filter dot 2020, inside square brackets or whatever, and just give it my lookup. And I would like to know, who won all the prizes in 2020? Instead of having to do the piping to select and exploding and all the yada, yada, yada, we have to do at the moment. If I had that lookup, I could just tell you who won the prize in 2020, right? That would be a very useful lookup. So that is the challenge to build that very useful lookup. For bonus credits, can you build a two level lookup where we have a lookup by year that gives us a lookup by category that gives us the prizes. So then we could say dot 2020 dot chemistry and actually figure out who got the Nobel Prize for Chemistry in 2020. Okay. And I'll give you a hint. A lookup can go to a lookup, right? So each entry we have a key and a value. Why can't the value be a whole other lookup? Of course it can. Okay, that's not in the show notes and I won't remember that. No, but that's why it's an optional extra challenge here. We will go through the sample solution in great detail, but if people want to have some thinking about it, it's both simple and head exploding all at the same time, because it is actually very simple to do a multi-level lookup and yet it makes your head explode. And depending on the kind of person you are, you may just go, Bart, what are you talking about? Really easy, okay. All right, well, hey Bart, I'm glad you only did half of this. It's weird as I would have raced it. You would have been lost. Yeah, anyway. Right, we shall leave it there. Next time, we shall do the opposite and tear our lookups down and rebuild them into different shapes and do cool things with them. So until then, have fun with your challenge and happy computing. If you learn as much from Bart each week as I do, I'd like you to go over to letstalk.ie and press one of the buttons over there to help support him. He does 98% of the work here. I'm just the stooge that listens to him and asks the dumb questions. If you go over to letstalk.ie, you can support him on Patreon, you can donate via PayPal, or you can use one of his referral links. I really hope you'll go over and help him out. In the meantime, you can contact me at Podfeet or check out all of the shows we do over there over at podfeet.com. Thanks for listening and stay subscribed.