Generated Shownotes
Chapters
0:00:07 Introduction to Programming with JQ
0:03:00 Understanding the Importance of Lookups
0:05:23 Exploring the Philosophy of Dictionaries
0:33:58 Storing Data Efficiently with Indexes
0:36:25 Trade-offs in Space and Time
0:44:31 Efficient Data Handling in Enterprise Systems
0:50:03 Utilizing Stale Data with Lookups
0:52:44 Simplifying Code with Lookups
0:53:52 Transition to Practical JQ Implementation
0:57:19 Designing an Intermediate Data Format
1:02:23 Building a Lookup with "fromEntries"
1:06:53 Understanding the Lookup Process
1:12:48 Generating Lookup Tables from Entries
1:20:21 Dealing with Data Types in Lookups
1:23:48 Utilizing "group_by" for Multi-Level Lookups
Long Summary
In this engaging episode of Chit Chat Across the Pond, we are joined by Bart Busschots as we embark on a deep dive into programming in Programming By Stealth Number 163. The conversation kicks off with Bart's candid sharing of the challenges he faced while grappling with the intricacies of JQ, only to find a sense of clarity and simplicity in chat GPT through its simplified examples. We venture into the complexities of handling laureates, spelling variations, and the significance of lookups and data disassembly for optimal processing efficiency. Sorting and deduplicating arrays take center stage, with practical examples like managing Nobel laureate names for parking lots adding a touch of real-world relevance. As we explore sorting by surnames, we conclude the episode with a solution employing the unique underscore by method to refine arrays of laureate dictionaries, underscoring the intricate yet rewarding nature of data processing across diverse programming languages and the profound impact of nuanced problem-solving approaches.
Transitioning into the subsequent segment, our attention shifts towards sorting and cleansing a JSON dataset utilizing the JQ language. The episode delves into the meticulous process of sorting dictionary objects and eliminating duplicates to craft an organized alphabetical list. As we navigate the complexities of sorting surnames with varying rules and characters like "Het Hooft," precision and accuracy in instructions emerge as pivotal elements in such tasks.
The episode takes a personal turn as the host shares insights on perfectionism, the joys of being a pedant, and draws parallels to experiences in stand-up comedy. Technical recommendations for custom functions are shared, alongside a strong emphasis on the importance of clear instructions, particularly in academic assessments.
Further delving into a theoretical realm, the conversation delves into the philosophical underpinnings of dictionaries and their role in representing data structures such as records and lookups. The host introduces the concept of trade-offs in computer science and expounds on the efficiency considerations linked to storing tabular data as arrays of records in JSON.
Expanding the discussion to indexes in databases for expedited data retrieval vis-a-vis looping, the host delves into the associated trade-offs and nuances, stressing the balance between search speed and resource utilization. The episode wraps up by elucidating how lookups in JSON mimic indexes, facilitating swifter data retrieval processes.
Brief Summary
In this enlightening episode of Chit Chat Across the Pond, Bart Busschots guides us through the intricate world of programming in Programming By Stealth Number 163. The discussion unravels the complexities of working with JQ, culminating in a newfound clarity through GPT chat examples. Practical applications such as managing Nobel laureate names for parking lots illuminate the real-world relevance of sorting and deduplicating arrays. The episode transitions to the meticulous process of sorting and cleansing a JSON dataset, emphasizing precision and accuracy in handling surname variations like "Het Hooft." Delving into philosophical discussions, the importance of dictionaries in data representation and the efficiency considerations in computer science are explored. The conversation concludes with insights on indexes in databases and JSON lookups, shedding light on expedited data retrieval methods.
Tags
Chit Chat Across the Pond, Bart Busschots, programming, Programming By Stealth, JQ, GPT chat, Nobel laureate names, arrays, JSON dataset, dictionaries, computer science, indexes, databases, JSON, data retrieval
Transcript
[0:00] Music.
Introduction to Programming with JQ
[0:08] It's that time of the week again it's time for chit chat across the pond this is episode number 789 for march 16th 2024 and i'm your host allison sheridan this week our guest is barbou shots with programming by stealth number 163 we're still having fun with jq right bart well i'm certainly having fun i hope you are too i am i am it's uh as i told you uh i find that jq is not particularly particularly sticky in my head that in between the two weeks, I complained to Bart while I was actually trying to work on the JQ homework while in a car on an unpaved road out in the desert.
[0:48] And I was basically, I was just sitting there looking and going, I have no idea.
I know you start with JQ and a single quote, but maybe some square brackets, but wow, I'm foggy on the the beginning. I mean, like just the simplest things.
And then I got to one point where I was just like, I know what I'm doing is right, but why is it not working?
And I went to ChatGPT and I gave it like a super simplified example of what I was trying to do.
And it said exactly what I was saying. I was looking at the two things.
And the final thing was like, I hadn't, after I'd done everything, I had everything in an array.
So to get to display name, I needed to explode the array. And I kept forgetting to explode the array.
But even looking at the answer that ChatGPT gave me of the simplified example, I look back and I'm like, those are the same thing. How could that be different?
So it seems to squirt out of my brain. And I know you're doing it regularly at work. I am not.
So the homework is definitely critical for, you know, gluing the broken bits back into my brain.
Well, some very wise person suggested that maybe we should do challenges as part of this series. I don't know who that was. Yeah.
I don't know. One other little thing I wanted to ask people is, how many different ways have y'all learned to spell laureates?
Because between Bart and me, I think we may have found all of them, but it's not just one way we can misspell them.
Well, I haven't found how to spell it right. I've just found all the wrong ways.
[2:18] Well, eventually, it's funny, we get it right in the code, because otherwise it's not going to work.
But if you write a comment, there's no chance it's spelled correctly in the comment.
[2:28] Yeah that is that is a fair point oh i see one wrong yeah there we go it's wrong there in line five of the first challenge solution isn't it uh yeah i thought i fixed it on that one maybe it's a different one okay well we'll uh we'll find oh yeah no i may have changed it to a different that is a new you did not spell it that way i did that oh great at least it's not just me right there's like a's and e's going all over the place but uh yeah so it may or may not be spelled right in the the comments. Good luck to all of you in following what we're saying.
Understanding the Importance of Lookups
[3:00] Right. So before we get to our challenge solution, just to set the scene, I'm a big believer in the old tell them what you're going to tell them, tell them, tell them what you told them approach.
So I had promised you, I had teased that we were doing all about lookups in this installment.
And that's half true, because as I began writing the show notes, I realized that there were too many concepts.
And if I didn't split it in half I would end up teaching nothing instead of twice as much because none of it would actually go in.
So we've ended up actually with quite a nice split.
So we're going to spend a lot of time today trying to understand why we are interested in lookups and it's kind of a much bigger lesson than just JQ.
[3:50] It's a lesson for any data processing you do in any language.
[3:56] Because with my work head-on, I do a lot of data processing with JavaScript because I like JavaScript.
And with Node, you can write JavaScript to run shell scripts. So I do that a lot.
And then when I have JSON data, I obviously use a lot of JQ.
And when I have SQL data, I use a lot of MySQL or PostgreSQL or, if I'm really unfortunate, Oracle.
Very unfortunate, bad day. um but a lot of the concepts are actually the same it doesn't really matter what language you're in so a lot of the stuff we're going to do for the first half of this installment is way more philosophical than you're probably used to i think we might have done this more in the very early days of javascript when we were meeting our very first programming language ever i think we got quite philosophical sometimes but you know we're going back down that road for the first half of today then we're going to look at how jq applies that concept in its own unique and special way because jq is kind of unique and special because i don't know if you've noticed this but you've been doing an awful lot of looping in jq but we haven't met a single looping keyword you'd recognize we haven't met a for we haven't met a while we haven't met a for each and yet when you explode an array you're looping over every element in the array right right so we do loop a lot we do do a lot a lot a lot a lot of loops we've never looped explicitly um and so how jq does things.
Exploring the Philosophy of Dictionaries
[5:23] Is usually quite different to how other languages do it so how jq does lookups is very jq specific and then we're going to build some and that's the point we're going to, draw a line under for the day and then when we come back in two weeks time we're going to pick it up is the opposite of building is taking them apart because it's actually very important to be able to do that for a couple of reasons a lookup is a really great way to access data but it's actually terrible for trying to process the data so if the problem to be solved isn't find me something quickly but do some sort of a search or do a transformation you may actually want to break the lookup apart, assemble it into a more traditional structure, process it and then maybe put it back into a different lookup.
Or the other thing is someone may have written the data with a lookup on one field and you don't want it on that field.
And the example, this is when I in my real life learned how to do these things with jq because when you do a lookup it's on a key.
[6:34] And the wonderful, wonderful Troy Hunt does Have I Been Pwned?
Amazing service, amazing person.
But his Jason is in the form of what we're going to discover today is a lookup.
And his key is designed for when the problem to be solved is what breaches was this person involved in?
And the problem I have to solve is the opposite.
A breach has just happened. Which people are in that breach?
Which means the data is opposite right the data is wrong so i have to pull well it's not wrong for me it's just not what you want yeah it's wrong for my purposes right the context is important so i have to pull his lookup apart completely rejigger all the pieces and build the lookup that i need to answer my question and that's actually a very important skill so the entire next installment is about disassembling and potentially, if needed, reassembling into what you need for your purposes.
But before we do any of that, we of course had a challenge.
We spent a lot of time in the previous installment getting very friendly with processing arrays.
[7:47] With manipulating them in various ways, and two very useful manipulations we learned about out was reordering arrays so they were sorted in the same order and deduplicating arrays so that instead of there being five copies of the value two, they'd be only the one copy of the value two.
And that's actually something you need to do quite a bit when processing data.
And a fantastic example of why you would want to do that is if you just want to list basically an an honour roll of everyone who when they come to Princeton gets to park in the Nobel laureates area because they actually have a different car park for Nobel laureates because they have a few.
[8:28] And so you just want everyone once right and, And Einstein, actually he only had one Nobel Prize, Marie Curie, should she come back from the dead and drive, I don't think she drove, but she doesn't get three parking places, she gets one parking place.
So obviously we want to de-duplicate her. And that's sort of why the example I picked was to do a list of all the laureates, which meant you had to de-duplicate them because lots of people won more than one Nobel Prize.
And also I wanted them sorted alphabetically because, well, it seemed like a sensible thing to do with a list of names. So how else do you find whether or not someone's on the list, right?
If you're the bouncer at the car park for the Nobel laureates and someone comes up to you and says, Hi, I'm Andrea Ghez.
And if it isn't alphabetic, how's the bouncer supposed to say, Wait a minute there, Dr. Ghez.
I have a hundred and something or I think it's 176, I think is what I got in total or something.
It might have been more. Anyway, whatever it is, a lot of reading to do, right? Right, so anyway, that's why, that was the problem to be solved, kind of semi-practical anyway.
[9:34] So you will find my challenge solution for the basic solution, which is just print all the names and sort them alphabetically as you would write them.
So Andrea Ghez would sort as an A, the Institute of Peace, I think was something called. Actually, the International Panel on Climate Change would sort as a T.
Right, and just sort them as you see them. basically. That was for full marks.
And then for a bit of extra credit, you could sort them on surname, which is actually quite fiddly because organizations don't have surnames.
So we'll deal with the fiddly bit after we do the important bit, which is the bit to get you full marks.
So in order to start working with the laureates, the first problem to be solved is to get all the laureates.
You're going to have duplicates, you're going to be in the wrong order, but just get the ball.
And so, the first step to getting at all the laureates is to loop through all the prizes, so we explode the prizes array.
[10:38] Then we have our prizes, and we don't care about the year, we don't care about the category, the only thing we care about is the laureates.
So we explode those out.
And now we have all the laureates that ever existed from all the prizes.
But we only want the names, only some of them have a first name and a surname, name and some of them only have a first name so I basically copied and pasted my solution from the previous challenge which was to do the pretty printing of the names and I just pasted that in so at this point in time what I have is many many outputs one for each laureate but they are individual outputs because we've exploded and then we've exploded again so we now have many many pieces but the functions for sorting and so forth expect an array so we need to collect our pieces together in an array so we wrap everything we've done so far in square brackets and now we've recollected them all back into an array.
[11:40] And then a nice side effect about how the unique function does its magic of deduplicating is that it sorts the array and then it just goes through it from start to finish and if it meets the same value multiple times in a row it throws all of them away but the last one so that's actually how unique works is it's a sort followed by it throw half of it away and so by simply piping our array to unique we get straight to the answer, so we explode twice we pull out the name group it all into an array and run it through unique, and hey presto full marks you know i um forgot that unique would do this would do the d i knew I knew it would do the deduplicating, but I forgot that it would do the sorting.
But I kind of cheated. I thought, you know what?
I know of a JSON dataset that already has display names created because Bart did it last week.
So I started from your last time solution because I was like, why should I redo all that work? I already know how to do that, if I remembered.
And so I started from that. but you must have deduplicated last time because I didn't use Unique and I don't have three Marie Curie's.
[12:52] That's unusual. That sounds like a side effect of something earlier in your logic.
I haven't seen your solution, so I can't comment on the why.
That is unexpected, because I know for a fact that before I run mine through Unique and I piped it through a word count minus L, there were many, many more of them.
And when I ran it through a word count minus L after I showed it through Unique, which is my little test to myself that things were going as expected.
Huh. Okay. So the question... Well, I just checked and I only have one.
I think I got accidentally lucky. Hey, if it works.
Although it is kind of interesting to understand why it works.
I'll be curious to have a look at that maybe after we record tomorrow.
What the listeners don't know is that the clocks have done that really annoying thing where we're the wrong amount of hours apart.
[13:45] So it's now later for me. Anyway, it's all timey-wimey, wibbly-wobbly. probably.
So in order to get the bonus, we then want to do something a little bit tricky.
We want to sort on a different value than what we display, because we don't want to display them in the wrong order.
We just want to sort them by their surname, but still display their name normally.
Which gives me an excuse to use a trick that iTunes, of all things, taught me. so when you would look at an mp3 file in its id3 headers there was a field for title of the song and then there was a field called sort order.
[14:26] And you could put Beatles, the, into the sort order to have your albums, to have the Beatles sort under B, but they would still show up in your list as the Beatles.
They would just be. Oh, that's kind of neat. After the Yardvarks, who don't exist as a band, but let's imagine they do.
And then after ABBA. Oh, you know, ABBA would come and then the Yardvarks.
Yeah, exactly. Exactly.
And so that gave me, basically, my basic approach is, instead of exploding all the way to the name, so explode, explode, and then don't reduce it to the name.
[15:06] Instead, inject two keys, a display name, copying and pasting my logic from previous challenge, and another field called sort order.
I think I called it sort by in my solution, but same logic, right?
Yeah, I did call it sort by, I think.
And the sort by field, I applied different logic so that it would basically say, do you have a surname?
If you do have a surname, then build me the string surname space first name.
If you don't have a surname, just give me the string first name.
Which meant that my sort order would let, you know, the International Panel on Climate Change be the International Panel on Climate Change and Einstein would sort as Einstein-Albert instead of Albert Einstein. So he'd sort under the E's.
So I recollect all of those into an array. So I now have too many laureates in the wrong order.
[16:05] So the first thing I do is I use the unique underscore by function to deduplicate based on...
Wait, why are they in the wrong order? You just did a sort by.
Not in my bonus solution, which is further down the page.
So you just described creating the dot sort by.
Yeah. Oh, but you haven't dot sort by'd. You've created that as like a variable.
It's a field. Yeah, exactly. So it's an extra key in the, so every dictionary for the laureates now has a sort by key and a display name key, as well as the name, the surname, whatever it had when it started.
It now has two extra keys, sort by and display name.
Okay, and sort by has what in it? What are the values of sort by?
So if there's a surname, the value of sort by will be surname space first name.
Otherwise, it will be first name.
[16:59] Okay, why not just sort by display name?
Well, because that's not what we want. We want the display name is first name, space, surname, or just the first name. Oh, right, we want it by last name. Okay.
Right, that's the whole point. We're sorting by something different to what we're showing. So they're too different.
So at the end of that, when I reassemble everything into an array on line 19 of my solution, then I now have an array of dictionaries.
Not an array of strings.
I have an array of dictionaries. so to sort the array of dictionaries I need to tell it which key in the dictionary it should use to order things so that's why I use a unique underscore by, which will have the side effect of sorting by the display name but at least it's getting rid of my duplicates so Einstein is now gone or sorry Curie now is down to one Marie Curie instead of three Marie Curies.
[17:59] But they're now sorted sorted in a different wrong order, but there's the right amount of them.
So now we pass it to sort underscore by and we tell it this time that the key it should use is dot sort by.
So now we have an array of dictionaries without duplicates in the right order.
And then the last thing we do is we explode it one more time, we pull out the display name and then I chose to put it back together to return an array as the output.
Put you could I've put the individual pieces I didn't specify so both would be perfectly valid I said a list of names so you could take that in English to mean names in a row or in JSON in which case it's names in an array but they're both valid interpretations of me giving the vague hand-waving description of list you didn't actually say a list of names so when I first showed you my answer it was a list of dictionary objects in alphabetical order by surname and you said that was right technically what you said was build an alphabetical sorted list with the names of all laureates so i would say there's a list of names but okay but not only names it's not the best english i have ever crafted i will i will hands up say that was it is not the clearest instructions i have ever given.
[19:23] I mean, I botched a lot of other stuff up in my answer, but I was going for points from the professor on a technicality there.
Hey, it would work in an exam.
[19:33] You know, he go to the exam as a peel board and I've sat on those and he was like, yeah, that question wasn't clear. Marks of the student.
[19:40] A little fun aside, before we'd even recorded this installment we're recording now, basically the moment I had publish, on the previous installment, installment we got an email from one of our loyalist of listeners um one of the many listeners jill yeah we have we have money we have multiple jills but this is this is a different listener jill and jill immediately pointed out that i had basically set an impossible challenge because the rules for actually in the real world sorting by surname are hella complicated because our own data set shows us that we have names like Het Hooft as a surname, which is... Oh yeah, what's up with that guy?
It's an abbreviation of Het Hooft, which technically means the head.
So he's actually Gerardo's The Head, which is kind of cool. I don't know if that means the head is in top of the body or the head is in chief of an organisation.
Both are valid meanings of the word in Dutch.
But the surname is abbreviated with it, you know, Het Hooft.
Then you have special characters don't sort in the right order in JQ because in actual fact when you're sorting on names, you should pretend that an Iagu is just a plain old E.
Mac MC should be sorted as M-A-C. O-Neil should just be O-Neil.
You should remove Von and Van. There's all these rules.
She linked to an amazing blog post describing all the complexity.
[21:05] Jill is the first person who talked to me about the joy of being a a pedant.
Jeez, our people. That's fair. Or my people, anyway. I may be known for bouts of pedantry.
I also married a grade A pedant who I love. But grade A pedant.
One of the most interesting experiences is watching stand-up comedy because he pulls apart the logic instead of the joke. Which I think is hilarious.
[21:30] That joke isn't factually correct. True. It is. Hilarious.
Anyway. You actually could could do all of this in jq and you can in fact you can make your own jq functions i haven't quite i don't think we're going to learn how to do that in this series because that is very rare to need but you could write your own function called sort by name or something like that and have it take two arguments you know anyway you could absolutely do it so you'd basically have to remove all the prefixes you'd have to run some regular expressions to sanitize things then convert the whole kick You can put a lowercase, pull out all of the unaccented version of the characters and replace them with their regular versions and remove any dashes and other bits and bobs.
And then you would have a perfect sorting by surname.
I think you even said you'd have to change people who were Mick, MC, to Mac.
You would. You'd need to put it back because you can't leave them as Mac if they're not Mac.
Remember, I would still do this as a separate field so the display name we don't touch.
Oh. Right? Oh, okay. That's the power of having the separate field.
You can be as brutal as you want, right? You're not going to offend anyone. Oh, right.
And yeah, generally speaking, the order you sort on is not what you will display.
Because these rules would make the name quite unintelligible to, or a lot of names quite weird looking.
I think if I was Vaughan something or other, I'd be very cranky if my Vaughan disappeared, mainly because it's the title of nobility.
[22:59] Nobility don't like it when you take away their nobility.
[23:03] Basically the Duke of whatever, Baron Vaughan, whatever.
Anyway, so all of that to say, to get your full marks, or to get bonus marks, that was what you had to do.
And if you managed to do everything Jill described, you get so many bonus marks you just win you just you win programming by stealth that's it congratulations um one of the things i do pull a little more detail on the show notes is the actual syntax for the sort by field creation because it's a good example of some features of the select function and of the alternate operator so the select function returns either the full thing you gave it or absolutely nothing at all.
So when you say select has surname what comes out is either the full dictionary or absolutely nothing at all.
And then what you'll see in my solution is you take select has surname and then you pipe that into some string construction.
So if there's no surname what happens?
[24:12] Well, you end up with nothing. It throws it away, right? Right, it gets thrown away. So nothing goes into that pipe.
So that string construction saying, take the surname, follow it by the first name, that doesn't do anything because it got zero inputs.
So what arrives at the left-hand side of the alternate operator is absolute nothingness, which does not evaluate to true.
So the rule for the alternate operator is if the left-hand side is empty, null, or false, then you do whatever is on the right-hand side.
So we come in on our left-hand side with empty.
So if there was no surname, we get nothing on the left. So what happens is the alternate operator returns first name.
[24:55] So for the International Panel on Climate Change, that's what will get returned.
If there is a surname, the select won't pass the surname.
The select will pass the entire dictionary. so the string construction gets a full dictionary and it can then correctly pull out surname followed by first name which is not empty or null therefore when the alternate operator gets that it goes oh great i'm done and that gets handed back to be stored and the stuff on the right hand side never happens my solution was that i decided organizations never deserve to get the nobel prize and I threw them all away.
You would be very unpopular in our in in Maynooth University where I work because we don't have many Nobel prizes not enough of them for a car park but the one we do have is that one of the 1,000 and something scientists who won as part of the IPCC is one of ours.
So we have like 1,000 international panel on climate change.
Oh okay. So we have like one-tenth of a percent of a Nobel Prize put through an organization.
So we'd get very cranky if you took that away from us. Anyway, there we are.
So that is the full-on challenge solution with all of its bits and bobs.
[26:13] If we do get a chance to look at my solution, which I don't consider good enough to show to anybody but you because you don't mock me, It will be curious to me how I managed to get a list of everybody in alphabetical order down to Richard Sismondi, but the last 20 are just people completely out of order, like Anwar al-Sadat, Paul-Henri de Tournelle de Constant.
Oh, these people all have like de in there, or they're the strange ones. Yes.
I would be very curious to have a look at your solution. If you want to play with it yourself, it may be interesting to throw a few debug statements in at various points and see what the intermediate states are.
Yeah, I think it's the people with the weird stuff, like van der Waals, van der Meer.
Yeah, they do have that in common. I know I'm offending everybody in Holland, but okay.
Yes. In Holland, by the way, van is not a noble title. It just means from.
So it's from the old days. You were from the field.
Van der Vilden is also one of my favorites. Oh yeah, that's Bob from the field.
[27:25] Okay, we better get to the new stuff. Yes. So we're starting with the philosophy.
So a dictionary is a spectacularly flexible data structure.
It is a key value pair. Sorry, it is multiple key value pairs collected together.
So you will see them used throughout programming in different languages for 20 kabillion different things.
But like animals can be grouped into species we can take different uses of the dictionary and describe them in sort of broad philosophical terms to say that this is a way of using a dictionary we can use them to create for our purposes today records or lookups they're not a different technology they're dictionaries they're collections of key value pairs but how you use them to represent your data means they behave so differently that you really can think of them as a different species.
[28:23] Are records and lookups two different things or are you saying those as synonyms?
No they're not quite antonyms but they are two very distinct ways of representing, two very distinct species of dictionary, shall we say?
I don't know, I sort of went with types in inverted commas, you know, massive air quotes. So it's broad categories of dictionary.
So to understand why we care about lookups, we're actually, that's really where we want to go. That's our destination for this story.
But for those to make any sense, I need to say in the beginning.
So we're going to go doodaloodaloo right back. And we're going to look at a different concept, which is probably the most common use of the dictionary, which is the record.
So we are going to use as our examples a nice small little data set, our menu.
Now I've added to our menu. So last time we had hot dogs, pancakes and waffles. We now have some scones.
I like scones. I love scones.
Particularly actually I discovered America taught me that cranberry and orange is an amazing scone. Amazing scone. Yeah.
[29:34] So scones are quite difficult to make in my mind and they're quite fancy.
So I priced them the same as waffles. They're $7.50 like waffles are.
So anyway, they've been added to our menu.
And our file is still menu.json and it contains an array of dictionaries.
And each of those dictionaries has a name, a price and an amount in stock.
So we have name, hot dogs price, whatever stock, whatever name, pancakes price, whatever stock, whatever name, waffles. Right. Each of them is the same.
So that's actually how you represent tabular data in JSON.
[30:10] That array of what I'm now telling you we call records is entirely equivalent to a database table with the three columns name, price, and stock.
Or an Excel spreadsheet with three headings across the top, name, price, and stock.
And if you don't believe me, I've created the table in the show notes.
Right it is okay the name column is hot dogs pancakes waffles and scones we have prices 599 310 750 750 143 in stock 43 in stock 14 and 11 okay it is a table and so what makes when we have jason representing a table we call each row a record because we're thinking databases it's a It's a database table which contains records.
So those types of dictionaries we call records.
And what makes them records is that fact that you have, every dictionary has a name, every dictionary has a price, and every dictionary has a stock.
And you group them together and now they are the records in a table.
So with records, the keys are column names and the values are the values in your table.
So keys are names of a column when you're dealing in records and that's how you think of them.
[31:34] Now, I'm going to take you even more philosophical. We like to say in our day-to-day life that there's no such thing as a free lunch.
And that is very true of computer science.
And one of the biggest trade-offs that you will come across when you start to do theoretical computer science, you will forever be told that you are forced at all points in time to trade space and time.
Space and time are continuously fighting for each other in computer science.
So an array of record style dictionaries is the least inefficient way json can store tabular data json is human friendly to read it is not the most efficient format if you were like if you were trying to put data onto a spaceship made in the 70s with like four bytes of memory you would never use json but if you are going to use json the least inefficient way to use json is actually arrays of records.
[32:32] So they're as inefficient as JSON gets in terms of space.
So if you had a giant data set and you needed to save it in JSON, that would be the least bad way to do it.
But it's a really terrible way of processing large amounts of data.
Because the only way to find anything when you have an array of records is to loop through them every single single time. You're loop, loop, loop, loop, loop, loop, loop.
So if you were using a traditional programming language, you'd be using your for's, your while, or for each's.
In JQ world, we explode them and we show them through select, which is just a loop.
We're just processing each one in order and checking it for some sort of a condition.
That's no different to for whatever, if the name is whatever, yay.
Right? We just say, explode it, pipe, select, name double equals, whatever, yay. Yay.
So that is very inefficient at processing.
[33:31] So you're probably thinking, well, hang on a second, there's got to be a better way to do this.
And the answer is absolutely, you can make it way more quicker to process, but you shall pay.
The cost is going to be disk space or RAM and probably a little bit of CPU as well.
So not only are you like, you're definitely going to end up paying in space, be it RAM or disk, you're also probably going to pay a bit of CPU time as well, but you will end up speeding up your processing.
Storing Data Efficiently with Indexes
[33:59] So, if we were living in the world of databases, which is the best way to store data, we have a name for this concept of being able to search your data quicker than a loop.
We call them indexes, and an index on a database table is effectively a little mini table that sits next to your main table that just says that, yeah, if you're looking for Bob, he's on row 15.
If you're looking for Alice, she's on row 32. If you're looking for Willy, he's up on row 956.
It's just one of your data fields mapped to whatever row the full record is on.
And so it's way quicker to find stuff in the index than it is to go look through the entire data set.
[34:44] A bit like a library. How do you find it in an index without looking through the entire data set for it?
It's actually stored as something called a hash table, which means that it's stored in memory.
It's stored like a dictionary, in fact, which means that you just tell it the key and it will then go, okay, well, that key maps to the value four.
Then you go, great, okay, now let's go to row four in my database table.
So you don't have to go searching through everything. You just jump straight to row four.
[35:10] It sounds like the same thing, but keep going. Oh, no, it's math.
Like if you get the indexes in your databases wrong, your site will crawl, your everything will crawl.
It's spectacularly efficient. but you have traded off space.
Because you might think, well, I have a database with a table with 50 columns.
I'll just make 50 indexes.
[35:30] Okay, well, you better start paying for that disk storage to GoDaddy or whoever, because you are having one heck of an inefficient database space-wise.
So you're always trading off space for time. Yay, I can search this really quickly.
Ooh, that's a lot of disk space I'm wasting.
The other trade-off you have is that every time you update your database, you also have to update every index.
So if you have data you write once queried often, indexes are a no-brainer, right? You save on searching.
You've given up some disk space. are well, but you don't have any cost on writing.
If you have data that writes often and you have too many indexes, you just now crippled yourself the other way. You've now made it really difficult to actually manage your data.
But I can search it really quickly, but I have this backlog of input here.
You know, I'm supposed to be a credit card processor and no one can get any payment through, but yeah, I can search the payments very quickly that I'm not making.
So you're always, always, always trading these things off against each other.
Trade-offs in Space and Time
[36:25] It's always a trade-off. It's always a a trade-off.
Now, when we're working not in a relational database, but when we're working with pieces of data that can be represented in JSON, the way we simulate an index is with a lookup, which is another type of dictionary.
But with a lookup, the keys are values.
[36:50] The keys aren't names of columns the keys are the value from the column we want to index so if we want to index names so that would use less data because you're not giving it a key and then the value that you're you're just making the value be the key um compared to an array where you don't have any top level keys it is still less efficient because in an array it's just a list whereas you now have a dictionary of dictionaries, so I'm afraid, you don't, you do trade space.
Anyway, I'll keep talking. So the key point, we're going to be a lot of puns in this and I'm not going to mean them.
The key point is that with a record, the keys are column names.
With a lookup, the keys are the values from a chosen column.
That's the column you're indexing.
So if we index our menu by the most obvious thing you could want to index a menu by, which is the name of the item, then the keys will be the values in the name column so our keys will be hot dogs pancakes.
[38:02] Waffles scones they will be our keys so instead of our keys being name price and stock our keys are now hot dogs pancakes waffles and scones and then the values are whatever it is you want the index to point to so the simplest thing to do is to put the entire record as the value so what you you end up with is instead of a top level array where each entry in the array is a record you end up with a top level dictionary where the keys are the names and the values of the records, so we have a dictionary with the key hot dogs and the value name hot dogs price whatever stock whatever then we have the next key pancakes which is another record with a name a price of stock and then we have waffles which is another record with the name of price in the stock scones yada yada yada so that if you do the byte count that is less efficient than what we started with because we now have hot dogs pancakes waffles and scones twice instead of just being a square bracket we now have a whole extra key everywhere so it is actually a more complicated data structure.
[39:09] What you can do is you can remove the duplication by taking the name out of the record and only only having the name exist as the key.
So you can basically pull it back and say, hot dogs colon open a smaller record price and stock.
Pancakes open a smaller record price and stock. I hate, hate, hate that.
Because when I do then look up the record, well, think about if you save that record somewhere, it's now incomplete.
The record isn't the full record anymore. So you now always have to remember two things to remember one thing.
[39:46] Why so? What do you mean it's not complete? It has the same information, just more densely packed.
Right, but when you pull it out, you leave a piece behind.
If you run that through a select, you will be left with price and stock.
The name is now disconnected once you pull it apart.
While it's in a lookup, it's all there. But if you take a piece of the lookup out, it's now, the record is incomplete.
Complete and that is okay sometimes very annoying it depends on your use you can also go the other way and you can make your lookup be i'll call it a tim report and look up in in memory of our mutual good friend who is a huge believer in menu bar apps because they did one thing and did it well, you can make lookups for one job and one job only and this is actually one of the best uses of lookup.
So you can make a lookup of price because you probably want the price of things really often.
So when you make a special purpose lookup, you throw away all the data apart from your special purpose.
And what you get is a really simple dictionary. It just says hot dogs are $5.99, pancakes are $3.10, waffles are $7.50, scones are $7.50. Really simple.
And for a one-to-one mapping, it's actually very, very simple.
You just have a key going to a value. Fantastic.
[41:09] What would happen if we tried to make a lookup where you use the price as the thing you're indexing and the items, the names, as the things that you're indexing to?
Well, waffles and scones are both 750. This is why the other thing you will see often is that the lookup is to an array, because an array lets you do a one-to-many.
So we can have a lookup of product name by price by using arrays.
So we say 310, just pancakes.
599, just hot dogs. But if you have 750 in your wallet, you have the choice of waffles or scones.
[41:48] So this is still bad, though, because you're going to throw away the price when you do the lookup, right?
But these are for really specific purposes. So these special purpose lookups are for very, very quickly answering what can I get for $3.99?
Or very, very quickly answering the question, what price is a hot dog?
So these are now special purpose. So you have a completely generic lookup, which is name to full record.
And you have completely special purpose lookups, which is one key to one value or one array of values, the bit that I have problem with is the halfway house where you neither have everything nor exactly what you need. You have almost everything.
And I just find that useless. I have never in my professional programming life found a use for that middle ground that doesn't know whether it's special purpose or general purpose. It's, I don't know what its purpose is.
[42:45] So at this stage, you're probably wondering, well, when are lookups worth the effort?
Because I've got to build a lookup, I've got to maintain the lookup.
When is it worth all of this faffing about?
And as a general rule, I would say that data you update infrequently.
[43:04] But query frequently, that's a really good candidate for lookups.
Because what you're doing there is you're giving away very very little cpu up front because every time you update a lookup it takes a bit of work but if you let's say the nobel prizes they're awarded once a year i think i can spare a few cpu cycles to build me a lookup of nobel prizes i don't think that's going to break the bank you know and then i now have the day the nobel prizes set in a structure that suits my purposes for an entire year and i can create efficiently for an entire year and then next year we come along and we update it again that's that's an ideal example right real world data is not usually that good so in my real world life if you're wondering why do i do this right with my professional hat on there are three places i use lookups and i use them a lot so data on a schedule is a fantastic candidate here right there are there are a lot of of databases and stuff in an enterprise system where yes technically speaking they are updated all the time but for reporting purposes you snapshot them once a day and you say that between today and tomorrow this is our enrollments yes the students can log into the website and re-enroll in a different course but none of that is reflected anywhere else until tomorrow because overnight we're going to export the current state of the enrollments and then every system on the university is going to pull that updated set of enrollments into themselves and then for the next day that That is the state of the university.
Efficient Data Handling in Enterprise Systems
[44:31] So that's data on a schedule.
[44:33] So once a day... Okay, but hang on.
[44:35] That's not updated infrequently. That's updated really frequently.
[44:40] Okay, but the bit you care about is updated daily. So it's data on a schedule.
So the on a schedule bit is where you build the lookup. That's just once a day.
[44:50] Oh, so once a day is not considered frequently and updated frequently.
Oh, yeah. When I'm talking frequently, I'm talking something like Twitter.
Right? Posts. Oh, okay. Right? The ratio of posting to reading posts is way, way different. Yeah, exactly.
Right? Yeah. So everything's faster on computers. Yeah. Right?
So a lot of data on enterprise systems, whether it be accounts, which are done by the day, accounts might even be done by the week.
You might not close your books until the end of every week.
Right? So some of these things can be in a long cadence.
But there is a point in time when the system says, and this shall now be the snapshot for everyone's use for the next N hours, days, whatever.
And so you're going to have a script that does a dump of that snapshot.
And that script probably already writes the file in multiple formats.
It's probably writing a CSV because that's just the oldest data format that there ever been.
Everything can read CSVs. I have yet to come across an enterprise system that doesn't output CSVs.
Right. For a few decades, XML was oh so fashionable in the enterprise, especially if the word Java came anywhere near the equation.
So most enterprise systems will give you a CSV and an XML file of the same data.
And these days, an awful lot of JSON is used because it's actually a really nice format.
So these days, a lot of enterprise systems will take their snapshot for the day for the hour and they will write it three times.
[46:16] CSV, XML and JSON.
The amount of work involved in having it write two Jasons, one with the index you want, and one as a flat file, you know, basically an array of records, is minimal.
[46:30] And if you know that, you know, this department needs to query it by academic year, and this department needs to query it by student name, surname first, well actually the amount of extra work in writing three or four lookups, as long as you give them sensible file names, is almost nothing.
So why not dump the data in all the formats all of the consumers want.
They just have them all sitting in a folder and you might have, you know, students dash by name, students dash by academic year, students dash by department.
You just write all the lookups you want, right? And yes, it's duplicative.
So you're trading space for time because everyone consuming that data can now consume it way more efficiently because you've given it them in the lookup they actually need for their business goals.
So data on a schedule is a fantastic use case for building lookups another one is mostly static data like we have information that we need to use inside scripts all the time like what are the ip ranges we own what are the domain names you own what are they used for yeah we have these 20 domain names three of them have email four of them are used for this purpose whatever so you can represent that as a data structure that updates itself once or twice a year you know How often do you buy a new bit of IP space or whatever, right? It's not very often.
[47:44] Or maybe it's an org chart. Yeah, you hire people from time to time.
And a general approach to this would be that you would have one copy that you consider to be canonical, which is probably just the array of records, and you probably write a little script that just takes your canonical version and spits out all the indexes you care about.
So you keep one copy as your master copy and a little script that says, and now update me the indexes.
[48:08] You know, that works perfectly well for data that you manually edit a few times a month or whatever.
Just edit the data, run the script. Now you have your indexes again, and everyone who consumes the data, they can have it in whatever shape they want. Very efficient again.
[48:22] And the other way is, I should say complex, by the way, if you're wondering what that typo is at point number three. There's an X missing. I fixed it.
Excellent. The other place where it's worth building an index is if you're going to do a whole bunch of complex processing within the one script.
The data might be different every single time you run the script but nonetheless the script itself is going to use the same data so many times, that it's still worth spending the CPU cycles up front to build the index even though you know it's going to be stale the next time you run the script doesn't matter you may never even write it to disk it may just exist as a variable in memory it's still worth the effort because the rest of your script is going to use the bejesus out of that data and i do this so often that i actually have written my i actually start my script by putting giant big comment blocks that say fetch data, validate data build lookups do work you'll actually find that in my script as those four headings and i genuinely put everything into those headings and as i'm in the do work bit if i need another lookup, I'll scroll up to the make lookups bit and I'll stick in another lookup.
And it's also vital to always validate your data because dirty data is the bait of my life.
Anyway, so I do that so often in big scripts, particularly scripts that reconcile data or that synchronize data or that validate data. And they're all things that I have to do a lot.
[49:50] The other thing is there's a halfway house there that I also do quite a bit is if you have data where it's like yeah this script runs every minute and I'm okay with it being wrong by an hour for this data source.
Utilizing Stale Data with Lookups
[50:03] I'll build a lookup right at a disk and then before I read it from disk or as I read it from disk I'll read it from disk and I'll use a script to check the modify time on the file, and if what I pulled in from disk was modified less than whatever I consider to be acceptable the bull just pull it from disk and don't recalculate it if i pull it from disk and it's too old delete the file recalculate it then shove it out to disk and then the next time it's like you've got a time to live in there precisely exactly so the data i call it going stale but yeah it's it is effectively a ttl and so that again is this is all very real world stuff, and i am sure there are people now shouting at their ipods or whatever ipods.
[50:52] Think about Bart Our CPUs are so fast So what if it's inefficient Just query the data the hard way Why build all these lookups There's a really good reason that will never cease to be true, Code that is simpler to write Is simpler to debug And simpler to maintain over time eye.
If you have your data structured in the right lookups, your JQ code is shorter.
Your JavaScript code is shorter.
All of your code is shorter, but not just shorter in terms of characters typed, shorter in terms of logical steps, because you're jumping straight to the right data.
So when you read the code, it almost reads like English if you have all the right lookups in place, especially if you name your lookups by what they do.
If you call your lookups Meng Yu by name.
[51:47] Price by name and when you read it out it's really obvious what's happening here all right yeah i'm getting you know ip or yeah ip address by department or whatever right whatever it is that you need if you name your variables blah by blah you get wonderfully readable code and anyone can maintain that code you know when you're gone and moved on to something else or you've handed the project over just into main or into the operations team instead of being in the project team or whatever it's such the value of making your code more human friendly by using lookups will never go away even if we get the world's fastest computers that is a win always and forever and i saved that for last because that's the one i want you to remember you and i are working with data that makes a lot of sense if you're especially if you're working with teams of people where they may not know exactly how you did something but if they can look at the names they can maybe maybe parse it back out and say, oh, I see what he's doing here.
Simplifying Code with Lookups
[52:45] Right. And also, if you come to debug a script and you go, right, well, I can see here that they've built a lookup, and then you check the lookup and you go, well, actually, the lookup's wrong.
Well, then everything in the script after here is no longer in your field of view for debugging.
The only part of your script that's in your field of view is the bit where you build the lookup.
[53:01] If you look and go, oh, they built the lookup, right, well, then all the code above there can go out of your field of view.
It's like, well, I don't have to worry about how they built that complicated lookup. thank goodness they did. Ooh, that's horrible code.
And then you just look down at the bottom where it all has a sensible frame and go, ah, I see where the mistake is.
He used, you know, name by IP address and he should have been using name by DNS name. Whatever, right?
[53:23] It's just making code more human friendly, even if it saved zero CPU time, would make lookups 110% worth the effort.
They're also more efficient for other things. But the fact that they're more efficient for us human beings is actually Julie, why I adore them so much and why I'm so passionate about making sure we teach this as part of our work on JQ.
[53:47] So I saved the best for last. Very good. But we're nowhere near done.
Transition to Practical JQ Implementation
[53:53] Well, now we get into the JQ specific bit. So now we go from philosophy to practical, right?
So we were way up in the clouds here, folks, right? We were, you know, software astronauting here.
Now we're getting right back down to earth. So how does JQ do this?
So before I tell you how they did it, I want you to put yourself in the shoes of the people designing the JQ language.
The problem to be solved is write a function which can take any one of the infinity of possible JSON data structures, run it through this function you have to write, to spit out any one of the infinity of possible lookup tables.
[54:34] How do you write one function that can do such a complex task? task?
How many possible arguments is it going to need to be able to do every possible transformation?
How do you attack a pie so big?
So one approach a computer scientist can use when facing a problem like this is to build what I call a mega function, right?
The documentation of this function is probably 20 pages long and it's like you can give it this argument and this argument and this argument and if you give it these four it'll do that and if you give it these three it'll do that.
And if you're wondering what that looks like, have a look at the man page for the find command.
The find command solves an extremely generic problem.
Everything you might want to search the file system for. And their approach was, we will provide you with a command line flag for every type of search we can conceive of. And we have a good imagination.
And the end result is that the command can do almost everything.
But it's so hard to find what you want that most people give up.
[55:41] So actually, that's a terrible solution. And thank goodness that is not what the JQ developers chose to do.
[55:47] Another very sensible approach in a lot of cases when you're dealing with a problem this big is to use what we call a domain-specific language.
So you write your function, so it takes very few arguments.
But one of those arguments is a string in a different language, and that language is designed to do a very specific task. task. And we've seen this.
That's how pattern matching is done in almost every programming language.
Instead of trying to write a function that takes 20 million optional arguments for describing every possible type of pattern, they take a string which has a domain-specific language we call regular expressions, and that domain-specific language does one thing and one thing only, it describes patterns.
So the function can be really simple. We take an input, which is is the data to be processed, a second input, which is this domain-specific language, and that domain-specific language describes the pattern we want to match.
And so the function is simple because the work has been handed off to this domain-specific language.
So if you look at the JQ functions for matching patterns, they're very simple.
They take what it is you want to match and the regular expression as an argument, and maybe some flags as an optional second argument.
JavaScript was the same. The match function, the test function, they just take a regular expression as a single argument. the function is simple, the work has been done by the special purpose language that is regular expressions.
[57:14] So, okay, we're still pretty abstract here for me to see where we're going, but keep going. Right.
Designing an Intermediate Data Format
[57:19] The problem is there's no language for describing mappings from one data set to a lookup. So that's not what they did either.
What they did was another very common approach. It's called an intermediate data format.
So instead of solving the problem in one step, you break the problem in two.
There are an infinity of possible shapes of JSON, right? You can have a JSON data structure in any shape imaginable.
Well, the JQ function say, if you want to make a lookup, you must take that data and transform it into this very specific shape.
And if you give it to me in that shape, I will make you a lookup.
And that middle point we call an intermediate format.
[58:04] And JSON's job is to transform data. So this is a fantastic solution.
Use JQ's core ability to put it in this shape and then give it to me and I'll make your lookup for you.
Well, reshaping data is JQ.
So use JQ to solve almost all of the problem.
Give it to me in the shape that I want it and then I'll take the last mile and I'll make it a lookup for you.
And that is how JQ deals with lookups.
Transform it into an intermediate format and then run that format through a trivially simple function that takes zero arguments.
You just give it the input and it gives you the lookup.
As long as the input has been transformed into this intermediate format, which JQ refers to as entries.
And entries are another flavor or species of dictionary.
They're a very specific species of dictionary. They are dictionaries with two keys.
A key named key. And a key named value.
[59:09] I know you're going to hate this. Don't start with me, Bart.
You're going to hate... I know, I know, I did my best to write it out as clearly as I can.
But in effect, what you're describing is an entry in a lookup.
You're saying key colon hot dogs, value colon 599.
Key colon pancakes, value colon 310. Key colon waffles, value colon 750.
Key colon scones, value colon 750.
They are the elements of our price lookup.
Our price lookup is a lookup table that wraps hot dogs to $5.99, pancakes to $3.10, waffles to $7.50, scones to $7.50.
In the intermediate format, that's more explicit. The key is hot dogs.
The value is $5.99. The next key is pancakes.
The next value is $3.10. The next key is waffles. The next value is $7.50.
[1:00:01] So this is an example of an entry?
Okay, so the intermediate format is an array of dictionaries, each of which has the key named key and the key named value.
And I'm showing you... Right, so this is... So this is... All entries have this exact format.
That's all they can have in them. A key called key and a key called value.
Correct. And each of those keys can have values in them. And that's it. That's it.
That is the intermediate format. That is the intermediate format.
But an array of those really simple dictions. What good is that?
[1:00:38] That says that I can't know three things about hot dogs, or two things about hot dogs. Of course you can, of course you can. I can only know one thing.
No, no. The value of value. Oh, I hate, this is why I hate this, right?
The value. The value of the key value. The value of the key value could be a dictionary.
If you put a dictionary in there of name colon hot dogs, price colon 599, then you can build a lookup that maps the name hot dog to the dictionary name hot dog, right? You can put the full record here as the value for value.
[1:01:11] I'm just showing you a simple one because otherwise, if I put the full record in for each of these in this example, you wouldn't see the structure.
But that 599 could be anything.
Now, I appreciate you simplifying it, but now that I understand it doesn't, it can be more than that, right? Yes, exactly.
So that actually lets us build any lookup conceivable. That actually gets us to the infinity of possible lookups because all lookups map something to something.
And this lets us say, here are all the mappings of something to something.
That is a lookup. That is the most generic form of lookup. So the only last piece we have is, well, what's the JQ function? It's called from entries.
So you take your array of entries, you shove them into from entries, and out pops a dictionary. Or sorry, out pops a lookup.
Wait, wait, wait, wait, wait, wait. Okay, we finally got to the part I was going to try to start really understanding because we've been philosophy here. here.
Say that all again. We've got this structure, this is entries, those have to look like that.
We know the structure of those. Now we're going to build a lookup by doing what?
We take that and we pipe it into the function named fromEntries and that function will make our lookup for us. That's it. That's all there is to it.
Building a Lookup with "fromEntries"
[1:02:23] You just take that structure, shove it into fromEntries and you're finished.
[1:02:31] I don't understand. Okay. What have we asked it? We haven't asked it anything.
We haven't done a lookup. We haven't told it what we want the data to look like.
We have. But we have. We've given it this list of keys and values.
That is all a lookup is, is a list of keys and values.
If you build your list of keys and values and shove it to from entries, you get your lookup.
What have we looked up? We're building the lookup.
We're not using you said we're done we're not building we're done you said we've already finished right exactly once you this right once you shove it to from entries it will build your lookup it's that simple it's it's so simple you think i don't know what a lookup is then even though you just spent the last uh okay hour and three minutes telling me what a lookup is i don't understand what's what's going to squirt out the other side okay so all of the examples where we had hot dogs mapping to the full record or hot dogs mapping to the price, those are lookups, right?
We have the full record by the name. We have the price by the name, right? They are lookups.
[1:03:45] Nope, not in my head. What? I thought we were finally getting to what is a lookup, what is a lookup like in JQ?
Okay, JQ is a language for transforming JSON, so a lookup in JQ looks like JSON.
But I haven't seen any JSON. Okay, so in the section that starts with the heading lookups, there are one, two, three, four examples.
[1:04:15] Okay, I'm going to have to scroll back. It types a dictionary after that.
Yeah, so the section that just has the heading lookups.
[1:04:23] Right, after the tables, after the space-time trade-off, we have the section called lookups and in there we give four examples of lookups.
Different lookups all built from our menu. So we can have our menu as a lookup that maps names to the full record.
We can take our menu, transform it into a lookup that maps the names of our food items to just their prices and their stocks.
We can make a lookup that maps the names of our food items to their price.
We can have a lookup that maps prices to which food items cost that amount.
There are four example lookups built from our menu.json file.
I think where I might be stuck is like a noun verb sort of thing.
Like a lookup to me is like, I am going to go look something up.
I'm going to look up what is the price of pancakes, but that's not what a lookup is.
A lookup is a data structure. like a tape is it call it a lookup table in your head call it a lookup table call it a lookup table in your head because okay that'll help a lot so so now you know where we were where we're saying we're going to pipe um that that structure the entries structure into the command from entries, that will build what structure to make but we haven't told it what structure to make our lookup table yet.
Okay. The entries are the structure.
[1:05:51] They say you will have a key with this value. That's going to be an entry in our lookup table.
But that's not the way I want my lookup table to look.
I want my lookup table to look this other way, one of those four examples you gave up about them. We haven't told it that yet.
No, no, but we have. So when you transform it into that ugly shape and you run it through from entries, what spits out the far side is the pretty shape.
Which one of the four pretty shapes you described? Okay, so in the example I'm showing you, in the show notes, sorry, let me scroll back to my own show notes, it says we're going to have the key hot dogs mapping to the value 599, the key pancakes mapping to the value 310.
So that is exactly the representation of the third of our examples, hot dogs colon 599, pancakes colon 310.
That is our lookup by price. Sorry, a lookup of price by name.
Understanding the Lookup Process
[1:06:53] Maybe, sorry about that really loud noise outside. Your mic didn't pick it up.
Oh, good. Wow.
We had to build a file that was key colon hot dogs value colon 599.
We had to build that. But that's not the lookup table we want.
Right. I'm saying that's the intermediate format.
That is the known shape.
So how do we tell it the format we do want? But that is what's telling it.
Maybe you just haven't told us that yet.
No, no. That shape, that intermediate format is the specification.
[1:07:35] Look at what it says. You said we had no choice. It's always like this.
You said it's always key colon hot dogs value colon 599.
That's the only way we can put the data in. Okay, but the actual, right.
But look at what that is. That is a perfect description of every, that is a description of a specific lookup table.
So when you run that through the function from entries, you get out the lookup table you have specified here.
You have specified you want a lookup table that maps hot dogs to 599, pancakes to 310, waffles to 750, scones to 750.
That will produce the lookup with the key hot dogs and the value 599 and the key pancakes and the value 310.
But you said we had to start with that.
So how does from entries create that if we had to start with that? Okay. No, no.
That is the big ugly one, which has a key named key with the value hot dogs.
That will produce a lookup with a key named hot dogs with the value 599. Okay.
[1:08:40] As opposed to a key hot dogs in value 599.
It's a key named hot dogs with a value 599, which are apparently two totally different things.
I won't be able to say it in five minutes, but I think I know what you just said.
But can we get, maybe you've got examples. Oh, heck yes, I do.
This is going to, okay, let's go. Let's just keep going. And maybe this is going to drop a little bit better.
Okay. So our examples are actually the examples we've already seen.
Because the examples I put open the show notes when I was all open the sky, those examples were written in jq of course so we're going to rewrite them so we're going to see them in action we're going to build the lookups we just used as our examples so our examples are our examples so the first thing we're going to do is we're going to build a nice simple lookup a one-to-one that's going to map the names of our food items to their prices because that is probably the single most useful mapping to want because you charge people for things.
[1:09:41] So in order to build that lookup we start by exploding our menu.json file which is an array at the top level so we explode the array and then we need to construct the entries, and we do that using jq's syntax for building a dictionary so we open a curly bracket and we're We're going to build a dictionary with a key named key.
And the value is going to be dot name. Got it. Okay.
Then we have a value, a key named value, whose value is dot price.
Got it. Okay. All right.
So we then got to wrap all of that in square brackets to capture the pieces because our from items function expects an array.
So if I take exactly what I wrote there, and I just wrap it in square brackets and put some nice comments around it to really make it clear what I'm doing, then we actually have what is needed to build the intermediate format.
So to show what I mean, I'm going to make you run this jq command here, which is going to show us not the final product, this is going to show us the intermediate shape.
[1:10:59] Oh, so am I supposed to be in that folder with this open and not reading in another application like it was?
And this is quite how far into the astronaut space we have been in the philosophy section here. We are quite some time into the recording.
And now, for the first time, I said, Alison, run a command.
Normally we get to the Alison running command. I guess I got to be in installment resources. Where do I need to be? PBS 163.
Installment resources, though. Okay. Yeah. I did not have a heads up from Bart that I would be doing this.
And actually, I should have known to give you the heads up because you told me you only managed to proofread to a certain point, so that might have happened.
But I didn't know I was going to be running any of this live.
So which one do you want me to run?
The one where you've wrapped it in nice square brackets and such?
So straight after that, I have it as a JQ command.
[1:11:48] Okay, so it's, gotcha. The thing he just described.
Okay, so if I do that, I get that nice little structure you talked about, key colon hotdogs, value colon 599. in a set of dictionaries inside of an array.
So that is the input. That is our intermediate format.
And that's what we shove into the from entries function to build the actual lookup table.
So if we take all of that code from above and at the very, very end we take pipe from entries, then we are running it through the from entries function.
Now I have saved that as menu-pricebyname.jq So you can build that yourself by running jq-f menu-price by name.jq with menu.json as the input and that will show you the actual lookup we have built from that intermediate format and I'm hoping the penny now goes ah bing bing!
Generating Lookup Tables from Entries
[1:12:49] Sure.
Okay, maybe you're inspecting bigger fireworks. Well, so it was an array with a bunch of dictionaries in it of key colon scones, value colon 750, etc.
Now we have a dictionary.
Not sure where we told it it was supposed to be a dictionary when it came out, but maybe that's what from entries does?
Right, exactly. Preacher dictionary doesn't, okay.
But it expects an array as an input. Right. Yes. Yes, exactly.
And then the intermediate format's been created, and now we have hotdogs colon 599. Right.
So the output is the lookup we wanted. The input is this strange intermediate format that is a list of entries.
Where did we tell it the format we wanted? Which is what my question's been since we started this piece.
[1:13:46] The output from the previous command is where we told it the format we wanted.
No, we told it in the previous command, we told it we wanted key colon name dot name value colon dot price.
Okay, and then look at the output. And then from entries changed it into some other format.
Okay, and how could you describe the output you got in any way better than the intermediate format, which said you want a key...
[1:14:13] Hot dogs with the value whatever and then you want a key scones with the value whatever that's where you told it the structure you know what i'm always yelling at you for making overly complex examples that throw in a whole bunch of other crap that confuses me and this time you made it so simple i can't see it because because it's just exactly what we said we we created an intermediate immediate format that was key colon pancakes value colon 310 and now we have a dictionary with each of those uh uh key value pairs yeah yeah yeah that's our lookup yeah we can take a name and it gives us the price that is a lookup of price by name so we have okay it's so simple it's complicated yeah yeah that is that is exactly what the problem was okay so that we have now now viewed our lookup, but in the real world, we were going to save it to a file and we did, we would like to save it in the efficient way computers like their JSON, not in the inefficient way us humans like our JSON. You, I'm not going to like this.
No. So we're going to take the command above and we're going to pipe it back into JQ and we're going to give it the minus or flag, which tells JQ, I want raw output.
I do not want you to wrap this in quotation marks to make it one big string. I want row output.
[1:15:42] And then the jq filter is at JSON, which is going to format it into JSON the way computers like.
And then we're going to take the arrow. I'm going to shove that into a file named menu-pricebyname.json.
And that is my naming scheme. But I always take whatever my data source is, dash, what is indexed by what? price by name dot chase, I probably shouldn't have just created that file in the same repo that I'm about to push, right? That should not be in there?
Well, you can put it there all you like because the ignore file for the repository ignores it.
Ah, sweet. Okay. I normally copy it first, but I was caught with my not quite knowing where we're going.
Remember, I make the show notes in place, so the ignore file usually covers you for that.
[1:16:36] Nice, nice. Okay. So, that is our first of our example locals from above, price by name.
We also have the more complicated one, which is the name leading to the full record.
But you know something? The code for that is almost identical.
Because we explode our top level array.
We then build our mappings, our list of entries.
Key is still name, so key colon dot name. And the value now is the entire record. So the value is dot.
And then we pipe that to from entries. So I forget what the input file looks like, what menu.json looks like, but...
So it's an array of dictionaries where each dictionary has name, colon, price, colon, stock, colon. So it's a record. Okay.
So what we're doing is we're saying for the value, give me the entire record. Give me dot. Okay.
And then we shove all of that to from entries. And then if you run jq-f menu-by-name.jq on menu.json, you will see a lookup that maps the name to the full record, which as far as I'm concerned is the most powerful lookup.
You have the full data indexed by the logicalist of possible indexes, its primary key, its name.
[1:17:59] Okay. Okay. And you like, but this is okay because it's hot dogs colon, and then name is hot dogs, price is $5.99, stock is $1.43.
So that way, if you pull anything out, you still have hot dogs with you.
Yes, yes, yes, yes, yes. So basically, I am a person who believes you have all or one.
So give me the price, great, or give me everything, but don't give me anything in between.
I like my local either one specific thing or everything so that has that's our example there if we wanted to pull i give you the code if you want to pull the name out all you do is after you build your entry you use the del command we learned the del function we learned about last last time to delete dot value dot name because we've just said that value is going to be the entire record which means value contains the name so if we then run that through del.
[1:18:59] Dot value dot name we then delete the name and then we run through from entries because it's already the primary key right so at that point we have we've just right but we've just made a dictionary with a key named value whose value is the full record so if i want to pull the name out and i say dot value dot name i don't want you to do that but the notes wouldn't be complete if i didn't tell you how to do the thing that you know so there it is so the last thing then is a little subtlety is that because of how dictionaries work the value for the key key in our entries must be a string so name has been a string all along so we haven't had to worry about at this but if we do a lookup of something by price the price is our numbers we will get an error unless we take dot price and run it through to string so if we build a lookup of the name of our product by its price we have to run it through to string.
Dealing with Data Types in Lookups
[1:20:21] If we don't, JQ will just get very cranky. From entries, we'll go, nope, I insist on strings. And it will tell you key must be string.
Now, this is an interesting one because it illustrates why everything I've said so far works great for one-to-one mapping.
So we're going to build an index that goes from price to the name of the item.
And if we take our logic from above and we just say .price to string and .name, Well, that's the same logic as .name and .price.
We just reversed them. This should work fine.
Well, it kind of works fine. We get hot dogs, pancakes, and scones.
Waffles? What happened to waffles? I know, waffles of all things.
Well, waffles came before scones, so scones came and stomped on the waffles, because scones and waffles are both $7.50.
So this is why when you have a one-to-many, you use an array as the value and assembling that is actually really hard, i tried to do it from first principles and i spent an hour banging my head at it going this is impossible this is impossible this is a really common thing how can it be impossible this is impossible then i went wait a second maybe i don't do this from first principles maybe there's There's a special function whose job in life is this thing I've spent an hour trying to do.
[1:21:47] Yes, there's such a function. It's called group underscore by.
You give group by an array of dictionaries. And you give it a filter that says how you would like to group those dictionaries.
And you get back an array of arrays.
[1:22:06] Effectively, it subdivides the array. and so if you give it the array of our menu and you say group by dot price what you get back is an array and then the first child array is only the dictionary for pancakes because that's the only thing with the price of 310 the next thing is you get an array with only the dictionary for hot dogs because it's the only thing with the price of 599 and then you get an array with it's two dictionaries waffles and scones because they both share a price so when we group by price we get a dictionary we get an array for all of the ones for 310 and an array for all of the ones at 599 and an array for all of the ones at 750 so this is really different we were getting dictionaries when we did it the other way but with group by all of a sudden we got a bunch of arrays Arrays of arrays.
Right, because this is, we need group by to build the entries to make our dictionary, to make our lookup.
Without the group by, I challenge you to build a lookup that maps to an array.
[1:23:18] Well, we need to use this before we use from entries.
We're still going to end up... Oh, this is before from entries. Right.
And I'm saying to you, you can't get it to a stage where it's in the right shape, that intermediate format, without the help of group by.
I spent oh so long trying to do it without group by, because I didn't know about group by and I thought it must be easy. No!
That's why they invented group by.
Utilizing "group_by" for Multi-Level Lookups
[1:23:49] So now let us build our lookup the other way around. So let us now build our lookup without losing any waffles.
We start with the group by.
So we start by saying break this array of menu items into subarrays by price.
So we say group by dot price. So we now have an array of arrays. arrays.
So we now need to build those into an array of entries. So we're going to explode our array of arrays.
We now don't yet have individual pieces. So we're going to explode the inner array.
[1:24:29] Wait, sorry. I pretend I didn't say those last few words.
No. Okay. One explosion, One explosion, one explosion, one explosion.
So we explode the outer array. And so we are now processing a group at a time.
So we're processing everything that is 399. Which is what we expect it to be.
Yeah. Sorry, my bad. We just explode once. No, no.
So even though we've exploded our array, we still have an array.
Because we had an array of arrays.
[1:25:01] I did so many comments here. so we now need to map a key to an array so the key is going to be the price but the price is inside each element in the array but the price is the same because we grouped by price so we could take the price out of any element in the array so i'm just taking it out of the first one because there's always a first element.
So if I say .0.price, that's the first one. That's always going to exist.
Oh, and you don't have to know how many were in each one because you know there's going to be at least one.
Exactly. And you grouped by price. So that's always going to be the same.
So therefore I just say .0.price to string. Great.
That's the key taken care of. The value, I want the value to be an array again, but I only want the names. I don't want the full dictionaries.
So, I start with a square bracket because I know I want an array, but the first thing I do is I explode the inner array, I just take the name, and then I close my square bracket.
So now the value is the array.
[1:26:10] Pancake scones for $7.50. Now it's in the right shape where we shove it to from entries, and the output we get is what we wanted, $3.10 to the array pancakes, $5.99 to the array hot dogs, and $7.50 to the array waffles and scones.
[1:26:31] Wow. Like I say.
Even using that group by it, it's still a little mind-bending.
It's only like five lines of code with your comments that are very helpful, but it's still sort of mind-bending.
This is why I would say the debug function is your friend, because you can stick that in at any point to see what it looks like, right?
So the last thing we have in the show notes is just a bonus extra.
In the real world, if I were to be actually caring about the menu.json file, I would keep that as my canonical record, right? This is my actual master copy of my menu.
It's a nice simple file. It's an array of records.
And if I change my menu, I change that one file.
And then I build from that file the lookups that are of value to me, which is the by name lookup, which is the full record by name, and the price by name lookup, which is the price by name.
[1:27:29] And I have the JQ files to do that work.
And then i have a shell script that just does all the updates in one go and i use a naming convention and this is very much for real this is how i actually do it in the real world so the master list is the thing with the shortest name menu dot json hang on back up back up back up back up because you're going to a thing beyond where i'm lost uh okay what do you mean you create a shell to update things to update what.
[1:27:56] So a little shell script to rebuild my index, my index, my JSON files that have the index in them whenever I change my menu.
So if I change my menu, I need to remake my indexes. If you change the menu.
Thank you. That's what you didn't say there. I'm like, what's updating?
Okay. Yeah. So on the rare occasion, I add some scones or whatever I do, then I need to rebuild my lookups.
And instead of doing it by hand, I have a script and that script makes use of JQ files that do the actual work. And my naming convention is such that my data files are the canonical one has the shortest name.
So menu.json must be the master file. So if I go to a folder and I see something called menu.json and menu-by-name.json and menu-price-by-name.json, I know which one is the master copy because it has the shortest name.
And everything I build from the master copy gets a postfix. So I append to it a description of what it is.
So the menu indexed by name, the menu prices by name, right?
So it's obvious what's derived from what.
The scripts then have the same names following them, basically.
So menu-by-name.jq is the actual jq syntax for building that lookup.
And menu-price-by-name.jq is the jq for generating the price lookup.
And then my script for actually building it all i call menu dash generate lookups.sh.
[1:29:24] Because i might have another script called sync to production server.sh and upload to third-party service.sh right i mean in the real world you do a lot of things with your data so i give it a nice generic name so that anyone else can look at that folder and go ah i've updated the menu therefore i must run generate lookups.sh generate look and then when larry comes in and says i want to look lookup by stock, you could create menu stock.
Menu dash stock by name. Or, yeah, whatever Larry wants. Stock by price, whatever, yeah. Yeah, exactly. Okay.
And then the actual shell script is wonderfully simple.
We just have our normal everyday bash shebang line followed by the jq commands.
Jq minus f menu dash by name dot jq menu dot json of the input.
Pipe it to jq minus or at json. shove it all into the file menu by name.json, and the same thing I have a question why don't you.
[1:30:24] Can you not do the dash r and at json at the end inside that jq file I could but then if I need to see it, you don't have it I don't have it yeah so by doing it as a separate step it means I get to have my cake and eat it if I run the jq file manually I see it pretty, but my shell script will write it efficiently.
So the actual .json files on disk are efficient. Okay, so it'll actually come up on the screen, but it'll actually be writing the file in the right format.
[1:30:56] Well, no. So if I just run the command jq minus f and I stop, right? So on the terminal, I would not do the pipe to the second jq.
On the terminal, I would only run jq minus f menu by name dot jq menu dot json.
And I would stop typing. And then I would see it. Look at it.
And then do the second one. Oh, okay. Yeah. And then the shell script, it does the extra piece.
And only the shell script does the extra piece. Oh, gotcha, gotcha, gotcha. Okay.
Yeah. Okay. All right. We got a challenge, huh? We do have a challenge.
So I have been working with our menu, which is a very, very small data set.
And therefore, anything I've done here in JQ, you could do manually in seconds.
Our Nobel Prize data set is kind of bigger.
It's a much more real world data set coming from the actual real world.
So I would like you to build the most logical lookup I could think of.
I would like the Nobel Prizes by year, please.
I would like to be able to say dot 2020 basically take jq.
[1:32:00] The entire filter dot 2020 inside square brackets or whatever and just give it my lookup and I would like to know who won all the prizes in 2020 instead of having to do the piping to select and exploding and all the yada yada yada we have to do at the moment if I had that lookup I could just tell you who won the prize in 2020 20 right that will be a very useful lookup so that is the challenge to build that very useful lookup for bonus credit can you build a two-level lookup where we have a lookup by year that gives us a lookup by category that gives us the prizes so then we could say dot 2020 dot chemistry chemistry and actually figure out who got the Nobel Prize for chemistry in 2020.
Okay.
[1:32:55] And I'll give you a hint. A lookup can go to a lookup.
So each entry, we have a key and a value.
Why can't the value be a whole other lookup?
Of course it can. Okay, that's not in the show notes, and I won't remember that.
No, but that's why it's an optional extra challenge here. We will go through the sample solution in great detail.
But if people want to have some thinking about it it's it's both simple and head exploding.
[1:33:31] All at the same time because it is actually very simple to do a multi-level lookup and yet, it makes your head explode and depending on the kind of person you are you may just go what are you talking about really easy okay all right well hey bart i'm glad you only did half of of this.
To be honest, I would have raced it. You would have been lost.
Yeah, anyway. So anyway, right. We shall leave it there.
Next time, we shall do the opposite and tear our lookups down and rebuild them into different shapes and do cool things with them.
So until then, have fun with your challenge and happy computing.
If you learn as much from Bart each week as I do, I'd like you to go over to let's-talk.ie and press one of the buttons over there to help support him.
He does 98% of the work here. I'm just the stooge that listens to him and asks the dumb questions.
If you go over to let's-talk.ie, you can support him on Patreon.
You can donate via PayPal, or you can use one of his referral links.
I really hope you'll go over and help him out. In the meantime, you can contact me at Podfeet, or check out all of the shows we do over there over at podfeet.com.
Thanks for listening, and stay supportive.
[1:34:45] Music.