Generated Shownotes
Chapters
0:00:08 Introduction to Chit Chat Across the Pond with Bart Bouchard
0:03:41 The Importance of Transforming Data in JQ
0:09:12 Introduction to Dense Syntax with Slash B and JQ
0:12:24 Using String Interpolation in JQ
0:15:29 Exploring the Power of Arrays in JQ
0:18:54 Converting strings to numbers for proper comparison
0:27:30 Introduction to building dictionaries in JQ
0:37:13 Using Regular Expressions to Build Dictionaries
0:41:26 Creating Named Capture Groups in Regular Expressions
0:45:02 Explanation of the capture function and its input requirements
0:45:22 JQ Show Notes: Copying, Pasting, and Laughing at Typos
0:55:34 Exploding the menu array for CSV conversion
1:06:19 Introduction to the topic of quotes and feeding NobelPrizes.json
1:06:51 The presence of quotes in the JQ command
1:07:42 Running the same JQ command without URI query
1:08:07 Confirming the absence of quotes with JQ command
1:09:27 Different positions of 'at csv' command in a pipeline
1:11:49 String Manipulation: Answer Extraction
1:12:23 Understanding the use of "at" for input manipulation
1:17:05 Formatting text using different encodings and separators
1:23:08 A Good Story that Holds Together
Long Summary
In this episode of Programming by Stealth, we share exciting news about a new functionality added to the show notes website that allows users to easily navigate forward and backward through the installments. We express our gratitude for the community's contributions.
Moving on, we discuss the third problem that JQ solves, which is transforming data into the desired format. This topic will be covered in the upcoming episodes and involves building new data structures and applying functions to the data. We rearranged the show notes for better flow and set a challenge from the previous installment to find Nobel laureates awarded for their work in quantum physics.
We then dive into the use of regular expressions to search for the reason Nobel laureates were given their prizes. We explain that we'll test for the word "quantum" using regular expressions with the "i" flag for case insensitivity. This gives us 23 prizes, from the oldest belonging to Werner Heisenberg to the most recent for Quantum Dots.
Next, we move on to building strings, arrays, and dictionaries in JQ. We demonstrate the use of string interpolation to insert values into placeholders within a string. We then explore building arrays using JSON syntax in JQ, explaining how elements can be included based on filters applied. We also show how to convert strings to numbers when needed using the to number filter.
We discuss the importance of making numbers be numbers before doing math and acknowledge that searching for information can sometimes be difficult. We explain the process of combining dictionaries into one array and converting between strings and arrays using the split and join functions in JQ.
Moving on, we dive into creating dictionaries using curly brackets in JQ. We apply this concept to work with the messy Nobel Prizes dataset, specifically building a dictionary for Andrea Ghez's Nobel Prize with the desired information.
We then address formatting issues, such as fixing the year format, extracting the laureate's name, and cleaning up the motivation field. Despite encountering challenges, we successfully complete three out of the four tasks.
In the next part of the conversation, we discuss using regular expressions to build dictionaries. We introduce capture groups and explain how they allow us to extract specific parts of a pattern. We demonstrate using named capture groups in JQ and provide an example of extracting hours, minutes, and seconds from a time string.
We explain how regular expressions in JQ are used to work on strings and how key-value pairs don't have to be strictly strings. We mention that JQ is commonly used in a chain of commands to process JSON data and send it somewhere else. We discuss the different formats that JQ can handle, such as encoding special characters in a URL or converting data to CSV or HTML format.
We explain how the at symbol can handle various functions in the pipeline. We clarify that when using it as a filter, it applies to the entire output, while in string interpolation, it only applies to the placeholders. We provide examples of using the at CSV filter to convert JSON data into CSV format.
We discuss creating a CSV file from a JSON file using the JQ command. We explain the process of creating a header row, adding data rows, and saving the output as a CSV file. We mention the need to prevent JQ from converting CSV to JSON using the "--raw-output" flag.
We explain how to create a search URL using JQ, including encoding special characters using the URI encoding scheme. We demonstrate how to build a search URL to find the winner of the first ever Nobel Prize and provide the necessary string interpolation and formatting.
Lastly, we discuss using the CSV command in JQ to process data and transform it into different formats like comma-separated values (CSV). We explain the purpose of various formats, such as HTML and Base64. We also mention the importance of safe shell escaping and summarize the difference between "at CSV" and "at URI".
We conclude the conversation by discussing the challenge of transforming the NobelPrizes.json file into a top-level array containing dictionaries for each awarded prize, excluding those with no laureates. We explain the steps involved and the need to ensure correct formatting for prizes where the winner is an organization.
Overall, we cover a range of topics including regular expressions, string interpolation, building arrays and dictionaries, formatting data, and using JQ to transform and process JSON data.
Brief Summary
In this episode of Programming by Stealth, we introduce new navigation functionality for the show notes website and express gratitude for community contributions. We discuss using regular expressions to search for Nobel Prize criteria and cover topics like building arrays and dictionaries in JQ, string interpolation, and formatting data. We explore creating search URLs, converting data to CSV format, and demonstrate the challenge of transforming the NobelPrizes.json file.
Tags
episode, Programming by Stealth, navigation functionality, show notes website, community contributions, regular expressions, Nobel Prize criteria, arrays, dictionaries, JQ, string interpolation, formatting data, search URLs, CSV format, NobelPrizes.json file
Transcript
[0:00] Music.
Introduction to Chit Chat Across the Pond with Bart Bouchard
[0:08] Well, it's that time of the week again. It's time for Chit Chat Across the Pond.
This is episode number 784 for January 20th, 2024, and I'm your host, Alison Sheridan.
This week, our guest is Bart Bouchard. It's back with another installment of Programming by Stealth. Long time no see, Bart.
It has indeed been a while. I was going to say, has it been a year?
But we did record earlier in January, didn't we?
Did we? Oh, boy, I don't even remember. Maybe we didn't. Feels like last year.
Wait, before we dig in, I want to tell the audience about something fantastic.
And one thing I have always, always, always wanted is a way to flip forward and back through the show notes in pbs.butterfisher.net from lesson to lesson.
Because a lot of times I'm looking for something and I don't remember which in a series where it was. Was it in 101 or 102 or 103?
And so I end up going into 102, going back in the browser and then clicking on the next one and going forward and then turning around and going back.
And Mark from Australia, also known as Deafiant in GitHub, submitted a script ages ago to add this functionality.
Just recently, Helma from the Netherlands has implemented it.
So now when you go to a specific installment down all the way at the bottom, you'll find a link to go forward or back to the installments by name.
It's not just a forward and back arrow. You see the name of the installment and a button to go back to the index directly.
So this is fantastic. This will be a a huge time saver. Even if it's only for me, I thank Diaphiant and Helma from the bottom of my heart.
[1:35] Absolutely. One of the nicest things about the way we do this is that the community gets to contribute, and these are two fantastic community contributions.
So thank you, thank you, thank you.
Absolutely. I don't know how far back it was, but it was a long time ago.
Oh, February 18th, 2022 is how long he's been waiting for us to implement it.
It. So luckily Helma got on it because I didn't know how to do it.
Yeah, I think Helma added the improvement that you could rerun the script without it adding more links.
So basically now we can just run it infinity times and it won't add extra links into the bottom of every episode.
Okay. Well, and I think what he did was he wrote it for just the initial ones, like up to 101 or wherever he was, and we didn't have a way to automatically have it do the next one.
Yeah, basically, he had a script and Helmut made it better, and now because two people work together, we have amazing. So, thank you.
[2:30] All right, we've got a long one here today, so we should probably dig in.
We should. So, maybe to remind myself as much as anything else, just to set the scene.
So, we have been saying that JQ sort of solves three big picture problems.
Problems, pretty printing JSON, which is very easy.
And we did that in the very first installment, half of the very first installment, even that was easy.
And then it was searching. JSON has kept us very entertained for quite a few installments because filtering a large data set down to smaller pieces is a substantial piece.
You know, there's a lot, there's a lot of power here. So there was a lot for us to learn, but searching allows you to narrow down a data data set you've been handed.
But the third piece of the puzzle is to transform the data that you want into the form that you want it.
So making it look good is useful. Being able to find only the pieces you want is useful.
But being able to convert those pieces into the shape you need them to be is very, very, very useful.
And that's kind of the final piece that we're now getting stuck into and it will take us a few installments to get our way through and.
The Importance of Transforming Data in JQ
[3:41] Mentally sort of thinking about the different pieces to learn it sort of falls into two groups you can build your own data structures from scratch or you can basically apply functions to the data like you know multiply by five you know that kind of a thing and we're actually going to do it slightly slightly backwards, which is we're going to build new data structures first, because that's actually kind of needed before we can get a lot of value out of the ability to transform data by applying functions to it.
So I had written all of these show notes the other way around, and then I rewrote them all while spending two hours on the tarmac in Brussels airport.
And I think it works better this way around. And I think, Alison, you helped me find a few places is where I had, it sounded like I was saying something for the first time way down the show notes.
It's because when I first wrote it, that was way up the top of the show notes.
Okay. Yeah. You said something about, well, we learned this last time.
And I was like, boy, why didn't you tell us the first time you said it that we learned it last time?
And that's because it had been rearranged. I got it. That makes sense.
It flows very, very well.
I have done a pre-read, which I think will help the audience to enjoy it more when I'm not asking silly questions. I think I know what I'm, I think I know where you're going.
[4:59] Cool. So before we go, we should go slightly back because I set a challenge at the end of the previous instalment. So I guess the first question is, how do you get on?
My dog ate my homework. Your dog ate your homework? Oh, no. I thought it would have been your grandchild, but OK.
Yes, yes, that was it. It was Teddy. Teddy chewed up my homework. That's what it was.
[5:21] Well, OK. Well, anyway, our challenge was to find all the laureates awarded their prize for something to do with quantum physics.
And what we wanted for these laureates was their first name, their surname, why they were awarded the prize.
And really our condition is simply does the motivation contain any word that starts with quantum in any you know whether it's uppercase or lowercase we don't care about which is my way of saying remember those regular expression things let's go play with those when you say starts with is that so you could get quantum mechanics if it was one word but it isn't correct or quantum quantum chromodynamics or... Okay.
[6:02] Just in case. Physicists... Yeah, physicists tend to prefix quantum to things or tend to stick things after quantum.
We're not sure quite which way to put that.
And it also gives us an excuse to talk a little bit more about regular expression.
So we had done some stuff looking for, I think the example in the previous installment was to look for Nobel laureates whose name begins with a vowel, which was slightly arbitrary.
So now we're not searching the name. now we're searching the reason they were given their prize.
But the basic structure is going to be rather similar.
So we're going to want to explode the prizes and then inside those exploded prizes we want to explode the laureates and then we need to do select something.
We need to do some sort of logic to implement our contains a word starting with quantum.
So the question then is is what is our sum test?
[6:58] Well, so we learned last time that the test function applies a regular expression to its input.
So you send a string into the test function with a pipe, and then you give it the first argument is a regular expression as a string.
And if you need to specify some flags, like maybe I for case insensitive, then you put that as an optional second argument.
[7:24] And I've just noticed I use this comma instead of a semicolon to separate my arguments in my example, because I do that all the time, just like I promised you I would. So I've just fixed that.
[7:36] So the question is, what is our regular expression? Well, I mentioned as a hint that pro-compatible regular expressions, PCRE, that there is actually an escape sequence for word boundary, which is start or end of a word.
So a word boundary is basically, it could be the start of the string.
It could be after a space.
It could be after a dash, I think.
Basically, it's an intelligent version of this is the start of a word.
It sort of comports with what us humans think.
And also, this is the end of a word. So if it's before a period, it will be a word boundary as well.
And it's backslash B for boundary because backslash W is taken.
So word boundary. And the other thing I kept on saying last time was that the regular expression is a string.
So if we want to have a slash b in the regular expression, the slash needs to be escaped to be in the string. So it's actually slash slash b.
So what we're actually testing for is backslash backslash b quantum, semicolon.
And then we want one flag, which is i for case insensitive.
And that will then give us our 23 prizes, the oldest of which belongs to a certain Werner Heisenberg. who you might have heard of. He has a wee bit of a principal named after him.
And the most recent is for Quantum Dots, which is making our modern televisions so nice to look at.
Oh, that's pretty fun to know. I got to say that test looks kind of funny.
[9:03] Test, open roundy bracket, quote, backslash be quantum, unquote.
Introduction to Dense Syntax with Slash B and JQ
[9:12] Slash be quantum. I like it. It is odd syntax, isn't it? But yeah, the slash B means a word boundary. So yeah. Well, you promised us dense.
Yeah, that's regular expressions for you. They're dense. And JQ is dense, so double density here. Ooh, like the old floppy disks.
So I'm not sure I could write this myself, but it's very readable.
It says JQ dot prizes, square brackets, so it explodes the prizes.
Then you explode the laureates with a question mark to make sure you, what, skip the empty ones?
Is that what that does? Skip the ones that don't have laureates.
Right. So remember we had prizes where they didn't give them to anyone.
Yeah. And then you pipe that to the select. we're looking for dot motivation piping it to the test does it start with quantum and i want a case insensitive yeah that's readable so there we go there we go excellent right well that is our homework done so now we are going to build three things today we're going to build strings we're going to build arrays and we're going to build dictionaries because they are our three you know Now, you can't build an integer, right?
It's just there, right? You can't build a Boolean. It's just there.
But you can build strings, you can build arrays, and you can build dictionaries.
So that's what we're going to do.
I want to start with the simplest one, which is strings.
[10:29] So I'm going to ask you to cast your mind back to JavaScript some time ago in this series, before we got distracted by Git and Bash, which were fun distractions, but, you know, it's a while.
Well, and in JavaScript, we could make strings using something called a template string where you would have a special character to say, I'm about to stick something into this string.
We put the name of a variable and then when the string was made, it would have whatever you had. Then you'd meet that special character.
The value of the variable will get shoved into the string and then you could continue on. And then we call them template strings.
I think of them as like concatenation in Excel.
[11:08] Take this thing from this cell, put, say, a slash between this, and take this thing from that cell, do another slash, and so you're flipping back and forth between an equation to point to something and some text, a string that you're throwing in there.
Yeah, exactly. So you're basically shoving values of variables into placeholders.
Right. Yes, that's probably the nicest way to say it.
And the fancy-pants computer science way of saying it is string interpolation.
That is what string interpolation is. Is it sticking things into those placeholders?
And that is what JQ allows us to do.
So we know we can make a string by saying open a quote and then some letters and then closing the quote.
If we want to stick a value into that string, we have to have an escape sequence which tells JQ this is a placeholder.
[11:59] And JQ's choice of placeholder is unique.
I don't think I've ever seen a language do it quite the same way.
It's backslash open roundy bracket then you put in any JQ filter you like that's going to make a value and then you close the roundy bracket.
And so it will take whatever's between those roundy brackets and do it and the answer gets shoved into the string at that point, into that placeholder.
Using String Interpolation in JQ
[12:24] So backslash two roundy brackets and then in goes any valid JQ filter you like.
Yeah, that is a hard way of doing it, but...
Yeah, it does work. And a lot of the time, the filter is really simple. It's .nameofvariable.
It's usually just the simplest of filters, which is basically go into that dictionary that we're currently in and grab me whatever.
Almost always, that's what you're doing.
[12:50] So as an example, let us play some more with our Nobel Prizes in NobelPrizes.json, sitting in the installment zip.
Um we can use our existing experience uh to find the prize for a friend of the nasilica's dr andrea gez and we're going to use string interpolation to print out her entry in a nicer way than just a dictionary so what we're going to do is we're going to do our usual you know explode the prizes explode the laureates do a select this case dot surname double equals gez is our select And then we're going to pipe that into our new string interpolation thingy.
So we're going to open a string and immediately we're going to put in a placeholder.
So backslash open roundy bracket dot first name close roundy bracket space backslash open roundy bracket dot surname close aroundy bracket space was awarded her prize for space backslash roundy bracket dot motivation close aroundy bracket and close our string.
So we have three placeholders there first name surname and motivation and so when you run that you get andrea guess was awarded her prize for the discovery of supermassive compact object at the center of our galaxy which is what she did okay um one comment one question one thing i do like about this better than concatenation in excel is you always have to put the text part inside quotes.
[14:17] In this case, the whole thing is in quotes, so you're fine. Yes. But how did...
[14:23] For the discovery of a supermassive compact object at the center of our galaxy, how did that end up in quotes?
Because if you look inside the raw JSON file, the motivations are all quoted.
They are already wrapped in quotes in the raw data, which is very annoying.
Well, in this case, it's useful. It will become annoying later.
Yeah, it's annoying. There's one case.
It's annoying with numbers and dates, or dates especially. But okay, yeah, that's great.
Yeah. So that is string interpolation in action.
[14:58] Okay. So I'm actually in the show notes. I went into way more detail.
I broke it right down, didn't I?
Hang on. Sorry, I'm losing track of my own show notes here. Yeah.
Yeah, I don't know why I broke that down further in the show notes because that all made perfect sense when I said it in English.
Yeah, I've got to make sure that's not a leftover piece we're looking at.
Breaking the example down for a few filters. There's, yeah, I think you just broke it down more clearly, but I think we're getting the hang of it.
I think we're okay with that.
Exploring the Power of Arrays in JQ
[15:29] Okay. So believe it or not, that's kind of all there is to it.
Basically, backslash, open your roundy bracket, whatever it is you want to calculate, close your roundy bracket.
That's it. That's string interpolation. So that's not too bad.
Then we move on to building ourselves some arrays. So we've already seen...
I do think we might have something misarranged because you start talking about the dictionary, creating a, producing a dictionary.
Or was that still the rest of that explanation?
[16:00] Uh, no, that's still the explanation. That's showing, that's showing where we're putting the first name, surname and motivation from.
Now what you can see is that motivation has quote backslash quote for the discovery of backslash quote quote.
So there's where your two quotes are coming from. There we go. Okay.
This is perfect. I think you, you described it perfectly in audio, but now we have all of the context in, in the show notes, but we don't need to go through it again.
Got it. Okay. Now I'm caught up. So now we've done our interpolating strings, we get to build arrays.
[16:31] Yes. And I've already kind of been subtly showing you this.
So in some of our examples, you'll have seen me make an array of simple values by saying square bracket, you know, 42 comma 11, close square bracket.
And that's JSON syntax. And so you didn't call it out to me as a weird thing to see inside JQ because JQ is all about processing JSON.
And you'd be right not to call it out because it is very sensible.
But you can do more. The thing to go between the square brackets doesn't have to be a string, a number, a boolean.
It can be any JQ filter.
And then all of the answers become individual elements in the array.
So if the thing you stick inside the square brackets explodes something that has 50 elements, then your array will have 50 elements.
If you explode something, run it through a select and only five of them make it through the select, then your array will have just those five that survived the select.
[17:32] Okay, okay. So that is stupendously powerful.
Because you have been asking me a lot, if I explode something, how do I put it back together? And I have told you, yeah, avoid exploding it.
Well, the way you put it back together is you just put the square brackets before and after you explode it, and then that will reassemble it back in.
And then when you pipe it, you're back to having one array.
Okay. So if you need to explode something, just wrap it in square brackets, and then in the next pipe, it's an array again.
And you're back to it being one thing instead of however many pieces you made by exploding it.
That's kind of mind-bending, but I think I know what you mean.
Yeah, it's very powerful, though. And so we can use this to build any array we like. So basically anything can go in there.
So what do I have as my example in the show notes? Let's go back to our Nobel Prizes again.
So we know that we can explode the top-level array of prizes, dot prizes, open square bracket, close square bracket.
And we know that we can filter those exploded values with our select function.
So we can say every prize after 2020 by just saying dot prizes, open square bracket, close bracket, pipe it to select.
Inside the select, we have dot year, pipe to to number, and then we greater than 2020. And that gives us all of those prizes.
Converting strings to numbers for proper comparison
[18:54] And just a reminder, to number, we talked about last time, and that just converts a string to a number which is needed because our data set stores years as strings.
For reasons that make no sense.
[19:08] And if we didn't convert it from a number, when we tried to compare it to 2020, it would compare it alphabetically, as we learned in Assumption 157.
And sometimes that's okay.
And sometimes it's really not. And so it's a very bad idea to not make your numbers be numbers before doing math.
So, two-number, and then we can...
Great confidence to a greater than 2020. So you put a link in the show notes where you explain that two number in installment 157. And I'm glad you did.
The reason I noticed that one, actually, that'll be two installments ago because we're in 159.
So two number was one ago.
And the explanation of the greater than sometimes being alphabetic and sometimes being numeric is two ago.
Okay, I got you. I did this before I saw that you said you reminded us that you told us about it.
I was like, what is that two number thing? and searching pbs.bartofister.net for two number, it doesn't find it. And yet it's in the show notes.
It's there. You are 100% right, but it doesn't find it. So I don't know what, search is getting worse these days.
That's interesting. Yeah. That's very interesting. Not good, but okay. But now your show notes will tell us where to go find the explanation. So that's good.
[20:25] Now that is a simple enough query, but if we run it, what does JQ give us?
It doesn't give us an array it gives us lots and lots and lots of separate outputs so if you look at that output what you have open curly bracket to start a dictionary some values closed curly bracket no comma there was no leading open square bracket at the top so you have a dictionary new line a dictionary new line a dictionary you have lots and lots of separate dictionaries so So you don't have one dictionary and you don't have an array of dictionaries.
You have a pile of dictionaries nonsensically attached to each other.
[21:04] Correct, because actually from JQ's point of view, it gave you lots of answers.
It didn't give you an answer, it gives you lots of answers.
But what if you want a dictionary? Maybe you need to process this with something else, or maybe you need to send this to a whole other web service or something that needs true JSON.
It can't handle these pieces of JSON, it wants an array.
Well, the great thing is we could just take exactly the query we had before, we stick a square bracket at the very front and a square bracket at the very end and now all of those answers are going to get combined into one array and that one array is going to be the output of our function or of our call to jq so now if you run it again you'll see that the very first thing in the output is an open square bracket and you still see all of your dictionaries but now they're all tabbed in because they're inside the array and there's a nice comma at the end of every dictionary because they're all members of the array and the very very very last thing is is closing square bracket.
So now your script is outputting one single array, not many, many, many, many, many disconnected dictionaries.
I was afraid you were going to tell me I had to do string interpolation and throw those commas in between.
But just throwing the square brackets says, okay, let's just make an array of these dictionaries.
Cool. Yeah, exactly. So that's how you build arrays. You just put square brackets.
I want this to be an array, square brackets. Ta-da, you have an array. array.
[22:25] Now, something that I'm going to slip in here, because I wasn't really sure where this fit in the show notes, because it both builds an array and it builds a string.
So given that we've just built strings and we've just built arrays, let's do it here.
It is very common to need to go between strings and arrays.
We've seen this in JavaScript, where we literally have functions called split and join, where we say split, you give it a regular expression, and then it takes a string and makes an array.
So if you say split on comma space, then it will take a string with commas and spaces.
And for everything that isn't a comma or a space, you get an element of the array, element of the array, element of the array, and the commas and the spaces evaporate.
They're just separators, right? They're gone. But you have an array.
And join does the opposite. You tell it what you want to use to connect them together.
And it takes an array and then a first element, that separator, second element that separator again and so it's injecting your little string to join, into one joint big string from your array and the really nice thing is in jq the two functions are doing exactly the same thing are called split and join yay.
[23:31] So so, Split requires a string as an input, which is completely sensible because it takes a string and splits it apart. So of course you have to give it a string as an input.
Its second argument is a string telling it what characters it should split on.
So if you give it as an input the string 1, 2, 3, and you pipe that to split with the single argument the string comma, you will get back an array, one, two, three.
Because it will have split it on the comma. Hang on. Hold up. Hold up. Hold up.
We started with a string, and I don't remember you saying it was going to turn it into an array, but I guess it just does.
[24:14] That's Split's job, right? Split's job is to take a string. When you tell it what to split on, you get an array. Okay. One value becomes many.
Okay. And join is the inverse. You give it an array and it smushes them together to make one big string and you tell it what to use as the glue.
Okay. We'll go on to a better description.
So if we take the exact opposite, we give the actual JSON for an array into JQ and we say join it with a comma.
Then we get back the string 1, 2, 3.
Okay. So the join is going array to string and the split is going string to array. They're inverses of each other. It's very clean. Yeah.
It is very clean. The tiny amount of uncleanliness, but it's not massively uncleanly.
If you give split one argument, it is going to say, OK, that's a string.
I am going to just not treat this as a regular expression. It's a string.
I'm looking for exactly a comma.
[25:12] Some people are sloppy. Some people put comma space, comma space space, comma no space.
You know, we humans do these kind of things. So you may often want to split on a regular expression.
And to do that, you use two arguments. The second argument will be interpreted as flags for the regular expression, but even if you don't need any flags, you still have to give the second argument because otherwise JQ doesn't know you mean regular expression.
[25:42] So two arguments means this is a regular expression, even if you have no need of a flag.
Okay, I'm lost at how we're telling it we have a regular expression.
[25:52] By simply saying semicolon, second argument. So once you give a second argument, the first one becomes a regular expression.
That's the rule. One argument, I'm a string. Two arguments, I'm a regular expression.
Okay. And can you explain the regular expression? It says split, open roundy, quote, comma, and then square bracket with a space in it and a question mark. That means one or more spaces?
No. It means zero or one spaces. So the question mark is the zero or one operator in regular expressions. Okay.
Arguably, plus would have been more powerful, which is one.
Sorry, star would have been more powerful, which is zero or more, which would have allowed for even sloppier humans. But we don't like them.
Yeah. Yeah. Yeah, actually, yeah, I wish I'd done that with a star now.
But anyway, it is a valid regular expression.
That is the key point. The key point is, second argument means I'm a regular expression, not a plain old string.
That is the takeaway, which is why I popped it in bold. And it's a, this second argument, to repeat it one more time for clarity, is simply, quote, quote.
[27:00] In this case, it's quote, quote, because, yeah, we don't need a flag.
If we needed to be case-insensitive, we could have put I in there. Quote I, quote...
Yeah. Okay. But in this case, all we're doing that second argument for is to tell JQ, hey, that's a regular expression to your left. Okay.
Precisely. Precisely. It's a little inelegant, like I say, a little bit sloppy, but... I will forget this, just so you know. That's why I put it in bold.
Introduction to building dictionaries in JQ
[27:30] Because I will too, and then I'll be scrolling through the show notes, and I always look for the bits I put in bold, because that's like a message to me. Bart, you will forget this.
Yeah, or, you know, both of us will forget this.
So, so far, we're not doing too bad. We have string interpolation to build strings.
We build arrays, which is our square brackets. And now we've figured out that we can go from arrays to strings with split and join in a very JavaScript-like way.
And it's not just JavaScript. Almost every language has split and join.
So now let's move on to building our dictionaries.
And again, JQ is a language for processing JSON.
So of course jq is stealing most of its syntax from json so if we want to make a dictionary we open a curly bracket we give a key name of our choosing we put a colon and then we put a value of our choosing comma and do it again as often as we like again just like with arrays that value value can be the result of running a filter.
So we can have a jq statement as the value in our dictionary.
So you may have seen me do it with just simple names and values.
Well, it could be a name and a piece of jq, and then it will get calculated, and that will be what goes into our dictionary.
So let's sort of do another worked example.
[28:55] Again, we're going back to our Nobel Prizes data set, because I'm very fond of that data set, even if it is messy.
So that data set... Actually, especially because it's messy.
[29:06] That's actually, yeah, actually, that's a really good point.
It's helpful that it's representative of that sort of thing.
Reality. Yeah, that happens a lot.
[29:15] So, the data about Andrea Ghez's Nobel Prize is all in there, but it's in a shape I don't like.
The shape of data that I think there should be about Dr. Ghez's Nobel Prize is just very simple. I want a year.
That's a number. number. I want something called prize that is physics in Andrea's case. I want the name, that's her full name.
And I want the citation for why she got the prize without any of these sloppy extra quotation marks.
So basically, I would like us to build the dictionary that I think it should have been in the first place.
So let's work our way up to that. So the first thing to do is, we already know that we can explode the prizes, pipe that to select.
If we We go Annie dot laureates with surname double equals Ghez.
We will end up with Andrea's dictionary without it being changed by us.
So what we get back is the full dictionary for the entire prize that Dr.
Ghez was one of two winners, one of three winners on.
Oh, Roger Penrose. How did I not notice that before?
Huh. Wow. Wow. Okay. So we see that there's a year 2020 physics and then the laureates array is in there with one for Roger Penrose, one for a Reinhard Genzel, cool name, and then one for Andrea.
[30:42] So of the things I want, year and prize are right there for the taking, right?
So I can already see that if I say open curly bracket year colon dot year comma prize colon dot category, I've got two out of the four categories with very little work.
Right? So, if we take that existing JQ to give us Andrea's full dictionary, and then we pipe that to our new filter, open curly bracket, year colon dot year comma prize colon dot category, then we get.
[31:17] This dictionary out now, which is year 2020 prize physics, which is two out of four.
No, it isn't. It's 1.5 out of four because 2020 is a string.
But hey, we're making progress.
So how do we fix the year? Well, that's an easy one. So instead of saying year colon dot year, we just say year colon dot... Wait, can I guess, is it two number?
Ding, ding, ding. Okay, goody. So we just pipe it to two number and then we just go comma prize prize colon dot category.
Great. Now we're halfway. Year 2020 prize physics.
Now we need to get a little, now we need to bring in our knowledge from string interpolation to build Andrea's name.
But the problem is we can't just get her name immediately from the top level dictionary for the whole prize because her name is inside her laureates entry inside the laureates array. So we got to go digging deeper.
Now we learned in in previous installments that if we take the dot laureate, so we explode it, and then we pipe that to select dot surname double equals Gez, then we get the one dictionary that is actually Andrea's dictionary.
And so I've put the full command in the show notes so that you can make the dictionary yourself, but you can see it's ID 990, first name Andrea, surname Gez, motivation with the city extra quote share four.
[32:35] So we can see straight away that the two pieces of info we want are first name and surname, so string interpolation tells us we can get those with quote backslash roundy bracket dot first name close it space backslash roundy bracket dot surname close it, close the quote.
We're now to say that string interpolation, as well as all of the explodey stuff, all of that goes into name.
So, for our previous example ended with dot category close the curly bracket, now we're saying comma comma, name, colon, make me another key called name.
And then all of that logic goes in there before we close our curly.
Without any quotes around it.
So it's name, colon, space, dot laureate, square brackets, pipe, blah, blah, blah.
There's not a quote around. You don't know roundy, square, anybody bracket around all that. It just splats right in there.
[33:32] Splats right in there because it ends with either a comma for the next Dictionary entry or the squirrely bracket for a dictionary done Okay, so until you meet either a comma for a next key or the squirrely bracket for we're out of here You can just keep adding them.
So yeah, there it is So name colon dot laureates open square back closer pipe select pipe our string interpolation So now we're on three out of four.
We're doing pretty well here, right? We have our year prize and name So the final step is we want the motivation And the logic is almost the same as for the name, but instead of doing the whole string interpolation, we can just say pipe it to dot motivation.
[34:10] Hey, that's not looking too bad, but now we run into the problem with those annoying extra quotation marks.
And this gives me an excuse to give you a preview of what we're going to be doing in the next installment, which is learning about all the functions for manipulating our data.
And one of the things you very often have to do because a lot of APIs prefix or postfix answers with things like debug, colon, space, blah, blah, blah, or a timestamp.
Like all sorts of things get prefixed and postfixed to things.
And in this case, quotation marks.
And so if you want to remove something from the front, it's called left trimming because you're trimming from the left of the string.
And if you want to trim something from the right, it's called right trimming.
And that tells you why these functions have really odd names.
L trim stir, left trim string, or trim stir, right trim string.
That's how you remember them. That's what they stand for.
And what you give is the argument. I'm singing the same song over and over again, but in Excel, it's left and right.
Oh, okay. And you tell it the number of characters you want.
You don't tell it which characters, but you tell it how many of them.
So I want the left three or the right four.
[35:27] So the nice thing with the way it works here in JQ is that you tell it what characters you want to remove, and it won't care if they're not present.
Nice. So if there is a quote, take it off.
And if there isn't, don't give me an error. Don't get cranky.
Just carry on, which is nice. I like that.
So we basically pipe our motivation to, just a typo there.
There is not dot. you pipe the motivation straight to the function l trim stir and we have to give it the string quote which means it's quote backslash quote quote because we have to quote the bloody things inside our strings and that takes care of half of our problem then we pipe it to or trim stir and we do the same thing again so we take away the quote from the right and after all of that our motivation is nice and clean.
And so if we copy and paste that final key into our dictionary we're constructing, then we finally get our four out of four year prize name citation without any messy extra square brackets or extra quotation marks.
I do like predicting where I'm going to get stuck. When I look back at L Trimster, open roundy bracket, quote, backslash, quote, quote, quote, I'm going to think we were removing the backslash.
That's what it looks like to me. But backslash means to escape.
So I'm escaping the fact that I'm looking for a quote, but the whole thing has to be a string. So it's inside of quotes.
Yeah, I know. I hate, yeah.
[36:56] Prediction number seven of what I'm going to get wrong later.
I hate having to backslash things. It always breaks my head. Yeah.
So we have now done actually the vast majority of what I want to talk about today, but I do want to teach you one more cool thing for two reasons.
Using Regular Expressions to Build Dictionaries
[37:13] A, because it involves regular expressions, which I love, and B, because it is very much related to building dictionaries.
So we built a dictionary explicitly.
We said, I want a key called year, and I want you to go fetch the value from here.
I want a key called name. I want you to go build the value out of these two pieces.
But another way that you very often end up with a dictionary is that you have a joint big string.
Which contains multiple pieces of information you care about.
So the string could be a timestamp, in which case it contains a year, a month, a day, a number of hours, a number of minutes, a number of seconds.
If it's ISO 8601, a number of milliseconds even, right?
Or, I mean, it could be any structured piece of data that contains multiple things.
So you can write a regular expression that matches a date, right?
[38:08] But one of the things you can do with regular expressions is so-called capture groups, where you put roundy brackets inside your regular expression that basically says, this little sub piece of the pattern, this is a capture group.
I want you to remember it separately.
And in the bad old days of all the regular expressions that we've come across so far in Taming the Terminal and here, they get numbered.
They become the first capture group. Okay. Okay. I haven't figured out what a capture group is yet.
It's a piece of a regular expression. So if you have, let's not do dates because then American and European gets messed up. Let's do time.
Okay. So the pattern for a time is one or two digits followed by a colon followed by two digits.
[39:02] Okay. So you can write all of that as a regular expression. So I would write that as open square bracket, zero to nine, close square bracket, open a curly bracket, one comma two, close my curly bracket.
So that means one or two digits, colon, open square bracket, one to nine, close square bracket, open curly, two, close curly, exactly two digits.
Right. So that's my full regular expression. And that captures all of the time.
Okay. Digits, colon, digits. digits.
The hour is a sub pattern within my pattern.
And the minutes are a sub pattern within my pattern.
They're called capture groups. So the name for a pattern within a pattern is a capture group. Okay. All right.
In the bad old days and everything I've ever taught you so far in anything we've ever done together, those capture groups are made by saying, open around bracket, bracket, wherever you want, close around bracket, and we don't get to name them.
[40:05] The first round bracket is capture group one. The second round bracket is capture group two, whether we like it or not.
And that's really brittle because if you change your pattern and you say, oh, I need to capture a third thing.
Well, if that is between your first two, all of your code is now wrong because what was two has become three.
And when you're debugging your code, you're seeing one, two.
They're meaningless magic numbers.
[40:36] A fantastic thing that was added to Perl-compatible regular expressions relatively recently is called named capture groups.
So instead of them becoming 1, 2, you say at the point in time you create them, I shall name you blah.
And then in your code, you can refer to them by name.
In JQ, what happens is you take a string, you put it to the function called capture, You give it the regular expression with the capture groups and it will make a dictionary for you.
So it will give you all of the answers and the keys in the dictionary will be the names you chose for your capture groups.
And the values will be those parts of the regular expression.
Okay. So that lets you pull data from a string. So you basically say, this string represents a time.
Creating Named Capture Groups in Regular Expressions
[41:26] I want the hours and the minutes, and I want them as a dictionary with the key hours and the key minutes with the two relevant values.
And you could do all of that with a single regular expression. Huh. Okay.
I love it. So powerful. So let's look at an example.
[41:42] So to make a named capture group, you open your roundy bracket as you normally would to make any capture group in any regular expression you know in any context and instead of just opening the roundy bracket you say question mark open angle bracket your name close angle bracket so the question mark angle brackets is like a label you're basically saying i dub the whatever then you carry on your regular expression and when you're done with that sub piece of the pattern close the roundy bracket so we stop stop capturing so that gives us one named capture group and we would rather rinse repeat for as many capture groups as we would like in our dictionary so if we want a regular expression to capture time, it's going to be open roundy bracket question mark inside angle brackets hours so we're basically saying everything until the closing roundy bracket is going to be the pattern for an hour which is 0 to 9 1 comma 2, 1 or 2 0 to 9's we've then closed our capture group so we are no longer capturing the hours colon because that is part of the big picture pattern still part of the regular expression.
[42:51] Open another capture group question mark angle bracket minutes close our angle bracket and then the pattern 0 to 9 2 I want two of those close that capture group colon another capture group seconds 0 to 9 2, So all of that together is the regular expression with three named capture groups. Okay.
[43:14] That makes perfect sense, but I don't know what we use it for yet.
Okay, so now let us imagine we have a string, 9 colon 0 0 colon 0 0.
And it doesn't matter where that came from. We're going to pipe that into JQ.
We're going to say, shove that into the capture function with that horrible big regular expression I read out to you.
And what will come out of that JQ statement is a dictionary, hours 9, minutes 0, seconds 0.
And they're strings, not numbers.
They're strings, yes. We could pipe those. We could absolutely run those to two number. We could absolutely do that, yes.
[43:53] But yeah, so regular expressions work on strings. And key value pairs, they have to be strings, don't they?
Or no, they don't. In a dictionary, they didn't have to be.
Correct. But the regular expression makes strings. Because the regular expression is a string matching machine, right?
Takes a string, finds the pieces within a string. So those pieces are strings. strings.
So yeah, you end up with a dictionary of strings.
And if you need to do more, you can then process that dictionary.
You could pipe that to another jq command to convert it with two number or whatever you need to do.
But basically the capture command is string plus regular expression to dictionary. Okay.
So all this we've just learned about the question mark angle bracket hours, close angle bracket to make it a named capture group.
That's all regular expressions. That's something to do with JQ. Yes.
Correct. We're just using it with JQ now. We're using it with anger.
Yes, exactly. So the capture groups is PCRE, Pro Compatible Regular Expression.
The capture function is JQ.
Explanation of the capture function and its input requirements
[45:02] Oh. So the capture function takes as an input a string and as one argument a regular expression.
Okay. Okay. Which all ends up in quotes.
Yes, because the capture group's one argument is a string that is a regular expression.
JQ Show Notes: Copying, Pasting, and Laughing at Typos
[45:23] Okay. Count your quote marks on this. It looks right, but wow.
Well, it is right, because I ran it, because otherwise it wouldn't have been right.
The secret, by the way, to my show notes generally not being terrible is because I have a terminal open all the time, and I'm constantly copying and pasting and laughing at myself for the amount of silly typos I make while trying to write JQ.
And mind you, the same is true in Bash and everything else, but it's extra true in JQ, I have to say.
[45:51] So at this stage, we've actually, we've learned a lot, but there's one more piece that I think is very much related here, and I think this is the perfect time to throw it in, is that it is very normal with JQ.
So you're starting off with a data set in JSON. It may have come by calling curl on some sort of web API. It may have come from a file, but you have some JSON.
You're processing it and you're sending it somewhere, right?
The JQ is a terminal command. So it's that sort of Tim for important thing of do one thing and do it well.
But it's going to be in a chain, right? Curl to fetch it from the web, JQ to process it, and then send it to a file or send it to something else.
And the something else may be quite picky about the format.
So if the something else is a CSV file, well, you need to write CSV.
Now, you could absolutely do that with string interpolation.
You could find all of the rules for CSV formatting and implement them manually.
And that would mean you'd need to escape your quotation marks in very weird and wonderful ways.
You could do it you wouldn't enjoy it but you could do it and you could also similarly with a bit of jiggery pokery make it produce json format or you could make it produce plain text.
[47:06] That i'm going to clarify real quickly just in case i don't think we've said it csv is comma separated values it's a standard input format for uh spreadsheets indeed and we yes we're going to to get to csv shortly uh the other thing that it can do which is a related format that's not so popular these days but was once very popular is tsv do you remember tsv from your tab separated tab i've never heard of tsv but i know you can do tab separated yeah yeah excel will happily ingest both excel will give both of those as an option csv and tsv so another thing that you often end end up doing with the output from jq is building a url with it so we know that a url you can put data question mark and then you can start giving data to it but that data needs to be in queries query strings thank you yes yes so at the end of the url you can stick on query strings and they have to be encoded where every special character becomes percent and then two hexadecimal digits widgets.
Again, you could write a whole bunch of JQ syntax.
You'd have to use a substitute command to manually fix all of those characters.
[48:25] That would be painful. Another thing you very often need to do is encode stuff in good old base 64 encoding.
The amount of APIs that want base 64 encoded data is many.
There are a lot of things that talk JSON, and there are a lot of things that talk base 64 encoded. I've never even heard of that. The SMTP protocol, for example.
[48:48] If you ever need to send attachments from a script, you won't like it, But you'll get to know Base64 because that's how the email protocol works.
Anyway, all of this is what I'm getting around to saying is there are many, many real world reasons where you want to take the output from JQ and either replace all the special characters using some sort of common scheme or take an entire piece of data, be it a dictionary or an array, and format it in a well-known data format.
And instead of you doing all the heavy lifting, JQ can do it all for you.
You basically just tell it, I would like this and that format, please.
So the syntax for doing that is the at operator, followed by the name of a format that JQ knows about.
So you do say at CSV and you will get CSV formatted data. it at, at URI, and you will get that percent 20 carry on weird stuff.
The other one that's really powerful is in HTML, you're supposed to say ampersand some sort of silly abbreviation semicolon.
So the HTML for an actual ampersand is ampersand AMP semicolon.
[50:04] Let's give an easier one that people would have seen ampersand 120, I think it is, is a space.
That's percent 120. That's in URI. Percent 120. Percent 120.
Yeah. So one you see a lot in HTML is ampersand, Q-U-O-T, semicolon, which is a quotation mark.
[50:23] So the at symbol can handle this for us and it has, I have a table below of everything it can do.
And the way it works is either you just make it a whole filter, right?
So you just, in your stream of pipes, you just say pipe, and then you give it the name of a format and it will take all of its input and do whatever you say to it and that's just stick it in the pipeline that's cool the other thing you can do is when you're doing string interpolation you can say every substitution you make into this string i want you to apply this escaping mechanism to it and you do that by simply putting the at sign in front of your string interpolation so if you're building a url you however you find the value, close roundy bracket.
If at the very, very start of the string you have at URI, say, then the URI escaping gets automatically implied to every single one of your placeholders.
Okay, hold on here. So I'm confused because you said pipe it to at CSV, but now you're saying at URI would be at the beginning.
[51:34] I'm saying there are two ways of doing it. One of the two ways is that you just apply it as a whole filter, which means it applies to the entire output, right?
Okay. So if you put it in your chain, you're applying it to everything, right? It's just another part of your chain.
But if you're building a string, you don't want the bit of the string that you're explicitly typing to get messed up.
You only want it to apply to the inserts, right?
You have a string with placeholders. I have two worked examples to explain both uses.
Okay. So I may be over-explaining this to the point where...
No i'm confusing you instead of helping i'm getting clues you're getting clues okay good so there are two completely different ways of using this one operator i guess is the takeaway for now and i'm going to demonstrate both of them to you so the first thing we're going to do is we're going to do some csv hold on you have something in bold you didn't tell us yeah that's because i'm about to say it in about two sentences okay in english it works better this way in in In text, they work by the other way.
So we have, from a previous installment, menu.json, which was an array of dictionaries for pancakes, waffles, and a few other things, of course.
[52:49] Which contained a name of whatever it is on our menu, a price for whatever was in our menu, and how many of them we had in stock in our imagined restaurant. So that was menu.json.
And so it's just an array of dictionaries.
Let us imagine that we have a need for having our menu in CSV format, that is not unreasonable.
And we can use the at CSV filter to take a, our array of inputs and turn them into what we need.
Now, the at CSV filter does things one line at a time.
So if we would like it to produce a whole file, we need to produce multiple outputs from a jq command.
[53:30] The other thing to bear in mind is that jq is... JSON is its native language.
Jq likes to give you JSON unless you tell it otherwise.
So if you take a valid CSV string and you turn that into JSON, what you get is broken CSV.
Because it gets wrapped with superfluous quotation marks.
Because in JSON strings are wrapped in quotation marks. But CSV, that's wrong.
[53:57] So you have to tell JQ to stop doing that. And as we learned many installments ago, the minus minus raw minus output flag tells JQ not to do its JSON thing.
Just give me the raw output.
But we don't don't have to do all of that typing every time we're going to say minus or so if you're going to use the at csv you also should use minus or because it doesn't make sense to say give me csv in json okay okay so that's a little subtlety there so if we want to take if we want to make a line of csv we take any array we like and we pipe it to at csv and we will get one line of csv, so could we put at csv at the beginning before the the array no because okay so in jq you have filter pipe filter so the output of one filter is the input to the next that makes sense so at csv needs as its input an array okay and it will turn that array into fields on one line of csv Okay, that makes sense.
[55:06] So if we would like to output, say, a header column at the top of our CSV file, we would say open square bracket name, price, stock, close square bracket, which makes an array containing three elements, name, price, and stock.
And we then send that with the pipe into the next filter in the chain, which is at CSV.
So the input is an array. The output will be that array in valid CSV format.
Exploding the menu array for CSV conversion
[55:34] That make sense? Okay. Now, our menu.json contains an array of dictionaries.
We need that array of dictionaries to become an array of arrays for use with that CSV.
So if we say dot open square bracket, close square bracket, we are exploding the menu.
Right? So the menu is an array. We're exploding it. So the next thing in the pipe happens once for every single line, every single element in our menu.
What are we doing for every element in our menu? We're saying make me an array.
.name, .price, .stock. And then we're taking that and we're sending it to at CSV.
So that will happen three times. Once for pancakes, once for waffles, once for whatever else I put on the menu.
I don't remember what else I put on the menu. you. Hot dogs.
I put hot dogs on the menu for a change.
So that means that because we exploded, the middle thing happens three times and the third thing happens three times.
So the end result is three lines of CSV.
One for each explosion.
[56:54] Oh. Yeah, there's a lot going on here, isn't there?
Yeah, so I was with you up to, we did our JQ to create the array, name price stock, pipe it to CSV. So that gives us our header row.
But then a simple comma will keep adding to this array?
Okay, so it's going to just print one output, which is one line, right?
Okay, we aren't building an array here. We're just spitting stuff out on the screen right now.
Right, yeah, so this is just the filter, right? So I haven't wrapped it in the jq command. This is just the jq syntax. This is just the filter.
No, I'm looking at the jq syntax. It says jq minus r. Did I jump ahead? Okay, okay.
You did slightly, but only slightly. Sorry, okay, backing up.
So backing up, we have two pieces of jq here. The first one gives us the header row only.
The second one gives us one row for every element in our menu.
You so together those two things are everything we need header row three data rows so how do we do all of that together before you go on i was still trying to figure out the name price stock and studying that one when you explained the the second one it says dot square bracket open close square bracket what what's that, Okay, so our menu has at its top level an array. So dot is an array that is our menu.
[58:23] How does we know it's our menu? Oh, we aren't in the JQ yet, so we haven't told it what it is we're going to stuff into it.
[58:32] We're working with our menu.json file, so I guess I should have copied and pasted menu.json into the show notes again.
So that you can see what we're working with. What does dot open close square bracket do without... Okay, so dot means the thing we're processing, and open and close square bracket means explode it. But we haven't told it...
We normally tell it what to explode, like dot, laureate, open, close, square bracket. We haven't told it what to explode.
We're saying dot, so explode the whole current thing. Okay.
Sorry, I was trying to simplify this by not complicating it by showing you the file as well.
Maybe I've made it more complicated by trying to make it simple.
No, I think it's back to me not remembering what it means when you just say dot and you don't tell it what part of the thing.
I'm used to us exploding part of what the input file is.
So we've had the input file of NobelPrizes.json, but we never explode NobelPrizes.json.
We've been exploding dot laureates or dot prizes.
Correct. Because at the very, very top level of that file is a dictionary.
[59:40] Oh, okay. And in this case, this is an array.
There we are. Okay. So we're literally exploding the whole thing. Okay.
I suspect you told us that four installments ago, but it was gone.
Okay. So we have dot open, close, square brackets is exploding everything that's the input.
And then we're going to pipe that to pulling name, price, and stock.
Into a new array, which exists for... We're making an array that exists really briefly because at csv needs an array.
[1:00:15] Okay. Right. Oh, because each piece of that is a dictionary.
Right. We have a... No, we have to get the... I think we're getting a delay here, but we've got an array that has a series of dictionaries in it, name, price, stock for each one, but we need that to be an array in order to be the input to the at CSV filter.
Exactly. Okay. And we have to get the order right because otherwise our data's a mess.
So that's why we're explicitly saying saying open square bracket dot name comma dot prize comma dot stock.
You can't go wrong, right? We're taking the dictionary, we're pulling the pieces out in the order, which is the same order as our header row, which is very good of us because otherwise we're talking rubbish.
So we have two small pieces of JQ that do the two things we want.
Print me a header, print me every data row.
So if we combine all of that together into one JQ command, we say JQ space minus OR because we want the raw CSV. fee.
And then we have as our one giant big filter the entire thing I gave at the start, comma the entire second thing.
And I've had to wrap them in brackets because otherwise the pipes don't know where to go.
[1:01:28] Roundy brackets. So we have brackets the first thing to print the header row, comma. I want you to be specific. Roundy brackets.
Yes, exactly. We're grouping together. together, do this thing, comma, do this other thing.
So the first this thing is make our header row, and the second this thing is make all of our data rows.
Okay. And then we just use the good old-fashioned terminal arrow, greater than sign, menu.csv.
[1:01:56] And now our JSON file has magically become a very pretty CSV file.
You know, you can open it in Excel, but you can also just print of the terminal it's just name comma price comma stock hot dogs 5.9 143 pancakes 3.10 43 waffle 7.5 14 it is valid csv and it works in excel as it should interesting cool so that is using our formatting as an entire filter right we pipe it to at csv so the second thing i said was you can also use this to control what happens inside string interpolation and the canonical example here is building a url right and you i think you have some plugins and stuff on your mac that let you build custom urls that you tie to key thingies didn't you do a thing once we had special keys to search google or to search something oh yeah um keyword search does that.
[1:02:54] So, this same concept we can use in JQ. So, we, well, I'm going to tell you.
If you want to give someone a URL to a Google search, let me Google it for you.
Then it's https://www.google.com forward slash search question mark Q equals and then you put your search query.
So, if you want to search for pancakes, it will be Q equals pancakes.
If you want to search for waffles, Q equals waffles.
If you need to search for something which contains a space, or a comma, or frankly any special character, you have to encode that special character using the URI encoding scheme, which is %20 for space.
I don't remember the other ones, but there is one for everything.
So we're going to use JQ to build a search URL.
What are we going to search for? We are going to search for the winner of the first ever Nobel Prize.
So we're going to use our Nobel Prize data sets to find the thing we want to shove into the Google query.
So we don't know who won the first Nobel Prize. We just know we want to Google them.
Wait, so we're going to query NobelPrizes.json and make that be the input to a Google search without ever seeing what the thing we're searching for is.
[1:04:15] Precisely. In other words, we want to turn the answer to a question from our data set into a working Google link. Okay.
Which is a realistic thing to want to do, right? You look something up on a web API, you get back some JSON, you pull out the piece you want, and you turn that into a Google link.
In this case, the first ever Nobel Prize winner is my arbitrarily chosen thing to Google for.
Okay. So this gives me an excuse to remind us all, myself included, that one of the cool things JQ does is it lets us read arrays backwards.
Because this data set has the prizes in reverse chronological order.
So the most recent prize is at the top of the file, and the first ever prize is at the bottom of the file.
So if we say dot prizes minus one, we get the first ever Nobel Prize.
I remember you telling us that. Yeah, it's so cool. I love languages that let you do that.
So in, so if we say .prizes .laurieats , we get the first winner of the first Nobel Prize.
[1:05:19] If we then pipe that to string interpolation, backslash aroundy bracket dot first name, close aroundy bracket space, backslash dot surname, close aroundy bracket, we now have a way of looking up the name of the first ever Nobel Prize winner, which was Emil von Behring.
And we have the jq command here to do that. Okay.
Now we want to turn that answer into a working URL.
And so we can do that by having as our jq query, at URI, quote, HTTPS colon slash slash www.google.com forward slash search question mark q equals.
Then we start our string interpolation, backslash roundy bracket.
All the logic I just gave you, close our roundy bracket, close our string.
Introduction to the topic of quotes and feeding NobelPrizes.json
[1:06:19] I'm just laughing at how many quotes there are at the end. Okay.
So that is going to find... And then you still feed it NobelPrizes.json at the end. We do, exactly, yes.
So I was going to take Emil von Behring and shove that answer into the Q equals, but it's not going to just shove it in.
It's going to apply the URI encoding before it shoves it in because we put at URI before we opened our string.
The presence of quotes in the JQ command
[1:06:51] Two questions why doesn't Emil von Behring end up in quotes, because there are no quotes in the data set the return of our standalone JQ did return it inside quotes, those quotes were just wrapped by JQ they're not, if we'd done a jq minus or they wouldn't have been there so those quotes weren't there until the very very very very very end when they fell off the end of the command those quotes I wish I'd put a minus or in there to not confuse you, oh so if we ran that same jq command not stuffing it into the URI query and all that you're saying that that would have.
Running the same JQ command without URI query
[1:07:42] It would have done it with a it would not have had quotes those quotes were added at the very very very very very very end by the JQ command, I'm going to go back and do just the JQ with a minus R you're saying that I wouldn't get it in quotes correct okay okay.
Confirming the absence of quotes with JQ command
[1:08:07] Okay, so second question. Why is at URI at the front, not at the end, like at CSV?
I don't get it. Okay, right. So this case we are saying explicitly, we are doing string interpolation, and I want you to apply the at URI to the string interpolation.
So the first time the at whatever was the entire filter. Pipe, at whatever, end of story.
Which means apply me to everything. thing.
That's where we had it at the end?
Well, it wasn't at the end. It was the entire filter, right?
The pipe symbol says start a new filter.
So the entire filter was at CSV. Okay.
Here we are saying at CSV space, open quote.
At URI. Sorry, at URI. Okay, at anything. Right? At thing. Okay.
So that means apply this encoding to the string interpolation.
So when you just say at and you don't give it any more information, it applies it to everything.
If you say at followed by a string, it only applies it to the interpolation.
At on its own, apply to everything. At followed by a string, apply to the interpolation. That's the rule.
Different positions of 'at csv' command in a pipeline
[1:09:27] Not sinking in one tiny little bit okay if the at okay, could they had to say it could at the beginning is there is there an example where at csv could be at the beginning or is it because it's but it is at the beginning, getting so you got jq minus r single quote at uri and then all the stuff we're going to do all our query strings and the the url and all that is after at uri in the other example you had all the query string stuff was to the left and then we piped it to at csv so csv at csv was at the end, One's at the beginning, one's at the end of the command. Okay.
So in one case, we are saying take the input...
[1:10:18] And run all of the input through the CSV command. Agreed. I get that one.
So the entire filter is just at CSV. And its input is whatever came before.
It could be the entire file.
Or in this case, we're saying, give me this array and then send that to at CSV.
So you could actually say jq minus or at CSV, name a file.
And it would take the entire file and run it through at CSV.
Okay. So that CSV is a filter all by itself. We just have it at the end of a pipeline.
Okay. Why do you put this at URI at the beginning?
Why not pipe it to at URI at the end? Would that be the same thing?
[1:11:02] No. Good. This is the perfect way to ask me the question. This is the perfect way to ask the question, because then I can tell you the difference.
So if I take http://www.google.com and I pipe all of that through at URI, I get HTTP colon, sorry, I get HTTPS, percent something, percent something, percent something, www, percent something, Google, percent something, com, percent something, search, percent something.
I don't want to apply it to everything. I only want to apply it to the bits I'm inserting.
That's why it's at URI space the string interpolation.
String Manipulation: Answer Extraction
[1:11:49] Only the answer, only the bit that comes after my slash open roundy bracket, the bit I calculate gets converted, the rest of the string is left alone.
So the answer is https colon slash slash www.google.com forward slash search question mark q equals.
And now all the weird stuff happens.
Ml percent 20 von percent 20 bearing.
It's only applied to the string interpolation. Hmm.
Understanding the use of "at" for input manipulation
[1:12:23] Okay, I believe you. Okay, think of it this way.
If you just put the at and nothing else, it applies to all of its input.
[1:12:36] If you want to be specific, you use the at in front of string format.
Okay, I think we should keep going. I don't completely follow it, but I believe you.
I guess it's a pattern it's a pattern so you can copy and paste the same logic into anything you're doing, right and it will behave in the same way I guess what bothers me is at URI does it the thing that that is immediately following it is not the string interpolation it's the H-E-P-S W-W-W google.com you get to the string interpolation later you have the slash open rowdy bracket that's the string interpolation right right, Okay, but the thing straight after it is a string that contains interpolation. Mm-hmm.
Oh, wait a minute, wait a minute. So that could be at URI, quote, Bob backslash open rowdy bracket dot prizes minus one.
Yeah, absolutely. Yeah. Okay.
That's just text. It's literally just text to at URI at that point.
Precisely, exactly. From a string interpolation perspective.
Okay. Yeah. That sort of makes sense. I think that's as far as we're going to get me.
[1:13:56] Well, I mean, they're just patterns, right? So if you ever need to do this for real, you just replace the HTTPS bit with the bit you want, and you replace the prizes minus one, whatever you want, and it'll work.
It's just a pattern. It's a shape, right? And it behaves one way with one shape and one way with another shape.
So the last thing I want to get to today is just to tell you what you can do.
So in terms of formatting an entire line of text, we have at text, which is just a shortcut for to string. So if you just need to force plain text out, at text, and it's as if you run it through to string.
At JSON gives you JSON the way an API expects it.
So not pretty, not lots of new line characters with little tabs.
It gives you the kind of JSON you get from a URL, which is not a single wasted character.
It's just mushed together as one giant big barf of JSON.
Right. Great for computers, terrible for humans, but at Jason will give you the computer-friendly version.
At CSV is our comma-separated values, and at TSV is our tab-separated values.
In terms of encoding, which is escaping special characters is how I think of encoding, we have at HTML gives us things like ampersand, AMP, semicolon for and.
And percent, sorry, at URI gives us the percent 20 stuff.
[1:15:20] At base 64, we'll do a base 64 encoding of the input, which this and that gets base 64 encoded to gghpcy blah, blah, blah, blah, blah, blah, blah.
That is, you can base 64 encode anything and you get back this weird hexadecimal glop with equal signs in it. It's very pretty.
[1:15:40] There is also Base64D which is decoding Base64, so if you take that glop and run it through Base64D you get back this and that.
And the other very convenient one, because you often use JQ as part of a big terminal command is atsh will do shell escaping.
[1:16:00] On a string so you can safely send a string of jq output as an argument to another terminal command with at sh okay so that then brings us to an optional challenge so i've already shown you that we can read we can make a custom dictionary to represent andrea's prize and that was very pretty.
But I would like you to not just make a dictionary, I would like you to take as your input our NobelPrizes.json file and give me back what I think it should have been in the first place.
So not a dictionary with a key called prizes that is an array with lots of dictionaries which each contain another array of even more dictionaries for the laureates.
I would like that entire file to come back to me as one top level dictionary, sorry, one top level array containing one dictionary for every prize that was actually awarded.
So I don't care about the ones with no laureates. Don't want them in my list.
Formatting text using different encodings and separators
[1:17:05] And then for each element in that array, I just want four keys.
I want the year as a number.
[1:17:14] I want which prize it was as a string, so chemistry, physics, whatever.
I want the number of winners there were as a number, number, called numWinners, and then I would like a simple array of strings that is just the names of the winners called winners.
[1:17:33] So as an example of what a correctly formatted one would look like, the piece prize from 1907 should be year colon 1907, prize colon, as a number, prize colon piece, as a string, num winners, two, and then the winners array contains Ernesto Teodoro Moneta and Louis Renault.
That is it. So nice, simple representation of that prize.
I'm going to give you a warning and a tip. I don't know if that's me being kind or mean.
So it's easy enough to do this for the prizes where the winners are human beings, because they have a first name and a surname and you can just stick them together.
But the prizes where the winner is an organization have no surname.
And I promise you that your first attempt at solving this problem will result in trailing spaces.
[1:18:30] And you can check whether or not you have this problem by looking at the 1904 Peace Prize.
If you have successfully accounted for the lack of surnames, then the winner will be an array with one string, Institute of International Law with no trailing space.
And if you haven't, you will have space null as probably your first problem.
That's what I certainly got the first time I tried to solve this problem.
Institute of International Law, space null.
Then I eventually got it to International Institute of Law space, which is still raw.
I promise you, you can get it without the trailing space.
We may have trimmed some sort of trailing something at some point today, for example.
Okay. So you can get full credit by allowing the space to appear and then trimming it away later. That is a valid solution.
[1:19:29] But there is a more elegant solution for bonus credit, because it's mildly non-obvious, but you can stop it ever happening.
Right? You don't have to remove a trailing space you never create.
And the key to never creating it is the fact that if you join an array of one element, then the joining symbol never appears.
So if you have Bob and Dylan and you join it with a space, you get Bob space Dylan.
But if you have the array Bob and you join it with a space, you just get Bob.
Hmm, no trailing space.
[1:20:10] So if you can arrange to have your name as an array, then you can never have the space be a problem.
To help you make that true, I'm going to remind you of the existence of the alternate operator, which we both agree is terribly named because it's forward slash forward slash, which you, me, and half the planet think means this is a comment.
No. Or I'm about to escape the boundary of a word.
Yeah, I know, right? And then the other thing I'm going to tell you about is a function called empty.
And what it does is it produces absolute nothingness. Not null.
Actual, genuine nothingness. So if there is no surname, you don't want null.
You want actually nothing.
And the way to get actually nothing is empty.
Empty we're going to look at empty in the next installment because the documentation for empty is hilarious it is genuinely hilarious just so you know my dog already ate my bonus credit.
[1:21:27] Yeah that's the bonus the boat look the bonus credit right it's it's for bonus credit we will i am going to explain it in the sample solution because i want to tell you about empty Okay. Just so you can read the documentation.
Yeah, it's copied and pasted in the show notes. Yeah, it's cool.
But honestly, if you can build that nice four key dictionary for every Nobel Prize that was actually handed out, 100% full marks and a really good example of what you want JQ for, right?
You have a piece of data available to you. It's in the wrong shape.
Beat it into the right shape and output it as valid JSON and then you can use it for something else so this is the perfect example of why we want JQ cool.
[1:22:16] Right that was fun but there was a lot of it here yeah we looked at it up front knew it was long but it held together I think, It did. So this is all of the construction work, right? We know how to build strings, you know how to build arrays, you know how to build dictionaries.
And I've told you that there are functions for transforming.
We've seen Ltrim, Str, and Orbtrim, Str. There's loads of them.
You can do all sorts of cool things to strings and to arrays and to dictionaries.
Next time, we're just going to look at all the cool things we can do to strings, arrays, and dictionaries, and how to do math. The other thing we'll learn next time.
So compared to this, it's a way lighter lift because we're just going to learn how to make data change shape.
Sounds fun. You'll love it. Yeah, you'll love it with your Excel head.
So all of those functions that exist in Excel, there's equivalents of them in JQ because they're solving the same problem. We need to manipulate data.
And so that's what we're going to do next time.
A Good Story that Holds Together
[1:23:08] Very good. Sounds like fun. I know this was a long one and there was a lot in it, but like I said, it holds together.
I think it's a good story and I'm glad it's in one lesson. In one set of show notes, we briefly debated whether we should split it, but it just held together. other.
So I think it was good. We powered through, Bart. We made it. Yay.
And of course, the most important thing, until next time, happy computing.
If you learn as much from Bart each week as I do, I'd like you to go over to let's-talk.ie and press one of the buttons over there to help support him.
He does 98% of the work here. I'm just the stooge that listens to him and asks the dumb questions.
If you go over to let's-talk.ie, you can support him on Patreon, you can donate via PayPal, or you can use one of his referral links.
I really hope you'll go over and help him out. In the meantime, you can contact me at Podfeet or check out all of the shows we do over there over at podfeet.com. Thanks for listening.
[1:24:10] Music.