Awk Tutorial

1 hour ago 1
April 5, 2016

I already mentioned why you should learn AWK.
Let me show you how you can start using it today.

Example Data

I think it’s hard to learn AWK in a vacuum. I looked for open data on the web and picked Netflix historical stock prices. The CSV data is available to download from Yahoo finance or Google finance. It’s possible to parse this CSV data in AWK, but I replaced commas with TAB characters to make examples easier. Here is the data we’re going to use:

Date Open High Low Close Volume Adj Close 2016-03-24 98.639999 98.849998 97.07 98.360001 10646900 98.360001 2016-03-23 99.75 100.389999 98.809998 99.589996 8292300 99.589996 2016-03-22 100.480003 101.519997 99.199997 99.839996 9039500 99.839996 2016-03-21 101.150002 102.099998 99.50 101.059998 9562900 101.059998 2016-03-18 100.50 102.410004 100.010002 101.120003 15437300 101.120003 ...

Download the data if you want to try examples yourself.

Printing Columns

Printing columns is probably the most useful things you can do in AWK:

$ cat netflix.tsv | awk '{print $2}' Open 98.639999 99.75 100.480003 101.150002 100.50 # snip

Let’s take it one step at a time:

  • cat netflix.tsv | awk to send netflix.tsv to the STDIN of AWK

Alternatively, awk '{print $2}' netflix.tsv would have given us the same result. For this tutorial, I use cat to visually separate the input data from the AWK program itself. Using cat also emphasizes that AWK can treat any input and not just existing files.

  • {print $2} to print the 2nd column

Yes, you need the curly brackets – I’ll come to that shortly. You already guessed it: column 1 is $1, column 2 is $2, column 7 is $7, etc…

  • # snip to indicate omitted output

There are 3485 lines in the data file. For most examples, I’ll truncate the output because more isn’t always better.

Always Use Single-Quotes with AWK

Let’s get this out of the way: always use single-quotes with AWK.

As you’ve seen above, column names have dollar signs in them ($1, $2, $7…) which would usually be substituted by BASH. Single-quotes are how you tell BASH to keep the content of your strings untouched. Double-quotes won’t work, and backslash escapes might work but are not worth fighting for.

Let’s keep things simple with single-quotes.

If you need to inject some values into your script, I’ll show you how in a follow-up tutorial.

What’s With Those Curly Brackets? { }

What’s the difference between:

awk '{print $2}' awk 'print $2'

Answer: one works and the other doesn’t! (rimshot) We’ll need to take a step back to explain the difference. In AWK, a program is composed of rules which look like:

some-condition { one or many statements }

If it were C code:

if (some-condition) { one or many statements; }

In short, the curly brackets ({ }) tell AWK to do something. AWK allows either the condition or the action to be missing.

What does it mean when the condition is missing?

A missing condition defaults to “always run”:

awk '{print $2}' # means: awk '1 {print $2}' # 0 is false, any other value is true

if true, print the 2nd column.

What does it mean when the action is missing?

A missing action defaults to “print”:

$ cat netflix.tsv | awk '$2 > 100' Date Open High Low Close Volume Adj Close 2016-03-22 100.480003 101.519997 99.199997 99.839996 9039500 99.839996 2016-03-21 101.150002 102.099998 99.50 101.059998 9562900 101.059998 2016-03-18 100.50 102.410004 100.010002 101.120003 15437300 101.120003 2016-03-07 101.00 101.790001 95.25 95.489998 23855200 95.489998 2016-01-22 104.720001 104.989998 99.220001 100.720001 26772700 100.720001 # snip -- output has been reformated to align

A missing block prints the whole matching line.

awk '$2 > 100' # means: awk '$2 > 100 { print }' # means: awk '$2 > 100 { print $0 }'

$0 is a special variable that contains the current line, before it was separated into fields. print $0 means “print the current line”. print, by itself, also prints the current line.

More Printing

You know how to print one column, but what if you need to print many?

$ cat netflix.tsv | awk '{print $1, $6, $5}' Date Volume Close 2016-03-24 10646900 98.360001 2016-03-23 8292300 99.589996 2016-03-22 9039500 99.839996 2016-03-21 9562900 101.059998 2016-03-18 15437300 101.120003 # snip -- output has been reformated to align

A comma between print values will insert a space in the output. AWK also has printf which unleashes infinite formatting power:

$ cat netflix.tsv | awk '{printf "%s %15s %.1f\n", $1, $6, $5}' | sed 1d 2016-03-24 10646900 98.4 2016-03-23 8292300 99.6 2016-03-22 9039500 99.8 2016-03-21 9562900 101.1 2016-03-18 15437300 101.1 # snip

I removed the header line, which had been mangled in the printf.

AWK does string concatenation without an operator: just put 2 values next to each other. This is useful when you don’t want to reach for printf but still want some formatting flexibility:

$ cat netflix.tsv | awk '{print $1 "," $6}' Date,Volume 2016-03-24,10646900 2016-03-23,8292300 2016-03-22,9039500 2016-03-21,9562900 2016-03-18,15437300 # snip

Ooooh, we’re back to CSV.

Taking inventory: what can you do?

This is just the beginning, and there’s more to cover. But you now have a solid foundation: you know about conditions and actions, columns and printing. You can:

  • print only the columns you want
  • print them in the order you want
  • format with all the power of printf
  • use conditions to print only lines you want

Exercises

Try to:

  • only print the ‘Date’, ‘Volume’, ‘Open’, ‘Close’ columns, in that order
  • only print lines where the stock price increased (‘Close’ > ‘Open’)
  • print the ‘Date’ column and the stock price difference (‘Close’ - ‘Open’)
  • print an empty line between each line – double-space the file

Answers are here.

What’s next?

Part 2

Discuss on Bluesky

Read Entire Article