#24: The awk basics

Posted by | Comments (2) | Trackbacks (2)

awk is very complex and hard to learn. It is the most difficult task in Linux administrator's lives and drives everyone crazy who tries to learn it. It's written by Gods and used by them, no mortal man can ever learn awk.

In fact…, no! If you know any C-like programming language and regular expressions, awk is absolutely easy and it can make things much easier. It's pretty mighty, but I've never seen any programming language easier than awk (except QBasic).

First let's clarify some terms. awk is a programming language for processing data (or call it a “script language” if you like). It's named after its authors Alfred Aho, Peter Weinberger, and Brian Kernighan. Like any other non-native programming language, awk needs an interpreter, which is also called awk. But today, there are some other implementations of this interpreter. The most common implementation is GNU awk (gawk). This is the one I refer to in this blog post. However, the basics mentioned here should also work with any other implementation. If you don't know, which implementation you have, run

ls -l `which awk`

on your terminal. If you don't have the original awk implementation installed, then /usr/bin/awk should be a symlink to the real binary such as /usr/bin/gawk.

The main syntax of awk is pretty simple:

awk '[pattern] [{script}]' [file]

All parts are optional (therefore set in square brackets), but you have to specify at least a pattern or a script/expression.

The pattern is a filter pattern. Like sed, awk processes files (or if no file given, STDIN) line by line. The pattern acts as a pre-filter. Only lines matching the pattern are processed. Therefore you can basically do what normally grep does:

awk '/foobar/' file

If no awk script is given, the lines matching the pattern are printed as is. An awk script, which does nothing than printing the lines would have the same effect:

awk '/foobar/ { print }' file

The awk script is written in curly braces. You can either pass it on the command line or reference a file with the parameter -f:

awk -f script.awk inputfile

You can also write an executable script file:

#!/usr/bin/awk -f
{
    # Do some stuff
}

That's the basic usage. Let's now come to the more interesting stuff. awk scripts can consist of three main blocks: the BEGIN, the main and the END block.

BEGIN {
    # Do some stuff
}
{
    # The main block, process lines
}
END {
    # And conclude with this
}

The BEGIN and END block are marked with a keyword, the main block doesn't have any. The main block is executed for each line of input, the BEGIN and END block before and after processing lines. So the BEGIN block can be used to initialize variables etc. and the END block to do some concluding stuff. If you want to use a pattern, you have to write this right in front of the main block:

BEGIN {
    # Do some stuff
}
/pattern/
{
    # The main block, process lines
}
END {
    # And conclude with this
}

That's the structure, let's go further to the input format. Any input is split into columns/fields. This is normally done by dividing the input line at spaces and tabs. A column can be accessed by using the variables $1…$n. The variable $0 contains the complete string. That's basically the same as in shell programming.

echo 'abc def' | awk '{ print "First field: " $1; print "Second field: " $2 }'

You can also specify a custom delimiter character either with the parameter -F or the variable FS (field separator). Let's, e.g., set the delimiter to : to process /etc/group

awk -F: '{ print "Group name: " $1 }' /etc/group

or with FS:

awk 'BEGIN { FS=":" } { print "Group name: " $1 }' /etc/group

Note that you have to set this variable in the BEGIN block; otherwise the first line would not be processed properly since the variable is not yet set to : at that moment.

Like any other normal programming language, awk supports loops, conditions and operators. If you know any C-like programming language, they are nothing new to you. They look like this:

for (exp1; exp2; exp3) {
}

for (var in arr) {
}

while (exp) {
}

if (cond) {
}

etc. As you can see, awk supports for/in loops, thus it also supports arrays.

arr[0] = 'foo'
arr[1] = 'bar'
arr['bla'] = 'blub'

As in most scripting languages, variables (and arrays) don't have to be declared first. You can just use them by initializing them with values.

var = 'xyz'
print var

Variable names follow the normal convention: allowed are alpha-numeric characters and underlines, and the first character must not be a number.

That's basically it. The only thing you really have to learn are the built-in configuration variables (which are very important to know) and the built-in functions such as print, printf and match and of course regular expressions. All are described in the manual pretty well. Once you've understood the concept of awk, it's fairly simple. But still, many people don't have the heart to learn this fantastic language. It's not that hard, don't hesitate, dude! wink

Read more about awk:

Trackbacks

Manko10 sent a Trackback on : (permalink)

RT @reflinux: Final article: #Advent series "24 Short #Linux #Hints", day 24: The #awk basics http://bit.ly/id7SL8 Merry Christmas!

robo47 sent a Trackback on : (permalink)

RT @reflinux: Final article: #Advent series "24 Short #Linux #Hints", day 24: The #awk basics http://bit.ly/id7SL8 Merry Christmas!

Comments

There have been 2 comments submitted yet. Add one as well!
Per Wigren
Per Wigren wrote on : (permalink)
The "pattern" can actually be any kind of boolean expression. One of my favourite uses of awk is to search the PostgreSQL log for queries that took more than a certain amount of time. For example, this command will print queries that took more than 20ms (depending on how you configured the log format): awk '$8 == "duration:" && $9 > 20' pg.log

Write a comment:

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

By submitting a comment, you agree to our privacy policy.

Design and Code Copyright © 2010-2025 Janek Bevendorff Content on this site is published under the terms of the GNU Free Documentation License (GFDL). You may redistribute content only in compliance with these terms.