Hello World

Here's the simplest possible awk program:

#!/usr/local/bin/mawk -We
{print "Hello World"}

The preceding, when run on the people.table test file, produces this output:

[slitt@mydesk awk]$ cat people.table | ./hello.awk 
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
Hello World
[slitt@mydesk awk]$

Notice that the Hello World program prints 10 lines instead of 1. That's because most parts of an awk program execute on every line of the input file (in this case stdin).

The print statement between the braces is called an action. Actions are often taken only upon compliance with rules. Consider this slight modification to the program:

Now let's change the program slightly, adding a rule so that it prints only on row 1:

#!/usr/local/bin/mawk -We
NR == 1 {print "Hello World"}

In the preceding code, the NR == 1 is a rule, and the print statement within the braces is an action. The rule is the equivalent to an if statement in other languages. The action happens only if the rule is true. In this case, built in variable NR is the line number. In other words, the preceding prints only on line 1.

[slitt@mydesk awk]$ cat people.table | ./hello.awk 
Hello World
[slitt@mydesk awk]$

Now let's change it so as to print more useful information, and to print that info on every line:

#!/usr/local/bin/mawk -We
NR == 1 {
 print "Header: " $0
}
NR > 1 {
 print "Line " NR ": " $0
}

The preceding has two rules -- one for line 1, and one for all lines below line 1 (greater NR). The header is marked, and then prints $0, which is the line just read. On other lines, the line number and the line itself are printed.

The preceding code produces the following output:

[slitt@mydesk awk]$ cat people.table | ./hello.awk 
Header: person_id lname fname job_id
Line 2: 1001 Strozzi Carlo 1
Line 3: 1002 Torvalds Linus 1
Line 4: 1003 Stallman Richard 1
Line 5: 1004 Litt Steve 2
Line 6: 1005 Bush George 3
Line 7: 1006 Clinton Bill 3
Line 8: 1007 Reagan Ronald 3
Line 9: 1008 Cheney Dick 4
Line 10: 1009 Gore Al 4
[slitt@mydesk awk]$

Now let's modify the preceding slightly. Specifically, we'll finish the line 1 action with a next command, and eliminate the rule on the second action. This doesn't actually eliminate the rule -- it simply makes it an always-true rule.

The next command terminates all processing on the current line, so that the next line is read. With the always-true rule on the second action, if we hadn't included the next command, program flow would have fallen through the first action and both actions would have been taken on line 1.

Rule/action combinations ending with next statements can quickly get rid of lines that shouldn't get processed, or lines that you know have been totally processed and need no further attention. While such use corresponds to a continue statement in C and therefore violates the principles of structured programming, awk programmers use such constructs to avoid nested if statements and compound if statements, and to make the program more efficient during runtime. If you use awk, don't hesitate to put aside structured programming principles, because awk is intended to produce only small (let's say less than 400 line) programs, structured programming isn't important enough to forgo the simplicity and readability of strategic use of exit statements.

Sometimes you want it to fall through, but often you don't. When you don't want it to fall through, use an exit statement.

Here's a version of the previous program, but using the next statement:

#!/usr/local/bin/mawk -We
NR == 1 {
 print "Header: " $0
 next
}
{
 print "Line " NR ": " $0
}

As expected, the preceding program produces the same output as its two rule predicessor:

[slitt@mydesk awk]$ cat people.table | ./hello.awk
Header: person_id lname fname job_id
Line 2: 1001 Strozzi Carlo 1
Line 3: 1002 Torvalds Linus 1
Line 4: 1003 Stallman Richard 1
Line 5: 1004 Litt Steve 2
Line 6: 1005 Bush George 3
Line 7: 1006 Clinton Bill 3
Line 8: 1007 Reagan Ronald 3
Line 9: 1008 Cheney Dick 4
Line 10: 1009 Gore Al 4
[slitt@mydesk awk]$

BEGIN and END

When processing a file, you often want a header and footer. You can put those in actions associated with the BEGIN and END rules:

#!/usr/local/bin/mawk -We
BEGIN {
 total_lines=0
 print "BEGIN SHOWING LINES"
}
{
 print "Line " NR ": " $0
 total_lines++
}
END {
 print total_lines " LINES WERE SHOWN"
}

In the BEGIN action in the preceding code, you set a total to zero (not strictly necessary, but instructive here), and then print the header. In the always-true section line items are printed and the total is incremented. The END action prints the footer, as follows:

[slitt@mydesk awk]$ cat people.table | ./hello.awk
BEGIN SHOWING LINES
Line 1: person_id lname fname job_id
Line 2: 1001 Strozzi Carlo 1
Line 3: 1002 Torvalds Linus 1
Line 4: 1003 Stallman Richard 1
Line 5: 1004 Litt Steve 2
Line 6: 1005 Bush George 3
Line 7: 1006 Clinton Bill 3
Line 8: 1007 Reagan Ronald 3
Line 9: 1008 Cheney Dick 4
Line 10: 1009 Gore Al 4
10 LINES WERE SHOWN
[slitt@mydesk awk]$

Fields

Awk's field handling abilities give the programmer easy and convenient parsing capability. Here are some variables used in field handling:

NF
$0		The entire line
NF		Number of fields. This is the number of fields on the line.
$1, $2, $3...		The first, second, third fields, etc. You can iterate through fields like this: for(i=1; i<=NF; i++){act_on_field()}
FS		Field separator. This is what separates the fields from each other. The default is " ", a single space character which means "any combination of whitespace". For tab delimited lines you can change it to "\x09", representing a single tab. On a comma delimited line with every field enclosed in doublequotes, it could be "\",\"", but only if ALL fields are quoted. "Intelligent" quoting, where a field is quoted only if it contains commas, would be a nightmare. The tab separator can be a solid character like "\|", or a string of them like ":-:".
OFS		Output field separator. Defaults to " ", but can be set. In a print statement like this: print $3, $4, $2 the fields would be separated by the field separator.

The following is a simple program that reads each line, and for each field prepends the field with its field number. Totally useless, it shows some basics. As always, within the BEGIN action, tell the program what the file uses as a field separator, in this case the tab character.

#!/usr/local/bin/mawk -We
BEGIN {
 FS = "\x09" # Define fields as separated by tabs
}
{
 tempstring = "1:" $1
 for(i=2;i<=NF;i++){
 tempstring = tempstring "\x09" i ":" $i
 }
 print tempstring
}

You'll notice the loop, where i is initialized at 2, and iterates until the number of fields. This is a typical idiom because $1 is the first field -- $0 is the whole line. Fields are 1 based. The reason we start at 2 is because field 1 was put in tempstring manually. The first field is a special case because it is not preceded by a field separator.

Speaking of output field separators, you might wonder why we didn't use OFS. Turns out, we aren't using $0 so OFS wouldn't have done us much good. We'll use it later.

The reason you must use the temporary string is because it's impossible to use a print statement without printing a newline, so you can't use it to print one field.

Anyway, after tempstring has been constructed, it is printed. The output follows:

[slitt@mydesk awk]$ cat people.table | ./hello.awk
1:person_id 2:lname 3:fname 4:job_id
1:1001 2:Strozzi 3:Carlo 4:1
1:1002 2:Torvalds 3:Linus 4:1
1:1003 2:Stallman 3:Richard 4:1
1:1004 2:Litt 3:Steve 4:2
1:1005 2:Bush 3:George 4:3
1:1006 2:Clinton 3:Bill 4:3
1:1007 2:Reagan 3:Ronald 4:3
1:1008 2:Cheney 3:Dick 4:4
1:1009 2:Gore 3:Al 4:4
[slitt@mydesk awk]$

You can use the printf statement to do the same thing without using the temporary string:

#!/usr/local/bin/mawk -We
BEGIN {
 FS = "\x09" # Define fields as separated by tabs
}
{
 printf("1:%s", $1)
 for(i=2;i<=NF;i++){
 printf("\x09%d:%s", i, $i)
 }
 printf("\n")
}

The preceding code, using printf but no temporary string, produces the exact same output, as follows:

[slitt@mydesk awk]$ cat people.table | ./hello.awk
1:person_id 2:lname 3:fname 4:job_id
1:1001 2:Strozzi 3:Carlo 4:1
1:1002 2:Torvalds 3:Linus 4:1
1:1003 2:Stallman 3:Richard 4:1
1:1004 2:Litt 3:Steve 4:2
1:1005 2:Bush 3:George 4:3
1:1006 2:Clinton 3:Bill 4:3
1:1007 2:Reagan 3:Ronald 4:3
1:1008 2:Cheney 3:Dick 4:4
1:1009 2:Gore 3:Al 4:4
[slitt@mydesk awk]$

Here's an even trickier way to do it -- modify the fields in place and print $0:, using OFS to make sure tabs get printed between. No more doing $1 as a special case:

#!/usr/local/bin/mawk -We
BEGIN {
 FS = "\x09" # Define fields as separated by tabs
 OFS = "\x09"
}
{
 for(i=1;i<=NF;i++){
 $i = i ":" $i
 }
 print $0
}

[slitt@mydesk awk]$ cat people.table | ./hello.awk
1:person_id 2:lname 3:fname 4:job_id
1:1001 2:Strozzi 3:Carlo 4:1
1:1002 2:Torvalds 3:Linus 4:1
1:1003 2:Stallman 3:Richard 4:1
1:1004 2:Litt 3:Steve 4:2
1:1005 2:Bush 3:George 4:3
1:1006 2:Clinton 3:Bill 4:3
1:1007 2:Reagan 3:Ronald 4:3
1:1008 2:Cheney 3:Dick 4:4
1:1009 2:Gore 3:Al 4:4
[slitt@mydesk awk]$

To show the utility of OFS, here's the output if, in the BEGIN section, we set OFS to the string "|||":

[slitt@mydesk awk]$ cat people.table | ./hello.awk
1:person_id|||2:lname|||3:fname|||4:job_id
1:1001|||2:Strozzi|||3:Carlo|||4:1
1:1002|||2:Torvalds|||3:Linus|||4:1
1:1003|||2:Stallman|||3:Richard|||4:1
1:1004|||2:Litt|||3:Steve|||4:2
1:1005|||2:Bush|||3:George|||4:3
1:1006|||2:Clinton|||3:Bill|||4:3
1:1007|||2:Reagan|||3:Ronald|||4:3
1:1008|||2:Cheney|||3:Dick|||4:4
1:1009|||2:Gore|||3:Al|||4:4
[slitt@mydesk awk]$

DANGER WILL ROBINSON

The OFS variable will take effect only if one of the fields is changed or at least meddled with. A simple $i=$i will do, but you must meddle with it in some way.

An exception to this inconvenience occurs when you use commas in a print statement such as this:

print $3,$2, $4

The preceding print statement will honor OFS.

Now that you know about FS and OFS, and that little secret about meddling with a field, it's easy to write a program whose output differs from the input only in its field separator:

#!/usr/local/bin/mawk -We
BEGIN {
 FS = "\x09"
 OFS = "|||"
}
{
 $1=$1
 print $0
}

The preceding code yields the following output:

[slitt@mydesk awk]$ cat people.table | ./hello.awk
person_id|||lname|||fname|||job_id
1001|||Strozzi|||Carlo|||1
1002|||Torvalds|||Linus|||1
1003|||Stallman|||Richard|||1
1004|||Litt|||Steve|||2
1005|||Bush|||George|||3
1006|||Clinton|||Bill|||3
1007|||Reagan|||Ronald|||3
1008|||Cheney|||Dick|||4
1009|||Gore|||Al|||4
[slitt@mydesk awk]$

Break Logic

Did you take DP 101? Remember how break logic struck fear into your heart? Where do you set the "last" variables? Where do you print the header -- the footer? Awk's pretty darned good at break logic.

In order to demonstrate break logic, let's make a test file called test.file. The test file has three fields, each a four digit number. The leftmost field has values of 1001 through 1004, the middle field 2001-2004, and the rightmost 3001-3004.

The test file could be made with any programming language, but since this is an awk program, let's do it in awk:
x

#!/usr/local/bin/mawk -We
BEGIN {
 srand(44)
 for(line=1; line <= 12; line++){
 printf "%d%s", (int(4*rand()) + 1001), ":::"
 printf "%d%s", (int(4*rand()) + 2001), ":::"
 printf "%d\n", (int(4*rand()) + 3001)
 }
 exit(0)
}

The preceding seeds a random number generator with a constant (obviously it should have been seeded with time or with some other variable, but this is just an excercise. Then 12 lines of 3 fields each are printed, each with random fields within some tight ranges. When sorted, these make the perfect fodder for break logic. This all happens in the BEGIN section because we don't want to require (or even process) input, and for the same reason we exit the program after completion of this task. Here's how this is used to create test.file:
x

[slitt@mydesk awk]$ ./hello.awk | sort > test.file
[slitt@mydesk awk]$ cat test.file
1001:::2002:::3002
1001:::2004:::3002
1001:::2004:::3004
1002:::2002:::3002
1002:::2004:::3001
1003:::2001:::3002
1003:::2002:::3004
1003:::2003:::3001
1003:::2003:::3004
1003:::2004:::3001
1004:::2001:::3003
1004:::2004:::3001
[slitt@mydesk awk]$

Now that we have our test file, let's print out every line, but also, for each change in the first field, print a footer telling how many lines had that number:
x

#!/usr/local/bin/mawk -We
BEGIN {
 FS = ":::"
 OFS = ":::"
 lastdollar1 = "..INIT.."
 fieldcount = 0
}

$1 != lastdollar1 && NR > 1{
 print "There were " fieldcount " lines with value " lastdollar1 "."
 print ""
}
$1 != lastdollar1{
 lastdollar1 = $1
 fieldcount = 0
}

{
 print $0
 fieldcount++
}

END{
 print "There were " fieldcount " lines with value " lastdollar1 "."
}

In the preceding, the BEGIN section enunciates the field separator, then sets the break variable (lastdollar1) and the count (fieldcount).

The first rule says if $1 has changed and if it's not the first line, print the footer. The restriction on line 1 is so you don't print a header for the break variable's initial value. The next rule says if $1 has changed, set the break variable to that new value, and zero the totals.

The always-true rule's action just prints the line and increments the total.

The END section prints a final total, because no $1 change occurred after the last line.

You might prefer to combine the first two rules into a single rule when $1 changes, and put in an if statement so the header doesn't print on the first line. That code would look something like this:

$1 != lastdollar1{
 if(NR > 1){
 print "There were " fieldcount " lines with value " lastdollar1 "."
 print ""
 }
 lastdollar1 = $1
 fieldcount = 0
}

Whichever you think is more readable. The following is the output:

[slitt@mydesk awk]$ cat test.file | ./hello.awk
1001:::2002:::3002
1001:::2004:::3002
1001:::2004:::3004
There were 3 lines with value 1001.

1002:::2002:::3002
1002:::2004:::3001
There were 2 lines with value 1002.

1003:::2001:::3002
1003:::2002:::3004
1003:::2003:::3001
1003:::2003:::3004
1003:::2004:::3001
There were 5 lines with value 1003.

1004:::2001:::3003
1004:::2004:::3001
There were 2 lines with value 1004.
[slitt@mydesk awk]$

Here's a small simplification decreasing the number of rules by placing the break initialization within a rule testing for line 1:
x

#!/usr/local/bin/mawk -We
BEGIN {
 FS = ":::"
 OFS = ":::"
}

NR == 1{
 lastdollar1 = $1
 fieldcount = 0
}

$1 != lastdollar1 {
 print "There were " fieldcount " lines with value " lastdollar1 "."
 print ""
 lastdollar1 = $1
 fieldcount = 0
}

{
 print $0
 fieldcount++
}

END{
 print "There were " fieldcount " lines with value " lastdollar1 "."
}

The output's the same as before, so it won't be listed again.

Tealeaves Programs

Now for a tealeaves problem: Print the headers above the lines it summarizes, instead of below.

As a Junior Programmer, a DP Manager presented me with just such a task, and I said "what do you expect the program to do, read some tealeaves to guess what the total will be after everything's counted?" The Lead Programmer, my mentor, just looked on and smiled. He knew what the DP manager's answer would be, and he knew how I would respond.

The DP Manager said "The customers need it above, so do whatever you have to do, but get it done!"

I said "It's impossible!" and slinked off to try to find some way to comply with her request. The Lead Programmer smiled -- he knew I was getting some schooling.

If your computer career started after 1995, you should know that back in those days, computer memory was scarce and costly. Building a table within memory was not an option. This thing had to read a record, then write a record.

A few days later I came back, with the header printed before the data. The DP Manager said "I knew you could do it!" The Lead Programmer smiled -- he knew I could do it too.

Can you guess what I did? I didn't do it in memory, and it wasn't particularly difficult. I did, however, have to write two programs. Here's an example with the data discussed previously in this section:

addfields.awk

addheaders.awk

#!/usr/local/bin/mawk -We
BEGIN {
 FS = ":::"
 OFS = ":::"
}

NR == 1{
 lastdollar1 = $1
 fieldcount = 0
}

$1 != lastdollar1 {
 print lastdollar1,"10", $2, $3, fieldcount
 print lastdollar1,"12", $2, $3, fieldcount
 lastdollar1 = $1
 fieldcount = 0
}

{
 print $1, "11", $2, $3, "0"
 fieldcount++
}

END{
 print lastdollar1,"10", $2, $3, fieldcount
 print lastdollar1,"12", $2, $3, fieldcount
}

#!/usr/local/bin/mawk -We
BEGIN {
 FS = ":::"
 OFS = ":::"
}

$2 == "10"{
 print "There are " $5 " lines with value " $1 "."
}

$2 == "11"{
 print $1,$3,$4
}
$2 == "12"{
 print "This concludes value " $1 "."
 print ""
}

Now watch this:

[slitt@mydesk awk]$ cat test.file | ./addfields.awk | sort | ./addheaders.awk
There are 3 lines with value 1001.
1001:::2002:::3002
1001:::2004:::3002
1001:::2004:::3004
This concludes value 1001.

There are 2 lines with value 1002.
1002:::2002:::3002
1002:::2004:::3001
This concludes value 1002.

There are 5 lines with value 1003.
1003:::2001:::3002
1003:::2002:::3004
1003:::2003:::3001
1003:::2003:::3004
1003:::2004:::3001
This concludes value 1003.

There are 2 lines with value 1004.
1004:::2001:::3003
1004:::2004:::3001
This concludes value 1004.

[slitt@mydesk awk]$

I put in fields that, after sorting, would provide a header and footer record complete with the total. Therefore, printing the header is simply a matter of detecting the header line ($2==10) and printing the header information.

What if I didn't want to sort on fields $3 and $4? No problem -- I could have inserted an extra field after $2 corresponding to the original order, and it would have sorted just like the original.

The preceding demonstrates a fundamental part of the awk philosophy -- if an algorithm starts looking too complex, split it into separate programs and run them both -- usually through a pipeline.

Memory was an issue in 1984, and in certain circumstances it's still an issue today. What if you had a half a million lines in the file. Would you really want to build up an in-memory header table? Probably not.

But what if it gets really big -- maybe 10 million rows. Now even the sort is problematic. What to do?

In that case you'd modify addfields to write two files -- one containing header and footer lines not containing total data, and one containing totals for each distinct key. Now you'd change addheaders.awk to a merge that, upon encountering a header blank in the main file, inserts a header based on the next line of the header file.

Awk isn't well suited for a merge algorithm -- you might want to do the merge in Perl, Python or Ruby, or for speed you might want to do it in C. Please remember, the addfields.awk program can be tweaked to write trivially parsable data easily digestable by the C program.

The following is a merge solution, with both programs written in awk:, and a shellscript thrown in for good measure:

#!/usr/local/bin/mawk -We
BEGIN {
 FS = ":::"
 OFS = ":::"
 mergefn = "temp.tmp" # MERGE FILE FILENAME
}

NR == 1{
 lastdollar1 = $1
 fieldcount = 0
 print $1,"10" # PRINT MAIN FILE HEADER FOR FIRST KEY
}

$1 != lastdollar1 {
 # PRINT TOTAL FOR LAST KEY TO MERGE FILE
 print lastdollar1, fieldcount > mergefn # PRINT MERGE 

 # PRINT LAST KEY'S FOOTER FLAG REC TO STDOUT
 print lastdollar1, "12"

 # PRINT NEW KEY'S HEADER FLAG REC TO STDOUT
 print $1,"10"

 lastdollar1 = $1
 fieldcount = 0
}

{
 print $1, "11", $2, $3
 fieldcount++
}

END{
 # PRINT FINAL KEY'S TOTAL TO MERGE FILE
 print lastdollar1, fieldcount > mergefn

 # PRINT FINAL KEY'S FOOTER FLAG TO STDOUT
 print lastdollar1, "12"
}

The code at the left is the first part of merge version of a tealeaves algorithm. The only change to the BEGIN section is the addition of the filename that will hold each key group's totals.

The NR==1 rule functions to write the first main file (stdout) header flag. The purpose of the header and footer flags is to simplify the algorithm in the next program in the pipeline. By having this header marker, the next program down the pipe can print a header, and nothing but a header, confident that all data lines will follow.

The $1!=lastdollar1 rule prints the total for the last key to the merge file, then prints the footer flag for the last key to the main file, then prints the header for the new key. Lastly, it resets the break variable and zeros the total.

The always true rule prints a data record.

The END rule prints the final key's total to the merge file, and prints the footer flag for the final key to the main file.

The preceding code does two thing:

Writes total to a merge file in the same order as the keys in the main file
Writes dummy header and footer flags to the main file. These dummy records eliminate the need for the next program in the pipeline to do any break logic, because the flags signal the beginning and end of the key group.

The following is the output of this program:

STDOUT

MERGE FILE

[slitt@mydesk awk]$ cat test.file | ./addfields.awk
1001:::10
1001:::11:::2002:::3002
1001:::11:::2004:::3002
1001:::11:::2004:::3004
1001:::12
1002:::10
1002:::11:::2002:::3002
1002:::11:::2004:::3001
1002:::12
1003:::10
1003:::11:::2001:::3002
1003:::11:::2002:::3004
1003:::11:::2003:::3001
1003:::11:::2003:::3004
1003:::11:::2004:::3001
1003:::12
1004:::10
1004:::11:::2001:::3003
1004:::11:::2004:::3001
1004:::12
[slitt@mydesk awk]$

[slitt@mydesk awk]$ cat temp.tmp
1001:::3
1002:::2
1003:::5
1004:::2
[slitt@mydesk awk]$

The following is the code to actually add the headers and footers:

#!/usr/local/bin/mawk -We
BEGIN {
 FS = ":::"
 OFS = ":::"
 mergefn = "temp.tmp" # MERGE FILE FILENAME
}

$2 == "10"{
 getline mrg < mergefn
 split(mrg, keytot, ":::")
 if(keytot[1] != $1){
  print "Internal error: Main record doesnt match merge, aborting..."
  print " Main key: " $1
  print "Merge key: " keytot[1]
  exit 1
 }
 curtot = keytot[2]
 print "There are " curtot " lines with value " $1 "."
}

$2 == "11"{
 print $1,$3,$4
}
$2 == "12"{
 print "This concludes value " $1 "."
 print ""
}

The code at the left identifies the merge file in the BEGIN section.

The three rules correspond to the three type of records -- header flags (10), data lines (11) and footer flags(12). You can see all three in the output preceding this code. All three are mutually exclusive, so there's never a need to drop through and execute anything else. This makes the algorithm incredibly simple.

On encountering a header flag, the program reads the next line of the merge file and uses its data to write the top header.

On encountering a data line, the line is simply printed. On encountering a footer flag line, a footer is printed.

The preceding two pieces of code produce the following output:

[slitt@mydesk awk]$ cat test.file | ./addfields.awk | ./addheaders.awk
There are 3 lines with value 1001.
1001:::2002:::3002
1001:::2004:::3002
1001:::2004:::3004
This concludes value 1001.

There are 2 lines with value 1002.
1002:::2002:::3002
1002:::2004:::3001
This concludes value 1002.

There are 5 lines with value 1003.
1003:::2001:::3002
1003:::2002:::3004
1003:::2003:::3001
1003:::2003:::3004
1003:::2004:::3001
This concludes value 1003.

There are 2 lines with value 1004.
1004:::2001:::3003
1004:::2004:::3001
This concludes value 1004.

[slitt@mydesk awk]$

The preceding merge algorithm requires two passes through the data plus a pass through the merge file. No sort, no storage of multiple data pieces -- this algorithm is suitable for tens of millions of lines of data, as long as the initial data is properly sorted.

Multifile Programs

Awk has what could be called a feature or could be called a curse. It automatically considers each command line argument a file to process, and processes them in order. It's a curse if you absolutely need to read two files at once (a true merge). Otherwise it's a feature, and a darned nice one.

Here's a contrived example. The first file is a configuration file listing all fields to be output, and their order. Call it people.config:

[slitt@mydesk awk]$ cat people.config
job_id
lname
fname   
[slitt@mydesk awk]$

As a reminder, here's people.table:

^Aperson_id     ^Alname ^Afname ^Ajob_id
1001    Strozzi Carlo   1
1002    Torvalds        Linus   1
1003    Stallman        Richard 1
1004    Litt    Steve   2
1005    Bush    George  3
1006    Clinton Bill    3
1007    Reagan  Ronald  3
1008    Cheney  Dick    4
1009    Gore    Al      4

Let's start with a trivial program to display the file number (which argument), the line number within that file, the line number overall (cumulative starting from the first file), and the line itself:

#!/usr/local/bin/mawk -We

BEGIN{filenumber = 0}
FNR==1{filenumber++}
{print filenumber, FNR, NR, FILENAME, $0}

The preceding program produces the following output:

[slitt@mydesk awk]$ ./hello.awk people.config people.table 
1 1 1 people.config job_id
1 2 2 people.config lname
1 3 3 people.config fname       
2 1 4 people.table person_id    lname   fname   job_id
2 2 5 people.table 1001 Strozzi Carlo   1
2 3 6 people.table 1002 Torvalds        Linus   1
2 4 7 people.table 1003 Stallman        Richard 1
2 5 8 people.table 1004 Litt    Steve   2
2 6 9 people.table 1005 Bush    George  3
2 7 10 people.table 1006        Clinton Bill    3
2 8 11 people.table 1007        Reagan  Ronald  3
2 9 12 people.table 1008        Cheney  Dick    4
2 10 13 people.table 1009       Gore    Al      4
[slitt@mydesk awk]$

Notice that the first number is the file number. In other awk versions, such as gawk, this file number is supplied automatically as ARGIND. However, mawk doesn't provide that variable, so simple break logic was used to increment the file number. The second number is FNR, the line number in the current file. The third is NR, the line number processed cumulatively across all files so far. The next value is FILENAME, the name of the current file. The final content of each line is the original line content.

The purpose of this trivial program is to familiarize you with the variables used in multifile programs. Now let's write a simple program that uses the first file for configuration, and the second for data. Specifically, it will print out only the fields described in the first file, in the order described in the first file. Here it is:

#!/usr/local/bin/mawk -We

BEGIN{
 filenumber = 0
 FS="\x09"
 OFS=":::"
}

FNR==1{filenumber++}
filenumber==1{
 fields[FNR+1000] = $1
 fields[0] = FNR + 1000
 next     # No lines from first file get below here
}

FNR==1{
 newheader=""
 for(fn=1001;fn<=fields[0];fn++){
  for(i=1;i<=NF;i++){
   sub(/^\x01/, "", $i)
   if(fields[fn] == $i){
    if(fn==1)
     newheader = "\x01" $i
    else
     newheader = newheader "\x09\x01" $i
    fields[fn] = i  # make it a translate table
   }
  }
 }
 print newheader
 next # don't print this line, it's a header already printed
}

{
 for(fn=1001;fn<=fields[0];fn++){
  if(fn==1001)
   printf("%s", $fields[fn])
  else
   printf("%s%s", "\x09",$fields[fn])
 }
 print ""

}

The BEGIN section is businss as usual, except that it sets filenumber to instantiate file number break logic.

The FNR==1 rule increments the file number when FNR drops back to 1, meaning a new file has been encountered. Remember, according to gawk's documentation, gawk provides you automatically with ARGIND to take the place of filenumber.

The first file does nothing but load an array called fields with the fields listed in the first file. This configures the program to print those fields in that order. The reason fields[0] is set to FNR+1000 is so that an upper limit can be recorded without using another global variable. The reason 1000 is added to all subscripts is so that subscripts will be compared correctly, whether the comparison is a string or a numerical comparison.

The FNR==1 rule is where the fields array is turned into a translation table, relating the input file's fields to the output file's fields. It also creates a header for the output file. Because all data in the first line of the first file has already been used, there's no need for further printing, so this action is terminated with a next statement.

That leaves the always true action. It goes through the fields array, in order, printing the input field corresponding to the value of the element from fields. This is a very efficient algorithm, with no comparison done in the computations for data lines.

The preceding code produces the following output:

[slitt@mydesk awk]$ ./hello.awk people.config people.table 
        job_id  lname   fname
1       Strozzi Carlo
1       Torvalds        Linus
1       Stallman        Richard
2       Litt    Steve
3       Bush    George
3       Clinton Bill
3       Reagan  Ronald
4       Cheney  Dick
4       Gore    Al
[slitt@mydesk awk]$

The preceding output is just what's expected -- the lines are in original order, but the fields are those that appeared in the config file, ordered the same as they were in the config file.

The preceding was a demonstration of the use of multiple files. Once again, this works only when the files may be read consecutively -- if they need to be read concurrently, you need to use other techniques, such as those in the merge algorithm in the Break Logic article in this document.

Non-File Command Line Arguments

As mentioned, command line arguments automatically default to files for processing. Normally, you should try to use that fact. The first file can be a config file. In addition, you can input information into an awk program via environment variables.

Once again, to use command line arguments as non-files is fighting awk. Nevertheless, sometimes it's necessary, and when it is, you can do it.

First, let's explore what command line arguments do:

#!/usr/local/bin/mawk -We

BEGIN{
 for(i=0; i<=ARGC; i++){
  print "ARGV["i "]=" ARGV[i]
 }
}

The preceding code produces the following output.

[slitt@mydesk awk]$ ./hello.awk --mood=happy people.config --job="Awk Professor" people.table
ARGV[0]=mawk
ARGV[1]=--mood=happy
ARGV[2]=people.config
ARGV[3]=--job=Awk Professor
ARGV[4]=people.table
ARGV[5]=
[slitt@mydesk awk]$

As you can see, ARGC is the total number of arguments, including the program name. ARGV[0] is the program name, while ARGV[1] through ARGV[ARGC-1] are the actual command line arguments.

The following awk program's BEGIN section takes all arguments beginning with -- and puts them either in the options hash (for options with equal signs and values), or the flags array (for those without values), or the args array (for genuine command line arguments). It then uses the args array to rewrite the ARGV[] array, and reduces ARGC appropriately. The remaining ARGV[] arguments are assumed to be genuine filenames, which this program demonstrates:

#!/usr/local/bin/mawk -We

BEGIN{
 ### FROM CMD LINE, LOAD flags[], args[] and options[]
 flags[0] = 0
 args[0]=0
 for(arg=1; arg<ARGC; arg++){
  if(ARGV[arg] ~ /^--/){
    sub(/^--/, "", ARGV[arg])
    if(ARGV[arg] ~/=/){
     sub(/=/, "..EQUAL..", ARGV[arg])
     split(ARGV[arg], temparray, /\.\.EQUAL\.\./)
     options[temparray[1]] = temparray[2]
    } else {
     flags[++flags[0]] = ARGV[arg]
    }
  } else {
   args[0]++
   args[args[0]] = ARGV[arg]
  }
 }

 ### RESET ARGC AND ARGV FOR REAL ARGS
 for(i=1; i <= args[0]; i++)
  ARGV[i] = args[i]
 ARGC = args[0] + 1

 ### DIAGNOSTIC: PRINT flags, options AND ARGV
 print "\nARGS:"
 for(i=1;i<=ARGC-1;i++)
  print "ARGV[" i "]=" ARGV[i]
 print "\nFLAGS:"
 for(i=1; i <= flags[0]; i++)
  print flags[i]
 print "\nOPTIONS:"
 for(i in options)
  print i "=" options[i]

 ### ZAP UNNEEDED GLOBAL VARS
 for(i in temparray) delete temparray[i]
 delete temparray
 i=NULL
 arg=NULL

 ### ANNOUNCE COMMENCEMENT OF READING FILES ###
 print "\nREADING FROM FILES"
}

### PRINT FILES NAMED WITH NON -- CMD LINE ARGS
{print FILENAME "   :::  " $0}

The preceding code produces the following output. Note that files one.txt, two.txt, three.txt and four.txt are one line files. Files junk1, junk2 and junk3 are also one line files, each containing text stating that an error has happened. In fact, if those files are printed out, an error has occurred. Only the text from one.txt, two.txt, three.txt and four.txt should be printed, and printed in their order on the command line.

[slitt@mydesk awk]$ ./args.awk one.txt --junk1=junk2 two.txt --junk3 three.txt --junk3=junk2 four.txt

ARGS:
ARGV[1]=one.txt
ARGV[2]=two.txt
ARGV[3]=three.txt
ARGV[4]=four.txt

FLAGS:
junk3

OPTIONS:
junk1=junk2
junk3=junk2

READING FROM FILES
one.txt   :::  Steve was here
two.txt   :::  and now is gone
three.txt   :::  but left his name
four.txt   :::  to carry on.
[slitt@mydesk awk]$

The preceding output is what was expected. The --junk3 prints out as a flag. --junk1=junk2 and --junk3=junk2 print out under options. The four input files print in order.

Functions and Local Variables

Functions aren't as important in awk as in other languages, because awk is meant to be used in short parsing programs. Nevertheless, sometimes they're very helpful, as will be shown later in the Using an Array as a Stack section of the Using Arrays to Structure Data.

The following is a function Hello World:

#!/usr/local/bin/mawk -We

function circumference(radius){
 return(3.14159 * radius)
}

BEGIN{
 print circumference(1)
 print circumference(2)
}

The preceding code produces the expected output:

[slitt@mydesk awk]$ ./hello.awk
3.14159
6.28318
[slitt@mydesk awk]$

The function definition starts with the word function followed by the function's name, followed by its arguments enclosed in parentheses. The body of the function's code is enclosed in curly braces. If desired, a return value is returned via a return statement.

Variables declared or used in the body of the function are global. They overwrite identically named variables in the program's actions (or other functions), and upon entry have the value of such identically named variables. Often this is not what you want...

The only place it's possible to have local variables is within functions, and only if those local variables are declared within the function's parentheses, after the arguments. In other words, local variables are just extra arguments that are not named in the the call to the subroutine. Here's an example:

#!/usr/local/bin/mawk -We

function test(realarg, localvar){
 print "top of test, realarg=" realarg ", localvar=" localvar ", globalarg=" globalarg
 realarg = "set by test"
 localvar = "set by test"
 globalarg= "set by test"
 print "bottom test, realarg=" realarg ", localvar=" localvar ", globalarg=" globalarg
}

BEGIN{
 realarg = "top of begin"
 localvar = "top of begin"
 globalarg= "top of begin"
 print "top of begin, realarg=" realarg ", localvar=" localvar ", globalarg=" globalarg

 test("set in test call")

 print "bottom begin, realarg=" realarg ", localvar=" localvar ", globalarg=" globalarg
}

In the preceding, only the argument and local variable declared in the parentheses of test() are local -- all other variables are global and they clobber each other. The following output proves the point:

[slitt@mydesk awk]$ ./hello.awk
top of begin, realarg=top of begin, localvar=top of begin, globalarg=top of begin
top of test, realarg=set in test call, localvar=, globalarg=top of begin
bottom test, realarg=set by test, localvar=set by test, globalarg=set by test
bottom begin, realarg=top of begin, localvar=top of begin, globalarg=set by test
[slitt@mydesk awk]$

x
x

Using Arrays to Structure Data

Awk has no keywords like struct, class, typedef or the like. The only data structure keyword it has is array. At first you might conclude that awk lacks even rudimentary structures. That's not quite true, because you can use awk arrays to represent arrays, hashes, structures and stacks. These usages are a result of awk arrays' ability to take either a number or a string as a subscript. The fact that they can take a number enables them to be used as an array or stack. The fact that they can take strings as subscripts enable them to be used as a hash or structure.

Using an Array as an Array

Here's some code to implement an array, and the results.

#!/usr/local/bin/mawk -We

BEGIN{
 myarray["firstss"] = 1001
 myarray[1001] = "one"
 myarray[1002] = "two"
 myarray[1003] = "three"
 myarray[1004] = "four"
 myarray[1005] = "five"
 myarray[1006] = "six"
 myarray[1007] = "seven"
 myarray[1008] = "eight"
 myarray[1009] = "nine"
 myarray[1010] = "ten"
 myarray["lastss"] = 1010
 for(i in myarray){
 print "Element " i "=" myarray[i] "."
 }
 printelements(myarray)
 exit 0
}

function printelements(arr, ss){
 print "================"
 for(i=arr["firstss"]; i <= arr["lastss"]; i++){
 print "Element " i "=" arr[i] "."
 }
}

[slitt@mydesk awk]$ ./test.awk
Element lastss=1010.
Element 1001=one.
Element 1002=two.
Element firstss=1001.
Element 1010=ten.
Element 1003=three.
Element 1004=four.
Element 1005=five.
Element 1006=six.
Element 1007=seven.
Element 1008=eight.
Element 1009=nine.
================
Element 1001=one.
Element 1002=two.
Element 1003=three.
Element 1004=four.
Element 1005=five.
Element 1006=six.
Element 1007=seven.
Element 1008=eight.
Element 1009=nine.
Element 1010=ten.
[nosql_slitt@mydesk awk]$

Notice the subscripts start from 1001 rather than 1. This is because sometimes numbers are converted to strings and compared alphabetically rather than numerically, in which case 2 would be considered bigger than 10. Adding a large number guarantees that all subscripts will have the same number of digits and therefore sort correctly both numerically and alphabetically.

Notice also that the for(i in myarray) loop doesn't print in numerical order. Awk stores array elements in a hash, in the hash's order, not in original subscript order. Therefore, to print out in original subscript order, one must iterate from the first to the last subscript, which means the first and last subscript must be stored. In this case they were stored in myarray["lastss"] and myarray["lastss"].

You can also use a function called asorti(array, newarray) to access elements in order, but that can be a hassle requiring a new variable (and remember all variables are global except in special circumstances).

If you know you'll never need to access an array in an ordered way, you can simply use numbers starting with 1, and use the for(i in myarray) method.

Using an Array as a Hash

In fact, awk arrays are always hashes (also called associative arrays), because they take arbitrary strings as subscripts. Hashes are powerful data elements. They can be used to ringtoss different types of events and come out with a number of occurrences for each event. Or, they can be used to simulate a structure, or nested structures...

Using an Array as a Structure

A hash can be used as a simple structure. Consider these side by side implementation of a simple person structure: the left one is in C, the right one is in awk:

#include <stdio.h>

struct person {
 char * lname;
 char * fname;
 char * phone;
};

void printperson (struct person *p){
 printf("%s\n", p->lname);
 printf("%s\n", p->fname);
 printf("%s\n", p->phone);
}
main(int argc, char * argv[]){
 struct person p;
 p.lname="Litt";
 p.fname="Steve";
 p.phone="123-456-7890";
 printperson(&p);
}

#!/usr/local/bin/mawk -We

function printperson(p){
 print p["lname"]
 print p["fname"]
 print p["phone"]
}

BEGIN{
 person["lname"] = "Litt"
 person["fname"] = "Steve"
 person["phone"] = "123-456-7890"
 printperson(person)
 exit 0
}

[slitt@mydesk awk]$ gcc test.c
[slitt@mydesk awk]$ ./a.out
Litt
Steve
123-456-7890
[slitt@mydesk awk]$

[slitt@mydesk awk]$ ./test.awk
Litt
Steve
123-456-7890
[slitt@mydesk awk]$

The preceding imitates a C struct using an awk array.

The real power of data driven design occurs with levels of abstraction, where structs contain other structs containing yet other structs...

Because awk supports multidimensional arrays, it can imitate such levels of abstraction.

The following code shows how a multilevel array, indexed at times by strings and at other times by numbers, can be used to simulate an entire data structure (a person's first and last name, and various components of their address). In this case, an array of people has been loaded with people's info, including one people array element called "Fred", and function printaddress() prints Fred's mailing address:

#!/usr/local/bin/mawk -We

BEGIN{

 people["Fred", "lname"]="Johnson"
 people["Fred", "fname"]="Fred"
 people["Fred", "address", "firstss"]=101
 people["Fred", "address", 101]="331 West Main Street"
 people["Fred", "address", 102]="Apartment 2"
 people["Fred", "address", "lastss"]=102
 people["Fred", "address", "city"]="Melbourne"
 people["Fred", "address", "state"]="FL"
 people["Fred", "address", "country"]="USA"
 people["Fred", "address", "zip"]="33333"

 printaddress(people, "Fred")
 exit 0
}

function printaddress(person_hash, person_id, line){
 print "ATTN: " person_hash[person_id, "lname"] " " person_hash[person_id, "fname"]
 for(i=person_hash[person_id,"address","firstss"];
 i<=person_hash[person_id,"address","lastss"]; i++){
 print person_hash[person_id, "address", i]
 
 }
 line = person_hash[person_id, "address", "city"] ", "
 line = line person_hash[person_id, "address", "state"] ", "
 line = line person_hash[person_id, "address", "country"] ", "
 line = line person_hash[person_id, "address", "zip"]
 print line
}

The preceding code produces the following output, which is the address you'd expect:

[slitt@mydesk awk]$ ./test.awk
ATTN: Johnson Fred
331 West Main Street
Apartment 2
Melbourne, FL, USA, 33333
[slitt@mydesk awk]$

It's obvious that languages enabling declarations of whole classes, as opposed to awk's declaration of a single instance using arrays, would be advantageous. Nevertheless, awk enables the programmer to keep and organize significant amounts of data in an understandable way.

Using an Array as a Stack

Stacks are wonderful. You can simulate recursion using stacks. Stacks act like Tom Sawyer's and Becky Thatcher's ball of string, returning through twists and turns to the place you started, then allowing you to explore again. You can easily implement a stack in awk. Here's a stack implementation using local variables so as not to pollute the namespace:

#!/usr/local/bin/mawk -We

# THIS CODE IS PUBLIC DOMAIN, NO WARRANTY!

function push(stack, value){
 stack[++stack["lastss"]] = value
}

function pop(stack,locx){
 if(stack[mystack["lastss"]] < stack[stack["lastss"]]){
 return NULL
 } else {
 locx = stack[stack["lastss"]]
 delete stack[stack["lastss"]--]
 return locx
 }
}

function stacklook(stack, num){
 if(stack["firstss"] > stack["lastss"])
 return NULL # stack spent
 if(num <= 0){
 num = stack["lastss"] + num
 if(num < 1) return NULL
 return stack[num]
 } else {
 num = stack["firstss"] + num - 1
 if(num > stack["lastss"]) return NULL
 return stack[num]
 }
}

function stackoutofrange(stack, num){
 if(stack["firstss"] > stack["lastss"])
 return 3
 if(num <= 0){
 if(stack["lastss"] + num < stack["firstss"])
 return(-1)
 else
 return(0)
 } else {
 if(num + stack["firstss"] > stack["lastss"] +1)
 return 1
 else
 return 0
 }
}


function stackspent(stack){
 return (stack["firstss"] > stack["lastss"])
}

BEGIN{
 ### INITIALIZE mystack ###
 mystack["firstss"] = 10001
 mystack["lastss"] = 10000

 push(mystack, "one")
 push(mystack, "two")
 push(mystack, "three")
 push(mystack, "four")
 push(mystack, "five")

 print "===== TESTING stackoutofrange() BELOW ========"
 for(i=-7; i < 8; i++){
 print "range(" i ") returns ", stackoutofrange(mystack, i)
 }

 print "===== POSITIVE STACKLOOKS BELOW ========"
 for(i=1; !stackoutofrange(mystack, i); i++)
 {print "pos " stacklook(mystack, i)}

 print "\n===== NEGATIVE STACKLOOKS BELOW ========"

 for(i=0; !stackoutofrange(mystack,i); i--)
 {print "neg " stacklook(mystack,i)}

 print "\n===== POPS BELOW ========"

 while(!stackspent(mystack))
 {print "pop " pop(mystack)}

 print "\n===== MORE RANGE TESTING BELOW ========"
 print "===== SHOULD RETURN ALL 3 BECAUSE STACK SPENT ========"
 print "===== TESTING stackoutofrange() BELOW ========"
 for(i=-7; i < 8; i++){
 print "range(" i ") returns ", stackoutofrange(mystack, i)
 }
 exit 0
}

Stack functions are accomplished by functions push() and pop(). push() simply appends its value argument to the end of the array identified by the stack argument. It increments stack["lastss"] so future pop() and push() and stacklook()calls will work on the right element.

The pop() function deletes the last element from the stack and returns it via function return. Notice the locx "argument". It's not an argument at all -- it's a local variable. In awk local arguments can be declared only in the same parentheses as arguments -- after the arguments.

The stacklook() function is a way to non-destructively observe the stack. stacklook(myarray,0) returns the element that would be returned by pop() if you were to call pop(). stacklook(1) returns the most deeply embedded element in the stack -- the last valid pop().

Another way to look at stacklook() is to see the stack as an array instead of a stack. Positive numerical arguments to stacklook() correspond to array subscripts. Negative numerical arguments indicate how far from the end of the array you want to look (the stack interpretation would be how many pops you'd need to do before popping that argument).

All three functions, push(), pop() and stacklook() contain code to return NULL if a numerical argument points to something before or beyond the array comprising the stack, or if the stack had no elements, indicating a spent stack. HOWEVER, that can backfire if a NULL element was pushed -- how can you differentiate a deliberate NULL element from a spent stack or out of bounds numerical argument?

Two functions, stackoutofrange() and stackspent() are included to check the actual stack rather than testing the return value. stackspent() returns 0 if the stack is not spent, a positive number otherwise.

stackoutofrange() returns 0 if the stack is not spent and the numerical argument is within the stack's range. It returns 3 if the stack is spent, 1 if a positive numerical argument is too positive, and -1 if a negative numerical argument is too negative.

When using loops, always test using stackoutofrange() and stackspent(), because you never know, especially during development, whether a NULL has accidentally been pushed onto the stack, or arrived on the stack otherwise (using array techniques for instance).

Viewing the main routine of the preceding, first five values are pushed on stack mystack. Then a loop tests potential stacklook() numerical arguments from -7 to 7, determining whether they're out of range. As expected (the output is follows this explanation), everything more negative than -4 and more positive than -5 returns non-zero. Next, stacklook() loops are done with positive and then negative numerical arguments. In each case the loop condition relies on stackoutofrange() rather than the stacklook() return.

Next, a pop() loop is run, once again the loop condition is stackspent() rather than looking for a NULL return. Once all the pops have been done, theoretically the stack is spent. This is proven by yet another stackoutofrange() loop with numerical arguments from -7 to 7. This time every call to stackoutofrange() returns 3, indicating stack spent. The output follows:

===== TESTING stackoutofrange() BELOW ========
range(-7) returns -1
range(-6) returns -1
range(-5) returns -1
range(-4) returns 0
range(-3) returns 0
range(-2) returns 0
range(-1) returns 0
range(0) returns 0
range(1) returns 0
range(2) returns 0
range(3) returns 0
range(4) returns 0
range(5) returns 0
range(6) returns 1
range(7) returns 1
===== POSITIVE STACKLOOKS BELOW ========
pos one
pos two
pos three
pos four
pos five

===== NEGATIVE STACKLOOKS BELOW ========
neg five
neg four
neg three
neg two
neg one

===== POPS BELOW ========
pop five
pop four
pop three
pop two
pop one

===== MORE RANGE TESTING BELOW ========
===== SHOULD RETURN ALL 3 BECAUSE STACK SPENT ========
===== TESTING stackoutofrange() BELOW ========
range(-7) returns 3
range(-6) returns 3
range(-5) returns 3
range(-4) returns 3
range(-3) returns 3
range(-2) returns 3
range(-1) returns 3
range(0) returns 3
range(1) returns 3
range(2) returns 3
range(3) returns 3
range(4) returns 3
range(5) returns 3
range(6) returns 3
range(7) returns 3

So if you need a stack, here it is!

Search This Blog

Linux, UNIX

awk hello word examples

Hello World

BEGIN and END

Fields

Break Logic

Tealeaves Programs

Multifile Programs

Non-File Command Line Arguments

Functions and Local Variables

Using Arrays to Structure Data

Using an Array as an Array

Using an Array as a Hash

Using an Array as a Structure

Using an Array as a Stack

Comments

Post a Comment

Popular posts from this blog

HAproxy logging

tomcat catalina coyote jasper cluster

NFS mount add in fstab _netdev instead of default | firewall-cmd --list-all