unix  awk  perl  ruby  python  

Jan 2, 2015 • Michael Chen

Recently, I saw a interesting post on StackOverflow, which compares the difference among sed, AWK, Perl, Python. I decided to learn the “old tricks”. Then, I tried some simple tasks in AWK, Perl, Ruby, and Python. Why Ruby among these? Although Ruby is not a standard tool in a tranditional Unix toolbox, Ruby becomes more popular among these years and Ruby is sometimes viewed as the unadvertised successor of Perl 5.

If you want to try these tasks, please go to UCI Machine Learning Repository and download one dataset. I tried Wine dataset, but any dataset should be fine. The tasks here are just for demo purpose. The better way to do serious statistical analysis with these datasets may be opening a R session and feed some R commands.

Let’s take a glimpse to Wine dataset:

$ head -n3 wine.data
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185

As you see, the dataset is in CSV format. There are total 14 variables here. The first variable is the class of some wines; other variables are some properties of wines. Since we didn’t do serious analysis here, just ignore these facts and concentrate on numbers.

Say I want to change the first column value to its square, try this simple AWK one-liner:

$ awk -F',' '{ $1 = $1^2; OFS = ","; print }' wine.data | tail -n3
9,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835
9,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840
9,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560

The first column is changed from 3 to 9. It’s intuitive to do field computations in AWK, since AWK has built-in field manipulations.

Let’s try the same task with a Perl one-liner:

$ perl -a -F',' -ne '$F[0] = $F[0]**2; $, = ","; print @F;' wine.data | tail -n3
9,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835
9,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840
9,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560

The same result as expected, but a little longer.

This time, a Ruby one-liner:

$ ruby -a -F, -ne '$F[0] = $F[0].to_i ** 2; print $F.join(",");' wine.data | tail -n3
9,13.27,4.28,2.26,20,120,1.59,.69,.43,1.35,10.2,.59,1.56,835
9,13.17,2.59,2.37,20,120,1.65,.68,.53,1.46,9.3,.6,1.62,840
9,14.13,4.1,2.74,24.5,96,2.05,.76,.56,1.35,9.2,.61,1.6,560

Ruby code here looks like its Perl cousin. However, there are one difference: you have to do explict type conversion.

Since Python has no direct support for this kind of one-line program, we skip Python example in this round.

Let try a more complex task: we have three classes of wines in our dataset and we want the average numbers of these properties of each class.

Let’s start with an AWK script:

{
    for (i = 2; i <= 14; i++) {
	if ($1 == 1) {
	    sum[1","i] += $i;
	}
	if ($1 == 2) {
	    sum[2","i] += $i;
	}
	if ($1 == 3) {
	    sum[3","i] += $i;
	}
    }

    if ($1 == 1) {
	count[1]++;
    }
    if ($1 == 2) {
	count[2]++;
    }
    if ($1 == 3) {
	count[3]++;
    }
}

END {
    for (i = 1; i <= 3; i++) {
	printf "%d,", i
	for (j = 2; j <= 13; j++) {
	    printf "%.3f,", sum[i","j] / count[i];
	}
	printf "%.3f", sum[i","14] / count[i];
	print "";
    }
}

We can see that the code of AWK is simple and intuitive. A trick here is that there is no support to multi-dimensional array in original AWK and the index of AWK array is not integer but string; therefore, the name “associative array” for AWK array. We put a comma between our index numbers to mimic a 2-D array. If you want a real multi-dimensional array in AWK, consider GAWK.

Then, the same task with a Perl script:

#!/usr/bin/perl -a -F"," -n
for $i (1..13) {
  $sum[1]->[$i] += $F[$i] if $F[0] == 1;
  $sum[2]->[$i] += $F[$i] if $F[0] == 2;
  $sum[3]->[$i] += $F[$i] if $F[0] == 3;
}

$count[1]++ if $F[0] == 1;
$count[2]++ if $F[0] == 2;
$count[3]++ if $F[0] == 3;

END {
  for $i (1..3) {
    printf "%s,", $i;
    for $j (1..12) {
      printf "%.3f,", $sum[$i]->[$j] / $count[$i];
    }
    printf "%.3f", $sum[$i]->[13] / $count[$i];
    print "\n";
  }
}

You can see that the equalavent code is more compact in Perl than in AWK, because Perl support post statement if/while/for. Besides, Perl supports multiple dimension array natively.

Let’s try a Ruby one:

#!/usr/bin/ruby -a -F, -n

BEGIN {
  count = (0..3).to_a.map { |i| 0 }
  sum = (0..3).to_a.map { |i| (0..13).to_a.map { |j| 0 }}
}

(1..13).to_a.map do |i|
  sum[1][i] += $F[i].to_f if $F[0] == '1'
  sum[2][i] += $F[i].to_f if $F[0] == '2'
  sum[3][i] += $F[i].to_f if $F[0] == '3'
end

count[1] += 1 if $F[0] == '1'
count[2] += 1 if $F[0] == '2'
count[3] += 1 if $F[0] == '3'

END {
  (1..3).each do |i|
    printf "%s,", i.to_s
    (1..12).each do |j|
      printf "%.3f,", sum[i][j] / count[i]
    end
    printf "%.3f", sum[i][13] / count[i]
    puts
  end
}

The Ruby script and the Perl one share some similarities. However, we can find some differences between them. First, you cannot use auto-vivification like you did in Perl or AWK script. Second, like we mentioned before, you have to do type conversion. Third, you use more object-oriented features in Ruby than in Perl or AWK.

Finally, we try a Python solution. Python has no direct support for line-oriented program as other three language; Python code does things in a more “formal” ways:

#!/usr/bin/env python
import sys

_file = sys.argv[1]

count = [0 for i in range(4)]
_sum = [[0 for j in range(14)] for i in range(4)]

with open(_file, 'r') as f:
    for line in f:
        array = line.split(',')
        for i in range(1, 14):
            if array[0] == '1':
                _sum[1][i] += float(array[i])
            elif array[0] == '2':
                _sum[2][i] += float(array[i])
            elif array[0] == '3':
                _sum[3][i] += float(array[i])

        if array[0] == '1':
            count[1] += 1
        elif array[0] == '2':
            count[2] += 1
        elif array[0] == '3':
            count[3] += 1

for i in range(1, 4):
    sys.stdout.write("%s," % i)
    for j in range(1, 13):
        sys.stdout.write("%.3f," % float(_sum[i][j] / count[i]))
    sys.stdout.write("%.3f\n" % float(_sum[i][13] / count[i]))

As you see, you have to explictly split line in text file into array. Besides, you have to do type conversion as well, though in a non-OO way. No auto-vivification, of course.

Let’s do a simple statistics for our four scripts:

$ wc calc_class_avg.*
      34      98     473 calc_class_avg.awk
      21      73     447 calc_class_avg.pl
      31     107     861 calc_class_avg.py
      27      95     563 calc_class_avg.rb

In our statistics, Perl code tends to be compact and frugal; means time-saving. AWK also produces simple code, though less compact one. Ruby also produces tight code, but you have to think in OO way. Python has no direct support for one-line program or line-oriented programming. Besides, Python code tends to be longer. However, Python code is more readable.

Back to our original question, is AWK still learning even if we have Perl or Ruby? If you just want to learn one tool for text processing, Perl is the best choice. However, AWK is simple and intuitive in some situations; some AWK program is even simpler than Perl one. For simple tasks, sed and AWK are usually handy. Although Ruby absorbs many features of Perl, you still have to re-learn some Ruby features and think in Ruby way. Python code is succinct and concise, but less support for quick-and-dirty text processing tasks. I think these old tools like AWK is still worth learning, but for long and complex tasks, Perl/Python/Ruby may be more proper.

These demos are just personal tastes for AWK, Perl, Python, and Ruby. They don’t present the best practices of these languages. Every utility and language is born for some reasons. Keep open mind and enjoy your tools.