Instead of single application for a proprietary file format, Unix utilities manage text streams. Text streams means not only text files but also command line inputs and outputs. Unix comes with several handy text processing utilities. These tools co-operates well with text streams. Therefore, you should save your documents in plain text formats whenever possible. Here we briefly introduce some of them.
Before we jump into these utilities, let’s look at regular expressions. Regular expressions are not standalone command utilities but a set of mini-language in many Unix utilities and programming languages. Regular expressions are compact search patterns for strings, saving a lot of conditions and loops. If you do not know regular expressions, you may still use these Unix utilities, but these utilities become more powerful with regular expressions.
There are several dialects of regular expressions in different Unix utilities, causing confusion. We suggest starting with the regular expressions of Perl, the most complete dialect of regular expressions. Check perlrequick, perlretut, and perlre for more information. A simpler way to practice and use Perl regular expressions in command line is using
pcregrep, a utility bundled with Perl Compatible Regular Expressions library.
Then, let’s go back to these text processing utilities. These utilities almost become parts of Unix; installation is seldom needed. Here we won’t cover all text processing utilities but some common ones. They are:
perl mimicking Unix utilities is another application of Perl. The advantage is that you don’t need to memorize the usages of many utilities, but the alternative command in
perl is usually longer. See perlrun for details. There are also books discussing Perl one-liners, like Minimal Perl for Unix and Linux People, Manning and Perl One-Liners, No Starch Press.
iconv converts text files from one character encoding to another character encoding. For example:
Perl comes with a utility called
piconv, which behaves like
iconv. It is handy for system that has no
iconv like Microsoft Windows.
sed modify text streams and print out the result.
sed can be used with or without regular expressions. Normally,
sed doesn’t alter your file but print out the result to stand out. A simple usage of
sed is like this:
In this case,
-i means in-place editing; original file will be saved in file01.bak, etc.
You may use
tr replaces strings in character-wise level.
tr doesn’t adapt regular expressions. To use
tr to covert uppercase letters to lowercase letters, do this:
If you want to list all words in a file, use
tr to replace any characters other than alphabetic letters:
Again, you can substitute
AWK is an interpreted programming language for data extraction and report generation. AWK is suitable for fast one-liners text processing. To list all users on system by AWK, do this:
Many features of AWK have been absorbed into Perl. To use
awk in the same task, do this:
csplit splits one file into several files by regex or line numbers. Since the behavior of
csplit involves file I/O, there is no easy way to mimic
csplit to split a file, do this:
head prints out the first several lines of a file. Similiarly,
tail prints out the last several lines of a file. If running with arguments,
tail print out 10 lines of a file. To print the first 5 lines of a file, do this:
It is also possible mimicking
perl, but the command is longer.
nl calculates the line numbers in a file and print out the line numbers and the contents of the file. It is convienent if you need the line numbers.
Here is a longer example combining
perl. We extract the line numbers of titles in the file and split the file by line numbers.
wc is convienent for some basic statistics of files like character counts, word counts and line counts. Be aware of Unicode issue; there may be multibyte characters in files. An alternative program is
uniwc, a part of Unicode::Tussle Perl module.
There may be more commands and their useages in the text processing utilities of Unix, but I won’t dig too deeply. Consult system manual or online resources if you are interested in this topic. Good luck.