This article is part of a self-published book project by Balthazar Rouberol and Etienne Brodu, ex-roommates, friends and colleagues, aiming at empowering the up and coming generation of developers. We currently are hard at work on it!
If you are interested in the project, we invite you to join the mailing list!
Table of Contents
Text processing in the shell
One of the things that makes the shell an invaluable tool is the amount of available text processing commands, and the ability to easily pipe them into each other to build complex text processing workflows. These commands can make it trivial to perform text and data analysis, convert data between different formats, filter lines, etc.
When working with text data, the philosophy is to break any complex problem you have into a set of smaller ones, and to solve each of them with a specialized tool.
Make each program do one thing well.
The examples in that chapter might seem a little contrived at first, but this is also by design. Each of these tools were designed to solve one small problem. They however become extremely powerful when combined.
We will go over some of the most common and useful text processing commands the shell has to offer, and will demonstrate real-life workflows piping them together. I suggest you take a look at the man of these commands to see the full breadth of options at your disposal.
The example CSV (comma-separated values) file is available online. Feel free to download it yourself to test these commands.
cat
As seen in the previous chapter, cat is used to concatenate a list of one or more files and displays their content on screen.
$ cat Documents/readme Thanks again for reading this book! I hope you're following so far! $ cat Documents/computers Computers are not intelligent They're just fast at making dumb things. $ cat Documents/readme Documents/computers Thanks again for reading this book! I hope you are following so far! Computers are not intelligent They're just fast at making dumb things.
head
head prints the first n lines in a file. It can be very useful to peek into a file of unknown structure and format without burying your shell under a wall of text.
$ head -n 2 metadata.csv metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name mysql.galera.wsrep_cluster_size,gauge,,node,,The current number of nodes in the Galera cluster.,0,mysql,galera cluster size
If -n is unspecified, head will print the first 10 lines in its argument file or input stream.
tail
tail is head’s counterpart. It prints the last n lines in a file.
$ tail -n 1 metadata.csv mysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries
If you want to print all lines in a file located after the nth line (included), you can use the -n +n argument.
$ tail -n +42 metadata.csv mysql.replication.slaves_connected,gauge,,,,Number of slaves connected to a replication master.,0,mysql,slaves connected mysql.performance.queries,gauge,,query,second,The rate of queries.,0,mysql,queries Our file has 43 lines, so tail -n +42 only prints the 42nd and 43rd line in our file.
If -n is unspecified, tail will print the last 10 lines in its argument file or input stream.
tail -f or tail --follow displays the last lines in a file and displays each new line as the file is being written to. It is very useful to see real time activity that is written to a log file, for example a web server log file, etc.
wc
wc (for word count) prints either the number of characters (when using -c), words (when using -w) or lines (when using -l) in its argument files or input stream.
$ wc -l metadata.csv 43 metadata.csv $ wc -w metadata.csv 405 metadata.csv $ wc -c metadata.csv 5094 metadata.csv
By default, wc prints all of the above.
$ wc metadata.csv 43 405 5094 metadata.csv
Only the count will be printed out if the text data is piped in or redirected into stdin.
$ cat metadata.csv | wc 43 405 5094 $ cat metadata.csv | wc -l 43 $ wc -w < metadata.csv 405
grep
grep is the Swiss Army knife of line filtering. It allows you to filter lines matching a given pattern.
For example, we can use grep to find all occurrences of the word mutex in our metadata.csv file.
$ grep mutex metadata.csv mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits mysql.innodb.mutex_spin_rounds,gauge,,event,second,The rate of mutex spin rounds.,0,mysql,mutex spin rounds mysql.innodb.mutex_spin_waits,gauge,,event,second,The rate of mutex spin waits.,0,mysql,mutex spin waits grep can either files passed as arguments, or a stream of text passed to its stdin. We can thus chain multiple grep commands to further filter our text. In the next example, we filter lines in our metadata.csv file that contain both the mutex and OS words.
$ grep mutex metadata.csv | grep OS mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits
Let’s go over some of the options you can pass to grep and their associated behavior.
grep -v performs an invert matching: it filters the lines that do not match the argument pattern.
$ grep -v gauge metadata.csv metric_name,metric_type,interval,unit_name,per_unit_name,description,orientation,integration,short_name grep -i performs a case-insensitive matching. In the next example grep -i os matches both OS and os.
$ grep -i os metadata.csv mysql.innodb.mutex_os_waits,gauge,,event,second,The rate of mutex OS waits.,0,mysql,mutex os waits mysql.innodb.os_log_fsyncs,gauge,,write,second,The rate of fsync writes to the log file.,0,mysql,log fsyncs grep -l only lists files containing a match.
$ grep -l mysql metadata.csv metadata.csv grep -c counts the number of times a pattern was found.
$ grep -c select metadata.csv 3
grep -r recursively searches files in the current working directory and all subdirectories below it.
$ grep -r are ~/Documents /home/br/Documents/computers:Computers are not intelligent /home/br/Documents/readme:I hope you are following so far! grep -w only matches whole words.
$ grep follow ~/Documents/readme I hope you are following so far! $ grep -w follow ~/Documents/readme $
cut
cut cuts out a portion of a file (or, as always, its input stream). cut works by defining a field delimited (what separates two columns) with the -d option, and what column(s) should be extracted, with the -f option.
For example, the following command extracts the first column of the last 5 lines our CSV file.
$ tail -n 5 metadata.csv | cut -d , -f 1 mysql.performance.user_time mysql.replication.seconds_behind_master mysql.replication.slave_running mysql.replication.slaves_connected mysql.performance.queries
As we are dealing with a CSV file, we can extract each column by cutting over the , character, and extract the first column with -f 1.
We could also select both the first and second columns by using the -f 1,2 option.
$ tail -n 5 metadata.csv | cut -d , -f 1,2 mysql.performance.user_time,gauge mysql.replication.seconds_behind_master,gauge mysql.replication.slave_running,gauge mysql.replication.slaves_connected,gauge mysql.performance.queries,gauge
paste
paste can merge together two different files into one multi-column file.
$ cat ingredients eggs milk butter tomatoes $ cat prices 1$ 1.99$ 1.50$ 2$/kg $ paste ingredients prices eggs 1$ milk 1.99$ butter 1.50$ tomatoes 2$/kg
By default, paste uses a tab delimiter, but you can change that using the -d option.
$ paste ingredients prices -d: eggs:1$ milk:1.99$ butter:1.50$ tomatoes:2$/kg Another common use of paste it to join all lines within a stream or a file using a given delimiter, using a combination of the -s and -d argument.
$ paste -s -d, ingredients eggs,milk,butter,tomatoes If - is specified as an input file, stdin will be read instead.
$ cat ingredients | paste -s -d, - eggs,milk,butter,tomatoes
sort
sort, well, sorts argument files or input.
$ cat ingredients eggs milk butter tomatoes salt $ sort ingredients butter eggs milk salt tomatoes
sort -r performs a reverse sort.
$ sort -r ingredients tomatoes salt milk eggs butter sort -n performs a numerical sort, by sorting fields by their arithmetic value.
$ cat numbers 0 2 1 10 3 $ sort numbers 0 1 10 2 3 $ sort -n numbers 0 1 2 3 10
uniq
uniq detects or filters out adjacent identical lines in its argument file or input stream.
$ cat duplicates and one and one and two and one and two and one, two, three $ uniq duplicates and one and two and one and two and one, two, three
As uniq only filters out adjacent identical lines, we can still see more than one unique lines in its output. To filter out all identical lines from our duplicates file, we need to sort its content first.
$ sort duplicates | uniq and one and one, two, three and two
uniq -c prepends all lines with its number of occurrences.
$ sort duplicates | uniq -c 3 and one 1 and one, two, three 2 and two
uniq -u only displays the unique lines within its input.
$ sort duplicates | uniq -u and one, two, three
uniq is particularly useful used in conjunction with sort, as | sort | uniq allows you to remove any duplicate line in a file or a stream.
awk
awk is a little more than a text processing tool: it’s actually a whole programming language of its own. One thing awk is really good at is splitting files into columns, and it especially shines when these files contain a mix and match of spaces and tabs.
$ cat -t multi-columns John Smith Doctor^ITardis Sarah-James Smith^I Companion^ILondon Rose Tyler Companion^ILondon cat -t displays tabs as ^I.
We can see that these columns are either separated by spaces or tabs, and that they are not always separated by the same number of spaces. cut would be of no use there, because it only works on a single character delimiter. awk however, can easily make sense of that file.
awk '{ print $n }' prints the nth column in the text.
$ cat multi-columns | awk '{ print $1 }' John Sarah-James Rose $ cat multi-columns | awk '{ print $3 }' Doctor Companion Companion $ cat multi-columns | awk '{ print $1,$2 }' John Smith Sarah-James Smith Rose Tyler
There is so much more we can do with awk, however, printing columns probably accounts for 99% of my personal usage.
{ print $NF } prints the last column in the line.
tr
tr stands for translate, and it replaces characters into others. It either works on characters or character classes, such as lowercase, printable, spaces, alphanumeric, etc.
tr <char1> <char2 translates all occurrences of <char1> from its standard input into <char2>.
$ echo "Computers are fast" | tr a A computers Are fAst
tr can also translate character classes by using the [:class:] notation. The full list of available classes is described in the tr man page, but we’ll demonstrate some of them here.
[:space:] represent all types of spaces, from a simple space, to a tab or a newline.
$ echo "computers are fast" | tr '[:space:]' ',' computers,are,fast,%
All spaces-like characters were translated into a comma. Note that the % character at the end of the output represents the lack of a trailing newline. Indeed, that newline was translated to a comma as well.
[:lower:] represents all lowercase characters, and [:upper:] represents all uppercase characters. Converting between cases is thus made very easy.
$ echo "computers are fast" | tr '[:lower:]' '[:upper:]' COMPUTERS ARE FAST $ echo "COMPUTERS ARE FAST" | tr '[:upper:]' '[:lower:]' computers are fast
tr -c SET1 SET2 will transform any character not in SET1 into the characters in SET2. The following example replaces all non vowels by spaces.
$ echo "computers are fast" | tr -c '[aeiouy]' ' ' o u e a e a
tr -d deletes the matched characters, instead of replacing them. It’s the equivalent of tr <char> ''.
$ echo "Computers Are Fast" | tr -d '[:lower:]' C A F
tr can also replace character ranges, for example all letters between a and e, or all numbers between 1 and 8, by using the notation s-e, where s is the start character and e is the end one.
$ echo "computers are fast" | tr 'a-e' 'x' xomputxrs xrx fxst $ echo "5uch l337 5p34k" | tr '1-4' 'x' 5uch lxx7 5pxxk
tr -s string1 compresses any multiple occurrences of the characters in string1 into a single one. One of the most useful uses of tr -s is to replace multiple consecutive spaces by a single one.
$ echo "Computers are fast" | tr -s ' ' Computers are fast
fold
fold wraps each input line to fit in a specified width. It can be useful to make sure an argument text fits in a small display size for example. fold -w n folds the lines at n characters.
$ cat ~/Documents/readme | fold -w 16 Thanks again for reading this bo ok! I hope you're fo llowing so far!
fold -s will only break lines on a space character, and can be combined with -w to fold up to a given number of characters.
Thanks again for reading this book! I hope you're following so far!
sed
sed is a non-interactive stream editor, used to perform text transformation on its input stream, on a line-per-line basis. It can take its output from a file our its stdin and will output its result either in a file or its stdout.
It works by taking one or many optional addresses, a function and parameters. A sed command thus looks like this:
[address[,address]]function[arguments]
While sed can perform many functions, we will cover only substitution, as it is probably sed’s most common use.
Substituting text
A sed substitution command looks like this:
s/PATTERN/REPLACEMENT/[options]
Example: replacing the first instance of a word for each line in a file
$ cat hello hello hello hello world! hi $ cat hello | sed 's/hello/Hey I just met you/' Hey I just met you hello Hey I just met you world hi
We can see that only the first occurrence of hello was replaced in the first line. To replace all occurrences of hello in each line, we can use the g (for global) option.
$ cat hello | sed 's/hello/Hey I just met you/g' Hey I just met you Hey I just met you Hey I just met you world ji
sed allows you to specify any other separator than /, which is especially useful to keep the command readable if the search of replacement pattern contains forward slashes.
$ cat hello | sed 's@hello@Hey I just met you@g' Hey I just met you Hey I just met you Hey I just met you world ji
By specifying an address, we can tell sed on which line or line range to actually perform the substitution.
$ cat hello | sed '1s/hello/Hey I just met you/g' Hey I just met you hello hello world hi $ cat hello | sed '2s/hello/Hey I just met you/g' hello hello Hey I just met you world hi
The address 1 tells sed to only replace hello by Hey I just met you on line 1. We can specify an address range with the notation <start>,<end> where <end> can either be a line number or $, meaning the last line in the file.
$ cat hello | sed '1,2s/hello/Hey I just met you/g' Hey I just met you Hey I just met you Hey I just met you world hi $ cat hello | sed '2,3s/hello/Hey I just met you/g' hello hello Hey I just met you world hi $ cat hello | sed '2,$s/hello/Hey I just met you/g' hello hello Hey I just met you world hi
By default, sed displays its result in its stdout, but it can also edit the initial file in-place, with the use of the -i option.
$ sed -i '' 's/hello/Bonjour/' sed-data $ cat sed-data Bonjour hello Bonjour world hi
On Linux, only -i needs to be specified. However, due to the fact that sed’s behavior on macOS is slightly different, the '' needs to be added right after -i.
Real-life examples
Filtering a CSV using grep and awk
$ grep -w gauge metadata.csv | awk -F, '{ if ($4 == "query") { print $1, "per", $5 } }' mysql.performance.com_delete per second mysql.performance.com_delete_multi per second mysql.performance.com_insert per second mysql.performance.com_insert_select per second mysql.performance.com_replace_select per second mysql.performance.com_select per second mysql.performance.com_update per second mysql.performance.com_update_multi per second mysql.performance.questions per second mysql.performance.slow_queries per second mysql.performance.queries per second
This example filters the lines containing the word gauge in our metadata.csv file using grep, then the filters the lines with the string query as their 4th column, and displays the metric name (1st column) with its associated per_unit_name value (5th column).
Printing the IPv4 address associated with a network interface
$ ifconfig en0 | grep inet | grep -v inet6 | awk '{ print $2 }' 192.168.0.38
ifconfig <interface name> prints details associated with the argument network interface name. For example:
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 ether 19:64:92:de:20:ba inet6 fe80::8a3:a1cb:56ae:7c7c%en0 prefixlen 64 secured scopeid 0x7 inet 192.168.0.38 netmask 0xffffff00 broadcast 192.168.0.255 nd6 options=201<PERFORMNUD,DAD> media: autoselect status: active
We then grep for inet, which will match 2 lines.
$ ifconfig en0 | grep inet inet6 fe80::8a3:a1cb:56ae:7c7c%en0 prefixlen 64 secured scopeid 0x7 inet 192.168.0.38 netmask 0xffffff00 broadcast 192.168.0.255
We then exclude the line with ipv6 by using a grep -v.
$ ifconfig en0 | grep inet | grep -v inet6 inet 192.168.0.38 netmask 0xffffff00 broadcast 192.168.0.255
We finally use awk to get the 2nd column in that line: the IPv4 address associated with our en0 network interface.
$ ifconfig en0 | grep inet | grep -v inet6 | awk '{ print $2 }' 192.168.0.38
It has been suggested to me that grep inet | grep -v inet6 could be replaced by the following future-proof awk command:
$ ifconfig en0 | awk ' $1 == "inet" { print $2 }' 192.168.0.38
It is shorter and specifically targets IPv4 using the $1 == "inet" condition.
Extracting a value from a config file
$ grep 'editor =' ~/.gitconfig | cut -d = -f2 | sed 's/ //g' /usr/bin/vim
We look for the editor = value in the current user’s git configuration file, then cut over the = sign, get the second column and remove any space around that column.
$ grep 'editor =' ~/.gitconfig editor = /usr/bin/vim $ grep 'editor =' ~/.gitconfig | cut -d'=' -f2 /usr/bin/vim $ grep 'editor =' ~/.gitconfig | cut -d'=' -f2 | sed 's/ //' /usr/bin/vim
Extracting IP addresses from a log file
The following real life example looks for the message Too many connections from in a database log file (which is followed by an IP address) and displays the 10 biggest offenders.
$ grep 'Too many connections from' db.log | \ awk '{ print $12 }' | \ sed 's@/@@' | \ sort | \ uniq -c | \ sort -rn | \ head -n 10 | \ awk '{ print $2 }' 10.11.112.108 10.11.111.70 10.11.97.57 10.11.109.72 10.11.116.156 10.11.100.221 10.11.96.242 10.11.81.68 10.11.99.112 10.11.107.120
Let’s break down what this pipeline of command does. First, let’s look at what a log line looks like.
$ grep "Too many connections from" db.log | head -n 1 2020-01-01 08:02:37,617 [myid:1] - WARN [NIOServerCxn.Factory:1.2.3.4/1.2.3.4:2181:NIOServerCnxnFactory@193] - Too many connections from /10.11.112.108 - max is 60
awk '{ print $12 }' then extracts the IP from the line.
$ grep "Too many connections from" db.log | awk '{ print $12 }' /10.11.112.108 ...
sed 's@/@@' removes the trailing slash from the IPs.
$ grep "Too many connections from" db.log | awk '{ print $12 }' | sed 's@/@@' 10.11.112.108 ...
As we have previously seen, we can use whatever separator we want for sed. While / is commonly used as a separator, we are currently replacing that very character, which would make the substitution expression sightly less readable.
sort | uniq -c sorts the IPs lexicographically, and then removed duplicates while prefixing IPs by their associated number of occurrences.
$ grep 'Too many connections from' db.log | \ awk '{ print $12 }' | \ sed 's@/@@' | \ sort | \ uniq -c 1379 10.11.100.221 1213 10.11.103.168 1138 10.11.105.177 946 10.11.106.213 1211 10.11.106.4 1326 10.11.107.120 ...
sort -rn | head -n 10 sorts the lines by the number of occurrences, numerically and in the reversed order, which displays the biggest offenders first, 10 of which are displayed. The final awk { print $2 } extracts the IPs themselves.
$ grep 'Too many connections from' db.log | \ awk '{ print $12 }' | \ sed 's@/@@' | \ sort | \ uniq -c | \ sort -rn | \ head -n 10 | \ awk '{ print $2 }' 10.11.112.108 10.11.111.70 10.11.97.57 10.11.109.72 10.11.116.156 10.11.100.221 10.11.96.242 10.11.81.68 10.11.99.112 10.11.107.120
Renaming a function in a source file
Let’s imagine that we are working a code project, and we would like to rename rename a poorly named function (or class, variable, etc) in a code file. We can do this by using sed -i, which performs an in-place replacement in a file.
$ cat izk/utils.py def bool_from_str(s): if s.isdigit(): return int(s) == 1 return s.lower() in ['yes', 'true', 'y']
$ sed -i 's/def bool_from_str/def is_affirmative/' izk/utils.py $ cat izk/utils.py def is_affirmative(s): if s.isdigit(): return int(s) == 1 return s.lower() in ['yes', 'true', 'y']
Use sed -i '' instead of sed -i on macOs, as the sed version behaves slightly differently.
We’ve however only renamed this function in the file it was defined in. Any other file we import bool_from_str will now be broken, as this function is not defined anymore. We’d need a way to rename bool_from_str everywhere it is found in our project. We can achieve just that by using grep, sed, and either for loops or xargs.
Going further: for loops and xargs
To replace all occurrences of bool_from_str in our project, we first need to recursively find them using grep -r.
$ grep -r bool_from_str . ./tests/test_utils.py:from izk.utils import bool_from_str ./tests/test_utils.py:def test_bool_from_str(s, expected): ./tests/test_utils.py: assert bool_from_str(s) == expected ./izk/utils.py:def bool_from_str(s): ./izk/prompt.py:from .utils import bool_from_str ./izk/prompt.py: default = bool_from_str(os.environ[envvar])
As we are only interested in the matching files, we also need to use the -l/--files-with-matches option:
-l, --files-with-matches Only the names of files containing selected lines are written to standard out- put. grep will only search a file until a match has been found, making searches potentially less expensive. Pathnames are listed once per file searched. If the standard input is searched, the string ``(standard input)'' is written.
$ grep -r --files-with-matches bool_from_str . ./tests/test_utils.py ./izk/utils.py ./izk/prompt.py We can then use the xargs command to perform an action of each line in the output (each file containing the bool_from_str string).
$ grep -r --files-with-matches bool_from_str . | \ xargs -n 1 sed -i 's/bool_from_str/is_affirmative/'
-n 1 tells xargs that each line in the output should cause a separate sed command to be executed.
The following commands were then executed:
$ sed -i 's/bool_from_str/is_affirmative/' ./tests/test_utils.py $ sed -i 's/bool_from_str/is_affirmative/' ./izk/utils.py $ sed -i 's/bool_from_str/is_affirmative/' ./izk/prompt.py
If the command you call with xargs (sed, in our case) support multiple arguments, you can (and shoud, as a single command will execute faster) drop the -n 1 argument and run
grep -r --files-with-matches bool_from_str . | xargs sed -i 's/bool_from_str/is_affirmative/'
which will then execute
$ sed -i 's/bool_from_str/is_affirmative/' ./tests/test_utils.py ./izk/utils.py ./izk/prompt.py
We can see that sed can take multiple arguments by looking at its synopsis, in its man page.
SYNOPSIS sed [-Ealn] command [file ...] sed [-Ealn] [-e command] [-f command_file] [-i extension] [file ...]
Indeed, as we’ve seen in the previous chapter, file ... means that multiple arguments representing file names are accepted.
We can see that all bool_from_str occurrences have been replaced.
$ grep -r is_affirmative . ./tests/test_utils.py:from izk.utils import is_affirmative ./tests/test_utils.py:def test_is_affirmative(s, expected): ./tests/test_utils.py: assert is_affirmative(s) == expected ./izk/utils.py:def is_affirmative(s): ./izk/prompt.py:from .utils import is_affirmative ./izk/prompt.py: default = is_affirmative(os.environ[envvar])
As it is often the case, there are multiple ways of achieving the same result. Instead of using xargs, we could have used for lops, which allow you to iterate over a list of lines and perform an action on each element. These for loops have the following syntax:
for item in list; do command $item done
By wrapping our grep command by $(), it will cause the shell to execute the it in a subshell, which result will then be iterated on by the for loop.
$ for file in $(grep -r --files-with-matches bool_from_str .); do sed -i 's/bool_from_str/is_affirmative/' $file done
which will execute
$ sed -i 's/bool_from_str/is_affirmative/' ./tests/test_utils.py $ sed -i 's/bool_from_str/is_affirmative/' ./izk/utils.py $ sed -i 's/bool_from_str/is_affirmative/' ./izk/prompt.py
I tend to find the for loop syntax clearer than xargs’s. xargs can however execute the commands in parallel using its -P n options, where n is the maximum number of parallel commands to be executed at a time, which can be a performance win if your command takes time to run.
Summary
All these tools open up a world of possibilities, as allow you to extract data and transform its format, to make it possible to build entire workflows of commands that were possibly never intended to work together. Each of these commands accomplishes has a relatively small function (sort sorts, cat concatenates, grep filters, sed edits, cut cuts, etc).
Any given task involving text, can then be reduced to a pipeline of smaller tasks, each of them performing a simple action and piping their output into the next task.
For example, if we wanted to know how many unique IPs could be found in a log file, and that these IPs always appeared at the same column, we could:
greplines on a pattern specific to lines containing an IP address- locate the column the IPs appear, and extract them with
awk - sort the list of IPs with
sort - compute the list of unique IPs with
uniq - count the number of lines (aka, of unique IPs) with
wc -l
As there is a plethora of text processing tools, either available by default or installable, there is bound to be many ways to solve any given task.
The examples in this article were contrived, but I suggest you read the amazing article “Command-line Tools can be 235x Faster than your Hadoop Cluster” to get a sense of how useful and powerful these text processing commands really are, and what real-life problems they can solve.
Going further
2.1: Count the number of files and directories located in your home directory.
2.2: Display the content of a file in all caps.
2.3: Count how many times each word was found in a file.
2.4: Count the number of vowels present in a file. Display the result from the most common to the least.
from Hacker News https://ift.tt/3d4SHSD
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.