More about Bash
Don’t forget this week’s reading.
In this lecture, we continue our discussion of the Unix shell and its commands.
Goals
We plan to learn the following today:
- Redirection and pipes
- Special characters and quoting
- Standard input, output, and error
We’ll do this activity in today’s class:
- Sit with your group, and get to know your group mates.
- Experiment with shell pipelines in this Activity.
Class scripts
I recorded my Terminal windows from flume and Mac.
Redirection and pipes
To date we have seen Unix programs using default input and output - called standard input (stdin) and standard output (stdout) - the keyboard is the standard input and the display the standard output. The Unix shell is able to redirect both the input and output of programs. As an example of output redirection consider the following.
[engs50@mc ~]$ date > listing
[engs50@mc ~]$ ls -lR public_html/ >> listing
The output redirection >
writes the output of date
to the file
called listing
; that is, the ‘standard output’ of the date
process
has been directed to the file instead of the default, the display.
Note that the >
operation created a file that did not exist before
the output redirection command was executed. Next, we append a
recursive, long-format directory listing to the same file; by using
the >>
(double >
) we tell the shell to append to the file rather
than overwriting the file.
Note that the >
or >>
and their target filenames are not
arguments to the command - the command simply writes to stdout, as it
always does, but the shell has arranged for stdout to be directed to a
file instead of the terminal.
The shell also supports input redirection. This provides input to a
program (rather than the keyboard). Let’s create a file of prime
numbers using output redirection. The input to the cat
command can
come from the standard input (i.e., the keyboard). We can instruct
the shell to redirect the cat
command’s output (stdout) to file
named primes
. The numbers are input at the keyboard and CTRL-D
is
used to signal the end of the file (EOF
).
[engs50@mc ~]$ cat > primes
61
53
41
2
3
11
13
18
37
5
19
23
29
31
47
53
59
[engs50@mc ~]$
Input redirection <
tells the shell to use a file as input to the command rather than the keyboard.
In the input redirection example below primes
is used as input to cat
which sends its standard output to the screen.
[engs50@mc ~]$ cat < primes
61
53
41
2
3
11
13
18
37
5
19
23
29
31
47
53
59
[engs50@mc ~]$
Many Unix commands (e.g., cat
, sort
) allow you to provide input from stdin if you do not specify a file on the command line.
For example, if you type cat (CR)
(carriage return) then the command expects input from the standard input.
Unix also supports a powerful ‘pipe’ operator for passing data between commands using the operator |
(a vertical bar, usually located above the \
key on your keyboard).
Pipes connect commands that run as separate processes as data becomes available the processes are scheduled.
Pipes were invented by Doug McIlroy while he was working with Ken Thompson and Dennis Ritchie at AT&T Bell Labs. (As I mentioned earlier, Doug has been an adjunct professor here at Dartmouth College for several years now.) In this two-page interview, at the middle of the third column, Doug tells how pipes were invented and the
|
character selected as the operator. Pay special attention to the next paragraph: the Dartmouth Time Sharing System had something similar, even earlier!
Pipes are a clever invention indeed, since the need for separate temporary files for sharing data between processes is not required. Because commands are implemented as processes, a program reading an empty pipe will be ``suspended’’ until there is data or information ready for it to read. There is no limit to the number of programs or commands in the pipeline. In our example below there are four programs in the pipeline all running simultaneously waiting on the input:
[engs50@mc ~]$ sort -n primes | uniq | grep -v 18 | more
2
3
5
11
13
19
23
29
31
37
41
47
53
59
61
[engs50@mc ~]$
What is the difference between pipes and redirection?
Basically, redirection (>
,>>
,<
) is used to direct the stdout of command to a file, or from a file to the stdin of a command.
Pipes (|
) are used to redirect the stdout to the stdin of another command.
This operator allows us to ‘glue’ together programs as ‘filters’ to process the plain text sent between them (plain text between the processes - a nice design decision).
This supports the notion of reuse and allows us to build sophisticated programs quickly and simply.
It’s another cool feature of Unix.
Notice three new commands above: sort
, uniq
, and grep
.
sort
reads lines from from stdin and outputs the lines in sorted order; here-n
tells `sort to use numeric order (rather than alphabetical order);uniq
removes duplicates, printing only one of a run of identical lines;grep
prints lines matching a pattern (more generally, a regular expression); here,-v
inverts this behavior: print lines that do not match the pattern. In this case, the pattern is simply18
andgrep
does not print that number as it comes through.
And, as we saw last time, more
pauses the output when it would scroll off the screen.
Note that the original file - primes
- is not changed by executing the command line above.
Rather, the file is read in by the sort
command and the data is manipulated as it is processed by each stage of the command pipe line.
Because sort
and cat
are happy to read its input data from stdin, or from a file given as an argument, the following pipelines all achieve the same result:
[engs50@mc ~]$ sort -n < primes | uniq | grep -v 18 | more
[engs50@mc ~]$ cat primes | sort -n | uniq | grep -v 18 | more
[engs50@mc ~]$ cat < primes | sort -n | uniq | grep -v 18 | more
Which do you think would be most efficient?
Another pipeline: How did I create that list of existing usernames, for Lab0?
[engs50@mc ~]$ cut -d : -f 1 /etc/passwd | sort > usernames.txt
See
man cut
to understand what the first command does.
Another example: what is the most popular shell? Try each of these in turn:
[engs50@mc ~]$ cut -d : -f 7 /etc/passwd
[engs50@mc ~]$ cut -d : -f 7 /etc/passwd | less
[engs50@mc ~]$ cut -d : -f 7 /etc/passwd | sort
[engs50@mc ~]$ cut -d : -f 7 /etc/passwd | sort | uniq -c
[engs50@mc ~]$ cut -d : -f 7 /etc/passwd | sort | uniq -c | sort -n
[engs50@mc ~]$ cut -d : -f 7 /etc/passwd | sort | uniq -c | sort -nr
MacOS tip:
There are three great commands you should know - they are not on Linux, because they interact with MacOS: open
, pbpaste
, pbcopy
.
You’ve already seen the open
command, which on MacOS at the Terminal command line, will open a file for viewing in the relevant MacOS application; for example, for a photo file you might type open photo.jpg
and see Preview launch and open that file; for an html file you might type open index.html
and see Safari launch and render that page.
The commands pbpaste
and pbcopy
are a great fit into many pipelines.
The first command prints the MacOS ‘clipboard’ to its standard output, and the second copies its standard input into the MacOS clipboard.
For example; select some text in a window somewhere, then cmd-C to copy it to the clipboard, then
pbpaste | wc
to count the lines, words, and characters in the clipboard. Or,
ls -l | pbcopy
saves the directory listing in the clipboard, where you might paste it into Piazza or an email message or some document.
I’m not certain you can combine pbpaste
and pbcopy
on the same command line, however; if you wanted to replace the clipboard with its number of lines, you might try
pbpaste | wc -l | pbcopy
but be careful mixing those two commands.
Standard input, output and error
As we learned above, every process (a running program) has a standard input (abbreviated
to stdin
) and a standard output (stdout
).
The shell sets stdin
to the keyboard by default, but the command
line can tell the shell to redirect stdin
using <
or a pipe. The
shell sets stdout
to the display by default, but the command line
can tell the shell to redirect stdout
using >
or >>
, or to a
pipe.
Each process also has a standard error (stderr
), which most
programs use for printing error messages. The separation of stdout
and stderr
is important when stdin
is redirected to a file or
pipe, because normal output can flow into the file or pipe while error
messages reach the user on the screen.
Inside the running process these three streams are represented with numeric file descriptors:
stdin
stdout
stderr
You can tell the shell to redirect using these numbers; >
is
shorthand for 1>
and <
is shorthand for 0<
. You can thus
redirect the standard error (file descriptor 2) with the symbol 2>
.
Suppose I was logged in as dfk
and wanted to make a list of the web
files index.html
anywhere in the cs50
account’s home directory
tree.
[engs50@mc ~]$ find ~cs50 -name index.html > pages
find: `/net/class/cs50/Archive/src': Permission denied
find: `/net/class/cs50/.emacs.d': Permission denied
find: `/net/class/cs50/private/': Permission denied
[engs50@mc ~]$
I redirected the output of find
to a file called pages
- and
indeed that file filled with useful information - but find
printed
its error messages to the screen so I could still see them. Suppose I
wanted to capture those errors in a file too:
[engs50@mc ~]$ find ~cs50 -name index.html > pages 2> errors
[engs50@mc ~]$
The file errors
contains the error messages we saw earlier.
As another alternative, we could ignore the error output entirely by sending it to a place where all characters go and never return!
[engs50@mc ~]$ find ~cs50 -name index.html > pages 2> /dev/null
[engs50@mc ~]$
The file called /dev/null
is a special kind of file - it’s not a file at all,
actually, it’s a ‘device’ that simply discards anything written to it.
(If you read from it, it appears to be an empty file.)
Special characters
There are a number of special characters interpreted by the shell - spaces, tabs, wildcard (‘globbing’) characters for filename expansion, redirection symbols, and so forth. Special characters have special meaning and cannot be used as regular characters because the shell interprets them in a special manner. These special characters include:
& ; | * ? ` " ' [ ] ( ) $ < > { } # / \ ! ~
We have already used several of these special characters. Don’t try to memorize them at this stage. Through use, they will become second nature. We will just give some examples of the ones we have not discussed so far.
Quoting
If you need to use one of these special characters as a regular
character, you can tell the shell not to interpret it by escaping or
quoting it. To escape a single special character, precede it with a
backslash \
; earlier we saw how to escape the character *
with
\*
. To escape multiple special characters (as in **
), quote each:
\*\*
. You can also quote using single quotation marks such as
'**'
or double quotation marks such as "**"
- but these have
subtlety different behavior. You might use this form when quoting a
filename with embedded spaces: "Homework assignment"
.
You will often need to pass special characters as part of arguments to
commands and other programs - for example, an argument that represents
a pattern to be interpreted by the command; as happens often with
find
and grep
.
There is a situation where single quotes work differently than double
quotes. If you use a pair of single quotes around a shell variable
substitution (like $USER
), the variable’s value will not be
substituted, whereas it would be substituted within double quotes:
[engs50@mc ~]$ echo "$LOGNAME uses the $SHELL shell and his home directory is $HOME."
cs50 uses the /bin/bash shell and his home directory is /net/class/cs50.
[engs50@mc ~]$ echo '$LOGNAME uses the $SHELL shell and his home directory is $HOME.'
$LOGNAME uses the $SHELL shell and his home directory is $HOME.
[engs50@mc ~]$
Example 1. Double-quotes are especially important in shell scripts, because the variables involved might have been user input (a command-line argument or a keyboard input) or might have be a file name or output of a command; such variables should always be quoted when substituted, because spaces (and other special characters) embedded in the value of the variable can cause confusion. Thus:
directoryName="Homework three"
...
mkdir "$directoryName"
mkdir $directoryName
Try it!
Example 2. Escapes and quoting can pass special characters and patterns passed to commands.
Suppose I have a list of email addresses, one per line.
Eric Chen <eric.t.chen.17@dartmouth.edu>
Rui Liu <Rui.Liu.GR@dartmouth.edu>
Taylor Hardin <Taylor.A.Hardin.GR@dartmouth.edu>
Elena N. Horton <Elena.N.Horton.18@dartmouth.edu>
Jinnan Ge <Jinnan.Ge.TH@dartmouth.edu>
Mauricio Esquivel Rogel <Mauricio.Esquivel.Rogel.18@dartmouth.edu>
James Edwards <James.G.Edwards.18@dartmouth.edu>
Emily J. Lin <Emily.J.Lin.18@dartmouth.edu>
John Cardwell <John.R.Cardwell.18@dartmouth.edu>
John Kotz <John.P.Kotz.19@dartmouth.edu>
Joel J. Katticaran <joel.j.katticaran.ug@dartmouth.edu>
Kaya M. Thomas <Kaya.M.Thomas.17@dartmouth.edu>
Trevor L. Davis <trevor.l.davis.18@dartmouth.edu>
Thomas D. Kim <Thomas.D.Kim.19@dartmouth.edu>
Kyle Dotterrer <kyle.dotterrer.18@dartmouth.edu>
Some email programs, or websites, require these to be comma-separated. If I have the above text in my clipboard, I can change those newline characters to commas:
$ pbpaste | tr \\n ,
Eric Chen <eric.t.chen.17@dartmouth.edu>, Rui Liu <Rui.Liu.GR@dartmouth.edu>, Taylor Hardin <Taylor.A.Hardin.GR@dartmouth.edu>, Elena N. Horton <Elena.N.Horton.18@dartmouth.edu>, Jinnan Ge <Jinnan.Ge.TH@dartmouth.edu>, Mauricio Esquivel Rogel <Mauricio.Esquivel.Rogel.18@dartmouth.edu>, James Edwards <James.G.Edwards.18@dartmouth.edu>, Emily J. Lin <Emily.J.Lin.18@dartmouth.edu>, John Cardwell <John.R.Cardwell.18@dartmouth.edu>, John Kotz <John.P.Kotz.19@dartmouth.edu>, Joel J. Katticaran <joel.j.katticaran.ug@dartmouth.edu>, Kaya M. Thomas <Kaya.M.Thomas.17@dartmouth.edu>, Trevor L. Davis <trevor.l.davis.18@dartmouth.edu>, Thomas D. Kim <Thomas.D.Kim.19@dartmouth.edu>, Kyle Dotterrer <kyle.dotterrer.18@dartmouth.edu>
On Unix a single special character called ‘newline’ is what defines the end of one line and the beginning of the next. A common syntax for the newline character, in programming languages, is
\n
. But the\
is special, in bash, so we need to escape it, um, with\
; thus we have\\n
.
The tr
command filters stdin
to stdout
, translates each instance
of the character given in the first argument (\\n
) to the character
given in the second argument (,
). Here, tr
spit out the
translated input - on one very long line that does not end with a
newline. I can then copy-paste that result into the pesky web
browser.
Outlook and Microsoft tools want semicolons, not commas; sigh. But the semicolon is also special to bash; so we must escape it too:
$ tr \\n \;
On MacOS I can connect the MacOS “Clipboard” to this tool as follows:
$ pbpaste | tr \\n \; | pbcopy
great! copy the list of addresses to the clipboard, type that command in the shell, and now the clipboard contains a semicolon-separated list of addresses.
Example 3.
An even more powerful filtering tool - the stream editor called
sed
- allows you to transform occurrences of one or more patterns in
the input file(s):
sed pattern [file]...
For example,
[engs50@mc ~]$ cat .forward
David Kotz <kotz@cs>
Eric Chen <eric.t.chen.17@dartmouth.edu>
Rui Liu <Rui.Liu.GR@dartmouth.edu>
Taylor Hardin <Taylor.A.Hardin.GR@dartmouth.edu>
Elena N. Horton <Elena.N.Horton.18@dartmouth.edu>
Jinnan Ge <Jinnan.Ge.TH@dartmouth.edu>
Mauricio Esquivel Rogel <Mauricio.Esquivel.Rogel.18@dartmouth.edu>
James Edwards <James.G.Edwards.18@dartmouth.edu>
Emily J. Lin <Emily.J.Lin.18@dartmouth.edu>
John Cardwell <John.R.Cardwell.18@dartmouth.edu>
John Kotz <John.P.Kotz.19@dartmouth.edu>
Joel J. Katticaran <joel.j.katticaran.ug@dartmouth.edu>
Kaya M. Thomas <Kaya.M.Thomas.17@dartmouth.edu>
Trevor L. Davis <trevor.l.davis.18@dartmouth.edu>
Thomas D. Kim <Thomas.D.Kim.19@dartmouth.edu>
Kyle Dotterrer <kyle.dotterrer.18@dartmouth.edu>
[engs50@mc ~]$ # remove myself, and excess white space
[engs50@mc ~]$ sed -e /kotz@cs/d -e 's/\t/ /g' -e 's/ */ /g' .forward
Eric Chen <eric.t.chen.17@dartmouth.edu>
Rui Liu <Rui.Liu.GR@dartmouth.edu>
Taylor Hardin <Taylor.A.Hardin.GR@dartmouth.edu>
Elena Horton <Elena.N.Horton.18@dartmouth.edu>
Jinnan Ge <Jinnan.Ge.TH@dartmouth.edu>
Mauricio Esquivel Rogel <Mauricio.Esquivel.Rogel.18@dartmouth.edu>
James Edwards <James.G.Edwards.18@dartmouth.edu>
Emily Lin <Emily.J.Lin.18@dartmouth.edu>
John Cardwell <John.R.Cardwell.18@dartmouth.edu>
John Kotz <John.P.Kotz.19@dartmouth.edu>
Joel Katticaran <joel.j.katticaran.ug@dartmouth.edu>
Kaya Thomas <Kaya.M.Thomas.17@dartmouth.edu>
Trevor Davis <trevor.l.davis.18@dartmouth.edu>
Thomas Kim <Thomas.D.Kim.19@dartmouth.edu>
Kyle Dotterrer <kyle.dotterrer.18@dartmouth.edu>
[engs50@mc ~]$ # remove myself, remove names, make comma-sep list of addresses
[engs50@mc ~]$ sed -e /kotz@cs/d -e 's/.*<//' -e 's/>.*/,/' .forward
eric.t.chen.17@dartmouth.edu,
Rui.Liu.GR@dartmouth.edu,
Taylor.A.Hardin.GR@dartmouth.edu,
Elena.N.Horton.18@dartmouth.edu,
Jinnan.Ge.TH@dartmouth.edu,
Mauricio.Esquivel.Rogel.18@dartmouth.edu,
James.G.Edwards.18@dartmouth.edu,
Emily.J.Lin.18@dartmouth.edu,
John.R.Cardwell.18@dartmouth.edu,
John.P.Kotz.19@dartmouth.edu,
joel.j.katticaran.ug@dartmouth.edu,
Kaya.M.Thomas.17@dartmouth.edu,
trevor.l.davis.18@dartmouth.edu,
Thomas.D.Kim.19@dartmouth.edu,
kyle.dotterrer.18@dartmouth.edu,
[engs50@mc ~]$
The above uses the -e
switch to sed
, which allows one to list more than one pattern on the same command.
A few quick notes about sed’s patterns:
d
deletes lines matching the patternp
prints lines matching the pattern (useful with-n
)s
substitutes text for matches to the pattern.
Patterns are actually regular expressions; I use some above.
Example 4.
I saved a list of students enrolled in CS50 in the file ~cs50/demo/students.txt
.
Each line is of the form First.M.Last.18@Dartmouth.edu;
.
Let’s suppose you all decide to move to Harvard.
[engs50@mc ~]$ sed s/Dartmouth/Harvard/ demo/students.txt
...
[engs50@mc ~]$ sed -e s/Dartmouth/Harvard/ -e 's/\.[0-9][0-9]//' -e 's/;$//' demo/students.txt
...
The second form removes the dot and two-digit class number, and the
trailing semicolon. Notice how I quoted those patterns from the
shell, and even escaped the dot from sed’s normal meaning (dot matches
any character) so sed would look for a literal dot in that position.
The dollar $
constrains the semicolon match to happen at the end of
the line.
Here’s another fun pipe: count the number of students from each class (leveraging the class numbers in email addresses):
[engs50@mc ~]$ tr -c -d 0-9\\n < demo/students.txt | sed 's/^$/other/' | sort | uniq -c | sort -nr
27 18
22 19
9 17
5 other
1 16
[engs50@mc ~]$
See man sed
or the sed FAQ for more info.
You’ll want to learn a bit about regular expressions, which are used to describe patterns in sed’s commands;
see sed regexp info.
Compressing and archiving files
It is often useful to bundle several files into a compressed archive file.
You may have encountered files like this on the Internet - such as files.zip
, something.tar
, or somethin-else.tgz
.
Each packs together several files - including directories, creation dates, access permissions, as well as file contents - into one single file.
This is a convenient way to transfer large collections of files.
On Unix it is most common to use the tar
utility (short for tape archive - from back when we used tapes) to create an archive of all the files and directories listed in the arguments and name it to something appropriate.
We often ask tar
to compress the resulting archive too.
Given a directory stuff
, you can create (c
) a compressed tar archive (aka, a “tarball”), and then list (t
) its contents.
The f
refers to a file and the z
refers to compression.
[engs50@mc ~]$ mkdir stuff
[engs50@mc ~]$ echo > stuff/x
[engs50@mc ~]$ tar cfz stuff.tgz stuff
98.8%
[engs50@mc ~]$ tar tfz stuff.tgz
stuff/
stuff/x
[engs50@mc ~]$
The command leaves the original directory and files intact.
Notice that tar
has an unconventional syntax for its switches - there is no dash -
before them.
To unpack the archive,
[engs50@mc ~]$ tar xfz stuff.tgz
In short, c
for create, t
for type (list), x
for extract.
The f
implies that the next argument is the tarball file name.
The z
indicates that the tarball should be compressed.
By convention, a tarball filename ends in .tar
if not compressed, .tgz
if compressed.