This lesson is in the early stages of development (Alpha version)

Wildcards and pipes

Overview

Teaching: 45 min
Exercises: 10 min
Questions
  • How can I run a command on multiple files at once?

  • Is there an easy way of saving a command’s output?

Objectives
  • Redirect a command’s output to a file.

  • Process a file instead of keyboard input using redirection.

  • Construct command pipelines with two or more stages.

  • Explain what usually happens if a program or pipeline isn’t given any input to process.

Required files

If you didn’t get them in the last lesson, make sure to download the example files used in the next few sections:

Using wget: wget https://rse.shef.ac.uk/hpc-shell-tuos-citc/files/bash-lesson.tar.gz

Using a web browser: https://rse.shef.ac.uk/hpc-shell-tuos-citc/files/bash-lesson.tar.gz

Now that we know some of the basic UNIX commands, we are going to explore some more advanced features. The first of these features is the wildcard *. In our examples before, we’ve done things to files one at a time and otherwise had to specify things explicitly. The * character lets us speed things up and do things across multiple files.

Ever wanted to move, delete, or just do “something” to all files of a certain type in a directory? * lets you do that, by taking the place of one or more characters in a piece of text. So *.txt would be equivalent to all .txt files in a directory for instance. * by itself means all files. Let’s use our example data to see what I mean.

$ tar xvf bash-lesson.tar.gz
$ ls
NA12873_1.fastq
CosmicCodingMuts.vcf
Cancer_Gene_Census_Hallmarks_Of_Cancer.tsv
GRCh38_chr20.gtf
NA12873_2.fastq
NA12874_1.fastq
NA12874_2.fastq
NA12878_1.fastq
NA12878_2.fastq

Now we have a whole bunch of example files in our directory. For this example we are going to learn a new command that tells us how long a file is: wc. wc -l file tells us the length of a file in lines.

$ wc -l GRCh38_chr20.gtf
75706 GRCh38_chr20.gtf

Interesting, there are over 75000 lines in our GRCh38_chr20.gtf file. What if we wanted to run wc -l on every .fastq file? This is where * comes in really handy! *.fastq would match every file ending in .fastq.

$ wc -l *.fastq
20000 NA12873_1.fastq
20000 NA12873_2.fastq
20000 NA12874_1.fastq
20000 NA12874_2.fastq
20000 NA12878_1.fastq
20000 NA12878_2.fastq
120000 total

That was easy. What if we wanted to do the same command, except on every file in the directory? A nice trick to keep in mind is that * by itself matches every file.

$ wc -l *
100 Cancer_Gene_Census_Hallmarks_Of_Cancer.tsv
100 CosmicCodingMuts.vcf
75706 GRCh38_chr20.gtf
20000 NA12873_1.fastq
20000 NA12873_2.fastq
20000 NA12874_1.fastq
20000 NA12874_2.fastq
20000 NA12878_1.fastq
20000 NA12878_2.fastq
11403 bash-lesson.tar.gz
207309 total

Multiple wildcards

You can even use multiple *s at a time. How would you run wc -l on every file with “878” in it?

Solution

wc -l *878*

i.e. anything or nothing then fb then anything or nothing

Using other commands

Now let’s try cleaning up our working directory a bit. Create a folder called “fastq” and move all of our .fastq files there in one mv command.

Solution

mkdir fastq
mv *.fastq fastq/

Redirecting output

Each of the commands we’ve used so far does only a very small amount of work. However, we can chain these small UNIX commands together to perform otherwise complicated actions!

For our first foray into piping, or redirecting output, we are going to use the > operator to write output to a file. When using >, whatever is on the left of the > is written to the filename you specify on the right of the arrow. The actual syntax looks like command > filename.

Let’s try several basic usages of >. echo simply prints back, or echoes whatever you type after it.

$ echo "this is a test"
$ echo "this is a test" > test.txt
$ ls
$ cat test.txt
this is a test

Cancer_Gene_Census_Hallmarks_Of_Cancer.tsv  GRCh38_chr20.gtf    fastq
CosmicCodingMuts.vcf                        bash-lesson.tar.gz  test.txt

this is a test

Awesome, let’s try that with a more complicated command, like wc -l.

$ wc -l * > word_counts.txt
$ cat word_counts.txt
wc: fastq: Is a directory

 100 Cancer_Gene_Census_Hallmarks_Of_Cancer.tsv
 100 CosmicCodingMuts.vcf
75706 GRCh38_chr20.gtf
11403 bash-lesson.tar.gz
   0 fastq
   1 test.txt
87310 total

Notice how we still got some output to the console even though we “piped” the output to a file? Our expected output still went to the file, but how did the error message get skipped and not go to the file?

This phenomena is an artefact of how UNIX systems are built. There are 3 input/output streams for every UNIX program you will run: stdin, stdout, and stderr.

Let’s dissect these three streams of input/output in the command we just ran: wc -l * > word_counts.txt

Knowing what we know now, let’s try re-running the command, and send all of the output (including the error message) to the same word_counts.txt files as before.

$ wc -l * &> word_counts.txt

Notice how there was no output to the console that time. Let’s check that the error message went to the file like we specified.

$ cat word_counts.txt
     100 Cancer_Gene_Census_Hallmarks_Of_Cancer.tsv
      14 CosmicCodingMuts.vcf.gz
   75706 GRCh38_chr20.gtf
   11403 bash-lesson.tar.gz
       4 demo.sh
wc: fastq: Is a directory
       0 fastq
       9 loop.sh
       1 test.txt
       9 word_counts.txt
   87246 total

Success! The wc: fastq: Is a directory error message was written to the file. Also, note how the file was silently overwritten by directing output to the same place as before. Sometimes this is not the behaviour we want. How do we append (add) to a file instead of overwriting it?

Appending to a file is done the same was as redirecting output. However, instead of >, we will use >>.

$ echo "We want to add this sentence to the end of our file" >> word_counts.txt
$ cat word_counts.txt
 100 Cancer_Gene_Census_Hallmarks_Of_Cancer.tsv
 100 CosmicCodingMuts.vcf
75706 GRCh38_chr20.gtf
11403 bash-lesson.tar.gz
   0 fastq
   1 test.txt
87310 total
We want to add this sentence to the end of our file

Chaining commands together

We now know how to redirect stdout and stderr to files. We can actually take this a step further and redirect output (stdout) from one command to serve as the input (stdin) for the next. To do this, we use the | (pipe) operator.

grep is an extremely useful command. It finds things for us within files. Basic usage (there are a lot of options for more clever things, see the man page) uses the syntax grep whatToFind fileToSearch. Let’s use grep to find all of the entries pertaining to the NM_001323679.2 gene in the human genome.

$ grep NM_001323679.2 GRCh38_chr20.gtf

The output is nearly unintelligible since there is so much of it. Let’s send the output of that grep command to head so we can just take a peek at the first line. The | operator lets us send output from one command to the next:

$ grep NM_001323679.2 GRCh38_chr20.gtf | head -n 1
chr20   hg38_ncbiRefSeq exon    347111  347142  0.000000        +       .       gene_id "NM_001323679.2"; transcript_id "NM_001323679.2";

Nice work, we sent the output of grep to head. Let’s try counting the number of entries for NM_001323769.2 with wc -l. We can do the same trick to send grep’s output to wc -l:

$ grep NM_001323679.2 GRCh38_chr20.gtf | wc -l
11

Note that this is just the same as redirecting output to a file, then reading the number of lines from that file.

Writing commands using pipes

How many files are there in the “fastq” directory we made earlier? (Use the shell to do this.)

Solution

ls fastq/ | wc -l

Output of ls is one line per item, when chaining commands together like this, so counting lines gives the number of files.

Reading from compressed files

Let’s compress one of our files using gzip.

$ gzip CosmicCodingMuts.vcf

zcat acts like cat, except that it can read information from .gz (compressed) files. Using zcat, can you write a command to take a look at the top few lines of the CosmicCodingMuts.vcf.gz file (without decompressing the file itself)?

Solution

zcat CosmicCodingMuts.vcf.gz | head

The head command without any options shows the first 10 lines of a file.

Key Points

  • The * wildcard is used as a placeholder to match any text that follows a pattern.

  • Redirect a command’s output to a file with >.

  • Commands can be chained with |