MCB 5472 : Intro to UNIX

Please let me know, how far you got during the lab. If most students didn't finish, we will continue this next week!

Questions you should answer are given in blue. Please send your answers per email to gogarten@uconn.edu with subject MCB5472

The UNIX exercises were adopted from Keith Bradnam & Ian Korf's Unix and Perl Primer for Biologists <http://korflab.ucdavis.edu/Unix_and_Perl/>; the vi exercise from "Learning the vi Editor, 6th Edition" By: Linda Lamb; Arnold Robbins Publisher: O'Reilly Media, Inc. Pub. Print ISBN-13: 978-1-565-92426-0, available through Safari Online

Introduction to Unix

These exercises will (hopefully) teach you to become comfortable when working in the environment of the Unix terminal. Unix contains many hundred of commands but you will probably use just 10 or so to achieve most of what you want to do.

You are probably used to working with programs like the Apple Finder or the Windows File Explorer to navigate around the hard drive of your computer. Some people are so used to using the mouse to move ﬁles, drag ﬁles to trash etc. that it can seem strange switching from this behavior to typing commands instead. Be patient, and try — as much as possible — to stay within world of the Unix terminal. Please make sure you complete and understand each task before moving on to the next one.

U1. The Terminal

A ʻterminalʼ is the common name for the program that does two main things. It allows you to type input to the computer (i.e. run programs, move/view ﬁles etc.) and it allows you to see output from those programs. All Unix machines will have a terminal program and on Apple computers, the terminal application is unsurprisingly named ʻTerminalʼ.

[If your private or lab PC runs windows: In the computer lab we will use iMacs; but most the programs we run we will run on the cluster in the Bioinformatics Services facility. UConn provides an excellent ssh program for windows machines. See ftp://ftp.uconn.edu/restricted/ssh/ . If you normally use a windows machine, I recommend to install this program, and do your exercises on the bioinformatics cluster. Let me know, if you need help with the installation.]

Task U1.1: Use the ʻSpotlightʼ search tool (the little magnifying glass in the top right of the menu bar) to ﬁnd and launch Appleʼs Terminal application.

You should now see something that looks like the following (the text that appears inside your terminal window will be slightly different):

Before we go any further, you should note that you can:

make the text larger/smaller (hold down ʻcommandʼ and either ʻ+ʼ or ʻ–ʼ)

resize the window (this will often be necessary)

have multiple terminal windows on screen (see the ʻShellʼ menu)

have multiple tabs open within each window (again see the ʻShellʼ menu)

change the display by using "preferences" (see the 'Terminal' menu)

There will be many situations where it will be useful to have multiple terminals open and it will be a matter of preference as to whether you want to have multiple windows, or one window with multiple tabs.

U2. Your ﬁrst Unix command

Unix keeps ﬁles arranged in a hierarchical structure. From the 'top-level' of the computer, there will be a number of directories, each of which can contain ﬁles and subdirectories, and each of those in turn can of course contain more ﬁles and directories and so on, ad inﬁnitum. Itʼs important to note that you will always be “in” a directory when using the terminal. The default behavior is that when you open a new terminal you start in your own 'home” directory (containing ﬁles and directories that only you can modify).

To see what ﬁles are in our home directory, we need to use the ls command. This command ʻlistsʼ the contents of a directory. So why donʼt they call the command ʻlistʼ instead? Well, this is a good thing because typing long commands over and over again is tiring and time-consuming. There are many (frequently used) Unix commands that are just two or three letters. If we run the ls command we should see something like:

There are four things that you should note here:

You will probably see different output to what is shown here, it depends on your computer. Donʼt worry about that for now.

The 'lamarck:~ jpgogarten$' text that you see is the Unix command prompt. It contains my user name (jpgogarten), the name of the machine that I am working on ('lamarckʼ and the name of the current directory (ʻ~ʼ more on that later). Note that the command prompt might not look the same on different Unix systems. In this case, the $ sign marks the end of the prompt.

The output of the ls command lists twenty things. In this case, most are directories, but they could also be ﬁles. Weʼll learn how to tell them apart later on.

After the ls command ﬁnishes it produces a new command prompt, ready for you to type your next command.

The ls command is used to list the contents of any directory, not necessarily the one that you are currently in. If you want to list the files on your desktop type the following:

> ls ~/Desktop/

To obtain more information about the contents of a directory, you can use the command with flags. E.g.

> ls -l
-l (The lowercase letter ``ell''.) List in long format. (See below.) If the output is to a terminal, a total sum for all the file sizes is output on a line before the long listing.
-G Enable colorized output.

To obtain more information on a Unix command you can type man 'name of the command' at the prompt. For example

> man ls

This command will list the manual page for the list command.

You can use the up or down arrows to move around on the manual page, or you can use the spacebar to scroll through the text page by page. You can leave the manual page by typing 'q'.

U3: The Unix tree

Looking at directories from within a Unix terminal can often seem confusing. But bear in mind that these directories are exactly the same type of folders that you can see if you use Appleʼs graphical ﬁle-management program (known as ʻThe Finderʼ). A tree analogy is often used when describing computer ﬁlesystems. From the root level (/) there can be one or more top level directories, though most Macs will have about a dozen. In the example below, we show just three. When you log in to a computer you are working with your ﬁles in your home directory, and this will nearly always be inside a ʻUsersʼ directory. On many computers there will be multiple users.

All Macs have an applications directory where all the GUI (graphical user interface) programs are kept (e.g. iTunes, Microsoft Word, Terminal). Another directory that will be on all Macs is the Volumes directory. In addition to any attached external drives, the Volumes directory should also contain directories for every internal hard drive (of which there should be at least one, in this case itʼs simply called ʻMacʼ). It will help to think of this tree when we come to copying and moving ﬁles. E.g. if I had a ﬁle in the ʻCodeʼ directory and wanted to copy it to the ʻkeithʼ directory, I would have to go up four levels to the root level, and then down two levels.

U4: Finding out where you are

There may be many hundreds of directories on any Unix machine, so how do you know which one you are in? The command pwd will Print the Working Directory and thatʼs pretty much all this command does:

When you log in to a Unix computer, you are typically placed into your home directory. In this example, after I log in, I am placed in a directory called 'jpgogarten' which itself is a subdirectory of another directory called 'users'. Conversely, 'users' is the parent directory of 'jpgogarten'.
The ﬁrst forward slash that appears in a list of directory names always refers to the top level directory of the ﬁle system (known as the root directory). The remaining forward slash (between ʻusersʼ and ʻclmuserʼ) delimits the various parts of the directory hierarchy. If you ever get ʻlostʼ in Unix, remember the pwd command.

As you learn Unix you will frequently type commands that donʼt seem to work. Most of the time this will be because you are in the wrong directory, so itʼs a really good habit to get used to running the pwd command a lot.

U5: Getting from ʻAʼ to ʻBʼ

We are in the home directory on the computer but we want to to work on the Desctop folder. To change directories in Unix, we use the cd command:

> cd /Users/jpgogarten/Desktop
> pwd
/Users/jpgogarten/Desktop

The ﬁrst command reads as ʻchange directory to the Desktop directory, which inside the users home directory, which itself is inside the Users directory that is at the root level of the computerʼ. Did you notice that the command prompt changed after you ran the cd command? The ʻ~ʼ sign should have changed to ʻlamarck:Desktopʼ. This is a useful feature of the command prompt. By default it reminds you where you are as you move through different directories on the computer.

U6: Absolute and relative targets

In the previous example, we could have achieved the same result by giving the name of the directory that is inside the current directory:
> cd Desktop

Note that the command does not include a forward slash. When you specify a directory that starts with a forward slash, you are referring to a directory that should exist one level below the root level of the computer. What happens if you try the following two commands? The ﬁrst command should produce an error message.
$ cd Users
$ cd /Users

The error is because without including a leading slash, Unix is trying to change to a 'Users' directory below your current level in the ﬁle hierarchy, and there is no directory called Users at this location.

U7: Up, up, and away

Frequently, you will ﬁnd that you want to go 'upwards' one level in the directory hierarchy. Two dots (..) are used in Unix to refer to the parent directory of wherever you are. Every directory has a parent except the root level of the computer:

$ cd /Applications/clustalx.app/contents

$ pwd

/Applications/clustalx.app/contents

$ cd ..

$ pwd

/Applications/clustalx.app

What if you wanted to navigate up two levels in the ﬁle system in one go? Itʼs very simple, just use two sets of the .. operator, separated by a forward slash:

$ cd /Applications/clustalx.app/contents

$ cd ../..

$ pwd

/Applications

U8: Iʼm absolutely sure that this is all relative (read this only, nothing to type or execute here)

Using cd .. allows us to change directory relative to where we are now. You can also always change to a directory based on its absolute location.
E.g. if you are working in the /Applications/phylip3.65/exe directory and you then want to change to the /Applications/phylip3.65/doc directory, then you could do either of the following:

$ cd ../doc
or
$ cd /Applications/phylip3.65/doc

They both achieve the same thing, but the 2nd example requires that you know about the full path from the root level of the computer to your directory of interest (the 'path' is an important concept in Unix). Sometimes it is quicker to change directories using the relative path, and other times it will be quicker to use the absolute path.

U9: Shortcuts

If you are in the commandline in a Unix terminal (works in most cases), pressing the cursor key (arrows) up or down recalls the last couple of commands, moving the forward/backwards arrows allows to edit a command.

If you type a command like cd Desktop, you can type up to “cd D” and then press the <tab> key to complete the line. This completes the line as long as it unambiguous. (If you have directories Desktop, Desktop1 Desktop.old, the system would only complete to Desktop and wait for your input.

Another great time-saver is that Unix stores a list of all the commands that you have typed in each login session. Type history to see all of the commands you have typed so far. You can use the up and down arrows to access anything from your history. So if you type a long command but make a mistake, press the up arrow and then you can use the left and right arrows to move the cursor in order to make a change.

You also can cut down on typing by using the wild-card character (*), essentially meaning ʻmatch anything'. Using wild-card characters can save you a lot of typing.

U10: Make directories, not war

If we want to make a new directory (e.g. to store some work related data), we can use the mkdir command:

$ cd /Desktop

$ mkdir work

$ ls

$ cd work

$ ls

$ mkdir temp1

$ mkdir temp2

$ mkdir temp3

$ ls -l (for long, lower case L; NOT the number 1)

To remove directory temp3 type

rmdir temp3

To remove multiple directories you can use the wild card e.g. rmdir temp*

Using a wildcard character in conjunction with deletion is very very dangerous, e.g., rm temp* removes all files whose names start with temp, if by accident you insert a space and type rm temp * the file temp and all files whatever their name will be removed .... To protect against this mishap, you can use rm -i

U11: Text editors

<CTRL> click (press the ctrl key and click with the mouse, equivalent to right click in case you have a multi button mouse) on the following links and save the files on your desctop in the subdirectory work on your desktop. seq1 seq2 seq3 seq4 (depending on the browser you use, you might need to first save the files to the download folder (and then copy them to the work folder)

Using the command cat you can list the content of a file to the standard output (the screen).

cd ~/Desktop/work
cat sequences-1.fasta

To list the content of multipe files you can use the wild card character:

cat *.fasta

Unix alows to redirect output (using the > character) or to pipe (using the character | ) the output through other programs. E.g.:

cat *.fasta > all.seqs
copies the content of the files ending with fasta into a new file called all.fasta. Check the file with
cat all.seqs

an example example for '|' using the commant wc (word count). wc counts words and characters (for more information type man wc). To count the number of characters in a file you could use the command
cat *.fasta | wc -c

the command grep can be used to grab lines that contain a pattern. E.g.:
cat *.fasta |grep '>'
grabs and displays all the annotation lines from the fasta files. Try it!
(Note that the line on the screen is different form the line in the file. A single line in the file can wrap around several lines on the screen)

If you were just interested in the numebr of annotation lines (which is the numebr of sequences in a file or output, you could pipe the output through we with the line flag:
cat *.fasta |grep '>'| wc -l

Another way to look at files is to use the more or the less command. This allows you to page through a file using the cursor keys or the spacebar.
more all.seqs

type 'q' to quit

Several text editors are available on UNIX systems. It helps to be at least somewhat familiar with one of them. vi is very powerful, but has a rather steep learing curve. Among the nice things is that is is available on all unix systems, it can be taught to do context dependent coloring (nice for scripts - see below). While vi is great, it helps to have an additional editor on the computer you normally work on. I use textwrangler http://www.barebones.com/products/TextWrangler/, which is free and very useful. On PCs crimson is recommended http://www.crimsoneditor.com/ (notepad or MSword are not recommended).

The vi Text editor

vi has two main modes of operation: command mode, and insert mode. This is the cause of much of the confusion when a new user is learning vi, but it is actually very simple to understand. When you first load the editor, you will be placed into command mode. To switch into insert mode, simply press the 'i' key. Although nothing will change on the screen to indicate the new mode, any thing that you type from now on will appear in the screen - this is what you are used to if you have ever used any other editor, or word processor. Try typing a few lines of text. When you press 'return' or 'enter' a new line will be created, and you may continue typing. When you have finished typing, you may return to command mode. This is done by pressing your 'Esc' key. In command mode, key presses do not appear on the screen, but instead are used to indicate various commands to vi. At first, you may often mistake command mode and insert mode. For example, you may think you are in insert mode, and start typing your text, when in fact you are in command mode, and each keypress you make will issue a command to vi. Be careful - you may accidentally modify or delete parts of your file. If you are unsure which mode you are in, press 'Esc'. If you were in insert mode, you will be returned to command mode. If you were already in command mode, you will be left in command mode (possibly with a 'beep', to indicate that you were already in command mode).

To run vi and create a new file, simply run the command 'vi' from any shell prompt (after changing into the work directory on your desktop):
$ vi

Alternatively, to load an existing text file into vi, run the command 'vi [filename]', from the shell prompt:
$ vi all.seqs

1. Load vi (if you haven't already), by typing:

$ vi
(don't forget - the $ is the system's prompt, and may be different on your system. You only type the 'vi' part, shown in bold.)

The screen should show a blank file, with each blank line represented by a tilde (~), for example:

By default you are in command mode.
~
~
~
~
~

2. Switch to insert mode, by pressing the 'i' key. Then type some text, using ENTER or RETURN to start new lines. For example:

Hello. This is my first session in the vi editor.
This is the second line of text.
~
~
~
3. When you have finished entering your sample text, press 'Esc' to return to command mode.

4. Now we will learn a useful command: to save the file. The command for this is ':w' (note the colon before the 'w'). After the 'w', put a space, and the name you want to store the file as. For example:

:w file1
Type it now. Notice how the text ':w' appears at the bottom on the screen. When you have finished the command, press ENTER or RETURN. You should see a confirmation that the file has been saved, which may include the number of lines in the file, and possibly the file size.

5. To finish this simple introduction to the editor, we will learn one final command: How to close the editor. The command for this is ':q' (again, with a colon before the 'q'). Don't forget you need to be in command mode. If you're not already in command mode, or you're not sure which mode you're in, press 'Esc' now. Then issue the command:
:q
You should be returned to the UNIX prompt.

Sometimes, you may have made changes to a file that you do not want to save. This may be because you have decided the changes are incorrect, or you have become confused using vi (not unusual at first!), and incorrectly made some changes, maybe by typing into command mode instead of insert mode. To exit vi without saving, and ignoring any warnings about unsaved data, use a variation of the ':q' command, with an exclamation mark after it:
:q!
This will return you to the prompt, without saving any changes to the file, and with no warnings about unsaved data. Use this command carefully.

In case you are working on a file you can combine write and quit into a single command
<ESC>
:wq

.profile files

change to your home directory. Then execute the following commands:
$ ls
$ ls -p

The latter gives you an indication, which entries are directories and which are files. Unix allows to create profile to which you can add alias. The profile is stored in .profile and is excuted everytime you start a new shell. Files that start with a dot usually are not listed. to see them use
$ ls -a

Use vi to add aliases to your profile:
$ vi .profile

move the cursor to the first line, switch to insert mode by typing 'i'. Type the following text:
# some useful command line short-cuts
alias ls='ls -p'
alias rm='rm -i'
<ESC> :wq

The commands in .profile will be executed everytime you log in. To make the shell read the .profile file without logging out and in again, type
$ source .profile

From now on, everytime you type ls, ls -p will be executed, and if you remove something using a wildcard, you will be asked to confirm the removal of each individual file.

Other hidden files instruct programs to start up with useful parameters. The one used by vi is called .vimrc . The .vimrc file I use on my laptop has only two lines, but they make a big difference. To create this file in the chem-lab-user account:

$ vi .vimrc
enter insert mode by typing i, then type or copy the following two lines:
set term=xterm-color
syn on
then write the file and quit
:wq

To see the effect of syntax dependent coloring, open the .vimrc file again. As you restart vi, it should now use context dependent coloring.
$ vi .vimrc

On the bioinformatics cluster I use a more complex .vimrc file. If you want to use it copy it from here (<CTRL> click) and save it as .vimrc in your home directory (on the cluster, or on your iMac.

U12: Moving and copying files
Now, letʼs assume that we want to move ﬁles to a new directory (ʻTempʼ). We will
do this using the Unix mv (move) command:
$ cd ~/Desktop
$ mkdir Temp
$ mv ~/Desktop/work/all.seqs Temp/

Copying ﬁles with the cp (copy) command is very similar to moving them. Remember to always specify a source and a target location. Letʼs create a new ﬁle and make a copy of it.

For the mv and cp command, we always have to specify a source ﬁle (or directory) that we want
to move or copy , and then specify a target location. If we had wanted to we could have moved/copied
several ﬁles using the asterisk (*) acts as a wild-card character:
$ cp ~/Desktop/work/*.fasta Temp/

The cp command also allows us (with the use of a command-line option) to copy entire
directories (also note how the ls command in this example is used to specify multiple
directories):

$ cp -R work/ temp1/
$ ls work temp1
temp1:
sequences-1.fasta sequences-3.fasta
sequences-2.fasta sequences-4.fasta

work:
sequences-1.fasta sequences-3.fasta
sequences-2.fasta sequences-4.fasta

The -R option means ʻcopy recursivelyʼ, many other Unix commands also have a similar option. See what happens if you donʼt include the -R option. The -R option is particularly useful to change permissions.

U13: Permissions and scripts

Use vi to create shell script named hello.sh:

vi hello.sh

type i to enter insert mode

type
# my first Unix shell script
echo "Hello World"
<esc>:wq

All files in UNIX have separate read, write, and execute permissions, for the user (u), the group of the owner (g) and everyone else (a).

If you execute ls -l, every file and directory is listed on the separate line. The line starts with a number of letters.
The first letter is a d if the file is a directory, the next three letters five the read (r) / write (w) / execute (x) permissions, the following three give the same for the group, and the following three for everyone.

-rw-r--r-- 1 peter peter 49 Jan 19 14:44 hello.sh

indicates that user peter from the group peter owns the file, and that the owner has read and write permission, but everyone else can only read the file.

To execute a script you type ./ followed by the name of the script:
$ ./hello.sh
This should result in an error message, because you do not have execute permission.

To add permission for you to execute the file
$chmod u+x hello.sh
$ ./hello.sh

To give everyone permission to execut the hello.sh script, you type
$ chmod a+x hello.sh

To remove everyone's permission to read the file:
$ chmod a-r hello.sh

(Note: this also removed your permission to read the file :))

If you want to give everyone permission to read and write file in the work directory, you could execute the following
cd ~/Desktop
chmod -R a+r work/
This will give everyone read permission to all files contained in the work folders including all subfolders.

U14: Time to go home

Remember that the command prompt shows you the name of the directory that you are currently in, and that when you are in your home directory it shows you a tilde character (~) instead? This is because Unix uses the tilde character as a short-hand way of specifying a home directory.

Task U12.1: See what happens when you try the following commands (use the pwd command after each one to conﬁrm the results):

$ cd /

$ cd ~

$ cd /

$ cd

Hopefully, you should ﬁnd that cd and cd ~ do the same thing, i.e. they take you back to your home directory (from wherever you were). Also notice how you can specify the single forward slash to refer to the root directory of the computer. When working with Unix you will frequently want to jump straight back to your home directory, and typing cd is a very quick way to get there.

If you have time and energy left:

Use Entrez to find a Protein sequence that is of interest to you. (If you don't find something of interest, use gi|405795).
How many related protein sequences does your sequence have (see the menu on the right under related information)?
How many related nucleotide sequences does the encoding nucleotide sequence have (click on nucleotide under related information, then related sequences)?
Explore the BLink page.
What is shown on this page?
What do the colors in the symbolic alignment on the right hand side signify?
Where do the three links in every entry link to?
What are the limitations of BLINK?
Note: all of these results are already linked to your sequence, you did not need to perform a new search to get the results.
Blast using the NCBI web interface:
There are different BLAST programs.
Why/when would you want to use blastp?
When blastn, blastx, tblastn or tblastx?
The NCBI maintains many different web pages that link to blast searches. One of the more useful ones to assemble data sets for phylogenetic analysis is at http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi
Using a protein sequence of your choice (see above, or gi|59713171) search a group of prokaryotic genomes.
How many significant hits did you find in how many genomes?
Note: you can place checkmarks in front of the sequences you are interested in and retrieve them.
Why might this tool be preferable over BLINK in Entrez?
Does the results change, when you use a smaller word size or a different scoring matrix? (try one or two repeats, you need to select advance search, be radical in your choices, if you want to know what a parameter means, click on the ?)
Do all the target sequences have similar description lines?
Did the low complexity filter replace any part of your query sequence?

Assignments for next week:

Finish or at least read through today's exercise