class: center, middle, inverse, title-slide # Introduction to Linux for Bioinformatics ##
Talk is cheap. Show me the code.
### Lijia Yu @ NCCL in China ### 2019/06/17
(updated: 2020-10-03) --- # Agenda 1. Login to RStudio Server 2. UNIX file system 3. Command line + operations on files 4. FASTA and FASTQ --- # Login to RStudio Server 1. open broswer link: 192.168.0.105:8787 ![](./2019-06-17-Introduction_to_Linux_for_Bioinformatics_files/figure/RStudio.png) --- | RStudio Windows / Tabs | Location | Description | |---------------------------|-------------|--------------------------------------------------------------| | **Console & Terminal Window** | lower-left | location were commands are entered and the output is printed | | **Source Tabs** | upper-left | built-in text editor | | Environment Tab | upper-right | interactive list of loaded R objects | | History Tab | upper-right | list of key strokes entered into the Console | | **Files Tab** | lower-right | file explorer to navigate C drive folders | | Plots Tab | lower-right | output location for plots | | Packages Tab | lower-right | list of installed packages | | Help Tab | lower-right | output location for help commands and help search window | | Viewer Tab | lower-right | advanced tab for local web content | --- # UNIX file system ![](./2019-06-17-Introduction_to_Linux_for_Bioinformatics_files/figure/UnixFileSystem.png) .footnote[ [1] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # UNIX file system ## ls `ls` is a command to list computer files in Unix and Unix-like operating systems. ```bash ls /home/admin ``` ![](./2019-06-17-Introduction_to_Linux_for_Bioinformatics_files/figure/ls.png) --- # UNIX file system: files and extensions - For some operating systems, file extensions are important and define the file type. In e.g. Windows files have a three/four letter extension(e.g. .jpg, .exe, .docx, …) -- - In UNIX file extensions are arbitrary (no particular sense), at least for the operating system. A file can have several extensions and will be recognized by the file permissions and content. -- - Popular file extensions in bioinformatics: - scripts - .sh (shell/bash) - .pl (perl) - .py (python) - .r (R) - data - .txt (text) - .csv/.tsv (tabular) - .fasta/.fastq (sequence files) .footnote[ [1] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # UNIX file system: users and access rights Every file and directory is protected. A set of permissions determines who can access a certain file and what kind of access is allowed. - User: the user who owns the file - Group: other users from the same group - Others: all others in the system - Read: display the file - Write: display and modify the content of the file - Execute: run a file ( only for scripts and compiled programs) .footnote[ [1] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # UNIX file system: users and groups ```bash ls -a ``` print a list of all files and directories (incl. hidden files). Hidden files can be recognized by ‘.’ at the start of the name, ‘..’ is the parent directory ``` [root@nccl /home/lirui]$ ls -a . .. .bash_logout .bash_profile .bashrc ``` ![](./2019-06-17-Introduction_to_Linux_for_Bioinformatics_files/figure/ls-a.png) .footnote[ [1] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # UNIX file system: users and groups ```bash ls -l /usr/bin/gpg ``` ``` lrwxrwxrwx 1 root root 4 Feb 18 11:03 /usr/bin/gpg -> gpg2 ``` - file permissions, - number of hard links, - owner name, - owner group, - file size, - time of last modification, and file/directory name .footnote[ [1] [What do the fields in ls -al output mean? ](https://unix.stackexchange.com/questions/103114/what-do-the-fields-in-ls-al-output-mean) ] --- # UNIX file system: users and groups File permissions is displayed as following; ![](./2019-06-17-Introduction_to_Linux_for_Bioinformatics_files/figure/access_right.png) - r = readable - w = writable - x = executable .footnote[ [1] [What do the fields in ls -al output mean? ](https://unix.stackexchange.com/questions/103114/what-do-the-fields-in-ls-al-output-mean) [2] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # UNIX file system: file permission An example <font color="red">-</font><font color="blue">rwx</font><font color="green">rw-</font><font color="orange">r--</font>, this means the line displayed is: -- - <font color="red">a regular file (displayed as -)</font> -- - <font color="blue">readable, writable and executable by owner (rwx)</font> -- - <font color="green">readable, writable, but not executable by group (rw-)</font> -- - <font color="orange">readable but not writable or executable by other (r--)</font> .footnote[ [1] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # UNIX file system When entering a command you can use a wildcard character, this is used a substitute for one or many other characters. They are often used with file and directory name and filesystem commands. - `*` Match any number of characters - `?` Match one character - `[]` specify a range of characters on that position - `{}` specify a list of terms - separated by commas - `!` Exclude this range of characters .footnote[ [1] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # UNIX file system How many files will you find using the following command: - `ls -l *.c` - `ls -l bam*` - `ls -l samtools.?` - `ls -l *.*[!a]` - `ls -l {*.c,*.h}` - `ls -l version.[a-z]h` .footnote[ [1] [Linux for bioinformatics ](https://data.bits.vib.be/pub/trainingen/Linux/Linux_01_2018.pdf) ] --- # Command line + operations on files ## command pattern ```bash ls --help ``` ``` Usage: ls [OPTION]... [FILE]... ``` ```bash R --help ``` ``` Usage: R [options] [< infile] [> outfile] or: R CMD command [arguments] Start R, a system for statistical computation and graphics, with the specified options, or invoke an R tool via the 'R CMD' interface. Options: -h, --help Print short help message and exit --version Print version info and exit --encoding=ENC Specify encoding to be used for stdin --encoding ENC ``` --- # Command line: pwd, cd - ```bash pwd ``` print working directory ``` [root@nccl /home/lirui]$ pwd /home/lirui ``` - ```bash cd ``` change directory ```bash cd /tmp cd ./tmp cd ../tmp cd ~ ``` --- # Command line: cp, rm - ```bash cp /tmp/abc.txt /tmp/yulijia.txt cp -r /tmp/yulijia /tmp/yourname ``` Copy one file or all files in a directory - ```bash rm /tmp/yulijia.txt rm -r /tmp/yulijia ``` To remove objects such as files, directories, device nodes, symbolic links, and so on from the filesystem. --- # Command line: rm .center[![](https://i.pinimg.com/236x/99/32/b1/9932b1cebca00d51d1b6474215252bf5.jpg)] -- .center[**NERVER use `rm -f` !**] --- # Command line: less, cat, head, tail - ```bash less /home/admin/database/reference/human/hg19/chr1.fa ``` View the contents of a text file one screen at a time. It is similar to `more`, but has the extended capability of allowing both forward and backward navigation through the file. - ```bash cat /home/admin/database/reference/human/hg19/chr1.fa ``` Writing context to standard output. - ```bash head /home/admin/database/reference/human/hg19/chr1.fa ``` Display the beginning of a text file or piped data. - ```bash tail /home/admin/database/reference/human/hg19/chr1.fa ``` Display the tail end of a text file or piped data. --- <!-- # Command line + operations on files --> <!-- ```bash --> <!-- less /home/yulijia/.bashrc --> <!-- ``` --> <!-- ![](./2019-06-17-Introduction_to_Linux_for_Bioinformatics_files/figure/bashrc.png) --> <!-- --- --> <!-- # Command line + operations on files --> <!-- - --> <!-- ```bash --> <!-- vim --> <!-- ``` --> <!-- Vim is a highly configurable text editor for efficiently creating and changing any kind of text. It is included as "vi" with most UNIX systems and with Apple OS X. --> <!-- ```bash --> <!-- vim /home/yulijia/.bashrc --> <!-- ``` --> <!-- - `<Esc>`: Turn to command mode by type <Esc>. --> <!-- - `a`: Append text after the cursor [count] times. --> <!-- - `wq`: Write the current file and exit. --> <!-- - `q`: Quit Vim. This fails when changes have been made. --> <!-- - `q!`: Quit without writing. --> <!-- - `wq!`: Write the current file and exit always. --> <!-- .footnote[ --> <!-- [1] [vimCheatSheet --> <!-- ](https://www.fprintf.net/vimCheatSheet.html) --> <!-- ] --> <!-- --- --> <!-- # Command line + operations on files --> <!-- 0. --> <!-- ```bash --> <!-- vim /home/yulijia/.bashrc --> <!-- ``` --> <!-- 1. type `a` --> <!-- 2. add one line in the `.bashrc` file --> <!-- ````bash --> <!-- alias rm='trash-put' --> <!-- ```` --> <!-- 3. type `<Esc>` --> <!-- 4. type `:wq` --> --- # Command line: chmod - ```bash chmod ``` modify access rights ```bash chmod 755 /tmp/zyx.txt chmod 600 /tmp/zyx.txt ``` | # | Permission | rwx | Binary | |---|-------------------------|-----|--------| | 7 | read, write and execute | rwx | 111 | | 6 | read and write | rw- | 110 | | 5 | read and execute | r-x | 101 | | 4 | read only | r-- | 100 | | 3 | write and execute | -wx | 011 | | 2 | write only | -w- | 010 | | 1 | execute only | --x | 001 | | 0 | none | --- | 000 | --- # Command line: mkdir, mv - ```bash mkdir /home/yulijia/project1 ``` make a new directory - ```bash #cp /tmp/abc.fasta /home/yulijia/project1 #cd /home/yulijia/project1/ #ls mv /tmp/yourname.txt /home/yourname/yourname.txt mv abc.fasta <Gene Official Symbol>.fasta ``` moves one or more files or directories from one place to another. --- # FASTA format In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes. ```bash >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH >P01013 GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP ``` .footnote[ [1] [FASTA format used in blast](https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp) ] --- # FASTQ format A FASTQ file normally uses four lines per sequence. - Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). - Line 2 is the raw sequence letters. A,T,C,G,N. - Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. - Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. A FASTQ file containing a single sequence might look like this: ```bash @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 ``` --- # FASTQ format ## Illumina sequence identifiers **@HWUSI-EAS100R:6:73:941:1973#0/1** | Name | Meaning| |---------------|---------------------------------------------------------------------| | HWUSI-EAS100R | the unique instrument name | | 6 | flowcell lane | | 73 | tile number within the flowcell lane | | 941 | 'x'-coordinate of the cluster within the tile | | 1973 | 'y'-coordinate of the cluster within the tile | | #0 | index number for a multiplexed sample (0 for no indexing) | | /1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) | --- # Quality A quality value `Q` is an integer mapping of `P` (i.e., the probability that the corresponding base call is <u>incorrect</u>). Two different equations have been in use. The first is the standard Sanger variant to assess reliability of a base call, otherwise known as Phred quality score: `$$Q_{sanger}=-10log_{10}P$$` `$$Q_{solexa}=-10log_{10}\frac{P}{1-P}$$` --- # Phred Quality Score | Phred Quality Score | Probability of incorrect base call | Base call accuracy | |---------------------|------------------------------------|--------------------| | 10 | 1 in 10 | 90% | | 20 | 1 in 100 | 99% | | 30 | 1 in 1000 | 99.9% | | 40 | 1 in 10,000 | 99.99% | | 50 | 1 in 100,000 | 99.999% | | 60 | 1 in 1,000,000 | 99.9999% | --- # Encoding ![](./2019-06-17-Introduction_to_Linux_for_Bioinformatics_files/figure/Phred.png) --- # Command line: passwd ## Change your password ```bash passwd ``` ``` [yulijia@nccl ~/]$ passwd Changing password for user yulijia. Changing password for yulijia. (current) UNIX password: New password: Retype new password: password updated successfully ``` --- # Quitz 1.What is the meaning of these command? ```bash ls -lh ls -lt ls -R ``` --- # Quiz 2.Try to edit your fasta file, change the sequence name from `>X54156|Homo sapiens p53` to `>TP53|tumor protein p53` --- # Resource [1] [Introduction to Linux for bioinformatics](https://www.bits.vib.be/training-list/112-bits/training/upcoming-trainings/124-linux-for-bioinformatics) [2] [An introduction to Linux for bioinformatics](https://sites.ualberta.ca/~stothard/downloads/linux_for_bioinformatics.pdf) .footnote[ * All pictures without footnote are getting from Wikipedia or browser snapshots. ]