class: center, middle, inverse, title-slide # A Beginners Guide to Call Somatic Mutation ## Part II ### Lijia Yu ### 2020/10/03
(updated: 2020-10-05) --- # Outline 1. Quality Control 2. Alignment/Mapping --- # FASTQC - [Good Illumina Data](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/good_sequence_short_fastqc.html) - [Bad Illumina Data](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/bad_sequence_fastqc.html) | Measure | Value | |-----------------------------------|-------------------------| | Filename | good_sequence_short.txt | | File type | Conventional base calls | | Encoding | Illumina 1.5 | | Total Sequences | 250000 | | Sequences flagged as poor quality | 0 | | Sequence length | 40 | | %GC | 45 | --- # [FASTP](https://github.com/OpenGene/fastp) ```bash fastp -i in.R1.fq.gz -I in.R2.fq.gz -o out.R1.fq.gz -O out.R2.fq.gz ``` --- # Command line: screen/tmux The screen package is pre-installed on most Linux distros nowadays. You can check if it is installed on your system by typing: ```bash screen --version ``` ## Basic Linux Screen Usage Below are the most basic steps for getting started with screen: - On the command prompt, type `screen`. - Run the desired program. - Use the key sequence `Ctrl-a` + `Ctrl-d` to detach from the screen session. - Reattach to the screen session by typing `screen -r`. .footnote[ [1] [How To Use Linux Screen](https://linuxize.com/post/how-to-use-linux-screen/) ] --- # Command line: screen/tmux Tmux is better than screen. 1. Starting a Named tmux Session ```bash tmux new -s geek-1 ``` 2. Detaching and Attaching Sessions `Ctrl+B`, and then `D`. 3. To attach a detached session: `tmux attach-session -t geek-1` .footnote[ [1] [How to Use tmux on Linux (and Why It’s Better Than Screen)](https://www.howtogeek.com/671422/how-to-use-tmux-on-linux-and-why-its-better-than-screen/) ] --- # [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic): a flexible trimmer for Illumina sequence data - Removal of technical sequences - Quality filtering - Phred 64 to Phred 33 ```bash # Convert Phred64 to Phred33 java -Xmx4G \ -jar /home/admin/software/Trimmomatic-0.36/trimmomatic-0.36.jar PE \ -threads 1 \ -phred64 \ ../in/cancer_R1.fq.gz \ ../in/cancer_R2.fq.gz \ ../out/cancer_R1.trimed.fq.gz \ ../out/cancer_R1.unpaired.fq.gz \ ../out/cancer_R2.trimed.fq.gz \ ../out/cancer_R2.unpaired.fq.gz \ TOPHRED33 ``` --- # Alignment/Mapping .center[<img src="https://www.ebi.ac.uk/training/online/sites/ebi.ac.uk.training.online/files/resize/user/18/Figure19-700x527.png" width="550">] .footnote[ [1] [Read mapping or alignment](https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common-technologies-and-data-analysis-methods/read-mapping-or) ] --- # Alignment/Mapping - When studying an organism **with** <u>a reference genome</u>, it is possible to infer which transcripts are expressed by mapping the reads to the reference genome (genome mapping) or transcriptome (transcriptome mapping). Mapping reads to the genome requires no knowledge of the set of transcribed regions or the way in which exons are spliced together. This approach allows the discovery of new, unannotated transcripts. - When working on an organism **without** <u>a reference genome</u>, reads need to be assembled first into longer contigs (de novo assembly). These contigs can then be considered as the expressed transcriptome to which reads are re-mapped for quantification. .footnote[ [1] [Read mapping or alignment](https://www.ebi.ac.uk/training/online/course/functional-genomics-ii-common-technologies-and-data-analysis-methods/read-mapping-or) ] --- # Alignment/Mapping ## Tools - BWA (Burrows-Wheeler Aligner, recommand to Illumina/BGI user) BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome. - Novoalign Powerful tool designed for mapping of short reads onto a reference genome from <u>Illumina, Ion Torrent, and 454 NGS platforms.</u> - TMAP (recommand to Ion Torrent user) TMAP (torrent mapping alignment program) is a fast and accurate alignment software for short and long nucleotide sequences produced by next-generation sequencing technologies. --- # Alignment/Mapping ## Tools - Bowtie2 Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. - STAR STAR is ultrafast universal RNA-seq aligner. --- # De novo sequence assemblers - ABySS A de novo, parallel, paired-end sequence assembler designed for the assembly of short reads. There are two versions: ABySS( genomic) and Trans-ABySS (transcriptomic). - Trinity Trinity utilizes a three step process to produce high-quality transcriptome assemblies. This method has been found to reconstruct high quality transcriptomes. - HGAP/Falcon HGAP was the first long read assembler,It was designed mainly for haploid organisms. Falcon is a long read assembler designed by Pacific Biosciences to work on diploid organisms. .footnote[ [1] [De novo sequence assemblers](https://en.wikipedia.org/wiki/De_novo_sequence_assemblers) ] --- # [BWA](https://github.com/lh3/bwa) mapping ```bash Usage: bwa mem [options] <idxbase> <in1.fq> [in2.fq] ``` ```bash /home/admin/software/bwa-0.7.12/bwa mem -M\ -t 1\ /home/admin/database/reference/hg19/ucsc.hg19.fasta \ ../out/cancer_R1.trimed.fq.gz \ ../out/cancer_R2.trimed.fq.gz > ../out/cancer.sam ``` - `mem` BWA-MEM algorithm - `-M` mark shorter split hits as secondary --- # Samtools sorting ```bash Usage: samtools <command> [options] ``` ```bash /home/admin/software/samtools-1.9/samtools sort \ -o ../out/cancer.sorted.bam ../out/cancer.sam ``` - `sort` sort alignment file --- # BWA mapping and Samtools sorting ```bash /home/admin/software/bwa-0.7.12/bwa mem -M \ -t 1 \ -R "@RG\tID:H3FCTDSXY.2\tLB:hg19\tPL:ILLUMINA\tPU:H3FCTDSXY.2.cancer\tSM:cancer" \ /home/admin/database/reference/hg19/ucsc.hg19.fasta \ ../out/cancer_R1.trimed.fq.gz \ ../out/cancer_R2.trimed.fq.gz |\ /home/admin/software/samtools-1.9/samtools view -Sb - |\ /home/admin/software/samtools-1.9/samtools sort \ -o ../out/cancer.sorted.bam - ``` ### Piping in Unix or Linux ```bash $BWA mem ... | $SAMTOOLS sort ... command_1 | command_2 | command_3 | .... | command_N ``` Pipe is used to combine two or more commands, and in this, the output of one command acts as input to another command, and this command’s output may act as input to the next command and so on. --- # Quiz 1.Try to preprocess and map the normal sample (FASTQ file) to reference genome hg19. Please write your codes in a script file (script.sh). 2.Please explain the meaning of Timmomatic code line by line. ```bash # Convert Phred64 to Phred33 java -Xmx4G \ -jar /home/admin/software/Trimmomatic-0.36/trimmomatic-0.36.jar PE \ -threads 1 \ -phred64 \ ../in/cancer_R1.fq.gz \ ../in/cancer_R2.fq.gz \ ../out/cancer_R1.trimed.fq.gz \ ../out/cancer_R1.unpaired.fq.gz \ ../out/cancer_R2.trimed.fq.gz \ ../out/cancer_R2.unpaired.fq.gz \ TOPHRED33 ``` --- # Resource [1] [FastQC](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) [2] [Trimmomatic: A flexible read trimming tool for Illumina NGS data](http://www.usadellab.org/cms/?page=trimmomatic) [3] [BWA](https://github.com/lh3/bwa) [4] [Samtools](https://github.com/samtools/samtools)