class: center, middle, inverse, title-slide # A beginners guide to Call SNPs and indels: Part II ##
Mark Duplicates and Base (Quality Score) Recalibration
### Lijia Yu ### 2019/07/31
(updated: 2020-10-04) --- # Outline 1. Review 2. Mark Duplicates 3. Base (Quality Score) Recalibration --- ### Q1: How to create a new directory with multiple subdirectories? ```bash cd ~ mkdir ./project mkdir ./project/out mkdir ./tmp mkdir ./project/src mkdir ./project/in ``` ```bash mkdir -p ./project/{in,out,src,tmp} ``` --- ### Q2: What is src stand for ? ``` . ├── in ├── out ├── src └── tmp ``` - in: input (.fq.gz) - out: output (.bam) - src: source code (script.sh) Always run the code from `src` directory. --- ### Q3. How to use pipe? ```bash java -Xmx4G -jar trimmomatic-0.36.jar PE \ -threads 4 \ -phred64 \ ../in/normal_R1.fq.gz \ ../in/normal_R2.fq.gz \ ../out/normal_R1.trimed.fq.gz \ ../out/normal_R1.unpaired.fq.gz \ ../out/normal_R2.trimed.fq.gz \ ../out/normal_R2.unpaired.fq.gz TOPHRED33 |\ bwa mem -M -t 4\ /home/admin/database/reference/hg19/ucsc.hg19.fasta \ ../out/normal_R1.trimed.fq.gz \ ../out/normal_R2.trimed.fq.gz |\ samtools sort -o ../out/normal.sorted.bam - ``` --- ### Q3. How to use pipe? ```bash $BWA mem ... | $SAMTOOLS sort ... command_1 | command_2 | command_3 | .... | command_N ``` Pipe is used to combine two or more commands, and in this, the output of one command acts as input to another command, and this command’s output may act as input to the next command and so on. A pipe is a form of redirection (transfer of <u>standard output</u> to some other destination) that is used in Linux and other Unix-like operating systems to send the output of one command/program/process to another command/program/process for further processing. .footnote[ [1][Piping in Unix or Linux](https://www.geeksforgeeks.org/piping-in-unix-or-linux/) ] --- # Pipeline
--- # GATK workflow  .footnote[ [1][Introduction to Variant Discovery with GATK Best Practices](https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-1-GATK_primer.pdf) ] --- # SAM/BAM format .center[  ] SAM stands for <u>Sequence Alignment Map</u> and is described in the standard specification [here](http://samtools.github.io/hts-specs/). BAM and CRAM are both compressed forms of SAM; BAM (for Binary Alignment Map) is a lossless compression while CRAM can range from lossless to lossy depending on how much compression you want to achieve (up to very much indeed). .footnote[ [1] [SAM / BAM / CRAM - Mapped sequence data formats](https://gatkforums.broadinstitute.org/gatk/discussion/11014/sam-bam-cram-mapped-sequence-data-formats) [2] [SAM/BAM and related specifications](http://samtools.github.io/hts-specs/) ] --- # SAM/BAM format: header ```bash samtools view -H read_data.bam ``` > @HD VN:1.5 GO:none SO:coordinate > @SQ SN:20 LN:63025520 > @RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 > PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI > @PG [program group metadata] The header should contain read groups with **sample names** (SM:NA12878). The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate. .footnote[ [1] [SAM / BAM / CRAM - Mapped sequence data formats](https://gatkforums.broadinstitute.org/gatk/discussion/11014/sam-bam-cram-mapped-sequence-data-formats) ] --- # SAM/BAM format: CIGAR .center[ ] [1][Introduction to Variant Discovery with GATK Best Practices](https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-1-GATK_primer.pdf) --- # Genome Analysis Toolkit .center[ ] - the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. - Mutect2 / MuTect2 / MuTect - Picard ( a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. ) - GenomeAnalysisTK (<4.0) / gatk (4.0) --- # Add header ```bash /home/admin/software/gatk-4.1.0.0/gatk AddOrReplaceReadGroups \ --VALIDATION_STRINGENCY=SILENT \ --RGLB=hg19 \ --RGPL=illumina \ --RGPU=hg19PU \ --RGSM=cancer \ -I=../out/cancer.sorted.bam \ -O=../out/cancer.addheader.bam ``` - RGSM is Read-Group sample name (cancer/normal) .footnote[ [1] [AddOrReplaceReadGroups (Picard)](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.1/picard_sam_AddOrReplaceReadGroups.php) ] --- # Mark duplicates to mitigate duplication artifacts .center[  ] --- # MarkDuplicates on Spark (GATK4) ```bash /home/admin/software/gatk-4.1.0.0/gatk MarkDuplicatesSpark \ -I input.bam \ -O marked_duplicates.bam \ -M marked_dup_metrics.txt \ --conf 'spark.executor.cores=1' ``` This is a Spark implementation of Picard MarkDuplicates that allows the tool to be run in parallel on multiple cores on a local machine .footnote[ [1] [MarkDuplicatesSpark](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_transforms_markduplicates_MarkDuplicatesSpark.php) ] --- # Local realignment around indels corrects mapping errors .center[<img src="2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II_files/figure/indel_1.png">] | position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |------------|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | Reference| C | A | T | T | T | T | T | T | T | C | T | A | A | G | | group 1 |...| C | A | T | T | T | T | T | T |C | T | A | A | G | | group 2 | C | A | T | T | T | T | T | T |C | T | A | A | G |...| --- # Local realignment around indels corrects mapping errors .center[<img src="2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II_files/figure/indel_2.png" width="550">] | position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | |------------|---|---|---|---|---|---|---|---|---|---|---|---|---|---| | Reference| C | A | T | T | T | T | T | T | T | C | T | A | A | G | | group 1 | C | A | - | T | T | T | T | T | T |C | T | A | A | G | | group 2 | C | A | - | T | T | T | T | T | T |C | T | A | A | G | --- # RealignerTargetCreator and IndelRealigner (GATK3, retired) ```bash java -jar GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R reference.fasta \ -I input.bam \ --known indels.vcf \ -o forIndelRealigner.intervals java -jar GenomeAnalysisTK.jar \ -T IndelRealigner \ -R reference.fasta \ -I input.bam \ -known indels.vcf \ -targetIntervals intervalListFromRTC.intervals \ -o realignedBam.bam ``` .footnote[ [1] [RealignerTargetCreator](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php) [2] [IndelRealigner](https://software.broadinstitute.org/gatk/documentation/tooldocs/3.8-0/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php) ] --- # Base (Quality Score) Recalibration  [1][Introduction to Variant Discovery with GATK Best Practices](https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-1-GATK_primer.pdf) --- # Base (Quality Score) Recalibration: BaseRecalibrator ```bash /home/admin/software/gatk-4.1.0.0/gatk BaseRecalibrator \ -I my_reads.bam \ -R reference.fasta \ --known-sites dbsnp_138.hg19.vcf \ --known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ --known-sites 1000G_phase1.indels.hg19.sites.vcf \ --known-sites 1000G_phase1.snps.high_confidence.hg19.sites.vcf \ -L ../in/Illumina.bed \ -O recal_data.table ``` ExAc, gnomAD, or dbSNP resources can be used as known sites of variation. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. .footnote[ [1] [BaseRecalibrator](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_BaseRecalibrator.php) ] --- # Resource bundle - `dbSNP 138` + evaluation of dbSNP rate and Ti/Tv values at novel sites. - `Mills_and_1000G_gold_standard` + best set of known indels to be used for local realignment - `1000G_phase1.indels` + best set of known indels to be used for local realignment - `1000G_phase1.snps.high_confidence` + genotype refinement The GATK resource bundle is a collection of standard files for working with human resequencing data with the GATK. .footnote[ [1] [Resource Bundle: Reference materials for human analysis](https://software.broadinstitute.org/gatk/documentation/article.php?id=11017) ] --- # Base (Quality Score) Recalibration: ApplyBQSR ```bash /home/admin/software/gatk-4.1.0.0/gatk ApplyBQSR \ -R reference.fasta \ -I input.bam \ --bqsr-recal-file recal_data.table \ -L ../in/Illumina.bed \ -O output.bam ``` It recalibrates the base qualities of the input reads based on the recalibration table produced by the BaseRecalibrator tool, and outputs a recalibrated BAM or CRAM file. .footnote[ [1] [ApplyBQSR](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_bqsr_ApplyBQSR.php) ] --- # Quiz 1.Could you write the commands of Mark Duplicates and Base (Quality Score) Recalibration by yourself? Please write your codes in a script file (script.sh). --- # Resource This section is based on the GATK 4 Best Practices workflow at [https://software.broadinstitute.org/gatk/best-practices/](https://software.broadinstitute.org/gatk/best-practices/)