class: center, middle, inverse, title-slide # A Beginners Guide to Call Somatic Mutation ## Part III ### Lijia Yu ### 2020/10/04
(updated: 2020-10-04) --- # Outline 0. Review 1. Reference genome 2. SAM and BAM 3. Mark Duplicates --- # Answer the Quiz ## How to use pipe? ```bash java -Xmx4G -jar trimmomatic-0.36.jar PE \ -threads 4 \ -phred64 \ ../in/normal_R1.fq.gz \ ../in/normal_R2.fq.gz \ ../out/normal_R1.trimed.fq.gz \ ../out/normal_R1.unpaired.fq.gz \ ../out/normal_R2.trimed.fq.gz \ ../out/normal_R2.unpaired.fq.gz TOPHRED33 bwa mem -M -t 4 \ /home/admin/database/reference/hg19/ucsc.hg19.fasta \ ../out/normal_R1.trimed.fq.gz \ ../out/normal_R2.trimed.fq.gz |\ samtools sort -o ../out/normal.sorted.bam - ``` --- # Reference genome version - GRCh37: Genome Reference Consortium - unlocalized sequences - unplaced sequences - alternate loci - 1,..,22,X,Y,MT - GRCh37lite: Genome Reference Consortium - 1,..,22,X,Y,MT - hg19: UCSC - chr*_random sequences - chrUn_* sequences - chr1,...,chr22,chrX,chrY,chrM() - b37: 1k genome phase 1 - unlocalized sequences - unplaced sequences - 1,..,22,X,Y,MT - hs37d5: 1k genome phase 2 (**decoy** sequence) - unlocalized sequences - unplaced sequences - 1,..,22,X,Y,MT - human herpesvirus 4 type 1 sequence, etc. - GRCh38: Genome Reference Consortium - hg38: UCSC --- # Reference genome - If you map reads to GRCh37 or hg19, use `hs37-1kg`: <ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/ reference/human_g1k_v37.fasta.gz> - If you map to GRCh37 and believe decoy sequences help with better variant calling, use `hs37d5`: <ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/ reference/phase2_reference_assembly_sequence/hs37d5.fa.gz> - If you map reads to GRCh38 or hg38, use the following: <ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/ GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/ GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz> .footnote[ [1][Which human reference genome to use?](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use) ] --- # SAM/BAM format .center[ ] SAM stands for <u>Sequence Alignment Map</u> and is described in the standard specification [here](http://samtools.github.io/hts-specs/). BAM and CRAM are both compressed forms of SAM; BAM (for Binary Alignment Map) is a lossless compression while CRAM can range from lossless to lossy depending on how much compression you want to achieve (up to very much indeed). .footnote[ [1] [SAM / BAM / CRAM - Mapped sequence data formats](https://gatkforums.broadinstitute.org/gatk/discussion/11014/sam-bam-cram-mapped-sequence-data-formats) [2] [SAM/BAM and related specifications](http://samtools.github.io/hts-specs/) ] --- # SAM/BAM format: header ```bash samtools view -H read_data.bam ``` > @HD VN:1.5 GO:none SO:coordinate > @SQ SN:20 LN:63025520 > @RG ID:H0164.2 PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222 > PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI > @PG [program group metadata] The header should contain read groups with **sample names** (SM:NA12878). The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate. .footnote[ [1] [SAM / BAM / CRAM - Mapped sequence data formats](https://gatkforums.broadinstitute.org/gatk/discussion/11014/sam-bam-cram-mapped-sequence-data-formats) ] --- # SAM/BAM format: CIGAR .center[ ] [1][Introduction to Variant Discovery with GATK Best Practices](https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-1-GATK_primer.pdf) --- # Pipeline: [Data pre-processing for variant discovery](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912)
--- # Genome Analysis Toolkit .center[ ] - the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size. - Mutect2 / MuTect2 / MuTect - Picard ( a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. ) - GenomeAnalysisTK (<4.0) / gatk (4.0) --- # Add header ```bash /home/admin/software/gatk-4.1.0.0/gatk AddOrReplaceReadGroups \ --VALIDATION_STRINGENCY=SILENT \ --RGLB=hg19 \ --RGPL=illumina \ --RGPU=hg19PU \ --RGSM=cancer \ -I=../out/cancer.sorted.bam \ -O=../out/cancer.addheader.bam ``` - RGSM is Read-Group sample name (cancer/normal) .footnote[ [1] [AddOrReplaceReadGroups (Picard)](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.1/picard_sam_AddOrReplaceReadGroups.php) ] --- # Mark duplicates to mitigate duplication artifacts .center[ ] --- # MarkDuplicates ```bash /home/admin/software/gatk-4.1.0.0/gatk \ --java-options "-Xmx4G -Djava.io.tmpdir=./tmp" \ MarkDuplicates \ -I Input.sorted.bam \ -O Output.sortedRMdup.bam \ -M Output.marked_dup_metrics.txt ``` --- # MarkDuplicates on Spark (GATK4) ```bash /home/admin/software/gatk-4.1.0.0/gatk MarkDuplicatesSpark \ -I input.bam \ -O marked_duplicates.bam \ -M marked_dup_metrics.txt \ --conf 'spark.executor.cores=1' ``` This is a Spark implementation of Picard MarkDuplicates that allows the tool to be run in parallel on multiple cores on a local machine .footnote[ [1] [MarkDuplicatesSpark](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_transforms_markduplicates_MarkDuplicatesSpark.php) ] --- # Base Quality Score Recalibration (BQSR)  the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data. [1][Base Quality Score Recalibration (BQSR)](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531) --- # BQSR: How does it work? - `BaseRecalibrator` builds the model: `BaseRecalibrator` tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file. ```bash /home/admin/software/gatk-4.1.0.0/gatk \ BaseRecalibrator \ -I my_reads.bam \ -R reference.fasta \ --known-sites dbsnp_138.hg19.vcf \ --known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \ --known-sites 1000G_phase1.indels.hg19.sites.vcf \ --known-sites 1000G_phase1.snps.high_confidence.hg19.sites.vcf \ -L ../in/Illumina.bed \ -O recal_data.table ``` ExAc, gnomAD, or dbSNP resources can be used as known sites of variation. We assume that all reference mismatches we see are therefore errors and indicative of poor base quality. [1][Base Quality Score Recalibration (BQSR)](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531) --- # BQSR: How does it work? - `ApplyBQSR` adjusts the scores: `ApplyBQSR` tool adjusts the base quality scores in the data based on the model, producing a new BAM file. ```bash /home/admin/software/gatk-4.1.0.0/gatk \ --java-options "-Xmx4G -Djava.io.tmpdir=./tmp" \ ApplyBQSR \ -R reference.fasta \ -I Input.sortedRMdup.bam \ -bqsr recal_data.table \ -L ../in/Illumina.bed \ --add-output-sam-program-record \ --static-quantized-quals 10 \ --static-quantized-quals 20 \ --static-quantized-quals 30 \ -O Output.recal.bam ``` --- # Quiz 1.Could you write the commands of Mark Duplicates and Base (Quality Score) Recalibration by yourself? Please write your codes in a script file (script.sh). --- # Resource This section is based on the GATK 4 Best Practices workflow at [https://software.broadinstitute.org/gatk/best-practices/](https://software.broadinstitute.org/gatk/best-practices/)