A Beginners Guide to Call Somatic Mutation

class: center, middle, inverse, title-slide

# A Beginners Guide to Call Somatic Mutation
## Part III
### Lijia Yu
### 2020/10/04<br/>(updated: 2020-10-04)

---

# Outline

0. Review
1. Reference genome
2. SAM and BAM
3. Mark Duplicates

---

# Answer the Quiz

## How to use pipe?

```bash
java -Xmx4G -jar trimmomatic-0.36.jar PE \
-threads 4 \
-phred64 \
../in/normal_R1.fq.gz \
../in/normal_R2.fq.gz \
../out/normal_R1.trimed.fq.gz \
../out/normal_R1.unpaired.fq.gz \
../out/normal_R2.trimed.fq.gz \
../out/normal_R2.unpaired.fq.gz TOPHRED33

bwa mem -M -t 4 \
/home/admin/database/reference/hg19/ucsc.hg19.fasta \
../out/normal_R1.trimed.fq.gz \
../out/normal_R2.trimed.fq.gz |\
samtools sort -o ../out/normal.sorted.bam -
```

---

# Reference genome version

- GRCh37: Genome Reference Consortium
    - unlocalized sequences
    - unplaced sequences
    - alternate loci
    - 1,..,22,X,Y,MT
- GRCh37lite: Genome Reference Consortium
    - 1,..,22,X,Y,MT
- hg19: UCSC
    - chr*_random sequences
    - chrUn_* sequences
    - chr1,...,chr22,chrX,chrY,chrM()
- b37: 1k genome phase 1
    - unlocalized sequences
    - unplaced sequences
    - 1,..,22,X,Y,MT
- hs37d5: 1k genome phase 2 (**decoy** sequence)
    - unlocalized sequences
    - unplaced sequences
    - 1,..,22,X,Y,MT
    - human herpesvirus 4 type 1 sequence, etc.
- GRCh38: Genome Reference Consortium
- hg38: UCSC

---

# Reference genome

- If you map reads to GRCh37 or hg19, use `hs37-1kg`:

<ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/
reference/human_g1k_v37.fasta.gz>

- If you map to GRCh37 and believe decoy sequences help with better variant calling, use `hs37d5`:

<ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/
reference/phase2_reference_assembly_sequence/hs37d5.fa.gz>

- If you map reads to GRCh38 or hg38, use the following:

<ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/
GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/
GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz>

.footnote[
[1][Which human reference genome to use?](https://lh3.github.io/2017/11/13/which-human-reference-genome-to-use)
]

---

# SAM/BAM format

.center[
![](https://us.v-cdn.net/5019796/uploads/editor/f4/uuzmf2cbau1y.png)]

SAM stands for <u>Sequence Alignment Map</u> and is described in the standard specification [here](http://samtools.github.io/hts-specs/). BAM and CRAM are both compressed forms of SAM; BAM (for Binary Alignment Map) is a lossless compression while CRAM can range from lossless to lossy depending on how much compression you want to achieve (up to very much indeed).

.footnote[
[1] [SAM / BAM / CRAM - Mapped sequence data formats](https://gatkforums.broadinstitute.org/gatk/discussion/11014/sam-bam-cram-mapped-sequence-data-formats)

[2] [SAM/BAM and related specifications](http://samtools.github.io/hts-specs/)
]

---

# SAM/BAM format: header

```bash
samtools view -H read_data.bam
```

> @HD VN:1.5  GO:none SO:coordinate

> @SQ SN:20   LN:63025520

> @RG ID:H0164.2  PL:illumina PU:H0164ALXX140820.2 LB:Solexa-272222
> PI:0 DT:2014-08-20T00:00:00-0400 SM:NA12878 CN:BI

> @PG [program group metadata]

The header should contain read groups with **sample names** (SM:NA12878).

The presence of the @RG tags indicate the presence of read groups. Each read group has a SM tag, indicating the sample from which the reads belonging to that read group originate.

.footnote[
[1] [SAM / BAM / CRAM - Mapped sequence data formats](https://gatkforums.broadinstitute.org/gatk/discussion/11014/sam-bam-cram-mapped-sequence-data-formats)
]

---

# SAM/BAM format: CIGAR

.center[
![](./2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II_files/figure/CIGAR.png)]

[1][Introduction to Variant Discovery with GATK Best Practices](https://qcb.ucla.edu/wp-content/uploads/sites/14/2016/03/GATKwr12-1-GATK_primer.pdf)

---

# Pipeline: [Data pre-processing for variant discovery](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912)

<div id="htmlwidget-9bee415cc47b37e0f9f7" style="width:504px;height:504px;" class="grViz html-widget"></div>
<script type="application/json" data-for="htmlwidget-9bee415cc47b37e0f9f7">{"x":{"diagram":"digraph flowchart {\n      # node definitions with substituted label text\n      node [fontname = Helvetica, shape = rectangle] \n      tab1 [label = \"Quality Control: fastQC\"]\n      tab2 [label = \"Quality Control and filtering: FASTP\"]\n      tab3 [label = \"Phred 64 to 33: Trimmomatic\"]\n      tab4 [label = \"Map to Reference: Burrows-Wheeler Aligner\"]\n      tab5 [label = \"Mark Duplicates: ??\"]\n      tab6 [label = \"Base Recalibration: ??\"]\n\n      # edge definitions with the node IDs\n      tab1 -> tab3 -> tab4 -> tab5 -> tab6;\n      tab2 -> tab4;\n      tab2 -> tab3;\n      }","config":{"engine":"dot","options":null}},"evals":[],"jsHooks":[]}</script>

---

# Genome Analysis Toolkit

.center[
![](./2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II_files/figure/gatk.png)]

- the toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping. Its powerful processing engine and high-performance computing features make it capable of taking on projects of any size.

- Mutect2 / MuTect2 / MuTect
- Picard ( a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. )
- GenomeAnalysisTK (<4.0) / gatk (4.0)

---

# Add header

```bash
/home/admin/software/gatk-4.1.0.0/gatk AddOrReplaceReadGroups \
--VALIDATION_STRINGENCY=SILENT \
--RGLB=hg19 \
--RGPL=illumina \
--RGPU=hg19PU \
--RGSM=cancer \
-I=../out/cancer.sorted.bam \
-O=../out/cancer.addheader.bam 
```

- RGSM is Read-Group sample name (cancer/normal)

.footnote[
[1] [AddOrReplaceReadGroups (Picard)](https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.1.1/picard_sam_AddOrReplaceReadGroups.php)
]

---

# Mark duplicates to mitigate	duplication	artifacts

.center[
![](./2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II_files/figure/duplication.png)]

---

# MarkDuplicates

```bash
/home/admin/software/gatk-4.1.0.0/gatk \
--java-options "-Xmx4G -Djava.io.tmpdir=./tmp"  \
MarkDuplicates \
-I Input.sorted.bam \
-O Output.sortedRMdup.bam \
-M Output.marked_dup_metrics.txt
```

---

# MarkDuplicates on Spark (GATK4)

```bash
/home/admin/software/gatk-4.1.0.0/gatk MarkDuplicatesSpark \
-I input.bam \
-O marked_duplicates.bam \
-M marked_dup_metrics.txt \
--conf 'spark.executor.cores=1'
```

This is a Spark implementation of Picard MarkDuplicates that allows the tool to be run in parallel on multiple cores on a local machine

.footnote[
[1] [MarkDuplicatesSpark](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_transforms_markduplicates_MarkDuplicatesSpark.php)
]
---

# Base Quality Score Recalibration (BQSR)

![](./2019-07-31-A_beginners_guide_to_Call_SNPs_and_indels_Part_II_files/figure/Recalibration.png)

the scores produced by the machines are subject to various sources of systematic (non-random) technical error, leading to over- or under-estimated base quality scores in the data.

[1][Base Quality Score Recalibration (BQSR)](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531)
---

# BQSR: How does it work?

- `BaseRecalibrator` builds the model:
`BaseRecalibrator` tool builds a model of covariation based on the input data and a set of known variants, producing a recalibration file.

```bash
/home/admin/software/gatk-4.1.0.0/gatk \
BaseRecalibrator \
-I my_reads.bam \
-R reference.fasta \
--known-sites dbsnp_138.hg19.vcf \
--known-sites Mills_and_1000G_gold_standard.indels.hg19.sites.vcf \
--known-sites 1000G_phase1.indels.hg19.sites.vcf \
--known-sites 1000G_phase1.snps.high_confidence.hg19.sites.vcf \
-L ../in/Illumina.bed \
-O recal_data.table
```

ExAc, gnomAD, or dbSNP resources can be used as known sites of variation.

We assume that all reference mismatches we see are therefore errors and indicative of poor base quality.

[1][Base Quality Score Recalibration (BQSR)](https://gatk.broadinstitute.org/hc/en-us/articles/360035890531)

---

# BQSR: How does it work?

- `ApplyBQSR` adjusts the scores:
`ApplyBQSR` tool adjusts the base quality scores in the data based on the model, producing a new BAM file.

```bash
/home/admin/software/gatk-4.1.0.0/gatk \
--java-options "-Xmx4G -Djava.io.tmpdir=./tmp" \
ApplyBQSR \
-R reference.fasta  \
-I Input.sortedRMdup.bam \
-bqsr recal_data.table \
-L ../in/Illumina.bed \
--add-output-sam-program-record \
--static-quantized-quals 10 \
--static-quantized-quals 20 \
--static-quantized-quals 30 \
-O Output.recal.bam
```

---

# Quiz

1.Could you write the commands of  Mark Duplicates and Base (Quality Score) Recalibration by yourself?

Please write your codes in a script file (script.sh).

---

# Resource

This section is based on the GATK 4 Best Practices workflow at [https://software.broadinstitute.org/gatk/best-practices/](https://software.broadinstitute.org/gatk/best-practices/)