+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction to Bioinformatics

with algorithms

Lijia Yu @ NCCL in China

2019/05/15
(updated: 2019-06-03)

1 / 28
2 / 28

What is Bioinformatics?

Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. 1

[1] Molecular Biology Review - NCBI

[2] Venn diagram: http://acm.na.edu/img/bioInfoLab/bioinformatics_venn.png

3 / 28

The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

  • the development of new algorithms and statistics with which to assess relationships among members of large data sets

  • the analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein domains, and protein structures

  • the development and implementation of tools that enable efficient access and management of different types of information

1. Molecular data collection and management

  • In 1979, GenBank was established at Los Alamos National Laboratory (USA).

4 / 28

1. Molecular data collection and management

  • In 1979, GenBank was established at Los Alamos National Laboratory (USA).

5 / 28

1. Molecular data collection and management

  • In 1982, nucleotide sequence database of European Molecular Biology Laboratory (also known as EMBL-Bank) was created (Europe).

6 / 28

The EMBL Nucleotide Sequence Database (also known as EMBL-Bank) is the section of the ENA which contains high-level genome assembly details, as well as assembled sequences and their functional annotation.

As of release 114 (December 2012), the EMBL Nucleotide Sequence Database contains approximately 5×1011 nucleotides with an uncompressed filesize of 1.6 terabytes.

1. Molecular data collection and management

  • In 1986, DNA Data Bank of Japan (DDBJ) began data bank activities at National Institute of Genetics (Japan).

7 / 28

1. Molecular data collection and management

  • In the early 1990s, International Nucleotide Sequence Database Collaboration (INSDC) was founded in cooperation of Genebank/EMBL/DDBJ.

8 / 28

1. Molecular data collection and management

  • Big Data Center: The Genome Sequence Archive (GSA) is a data repository for archiving raw sequence reads.

9 / 28

2. Sequence alignment

  • In 1962, Zuckerkandl and Pauling introducing the concept of the "molecular clock", which enabled the neutral theory of molecular evolution.

10 / 28

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

1962年莱纳斯·鲍林和艾美·祖柯坎(Emile Zuckerkandl)提出了分子时钟,透过比较同源蛋白质的差异来推算双方分歧的时间。

2. Sequence alignment

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
sequence 1 G C A T G A C G A A T C A G
sequence 2 T A T G A C A A A C A G C
11 / 28

2. Sequence alignment

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
sequence 1 G C A T G A C G A A T C A G
sequence 2 T A T G A C A A A C A G C

Delete(1,-)

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
sequence 1 G C A T G A C G A A T C A G
sequence 2 T A T G A C A A A C A G C
11 / 28

2. Sequence alignment

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
sequence 1 G C A T G A C G A A T C A G
sequence 2 T A T G A C A A A C A G C

Delete(1,-)

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14
sequence 1 G C A T G A C G A A T C A G
sequence 2 T A T G A C A A A C A G C

Delete(8,-)

position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
sequence 1 G C A T G A C G A A T C A G
sequence 2 T A T G A C -- A A A C A G C
11 / 28

2. Sequence alignment

  • Hamming Distance
12 / 28

2. Sequence alignment

  • Hamming Distance
Hamming Distance(s,t)= 2 3 5
s= A A T AGCAA A GCACA CA
t= T A A A CATA A CACAA TA
12 / 28

2. Sequence alignment

  • Hamming Distance
Hamming Distance(s,t)= 2 3 5
s= A A T AGCAA A GCACA CA
t= T A A A CATA A CACAA TA
  • scoring matrix
BLOSUM 62 Matrix PAM 250 Matrix
BLOSUM 62 Matrix PAM250
12 / 28

BLOSUM矩阵由Henikoff提出,蛋白质短序列比对推导

PAM矩阵由Dayhoff提出,蛋白质全局序列比对推导

3. Genome Analysis

Genome Where Year
H. Influenza TIGR 1995
E. Coli K-12 Wisconsin 1997
S. cerevisiae (yeast) internat. collab. 1997
C. elegans (worm) Washington U./Sanger 1998
Drosophila M. (fruit fly) multiple groups 2000
E. Coli 0157:H7 (pathogen) Wisconsin 2000
H. Sapiens (that’s us) internat. collab./Celera 2001
Mus musculus (mouse) internat. collaboration 2002
Oryza sativa L. ssp. indica & Oryza sativa L. ssp. japonica (rice) internat. collaboration 2002
13 / 28

3. Genome Analysis

  • Cancer genome mutation

Driver_mutation

[1] Cancer Genome Landscapes

14 / 28

3. Genome Analysis

  • Genetic disease

gene_mutation_that_cause_huntington_disease

15 / 28

The HTT mutation that causes Huntington disease involves a DNA segment known as a CAG trinucleotide repeat.

4. Transcriptome Analysis

16 / 28

4. Transcriptome Analysis

17 / 28

4. Transcriptome Analysis

GeneExpressionHeatmapGeneNetwork

18 / 28

5. Proteome Analysis

20 / 28

5. Proteome Analysis

21 / 28

Related knowledge

  1. Biology: Cell Biology, Molecular Biology, Biochemistry, Developmental Biology, Genomics, Genetics, etc.
22 / 28

Related knowledge

  1. Biology: Cell Biology, Molecular Biology, Biochemistry, Developmental Biology, Genomics, Genetics, etc.

  2. Programming: R, Perl/Python, C/C++, etc.

22 / 28

Related knowledge

  1. Biology: Cell Biology, Molecular Biology, Biochemistry, Developmental Biology, Genomics, Genetics, etc.

  2. Programming: R, Perl/Python, C/C++, etc.

  3. Statistics: Hypothesis test, Interval estimation, Significance, Regression, etc.

22 / 28

Bioinformatics in Action

  • High performance computing

23 / 28

What is R?





R is a language and environment for statistical computing and graphics. Designed by: Ross Ihaka, Robert Gentleman.

x <- 1:6 # Create vector.
y <- x^2 # Create vector by formula.
plot(x,y,type = "b")

24 / 28

What is R?

What is RStudio?

25 / 28

RStudio Server with Terminal

26 / 28

RStudio Server with Terminal

  • Login to 192.168.0.105:8787
26 / 28

Syllabus

  • Basic knowledge of Linux
  • NGS Data analysis (Genome data)
  • NGS analysis online tools and Database

Resource

[1] Bioinformatics (For undergraduates) (in Chinese)

  • All pictures without footnote are getting from Wikipedia or browser snapshots.
27 / 28

Thanks!

Slides created via the R package xaringan.

The chakra comes from remark.js, knitr, and R Markdown.

28 / 28
2 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow