working with VCF files

Posted in HowTo, Bioinformatics and tagged VCF, sort, index, intersect, extract, merge, concatenate, combine on Apr 9, 2020

1.sort VCF files
2.index VCF files
3.extract vcf from a bed region
4.intersect VCF files
5.merge VCF files
6.Concatenate or combine or append VCF files
7.Change Chromosome Notation

This is a note of working with VCF files. Try to avoid use awk/sed and other linux default command. Use the professional tools!

1.sort VCF files

java -jar picard.jar SortVcf INPUT=in.vcf OUTPUT=out.vcf
java -jar picard.jar SortVcf INPUT=in.vcf.gz OUTPUT=out.vcf.gz

It will provide index file if you set the output as gz file.

2.index VCF files

bgzip genotypes.vcf && tabix -p vcf genotypes.vcf.gz

Ref:Question: Generate vcf.gz file and its index file vcf.gz.tbi

3.extract vcf from a bed region

Need have a vcf index file before extract from the vcf file. Print header line with -h.

tabix -R region.bed myvcf.gz -h > extract.vcf

Ref:Question: Extract Sub-Set Of Regions From Vcf File

4.intersect VCF files

bcftools isec  -p output_dir  A.vcf.gz B.vcf.gz

output_dir/0000.vcf.gz would be variants unique to A.vcf.gz
output_dir/0001.vcf.gz would be variants unique to B.vcf.gz
output_dir/0002.vcf.gz would be variants shared by A.vcf.gz and B.vcf.gz as represented in A.vcf.gz
output_dir/0002.vcf.gz would be variants shared by A.vcf.gz and B.vcf.gz as represented in B.vcf.gz

Ref: Question: intersect VCF files

5.merge VCF files

When we say merge VCF files, it means that all genotype of one snv/indel will be merged in single line.

A.vcf

 ##fileformat=VCFv4.1 
 ##FILTER=<ID=PASS,Description="Passed all filters">
 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3
 10  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 0/0 0/1
 11  .   C   A   .   .   DP=3;CALLER=Samtools    GT  .   .   1/1
 12  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 0/0
 13  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 1/1 1/1

B.vcf

 ##fileformat=VCFv4.1 
 ##FILTER=<ID=PASS,Description="Passed all filters">
 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S4  S5  S6
 10  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 0/0 0/1
 11  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1  .  1/1
 12  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 1/1 
 13  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 0/0

merged.vcf

 ##fileformat=VCFv4.1 
 ##FILTER=<ID=PASS,Description="Passed all filters">
 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3 S4  S5  S6
 10  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 0/0 0/1 0/1 0/0 0/1
 11  .   C   A   .   .   DP=3;CALLER=Samtools    GT  .   .   1/1 0/1  .  1/1
 12  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 0/0 0/0 0/0 1/1 
 13  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 1/1 1/1 0/0 0/0 0/0

bcftools only accept zipped vcf files. -Oz is the output format of .gz

bcftools merge *vcf.gz -Oz -o Merged.vcf.gz

Ref:Question: Best way to merge multiple VCF files

6.Concatenate or combine or append VCF files

All source files must have the same sample columns appearing in the same order. The program can be used, for example, to concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel VCF into one. The input files must be sorted by chr and position.

A.vcf

 ##fileformat=VCFv4.1 
 ##FILTER=<ID=PASS,Description="Passed all filters">
 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3
 10  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 0/0 0/1
 11  .   C   A   .   .   DP=3;CALLER=Samtools    GT  .   .   1/1
 12  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 0/0
 13  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 1/1 1/1

B.vcf

 ##fileformat=VCFv4.1
 ##FILTER=<ID=PASS,Description="Passed all filters">  
 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3
 14  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 0/0 0/1
 15  .   C   A   .   .   DP=3;CALLER=Samtools    GT  .   .   1/1
 16  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 0/0
 17  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 1/1 1/1

concat.vcf

 ##fileformat=VCFv4.1
 ##FILTER=<ID=PASS,Description="Passed all filters">
 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Read Depth">
 ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
 #CHROM POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  S1  S2  S3
 10  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 0/0 0/1
 11  .   C   A   .   .   DP=3;CALLER=Samtools    GT  .   .   1/1
 12  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 0/0
 13  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 1/1 1/1
 14  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 0/0 0/1
 15  .   C   A   .   .   DP=3;CALLER=Samtools    GT  .   .   1/1
 16  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/0 0/0 0/0
 17  .   C   A   .   .   DP=3;CALLER=Samtools    GT  0/1 1/1 1/1

bcftools concat A.vcf.gz B.vcf.gz -Oz -o output.vcf.gz

Ref:Question: How to merge vcf files with different variants but same samples?

7.Change Chromosome Notation

Please remember that in the VCF header line there also have chromosome notaion, if you use awk/sed to change the notaion, you should also change it on the VCF header line.

echo "1 chr1" >> chr_name_conv.txt
echo "2 chr2" >> chr_name_conv.txt
bcftools annotate --rename-chrs chr_name_conv.txt original.vcf.gz | bgzip > rename.vcf.gz

Ref:Question: VCF files: Change Chromosome Notation

Rising of Populism

Posted in Uncategorized and tagged Trump, virus, COVID-19, populism on Mar 22, 2020

Trump started to use the term “Chinese Virus” in twitter and briefings in the White House since 17 March. It really let me feel angry, he almost did nothing to help the Americans stop the spread of coronavirus except blame China. However, he stands for Americans, it implies most of Americans may agree with his view – The coronavirus is a “Chinese Virus”. Trump’s actions are all driven by domestic politics in preparation for the 2020 election. COVID-19 is a global public health event. The United States is facing a huge pandemic prevention challenge, and the pandemic is likely to develop into a public crisis. Trump has a very clear mind on this, in order to ensure that his re-election is not affected by the COVID-19 pandemic, he needs to shape a “story” for COVID-19. “Foreign virus” is such a “story”. Its theme is: the United States is being invaded by foreign virus, the virus is imported from “other countries”, threatening the health and safety of the American public. Americans are innocent victims. Trump’s role as president is to protect the United States.

Trump wants to “up the ante”, to make the stroy more vivid by throw out the “Chinese virus”. It clearly links the virus to China. And with China-threat theory, U.S.-China confrontation being the topic of U.S. politics for the last two years, cracking down and containing China is now politically correct in Washington. China was supposed to be the central topic of the Trump 2020 election.

The use of term “Chinese virus” is an attempt to link infectious diseases to anti-Chinese sentiment and anti-Chinese politics.

Trump’s story is that the COVID-19 is coming precisely from the biggest threat - China. Even though China did not “intentionally export” the virus, its “failure to disclose” and “poor management” led to its spread to the U.S, causing serious damage to the U.S.

In this way, people unconsciously shift their attention from their own epidemic prevention and government responsibility to China. They do not blame the government for its poor epidemic prevention, but to think that China exports viruses to the world and hurts the world. American politicians can find a scapegoat for their own failings.

I think all of the Trump supporters will agree on his view. I know from my former hospital that an American physician never goes abroad, this person believes U.S. is the best country and never visits any other place in the world (because other countries are not better than the U.S.). It is real ignorance.

The appeal of the populists has grown with mounting public discontent over the status quo. In the West, many people feel left behind by technological change, the global economy, and growing inequality. Horrific incidents of terrorism generate apprehension and fear. Some are uneasy with societies that have become more ethnically, religiously and racially diverse. There is an increasing sense that governments and the elite ignore public concerns. The Dangerous Rise of Populism

Nativism, xenophobia, racism are on the rise. We couldn’t avoid facing more and more conflict between China and the West. Nobody will win in the last.

In China, there are also lots of people support for Populism, some of them believe the COVID-19 is de novo synthetic viruses, some of them believe traditional medicine is better than modern medical medicine. They believe the person who says the truth but not consistent with the government voice is the spokesperson for the West. Personal attacks via social media are more and more common.

The ideology gap between the West and China will become larger than we could think.

The economic gap between the developed countries and developing countries will become larger than we could predict. By the way, do you see many artists from South East Asia or Latin American or Africa at the virtual concert series – One World: Together at Home? Do you think those people have their voice in the world?

How to set up Samba on CentOS

Posted in HowTo and tagged Samba, CentOS on Feb 28, 2020

Samba is the standard Windows interoperability suite of programs for Linux and Unix. People can browse a server directory like use Windows OS on their own computer. It also could avoid the scp step if you want to download and open a text file in the server, yes, you could open any files on the server and edit it like on your own computer.

Here I will introduce the steps to set up Samba on CentOS in a home network. There may be a problem if someone wants to use this Samba function on Windows 10 systems with a server that support different version of the Samba protocol. My laptop OS is Fedora, and the server OS is CentOS.

Install Samba on local computer and server, eg: yum install samba samba-client samba-client-libs.
Start the protocol on server, sudo systemctl start smb.service.
- The smbd service provides file sharing and printing services and listens on TCP ports 139 and 445. At here, for personal reason, I didn’t want to start the nmbd server, which provides NetBIOS over IP naming services to clients and listens on UDP port 137.

Set up firewall

firewall-cmd --list-services
firewall-cmd --list-ports
firewall-cmd --permanent --zone=public --add-service=samba

I want to use the /home/yulijia directory via Samba, so the Samba configuration file (/etc/samba/smb.conf) could be:
```
[homes]
 comment = Home Directories
 valid users = %S, %D%w%S
 browseable = Yes
 read only = No
 create mask = 0700
     directory mask = 0700
 inherit acls = Yes
```
Once done, please restart Samba services, systemctl restart smb.service
Connect from your local machine. Because I use Fedora, so I could open the Nautilus to add a new connect to server link (smb://192.168.0.111:139/), then connect to the server, select Registered User, enter your server username and password.

connect_to_samba

fill_in_username_and_password

Then the files on the Samba server will be shown.

Note: the default port of Samba service is 139.

Reference:

[1] How to Install and Configure Samba on CentOS 7

2019 nCoV

Posted in Uncategorized and tagged 2019-nCoV, COVID-19 on Feb 2, 2020

My friend in US ask me if there are lots of people are dying in China recently.

From what I know many people outside China are scared by the fake news or some conspiracy theories. As The Coronavirus Spreads, So Does Racism.

First of all, there are around 350 people deaths by date. In the early January some people died due to this disease, but the numbers are not count on the report because they didn’t get a correct diagnosis. The defination of correct diagnosis is that PCR results for the throat swab samples must be shown positive at least twice between a time span. Even on today, there are many people who cannot get a correct diagnosis, the capacity of medical care at Wuhan city are still lower. Not all patients have been hosipitalized, suspected cases are stay at home wait for the diagnosis, their family members are infected by them.

The local government made actions too late to stop the spread of the disease at the earlier stage. All Chinese feel angry with them. I believe some governers will go to jail after the end of the epidemic.

I would like to take a note to remember what I learned that happened at Wuhan City, and other cities.

2019 Dec 24-26

The first strain of 2019-nCoV virus was sequenced in a company that is mainly focused on Metagenomic next-generation sequencing. The company shared the result with CDC, they found that the virus is clustered close with SARS-CoV at the phylogenetic tree.

2019 Dec 30

Another company confirm the preview result, and this company provide a clincinal diagnosis report, in the report they believe the virus is SARS-CoV.

2019 Dec 30-31

‘A Rumor’ was spread in the social media that some people are diagnosed with SARS at Wuhan.

2020 Jan 1

Law enforcement agency at Wuhan warned 8 doctors, accuse them of spreading rumors. By now (2020 Feb), people called them 8 Braves in China. One of the 8 Braves, an ophthalmologist(oculist) passed away at 2020 Feb 6th. He was hospitalizated at 2020 Feb 1st.

2020 Jan 10

Novel 2019 coronavirus genome release

I feel regret for myself, although I am working on the medical science field, at the first glance I still ignore the similarity between SARS-CoV and 2019-nCoV is around 80%, which means they may have the same characterization: spread from person to person, breath problem, etc.

2020 Jan 10-19

More and more cases were reported at Wuhan City, and suspected cases were reported in other countries. The expert said that the virus may not spread between humans. My friend and I discussed the strange phenomenon that there are no suspected cases in other place of China but we know there are the suspected cases in other countries.

2020 Jan 20

In the night, Dr. Zhong Nanshan in the interview of CCTV told all people in China that ‘it is certain that it is a human-to-human transmission phenomenon’. He suggested that people in Wuhan should not travel to other places, and people from other places should not travel to Wuhan. It implied that the city should be locked down.

2020 Jan 21

masks were sold out in pharmacies/stores.

2020 Jan 23 - Jan 31

Wuhan is locked down.

People were angry at the mayor’s slow actions. There is no warning message before Jan 20, 2020, however, the virus already spread to many places. The governor of Hubei province and Wuhan city mayor said the medical supplies are enough for the local hospitals to treat all patients, however, they even don’t know how many masks could be made in the factories in Hubei province every year. By meanwhile, many hospitals are lack of medical protection supplies, they began to ask donations from individuals all over China and the world. The governors feel those hospitals ignore the Hubei government. The local red cross society (this is not a part of red cross global) received many donations but this organization doesn’t have the ability to assign supplies in a fair. Medical supplies are lacked in some hospitals until Feb 2, 2020.

Other news also happened during this period, for instance, some researchers pay more attention to submit publications than find a way to treat patients, some researchers in the Chinese Academy of Science posted silly news that they believe some Chinese traditional medicine can help cure illness.

2020 Jan 31

First Case of 2019 Novel Coronavirus in the United States

I believe key to cure is Remdesivir.

I feel totally angry with the Hubei and Wuhan local governers, but I still feel proud of the people of China, feel proud of the local medical workers and volunteers. The people always unite together, we all carry something when we forward.

2020 Feb 11

I heard from the social media in China that one old man in Wuhan suicided due to he cannot get medical care for his uremia problem, his wife was also diagnosed with COVID-19, but she cannot be hospitalized, because of the limitation of medical support at Wuhan.

2020 Feb 13

By now, there are 30 thousand people in Wuhan were diagnosed with COVID-19, it is more than I thought. In January, I told my friend that I guess the summit of the number of sick people will be 30 thousand at the end of February. All the people in this city are heroes, they suffered themselves to save other people in the whole world.

2020 Apr 4

China held a national mourning on Saturday (April 4) for citizens who died in the fight against the novel coronavirus outbreak and patients who died of the Covid-19 disease.

2020 Apr 17

Wuhan coronavirus death toll by 1290, it is a good sign that the government has the courage to face the truth that there were many deaths linked to the virus outside hospitals, such as people who died at home, had not previously been recorded.

How to draw multiple ggplot2 figures on a page

Posted in HowTo and tagged R, ggplot2 on Jan 15, 2020

There are multiple ways to layout graphs on one page with R¹². I prefer using grid.arrage function in gridExtra package.

The widths and heights options in the grid.arrage function will help you to arrange graphs with different size. Using do.call function may help you to pass all figures (in the list) to grid.arrange function.

Here is a short example of it.

1.With same figure size

library(ggplot2)
library(gridExtra)
p <- list()
p[[1]]<- ggplot(mtcars, aes(wt, mpg))+geom_point()
p[[2]] <-  ggplot(ChickWeight, aes(x=Time, y=weight, colour=Diet, group=Chick)) +
  geom_line() +
  ggtitle("Growth curve for individual chicks")

p[[3]] <- ggplot(mtcars, aes(x = factor(cyl), fill = factor(am)))+geom_bar()
p[[4]] <- ggplot(subset(ChickWeight, Time==21), aes(x=weight, fill=Diet)) +
  geom_histogram(colour="black", binwidth=50) +
  facet_grid(Diet ~ .) +
  ggtitle("Final weight, by diet") +
  theme(legend.position="none")

do.call("grid.arrange", c(p,ncol=2)) 

multiple-plots-on-one-page

2.With different figure size

grid.arrange(p[[1]],p[[2]],p[[3]],p[[4]],
             ncol=2, nrow=2, widths=c(4, 2), heights=c(3, 4))

multiple-plots-on-one-page-with-different-size

Biomedical science and Mathematics

To Infinity and Beyond!