Venn diagrams, Euler diagrams and Visualization of Intersecting Sets

Posted in Statistical graphics and tagged Venn diagram, Euler diagram, UpSet on Jul 15, 2015

Venn diagrams are similar to Euler diagrams. However, a Venn diagram for n component sets must contain all 2n hypothetically possible zones that correspond to some combination of inclusion or exclusion in each of the component sets. Euler diagrams contain only the actually possible zones in a given context. I often draw a Venn diagram with same zone size, however Euler diagram do not need same zone size. All Venn diagrams are Euler diagrams, but not all Euler diagrams are Venn diagrams.

In some R packages, they only provide us to draw Venn diagram with 2~5 subsets. It is difficult to find a symmetric Venn diagram over 5 subset. For more information of Venn diagram please see wiki

Drew Skau gave us a great explain of Venn and Euler Diagrams.

Last year, Alexander Lex and Nils Gehlenborg create a new set-based data visualization technique UpSet and published their Points of view: Sets and intersections. UpSet is a new way to visualize sets intersection, compare to Euler diagram, it can provides all intersection set info (more than 6 subsets) in one figure. Although it isn’t as beautiful as Euler diagram, I think it provide the more effective way to represent intersection set.

Here is the set visualization techniques we often used in published paper.

run sciClone without rgl package

Posted in R-Language and tagged R, sciClone, rgl on Jun 23, 2015

rgl_package_problem

Maybe you have met the problem above when apply sciClone package with a server without libGLU (X server).

For me, our lab use a cluster can not connect to outside network and without X-server. When apply sciClone package, it would got a error message like,

.onLoad filed in loadNamespace() for 'rgl'  
libGLU.so.1 can not open shared object file 

The fast way to fix this problem is, if you do not need sc.plot3d() function, just remove it, then you can run other subfunction in this package successfully.

If you do not know how to remove the sc.plot3d() function, you can download a modified sciClone package from my repository or install the modified package using install_git()

install_github("sciclone",username="yulijia",ref = "master")

An Explanation of nt and nct in GATK

Posted in Bioinformatics and tagged GATK, threads, CPU on Jun 3, 2015

I am confused with GATK Multi-threading options, -nt / --num_threads controls the number of data threads sent to the processor, -nct / --num_cpu_threads_per_data_thread controls the number of CPU threads allocated to each data thread.

What’s data threads?

Here is an answer from Geraldine Van der Auwera, PhD

In the meantime, what you need to know is that -nct is the number of CPU threads, ie threads that can be run by different cores if you have a multicore CPU, while -nt is the number of data threads, ie number of “clones” of the GATK that are run in parallel on your machine.

So nt is based on how many copies you want to run in the same time.

nct can control the CPU threads. Take CB60-G16 Network Server for example,

Tested Configuration:
Computer Type:	Blade Module
Mother Board Revision:	C01
BIOS/uEFI:	uEFI: 1.61 (09/06/2013)
CPU:	2 Intel Xeon® Processor E5-2690 v2 3.0 GHz
RAM:	32 GB

Each node has two CPUs (10 cores, 20 threads).

If I use -nct 10, it means I use 10 threads on one node. -nt 10 means I use 10 threads for one data thread(one sample). nt might cause an increasing in memory usage, each data thread would be allowed with the same memory size.

HotNet2

Posted in Bioinformatics and tagged HotNet2, random walk, sample size on Jun 2, 2015

HotNet2 is a Python based software to identify subnetworks of gene interaction network with mutation infomation.

I read the Algorithm last month and got a basic knowledge of random walk.

Random walk with restart

concept

For gene(protein) interaction network, random walk starts from a protein g and at each time step moves to one of the neighbors with the probability \(1-\beta\) (\(0 \leq \beta \leq 1\)).

The walk can also restarts from g with probability \(\beta\). This process is defined by a transition matrix \(W\).

\[W_{ij}= \left\{ \begin{array}{ll} {\frac{1}{deg(j)}} &\text{if node i interacts with node j,}\\ {0} &\text{otherwise.} \end{array} \right.\]

\(deg(j)\) is the number of neighbors (the degree) of protein \(g_{i}\) in the interaction network.

\(\beta\) represent the probability with which the walk starting at \(g_{i}\) is forced to restart from \(g_{i}\).

The random walk will reaches a stationary discribution described by the vector \(\vec{s}_{i}\)

\[\vec{s}^{\prime}_{i}=(1-\beta)W\vec{s}_{i}+\beta\vec{s}_{i}\]

When \(\vec{s}^{\prime}_{i}=\vec{s}_{i}\), we can get

\[\vec{s}_{i}=\beta(I-(1-\beta)W)^{-1}\vec{e}_{i}\]

where\(\vec{e}_{i}\) is the vector with a 1 in position \(i\) and 0 is in the remaining positions.

This part \(\beta(I-(1-\beta)W)^{-1}\) is called diffusion matrix \(F\).

\[F=\beta(I-(1-\beta)W)^{-1}\]

Note that \(\vec{s}_{i}\) is the \(i^{th}\) column of \(F\).

parameter

To calculate the diffusion matrix, we need know the value of \(\beta\). In HotNet2, they chose \(\beta\) to balance the amount of heat that diffuses from a protein to its immediate neighbors and to the rest of the network.

There is another parameter \(\delta\), it is the edge wight parameter. It’s used to make sure the HotNet2 will not find large subnetworks using random data.

For more detail of these parameters, please see the supplementary of this paper.

sample size problem

I ask a question about sample size on HotNet google group.

I read the HotNet2 paper and find the size of samples used in this paper is very large. If my sample size is small ( for example, 10 samples or 20 samples ), could I use hotnet2 to do the pathway analysis? Is there any baseline of sample size ?

As I understand, you can use hotnet2 with whatever sample size you have. For example if you take a look on this analysis here: https://cs.brown.edu/research/pubs/theses/ugrad/2014/jain.pdf you can see that they used p-values as input, thus hotnet itself does not depend on the sample size. This means, that you have to take into account sample size caused biases at the p-value calculation. Therefore, if your sample size is low, you might want to consider more robust ways to calculate p-values, e.g. some rank based approaches (for example Rank Product).

Answer by Akos Tenyi

John Nash

Posted in Uncategorized and tagged game theory, mathematics on May 24, 2015

John Nash and his wife were killed in crash.

For me, it marks the end of an era.

Biomedical science and Mathematics

To Infinity and Beyond!