Nov 132012
 

I often have the desire to compare distributions with differing numbers of data points. This is fairly easy to do with R and ggplot2.

 

Step 1. Get data from first and second sources.

> e = read.table("ensembl_last_exon_distance.txt",header=T)
> r = read.table("refgene_last_exon_distance.txt",header=T)

Step 2. Since the two data sources have different headers, I can’t use rbind immediately to combine them together. Also, I wouldn’t know what was what (since they’d be on the same column). So I need to, basically, copy the data over to a new data.frame with the same headers. Then I can add a second column to distinguish the two data types.

> ensemblu = data.frame(Distance = (e$ensembl))
> refgeneu = data.frame(Distance = (r$refgene))

> ensemblu$DataSource = 'ensembl'
> refgeneu$DataSource = 'refgene'

> head(refgeneu)
  Distance DataSource
1    71914    refgene
2   259289    refgene
3    24759    refgene
4     8520    refgene
5   103292    refgene
6   148873    refgene

Step 3. Combine both data sets together. You see from the head and tail that I now have both data sets together, in one column. I will use the second column when plotting to distinguish the data sets visually.

> both = rbind(ensemblu,refgeneu)

> head(both)
  Distance DataSource
1     6157    ensembl
2    18815    ensembl
3    43723    ensembl
4    48196    ensembl
5    31755    ensembl
6    93981    ensembl

> tail(both)
      Distance DataSource
23037    42503    refgene
23038    26796    refgene
23039    34782    refgene
23040    18100    refgene
23041     6066    refgene
23042     7635    refgene

Step 4. Plot. You can histogram it straight up, log transform, rnorm, etc. very easily.

> library(ggplot2)

> ggplot(both, aes(Distance, fill=DataSource)) + geom_bar(alpha=0.5)

> ggplot(both, aes(log(Distance), fill=DataSource)) + geom_bar(alpha=0.5)

> ggplot(both, aes(rnorm(Distance), fill=DataSource)) + geom_histogram(alpha=0.5)

For more details, go here.

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)