I often have the desire to compare distributions with differing numbers of data points. This is fairly easy to do with R and ggplot2.

Step 1. Get data from first and second sources.

```> e = read.table("ensembl_last_exon_distance.txt",header=T)

Step 2. Since the two data sources have different headers, I can’t use rbind immediately to combine them together. Also, I wouldn’t know what was what (since they’d be on the same column). So I need to, basically, copy the data over to a new data.frame with the same headers. Then I can add a second column to distinguish the two data types.

```> ensemblu = data.frame(Distance = (e\$ensembl))
> refgeneu = data.frame(Distance = (r\$refgene))

> ensemblu\$DataSource = 'ensembl'
> refgeneu\$DataSource = 'refgene'

Distance DataSource
1    71914    refgene
2   259289    refgene
3    24759    refgene
4     8520    refgene
5   103292    refgene
6   148873    refgene```

Step 3. Combine both data sets together. You see from the head and tail that I now have both data sets together, in one column. I will use the second column when plotting to distinguish the data sets visually.

```> both = rbind(ensemblu,refgeneu)

Distance DataSource
1     6157    ensembl
2    18815    ensembl
3    43723    ensembl
4    48196    ensembl
5    31755    ensembl
6    93981    ensembl

> tail(both)
Distance DataSource
23037    42503    refgene
23038    26796    refgene
23039    34782    refgene
23040    18100    refgene
23041     6066    refgene
23042     7635    refgene```

Step 4. Plot. You can histogram it straight up, log transform, rnorm, etc. very easily.

```> library(ggplot2)

> ggplot(both, aes(Distance, fill=DataSource)) + geom_bar(alpha=0.5)

> ggplot(both, aes(log(Distance), fill=DataSource)) + geom_bar(alpha=0.5)

> ggplot(both, aes(rnorm(Distance), fill=DataSource)) + geom_histogram(alpha=0.5)```

For more details, go here.

•  November 13, 2012