Jan 282013

When doing loops and renaming files it is very important to able to strip the end of the original filename off. Otherwise you quickly end up with long and awkward filenames like “ZF_rnaseq.bed.gz.processed.uniques.gz” for i in *.bed.gz; do zcat $i | cut -f1 | gzip -f  > ${i%.bed.gz}.cut.gz; done The ${variable%.end_of_filename_you_want_remove}.new_file_name syntax is easy to use and easy to forget.

Jan 042013

This isn’t elegant at all, but it is an easy and adaptable way to delete lots of files in a directory. First, get the list, using grep (if the file-names have a sort-of-pattern) and/or ls -lt (if the files are grouped by date). Next, copy the list (click-and-drag), open up vim (emacs), and paste the info to create a text file. If you were sorting with ls -lt you will have a lot more info than the file name. You can use awk ‘{print $whateverColumnNumber}’ file.txt > new_file.txt to create a text file with just the file names. Then you can simply use a loop in bash: while read p; […]

Nov 162012

I’m working with publicly available RNA-seq data-sets generated from ABI SOLiD 3.0 sequencing. It’s been a real nightmare working with this data. The reason(s) why this is so is better left for another time. I just got access to ABI’s LifeScope alignment software. It requires data to be in their new(ish) XSQ format. The data I have from the SRA is in the csfasta format, courtesy of their abi-dump script. I thought conversion would be straightforward – ABI does provide a script to do it as well as a converter (probably the same script) in the LifeScope GUI interface. But, of course, it wasn’t straightforward. I kept getting this error: [mcgaugheyd@gryphon raw_data]$ ~/XSQ_Tools/./convertToXSQ.sh […]

Nov 132012

I often have the desire to compare distributions with differing numbers of data points. This is fairly easy to do with R and ggplot2.   Step 1. Get data from first and second sources. > e = read.table(“ensembl_last_exon_distance.txt”,header=T) > r = read.table(“refgene_last_exon_distance.txt”,header=T) Step 2. Since the two data sources have different headers, I can’t use rbind immediately to combine them together. Also, I wouldn’t know what was what (since they’d be on the same column). So I need to, basically, copy the data over to a new data.frame with the same headers. Then I can add a second column to distinguish the two data types. > ensemblu = data.frame(Distance = (e$ensembl)) > […]

Nov 022012

R and loops don’t get along. It’s usually easier and faster to vectorize your problem.   The apply function allows you to perform calculations across an entire matrix, extremely rapidly. It’s also easier to write than a for loop.   I have some data tables I entered using read.table. > head(data) X3dpfE X3dpfS brain1 brain2 brain3 heart1 heart2 heart3 liver1 liver2 liver3 ENSDARG00000000001 1.562704 1.302656 3.386386 4.501326 3.770699 4.410918 3.119245 4.981105 4.543457 4.660851 4.166984 ENSDARG00000000002 1.868292 2.669400 3.774026 2.148060 3.342408 6.346923 5.704955 6.396921 3.177544 2.848264 3.031184 ENSDARG00000000018 5.537198 6.120828 6.227827 5.199024 6.745472 7.235136 4.266326 7.304808 6.080495 5.901095 6.149637 ENSDARG00000000019 8.307379 8.347306 7.602548 7.846188 7.367002 5.948792 6.926612 5.624044 7.266805 6.870143 7.184076 […]