I’m working with publicly available RNA-seq data-sets generated from ABI SOLiD 3.0 sequencing. It’s been a real nightmare working with this data. The reason(s) why this is so is better left for another time.
I just got access to ABI’s LifeScope alignment software. It requires data to be in their new(ish) XSQ format. The data I have from the SRA is in the csfasta format, courtesy of their abi-dump script.
I thought conversion would be straightforward – ABI does provide a script to do it as well as a converter (probably the same script) in the LifeScope GUI interface.
But, of course, it wasn’t straightforward. I kept getting this error:
[mcgaugheyd@gryphon raw_data]$ ~/XSQ_Tools/./convertToXSQ.sh --mode Fragment --libraryName na --laneNumber 1 --runStartTime '2012-11-15 10:10:10' --c1 test.csfasta --q1 test.qual -xsqfile test.xsq[16Nov12 07:59:15,343][ERROR]- Invalid bead id: >mbt_solid3.0.1 1_19_223_F3 com.lifetechnologies.exception.ConverterException: Invalid bead id: >mbt_solid3.0.1 1_19_223_F3 at com.lifetechnologies.util.AbstractXSQMapper.validateBeadId(AbstractXSQMapper.java:602) at com.lifetechnologies.util.AbstractXSQMapper.getNextBeadId(AbstractXSQMapper.java:135) at com.lifetechnologies.util.CsfastaXSQMapper.getMaxReadLength(CsfastaXSQMapper.java:311) at com.lifetechnologies.converter.SolidFragmentConverter.convertFile(SolidFragmentConverter.java:73) at com.lifetechnologies.XSQConverter.runXSQConverter(XSQConverter.java:48) at com.lifetechnologies.XSQConverter.main(XSQConverter.java:144)
My csfasta file looks something like this:
# # Title: Solid_ZebrafishWTA_Zebrafish_WTA # >mbt_solid3.0.1 1_19_223_F3 T32223.10221000030202232102222210123302221222030222 >mbt_solid3.0.2 1_19_963_F3 T22132.12310012033102112223111333111221102011133131 >mbt_solid3.0.3 1_19_971_F3 T01221.30021010132130031131311211010102221022213121 >mbt_solid3.0.4 1_19_1057_F3 T10221.10031313310021320301032000012010333121330113 >mbt_solid3.0.5 1_19_1217_F3 T12000.13210313200001011313110002232022200010000000 >mbt_solid3.0.6 1_19_1360_F3 T31013.21202220213111212222011110112302110011103122 >mbt_solid3.0.7 1_19_1407_F3 T10010.10113211310110112213222231210312223211133223 >mbt_solid3.0.8 1_20_584_F3 T20210.31212311131112121120111200111101130211111111 >mbt_solid3.0.9 1_20_717_F3 T32121.01221020231223122301202111123131103010113313
I first thought the issue was with the space in the tag. So I wrote a script to remove it. Didn’t help. Then I tried simply replacing the tag with numbers counting up from 1. Didn’t help. Then I started changing all sorts of random options. Didn’t work. Tried using the LifeScope GUI. I started up the conversion 16 hours ago and it’s still at 0% complete. Great.
After some Googling I found this.
“The Tag_ID is a unique identifier for every tag, which consists of four components: panel_xpixel_ypixel_tagtype. For example, 1_567_321_F3 describes a bead in panel 1 at coordinates 56, 321 (X,Y) with the F3 tag (first tag in a mate pair, only tag in a fragment run).”
Aha! The issue is that the abi-dump script that I used to converted the sra file to csfasta adds the file name to the tag. The XSQ converter isn’t accounting for the possibility. This little python script removes the offending information.
#!/usr/bin/env python import sys file = open(sys.argv) for line in file: if line == '>': line = line.split() output = ">" + line print output else: print line[:-1]
Now it works!
Update : 2013-01-28.
Never mind! These files fail when you try to import them into LifeScope (they do not seem to be recognized as XSQ files). They only way I could get these files to work properly was to convert them to the XSQ format using the GUI LifeScope interface.