- Extended fasta-to-compact and compact-to-fasta to handle paired end runs. See new command
line arguments --paired-end and pair-indicator arguments in fasta-to-compact and
--pair-output argument in compact-to-fasta.
- Draft support for paired sequence runs. The compact file format is extended to store
sequence, sequence length and quality scores for the paired run. This extension makes
it possible to store both paired end runs in a single compact file. This should help
keep the data together.
- Implemented translation back and from Solexa quality score encoding in fasta-to-compact
and compact-to-fasta. Thanks to Cock PJA et al NAR 2010 for the clear description of the
Solexa base quality scores.
- The sort mode now supports reading only a slice of an input alignment (see options
--start-position and --end-position).
- Refactored CompactAlignmentToAnnotationCountsMode to use IterateAlignments (provides
large speed ups when working with sorted/indexed alignments and selecting a subset of
reference sequences for DE).
- IterateAlignments now takes advantage of the skipTo method when the alignment is sorted
and indexed. This provides large performance improvements when one needs to access data
for only a few reference sequences in an alignments. All the modes that use
IterateAlignments benefit, including display-sequence-variations, and
sequence-variation-stats.
- Index alignments that are sorted upon writing. The skipTo method leverages the index
to provide fast semi-random access to entries by genomic location. This feature is used
by the IGV Goby plugin, which requires Goby 1.7+.
- Concatenate alignment now produces sorted alignments if all the input alignments
are sorted.
- Added a mode to sort alignment by reference sequence and then by position
on the reference sequence.
- Support to estimate read weights described in Hansen KD et al NAR 2010.
See http://campagnelab.org/software/goby/tutorials/estimate-heptamer-weights/
In contrast to the initial publication, Goby supports using the weights to
reweight annotation counts and transcript counts.
- Support to estimate GC content weights for reads and to reweight raw counts to
remove the dependence of counts on GC read content.
- Preliminary support for barcoded reads (barcodes in the sequence), see new
mode decode-barcodes (and tutorial online at
http://campagnelab.org/software/goby/tutorials/handling-barcoded-reads/).
- alignment-to-*-counts: New --eval argument allows to specify which statistics
to evaluate when comparing samples.
- alignment-to-*-counts: New eval options 'samples' will write a column per sample
for RPKM, log2(RPKM) and raw counts. RPKM and log2(RPKM) are written once per sample
and global normalization method.
- Reduce memory requirements when concatenating many alignments. A change
introduced in 1.6 caused more memory than needed to be allocated for each
split of an alignment (as much as the number of reads in the file that
was split). Each split now uses only as much memory as needed to keep
query lengths for the split.
- Dramatically improved performance for differential expression tests with millions of
differentially expressed elements (e.g., exon+gene+other). The code previously
incorrectly grew internal arrays from zero to the number of new DE element described
in the annotation file.
Changes that impact the compact alignment format:
- The compact file format is extended to store sequence, sequence length and quality scores
for the paired run. This extension makes it possible to store both paired end runs in a
single compact file. This should help keep the data together.
- Moved query lengths from header to alignment entries. This scales much
better when processing large alignment files (generated from more than
a few hundred million reads).
- The optional 'sorted' attribute in header indicates if an alignment has been sorted.