Goby

Latest version: v2.0

Safety actively analyzes 628924 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 6 of 9

1.9.7

- Now using protobuf 2.4.1. Please upgrade your local version of protobuf if you are recompiling from sources.
- AlignmentWriter now correctly records Goby version in header upon close(). This fixes a problem when alignments
read from read-only files would fail upon trying a new upgrade.
- Optimized the performance of VCFParser on files with large number of columns. The VCF format seems designed
without performance in mind, so it is hard to come up with a reasonably fast implementation. The current
implementation of the Goby VCF parser can only process about 8,000 lines of compressed VCF per second on
a desktop machine.
- AlignmentEntry schema change: a new field sample_index holds the index of the alignment from which the
entry was read. This is useful when concatenating over multiple alignments and realigning reads that span
indels, to reliably track the alignment origin of each entry. The concatenation readers have been
modified to set sample_index accordingly. Please note that the activeIndex field of the sorted reader
is not a reliable way to identify the alignment of origin when realignment is active. Please use the
new sample_index field instead.
- We have added the capability to perform on the fly realignment around indels. This feature is available
in mode discover-sequence-variants and in concatenate-alignments. The feature is activated with the new
--processor realign_near_indels option. When the option is provided, a compressed reference genome must
also be given on the command line (with the --genome option). This will trigger realignment of reads in
regions where candidate indels are found by the aligner. The algorithm is very fast, in fact much faster
than previously described approaches and consumes a reasonable amount of memory (function of maximum
depth of coverage in the region where candidate indels are observed, but typically <2GB). Realignment
correctly removes artefactual SNPs that can be introduced when an aligner fails to align the read ends
properly through a read deletion. Please note that this version realigns read deletions. Realignment of
read insertions has not been implemented.
- Make it possible to open an alignment if the header file is present, but the entries file is missing.
This allows to read the header only, for instance when we need to load counts and have access to targetIds.
- Add mode to convert annotations to counts archive format.
- Add new coverage mode to calculate coverage stats over annotation regions. When annotation regions are
defined with capture regions, this mode outputs enrichment efficiency efficiency and depth of coverage for
specific proportions of captured sites.
The mode uses just .header and .count files and traverses count transitions. The algorithm used to iterate
through count transitions is very efficient (for instance it takes about ~20 seconds to estimate coverage
stats for an alignment with ~20M aligned reads). Count files are produced with GobyWeb together with the
alignment or with the alignment-to-counts mode.
- Add CountBinningAdaptor, useful to bin counts on the fly at any resolution for display in IGV.
- Added ability to record total number of bases and sites seen in count archive.
- Added a new mode (file-to-attributes) to generate a sample attribute file suitable for loading in IGV.
Useful when files are named with the convention attr1-attr2-attr3.counts

1.9.6.1

- Patched VCF output for compatibility with VCF specification. Specifically, we now write . in the QUAL
field and write genotype as the first field in the methylation output format. Additionally, we only
write a VCF line if the site can be typed in at least on of the samples. This changes make Goby VCF
output compatible with the IGV 2.0 VCFTrack.
- Fix a bug in merge that could trigger a ArrayIndexOutOfBoundsException with some alignments.

1.9.6

- AlignmentReaderImpl now supports full random access to an alignment. Use reposition(ref,pos) followed
by skipTo(ref,pos) to obtain the first entry matching at location (ref,pos). Prior to 1.9.6, the
reposition method would not reposition to a location already visited forcing clients to close the
alignment reader and reopen it (this new behaviour should improve performance in IGV).
- The indexing logic used in versions of Goby up to 1.9.5 (inclusive) had subtle flaws. This could cause
the skipTo method to behave incorrectly for some aligments. For instance, if reads matched on target N
at a position larger than the length of target N+1, these reads would not be returned by skipTo.
Thanks to Alec Chapman for identifying these issues.
We have corrected the problem and added additional unit tests to check the behavior of the implementation
in various edge cases. A consequence of this change is that the new indexing logic requires recalculating
the .index data structure for alignments sorted and indexed with a version of Goby prior to 1.9.6.
We provide a new mode, goby upgrade, to perform these calculations and fix such alignments. To upgrade
alignments off-line, simply do:
goby 3g upgrade [files].
This command will upgrade each alignment corresponding to the filenames provided. It skips those alignments
produced by versions of Goby that do not require upgrading. The upgrade process creates a backup of the
files that are affected: .index and .header are backed to .index.bak and .header.bak respectively.
The upgrade process is relatively fast, in our tests we upgraded a 750Mb alignment file in 2'30".
- Version 1.9.6 will try to upgrade alignments on the fly to the new version of the index data structures.
- Detect when FastaToCompact is running in API mode versus command line. Do NOT do System.exit in API
mode and instead throw exceptions. Also, API mode doesn't run conversions in parallel but instead runs
them serially for easier exception catching.
- VCFParser now splits headers by tab instead of whitespace so column names that contain spaces
are read correctly.

1.9.5

- Determine alignment sortedness and index state from the header and by checking that the index file exists.
This allows to recover alignments when the index file was deleted. In such cases, sorting the alignment can
be done again, this is preferable to losing the alignemnt data.
- New mode simulate-reads will generate reads artifically against a reference sequence. We use this mode
to create simulated datasets of bisulfite converted reads or mutated reads and to test that Goby produces
the expected results.
- Show phred scores in DisplaySequenceVariants (tab + base)
- Add a QualityEncoding.PHRED in case one just wants to transfer quality scores without changing quality scale
- Rewritten sam-to-compact mode that handles sequence variations better, handles bsmap sam files better,
and handles quality score conversions more flexibly. The old mode is still around called
sam-to-compact-old for comparison. The new mode has slightly different command line paramters.
- Added a discover-sequence-variants mode format 'methylation' to estimate methylation rates for RRBS and
Methyl-Seq alignments.
- Dramatically improved TMH loading times for large alignemnts.
- Completely removed support for queryLength in header. This usage was deprecated in Goby 1.7, complicates
the code unecessarily and is error prone (because we had two ways to store read length in the previous
versions of Goby). Note that versions since 1.7 had a concat mode that transfered information from the
header to the alignment entries transparently. Use this mode from a pre 1.9.4 release if you need to
migrate a 1.6- alignment to work with Goby 1.9.5+.
- Fixed a bug where merge-compact-alignments would throw an ArrayIndexOutOfBounds because a TMH
query index was smaller than the first query index in the alignment.
- Changed discover-sequence-variant mode to filter out alignment entries whose read mapped multiple locations in the
reference (as determined by the aligner argument (i.e., -n for gsnap)).
- Made AlignmentReader an interface. The previous AlignmentReader class is now called AlignmentReaderImpl.
- ConcatSortedAlignmentReader and ConcatAlignemntReader now support a configurable AlignmentReaderFactory.
The factory makes it possible to plug in alignment reads that filter entries as they are read. The default
factory returns all reads. However, if NonAmbiguousAlignmentReader factory is installed, the concatenate
reader returns only entries for which the read did not match other locations in the genome. Other filtering
behaviour can be implemented in a sub-class of AlignmentReader (see NonAmbiguousAlignmentReader for an example)
and a factory created to return instances of this class.
This mechanism is used to filter out entries whose reads match several locations on the reference sequence.
- Goby now includes a VCFParser class (see package edu.cornell.med.icb.goby.readers.vcf). VCF stands
for Variant Call Format. The VCF format is described at http://www.1000genomes.org/node/101.
The Goby VCFParser class implements a VCF 4.0+ parser. Importantly, this implementation also can be
used to parse plain TSV files, or VCF that do not include the fixed VCF columns. It therefore support
an extended version of the VCF format that is as generic as a TSV file, but can also provide meta-information
about the columns in the specific file. Another difference with VCF 4.0 is that we support the Group
attribute on column fields. This makes it possible to indicate that fields are part of the same group.
Such a feature can be used by user interfaces that would like to offer the ability to manipulate multiple
column fields as a group (for instance to hide or show an entire group of fields).
- FDR mode now supports VCF input files and outputs. See the option --vcf to activate processing of VCF formatted
files.
- Added a VCFWriter class to write files in the VCF4 format. This class is now used by discover-sequence-variants
when writing in genotypes format. This should make it possible to use vcf-tools on the genotype files produced.
- Fix logic for IterateSortedAlignments which, in turn, fixes sequence-variation-stats2. The issue primarily
dealt with insertions, deletions, and left and/or right padding.
- Fixed the logic for TAB_SINGLE_BASE in display-sequence-variation mode to report the correct
read_index and ref_position.

1.9.4

- The C API (used by BWA, GSNAP) has been updated to more accurately write sequence variations (this version
fixes problems in reporting of the read index). We have created examples of how sequence variations are
encoded in Goby alignment files. These examples are available at http://tinyurl.com/goby-sequence-variations
- Mode concatenate-alignments now propagates names and versions of the aligners that contributed input alignments.
- Mode sort now propogates the name and version of the aligner that produced the alignment.
- Mode compact-file-stats now reports the name and version of the aligner that produced a Goby alignment file.
- Mode discover-sequence-variants has been extended to support multiple types of outputs (see --format flag).
One output format prints genotypes (--format genotypes), while another estimates the proportion of the
reference allele in each sample (--format allele_frequencies).
- Added a mechanism to support base filters in discover-sequence-variants. To activate these filters, you must provide
the --eval option with the "filter" option. Two filters are currently active when --eval filter is used: one
filters variant bases by quality score (keeping only bases with q-phred>=30) and another is a simple and efficient
strategy to remove bases that do not quite agree across all the observations.
Future versions will make it possible to customize the set of filters and their options.
- sequence-variation-stats2 now runs in parallel up to the available number of threads when multiple alignments are
given as input.
- display-sequence-variations and sequence-variation-stats modes: Fix problems in the logic to calculate
read-index for large insertions/deletions.

1.9.3

- This release has a C API compatible with our development version of GSNAP. A version of GSNAP released
after 2011-03-11 should compile with Goby 1.9.3.
- Add new statistics for discover-sequence variants. Notably, we now record the log odds ratio,
the estimated standard error of the log odds ratio, as well as a Z-score for the log odds.
Standard error and Z-score are only estimated if more than 10 counts exist in each cell of the contingency table.
Also added the proportion of reference allele (refCount / (refCount+varCount).
- Fix reformat-compact-reads bug where quality scores where longer by 1 than the sequence.
- Reduce the memory needed by compact-file-stats to determine the number of reads in a compact reads file.
- Changed how the number of reads in an alignment file is determined by compact-file-stats. We now report the number
stored in the alignment header.
- Change how log2 fold change was estimated. We used to estimate as ((log2_rpkm_group_a+1)/ (log2_rpkm_group_b+1)).
This can cause problems when log2 rpkm are negative in one group and positive in the other. We now add 1 to counts
before calculating RPKMs and taking the log. Similar changes were done to the fold-change. RPKM columns now return
PRKM of (count+1).
- Mode reformat-compact-reads now takes an optional -f argument to filter reads. This option can be used to
remove redundant reads from a compact-reads file (see tally-reads mode to produce the read filter). It is no longer
necessary to do round-trips to fastq to remove redundant reads.

Page 6 of 9

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.