Goby

Latest version: v2.0

Safety actively analyzes 628918 Python packages for vulnerabilities to keep your Python projects secure.

Page 5 of 9

1.9.8.2

- Make it possible to activate indel calling without recompilation. Mode discover-sequence-variants now accepts
the boolean argument --call-indels true/false.
- Preliminary support for calling indels with discover-sequence-variants. Candidate indels are now written
in the formats that use GenotypeOutputFormat (e.g., genotypes, compare_groups, allele_frequency).
The method of Krawitz et al is used to determine the equivalent indel region for each possible candidate.
After possible realignment, and filtering to remove possible errors, EIR are reported with their frequencies.
Please be advised that the VCF spec(s) are rather vague and as a result often interpreted differently by different
programmers. This is especially true of the parts of the specification(s) that describe how to report indels. As a
result of this situation, you might run into problems when trying to loading indel containing VCF files generated
with Goby into other tools.
- vcf-subset: Add ability to exclude positions at which all samples match the reference.
- Add a replacement for the VCF-tools VCF-subset program. The Goby tool is orders of magnitude faster.
- Improve vcf-compare mode. Now has the ability to provide a random samples of the positions that differ between the
files being compared. Random samples are calculated for each kind of difference (missing from one file, missing
one allele, two alleles, different genotypes)
- vcf-compare now outputs Ti/Tv ratios for each sample in input file (in the output file only).
- Fix scalability problem with local realignment code. Local realignment around indels would slow down as more entries
were processed. This is now fixed so that speed is constant across large alignments.
- Fixed index file writing. In some conditions, part of the alignment past the 2GB mark were not accessible
with skipTo when reading files larger than 2GB. Use the upgrade mode to fix old alignments at a specific time, or
use Goby as usual to have alignments upgraded on the fly.
- Add mechanism to upgrade/fix large alignments indices with Goby 1.9.8.2. The upgrade mechanism uses concatenate
alignment to rewrite an alignment index file if the size of the entries file exceeds 2GB. This is rather slow as
the process reads and writes large alignments, to produce the new index file. While slow, upgrading is still faster
than aligning the reads again. The process also requires approximately double the alignment size as the new alignment
files are written. Alignments smaller than 2GB are quietly ignored since they were not affected by the bug.
- Codecs: Add support to decode alignments with a codec in AlignmentReader.
- Improved ReadsReader to find a suitable decoder when several codecs exist.
- Prevents local realignment from running out of memory when processing positions where clonal reads create huge peaks.
- Make filterIndels remove from sample count info object, not just form list of bases.
- Fix VCF genotypes that could look like 0/0/1/1 to be 0/1 (seen with indels only).
- only write allele base count in VCF BC field when the count is not zero (useful with indels).

1.9.8.1

- Discover-sequence-variants: add ability to describe zero, one or more group comparisons. Syntax is A/B,A/C to compare
group A to B and group A to C. Additional pairs can be described, separated by coma.
- Extend methyl-stats mode to estimate fraction of methylated cytosine observed in CpX contexts.
- Discover-sequence-variants, genotype format: Fix a bug where alleleSet was cleared in each sample, rather than before
any sample is processed. This made it possible for some positions to be ignored erroneously when samples were given
on a specific order on the command line. Specifically, positions would be ignored if they were not typed (i.e., not
enough good bases) in the last sample given on the command line.
- Optimize merging of TMH when the files are large (>100M compressed).
- Fixed a major bug where NonAmbiguousAlignmentReader would stop iterating after encountering an ambiguous alignment.
Alignments with shorter reads were much more likely to be affected.
- Fix sam-extract-reads for paired-end BAM files. Each BAM file contains both pairs. To convert to compact reads, the
input BAM file must be sorted by read name, since this is the only way we can put the pairs back together in one
Goby record.
- Mode discover-sequence-variants now limits the maximum coverage per site in order to limit the impact on peak memory
of a few very high coverage sites. The default setting is set to 500,000x and can be changed with
option --max-coverage-per-site
- Switched IndexedIdentifier to an AVLTreeMap to help scale when we have millions of elements to compare in diff exp.
- Fixed a subtle bug in IterateSortedAlignment that would cause iteration to return partial results for some alignments
when restricting results to a window. The problem would manifest more clearly for alignments against genomes where
contigs have smaller indices than chromosomes and chromosome sequences are listed in non-increasing order (e.g., chr
16 appearing before chr 10) and restricting to window from chr16 to MT (which should include chr 10 in that genome,
but returned no result on chr 10).
- Trim mode: Fix exception that could occur when trimming reads with no quality scores.
- Change goby script to request the bash shell explicitly. This is needed on systems where bin/sh is not a synonym for
bash. Thanks to Martin Frith for catching this on Ubuntu.
- Change how targetLengths are concatenated. It turns out that last-to-compact needs alignment entries matching
the target to record the length in the alignment. We need to keep any length seen when we concat because the first
chunk may just not have the length for the remaining parts..
- Improved logic for --paired-end filename support in the fastaToCompactMode.
- Fix a NPE in suggest-position-slices that could occur with very small alignment files.

1.9.8

- The BaseStats utility was transformed into a Goby mode (base-stats). The new mode has the ability to tally occurrence
of CpX motifs in reads. Useful as a proxy to the amount of unconverted Cs in bisulfite converted reads.
- The methyl-stats mode take a VCF file produced by Goby methylation output and a genome and calculates various
statistics about the distribution of fragment lengths between CpG interrogated by the assay.
- FDR mode now accepts --column-selection-filter to select columns matching string.
- Proof of principle that protocol buffer can seamlessly cohabit with data-specific compression schemes. The
--codec option on fasta-to-compact is introduced to activate compression of reads when writing compact reads.
The codec provided (called read-codec-1) achieves about 10-12% better compression of read files than pure
protocol-buffer encoding. This read-codec-1 codec stores bases and quality scores with an arithmetic coder in
a protocol buffer field called 'compressed_data'. Please note that we do not recommend using this option at
this stage since the C/C++ APIs cannot load data encoded with this codec at this time.
- Add ability to run alignment-to-annotation-counts on a specific genomic region (see --start-position and
--end-position).
- alignment-to-annotation mode has a new option (--remove-shared-segments). When active, this option will remove
annotation segments when they partially overlap with more than one primary annotation id. When this option is
selected and the primary id is a gene, and secondary id is an exon, the mode will remove exons that are associated
with several genes. When the option is used with transcript id as primary and exon as secondary, exons are removed
that are shared across different transcripts of the same gene.
- mode base-stats now supports multiple input files.
- VCFParser will now set column type when reading TSV files by using TabToColumnInfoMode to scan the actual values
stored in the TSV file. The first time this is done for a each file, a .colinfo file will be created and then
used if the file is read again by VCFParser in the future.
- Added the mode tab-to-column-info to read the data from TSV files to determine the the column types
(double/integer/string). Write a .colinfo file detailing the column names and types.
- Upgraded to SAM JDK 1.52
- Modes sam-to-compact and sam-extract-reads now set SILENT validation before reading file header. This is required
because the SAM JDK validation rules are more stringent than required by the specification. This means that
some valid SAM files (per the SAM spec) cannot be parsed without error when the strict validation is used.
- Fixed a bug with ReadsQualityStatsMode when when SampleFraction == 1.0d, such as for files with a small
number of reads.
- Mode sam-extract-reads now supports extracting reads from paired samples. See the new options --paired-end
and --pair-indicator. These options work similarly to the fasta-to-compact options.
- Fix problem with suggestion-position-slices that could create empty slices.
- Fix bug in discover-sequence-variants methylation format that wrote methylation rates only for up to two samples.
- Fix bug in alignment-to-counts that caused problems with large alignments.

1.9.7.3

- Fix allele frequency format to write genotype first in FORMAT per vcf spec.
- Add new INFO fields in compare group vcf format to show allele counts in each group.
- Ability to support short versions of mode names, such as "compact-file-stats" has the short mode
name "cfs". There is a default short mode name generation implementation in
AbstractCommandLineMode.getShortModeName() but each mode class can override this method in the case
of short mode name collisions. In the case of collisions, the command line parser will not offer/accept
ANY short mode names for the classes in question.
- SamToCompact: Generate sorted goby alignments when a sorted BAM files is provided as input (use --sorted
flag to activate this option). Thanks to Bradford Powell for the suggestion and draft implementation.
- Fixed a bug in tally-reads that was triggered by reads of different lengths. Thanks to Adrian Platts for
the bug report.

1.9.7.2

- Fix realignment around indels bug that prevented reads from being realigned to the left in exome data.
Now correctly updates the start position of the moving window.
- Renamed AlignmentEntry.splicedAlignmentLink to AlignmentEntry.splicedForwardAlignmentLink and added
AlignmentEntry.splicedForwardAlignmentLink so splice links can be both bidirectional and more than
two segments long. This change is included in the C/C++ APIs and make it possible for GSNAP to write
splice information to Goby alignment files.
- FDR mode now supports reporting the top n hits irrespective of corrected q-value threshold (top n hits are
defined by the ranking produced by ordering the hits by increasing p-value, for the last column adjusted).
- Significantly reduced memory consumption when performing FDR BH adjustment on hundreds of million of elements.
- VCFWriter now writes missing value '.' in ID, ALT and FILTER fields, as required by VCF 4.1 documentation
(http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41)
This change is required to read the files generated by Goby with the latest version of Tribble used in IGV EA.
- AlignmentToTextMode will now display splice information.

1.9.7.1

- alignment-to-counts now generates indexed base-level histogram files. Indexing makes it possible to jump quickly
to a new genomic location in IGV. This is especially useful when viewing coverage for tens of tracks.
- Filter out ambiguous reads from alignment-to-counts base level histogram output. Pre-1.9.7.1 behaviour can be
obtained by setting the argument --filter-ambiguous-reads to false.
alignment-to-counts: also tried a new way to create base-level histograms from sorted alignment files.
This turns out to be about 3 times slower than the current approach. We still keep the new approach because it
should scale to any size alignment. Mode alignment-to-count will use to the new approach if an alignment is sorted
and has more than 50 million aligned reads.
- Filter out ambiguous reads from alignment-to-annotation-counts by default. Pre-1.9.7.1 behaviour can be obtained
by setting the argument --filter-ambiguous-reads to false.
- Add ability to switch off the recording of sampleIndex. This is useful when concat is just used to put pieces
of a large alignment back together after splitting reads for parallel processing.
- Do not print indices at the end of upgrade. This caused upgrade to fail on some alignments with an exception.
- Extended IterateAlignments to create alignment reader with a configurable AlignmentReaderFactory.
- Set the default normalization method for alignment-to-annotation-count to bullard normalization only.
- Fix a bug in VCFParser that affected parsing tab delimited files. Some files would be parsed with a tab in the
value of the last column, separating the values of the last two actual columns.

Page 5 of 9

Releases

Has known vulnerabilities

Previous Next

Goby

Page 5 of 9

1.9.8.2

1.9.8.1

1.9.8

1.9.7.3

1.9.7.2

1.9.7.1

Page 5 of 9

Links

Releases