Chewbbaca

Latest version: v3.3.5

Safety actively analyzes 629436 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 2 of 6

3.2.0

- New version of the SchemaEvaluator module. The updated version fixes several issues related with outdated dependencies that were leading to errors in the previous version. The new version also includes new features and components. Read the [docs page](https://chewbbaca.readthedocs.io/en/latest/user/modules/SchemaEvaluator.html) to know more about the latest version of the SchemaEvaluator module.
- Updated the link to the UniProt FTP used by the UniprotFinder module.
- Added the `.fas` file extension to the list of file extensions accepted by chewBBACA. chewBBACA accepts genome assemblies and external schemas with FASTA files that use any of the following file extensions: `.fasta`, `.fna`, `.ffn`, `.fa` and `.fas`. The FASTA files created by chewBBACA use the `.fasta` extension.
- Fixed issue in the PrepExternalSchema module where it would only detect FASTA files if they ended with the `.fasta` extension.
- Added the `--size-filter` parameter to the PrepExternalSchema module to define if the adaptation process should filter out alleles based on the minimum length and size threshold values.
- Added the `--output-novel` parameter to the AlleleCall module. If this parameter is used, the AlleleCall module creates a FASTA file with the novel alleles inferred during the allele calling. This file is created even if the `--no-inferred` parameter is used and the novel alleles are not added to the schema.

3.1.2

- Fixed issue related with sequence header format when FASTA files with coding sequences (CDSs) were provided to the CreateSchema and AlleleCall modules through the `--cds-input` parameter. chewBBACA expected the sequence headers to have the same format used to save the CDSs extracted during the gene prediction step, `<genomeID>-protein<cdsID>`. This would lead to errors if the FASTA files with CDSs provided by users had a different sequence header format. To fix this issue, chewBBACA determines the unique genome identifier for each input/genome and renames the sequence headers to match the `<genomeID>-protein<cdsID>` format (e.g: given a file named `GCF_000007125.1_ASM712v1_cds_from_genomic.fna`, the unique genome identifier is `GCF_000007125`, and the sequence headers are renamed to `GCF_000007125-protein1`, `GCF_000007125-protein2`, ..., `GCF_000007125-proteinN`). FASTA files with renamed headers are stored in the temporary directory created by chewBBACA. The sequence headers are renamed sequentially, so the integer after `protein` indicates the order of the sequences in the original FASTA file

3.1.1

- Fixed issue in the ExtractCgMLST module. The cgMLST matrix only contained 1's and 0's because it was a subset of the presence-absence matrix. The module now uses the list of loci in the core genome to subset the masked matrix with the allele identifiers.

3.1.0

- Updated the ExtractCgMLST module ([PR](https://github.com/B-UMMI/chewBBACA/pull/153)). It can now accept several loci presence threshold values and creates a HTML file with a line plot that displays the number of loci in the cgMLST per threshold value (default: compute for 0.95, 0.99 and 1 thresholds).
- Removed the TestGenomeQuality module (superseded by the new functionalities implemented in the ExtractCgMLST module).
- Bugfix for the determination of the BLASTp raw scores for the representative alleles ([PR](https://github.com/B-UMMI/chewBBACA/pull/156)).
- Changed the valued passed to BLASTp `-max_target_seqs` parameter during representative determination from 20 to the default value used by BLASTp (500). This leads to a slight improvement in classification accuracy.

3.0.0

New implementation of the **AlleleCall** process. The new implementation was developed to reduce execution time, improve accuracy and provide more detailed results. It uses available computational resources more efficiently to allow for analyses with thousands of strains in a laptop. This new version is fully compatible with schemas created with previous versions.

AlleleCall changes

- The new implementation avoids redundant comparisons through the identification of the set of distinct CDSs in the input files. The classification for a distinct CDS is propagated to classify all input genomes that contain the CDS.
- Implemented a clustering step based on minimizers to cluster the translated CDSs. This step complements the alignment-based strategy with BLASTp to increase computational efficiency and classification accuracy.
- The AlleleCall process has 4 execution modes (1: only exact matches at DNA level; 2: exact matches at DNA and Protein level; 3: exact matches and minimizer-based clustering to find similar alleles with BSR > 0.7; 4: runs the full process to find exact matches and all matches with BSR >= 0.6).
- Files with information about loci length modes (`loci_modes`) and the self-alignment raw score for the representative alleles (`short/self_scores`) are pre-computed and automatically updated (the process no longer creates and updates a file with the self-alignment raw score per locus).
- The process creates the `pre_computed` folder to store files with hash tables that are used to speedup exact matching and avoid running the step to translate the schema alleles in every run.
- Added the `--cds` parameter to accept FASTA files with CDSs (one FASTA file per genome) and skip gene prediction with Prodigal.
- Users can control the addition of novel alleles to the schema with the `--no-inferred` parameter.
- Added the `--output-unclassified` parameter to write a FASTA file (`unclassified_sequences.fasta`) with the distinct CDSs that were not classified in a run.
- Added the `--output-missing` parameter to write a FASTA file (`missing_classes.fasta`) and a TSV file with information about the classified sequences that led to a locus being classified as ASM, ALM, PLOT3, PLOT5, LOTSC, NIPH, NIPHEM and PAMA.
- Added the `--no-cleanup` parameter to keep the temporary folder with intermediate files created during a run.
- Removed the `--contained`, `--force-reset`, `--store-profiles` (to be reimplemented in a future release), `--json` and `--verbose` parameters.
- The `--force-continue` parameter no longer allows users to continue a run that was interrupted. This parameter is now used to ignore warnings and prompts about missing configuration files and the usage of multiple argument values per parameter.
- The allelic profiles in the `results_alleles.tsv` file can be hashed by providing the `--hash-profiles` parameter and a valid hash type as argument (hash algorithms available from the [hashlib](python) library and crc32 and adler32 from the [zlib](https://docs.python.org/3/library/zlib.html) library).
- The process creates a TSV file, `cds_coordinates.tsv`, with the genomic coordinates for all CDSs identified in the input files.
- The process creates a TSV file, `loci_summary_stats.tsv`, with summary statistics for loci classifications.
- The process no longer creates the `RepeatedLoci.txt` file. It now creates the `paralogous_counts.tsv` and `paralogous_loci.tsv` files with more detailed information about the loci identified as paralogous.
- The PLNF class is attributed in modes 1, 2 and 3 to indicate that a more thorough analysis might have found a match for the loci that were not found (LNF).
- CDSs that match several loci are classified as PAMA.
- Bugfix for PLOT3, PLOT5 and LOTSC classification types. LOTSC classification was not always attributed when a contig was smaller than the matched representative allele and some PLOT5 cases were classified as LOTSC. LOTSC cases counted as exact matches in the `results_statistics.tsv` file.

Additional changes

- The UniprotFinder allows users to search for annotations through UniProt's SPARQL endpoint or based on matches against UniProt's reference proteomes or both.
- Bugfix for an issue in the UniprotFinder module that was leading to errors when the data returned by UniProt's SPARQL endpoint only contained one set of annotation terms.
- Bugfix for an issue in the UniprotFinder module that was preventing the annotations from being written to the output file.
- Bugfix for an issue in the [map_async_parallelizer](https://github.com/B-UMMI/chewBBACA/blob/d7572c085677319500546dbb4ed8eee69cc3d2c2/CHEWBBACA/utils/multiprocessing_operations.py#L51) function that led to high memory usage.
- Implemented and changed several functions in the modules included in the `utils` folder to optimize code reusability, reduce runtime and peak memory usage, especially for large schemas and datasets (these changes affect mostly the CreateSchema and AlleleCall modules).
- Updated function docstrings and added comments.

2.8.5

- Updated **JoinProfiles** process to accept any number of files to join. Added `--common` flag to identify and join results for the set of loci common to all input files.
- Added line breaks at the end of file for **AlleleCall** outputs.
- Improved code readability and efficiency for the **RemoveGenes** process.

Page 2 of 6

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.