Mikado

Latest version: v2.3.4

Safety actively analyzes 626513 Python packages for vulnerabilities to keep your Python projects secure.

Page 2 of 6

2.2.0

Removed Cython from the requirements.txt file. This allows to perform the tests correctly in a Conda environment (as Conda disallows installing Cython as part of a distributed package).
As a result of this change, the preferred installation procedure from source has to be slightly amended:
- either install using `pip wheel -w dist . && pip install dist/Mikado*whl`
- or install with `python setup.py bdist_wheel` **after** having forcibly installed Cython, with `pip install Cython` or the like.

Other changes:
- Fix [381](https://github.com/EI-CoreBioinformatics/mikado/issues/381): now Mikado will be able to guess correctly
the input file format, instead of relying on the file name extension or user's settings. Sniffing for files
provided as a stream is *disabled* though.
- Fix [382](https://github.com/EI-CoreBioinformatics/mikado/issues/382): now Mikado can accept generic BED12 files
as input junctions, not just Portcullis junctions. This allows e.g. a user to provide a ***set of gene models***
in BED12 format as sources of valid junctions.
- Fix [384](https://github.com/EI-CoreBioinformatics/mikado/issues/384): now Mikado convert deals properly with
unsorted GTFs/GFFs.
- Fix [386](https://github.com/EI-CoreBioinformatics/mikado/issues/386): dealing better with unsorted GFFs/GTFs for
the stats utility.
- Fix [387](https://github.com/EI-CoreBioinformatics/mikado/issues/387): now Mikado will always use a static seed,
rather than generating a new one per call unless specifically instructed to do so. The old behaviour can still be
replicated by either setting the `seed` parameter to `null` (ie `None`) in the configuration file, or by
specifying `--random-seed` during the command invocation.
- General increase in code unit-test coverage; in particular:
- Slightly increased the unit-test coverage for the locus classes, e.g. properly covering the `as_dict` and `load_dict`
methods. Minor bugfixes related to the introduction of these unit-tests.
- `Mikado.parsers.to_gff` has been renamed to `Mikado.parsers.parser_factory`.
- The code related to the transcript padding has been moved to the submodule `Mikado.transcripts.pad`, rather than
being part of the `Mikado.loci.locus` submodule.
- Mikado will error informatively if the scoring configuration file is malformed.

2.1.1

Hotfix release:
- **IMPORTANT** Mikado now uses correctly the scores associated to a given source.
- **IMPORTANT** Mikado was not forwarding the original source to transcripts derived by chimera splitting. This compounded the issue above.
- Corrected the issue that caused the issues above, ie transcripts where not dumping and reloading all relevant fields. Now implemented properly and tested with specific new routines.
- Corrected an issue that caused Mikado to erroneously calculate twice the metrics and scores of loci, therefore reporting some wrong ones in the output files.
- affected metrics where e.g. `selected_cds_intron_fraction` and `combined_cds_intron_fraction`.
- Removed `quicksect` from the requirements.

2.1.0

Bugfix and speed improvement release.

- Fix a bug that prevented Mikado from reporting the correct metrics/scores in the output of *loci* files. This bug only affected reporting, not the results themselves. See [issue 376](https://github.com/EI-CoreBioinformatics/mikado/issues/376)
- Fix a bug in printing out the statistics for an annotation file with `mikado util stats` ([issue 378](https://github.com/EI-CoreBioinformatics/mikado/issues/378))
- When doing serialising, Mikado now by default will drop and reload everything. The previous default behaviour results in hard-to-parse errors and is not what is usually desired anyway.
- Improved the performance of pick in multiple ways ([issue 375](https://github.com/EI-CoreBioinformatics/mikado/issues/375)):
- now only external metrics that are requested in the scoring file will be printed out in the final `metrics` files. This reduces runtime in e.g. Minos. The new CLI switch `--report-all-external-metrics` (both in `configure` and `pick`) can be used to revert to the old behaviour.
- the `external` table in the Mikado database now is indexed properly, increasing speed.
- batch and compress the results before sending them through a queue (ljyanesm)
- brentp enhanced the bcbio `intervaltree.pyx` into `quicksect`. Copied this new version of interval tree and adapted it to Mikado.
- Using sqlalchemy bakeries for the SQLite queries, as well as LRU caches in various parts of Mikado.
- Removed excessive copying in multiple parts of the program, especially regarding the configuration objects and during padding.
- Using `operator.attrgetter` instead of a custom (and slower) recursive `getattr` function.
- Removed unsafe calls to tempfile.mktemp and the like, for increased security according to CodeQL.

2.0.2

Not secure

Bugfix release.

- Fix infinite recursion bug when trying to recover lost transcripts
- Fix performance regression by passing the configuration to Excluded locus objects.

2.0.1

Not secure

BugFix release.

- Fixed a bug that caused Mikado configure (but not daijin configure, or "mikado configure --daijin") to print out invalid configuration files.
- Restored the functionality of "--full" - now Mikado can print out both partial (but still valid) or fully-fledged configuration files.
- Configured bumpversion
- Corrected a small bug in parsing EnsEMBL GFF3
- Cured some deprecation warning messages from marshmallow and numpy
- Small bug fix in the CLIs of mikado/daijin configure.
- Default value of the seed is now 0 (ie: undefined, a random one will be selected). Only integers are allowed values.
- Small bugfixes/extensions in the test suite.

2.0

Not secure

This is the second major release of Mikado. It contains **backwards-incompatible changes** to the data structures used in the program;
as such, all users are **warmly invited to update the program as soon as possible**. Due to these changes, old runs might need to be redone
(e.g. for Mikado serialise).

This release has been greatly focused on making Mikado capable of integrating not just transcript assemblies but rather a mixture of transcripts assemblies
and _ab initio_ gene annotations. We also made possible to flag certain sets of transcripts for Mikado as of *reference quality*, and improved the possibility
of passing external metrics (e.g. expression values) to Mikado. In practice, these changes **make Mikado a robust program to integrate gene annotations from multiple
into a coherent, final gene annotation**. Mikado had already been used in this capacity
[for the annotation of _T. aestivum_](https://science.sciencemag.org/content/361/6403/eaar7191); the changes in this version build upon that early work.

Following these changes, we plan to use Mikado in this capacity in a fully automated gene annotation pipeline. Please also note that, due to our work
on this new product, **we are planning to retire Daijin in the near future and its development is now discontinued**.

Aside from numerous bug fixes, this release brings the following highlights:

- Now Mikado will use [TOML](https://github.com/toml-lang/toml) as default configuration language.
- Many parts of `mikado`, especially in `serialise` and `pick`, have been rewritten to be much more performant. Specifically:
- `mikado pick` underwent a strict code revision to remove quadratic (or worse) bottlenecks. This now allows `mikado pick` to run on much denser, larger inputs without prohibitive computational resources or times.
- `mikado serialise` now is fully parallelised both for ORF and BLAST loading (280).
- `mikado serialise` can now load data from custom-field tabular BLAST data, rather than only from XML files (280).
- both steps now use temporary SQLite3 databases for fast inter-process data exchange.
- Mikado will now function correctly with soft-masked genomes.
- Mikado pick now will **backtrack** during the picking stage. This prevents loci from being missed due to chaining in the early stages of the algorithm.
- Mikado is now capable of padding transcripts in a locus so that they will share the same 5' and 3', if appropriate.
This leads to more coherent gene models, and can lead to recover gene models that are present only in fragmentary form,
by piggybacking on other, more complete models. This padding behaviour is now **default** in Mikado.
- The Mikado database (for Mikado serialise) and the GF index (used by Mikado compare) have been overhauled and are **not** back compatible.
- Mikado compare is now fully multi-processed.
- Mikado compare now **can consider fuzzy matching for the introns**.
This helps in e.g. evaluating the results from noisy long reads, such as those from NanoPore. Briefly, when activated,
Mikado compare will consider an intron match to a reference intron any match which is within the specified amount of bases. A similar fuzzy logic will apply to intron chains.
- Mikado can now load arbitrary numerical or boolean external metrics for all transcripts. They are not limited any longer to floats between 0 and 1.
- Alternative transcript events will now have to have the same coding frame, in coding loci.
- Mikado now provides only two scoring files ("*plant.yaml*" and "*mammalian.yaml*").
"*Plant.yaml*" should function also for insect or fungal species, but we have not tested it extensively.
Old scoring files can be found under "HISTORIC".
- Mikado now can specify a **random seed generator** as a 32bit integer. This allows to produce fully reproducible runs.
- Mikado will now exit without hanging in case of a crash during a multi-processed run.

With this release, we are also officially dropping support for Python 3.4. Python 3.5 will not be automatically tested for, as many Conda dependencies are not up-to-date, complicating the TRAVIS setup.

Contributors to this release:

- Gemy George Kaithakottil (gemygk)
- Christian Schudoma (cschu)
- David Swarbreck (swarbred)

Acknowledgements for contributing by bug reports and suggestions:

- Tom Mathers (tommathers)
- AsclepiusDoc
- Justin S (codeandkey)
- zebrafish-507
- Dr Robert King (rob123king)
- mndavies286
- Ole Tørresen (Thieron)
- Ferdinand Marlétaz (fmarletaz)
- Luohao Xu (lurebgi)
- Sagnik Banerjee (sagnikbanerjee15)
- lijing28101
- Lawrence Percival Alwyn (for the suggestion on random seeds)

Detailed list of bugfixes and improvements:

General

- Many internal algorithms of `mikado pick` have been rewritten to avoid quadratic bottlenecks. This allows Mikado to analyse datasets that are much denser or richer, without the processing time getting out of hand.
- `mikado pick` is now much more efficient in using multiple processors.
- Mikado has now been tested to be compatible with Python 3.7.
- Mikado can now specify a static random seed, ensuring full reproducibility of the runs ([183](https://github.com/EI-CoreBioinformatics/mikado/issues/183))
- Mikado will now correctly terminate all child processes in the event of a crash, and exit without hanging ([205](https://github.com/EI-CoreBioinformatics/mikado/issues/205))
- Mikado now always uses PySam, instead of PyFaidx, to fetch chromosomal regions (e.g. during prepare and pick).
This speeds up and lightens the program, as well as making tests more manageable.
- Made logging more sensible and informative for all three steps of the pipeline (prepare, serialise, pick)
- Mikado now supports the BED12+1 format (ie a BED12 format with GFF-like attributes on the 13th field)
- Now Mikado can use alternative translation tables among those provided by [NCBI through BioPython](ftp://ftp.ncbi.nih.gov/entrez/misc/data/gc.prt). The default is "0", ie the Standard table
but with only the canonical "ATG" being accepted as valid start codon. ([34](https://github.com/EI-CoreBioinformatics/mikado/issues/34)).
Please note that this is still a **global** value, it is not yet possible to specify a subset of chromosomes functioning with a different table.
- Now Mikado correctly considers the phase (instead of the incorrect frame) for GTFs. This makes it
compatible with EnsEMBL and [GenomeTools](http://genometools.org/) or [GffRead](https://github.com/gpertea/gffread), among others ([#135](https://github.com/EI-CoreBioinformatics/mikado/issues/135))
- Mikado was not dealing correctly with soft-masked genomes ([139](https://github.com/EI-CoreBioinformatics/mikado/issues/139))
- Increased coverage of the unit tests to approximately 83% ([137](https://github.com/EI-CoreBioinformatics/mikado/issues/137))
- Created proper Docker and Singularity recipes for Mikado ([149](https://github.com/EI-CoreBioinformatics/mikado/issues/149), [#164](https://github.com/EI-CoreBioinformatics/mikado/issues/164))
- Fixed an incorrect algorithm for merging overlapping intervals ([150](https://github.com/EI-CoreBioinformatics/mikado/issues/150))
- Improved Mikado performance by removing the default overloading of `__getattribute__` in the *Transcript* class ([153](https://github.com/EI-CoreBioinformatics/mikado/issues/153), [#154](https://github.com/EI-CoreBioinformatics/mikado/issues/154))
- The configuration file has been overhauled for simplicity's sake ([158](https://github.com/EI-CoreBioinformatics/mikado/issues/158))
- Dropped the by-now obsolete "nosetests" suite for testing, moving to the modern and maintained "pytest".
- Now Mikado will be forced to run in single-threaded mode if the user is asking for debugging-level logs.
This is to prevent a [re-entrancy race condition that causes deadlocks](https://codewithoutrules.com/2017/08/16/concurrency-python/).
- During configure and prepare, Mikado can now flag some transcripts as coming from a "reference".
Transcripts flagged like this **will never be modified nor dropped during a mikado prepare run**, unless generic or
critical errors are registered. Moreover, if source scores are provided, Mikado will preferentially keep one identical
transcript from those that have the highest *a priori* score. This will allow to e.g. prioritise PacBio or reference
assemblies during prepare ([141](https://github.com/EI-CoreBioinformatics/mikado/issues/141)).
- Please note that this change **does not affect the final picking**, but rather is just a mechanism for allowing Mikado to accept pass-through data.
- If you desire to prioritise reference transcripts, please directly assign a source score higher than 0 to these sets.
- Alternatively, use the `--only-update-reference` flag for having Mikado only try to add ASEs to known loci (see under *Mikado pick*)
- Mikado runs should now be fully reproducible, by specifying a seed. One will be generated automatically by Mikado
when launching the configuration, so that repeated runs using the same configuration file will be deterministically identical.
- [136](https://github.com/EI-CoreBioinformatics/mikado/issues/136): documentation has been updated to reflect the changes in the latest releases.

Mikado prepare
- Mikado will now always strip the CDS when a transcript is reversed ([126](https://github.com/EI-CoreBioinformatics/mikado/issues/126)).
- Mikado prepare now will *not* consider redundant transcripts that have the same cDNA but *different* CDS
([127](https://github.com/EI-CoreBioinformatics/mikado/issues/127)).
- Mikado prepare will consider for redundancy whether a transcript is *contained* within another and *shares its intron chain in its entirety*.
This will allow to drastically reduce the number of inputs to the other steps ([270](https://github.com/EI-CoreBioinformatics/mikado/issues/270)).
- Mikado prepare will now allow to decide *per-source* whether redundant transcripts should be kept or discarded ([270](https://github.com/EI-CoreBioinformatics/mikado/issues/270)).
- Mikado prepare will now ascertain whether a CDS has a valid start and/or stop codon ([132](https://github.com/EI-CoreBioinformatics/mikado/issues/132)) and will retain the original phase values ([#133](https://github.com/EI-CoreBioinformatics/mikado/issues/133)).
- Mikado prepare now will preferentially keep "reference" transcripts and transcripts with a higher source score, in this order.
Reference transcripts will be never discarded for failing a requirements check ([141](https://github.com/EI-CoreBioinformatics/mikado/issues/141)).
- Mikado prepare was not considering correctly GTFs without a `transcript` line feature ([196](https://github.com/EI-CoreBioinformatics/mikado/issues/196)).
- Mikado prepare now can accept models that lack any exon features but still have valid CDS/UTR features - this is necessary for some protein prediction tools.

Mikado serialise
- Use of temporary SQLite databases for inter-process communication in Mikado serialise, with consequent speedup ([97](https://github.com/EI-CoreBioinformatics/mikado/issues/97))
- Fixed bugs related to Prodigal ORFs on the negative strand ([181](https://github.com/EI-CoreBioinformatics/mikado/issues/181))
- Now BLAST HSPs will have stored as well whether there is an in-frame stop codon.
- Mikado serialise is now much faster when serialising the ORFs or BLAST data.
This is due to better multiprocessing and to having moved to Cython the most expensive steps ([280](https://github.com/EI-CoreBioinformatics/mikado/issues/280))
- Mikado serialise is now able to use *tabular* BLAST data as input, not just XML.
The tabular output should contain the standard columns plus, *at the end*, the following two:
- ppos
- btop

Mikado pick

- For the external scores, Mikado can now accept any type of numerical or boolean value. Mikado will understand at
serialisation time whether a particular score can be used raw (ie its values are strictly comprised between 0 and 1)
or whether it has to be forcibly scaled.
- This allows Mikado to use e.g. transcript expression as a valid metric.
- Mikado is now capable of correctly padding the transcripts so to uniform their ends in a single locus. This will
also have the effect of trying to enlarge the ORF of a transcript if it is truncated to begin with. Please note that
padded transcripts will add terminal *exons* rather than just extending their terminal ends. This should prevent the
creation of faux retained introns. Moreover, now the padding procedure will explicitly find and discard transcripts
that would become invalid after padding (e.g. because they end up with a far too long UTR, or retained introns).
If some of the invalid transcripts had been used as template for the expansion, Mikado will remove the offending
transcripts and restart the procedure ([129](https://github.com/EI-CoreBioinformatics/mikado/issues/129),
[142](https://github.com/EI-CoreBioinformatics/mikado/issues/142)). Moreover:
- Mikado will remove fully redundant (ie 100% identical transcripts) after padding ([208](https://github.com/EI-CoreBioinformatics/mikado/issues/208))
- As a consequence of this change, Transcript objects have been modified to expose the following methods related to the internal interval tree:
- find/search (to find intersecting exonic or intronic intervals)
- find_upstream (to find all intervals upstream of the requested one in the transcript)
- find_downstream (to find all intervals downstream of the requested one in the transcript)
- Moreover, transcript objects now do not have any more the unused "cds_introntree" property. Combined CDS and CDS introns are now present in the "cds_tree" object.
- Again as a consequence, now Locus objects have a new private method - _swap_transcript - that allows two Transcript
objects with the same ID to be exchanged within the locus. This is essential to allow the Locus to recalculate most
scores and metrics (e.g. all the exons or introns in the locus).
- Fixed a bug which caused some loci to crash at the last part of the picking stage.
- After picking, loci will be either coding or non-coding - no admixture.
- Solved a bug which led Mikado to recalculate the phases for each model during picking, potentially creating mistakes
for models truncated at the 5' end ([138](https://github.com/EI-CoreBioinformatics/mikado/issues/138)).
- Transcript padding has been overhauled and bugfixes related to it fixed ([124](https://github.com/EI-CoreBioinformatics/mikado/issues/124),
[142](https://github.com/EI-CoreBioinformatics/mikado/issues/142)).
- During scoring, it is now possible to specify conditions **related to a different metric** as a filtering option; moreover,
Mikado now will ignore for the purposes of scoring transcripts that have not passed the minimum filter.
See [130](https://github.com/EI-CoreBioinformatics/mikado/issues/130) and documentation for details.
- Mikado pick now will backtrack if it realises that some loci have been lost due to chaining.
Previously, Mikado could have missed loci if they were lost between the sublocus and monosublocus stages.
Now Mikado implements a basic backtracking recursive algorithm that should ensure no locus is missed.
This check happens during the last stage of the picking. ([131](https://github.com/EI-CoreBioinformatics/mikado/issues/131))
- Now all coding transcripts of a Mikado pick locus will share the same frame. Moreover,
**Mikado will now calculate the CDS overlap percentage based on the primary transcript CDS length**, not the minimum
CDS length between primary and candidate. Please note that the change **regarding the frame** also affects the monosublocus stage.
Mikado still considers only the primary ORFs for the overlap. ([134](https://github.com/EI-CoreBioinformatics/mikado/issues/134))
- Mikado pick was forgetting the original phases of transcripts, when not loading them from a database ([138](https://github.com/EI-CoreBioinformatics/mikado/issues/138)).
- Mikado pick will never discard a reference transcript for failing the requirements check. Moreover,
**it is now possible to instruct Mikado to only update a reference** rather than trying to come up with an annotation on its own.
When so instructed, Mikado pick will ignore any locus without a reference transcript, consider those as pass-through, and try to add
new transcripts that are compatible with the known loci ([148](https://github.com/EI-CoreBioinformatics/mikado/issues/148)).
- Mikado now contains only two scoring files, *plants.yaml* and *mammals.yaml* ([155](https://github.com/EI-CoreBioinformatics/mikado/issues/155)).
- Mikado pick now uses the [WAL](https://www.sqlite.org/wal.html) method for faster dispatching of data and to avoid crashes
([205](https://github.com/EI-CoreBioinformatics/mikado/issues/205)).
- Corrected a long-standing bug that made Mikado lose track of some fragments during the fragment removal phase.
Somewhat confusingly, Mikado printed those loci into the output, but reported in the log file that there was a
"missing locus". Now Mikado is able to correctly keeping track of them and removing them.
- Corrected issues that caused a crash due to the data exchange databases being locked ([205](https://github.com/EI-CoreBioinformatics/mikado/issues/205))

Mikado compare

- Mikado compare now reports statistics related to **non-redundant introns and intron chains**. This provides a better picture of the prediction in some instances, eg. when analysing IsoSeq/ONT runs.
- Always in Mikado compare, possibility of considering "fuzzy matches" for the introns. This means that two transcripts might be considered as a "match" even if their introns
are slightly staggered. This helps e.g. when assessing imperfect data such as Nanopore, where the experimenter usually knows that the per-base precision is quite low.
- Switched to the lighter [msgpack](https://github.com/msgpack/msgpack-python) from ujson, with increase in performance, for the Mikado index ([#168](https://github.com/EI-CoreBioinformatics/mikado/issues/168))
- Mikado compare has been greatly improved ([166](https://github.com/EI-CoreBioinformatics/mikado/issues/166)), with the addition of:
- proper multiprocessing
- faster startup times

Daijin
- Daijin now supports the `--use-conda` command line switch, to download and install seamlessly the necessary packages.

Other

- The `add_transcript_feature.py` script has been improved. It now automatically splits chimeric transcripts
and corrects mistakes related the intron size, mostly to deal with Nanopore reads ([123](https://github.com/EI-CoreBioinformatics/mikado/issues/123))
- Fixed some parsing errors for GTFs created by converting from BAM files ([157](https://github.com/EI-CoreBioinformatics/mikado/issues/157))
- Mikado util convert now functions with BAM files ([197](https://github.com/EI-CoreBioinformatics/mikado/issues/197))
- Mikado `util grep -v` functions also for GTFs ([203](https://github.com/EI-CoreBioinformatics/mikado/issues/203))
- [209](https://github.com/EI-CoreBioinformatics/mikado/issues/209): now `daijin` supports conda environments. Moreover, we test the assemble part properly to ensure its correct functioning.

Page 2 of 6

Releases

Has known vulnerabilities

Previous Next

Mikado

Page 2 of 6

2.2.0

2.1.1

2.1.0

2.0.2

2.0.1

2.0

Page 2 of 6

Links

Releases