exoskeleton Changelog

1.2.3

Not secure

* Update required versions of dependencies `agg` and `bote`.
* Refactored code.

1.2.2

Not secure

* Update required versions of dependencies:
* The new required version of `urllib3` deprecated `TLSv1` and `TLSv1.1`. Connections using those protocols will not fail, but will cause a warning.
* Require version `1.0.2` of `pymysql` instead of `0.10.1`: As a new major branch was released in 2021, `pymsql` finally dropped support for Python 2.x/3.5 and improved code quality.
* New dependency: [`compatibility`](https://github.com/RuedigerVoigt/compatibility) (`>=0.8.0`) is added. This warns you if you use exoskeleton with an untested or unsupported version of Python. As a sister project of `exoskeleton` development is coordinated.

1.2.1

Not secure

* Require `lxml` version >= 4.6.2 (released 2020-11-26) as it fixes a vulnerability *and* works with Python 3.9.

1.2.0

Not secure

* The code has been refactored to make it easier to maintain.
* Make the methods `assign_labels_to_master`, `assign_labels_to_uuid`, and `get_label_ids` directly accessible again. (With an alias in the main class as they are sometimes useful in direct access.)

Bugfixes:

* The database script for version 1.0.0 onward contained the database name `exoskeleton` hardcoded in the view `v_errors_in_queue`. So the create script would not complete if your database had another name.
* The dependency `userprovided` has been updated from version 0.7.5 to 0.8.0 as some not RFC compliant URLs caused exceptions.

1.1.0

Not secure

* The [agg](https://github.com/RuedigerVoigt/agg) package has been added as an dependency for a planned feature. Its functionality is joining multiple CSV files into one. As a sister project of exoskeleton it follows the same development steps.
* If a user provides an unsupported browser to the "save as PDF" functionality, exoskeleton now checks if supported browsers are in the PATH and suggests them.
* The database schema now stores its version in the table `exoinfo`. This makes it possible to alert users combining different, potentially incompatible versions of the database schema and exoskeleton.

Breaking Changes:
* URLs are now normalized using `userprovided.url.normalize_url()` ([more info](https://github.com/RuedigerVoigt/userprovided/blob/master/CHANGELOG.md#version-075-beta-2020-10-27)). This is a more elaborate method than used before and will reduce the number of duplicates in certain scenarios. For example, it is now recognized as a duplicate if two links point to the same page and the only difference is a fragment (i.e. `https://www.example.com/index.html` and `https://www.example.com/index.html#foo`). So if you switch from 1.0.0 to 1.1.0 in an already running project, this might lead to some resources being downloaded again as old URLs are normalized in a different way.
* The parameter `chrome_name` now defaults to an empty string. So if you want to use Chromium or Google Chrome to download PDF versions of webpages, you have to provide the name of an executable in the path.

1.0.0

Not secure

New Features:
* **System Test**: Each push and every pull requests now also triggers a system test. This test launches an Ubuntu instance and loads a MariaDB container. Then it creates the database, adds tasks to the queue, processes the queue, and checks the resulting structure.
* New function `add_save_page_text` which only saves the text of a page, but not its HTML code.
* The parameter `queue_max_retries` (in: `bot_behavior`) is now evaluated: After a task fails, the wait time until the next try increments. After the specified number of tries (default: 5) it is assumed an error is not temporary but permanent and exoskeleton stop trying to execute the task.
* If a crawl delay is added to a specific task in the queue, it will now also be added to all other tasks that affect the same URL. The number of tries is still counted for each individual task, not the URL.
* If a crawler hits a rate limit, a server should respond with the HTTP status code 429 ("Too Many Requests"). If that is the case, exoskeleton now adds the fully qualified domain name (like `www.example.com`) to a rate limit list. It then blocks contact to this FQDN for a predefined time (default: 31 minutes).
* Not all servers signal hitting the rate limit with the HTTP status code 429, but use codes like 404 ("Not Found") or 410 ("Gone") instead. Tasks that cause such errors stay in the queue, but exoskeleton does not try to carry them out. Therefore some helper functions were introduced to reset and start to process those again: `forget_all_errors`, `forget_permanent_errors`, `forget_temporary_errors`, and `forget_specific_error`.
* Added the database view `v_errors_in_queue` that makes information about errors occurred while crawling better accessible.

Breaking Changes:
* The function `mark_error` has been renamed to `mark_permanent_error` to better reflect its purpose.
* The function `add_crawl_delay_to_item` was renamed to `__add_crawl_delay` and the optional parameter `delay_seconds` was removed.
* The database table `statistics_host` was extended.
* The method `get_queue_id` is now `QueueManager.__get_queue_uuids` as there is now reason to access it from a script.
* The method `num_items_in_queue` has been replaced with `Queuemanager.queue_stats` and now returns more information as a dictionary.

Exoskeleton

Page 2 of 5

1.2.3

1.2.2

1.2.1

1.2.0

1.1.0

1.0.0

Page 2 of 5

Links

Releases