Trafilatura

Latest version: v1.8.1

Safety actively analyzes 619494 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 7

1.8.1

Maintenance:
- Pin LXML to prevent broken dependency (535)

Extraction:
- Improve extraction accuracy for major news outlets (530)
- Fix formatting by correcting order of element generation and space handling with dlwh (528)
- Fix: prevent tail insertion before children in nested elements by knit-bee (536)

1.8.0

Extraction:
- Better precision by felipehertzer (509, 520)
- Code formatting in TXT/Markdown output added (498)
- Improved CSV output (496)
- LXML: compile XPath expressions (504)
- Overall speedup about +5%

Downloads and Navigation:
- More robust scans with `is_live_page()` (501)
- Better sitemap start and safeguards (503, 506)
- Fix for headers in response object (513)

Maintenance:
- License changed to Apache 2.0
- `Response` class: convenience functions added (497)
- `lxml.html.Cleaner` removed (491)
- CLI fixes: parallel cores and processing (524)

1.7.0

Extraction:
- improved `html2txt()` function

Downloads:
- add advanced `fetch_response()` function
→ pending deprecation for `fetch_url(decode=False)`

Maintenance:
- support for LXML v5+ (484 by knit-bee, 485)
- update [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.7.0)

1.6.4

Maintenance:
- MacOS: fix setup, update htmldate and add tests (460)
- drop invalid XML element attributes with vbarbaresi in 462
- remove cyclic imports (458)

Navigation:
- introduce `MAX_REDIRECTS` config setting and fix urllib3 redirect handling by vbarbaresi in 461
- improve feed detection (457)

Documentation:
- enhancements to documentation and testing with Maddesea in 456

1.6.3

Extraction:
- preserve space in certain elements with idoshamun (429)
- optional list of xPaths to prune by HeLehm (414)

Metadata:
- more precise date extraction (see [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.6.0))
- new `htmldate` extensive search parameter in config (434)
- changes in URLs: normalization, trackers removed (see [courlan](https://github.com/adbar/courlan/releases/tag/v0.9.5))

Navigation:
- reviewed code for feeds (443)
- new config option: external URLs for feeds/sitemaps (441)

Documentation:
- update, add page on text embeddings with tonyyanga (428, 435, 447)
- fix quickstart by sashkab (419)

1.6.2

Extraction:
- more lenient HTML parsing (370)
- improved code block support with idoshamun (372, 401)
- convertion of relative links to absolute by feltcat (377)
- remove use of signal from core functions (384)

Metadata:
- JSON-LD fix for sitenames by felipehertzer (383)

Command-line interface:
- more robust batch processing (381)
- added `--probe` option to CLI to check for extractable content (378, 392)

Maintenance:
- simplified code (408)
- support for Python 3.12
- pinned LXML version for MacOS (393)
- updated dependencies and parameters (notably `htmldate` and `courlan`)
- code cleaning by marksmayo (406)

Page 1 of 7

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.