Trafilatura

Latest version: v1.9.0

Safety actively analyzes 623439 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 1 of 8

1.9.0

Extraction:
- add markdown as explicit output (550)
- improve recall preset (571)
- speedup for readability-lxml (547)
- add global options object for extraction and use it in CLI (552)
- fix: better encoding detection (548)
- recall: fix for lists inside tables with mikhainin (534)
- add symbol to preserve vertical spacing in Markdown (499)
- fix: table cell separators in non-XML output (563)
- slightly better accuracy and execution speed overall

Metadata:
- add file creation date (date extraction, JSON & XML-TEI) (561)
- fix: empty content in meta tag by felipehertzer (545)

Maintenance:
- restructure and simplify code (543, 556)
- CLI & downloads: revamp and use global options (565)
- eval: review code, add guidelines and small benchmark (542)
- fix: raise error if config file does not exist (554)
- deprecate `process_record()` (549)
- docs: convert readme to markdown and update info (564, 578)

1.8.1

Maintenance:
- Pin LXML to prevent broken dependency (535)

Extraction:
- Improve extraction accuracy for major news outlets (530)
- Fix formatting by correcting order of element generation and space handling with dlwh (528)
- Fix: prevent tail insertion before children in nested elements by knit-bee (536)

1.8.0

Extraction:
- Better precision by felipehertzer (509, 520)
- Code formatting in TXT/Markdown output added (498)
- Improved CSV output (496)
- LXML: compile XPath expressions (504)
- Overall speedup about +5%

Downloads and Navigation:
- More robust scans with `is_live_page()` (501)
- Better sitemap start and safeguards (503, 506)
- Fix for headers in response object (513)

Maintenance:
- License changed to Apache 2.0
- `Response` class: convenience functions added (497)
- `lxml.html.Cleaner` removed (491)
- CLI fixes: parallel cores and processing (524)

1.7.0

Extraction:
- improved `html2txt()` function

Downloads:
- add advanced `fetch_response()` function
→ pending deprecation for `fetch_url(decode=False)`

Maintenance:
- support for LXML v5+ (484 by knit-bee, 485)
- update [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.7.0)

1.6.4

Maintenance:
- MacOS: fix setup, update htmldate and add tests (460)
- drop invalid XML element attributes with vbarbaresi in 462
- remove cyclic imports (458)

Navigation:
- introduce `MAX_REDIRECTS` config setting and fix urllib3 redirect handling by vbarbaresi in 461
- improve feed detection (457)

Documentation:
- enhancements to documentation and testing with Maddesea in 456

1.6.3

Extraction:
- preserve space in certain elements with idoshamun (429)
- optional list of xPaths to prune by HeLehm (414)

Metadata:
- more precise date extraction (see [htmldate](https://github.com/adbar/htmldate/releases/tag/v1.6.0))
- new `htmldate` extensive search parameter in config (434)
- changes in URLs: normalization, trackers removed (see [courlan](https://github.com/adbar/courlan/releases/tag/v0.9.5))

Navigation:
- reviewed code for feeds (443)
- new config option: external URLs for feeds/sitemaps (441)

Documentation:
- update, add page on text embeddings with tonyyanga (428, 435, 447)
- fix quickstart by sashkab (419)

Page 1 of 8

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.