Extruct

Latest version: v0.16.0

Safety actively analyzes 628918 Python packages for vulnerabilities to keep your Python projects secure.

Scan your dependencies

Page 3 of 4

0.6.0

-------------------

* JSON-LD parsing is less strict now: control characters are allowed.

0.5.0

-------------------

* Add OpenGraph and Microformat extractors.
* Add argument ``syntaxes`` to ``extract`` and command line function, it allows to
select which syntaxes to extract.
* Add argument ``uniform`` to ``extract`` and command line function, if True it maps
the output of Microdata, OpenGraph, Microformat and Json-ld to the same template.
* Add argument ``errors`` to ``extract`` and command line function, it allows to
define if errors should be raised, logged or ignored.
* Fix RDFa memory leak, now RDfaExtractor resets ``_lookups`` after each
extraction.
* Fixed regex pattern in ``JsonLdExtractor`` to avoid removing comments from
within valid JSON.
* In ``w3microdata`` strip whitespaces, newlines, etc from urls extracted from
html nodes.
* ``base_url`` substitutes ``url`` in ``MicroformatExtractor``, ``JsonLdExtractor``,
``OpenGraphExtractor``, ``RDFaExtractor`` and ``MicrodataExtractor``
* individual extractors accept ``base_url`` instead of ``url``, unused keyword
arguments are removed.
* In ``w3microdata.extract_items`` ``items_seen`` and ``url`` are no longer
class variables but are passed as arguments.
* In ``w3microdata`` the following functions are now private:
``extract_item``, ``extract_property_value``, ``extract_textContent``,
``_extract_property``, ``_extract_properties``, ``_extract_property_refs``
and ``_extract_textContent``.
* In ``w3microdata`` ``_extract_properties``, ``_extract_property_refs``,
``_extract_property``, ``_extract_property_value`` and ``_extract_item``
now need ``items_seen`` and ``url`` to be passed as arguments.
* Add argument ``return_html_node`` to ``extract``, it allows to return HTML
node with the result of metadata extraction. It is supported only by
microdata syntax.

Warning: backward-incompatible change:

* ``base_url`` is used instead of ``url`` in ``extruct.extract``, ``url`` is
still supported by deprecated.
* In ``extruct.extract`` default ``base_url`` is now ``None`` to avoid wrong
results with ``urljoin``.

0.4.0

-------------------

* New ``extruct`` command line tool to fetch a page and extract its metadata.
Works either via ``extruct`` directly or ``python -m extruct``.
* Accept leading HTML comment in JSON-LD payload.
* rdflib log messages were silenced to avoid the noise when importing extruct.

0.3.1

-------------------

* Fix dependencies and support RDFa by default (hence depend on rdflib by default).
* Update README with all-in-one extractor examples.

0.3.0

-------------------

* All extractors have an ``.extract_items()`` method, taking an lxml-parsed
document as input, if you want to reuse one you already have.
* Add generic extraction: use ``extruct.extract()`` to call all extractors
at once.

0.3.0a2

---------------------

Warning: backward-incompatible change:

* ``.extract()`` methods now return a list of Python dicts (the items)
instead of a dict with an "items" key having this list as value.

Page 3 of 4

© 2024 Safety CLI Cybersecurity Inc. All Rights Reserved.