Advertools

Latest version: v0.14.2

Safety actively analyzes 628499 Python packages for vulnerabilities to keep your Python projects secure.

Page 3 of 7

0.11.0

-------------------

* Added
- New parameter `recursive` for ``sitemap_to_df`` to control whether or not
to get all sub sitemaps (default), or to only get the current
(sitemapindex) one.
- New columns for ``sitemap_to_df``: ``sitemap_size_mb``
(1 MB = 1,024x1,024 bytes), and ``sitemap_last_modified`` and ``etag``
(if available).
- Option to request multiple robots.txt files with ``robotstxt_to_df``.
- Option to save downloaded robots DataFrame(s) to a file with
``robotstxt_to_df`` using the new parameter ``output_file``.
- Two new columns for ``robotstxt_to_df``: ``robotstxt_last_modified`` and
``etag`` (if available).
- Raise `ValueError` in ``crawl`` if ``css_selectors`` or
``xpath_selectors`` contain any of the default crawl column headers
- New XPath code recipes for custom extraction.
- New function ``crawllogs_to_df`` which converts crawl logs to a DataFrame
provided they were saved while using the ``crawl`` function.
- New columns in ``crawl``: `viewport`, `charset`, all `h` headings
(whichever is available), nav, header and footer links and text, if
available.
- Crawl errors don't stop crawling anymore, and the error message is
included in the output file under a new `errors` and/or `jsonld_errors`
column(s).
- In case of having JSON-LD errors, errors are reported in their respective
column, and the remainder of the page is scraped.

* Changed
- Removed column prefix `resp_meta_` from columns containing it
- Redirect URLs and reasons are separated by '' for consistency with
other multiple-value columns
- Links extracted while crawling are not unique any more (all links are
extracted).
- Emoji data updated with v13.1.
- Heading tags are scraped even if they are empty, e.g. <h2></h2>.
- Default user agent for crawling is now advertools/VERSION.

* Fixed
- Handle sitemap index files that contain links to themselves, with an
error message included in the final DataFrame
- Error in robots.txt files caused by comments preceded by whitespace
- Zipped robots.txt files causing a parsing issue
- Crawl issues on some Linux systems when providing a long list of URLs

* Removed
- Columns from the ``crawl`` output: `url_redirected_to`, `links_fragment`

0.10.7

-------------------

* Added
- New function ``knowledge_graph`` for querying Google's API
- Faster ``sitemap_to_df`` with threads
- New parameter `max_workers` for ``sitemap_to_df`` to determine how fast
it could go
- New parameter `capitalize_adgroups` for ``kw_generate`` to determine
whether or not to keep ad groups as is, or set them to title case (the
default)

* Fixed
- Remove restrictions on the number of URLs provided to ``crawl``,
assuming `follow_links` is set to `False` (list mode)
- JSON-LD issue breaking crawls when it's invalid (now skipped)

* Removed
- Deprecate the ``youtube.guide_categories_list`` (no longer supported by
the API)

0.10.6

-------------------

* Added
- JSON-LD support in crawling. If available on a page, JSON-LD items will
have special columns, and multiple JSON-LD snippets will be numbered for
easy filtering
* Changed
- Stricter parsing for rel attributes, making sure they are in link
elements as well
- Date column names for ``robotstxt_to_df`` and ``sitemap_to_df`` unified
as "download_date"
- Numbering OG, Twitter, and JSON-LD where multiple elements are present in
the same page, follows a unified approach: no numbering for the first
element, and numbers start with "1" from the second element on. "element",
"element_1", "element_2" etc.

0.10.5

-------------------

* Added
- New features for the ``crawl`` function:
* Extract canonical tags if available
* Extract alternate `href` and `hreflang` tags if available
* Open Graph data "og:title", "og:type", "og:image", etc.
* Twitter cards data "twitter:site", "twitter:title", etc.

* Fixed
- Minor fixes to ``robotstxt_to_df``:
* Allow whitespace in fields
* Allow case-insensitive fields

* Changed
- ``crawl`` now only supports `output_file` with the extension ".jl"
- ``word_frequency`` drops `wtd_freq` and `rel_value` columns if `num_list`
is not provided

0.10.4

-------------------

* Added
- New function ``url_to_df``, splitting URLs into their components and to a
DataFrame
- Slight speed up for ``robotstxt_test``

0.10.3

-------------------

* Added
- New function ``robotstxt_test``, testing URLs and whether they can be
fetched by certain user-agents

* Changed
- Documentation main page relayout, grouping of topics, & sidebar captions
- Various documentation clarifications and new tests

Page 3 of 7

Releases

Has known vulnerabilities

Previous Next