Changelogs » Advertools

PyUp Safety actively tracks 232,000 Python packages for vulnerabilities and notifies you when to upgrade.



  * Added
  - JSON-LD support in crawling. If available on a page, JSON-LD items will
  have special columns, and multiple JSON-LD snippets will be numbered for
  easy filtering
  * Changed
  - Stricter parsing for rel attributes, making sure they are in link
  elements as well
  - Date column names for ``robotstxt_to_df`` and ``sitemap_to_df`` unified
  as "download_date"
  - Numbering OG, Twitter, and JSON-LD where multiple elements are present in
  the same page, follows a unified approach: no numbering for the first
  element, and numbers start with "1" from the second element on. "element",
  "element_1", "element_2" etc.


  * Added
  - New features for the ``crawl`` function:
  * Extract canonical tags if available
  * Extract alternate `href` and `hreflang` tags if available
  * Open Graph data "og:title", "og:type", "og:image", etc.
  * Twitter cards data "twitter:site", "twitter:title", etc.
  * Fixed
  - Minor fixes to ``robotstxt_to_df``:
  * Allow whitespace in fields
  * Allow case-insensitive fields
  * Changed
  - ``crawl`` now only supports `output_file` with the extension ".jl"
  - ``word_frequency`` drops `wtd_freq` and `rel_value` columns if `num_list`
  is not provided


  * Added
  - New function ``url_to_df``, splitting URLs into their components and to a
  - Slight speed up for ``robotstxt_test``


  * Added
  - New function ``robotstxt_test``, testing URLs and whether they can be
  fetched by certain user-agents
  * Changed
  - Documentation main page relayout, grouping of topics, & sidebar captions
  - Various documentation clarifications and new tests


  * Added
  - User-Agent info to requests getting sitemaps and robotstxt files
  - CSS/XPath selectors support for the crawl function
  - Support for custom spider settings with a new parameter ``custom_settings``
  * Fixed
  - Update changed supported search operators and values for CSE


  * Changed
  - Links are better handled, and new output columns are available:
  ``links_url``, ``links_text``, ``links_fragment``, ``links_nofollow``
  - ``body_text`` extraction is improved by containing <p>, <li>, and <span>


  * Added
  - New function ``crawl`` for crawling and parsing websites
  - New function ``robotstxt_to_df`` downloading robots.txt files into


  * Added
  - Ability to specify robots.txt file for ``sitemap_to_df``
  - Ability to retreive any kind of sitemap (news, video, or images)
  - Errors column to the returnd DataFrame if any errors occur
  - A new ``sitemap_downloaded`` column showing datetime of getting the
  * Fixed
  - Logging issue causing ``sitemap_to_df`` to log the same action twice
  - Issue preventing URLs not ending with xml or gz from being retreived
  - Correct sitemap URL showing in the ``sitemap`` column


  * Added
  - New function ``sitemap_to_df`` imports an XML sitemap into a


  * Changed
  - Column `query_time` is now named `queryTime` in the `youtube` functions
  - Handle json_normalize import from pandas based on pandas version


  * Added
  - New module `youtube` connecting to all GET requests in API
  - `extract_numbers` new function
  - `emoji_search` new function
  - `emoji_df` new variable containing all emoji as a DataFrame
  * Changed
  - Emoji database updated to v13.0
  - `serp_goog` with expanded `pagemap` and metadata
  * Fixed
  - `serp_goog` errors, some parameters not appearing in result
  - `extract_numbers` issue when providing dash as a separator
  in the middle


  * Added
  - New function `extract_exclamations` very similar to
  - New function `extract_urls`, also counts top domains and
  top TLDs
  - New keys to `extract_emoji`; `top_emoji_categories`
  & `top_emoji_sub_categories`
  - Groups and sub-groups to `emoji db`


  * Changed
  - `Emoji regex` updated
  - Simpler extraction of Spanish `questions`


  * Fixed
  - Missing __init__ imports.


  * Added
  - New `extract_` functions:
  * Generic `extract` used by all others, and takes
  arbitrary regex to extract text.
  * `extract_questions` to get question mark statistics, as
  well as the text of questions asked.
  * `extract_currency` shows text that has currency symbols in it, as
  well as surrounding text.
  * `extract_intense_words` gets statistics about, and extract words with
  any character repeated three or more times, indicating an intense
  feeling (+ve or -ve).
  - New function `word_tokenize`:
  * Used by `word_frequency` to get tokens of
  1,2,3-word phrases (or more).
  * Split a list of text into tokens of a specified number of words each.
  - New stop-words from the ``spaCy`` package:
  **current:** Arabic, Azerbaijani, Danish, Dutch, English, Finnish,
  French, German, Greek, Hungarian, Italian, Kazakh, Nepali, Norwegian,
  Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.
  **new:** Bengali, Catalan, Chinese, Croatian, Hebrew, Hindi, Indonesian,
  Irish, Japanese, Persian, Polish, Sinhala, Tagalog, Tamil, Tatar, Telugu,
  Thai, Ukrainian, Urdu, Vietnamese
  * Changed
  - `word_frequency` takes new parameters:
  * `regex` defaults to words, but can be changed to anything '\S+'
  to split words and keep punctuation for example.
  * `sep` not longer used as an option, the above `regex` can
  be used instead
  * `num_list` now optional, and defaults to counts of 1 each if not
  provided. Useful for counting `abs_freq` only if data not
  * `phrase_len` the number of words in each split token. Defaults
  to 1 and can be set to 2 or higher. This helps in analyzing phrases
  as opposed to words.
  - Parameters supplied to `serp_goog` appear at the beginning
  of the result df
  - `serp_youtube` now contains `nextPageToken` to make
  paginating requests easier


  * New function
  - `extract_words` to extract an arbitrary set of words
  * Minor updates
  - `ad_from_string` slots argument reflects new text
  ad lenghts
  - `hashtag` regex improved


  * Fix minor bugs
  - Handle Twitter search queries with 0 results in final request


  * Fix minor bugs
  - Properly handle requests for >50 items (`serp_youtube`)
  - Rewrite test for _dict_product
  - Fix issue with string printing error msg


  * Fix minor bugs
  - _dict_product implemented with lists
  - Missing keys in some YouTube responses


  * New function `serp_youtube`
  - Query YouTube API for videos, channels, or playlists
  - Multiple queries (product of parameters) in one function call
  - Reponse looping and merging handled, one DataFrame
  * `serp_goog` return Google's original error messages
  * twitter responses with entities, get the entities extracted, each in a
  separate column


  * New function `serp_goog` (based on Google CSE)
  - Query Google search and get the result in a DataFrame
  - Make multiple queries / requests in one function call
  - All responses merged in one DataFrame
  * twitter.get_place_trends results are ranked by town and country


  * New Twitter module based on twython
  - Wraps 20+ functions for getting Twitter API data
  - Gets data in a pands DataFrame
  - Handles looping over requests higher than the defaults
  * Tested on Python 3.7


  * Search engine marketing cheat sheet.
  * New set of extract\_ functions with summary stats for each:
  * extract_hashtags
  * extract_mentions
  * extract_emoji
  * Tests and bug fixes


  * New set of kw_<match-type> functions.
  * Full testing and coverage.


  * First release on PyPI.
  * Functions available:
  - ad_create: create a text ad place words in placeholders
  - ad_from_string: split a long string to shorter string that fit into
  given slots
  - kw_generate: generate keywords from lists of products and words
  - url_utm_ga: generate a UTM-tagged URL for Google Analytics tracking
  - word_frequency: measure the absolute and weighted frequency of words in
  collection of documents