scrapy-poet Changelog

0.10.1

-------------------

* More robust time freezing in ``scrapy savefixture`` command.

0.10.0

-------------------

* Now requires ``web-poet >= 0.8.0``.

* The ``savefixture`` command now also saves requests made via the
:class:`web_poet.page_inputs.client.HttpClient` dependency and their
responses.

0.9.0

------------------

* Added support for item classes which are used as dependencies in page objects
and spider callbacks. The following is now possible:

.. code-block:: python

import attrs
import scrapy
from web_poet import WebPage, handle_urls, field
from scrapy_poet import DummyResponse

attrs.define
class Image:
url: str

handle_urls("example.com")
class ProductImagePage(WebPage[Image]):
field
def url(self) -> str:
return self.css("product img ::attr(href)").get("")

attrs.define
class Product:
name: str
image: Image

handle_urls("example.com")
attrs.define
class ProductPage(WebPage[Product]):
✨ NEW: The page object can ask for items as dependencies. An instance
of ``Image`` is injected behind the scenes by calling the ``.to_item()``
method of ``ProductImagePage``.
image_item: Image

field
def name(self) -> str:
return self.css("h1.name ::text").get("")

field
def image(self) -> Image:
return self.image_item

class MySpider(scrapy.Spider):
name = "myspider"

def start_requests(self):
yield scrapy.Request(
"https://example.com/products/some-product", self.parse_product
)

✨ NEW: We can directly use the item here instead of the page object.
def parse_product(self, response: DummyResponse, item: Product) -> Product:
return item

In line with this, the following new features were made:

* New :class:`scrapy_poet.page_input_providers.ItemProvider` which makes the
usage above possible.

* An item class is now supported by :func:`scrapy_poet.callback_for`
alongside the usual page objects. This means that it won't raise a
:class:`TypeError` anymore when not passing a subclass of
:class:`web_poet.pages.ItemPage`.

* New exception: :class:`scrapy_poet.injection_errors.ProviderDependencyDeadlockError`.
This is raised when it's not possible to create the dependencies due to
a deadlock in their sub-dependencies, e.g. due to a circular dependency
between page objects.

* New setting named ``SCRAPY_POET_RULES`` having a default value of
:meth:`web_poet.default_registry.get_rules <web_poet.rules.RulesRegistry.get_rules>`.
This deprecates ``SCRAPY_POET_OVERRIDES``.

* New setting named ``SCRAPY_POET_DISCOVER`` to ensure that ``SCRAPY_POET_RULES``
have properly loaded all intended rules annotated with the ``handle_urls``
decorator.

* New utility functions in ``scrapy_poet.utils.testing``.

* The ``frozen_time`` value inside the :ref:`test fixtures <testing>` won't
contain microseconds anymore.

* Supports the new :func:`scrapy.http.request.NO_CALLBACK` introduced in
**Scrapy 2.8**. This means that the :ref:`pitfalls` (introduced in
``scrapy-poet==0.7.0``) doesn't apply when you're using Scrapy >= 2.8, unless
you're using third-party middlewares which directly uses the downloader to add
:class:`scrapy.Request <scrapy.http.Request>` instances with callback set to
``None``. Otherwise, you need to set the callback value to
:func:`scrapy.http.request.NO_CALLBACK`.

* Fix the :class:`TypeError` that's raised when using Twisted <= 21.7.0 since
scrapy-poet was using ``twisted.internet.defer.Deferred[object]`` type
annotation before which was not subscriptable in the early Twisted versions.

* Fix the ``twisted.internet.error.ReactorAlreadyInstalledError`` error raised
when using the ``scrapy savefixture`` command and Twisted < 21.2.0 is installed.

* Fix test configuration that doesn't follow the intended commands and dependencies
in these tox environments: ``min``, ``asyncio-min``, and ``asyncio``. This
ensures that page objects using ``asyncio`` should work properly, alongside
the minimum specified Twisted version.

* Various improvements to tests and documentation.

* Backward incompatible changes:

* For the :class:`scrapy_poet.page_input_providers.PageObjectInputProvider`
base class:

* It now accepts an instance of :class:`scrapy_poet.injection.Injector`
in its constructor instead of :class:`scrapy.crawler.Crawler`. Although
you can still access the :class:`scrapy.crawler.Crawler` via the
``Injector.crawler`` attribute.

* :meth:`scrapy_poet.page_input_providers.PageObjectInputProvider.is_provided`
is now an instance method instead of a class method.

* The :class:`scrapy_poet.injection.Injector`'s attribute and constructor
parameter called ``overrides_registry`` is now simply called ``registry``.

* Removed the ``SCRAPY_POET_OVERRIDES_REGISTRY`` setting which overrides the
default registry.

* The ``scrapy_poet.overrides`` module which contained ``OverridesRegistryBase``
and ``OverridesRegistry`` has now been removed. Instead, scrapy-poet directly
uses :class:`web_poet.rules.RulesRegistry`.

Everything should pretty much the same except for
:meth:`web_poet.rules.RulesRegistry.overrides_for` now accepts :class:`str`,
:class:`web_poet.page_inputs.http.RequestUrl`, or
:class:`web_poet.page_inputs.http.ResponseUrl` instead of
:class:`scrapy.http.Request`.

* This also means that the registry doesn't accept tuples as rules anymore.
Only :class:`web_poet.rules.ApplyRule` instances are allowed. The same goes
for ``SCRAPY_POET_RULES`` (and the deprecated ``SCRAPY_POET_OVERRIDES``).

* The following type aliases have been removed:

* ``scrapy_poet.overrides.RuleAsTuple``
* ``scrapy_poet.overrides.RuleFromUser``

0.8.0

------------------

* Now requires ``web-poet >= 0.7.0`` and ``time_machine``.

* Added a ``savefixture`` command that creates a test for a page object.
See :ref:`testing` for more information.

0.7.0

------------------

* Fixed the issue where a new page object containing a new response data is not
properly created when :class:`web_poet.exceptions.core.Retry` is raised.

* In order for the above fix to be possible, overriding the callback dependencies
created by **scrapy-poet** via :attr:`scrapy.http.Request.cb_kwargs` is now
unsupported. This is a **backward incompatible** change.

* Fixed the broken
:meth:`scrapy_poet.page_input_providers.HttpResponseProvider.fingerprint`
which errors out when running a Scrapy job using the ``SCRAPY_POET_CACHE``
enabled.

* Improved behavior when ``spider.parse()`` method arguments are supposed
to be provided by **scrapy-poet**. Previously, it was causing
unnecessary work in unexpected places like
:class:`scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware`,
:class:`scrapy.pipelines.images.ImagesPipeline` or
:class:`scrapy.pipelines.files.FilesPipeline`. It is also a reason
:class:`web_poet.page_inputs.client.HttpClient` might not be working
in page objects. Now these cases are detected, and a warning is issued.

As of Scrapy 2.7, it is not possible to fix the issue completely
in **scrapy-poet**. Fixing it would require Scrapy changes; some 3rd party
libraries may also need to be updated.

.. note::

The root of the issue is that when request.callback is ``None``,
``parse()`` callback is assumed normally. But sometimes callback=None
is used when :class:`scrapy.http.Request` is added to the Scrapy's
downloader directly, in which case no callback is used. Middlewares,
including **scrapy-poet**'s, can't distinguish between these two cases,
which causes all kinds of issues.

We recommend all **scrapy-poet** users to modify their code to
avoid the issue. Please **don't** define ``parse()``
method with arguments which are supposed to be filled by **scrapy-poet**,
and rename the existing ``parse()`` methods if they have such arguments.
Any other name is fine. It avoids all possible issues, including
incompatibility with 3rd party middlewares or pipelines.

See the new :ref:`pitfalls` documentation for more information.

There are backwards-incompatible changes related to this issue.
They only affect you if you don't follow the advice of not using ``parse()``
method with **scrapy-poet**.

* When the ``parse()`` method has its response argument annotated with
:class:`scrapy_poet.api.DummyResponse`, for instance:
``def parse(self, response: DummyResponse)``, the response is downloaded
instead of being skipped.

* When the ``parse()`` method has dependencies that are provided by
**scrapy-poet**, the :class:`scrapy_poet.downloadermiddlewares.InjectionMiddleware` won't
attempt to build any dependencies anymore.

This causes the following code to have this error ``TypeError: parse()
missing 1 required positional argument: 'page'.``:

.. code-block:: python

class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ["https://books.toscrape.com"]

def parse(self, response: scrapy.http.Response, page: MyPage):
...

* :func:`scrapy_poet.injection.is_callback_requiring_scrapy_response` now accepts
an optional ``raw_callback`` parameter meant to represent the actual callback
attribute value of :class:`scrapy.http.Request` since the original ``callback``
parameter could be normalized to the spider's ``parse()`` method when the
:class:`scrapy.http.Request` has ``callback`` set to ``None``.

* Official support for Python 3.11

* Various updates and improvements on docs and examples.

0.6.0

------------------

* Now requires ``web-poet >= 0.6.0``.

* All examples in the docs and tests now use ``web_poet.WebPage``
instead of ``web_poet.ItemWebPage``.
* The new ``instead_of`` parameter of the ``handle_urls`` decorator
is now preferred instead of the deprecated ``overrides`` parameter.
* ``scrapy_poet.callback_for`` doesn't require an implemented ``to_item``
method anymore.
* The new ``web_poet.rules.RulesRegistry`` is used instead of the old
``web_poet.overrides.PageObjectRegistry``.
* The Registry now uses ``web_poet.ApplyRule`` instead of
``web_poet.OverrideRule``.

* Provider for ``web_poet.ResponseUrl`` is added, which allows to access the
response URL in the page object. This triggers a download unlike the provider
for ``web_poet.RequestUrl``.
* Fixes the error when using ``scrapy shell`` while the
``scrapy_poet.InjectionMiddleware`` is enabled.
* Fixes and improvements on code and docs.

Scrapy-poet

Page 4 of 6