Changelogs » Scrapydd

PyUp Safety actively tracks 266,777 Python packages for vulnerabilities and notifies you when to upgrade.



* docker image change to debian-python3.8 base image
  * add grpc and restapi for node/server communication.
  * add client module support both rest and grpc client, remove old asynchttp client
  * add openapi doc.
  * modify node register / login process.
  * admin can delete node on webui.
  * add node name field, will auto-populate with random name when creating.
  * change to bigint type, use twitter snowflake algorithm to generate new value.
  * restructure some manager, receive a session parameter to give better control of caller.
  * (Important) enable_authentication default value change to false.
  * fix keys generating issue.
  * fix jobs are not schedule concurrently across spiders ( a global counter).


* fix: docker runner AttributeError
  * admin users page.
  * move jobs page into admin pages.
  * relate projects to owner user, add related authorization.
  * fix: delete project error.
  * upgrade dependency tornado v6.0, scrapy v2.1
  * a creating empty project page and a separated page to upload package of existing project.
  * a project info page.
  * share database session instance in request processing context.
  * fix: io error when uploading failed job log.


* Fix `scrapy package` command error.
  * ProjectWorkspace subprocess cross-platform compatibility, including PIPE length limitation and encoding.
  * Server-side Job killing timeout.
  * Fix: upload stream file not closed issue.
  * Webhook use new post mechanism.
  * Add GetJob and GetJobItems rest apis.
  * Dockerfile add only necessary files.
  * Add new Runner2, support plugin system.


* add storage_version of Project entity
  * add default_project_storage_version configuration, version with "2" support data stored by project.
  * support project deleting.
  * add ProjectPackage entity, store egg related information.
  * add "runner_type" config, support "docker" runner type.
  * add "runner_docker_image" config, docker image for runner.


* modify tag matching rule: none tag agent cannot run any spider tag against none.
  * tag matching use spider's tag, will effect existing job when modify spider's tag after job been created.
  * enhance authentication check of web views and apis.
  * http stream tool python3 compatible.
  * move agent relatived handler into handlers package.
  * fix: job complete fail issue.
  * Workspace list_spiders return list of str.
  * add new unit tests.


* enhance security: add xsrf protection.
  * password encrypted with SHA1.
  * support both HMAC and Basic Authentication for API endpoints.
  * fix error of node key page.
  * remove scrapyd dependency.
  * runner support SCRAPY_EXTRA_SETTINGS_MODULE env parameter.

0.6.1 not secure

* add a new spider setting "extra_settings" to control installing additional requirement of spider.
  * support read config from environment variables.
  * new docker build.
  * override BOT_NAME by the project_name on the server side.

0.6.0 not secure

* new project egg download api.
  * fix: error occured when project has no dependency.
  * fix: server and client side memory leak every time need to get project dependency keep importing egg in current
  process, now read requirements in egg(zip) file directly.
  * move project operations into ProjectWorkspace, workspace execute in tmp folder on both server/agent side.
  * add agent error log file.
  * fix: requirement not correct when installing this package.

0.5.0 not secure

* add tag system.
  * add database_url configuration

0.4.21 not secure

* job fail email nofitication.
  * spider parameters.
  * add config file location ./conf/scrapydd.conf

0.4.18 not secure

- fix: error occurred when modifying spider setting.

0.4.17 not secure

- agent: add configuration request_timeout
  - fix: add_schedule.json occurs issue
  - server & agent: kill process when job is timeout
  - move webhook setting to spider settings

0.4.16 not secure

- fix: agent posting complete request sometimes timeout, timeout time to 60 secs

0.4.15 not secure

- fix: new trigger will to fire job starting

0.4.14 not secure

- fix: TaskExecutor memory leap
  - spider view, show project name
  - add listversions.json interface

0.4.13 not secure

- fix: webhook store too many files, never clear up.

0.4.12 not secure

- fix: job's pid not updated.
  - server: job status as WARNING when error/warning in log.

0.4.11 not secure

- fix: invalid cron error handling of creating trigger
  - fix: agent node expired exception not handled

0.4.10 not secure

- server and client: ssl validation, support both CLIENT/SERVER validation
  - agent: isolate spider's workdir im tmp folder

0.4.9 not secure

- server: spider concurrency control.

0.4.8 not secure

- server: upload project check parts before read items log
  - server: error handling of queueed job with no spider or project
  - agent: remove eggs after spider run
  - delete eggs when deleting project
  - workspace init, return if exception catched
  - home page show spider last status
  - delproject.json
  - fix: upload project did not modify project version

0.4.7 not secure

- webui: remove trigger
  - cluster state sync

0.4.6 not secure

- handle sub process error when uploading project
  - upload project, test egg before storing
  - items page
  - read log file path from job_history table

0.4.5 not secure

- fix: agent python running path incorrect on win platform
  - fix: agent init workspace future sync wait throw exception
  - server fork multi-subprocess, shutdown all subprocesses on closing
  - server items count from log file.
  - web pages modifications.
  - fix: webhook payload_url not updated when editing

0.4.3 not secure

- agent daemon mode, agent cmdline help
  - server daemon stdout/stderr redirect to /dev/null
  - server: project upload non-blocking run.
  - server and agent: run project in isolated virtualenv
  - fix: item file is not written completely on job complete

0.4.2 not secure

- agent logging file
  - agent remove local temp files when job complete
  - agent local files in workspace folder
  - server check node id and job status when committing job complete request

0.4.1 not secure

- job history limitation
  - webhook use isolated file against the job's original items file, the file will also be cleared when webhook job complete.

0.4.0 not secure

- Webhook

0.3.3 not secure

- stream upload items file
  - async upload items with file stream when task complete
  - agent execute process stdout to fd
  - agent use a slot container to control concurrency, instead of the queue
  - fix: agent remove tmp egg file after requirements test
  - heartbeat post running jobs

0.3.2 not secure

- items file upload
  - log and item file error handling
  - fix: spider page display all jobs history
  - scrapydd cmdline -v/--version to check version
  - double check task_queue get, and also check database consistence
  - heartbeat response contains whether there is new task on server info, client do not need pool GETTASK any more
  - fix: get_task might cause task status in queue inconsistency with database