GitHub - elpelaflow/crawlee-python: Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.

Development

Here you'll find a contributing guide to get started with development.

Environment

For local development, it is required to have Python 3.10 (or a later version) installed.

We use uv for project management. Install it and set up your IDE accordingly.

Dependencies

To install this package and its development dependencies, run:

Code checking

To execute all code checking tools together, run:

Linting

We utilize ruff for linting, which analyzes code for potential issues and enforces consistent style. Refer to pyproject.toml for configuration details.

To run linting:

Formatting

Our automated code formatting also leverages ruff, ensuring uniform style and addressing fixable linting issues. Configuration specifics are outlined in pyproject.toml.

To run formatting:

Type checking

Type checking is handled by mypy, verifying code against type annotations. Configuration settings can be found in pyproject.toml.

To run type checking:

Unit tests

We employ pytest as our testing framework, equipped with various plugins. Check pyproject.toml for configuration details and installed plugins.

We use pytest as a testing framework with many plugins. Check pyproject.toml for configuration details and installed plugins.

To run unit tests:

To run unit tests with HTML coverage report:

End-to-end tests

Pre-requisites for running end-to-end tests:

  • apify-cli correctly installed
  • apify-cli available in PATH environment variable
  • Your apify token is available in APIFY_TEST_USER_API_TOKEN environment variable

To run end-to-end tests:

Documentation

We follow the Google docstring format for code documentation. All user-facing classes and functions must be documented. Documentation standards are enforced using Ruff.

Our API documentation is generated from these docstrings using pydoc-markdown with custom post-processing. Additional content is provided through markdown files in the docs/ directory. The final documentation is rendered using Docusaurus and published to GitHub pages.

To run the documentation locally, ensure you have Node.js 20+ installed, then run:

Release process

Publishing new versions to PyPI is automated through GitHub Actions.

  • Beta releases: On each commit to the master branch, a new beta release is automatically published. The version number is determined based on the latest release and conventional commits. The beta version suffix is incremented by 1 from the last beta release on PyPI.
  • Stable releases: A stable version release may be created by triggering the release GitHub Actions workflow. The version number is determined based on the latest release and conventional commits (auto release type), or it may be overriden using the custom release type.

Publishing to PyPI manually

  1. Do not do this unless absolutely necessary. In all conceivable scenarios, you should use the release workflow instead.

  2. Make sure you know what you're doing.

  3. Update the version number:

  • Modify the version field under project in pyproject.toml.
[project]
name = "crawlee"
version = "x.z.y"
  1. Generate the distribution archives for the package:
  1. Set up the PyPI API token for authentication and upload the package to PyPI:
uv publish --token YOUR_API_TOKEN