feat: Persist `DefaultRenderingTypePredictor` state by Mantisus · Pull Request #1340 · apify/crawlee-python

Pull Request Overview

This PR adds persistence capabilities to the DefaultRenderingTypePredictor by implementing state management that saves and restores the trained model and associated data to/from a key-value store. This allows the predictor to maintain its learned patterns across different runs.

  • Adds persistence support with configurable key-value storage integration
  • Implements async context manager pattern for proper resource management
  • Introduces state serialization/deserialization for scikit-learn models

Reviewed Changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/crawlee/crawlers/_adaptive_playwright/_rendering_type_predictor.py Core implementation of persistence with RecoverableState integration and async context manager
src/crawlee/crawlers/_adaptive_playwright/_utils.py Utility functions for scikit-learn model serialization and validation
src/crawlee/crawlers/_adaptive_playwright/_adaptive_playwright_crawler.py Integration of predictor into crawler's context managers
tests/unit/crawlers/_adaptive_playwright/test_predictor.py Updated tests to use async context manager and added persistence tests
tests/unit/crawlers/_adaptive_playwright/test_adaptive_playwright_crawler.py Added super().init() call to test mock class
docs/guides/code_examples/playwright_crawler_adaptive/init_prediction.py Updated example to properly call parent constructor
Comments suppressed due to low confidence (1)

tests/unit/crawlers/_adaptive_playwright/test_predictor.py:27

  • The function name 'ictor_same_label' appears to be truncated or misspelled. It should likely be 'test_predictor_same_label' or similar.
async def ictor_same_label(url: str, expected_prediction: RenderingType, label: str | None) -> None: