ParselCrawler | API | Crawlee for Python · Fast, reliable Python web crawlers.

Index

Methods

  • __init__(*, navigation_timeout, request_handler, statistics, configuration, event_manager, storage_client, request_manager, session_pool, proxy_configuration, http_client, max_request_retries, max_requests_per_crawl, max_session_rotations, max_crawl_depth, use_session_pool, retry_on_blocked, concurrency_settings, request_handler_timeout, abort_on_error, configure_logging, statistics_log_format, keep_alive, additional_http_error_status_codes, ignore_http_error_status_codes, respect_robots_txt_file, status_message_logging_interval, status_message_callback, id): None

  • Parameters

    • keyword-onlyoptionalnavigation_timeout: timedelta | None
    • keyword-onlyoptionalrequest_handler: NotRequired[Callable[[TCrawlingContext], Awaitable[None]]]
    • keyword-onlyoptionalstatistics: NotRequired[Statistics[TStatisticsState]]
    • keyword-onlyoptionalconfiguration: NotRequired[Configuration]
    • keyword-onlyoptionalevent_manager: NotRequired[EventManager]
    • keyword-onlyoptionalstorage_client: NotRequired[StorageClient]
    • keyword-onlyoptionalrequest_manager: NotRequired[RequestManager]
    • keyword-onlyoptionalsession_pool: NotRequired[SessionPool]
    • keyword-onlyoptionalproxy_configuration: NotRequired[ProxyConfiguration]
    • keyword-onlyoptionalhttp_client: NotRequired[HttpClient]
    • keyword-onlyoptionalmax_request_retries: NotRequired[int]
    • keyword-onlyoptionalmax_requests_per_crawl: NotRequired[int | None]
    • keyword-onlyoptionalmax_session_rotations: NotRequired[int]
    • keyword-onlyoptionalmax_crawl_depth: NotRequired[int | None]
    • keyword-onlyoptionaluse_session_pool: NotRequired[bool]
    • keyword-onlyoptionalretry_on_blocked: NotRequired[bool]
    • keyword-onlyoptionalconcurrency_settings: NotRequired[ConcurrencySettings]
    • keyword-onlyoptionalrequest_handler_timeout: NotRequired[timedelta]
    • keyword-onlyoptionalabort_on_error: NotRequired[bool]
    • keyword-onlyoptionalconfigure_logging: NotRequired[bool]
    • keyword-onlyoptionalstatistics_log_format: NotRequired[Literal['table', 'inline']]
    • keyword-onlyoptionalkeep_alive: NotRequired[bool]
    • keyword-onlyoptionaladditional_http_error_status_codes: NotRequired[Iterable[int]]
    • keyword-onlyoptionalignore_http_error_status_codes: NotRequired[Iterable[int]]
    • keyword-onlyoptionalrespect_robots_txt_file: NotRequired[bool]
    • keyword-onlyoptionalstatus_message_logging_interval: NotRequired[timedelta]
    • keyword-onlyoptionalstatus_message_callback: NotRequired[ Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] ]
    • keyword-onlyoptionalid: NotRequired[int]

    Returns None

  • async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None

  • Parameters

    • requests: Sequence[str | Request]
    • optionalkeyword-onlyforefront: bool = False
    • optionalkeyword-onlybatch_size: int = 1000
    • optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)
    • optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
    • optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None

    Returns None

  • create_parsed_http_crawler_class(static_parser): type[AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult, TSelectResult]]

  • Parameters

    • static_parser: AbstractHttpParser[TParseResult, TSelectResult]

    Returns type[AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult, TSelectResult]]

  • error_handler(handler): ErrorHandler[TCrawlingContext]

  • Parameters

    • handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]

    Returns ErrorHandler[TCrawlingContext]

  • async export_data(path, dataset_id, dataset_name, dataset_alias, additional_kwargs): None

  • Parameters

    • path: str | Path
    • optionaldataset_id: str | None = None
    • optionaldataset_name: str | None = None
    • optionaldataset_alias: str | None = None
    • additional_kwargs: Unpack[ExportDataJsonKwargs | ExportDataCsvKwargs]

    Returns None

  • failed_request_handler(handler): FailedRequestHandler[TCrawlingContext]

  • Parameters

    • handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]

    Returns FailedRequestHandler[TCrawlingContext]


  • Parameters

    • optionaldataset_id: str | None = None
    • optionaldataset_name: str | None = None
    • optionaldataset_alias: str | None = None
    • kwargs: Unpack[GetDataKwargs]

    Returns DatasetItemsListPage

  • async get_dataset(*, id, name, alias): Dataset

  • Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None
    • optionalkeyword-onlyalias: str | None = None

    Returns Dataset


  • Parameters

    • optionalkeyword-onlyid: str | None = None
    • optionalkeyword-onlyname: str | None = None
    • optionalkeyword-onlyalias: str | None = None

    Returns KeyValueStore

  • post_navigation_hook(hook): None

  • Parameters

    • hook: Callable[[HttpCrawlingContext], Awaitable[None]]

    Returns None

  • pre_navigation_hook(hook): None

  • Parameters

    • hook: Callable[[BasicCrawlingContext], Awaitable[None]]

    Returns None


  • Parameters

    • optionalrequests: Sequence[str | Request] | None = None
    • optionalkeyword-onlypurge_request_queue: bool = True

    Returns FinalStatistics

  • stop(reason): None

  • Parameters

    • optionalreason: str = 'Stop was called externally.'

    Returns None

  • async use_state(default_value): dict[str, JsonSerializable]
  • Parameters

    • optionaldefault_value: dict[str, JsonSerializable] | None = None

    Returns dict[str, JsonSerializable]

Properties

router: Router[TCrawlingContext]

statistics: Statistics[TStatisticsState]