ParselCrawler | API | Crawlee for Python · Fast, reliable Python web crawlers.
Index
Methods
- __init__(*, navigation_timeout, request_handler, statistics, configuration, event_manager, storage_client, request_manager, session_pool, proxy_configuration, http_client, max_request_retries, max_requests_per_crawl, max_session_rotations, max_crawl_depth, use_session_pool, retry_on_blocked, concurrency_settings, request_handler_timeout, abort_on_error, configure_logging, statistics_log_format, keep_alive, additional_http_error_status_codes, ignore_http_error_status_codes, respect_robots_txt_file, status_message_logging_interval, status_message_callback, id): None
Parameters
keyword-onlyoptionalnavigation_timeout: timedelta | None
keyword-onlyoptionalrequest_handler: NotRequired[Callable[[TCrawlingContext], Awaitable[None]]]
keyword-onlyoptionalstatistics: NotRequired[Statistics[TStatisticsState]]
keyword-onlyoptionalconfiguration: NotRequired[Configuration]
keyword-onlyoptionalevent_manager: NotRequired[EventManager]
keyword-onlyoptionalstorage_client: NotRequired[StorageClient]
keyword-onlyoptionalrequest_manager: NotRequired[RequestManager]
keyword-onlyoptionalsession_pool: NotRequired[SessionPool]
keyword-onlyoptionalproxy_configuration: NotRequired[ProxyConfiguration]
keyword-onlyoptionalhttp_client: NotRequired[HttpClient]
keyword-onlyoptionalmax_request_retries: NotRequired[int]
keyword-onlyoptionalmax_requests_per_crawl: NotRequired[int | None]
keyword-onlyoptionalmax_session_rotations: NotRequired[int]
keyword-onlyoptionalmax_crawl_depth: NotRequired[int | None]
keyword-onlyoptionaluse_session_pool: NotRequired[bool]
keyword-onlyoptionalretry_on_blocked: NotRequired[bool]
keyword-onlyoptionalconcurrency_settings: NotRequired[ConcurrencySettings]
keyword-onlyoptionalrequest_handler_timeout: NotRequired[timedelta]
keyword-onlyoptionalabort_on_error: NotRequired[bool]
keyword-onlyoptionalconfigure_logging: NotRequired[bool]
keyword-onlyoptionalstatistics_log_format: NotRequired[Literal['table', 'inline']]
keyword-onlyoptionalkeep_alive: NotRequired[bool]
keyword-onlyoptionaladditional_http_error_status_codes: NotRequired[Iterable[int]]
keyword-onlyoptionalignore_http_error_status_codes: NotRequired[Iterable[int]]
keyword-onlyoptionalrespect_robots_txt_file: NotRequired[bool]
keyword-onlyoptionalstatus_message_logging_interval: NotRequired[timedelta]
keyword-onlyoptionalstatus_message_callback: NotRequired[ Callable[[StatisticsState, StatisticsState | None, str], Awaitable[str | None]] ]
keyword-onlyoptionalid: NotRequired[int]
Returns None
- async add_requests(requests, *, forefront, batch_size, wait_time_between_batches, wait_for_all_requests_to_be_added, wait_for_all_requests_to_be_added_timeout): None
Parameters
requests: Sequence[str | Request]
optionalkeyword-onlyforefront: bool = False
optionalkeyword-onlybatch_size: int = 1000
optionalkeyword-onlywait_time_between_batches: timedelta = timedelta(0)
optionalkeyword-onlywait_for_all_requests_to_be_added: bool = False
optionalkeyword-onlywait_for_all_requests_to_be_added_timeout: timedelta | None = None
Returns None
- create_parsed_http_crawler_class(static_parser): type[AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult, TSelectResult]]
Parameters
static_parser: AbstractHttpParser[TParseResult, TSelectResult]
Returns type[AbstractHttpCrawler[ParsedHttpCrawlingContext[TParseResult], TParseResult, TSelectResult]]
- error_handler(handler): ErrorHandler[TCrawlingContext]
Parameters
handler: ErrorHandler[TCrawlingContext | BasicCrawlingContext]
Returns ErrorHandler[TCrawlingContext]
- async export_data(path, dataset_id, dataset_name, dataset_alias, additional_kwargs): None
Parameters
path: str | Path
optionaldataset_id: str | None = None
optionaldataset_name: str | None = None
optionaldataset_alias: str | None = None
additional_kwargs: Unpack[ExportDataJsonKwargs | ExportDataCsvKwargs]
Returns None
- failed_request_handler(handler): FailedRequestHandler[TCrawlingContext]
Parameters
handler: FailedRequestHandler[TCrawlingContext | BasicCrawlingContext]
Returns FailedRequestHandler[TCrawlingContext]
- async get_data(dataset_id, dataset_name, dataset_alias, kwargs): DatasetItemsListPage
Parameters
optionaldataset_id: str | None = None
optionaldataset_name: str | None = None
optionaldataset_alias: str | None = None
kwargs: Unpack[GetDataKwargs]
Returns DatasetItemsListPage
- async get_dataset(*, id, name, alias): Dataset
Parameters
optionalkeyword-onlyid: str | None = None
optionalkeyword-onlyname: str | None = None
optionalkeyword-onlyalias: str | None = None
Returns Dataset
- async get_key_value_store(*, id, name, alias): KeyValueStore
Parameters
optionalkeyword-onlyid: str | None = None
optionalkeyword-onlyname: str | None = None
optionalkeyword-onlyalias: str | None = None
Returns KeyValueStore
- async get_request_manager(): RequestManager
- on_skipped_request(callback): SkippedRequestCallback
- post_navigation_hook(hook): None
Parameters
hook: Callable[[HttpCrawlingContext], Awaitable[None]]
Returns None
- pre_navigation_hook(hook): None
Parameters
hook: Callable[[BasicCrawlingContext], Awaitable[None]]
Returns None
- async run(requests, *, purge_request_queue): FinalStatistics
Parameters
optionalrequests: Sequence[str | Request] | None = None
optionalkeyword-onlypurge_request_queue: bool = True
Returns FinalStatistics
- stop(reason): None
Parameters
optionalreason: str = 'Stop was called externally.'
Returns None
- async use_state(default_value): dict[str, JsonSerializable]
Parameters
optionaldefault_value: dict[str, JsonSerializable] | None = None
Returns dict[str, JsonSerializable]
Properties
router: Router[TCrawlingContext]
statistics: Statistics[TStatisticsState]