Implement/document a way how to pass information between handlers
I came across a situation where I scrape half of the item's data in the listing page handler and the other half in a handler taking care of the detail page. I think must be quite common case. I struggle to see how I pass the information down from one handler to another. See concrete example below:
import re import asyncio from enum import StrEnum, auto import click from crawlee.beautifulsoup_crawler import ( BeautifulSoupCrawler, BeautifulSoupCrawlingContext, ) from crawlee.router import Router LENGTH_RE = re.compile(r"(\d+)\s+min") class Label(StrEnum): DETAIL = auto() router = Router[BeautifulSoupCrawlingContext]() @click.command() def edison(): asyncio.run(scrape()) async def scrape(): crawler = BeautifulSoupCrawler(request_handler=router) await crawler.run(["https://edisonfilmhub.cz/program"]) await crawler.export_data("edison.json", dataset_name="edison") @router.default_handler async def detault_handler(context: BeautifulSoupCrawlingContext): await context.enqueue_links(selector=".program_table .name a", label=Label.DETAIL) @router.handler(Label.DETAIL) async def detail_handler(context: BeautifulSoupCrawlingContext): context.log.info(f"Scraping {context.request.url}") description = context.soup.select_one(".filmy_page .desc3").text length_min = LENGTH_RE.search(description).group(1) # TODO get starts_at, then calculate ends_at await context.push_data( { "url": context.request.url, "title": context.soup.select_one(".filmy_page h1").text.strip(), "csfd_url": context.soup.select_one(".filmy_page .hrefs a")["href"], }, dataset_name="edison", )
I need to scrape starts_at at the default_handler, then add more details to the item on the detail page, and calculate the ends_at time according to the length of the film. Even if I changed enqueue_links to something more delicate, how do I pass data from one request to another?