How can I disable cache completely?
I am trying to write a simple function to crawl a website and I don't want crawlee to cache anything (each time I call this function it will do everything from scratch).
here is my attempt so far, I tried with persist_storage=False and purge_on_start=True in the configuration, and with removing the storage directory entirely, but I keep getting either a concatenated result of all the requests or and empty result in case I delete the storage directory.
async def main( website: str, include_links: list[str], exclude_links: list[str], depth: int = 5, ) -> str: crawler = BeautifulSoupCrawler( # Limit the crawl to max requests. Remove or increase it for crawling all links. max_requests_per_crawl=depth, ) dataset = await Dataset.open( configuration=Configuration( persist_storage=False, purge_on_start=True, ), ) # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: # type: ignore # Extract data from the page. text = context.soup.get_text() await dataset.push_data({"content": text}) # Enqueue all links found on the page. await context.enqueue_links( include=[Glob(url) for url in include_links], exclude=[Glob(url) for url in exclude_links], ) # Run the crawler with the initial list of URLs. await crawler.run([website]) data = await dataset.get_data() content = "\n".join([item["content"] for item in data.items]) # type: ignore return content
also is there a way to simple get the result of the crawl as a string, and not use Dataset ?
any help is appreciated 🤗 thank you in advance !