Enqueue links does not process all links on the page
Both Playwright and BeautifulSoup Crawlers suffer the same issue regarding the enqueue links.
BeautifulSoup:
import asyncio import logging from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext logging.basicConfig(level=logging.INFO) async def main() -> None: crawler = BeautifulSoupCrawler() @crawler.router.default_handler async def request_handler(context: BeautifulSoupCrawlingContext) -> None: await context.enqueue_links() data = { 'url': context.request.url, 'title': context.soup.title.string if context.soup.title else None, } await context.push_data(data) await crawler.run(['https://crawlee.dev']) if __name__ == '__main__': asyncio.run(main())
or Playwright:
import asyncio import logging from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext logging.basicConfig(level=logging.INFO) async def main() -> None: crawler = PlaywrightCrawler() @crawler.router.default_handler async def request_handler(context: PlaywrightCrawlingContext) -> None: await context.enqueue_links() record = { 'request_url': context.request.url, 'page_url': context.page.url, 'page_title': await context.page.title(), } await context.push_data(record) await crawler.run(['https://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main())
Only 16 URLs are processed:
[
"https://crawlee.dev/",
"https://crawlee.dev/docs/guides/typescript-project",
"https://crawlee.dev/docs/guides/javascript-rendering",
"https://crawlee.dev/docs/guides/avoid-blocking",
"https://crawlee.dev/docs/guides/cheerio-crawler-guide",
"https://crawlee.dev/docs/guides/jsdom-crawler-guide",
"https://crawlee.dev/api/core/class/AutoscaledPool",
"https://crawlee.dev/docs/guides/proxy-management",
"https://crawlee.dev/docs/guides/result-storage",
"https://crawlee.dev/docs/guides/request-storage",
"https://crawlee.dev/api/utils/namespace/social",
"https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions",
"https://crawlee.dev/api/utils",
"https://crawlee.dev/docs/quick-start",
"https://crawlee.dev/docs/deployment/aws-cheerio",
"https://crawlee.dev/docs/deployment/gcp-cheerio"
]Clearly, there are many more pages on crawlee.dev than what is being processed.
The problem is probably somewhere in the BasicCrawler's _check_enqueue_strategy or similar function.