`MemoryStorageClient` issue with deduplication
Test code
import asyncio from crawlee.crawlers import ParselCrawler, ParselCrawlingContext from crawlee import ConcurrencySettings from crawlee.storage_clients import MemoryStorageClient async def main() -> None: storage_client = MemoryStorageClient() crawler = ParselCrawler( storage_client=storage_client, concurrency_settings=ConcurrencySettings(desired_concurrency=20), ) @crawler.router.default_handler async def request_handler(context: ParselCrawlingContext) -> None: data = { 'url': context.request.url, 'title': context.selector.css('title::text').get(), } await context.push_data(data) await context.enqueue_links(strategy='same-domain') await crawler.run(['http://crawlee.dev/']) if __name__ == '__main__': asyncio.run(main())
Results:
[crawlee._autoscaling.autoscaled_pool] INFO Waiting for remaining tasks to finish
[ParselCrawler] INFO Final request statistics:
┌───────────────────────────────┬────────────────┐
│ requests_finished │ 9024 │
│ requests_failed │ 0 │
│ retry_histogram │ [9024] │
│ request_avg_failed_duration │ None │
│ request_avg_finished_duration │ 979.7ms │
│ requests_finished_per_minute │ 655 │
│ requests_failed_per_minute │ 0 │
│ request_total_duration │ 2h 27min 20.9s │
│ requests_total │ 9024 │
│ crawler_runtime │ 13min 46.7s │
└───────────────────────────────┴────────────────┘
Expected number of links for crawlee.dev - 4512. Apparently, each link was processed twice.