Reconsider crawler inheritance
Currently, we have the following inheritance chains:
BasicCrawler->HttpCrawlerBasicCrawler->BeautifulSoupCrawlerBasicCrawler->PlaywrightCrawlerBasicCrawler->ParselCrawler(feat: Implement ParselCrawler that adds support for Parsel #348 )
This is an intentional difference from the JS version, where
BrowserCrawleris a common ancestor ofPlaywrightCrawlerandPuppeteerCrawler- this is not relevant in Python ecosystem - we won't implement anything similar to Playwright anytime soon
CheerioCrawlerandJSDomCrawlerinherit fromHttpCrawler- this is the important difference
- We decided to do this differently to avoid inheritance chains, which make it harder to track down the code that is actually being executed. The cost is a bit of code duplication.
- In the Python version, we also have the HttpClient abstraction and most of the http-handling logic is contained there
We might want to reconsider this because
- New HTML parsers are being added as we speak
- This might make the code duplication too costly to maintain
- For Adaptive playwright crawler #249, we would like to have a "parse the current HTML" helper that works with all supported HTML parsers, not just beautifulsoup, for instance
The possible ways out are
- Leave it as it is now
- Parametrize
HttpCrawlerwith an HTML parser
- this would make
BeautifulSoupCrawlerandParselCrawlervery thin - they would just pass the rightHttpClientandHtmlParsertoHttpCrawler - we may want to consider moving the
send_requestcontext helper fromBasicCrawlingContexttoHttpCrawlingContext
- Remove
HttpCrawleraltogether and pull its functionality intoBasicCrawler