improve title extractor by prnake · Pull Request #924 · ArchiveBox/ArchiveBox
Summary
This PR trying to get title from offline html, such as singlefile first, the same idea as readability extractor, in order to handle some anti-scraping pages like https://xz.aliyun.com/t/10870 that can not get title from wget or curl.
Changes these areas
- Bugfixes
- Feature behavior
- Command line interface
- Configuration options
- Internal architecture
- Snapshot data layout on disk