Warn on broken steps, use yt-dlp to avoid youtube-dl errors, and don't crash on bad UTF-8 by turian · Pull Request #1026

Warn on broken steps, use yt-dlp to avoid youtube-dl errors, and don't crash on bad UTF-8 by turian · Pull Request #1026 · ArchiveBox/ArchiveBox

Summary

If some step is broken, we issue a warning instead of crashing.
We use yt-dlp instead of youtube-dl by default.
On bad unicode, we don't crash.

Quickest workaround for many people, until this is merged

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn't work, you can use my Docker turian/archivebox:kludge-984-UTF8-bug which includes this patch, instead of archivebox/archivebox for now.

@jgoerzen I think you also said this bug was a showstopper for you

Try it out

This finally works with this patch:

archivebox add 'https://www.ashra.com/news.php?m=A'

Related issues

Should close these issues:

UnicodeDecodeError: Bug: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 264619: invalid start byte #1014 and Bug: Indexing subtitles in media extractor fails when they're not UTF-8 encoded #984 Bug: UnicodeDecodeError when archiving site #999
Not able to save YouTube videos which might be UnicodeDecodeError but are probably also youtube-dl errors: Bug: Docker install not able to save YouTube videos (media failure) #991 Question: YouTube videos not completing media archive method #998

Changes these areas

Bugfixes
Feature behavior
Command line interface
Configuration options
Internal architecture
Snapshot data layout on disk

`youtube-dl` -> `yt-dlp`

So I began by changing the archive_link from a crashing exception to a warning. This gave me better diagnostics than the exceptions you see in the issues above. I observed the following:

       Extractor failed:
             Failed to save media
        Run to see full output:
            cd /home/archivebox/archivebox-turian/ArchiveBox/pipdata/archive/1663011964.432896;
            youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --no-abort-on-error --geo-bypass --add-metadata --max-filesize=750m https://www.ashra.com/news.php?m=A

Inspecting that, I saw:

$ youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --no-abort-on-error --geo-bypass --add-metadata --max-filesize=750m https://www.ashra.com/news.php?m=A
Usage: youtube-dl [OPTIONS] URL [URL...]

youtube-dl: error: no such option: --no-abort-on-error

Oups. youtube-dl doesn't have --no-abort-on-error. I think it only has --abort-on-error but I'm not entirely sure what the default behavior is.

I switched youtube-dl to yt-dlp as the default, which is a more actively maintained fork that pulls upstream from youtube-dl constantly. It also has option --no-abort-on-error so now the media download works.

It operates essentionally identically (and FASTER) to youtube-dl.

The above youtube-dl options are present, with the following caveats:

We aren't doing this, but "When --embed-subs and --write-subs are used together, the subtitles are written to disk and also embedded in the media file. You can use just --embed-subs to embed the subs and automatically delete the separate file."
We aren't doing this, but "--add-metadata attaches the infojson to mkv files in addition to writing the metadata when used with --write-info-json. Use --no-embed-info-json or --compat-options no-attach-info-json to revert this"
--write-annotations: "No supported site has annotations now"

I haven't pushed yt-dlp default changes to the submodules yet:
deb_dist, brew_dist, docker, docs, etc/ArchiveBox.conf.default, pip_dist
but I'd like to. I would even go so far as to deprecate youtube-dl, but perhaps that's too radical for some people.

UnicodeDecodeError fixes

A common complaint. In media.py, we used to have:

text_file.read_text(encoding='utf-8').strip()

This is strict and errors cause a crash. I have changed the behavior to xmlcharrefreplace ("Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.") which seemed the best to me, but you can read about other options in my comment there or in more detail at Python's open function documentation.

My preferred workaround, because sometimes things say they are utf-8 but are actually a different encoding, would be this:

Try utf-8 in strict mode. If clean, then good.
If error, use chardet to guess the encoding. (It's a little slow, so it's better as a fallback option.) Attempt to decode to Unicode using the guessed encoding in strict mode. If clean, then good.
Worst case: Either use utf-8 or the highest confidence chardet guessed encoding, and decode it using xmlcharrefreplace for errors.

This would be a separate PR, if you approve of me including chardet as a pip dependency here and in all submodules (same as with yt-dlp).

Postscript

I ran flake8 and caught all the flakes in the code I introduced, as far as I know. Since I am a new committer, CI/CD won't run for me here.