Warn on broken steps, use yt-dlp to avoid youtube-dl errors, and don't crash on bad UTF-8 by turian · Pull Request #1026 · ArchiveBox/ArchiveBox

Summary

  • If some step is broken, we issue a warning instead of crashing.
  • We use yt-dlp instead of youtube-dl by default.
  • On bad unicode, we don't crash.

Quickest workaround for many people, until this is merged

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn't work, you can use my Docker turian/archivebox:kludge-984-UTF8-bug which includes this patch, instead of archivebox/archivebox for now.

@jgoerzen I think you also said this bug was a showstopper for you

Try it out

This finally works with this patch:

archivebox add 'https://www.ashra.com/news.php?m=A'

Related issues

Should close these issues:

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

youtube-dl -> yt-dlp

So I began by changing the archive_link from a crashing exception to a warning. This gave me better diagnostics than the exceptions you see in the issues above. I observed the following:

       Extractor failed:
             Failed to save media
        Run to see full output:
            cd /home/archivebox/archivebox-turian/ArchiveBox/pipdata/archive/1663011964.432896;
            youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --no-abort-on-error --geo-bypass --add-metadata --max-filesize=750m https://www.ashra.com/news.php?m=A

Inspecting that, I saw:

$ youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --no-abort-on-error --geo-bypass --add-metadata --max-filesize=750m https://www.ashra.com/news.php?m=A
Usage: youtube-dl [OPTIONS] URL [URL...]

youtube-dl: error: no such option: --no-abort-on-error

Oups. youtube-dl doesn't have --no-abort-on-error. I think it only has --abort-on-error but I'm not entirely sure what the default behavior is.

I switched youtube-dl to yt-dlp as the default, which is a more actively maintained fork that pulls upstream from youtube-dl constantly. It also has option --no-abort-on-error so now the media download works.

It operates essentionally identically (and FASTER) to youtube-dl.

The above youtube-dl options are present, with the following caveats:

  • We aren't doing this, but "When --embed-subs and --write-subs are used together, the subtitles are written to disk and also embedded in the media file. You can use just --embed-subs to embed the subs and automatically delete the separate file."
  • We aren't doing this, but "--add-metadata attaches the infojson to mkv files in addition to writing the metadata when used with --write-info-json. Use --no-embed-info-json or --compat-options no-attach-info-json to revert this"
  • --write-annotations: "No supported site has annotations now"

I haven't pushed yt-dlp default changes to the submodules yet:
deb_dist, brew_dist, docker, docs, etc/ArchiveBox.conf.default, pip_dist
but I'd like to. I would even go so far as to deprecate youtube-dl, but perhaps that's too radical for some people.

UnicodeDecodeError fixes

A common complaint. In media.py, we used to have:

text_file.read_text(encoding='utf-8').strip()

This is strict and errors cause a crash. I have changed the behavior to xmlcharrefreplace ("Characters not supported by the encoding are replaced with the appropriate XML character reference &#nnn;.") which seemed the best to me, but you can read about other options in my comment there or in more detail at Python's open function documentation.

My preferred workaround, because sometimes things say they are utf-8 but are actually a different encoding, would be this:

  1. Try utf-8 in strict mode. If clean, then good.
  2. If error, use chardet to guess the encoding. (It's a little slow, so it's better as a fallback option.) Attempt to decode to Unicode using the guessed encoding in strict mode. If clean, then good.
  3. Worst case: Either use utf-8 or the highest confidence chardet guessed encoding, and decode it using xmlcharrefreplace for errors.

This would be a separate PR, if you approve of me including chardet as a pip dependency here and in all submodules (same as with yt-dlp).

Postscript

I ran flake8 and caught all the flakes in the code I introduced, as far as I know. Since I am a new committer, CI/CD won't run for me here.