justhtml
Command Line Interface
JustHTML ships with a small CLI for parsing HTML and extracting HTML/text/Markdown from selected parts of a document.
Running
If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command.
If you donβt have it available, use the equivalent python -m justhtml ... form.
Basic usage
# Pretty-print an HTML file
justhtml page.html
# Read HTML from stdin
curl -s https://example.com | justhtml -
Selecting nodes
Use --selector to choose which nodes to extract.
# Extract text from all paragraphs
justhtml page.html --selector "p" --format text
# Only output the first match
justhtml page.html --selector "main p" --format text --first
# In Python, this corresponds to: doc.query_one("main p")
Fragments
Use --fragment to parse the input as an HTML fragment (instead of a full document). This avoids implicit <html>, <head>, and <body> insertion.
echo '<li>Hi</li>' | justhtml - --fragment
Strict Mode
Use --strict to fail immediately and print an error to stderr if the input HTML contains parsing errors or is otherwise malformed according to the WHATWG specification.
echo '</b>' | justhtml - --strict --fragment
# Exits with code 2 and prints:
# StrictModeError: Element 'b' has no matching start tag
Output formats
--format controls what is printed:
html(default): pretty-printed HTML for each matchtext: concatenated text (same semantics asto_text(separator=" ", strip=True); sanitized by default)markdown: a pragmatic subset of GitHub Flavored Markdown (GFM)
Notes:
markdownkeeps tables (<table>) and images (<img>) as raw HTML.- For multiple matches:
htmlandtextprint one result per line.markdownprints matches separated by a blank line.
Sanitization
By default, the CLI sanitizes output (same safe-by-default behavior as JustHTML(..., sanitize=True)).
To disable sanitization for trusted input, pass --unsafe.
In safe mode, you can allow additional tags via --allow-tags (comma-separated). This augments the default policy (document vs fragment).
Example:
justhtml page.html --selector "article" --allow-tags article,section --format markdown
Cleanup
--cleanup removes common unhelpful output artifacts:
- unwrap
<a>tags that have nohref - drop
<img>tags that have nosrc(orsrc="") - prune empty tags
This is useful when sanitization has stripped attributes and left behind empty tags.
curl -s https://example.com | justhtml - --format html --cleanup
Text options
When using --format text, you can control whitespace handling:
--separator "..."(default: a single space) joins text nodes--separator-blocks-onlyapplies--separatoronly between block-level elements (avoid separators inside inline tags)--strip/--no-stripcontrols whether each text node is stripped and empty segments dropped
Example:
justhtml page.html --selector "main" --format text --separator "" --no-strip
Exit codes
0: success1: missing input path or no matches for the selector2: invalid selector, or a parse error when--strictis enabled
Real-world example
curl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown | head -n 15
Output:
# JustHTML
[](#justhtml)
A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.
**[π Read the full documentation here](/EmilStenstrom/justhtml/blob/main/docs/index.md)**
## Why use JustHTML?
[](#why-use-justhtml)
### 1. Just... Correct β
[](#1-just-correct-)