A HTML to Markdown converter.
Installation
Usage
In ruby:
require "upmark" html = "<p>messenger <strong>bag</strong> skateboard</p>" markdown = Upmark.convert(html) puts markdown
From the command-line:
You can also pipe poorly formatted HTML documents through tidy before piping them into upmark:
> cat bar.html | tidy -asxhtml -indent -quiet --show-errors 0 --show-warnings 0 --show-body-only 1 --wrap 0 | upmark
Features
Upmark will convert the following (arbitrarily nested) HTML elements to Markdown:
strongempah1,h2,h3,h4,h5,h6ulolbr
It will also pass through block and span-level HTML elements (e.g. table, div, span, etc) which aren't used by Markdown.
How it works
Upmark defines a parsing expression grammar (PEG) using the very awesome Parslet gem. This PEG is then used to convert HTML into Markdown in 4 steps:
- Parse the XHTML into an abstract syntax tree (AST).
- Normalize the AST into a nested hash of HTML elements.
- Mark the block and span-level subtrees which should be ignored (
table,div,span, etc). - Convert the AST leaves into Markdown.