blog: Don Marti


Removing extra, complex markup from a HTML page

20 January 2020

Interesting question from Doc Searls, on Twitter:

Here's one way. Save your messy HTML file (here I'm calling it index.html, then:

pandoc -f html -t commonmark < index.html \
| grep -v '<' \
| pandoc -f commonmark -t html -o clean_index.html

Open clean_index.html and there's your old-school web page.

What this command is doing is...

First, convert the contents of index.html to the simpler CommonMark markup format.

pandoc -f html -t commonmark index.html 

Next, filter out any lines containing HTML markup that Pandoc wasn't able to translate.

grep -v '<' 

Finally, convert back to HTML and write it out to a new file.

pandoc -f commonmark -t html -o clean_index.html

Where to get Pandoc, and more info: Pandoc

At Southern California Linux Expo a few years ago I did a talk with more Pandoc tricks: Using git and make for tasks beyond coding [LWN.net]

By the way, I'm scheduled to speak at SCALE again this year: Hacking the California Consumer Privacy Act for Fun and Profit (and freedom and privacy), so hope to see you there.

