blog: Don Marti


Removing extra, complex markup from a HTML page

20 January 2020

Interesting question from Doc Searls, on Twitter:

Here's one way. Save your messy HTML file (here I'm calling it index.html, then:

pandoc -f html -t commonmark < index.html \
| grep -v '<' \
| pandoc -f commonmark -t html -o clean_index.html

Open clean_index.html and there's your old-school web page.

What this command is doing is...

First, convert the contents of index.html to the simpler CommonMark markup format.

pandoc -f html -t commonmark index.html 

Next, filter out any lines containing HTML markup that Pandoc wasn't able to translate.

grep -v '<' 

Finally, convert back to HTML and write it out to a new file.

pandoc -f commonmark -t html -o clean_index.html

Where to get Pandoc, and more info: Pandoc

At Southern California Linux Expo a few years ago I did a talk with more Pandoc tricks: Using git and make for tasks beyond coding [LWN.net]

By the way, I'm scheduled to speak at SCALE again this year: Hacking the California Consumer Privacy Act for Fun and Profit (and freedom and privacy), so hope to see you there.

Bonus links

The last radio station

Publishers Sense Opportunity As Chrome Drops Third-Party Cookies

How Will Publishers Fare As Google Moves To Kill Cookies In Chrome?

Newish things that haven’t made advertising better, part 9: certainty.

The anti-predictions: What won’t happen in media and marketing in 2020

This Page is Designed to Last