Removing extra, complex markup from a HTML page
20 January 2020
Interesting question from Doc Searls, on Twitter:
Is there a way to export a non-complex @Medium post (one, say, just with links) that isn't full of cruft like "p id="4f89" class="gw gx ar bj gy b gz ha hb hc hd he hf hg hh hi hj" data-selectable-paragraph="" ? How about an "Export as old school html?" option in the ••• menu?
— Doc Searls (@dsearls) January 16, 2020
Here's one way. Save your messy HTML file (here I'm calling it index.html
, then:
pandoc -f html -t commonmark < index.html \
| grep -v '<' \
| pandoc -f commonmark -t html -o clean_index.html
Open clean_index.html
and there's your old-school web page.
What this command is doing is...
First, convert the contents of index.html
to the simpler CommonMark markup format.
pandoc -f html -t commonmark index.html
Next, filter out any lines containing HTML markup that Pandoc wasn't able to translate.
grep -v '<'
Finally, convert back to HTML and write it out to a new file.
pandoc -f commonmark -t html -o clean_index.html
Where to get Pandoc, and more info: Pandoc
At Southern California Linux Expo a few years ago I did a talk with more Pandoc tricks: Using git and make for tasks beyond coding [LWN.net]
By the way, I'm scheduled to speak at SCALE again this year: Hacking the California Consumer Privacy Act for Fun and Profit (and freedom and privacy), so hope to see you there.
Bonus links
Publishers Sense Opportunity As Chrome Drops Third-Party Cookies
How Will Publishers Fare As Google Moves To Kill Cookies In Chrome?
Newish things that haven’t made advertising better, part 9: certainty.
The anti-predictions: What won’t happen in media and marketing in 2020