Removing extra, complex markup from a HTML page
20 January 2020
Interesting question from Doc Searls, on Twitter:
Is there a way to export a non-complex @Medium post (one, say, just with links) that isn't full of cruft like "p id="4f89" class="gw gx ar bj gy b gz ha hb hc hd he hf hg hh hi hj" data-selectable-paragraph="" ? How about an "Export as old school html?" option in the ••• menu?— Doc Searls (@dsearls) January 16, 2020
Here's one way. Save your messy HTML file (here I'm calling it
pandoc -f html -t commonmark < index.html \ | grep -v '<' \ | pandoc -f commonmark -t html -o clean_index.html
clean_index.html and there's your old-school web page.
What this command is doing is...
First, convert the contents of
index.html to the simpler CommonMark markup format.
pandoc -f html -t commonmark index.html
Next, filter out any lines containing HTML markup that Pandoc wasn't able to translate.
grep -v '<'
Finally, convert back to HTML and write it out to a new file.
pandoc -f commonmark -t html -o clean_index.html
Where to get Pandoc, and more info: Pandoc
At Southern California Linux Expo a few years ago I did a talk with more Pandoc tricks: Using git and make for tasks beyond coding [LWN.net]
By the way, I'm scheduled to speak at SCALE again this year: Hacking the California Consumer Privacy Act for Fun and Profit (and freedom and privacy), so hope to see you there.