I’ll write up more information later, but a couple people have asked for this in #PowerShell on irc.freenode.net, and I had it already written, so here you go … my ConvertFrom-Html cmdlet (in a Huddled.HtmlSnapin). It converts HTML to valid xml using the SGML Parser which was available on GotDotNet years ago. It only works with files (doesn’t do URL downloads yet). Use it like this:


$url = "http://huddledmasses.org/"
$file = Join-Path $pwd "HuddledMasses.html"

$client = new-object System.Net.WebClient
$client.DownloadFile( $url, $file ) #NOTE: You need to use a full path here, not relative

$xml = ConvertFrom-Html $file

# Or even
(ConvertFrom-Html $file).Save($file)
 

The source code to my plugin may be considered public domain, and is included in the Huddled HTML SnapIn Zip.

However, the SgmlReader library is a Microsoft Sample which is licensed under the old MS Samples license which doesn’t allow reuse with viral open source software. I’ve seen some work being done on an HtmlAgilityPack on CodePlex (using a Creative Commons ASA license) but I have not really looked at it except to see that it has a several active issues related to entity encoding and dropping malformed tags which I haven’t encountered in SgmlReader …