I’ll write up more information later, but a couple people have asked for this in #PowerShell on irc.freenode.net, and I had it already written, so here you go … my ConvertFrom-Html cmdlet (in a Huddled.HtmlSnapin). It converts HTML to valid xml using the SGML Parser which was available on GotDotNet years ago. It only works with files (doesn’t do URL downloads yet). Use it like this:
$url = "http://huddledmasses.org/"
$file = Join-Path $pwd "HuddledMasses.html"
$client = new-object System.Net.WebClient
$client.DownloadFile( $url, $file ) #NOTE: You need to use a full path here, not relative
$xml = ConvertFrom-Html $file
# Or even
(ConvertFrom-Html $file).Save($file)
The source code to my plugin may be considered public domain, and is included in the Huddled HTML SnapIn Zip.
However, the SgmlReader library is a Microsoft Sample which is licensed under the old MS Samples license which doesn’t allow reuse with viral open source software. I’ve seen some work being done on an HtmlAgilityPack on CodePlex (using a Creative Commons ASA license) but I have not really looked at it except to see that it has a several active issues related to entity encoding and dropping malformed tags which I haven’t encountered in SgmlReader …
The SGMLReader library has been re-released on MSDN Code Gallery under an Ms-PL license, and all is well with the world.
Please don’t hesitate to drop me a note (with reproductible issue) if you need advice on Html Agility Pack (I wrote it). It should support encoding, and malformed tags like a charm, but sometimes, its usage has to be tweaked.