Well, after multiple attempts at a wget PowerShell script (the last one works very well for downloading web pages, files, as long as you don’t need to send post parameters or anything like that) ... I found myself writing a script last week that included a custom HTTP POST function as well as using some prior functionality I wrote (ConvertFrom-Html) to convert HTML files to XML documents — which PowerShell can deal with nicely.

So I’ve taken my Huddled.Html code and worked it in together with a few extra bits … and I’m releasing it under the new name “PoshHttp” (which has the sole benefit of being short, in the case when you need to type the full command names, like: PoshHttp\Get-Web.

In lieu of providing proper documentation for this, there’s a script below which shows off some of the features of the Get-Web cmdlet … but first, you’re going to want to download PoshHttp and I can give you a basic overview.

Get-Web

Downloads a file from the web URI (Uniform Resource Identifier), optionally passing POST or GET parameters which it accepts as a Hashtable. If the resulting file is an xml or html file, it converts it to XML and outputs it as an XmlDocument object. If they are not, it saves them to file, and outputs the FileInfo object.

  • Plain will prevent the conversion to xml
  • Save will force saving to file (either xml or binaries)
  • Path lets you specify the file name/path (normally, it selects a path based on what you’re downloading).
  • Force will prevent it from asking if you want to overwrite files

ConvertFrom-Html

Converts HTML to XML. I’ll eventually tweak this so it supports passing it a file path and even saving the xml in-place on the original file … but for now you can use it on a stream of text, like this: (Get-Content file.html | ConvertFrom-Html).save( "file.xml" ). Note that it outputs an xmldocument object … you can get the xml as text to the console using Get_OuterXml().

ConvertTo-Hashtable

Converts a url-style query string or NameValueCollection into a Hashtable. It’s mainly to make it easier to use the Get-Web in cases where you’re getting the URL or POST data from elsewhere.

Here’s a script with a few example uses of Get-Web: (more…)

Just a quick post to upload a simple PowerShell script that transforms XML files with XSL files. There’s actually a cmdlet in PowerShell Community Extensions which does this, but believe it or not, in all my tests the script outperforms the cmdlet (the cmdlet takes, on average, 117ms on a file which takes the script 63ms on average).

function Convert-WithXslt($originalXmlFilePath, $xslFilePath, $outputFilePath)
{
   ## Simplistic error handling
   $xslFilePath = resolve-path $xslFilePath
   if( -not (test-path $xslFilePath) ) { throw "Can't find the XSL file" }
   $originalXmlFilePath = resolve-path $originalXmlFilePath
   if( -not (test-path $originalXmlFilePath) ) { throw "Can't find the XML file" }
   $outputFilePath = resolve-path $outputFilePath
   if( -not (test-path (split-path $originalXmlFilePath)) ) { throw "Can't find the output folder" }

   ## Get an XSL Transform object (try for the new .Net 3.5 version first)
   $EAP = $ErrorActionPreference
   $ErrorActionPreference = "SilentlyContinue"
   $script:xslt = new-object system.xml.xsl.xslcompiledtransfrm
   trap [System.Management.Automation.PSArgumentException]
   {  # no 3.5, use the slower 2.0 one
      $ErrorActionPreference = $EAP
      $script:xslt = new-object system.xml.xsl.xsltransform
   }
   $ErrorActionPreference = $EAP
   
   ## load xslt file
   $xslt.load( $xslFilePath )
     
   ## transform
   $xslt.Transform( $originalXmlFilePath, $outputFilePath )
}
 

I’ll write up more information later, but a couple people have asked for this in #PowerShell on irc.freenode.net, and I had it already written, so here you go … my ConvertFrom-Html cmdlet (in a Huddled.HtmlSnapin). It converts HTML to valid xml using the SGML Parser which was available on GotDotNet years ago. It only works with files (doesn’t do URL downloads yet). Use it like this:


$url = "http://huddledmasses.org/"
$file = Join-Path $pwd "HuddledMasses.html"

$client = new-object System.Net.WebClient
$client.DownloadFile( $url, $file ) #NOTE: You need to use a full path here, not relative

$xml = ConvertFrom-Html $file

# Or even
(ConvertFrom-Html $file).Save($file)
 

The source code to my plugin may be considered public domain, and is included in the Huddled HTML SnapIn Zip.

However, the SgmlReader library is a Microsoft Sample which is licensed under the old MS Samples license which doesn’t allow reuse with viral open source software. I’ve seen some work being done on an HtmlAgilityPack on CodePlex (using a Creative Commons ASA license) but I have not really looked at it except to see that it has a several active issues related to entity encoding and dropping malformed tags which I haven’t encountered in SgmlReader …