Well, after multiple attempts at a wget PowerShell script (the last one works very well for downloading web pages, files, as long as you don’t need to send post parameters or anything like that) ... I found myself writing a script last week that included a custom HTTP POST function as well as using some prior functionality I wrote (ConvertFrom-Html) to convert HTML files to XML documents — which PowerShell can deal with nicely.
So I’ve taken my Huddled.Html code and worked it in together with a few extra bits … and I’m releasing it under the new name “PoshHttp” (which has the sole benefit of being short, in the case when you need to type the full command names, like: PoshHttp\Get-Web.
In lieu of providing proper documentation for this, there’s a script below which shows off some of the features of the Get-Web cmdlet … but first, you’re going to want to download PoshHttp and I can give you a basic overview.
Get-Web
Downloads a file from the web URI, optionally passing POST or GET parameters which it accepts as a Hashtable. If the resulting file is an xml or html file, it converts it to XML and outputs it as an XmlDocument object. If they are not, it saves them to file, and outputs the FileInfo object.
- Plain will prevent the conversion to xml
- Save will force saving to file (either xml or binaries)
- Path lets you specify the file name/path (normally, it selects a path based on what you’re downloading).
- Force will prevent it from asking if you want to overwrite files
ConvertFrom-Html
Converts HTML to XML. I’ll eventually tweak this so it supports passing it a file path and even saving the xml in-place on the original file … but for now you can use it on a stream of text, like this: (Get-Content file.html | ConvertFrom-Html).save( "file.xml" ). Note that it outputs an xmldocument object … you can get the xml as text to the console using Get_OuterXml().
ConvertTo-Hashtable
Converts a url-style query string or NameValueCollection into a Hashtable. It’s mainly to make it easier to use the Get-Web in cases where you’re getting the URL or POST data from elsewhere.
Here’s a script with a few example uses of Get-Web:
## Find the url of the image (download an html file as an xmldocument)
$ImgURL = (Get-Web http://dilbert.com/fast/).html.body.div.img.src
if($Host.Name -eq "PoshConsole") {
#In PoshConsole, show it inline
$ImgUrl | Out-WPF -source "<Image Height='174' Source='{Binding Path=AbsoluteUri}' xmlns='http://schemas.microsoft.com/winfx/2006/xaml/presentation' />"
} else {
## Use Get-Web to download an image (it's saved to file by default), and open it in my default viewer
## When Get-Web saves a file, it outputs the [FileInfo] which can be invoked with &
&(Get-Web "http://dilbert.com$ImgURL" -force)
}
###################################################################################################
## Determine the language of a snippet of text (5 words minimum for best results)
function Resolve-Language([string]$text) {
## This time, we're using Get-Web with POST parameters ...
return (Get-Web "http://www.xrce.xerox.com/cgi-bin/mltt/LanguageGuesser" -Post (@{Text=$text})
).SelectSingleNode("//font")."#text".Trim()
}
###################################################################################################
## Translate text to English ... This is obviously reworkable as a general translation tool
## But I don't have much use for that, since I only speak Spanish, English, and code ...
function Get-English([string]$text,[string]$FromLanguage) {
if(!$FromLanguage) {
$FromLanguage = Resolve-Language $text
}
$post = @{text=$text}
switch($FromLanguage) {
"Arabic" { $post["langpair"] = "ar|en" }
"Chinese" { $post["langpair"] = "zh|en" }
"Dutch" { $post["langpair"] = "nl|en" }
"French" { $post["langpair"] = "fr|en" }
"German" { $post["langpair"] = "de|en" }
"Greek" { $post["langpair"] = "el|en" }
"Italian" { $post["langpair"] = "it|en" }
"Japanese" { $post["langpair"] = "ja|en" }
"Korean" { $post["langpair"] = "ko|en" }
"Portuguese" { $post["langpair"] = "pt|en" }
"Russian" { $post["langpair"] = "ru|en" }
"Spanish" { $post["langpair"] = "es|en" }
default { return "Sorry, but I can't translate $FromLanguage" }
}
## Using Get-Web with POST parameters ... nothing remarkable, but it's just nice with the previous function ...
return (Get-Web "http://www.google.com/translate_t" -Post $post
).SelectSingleNode("//div[@id='result_box']")."#text".Trim()
}
####################################
## Figure out the real url behind those shortened forms
function Resolve-URL([string[]]$urls) {
[regex]$snip = "(?:https?://)?(?:snurl|snipr|snipurl)\.com/([^?/ ]*)\b"
[regex]$tiny = "(?:https?://)?TinyURL.com/([^?/ ]*)\b"
[regex]$isgd = "(?:https?://)?is.gd/([^?/ ]*)\b"
[regex]$twurl = "(?:https?://)?twurl.nl/([^?/ ]*)\b"
switch -regex ($urls) {
$snip {
## Notice that {{ Get-Web ... -plain }} returns plain text, not an xml doc
write-output $snip.Replace( $_, (Get-Web "http://snipurl.com/resolveurl?id=$($snip.match( $_ ).groups[1].value)" -plain))
}
$tiny {
## GET parameters this time ...
$doc = Get-Web "http://tinyurl.com/preview.php" -Get @{num=$tiny.match( $_ ).groups[1].value}
write-output $tiny.Replace( $_, "$($doc.SelectSingleNode(""//a[@id='redirecturl']"").href)" )
}
$isgd {
$doc = Get-Web "http://is.gd/$($isgd.match( $_ ).groups[1].value)-"
write-output $isgd.Replace( $_, "$($doc.SelectSingleNode(""//div[@id='main']/p/a"").href)")
}
$twurl {
$doc = Get-Web "http://tweetburner.com/links/$($twurl.match( $_ ).groups[1].value)"
write-output $twurl.Replace($_, "$($doc.selectsingleNode(""//div[@id='main-content']/p/a"").href)" )
}
default { write-output $_ }
}
}
Hey – cool stuff. I've been looking to do some HTML scraping for a while so I can collect data for historical trending. Get-Web does a great job of returning the page I'm interested in, but I would love to convert the html table into a PS Object. Are you aware of any cmdlets/functions that make this easy?
Here's an example of the web page: "https://senderscore.org/lookup.php?lookup=216.235…;
Thanks,
Chris