In July of last year I wrote a PowerShell script with the goal of allowing me to generate XML from PowerShell with a simple markup that would look a little like the resulting XML ... this week I was using that script again, and had a couple of issues that made me go back and look at the source.
While I was playing with the source and tweaking things a little bit to improve the way it handles namespaces, I started playing with the idea that I could improve the syntax. At the very least, I thought, I ought to be able to do away with all those “xe” aliases…
Well, I was able to. And what’s more, I managed to dramatically clean up the way namespaces work, and make it so that really, the only ugly part of the syntax is the initial declaration of namespaces! I’m going to start with two examples, and use them to walk you through the features
The simplest example I could think of is to list all the files in a folder, with the file size and last modified stamp:
The output of that, when run on my formats folder, looks like this:
You can immediately see what the script does: New-XDocument (which is aliased as ‘xml’) actually generates the root xml node, so the first argument to it is the name of that node, and any other arguments become attributes … except for the script block. That script block turns into the contents of the node.
Inside the script block, PowerShell code is parsed as usual, but whenever a command that doesn’t exist is encountered, it is turned into an xml node! Pretty simple, right? Of course, if you wanted to create a node with a name that’s already taken by a PowerShell command, you can just replace file with New-XElement file, or (using aliases) xe 'file', which explicitly creates an xml node with the given name.
That’s pretty much it for our first example, so let’s look at a more complicated example, with multiple namespaces, and deeper nesting.
This time, we’ll create an Atom document, and we’ll include some namespace extensions (including a made up one for listing my files as we did above):
There are four things you should notice, in particular:
First: the initial tag has a [XNamespace] added to it. You can specify a tag name that has a namespace by adding them together this way, or by embedding the namespace in the string like "{http://www.w3.org/2005/Atom}feed" instead. Either way works. This initial namespace becomes the default namespace for the document. If you don’t specify a namespace on tags later, they automatically belong to that one.
Second: when you want to add additional namespaces, you can do so with a custom prefix like: -dc ([XNamespace]"http://purl.org/dc/elements/1.1"), and that prefix (dc) takes on a special meaning. When you want to have a tag later on that is part of that namespace, you just prefix the tag, like dc:rights —the same way you would in XML.
Third: any number of attributes can be specified using the -name value syntax, but anything in a {scriptblock} becomes the content — and is subject to the same rules as the outer sections.
Fourth: This generates an XDocument. When you cast an XDocument to string, the xml declaration is left off, so if you want it, you need to manually add it via $XDocument.Declaration. Incidentally, XDocuments are not XMLDocuments, but they are trivially castable to them.
The output of that particular section of New-XDocument is this:
The New-XDocument script itself is on PoshCode in the Xml Module 4 along with a few interesting functions like Select-XML (which improves over the built-in by being able to ignore namespaces when you write XPath) and Remove-XmlNamespace (which was instrumental in removing namespaces for Select-Xml). There’s also a Format-Xml for pretty-printing, and a Convert-Xml for processing XSL transformations.
I’ll probably post some more examples of this in the next week or two, and I really should write some commentary about the function itself, which uses the tokenizer to discover which “commands” are really xml nodes … but for now, I’ll leave you to enjoy.
I know that I just wrote a post last week about XPath and namespaces in PowerShell, but at the time I left out one possible way of dealing with namespaces, because it’s not the right way of doing things. However, sometimes it’s nice to have options, and when you’re working on the command-line in PowerShell, or just trying to figure out a proof-of-concept call to a web service, you really don’t need to deal with namespaces correctly, you just need it to work.
With that in mind, I present to you the fourth option: just strip the namespaces out! The simplest way to do that is to run the XML through an XSL stylesheet which just outputs the local-name() of each node (including attributes), and remove any namespace definitions (processing instructions).
That stylesheet and the basic steps of the process will work anywhere, from Java to C# to the web … but since my current language of choice for prototyping is PowerShell, I’ll show you how to implement it there as Remove-XmlNamespace. Once you have that, I think you’ll see that it was relatively simple for me to write a new Select-XML which adds a parameter RemoveNamespace which is implemented by calling this Remove-XmlNamespace …
That actually allows you to call Select-Xml with the -RemoveNamespace parameter just as though the namespaces didn’t exist. Of course, the returned XML nodes will, in fact, NOT have namespaces … so they may not be quite the same as the source, but the data will all be there.
Although there are a few “CSS Selector” libraries, most browsers haven’t even implemented CSS3 selectors, never mind frameworks like .Net or scripting languages like Javascript or PowerShell
so XPath remains the most powerful way to deal with finding specific data in an XML file, and by extension, XHTML and even HTML files (if you can convert them using something like SgmlReader) is to use XPath queries.
There are a lot of XPath tutorials around the web, so there’s no need for me to get into that very much, but I just wanted to write a brief note about using XPath with documents that have namespaces (particularly, from .Net). The problem is that in order to select nodes that have a namespace assigned, you must use a namespace prefix and a NamespaceManager. So even if it’s the default namespace, if there’s an xmlns=”...” on the document, you have to create and use a prefix.
The bottom line is this: if you have an XML document that looks like this:
That bit at the top where it says: xmlns="http://schemas.microsoft.com/win/2004/08/events/event" is assigning a default namespace. Sometimes you’ll see something like this (eg: in RSS):
This one assigns a specific prefix “media” to the namespace url “http://search.yahoo.com/mrss/” ...
In either case, if you want to select a node that’s assigned to a namespace (which is ALL the nodes in the first example, but just the ones that start with media: in the second example) in .net, you have to specify the namespace in order to select those nodes with XPath. PowerShell 2.0 has a Select-Xml cmdlet which accepts -Namespaces as a parameter: you simply provide a hashtable of names to urls.
If you had loaded the first document above into $xml, you could select the BootStartTime and BootEndTime using an XPath query like this: //e:Data[Name = ‘BootStartTime’ or Name = 'BootEndTime'] but we have to DEFINE that “e” namespace. To do so using Select-Xml you just pass it into the command
Of course, you don’t have to use Select-Xml, you can do this in plain .Net without cmdlets (and this is what you would have to do in C#). In fact, depending on the situation, it might even be simpler:
There are, however, two ways that you could avoid specifying the namespaces. The first is to just avoid specifying the node name at all. In that first example, that would be fairly easy, because the “BootStartTime” and “BootEndTime” names are unique to the nodes we’re interested in (even if there were boatloads of identical events, you’ll never have a Name=“BootStartTime” attribute on the
So to ignore the tag name, we can just use a * for a wildcard. The only real difference is that we don’t need the namespace, and the XPath pattern will be different:
Or to specify the local name, we can use the Local-Name function. Again, we don’t need a namespace, we are just changing the XPath to be a more specific match:
Well, after multiple attempts at a wget PowerShell script (the last one works very well for downloading web pages, files, as long as you don’t need to send post parameters or anything like that) ... I found myself writing a script last week that included a custom HTTP POST function as well as using some prior functionality I wrote (ConvertFrom-Html) to convert HTML files to XML documents — which PowerShell can deal with nicely.
So I’ve taken my Huddled.Html code and worked it in together with a few extra bits … and I’m releasing it under the new name “PoshHttp” (which has the sole benefit of being short, in the case when you need to type the full command names, like: PoshHttp\Get-Web.
In lieu of providing proper documentation for this, there’s a script below which shows off some of the features of the Get-Web cmdlet … but first, you’re going to want to download PoshHttp and I can give you a basic overview.
Downloads a file from the web URI, optionally passing POST or GET parameters which it accepts as a Hashtable. If the resulting file is an xml or html file, it converts it to XML and outputs it as an XmlDocument object. If they are not, it saves them to file, and outputs the FileInfo object.
Converts HTML to XML. I’ll eventually tweak this so it supports passing it a file path and even saving the xml in-place on the original file … but for now you can use it on a stream of text, like this: (Get-Content file.html | ConvertFrom-Html).save( "file.xml" ). Note that it outputs an xmldocument object … you can get the xml as text to the console using Get_OuterXml().
Converts a url-style query string or NameValueCollection into a Hashtable. It’s mainly to make it easier to use the Get-Web in cases where you’re getting the URL or POST data from elsewhere.
Here’s a script with a few example uses of Get-Web: Read the rest of this entry »
Just a quick post to upload a simple PowerShell script that transforms XML files with XSL files. There’s actually a cmdlet in PowerShell Community Extensions which does this, but believe it or not, in all my tests the script outperforms the cmdlet (the cmdlet takes, on average, 117ms on a file which takes the script 63ms on average).
I’ll write up more information later, but a couple people have asked for this in #PowerShell on irc.freenode.net, and I had it already written, so here you go … my ConvertFrom-Html cmdlet (in a Huddled.HtmlSnapin). It converts HTML to valid xml using the SGML Parser which was available on GotDotNet years ago. It only works with files (doesn’t do URL downloads yet). Use it like this:
The source code to my plugin may be considered public domain, and is included in the Huddled HTML SnapIn Zip.
However, the SgmlReader library is a Microsoft Sample which is licensed under the old MS Samples license which doesn’t allow reuse with viral open source software. I’ve seen some work being done on an HtmlAgilityPack on CodePlex (using a Creative Commons ASA license) but I have not really looked at it except to see that it has a several active issues related to entity encoding and dropping malformed tags which I haven’t encountered in SgmlReader …
The SGMLReader library has been re-released on MSDN Code Gallery under an Ms-PL license, and all is well with the world.