This is an old post that’s been sitting in my queue for a long time, and I just finally thought I’d publish it, for what it’s worth. The word on the street was that SQL Server 2008’s PowerShell support is going to be limited.
I’m here to say: it’s not so much limited, as it it cautiously restrained. A lot of people were apparently hoping for a SQL Server provider like this demo one that could fully navigate tables and list their contents etc., but I think the SQL Team has decided not to allow that, and I think they probably did the right thing.
Honestly, a provider that can navigate into SQL datatables would be extremely frightening, to the point that it might need to be banned or blocked in some instances. Just imagine a SQL Server dba who’s brand new to PowerShell, but thinks that new provider sounds interesting … so he fires up PowerShell and switches over to the customer orders table (you know, the one with 2 million records of purchases) and does:
Just think about that line for a second. Not only is it fetching every record in the table, it has to build a .net object for every single one. One of the most commonly used cmdlet’s in PowerShell is Get-Member, which we typically see used when you’re trying to figure out which properties are available on an item. Applied to the previous example, you can imagine doing something like this:
Now you’ve retrieved every record from the table … just for the sake of finding out what the columns are! You’re not even using the data. If you’re lucky, the memory use of creating 2 million System.Data.Row objects would cripple your dba’s PC before they were able to try something like this:
Anyway. Hopefully this counter example will be enough to not only make you feel better about using a cmdlet instead of a provider, but also to make some people think about how they choose to design things that can cause potentially crippling network traffic or server load. The thing is that although it’s possible to allow the -Filter parameter of Get-ChildItem (ls) to specify both a WHERE clause and a TOP count … that wouldn’t help unless you were sure your users would always filter that way instead of by passing it through | Where-Object.
As a side note: The management studio protects you from this sort of load on your system to some extent by caching data and limiting how many rows it pulls over at a time, but not as much as it could if it allowed you to reorder columns and sort them after you had retrieved the data. It also has the distinct advantage of not necessarily having to create .Net objects to pass into the pipeline for each row on the fly (although, maybe that’s how it works, I don’t know, but I would hope not).
Anyway, from what I’ve understood, the SQL powershell provider will let you navigate the schemas, and will provide some cmdlets to do queries that should replace some of the functions we’ve all been using to do queries. Maybe they’ll even provide something that can do built in paging, like: Get-SqlQuery “Select * from Northwind.Orders” -Next 10 …
So, I posted a story about Fibonacci sequences and noticed that the recursion in PowerShell is really rather slow. But I also noticed something interesting — I had expected that recursing using a function would be a little cheaper (and faster) than recursing using the pipeline, and it turns out that it’s not. At all.
I’d love to hear some thoughts on why I got the speed results I got here. At first I thought it was something about type coercion, but as you can see, I tried casting everything to no avail. There’s no difference if I don’t output anything, and in fact, the non-recursive algorithm can trivially output every Fibonacci number in the sequence without affecting it’s time (I actually altered it so it didn’t to keep the output clean).
Obviously I expected the recursive algorithms to be slow and scale badly because of the amount of work involved, but how is it that the pipeline-recursive “filter” outperforms the plain old recursive function (albeit by an infinitesimal fraction)?
P.S.: The numbers in braces with the colon (like [18]: ... ) is my prompt.
I started this post a few weeks ago, but never quite got around to posting it, and then I saw today that Dmitry wrote about the memory problems with using measure-object and I thought this would be a great time to post this. Essentially, this is a continuation of my exploration of development in the PowerShell pipeline, and we’ll see some of the problems with the way that the pipeline works and what happens if you overlook the numbers of items that could be in the pipeline.
Dmitry was trying to count the number of users in Active Directory, and ran into ridiculous memory use when using Measure-Object because it appears to collect all the items in the pipeline into memory before it counts, resulting in huge amounts of memory use based on the number of objects you’re counting. Of course, for the purpose of counting, it’s a pretty obvious fix (as Dmitry explains), but for other purposes it may not be as obvious.
When I originally started this post, I was trying to upgrade my original Select-Random script (which was published in the PowerShell Community Extensions) to use reservoir sampling in an attempt to solve the same problem. Select-Random is a script which selects a random element from a collection or from the pipeline. In the past, if the elements were passed in as an argument, it simply collected them into an array (in a really inefficient way, but never mind) and then picks a random number and selects the item by index.
I started by improving the collection method to use an ArrayList and reduced the memory usage to about 1/5th of the original method, and then rewrote the whole script to use reservoir sampling instead. Greg wrote a good explanation of reservoir sampling but basically it’s an algorithm for choosing random items from a collection of unknown size without having to traverse the collection twice (to count it the first time and then to select) — in my case, the need to count the collection was requiring me to store the pipeline input in an array, so the improvement was more about memory use than about needing to enumerate the list twice. However, there is a downside: it’s slower. Read the rest of this entry »