Posts Tagged ‘Reservoir Sampling’

I started this post a few weeks ago, but never quite got around to posting it, and then I saw today that Dmitry wrote about the memory problems with using measure-object and I thought this would be a great time to post this. Essentially, this is a continuation of my exploration of development in the PowerShell pipeline, and we’ll see some of the problems with the way that the pipeline works and what happens if you overlook the numbers of items that could be in the pipeline.

Dmitry was trying to count the number of users in Active Directory, and ran into ridiculous memory use when using Measure-Object because it appears to collect all the items in the pipeline into memory before it counts, resulting in huge amounts of memory use based on the number of objects you’re counting. Of course, for the purpose of counting, it’s a pretty obvious fix (as Dmitry explains), but for other purposes it may not be as obvious.

Select Random

When I originally started this post, I was trying to upgrade my original Select-Random script (which was published in the PowerShell Community Extensions) to use reservoir sampling in an attempt to solve the same problem. Select-Random is a script which selects a random element from a collection or from the pipeline. In the past, if the elements were passed in as an argument, it simply collected them into an array (in a really inefficient way, but never mind) and then picks a random number and selects the item by index.

I started by improving the collection method to use an ArrayList and reduced the memory usage to about 1/5th of the original method, and then rewrote the whole script to use reservoir sampling instead. Greg wrote a good explanation of reservoir sampling but basically it’s an algorithm for choosing random items from a collection of unknown size without having to traverse the collection twice (to count it the first time and then to select) — in my case, the need to count the collection was requiring me to store the pipeline input in an array, so the improvement was more about memory use than about needing to enumerate the list twice. However, there is a downside: it’s slower. Read the rest of this entry »

Search My Content