Reservoir sampling and performance considerations

I started this post a few weeks ago, but never quite got around to posting it, and then I saw today that Dmitry wrote about the memory problems with using measure-object and I thought this would be a great time to post this. Essentially, this is a continuation of my exploration of development in the PowerShell pipeline, and we’ll see some of the problems with the way that the pipeline works and what happens if you overlook the numbers of items that could be in the pipeline.

Dmitry was trying to count the number of users in Active Directory, and ran into ridiculous memory use when using Measure-Object because it appears to collect all the items in the pipeline into memory before it counts, resulting in huge amounts of memory use based on the number of objects you’re counting. Of course, for the purpose of counting, it’s a pretty obvious fix (as Dmitry explains), but for other purposes it may not be as obvious.

Select Random

When I originally started this post, I was trying to upgrade my original Select-Random script (which was published in the PowerShell Community Extensions) to use reservoir sampling in an attempt to solve the same problem. Select-Random is a script which selects a random element from a collection or from the pipeline. In the past, if the elements were passed in as an argument, it simply collected them into an array (in a really inefficient way, but never mind) and then picks a random number and selects the item by index.

I started by improving the collection method to use an ArrayList and reduced the memory usage to about 1/5th of the original method, and then rewrote the whole script to use reservoir sampling instead. Greg wrote a good explanation of reservoir sampling but basically it’s an algorithm for choosing random items from a collection of unknown size without having to traverse the collection twice (to count it the first time and then to select) — in my case, the need to count the collection was requiring me to store the pipeline input in an array, so the improvement was more about memory use than about needing to enumerate the list twice. However, there is a downside: it’s slower.

Reservoir Sampling

Reservoir sampling means that instead of generating a single random number and indexing into the collection, you sample as you iterate. You start by filling your “reservoir” with the first items in the pipeline, and then for each item remaining in the pipeline, you generate a random number with a decreasing probability of choosing the item. You’re trading memory usage for speed. Basically, if you want $count items, then the script stores $count elements (the eventual result) at all times. It continues processing elements until it reaches the end of the input.

  • For each input element $n (where n is the count of the inputs so far) there is a $count/$n chance that it becomes part of the result.
  • For each previously selected element, there is a $count/($n-1) chance of it remaining selected (ie: not being replaced).
  • For the ones selected, there’s a ($count/$n * 1/$count = 1/$n) chance of it being replaced, so a ($n-1)/$n chance of it remaining … thus, it’s cumulative probability of being among the selected elements after the nth input is processed is $count/($n-1) * ($n-1)/$n = $count/$n, as it should be.

The bottom line is that using this method is a LOT slower than random indexing. However, it uses a LOT less memory when iterating large numbers of items that are output by cmdlets as long as they weren’t already collected into an array. As a simple example, Get-ChildItem outputs each item as it traverses the provider, rather than creating an in-memory collection. If nothing in the pipeline “collects” all of the output, then it never needs to use memory except for the text display of them … or if it’s piped through a reservoir sampling algorithm, it only uses the amount of memory required to store as many items as the size of our sample.

Performance comparison

Using as an example the output of Get-ChildItem -Recurse on my profile directory, where there are about 98 thousand items total in the subdirectories, I ran each test in it’s own clean console so that I could see the results of the memory usage cleanly. So that you could tell the difference, I changed the letters in the prompt for each test (I also use a counter in my prompt, and I change the character from > to : to make my life easier when pasting into HTML).

In case you’re not familiar with the commands used here, Measure-Command is a cmdlet that times how long a script takes to run, and if you wrap it in a string, it outputs a simple elapsed time statement, instead of more verbose output. There are other ways of accomplishing the same thing, but this is the least distracting way. Also, -f is the format operator, so putting "{0}" -f 42 is the same as @[string]::format(”{0}”,42) but in less
code.


[A1]: # Collect the items first, and pass them as a parameter to Select-Random
[A2]: "$(measure-command {Select-Random -count 5 -input (ls -recurse)})"
00:00:32.8902287
[A3]: "{0:#,#}" -f ((ps powershell*)[1].WorkingSet/1KB)
189,624

[B1]: # Pass the items via the pipeline, but use the -collect method
[B2]: "$(measure-command {ls -recurse | Select-Random -count 5 -collect})"
00:00:40.6224274
[B3]: "{0:#,#}" -f ((ps powershell*)[1].WorkingSet/1KB)
73,688

[C1]: # Pass the items on the pipeline and use reservoir sampling
[C2]: "$(measure-command {ls -recurse | Select-Random -count 5})"
00:00:52.7225395
[C3]: "{0:#,#}" -f ((ps powershell*)[1].WorkingSet/1KB)
43,188

Hopefully I didn’t make that example too complicated, and you can see not only that collecting the output into an array is faster, but that is uses a boatload more memory. You can also see that my -collect method (which now uses an ArrayList (see the code below)) uses a lot less memory than collecting the items in an array the way the first example does. I’m not 100% sure why this is, but my theory is that doing (ls -recurse) is in effect the same as using the += operator on an array — it makes a copy. To be more precise, it makes about 98 thousand copies.

You’re not getting quite the whole picture in that example, but to fill it in a little: the memory use starts at around 39 MB … JUST doing the Get-ChildItem to count the files increases the working set memory usage of the PowerShell process by about 3.5 MB. Using the reservoir method to select 5 of them increased the memory usage of the PowerShell process by about 5 MB of memory (only 2 MB more than simple counting). Using the pre-collection method requires around 151 MB. That’s a really heavy price to pay in memory, at over 1.5 KB per item in the collection. And even the -collect method uses almost 10x as much memory.

Some perspective

To make sure you have a chance to get the same perspective I have on this … let me show you the output another way. The key thing to see in these last two examples is that the memory cost is the same for both — the same 151MB of the earlier collection example. The difference in time is large and undeniable. If you compare this with example A, you can see that basically, listing the files takes 32 seconds, and picking 25 of them takes about a tenth of a second, while iterating them (which requires generating about 98 thousand random numbers) takes about 20 seconds.


[D1]: $files = ls -recurse
[D2]: "$(measure-command {Select-Random -Count 25 -input $files})"
00:00:00.1337390

[E1]: $files = ls -recurse
[E2]: "$(measure-command {$files | Select-Random -Count 25})"
VERBOSE: Selected 2 of 97849 elements
00:00:19.9349478

Conclusion

So, the point is: there are times when you need to sacrifice speed in order to keep your memory usage low … But when you’re writing a general-purpose script or a cmdlet for the PowerShell pipeline, you need to at least understand the choice you’re making so you can make it deliberately. Sometimes you’ll choose to write it fast and memory-expensive. Sometimes you’ll choose to write it slower and memory-lean. And sometimes, like I did here, you might choose to do both ;) .

The script …

It’s also script 118 on the PowerShell central script repository :) (where it has a comment header that’s giving me problems here).


param([int]$count=1, [switch]$collectionMethod, [array]$inputObject=$null)

BEGIN {
   if ($args -eq '-?') {
@"
Usage: Select-Random [[-Count] <int>] [-inputObject] <array> (from pipeline) [-?]

Parameters:
 -Count            : The number of elements to select.
 -inputObject      : The collection from which to select a random element.
 -collectionMethod : Collect the pipeline input instead of using reservoir
 -?                : Display this usage information and exit

Examples:
 PS> $arr = 1..5; Select-Random $arr
 PS> 1..10 | Select-Random -Count 2

"
@
exit
   }
   else
   {
      $rand = new-object Random
      if ($inputObject)
      {
         # Write-Output $inputObject | &($MyInvocation.InvocationName) -Count $count
      }
      elseif($collectionMethod)
      {
         Write-Verbose "Collecting from the pipeline "
         [Collections.ArrayList]$inputObject = new-object Collections.ArrayList
      }
      else
      {
         $seen = 0
         $selected = new-object object[] $count
      }
   }
}
PROCESS {
   if($_)
   {
      if($collectionMethod)
      {
         $inputObject.Add($_) | out-null
      } else {
         $seen++
         if($seen -lt $count) {
            $selected[$seen-1] = $_
         } ## For each input element $n there is a $count/$n chance that it becomes part of the result.
         elseif($rand.NextDouble() -lt ($count/$seen))
         {
            ## For the ones previously selected, there's a 1/$n chance of it being replaced
            $selected[$rand.Next(0,$count)] = $_
         }
      }
   }
}
END {
   if (-not $inputObject)
   {  ## DO ONCE: (only on the re-invoke, not when using -inputObject)
      Write-Verbose "Selected $count of $seen elements."
      Write-Output $selected
      # foreach($el in $selected) { Write-Output $el }
   }
   else
   {
      Write-Verbose ("{0} elements, selecting {1}." -f $inputObject.Count, $Count)
      foreach($i in 1..$Count) {
         Write-Output $inputObject[$rand.Next(0,$inputObject.Count)]
      }  
   }
}

Similar Posts:

2 thoughts on “Reservoir sampling and performance considerations”

  1. Quick question Joel, why not use PeakWorkingSet rather than just WorkingSet when detecting the memory consumption? This would show peak consumption regardless of how Windows frees up memory, right?

    Of course you’ll need to start the tests in different processes to make sure that you don’t get the same peak over and over. ;)

    Just my two cents.

    Dmitry

  2. Actually, I did run every test in it’s own process, sequentially rather than all at once, to make sure they had a situation that was as similar to each other as possible … I was watching memory use in the Task Manager and checking Working Set, and Peak Working Set, as well as the Private Working Set.

    In this case (and, I suspect, ain any case like this) the numbers are essentially the same for the Working Set at the end of the script and the Peak Working Set — at most a few Kb difference on a process using over 180Mb — I just went with what I thought was the simple number because the differences were so small.

    Actually, it’s an interesting side effect of non-deterministic garbage collection (and having plenty of RAM) that it takes a couple of minutes for the Working Set memory to start coming down from it’s peak unless you start thrashing RAM with some other application.

Comments are closed.