Friday, March 11, 2016

Using GNU Parallel to speed up nested loops in bash

Get it right ...

At $ork we have a small Perl application that uses a CSV source file to generate configuration files that are consumed by Ecelerity. The application invocation looks like this:
app.pl myconfig.csv some-binding some-day
The CSV file is the master configuration, the binding is an ecelerity thing, and the day is the day-specific configuration we're generating.  The output is a wad of text that contains the binding- and day-specific configuration.

Since we need to generate configuration files for many bindings and many days, the script is usually run many times in succession in a bash script, e.g.:
for binding in $bindings; do
  for day in $days; do
    app.pl $csvfile $binding $day > $binding/$day.conf
  done
done
This solution does a good job of separating concerns; each file is generated by a single call to the helper script, which makes the script simple and easy to maintain.

... then make it fast!


The above solution was intended to be a stopgap solution, good for a few months at most.  More than a year later, what was a stopgap is now part of the process. Over time, the number of bindings and the size of the configuration file have increased, resulting in the config generation time going from about 2 to about 10 minutes.

One early optimization was to limit the configuration files generated to days in the future.  There's no sense in generating a config file for day 20 when it's day 21 now. This was a great improvement, but the run time crept up nonetheless.

Possible areas of improvement 

Ten minutes is too long.  What options do we have to speed it up?  Ordinarily I'd want to use profiling to make sure I'm addressing the areas with the greatest opportunity, but two options present themselves based on experience with the process:
  • Execution of the wrapper shell script with the nested loops is single-threaded.  This is running on a 16-core machine; can we take advantage of this?
  • The CSV file is increasing in size because there's old data in it.  Can we remove the old data, and thus speed up the parsing?

Improvement #1: using GNU Parallel

With a little fudging of our original helper script, we can replace the nested loops with a call to GNU Parallel, which allows us to easily run say 10 invocations of our Perl script in parallel.  This allows us to make better use of the system resources that are available to us.  Here's one way:

# helper function to simplify our parallel invocation
function helper() {
  csv=$1
  binding=$2
  day=$3
  app.pl $csv $binding $day > $binding/$day.conf
}
# need to export the helper so that subshells
# invoked by parallel have access to it
export -f helper
# run 10 jobs in parallel
parallel --jobs 10 helper $csv ::: $bindings ::: $days

To debug this, I made use of the "--dry-run" option to parallel, which enumerates to STDOUT the commands the parallel command would run.

Right off the bat, the overall run time was reduced from 10 minutes to under 1 minute--a factor of 10 speedup in the script.  Adding additional jobs to the processing doesn't reduce the run time further, so some other fundamental issue must be limiting our throughput.

Improvement #2: CSV source size reduction

Our CSV "spreadsheet" contains a large number of columns (one per day). One existing optimization mentioned earlier is not to regenerate config files for days in the past.  Unfortunately it appears that the cost of parsing the larger files is catching up to us.

I put together a robust and fast mechanism to remove unwanted columns, written in Perl. This script is run once, and it creates a temporary CSV file pruned to about 1/4 the size of the original file. When I run the parallelized version of the throttle generation code, the run time is reduced to under 30 seconds.

Not too shabby--a 95% reduction in run time overall!

Future directions

Some other changes suggest themselves:

  • Maybe cache the parse? This is hard to justify without some profiling.
  • Improve the CSV pruning script for ease of use.  As currently implemented, the pruning script identifies the columns to remove via regular expressions. A less general but more useful solution would just remove all columns that look like "Day N", where N is some argument to the script.

Wednesday, March 9, 2016

Powershell, foreach, and foreach-object

Wat

An interesting discussion cropped up at work recently regarding looping constructs in Windows powershell. The context was a minor code change (replacing ForEach-Object with Foreach in two lines); the claim was that Foreach was much faster than Foreach-Object, with the change saving many seconds per script invocation.

Not being familiar with the script, and being a relative newcomer to powershell, this sounded odd, but plausible. The obvious question arose in my mind: how could such a minor change in code result in such a large change in run time?

Loop Construct versus Commandlet

I started with the first link in the pull request supporting the change. It goes into some detail on how a developer improved the performance of a script, and it reads in part:
The difference between the ForEach loop construct and ForEach-Object cmdlet is not well understood by those beginning to write Windows PowerShell scripts. This is because they initially appear to be the same. However, they are different in how they handle the loop object enumeration (among other things), and it is worth understanding the difference between the two—especially when you are trying to speed up your scripts.
Digging further, the foreach loop construct looks like this:
foreach ($object in $objects) { commands ... }
And the commandlet version of foreach looks like this:
dir C:\ | ForEach-Object { $_.Name }
Superficially, they look similar, but there are hints in the syntax that make them visually distinct:
  • The commandlet version looks like " | foreach-object { commands } "
    • it has a pipe
    • it has the objects before the "foreach"
  • The loop construct looks like " foreach (obj in objects) { commands } "
    • it does not have a pipe
    • its objects come after the foreach

Aliases

PowerShell muddies the waters by allowing you to define aliases for your commandlets.  The Foreach-Object commandlet comes with two predefined aliases: Foreach and %.  Don't be fooled though--when you see this command:
dir C:\ | Foreach { $_.Name }
You're still using the commandlet version of foreach, not the foreach loop construct. We know this because (1) it has a pipe, and (2) the objects (from the dir command in this case) come before our ForEach.  Even though it looks like the looping construct, we know this is the commandlet alias, which is not the same thing.

What's the difference?

As it turns out, in PowerShell, apart from the syntax surrounding the ForEach variants, people claim that there are real differences underlying the commandlet form and the loop construct form of ForEach:
  • Memory use. The claim is that the loop construct version of ForEach evaluates its loop array fully before starting the loop.  This causes memory consumption not seen in the commandlet version, which handles objects as they arrive.
  • Delayed start of loop. See above.
  • Speed.  According to this blog post, Bruce Payette (PowerShell development lead) says that there are optimizations to the ForEach loop construct that allows it to outperform the commandlet in some circumstances.

Back to the Beginning

With this basic knowledge under our belts, we can revisit the original claim that this code:
dir C:\ | ForEach { $_.Name }
is significantly faster than this code:
dir C:\ | ForEach-Object { $_.Name }
Any difference between the two would be entirely due to how PowerShell invokes aliases of commandlets versus the commandlets themselves.  This also suggests an interesting corollary--if there is a difference between the above versions, the "%" alias for ForEach-Object should outperform ForEach-Object as well!

Looking at the benchmark done here, and performing my own benchmark, I am confident that for this use case, there is little performance difference between the commandlet and the loop construct ForEach, and even less (if any) difference between the commandlet aliases.

Conclusions

The facts:
  • The commandlet and loop construct versions are distinct entities in PowerShell.
  • Many people get the variations in ForEach mixed up.
  • There are substantiated claims of real performance differences between the commandlet and loop construct versions of ForEach.
  • Casual PowerShell users are unlikely to see any important differences between the two; any time spent worrying about the two is likely to be premature optimization.
  • If your teammate is convinced that the ForEach commandlet alias is faster than the ForEach-Object commandlet, that teammate is certainly confused about the various forms ForEach can take in PowerShell.

Resources

  1. https://blogs.technet.microsoft.com/heyscriptingguy/2014/05/18/weekend-scripter-powershell-speed-improvement-techniques/
  2. http://poshoholic.com/2007/08/21/essential-powershell-understanding-foreach/
  3. http://powershelladministrator.com/2015/11/15/speed-of-loops-and-different-ways-of-writing-to-files-which-is-the-quickest/
  4. https://www.pluralsight.com/blog/it-ops/what-you-need-to-know-about-foreach-loops-in-powershell 
  5. http://zduck.com/2013/benchmarking-with-Powershell/ 
  6. https://technet.microsoft.com/en-us/library/hh847816.aspx
  7. https://technet.microsoft.com/en-us/library/hh849731.aspx


Thursday, August 28, 2008

How many of me?


HowManyOfMe.com
LogoThere are
216
people with my name in the U.S.A.

How many have your name?

Monday, August 4, 2008

Docile!

From Paul Graham's What You'll Wish You'd Known:
Your teachers are always telling you to behave like adults. I wonder if they'd like it if you did. You may be loud and disorganized, but you're very docile compared to adults. If you actually started acting like adults, it would be just as if a bunch of adults had been transposed into your bodies. Imagine the reaction of an FBI agent or taxi driver or reporter to being told they had to ask permission to go the bathroom, and only one person could go at a time. To say nothing of the things you're taught. If a bunch of actual adults suddenly found themselves trapped in high school, the first thing they'd do is form a union and renegotiate all the rules with the administration.
I got a few yuks out of the quote, and the article.

Friday, July 18, 2008

docs.python.org lays an egg

Want to learn how to install or build a Python "egg" file? Don't bother looking on docs.python.org:

Monday, July 14, 2008

My cat gets around

Amazing--this LOLcat looks just like my Gladys!


U HAS A FLAVOR

... and with a guinea pig no less!

MySQL stored procedure to generate "safe noise"

I'm playing around with stored procedures, and thought I'd try my hand at creating a function used in various capacities at work. Here it is.



USE test;
DROP FUNCTION IF EXISTS safe_noise;
DELIMITER $$
CREATE FUNCTION safe_noise (n INT UNSIGNED)
RETURNS VARCHAR(100) NOT DETERMINISTIC
BEGIN
DECLARE x CHAR(100) DEFAULT
"bcdfghkmnpqrstvwxzBCDFGHKMNPQRSTVWXZ23456789";
DECLARE y CHAR(100) DEFAULT "";
DECLARE i INT;
WHILE n > 0 DO
SET n = n - 1;
SET i = 1 + FLOOR(44 * RAND());
SET y = CONCAT(y, SUBSTR(x, i, 1));
END WHILE;
RETURN y;
END
$$
DELIMITER ;

-- Sample output:
-- mysql> select safe_noise(50);
-- +----------------------------------------------------+
-- | safe_noise(50) |
-- +----------------------------------------------------+
-- | 96Pw5CKDPW4g5pv984DSdp4VfpXvtK3KpDR6KfSp9tHVkFbKXp |
-- +----------------------------------------------------+
-- 1 row in set (0.01 sec)



Aside: I'm really surprised that blogger.com can't handle verbatim text excerpts. There's more to life than chin music, people! God forbid I should post some Python...