Expensive join

Mär 11

von Sebastian am 11.03.2013 um 18:11 in English, Perl

Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.

How to concat an array into a string? Perl is TIMTOWTDI and I could imagine two common ways - but which one is the best?

Actually, "best" isn't obvious, it depends on the data and project preferences. I had to improve the memory usage of a module handling a huge amount of data. I thought joinmight be the best choice to concat the array into a string (not doing this at all would be best, but it's not an option here), but it isn't.

First, get a base number of the data size. It's not exact and includes other parts, but ok to get an idea of the sizes we're working on:

perl -le '@a = (map { "x"x1024000 } (0..100)); system "ps u $$";'USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMANDsewi     28588  6.0  0.8 124344 104180 pts/0   S+   16:31   0:00 perl -le @a = (map { "x"x1024000 } (0..100)); syst

Next, compare join vs. a element-by-element concat using a for loop:

perl -le '@a = (map { "x"x1024000 } (0..100)); $y = join("\n",@a); system "ps u $$";'USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMANDsewi     28602  0.0  2.5 326352 306136 pts/0   S+   16:32   0:00 perl -le @a = (map { "x"x1024000 } (0..100)); $y =perl -le '@a = (map { "x"x1024000 } (0..100)); for my $b (@a) { $y .= $b; } system "ps u $$";'USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMANDsewi     28606  0.0  1.6 245596 205148 pts/0   S+   16:32   0:00 perl -le @a = (map { "x"x1024000 } (0..100)); for

You should look at the RSS colum, it's not 100% perfect, because swapped memory isn't shown here and Perl's own memory manager is sitting between the script and the kernel, but it's enough to compare things.

A very quick-and-dirty benchmark but still shows that (after substracting the "base" data usage) join is using twice the amount of the concat loop: (306136 - 104180) vs. (205148 - 104180). All numbers are kilobytes (kb)*.

But memory isn't important if you're dealing with small pieces of data, maybe speed is an issue:

perl -MBenchmark -le '@a = (map { "x"x1024000 } (0..100));timethese(0, {  _join => sub { my $y = join("\n",@a); },  _concat => sub { my $y; for my $b (@a) { $y .= $b; } }});'Benchmark: running _concat, _join for at least 3 CPU seconds...
 _concat: 3 wallclock secs ( 3.26 usr + 0.00 sys = 3.26 CPU) @ 59.20/s (n=193)
 _join: 3 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 32.58/s (n=101)

The manual concat loop is even faster than Perl's internal join: 59 runs per second compared to 32 runs per second for join.

But wait... there is not "the one best" way. The same benchmark with a very small change:

perl -MBenchmark -le '@a = (map { "x"x1024 } (0..10));timethese(0, {  _join => sub { my $y = join("\n",@a); },  _concat => sub { my $y; for my $b (@a) { $y .= $b; } }});'Benchmark: running _concat, _join for at least 3 CPU seconds...
   _concat:  3 wallclock secs ( 3.15 usr +  0.00 sys =  3.15 CPU) @ 644350.48/s (n=2029704)
     _join:  3 wallclock secs ( 3.11 usr +  0.00 sys =  3.11 CPU) @ 749435.69/s (n=2330745)

The internal join clearly wins this time some 100.000 runs per second ahead of the concat loop. Most arrays are usually closer to "10 times 1 kB" than "100 times 1 MB", so join seems to be the right one for most operations - but not all.

I retried the same test with "100 times 10 bytes" and join turned out to be nearly twice as fast as the loop.

* Don't give me that silly creations like "kiB" or "MiB". One kB are 1024 bytes and one MB are 1024 * 1024 bytes, noone needs that "i" junk. I started developing at a time where people knew that a "kB" is 1024 bytes, there is absolutely no reason why someone should start with 1000 bytes. Show me a binary memory chip having 1000 bytes and I might start accepting kiB's. "MiB" is "Men in Black", nothing else. Read this before complaining.

join, Memory, Perl

Weitere Artikel aus der Kategorie Perl:

Perl -M-A-C tests

Benchmarking mapping...

Warum ist 075 = 61?

Perl variable declar...

Perl Taint mode: Ver...

Perl Taint mode: Dat...

Twittern

15 Kommentare. Schreib was dazu

Alexander Hartmaier (abraxxa)

11.03.2013 18:54

Note that the bandwidth of network links is specified in KiB/s (1000 based), so those units are useful as well!

Antworten
Sebastian Willing

11.03.2013 18:56

[Imported from blog]

Antworten
Sebastian Willing

11.03.2013 18:56

[Imported from blog]

Antworten
Sebastian

11.03.2013 20:48

base-1000 is also useless there. We used to use calculated them base-1024 and noone complained.

Antworten
Sebastian Willing

11.03.2013 20:56

[Imported from blog]

Antworten
Sebastian Willing

11.03.2013 20:56

[Imported from blog]

Antworten
Andrew

15.03.2013 14:45

Those of us who actually learned to count in school know that k is a prefix for 1000, and M is a prefix for 1,000,000. To blow off the SI and create confusion by inventing different values for them is the height of arrogance and stupidity.

Antworten
Sebastian Willing

15.03.2013 16:06

[Imported from blog]

Antworten
Sebastian Willing

15.03.2013 16:06

[Imported from blog]

Antworten
Chas. Owens

5.04.2013 18:54

You are making the join do extra work (concatenating with "\n"). If you make the loop perform the same work, the performance advantage drops to next to nothing (in fact, the join eventually starts to beat the concat):

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw/cmpthese/;

my @a = qw/a b c d e/;

my %subs = (
join => sub {
my $s = join "-", @a;
return $s;
},
concat => sub {
my $s = @a[0];
for my $v (@a[1 .. $#a]) {
$s .= "-$v";
}
return $s;
},
);

for my $sub (keys %subs) {
print "$sub: ", $subs{$sub}(), "\n";
}

for my $n (10, 50, 100) {
@a = ("x" x 1024000) x $n;
print "\n$n 1,024,000 byte items\n";
cmpthese -3, \%subs;
}

Antworten
Sebastian Willing

5.04.2013 18:57

[Imported from blog]

Antworten
Sebastian Willing

5.04.2013 18:57

[Imported from blog]

Antworten
Sebastian

5.04.2013 21:30

Thank you, you're right.

Antworten
Sebastian Willing

5.04.2013 21:36

[Imported from blog]

Antworten
Sebastian Willing

5.04.2013 21:36

[Imported from blog]

Antworten

Expensive join

15 Kommentare. Schreib was dazu

Schreib was dazu

Kategorien

Andere Blogs

Blogverzeichnisse

Expensive join

15 Kommentare. Schreib was dazu

Schreib was dazu

Kategorien

Tags

Andere Blogs

Blogverzeichnisse