Seitenanfang

Expensive join

Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.


How to concat an array into a string? Perl is TIMTOWTDI and I could imagine two common ways - but which one is the best?

Actually, "best" isn't obvious, it depends on the data and project preferences. I had to improve the memory usage of a module handling a huge amount of data. I thought joinmight be the best choice to concat the array into a string (not doing this at all would be best, but it's not an option here), but it isn't.

First, get a base number of the data size. It's not exact and includes other parts, but ok to get an idea of the sizes we're working on:

perl -le '@a = (map { "x"x1024000 } (0..100)); system "ps u $$";'USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMANDsewi     28588  6.0  0.8 124344 104180 pts/0   S+   16:31   0:00 perl -le @a = (map { "x"x1024000 } (0..100)); syst
Next, compare join vs. a element-by-element concat using a for loop:
perl -le '@a = (map { "x"x1024000 } (0..100)); $y = join("\n",@a); system "ps u $$";'USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMANDsewi     28602  0.0  2.5 326352 306136 pts/0   S+   16:32   0:00 perl -le @a = (map { "x"x1024000 } (0..100)); $y =

perl -le '@a = (map { "x"x1024000 } (0..100)); for my $b (@a) { $y .= $b; } system "ps u $$";'USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMANDsewi 28606 0.0 1.6 245596 205148 pts/0 S+ 16:32 0:00 perl -le @a = (map { "x"x1024000 } (0..100)); for

You should look at the RSS colum, it's not 100% perfect, because swapped memory isn't shown here and Perl's own memory manager is sitting between the script and the kernel, but it's enough to compare things.

A very quick-and-dirty benchmark but still shows that (after substracting the "base" data usage) join is using twice the amount of the concat loop: (306136 - 104180) vs. (205148 - 104180). All numbers are kilobytes (kb)*.

But memory isn't important if you're dealing with small pieces of data, maybe speed is an issue:

 

perl -MBenchmark -le '@a = (map { "x"x1024000 } (0..100));timethese(0, {  _join => sub { my $y = join("\n",@a); },  _concat => sub { my $y; for my $b (@a) { $y .= $b; } }});'

Benchmark: running _concat, _join for at least 3 CPU seconds...

_concat: 3 wallclock secs ( 3.26 usr + 0.00 sys = 3.26 CPU) @ 59.20/s (n=193)

_join: 3 wallclock secs ( 3.10 usr + 0.00 sys = 3.10 CPU) @ 32.58/s (n=101)

The manual concat loop is even faster than Perl's internal join: 59 runs per second compared to 32 runs per second for join.

But wait... there is not "the one best" way. The same benchmark with a very small change:

perl -MBenchmark -le '@a = (map { "x"x1024 } (0..10));timethese(0, {  _join => sub { my $y = join("\n",@a); },  _concat => sub { my $y; for my $b (@a) { $y .= $b; } }});'

Benchmark: running _concat, _join for at least 3 CPU seconds...

_concat: 3 wallclock secs ( 3.15 usr + 0.00 sys = 3.15 CPU) @ 644350.48/s (n=2029704)

_join: 3 wallclock secs ( 3.11 usr + 0.00 sys = 3.11 CPU) @ 749435.69/s (n=2330745)

The internal join clearly wins this time some 100.000 runs per second ahead of the concat loop. Most arrays are usually closer to "10 times 1 kB" than "100 times 1 MB", so join seems to be the right one for most operations - but not all.

I retried the same test with "100 times 10 bytes" and join turned out to be nearly twice as fast as the loop.

* Don't give me that silly creations like "kiB" or "MiB". One kB are 1024 bytes and one MB are 1024 * 1024 bytes, noone needs that "i" junk. I started developing at a time where people knew that a "kB" is 1024 bytes, there is absolutely no reason why someone should start with 1000 bytes. Show me a binary memory chip having 1000 bytes and I might start accepting kiB's. "MiB" is "Men in Black", nothing else. Read this before complaining.

 

15 Kommentare. Schreib was dazu

  1. Alexander Hartmaier (abraxxa)

    Note that the bandwidth of network links is specified in KiB/s (1000 based), so those units are useful as well!

  2. [Imported from blog]

  3. [Imported from blog]

  4. Sebastian

    base-1000 is also useless there. We used to use calculated them base-1024 and noone complained.

  5. [Imported from blog]

  6. [Imported from blog]

  7. Andrew

    Those of us who actually learned to count in school know that k is a prefix for 1000, and M is a prefix for 1,000,000. To blow off the SI and create confusion by inventing different values for them is the height of arrogance and stupidity.

  8. [Imported from blog]

  9. [Imported from blog]

  10. Chas. Owens

    You are making the join do extra work (concatenating with "\n"). If you make the loop perform the same work, the performance advantage drops to next to nothing (in fact, the join eventually starts to beat the concat):



    #!/usr/bin/perl


    use strict;
    use warnings;


    use Benchmark qw/cmpthese/;


    my @a = qw/a b c d e/;


    my %subs = (
    join => sub {
    my $s = join "-", @a;
    return $s;
    },
    concat => sub {
    my $s = @a[0];
    for my $v (@a[1 .. $#a]) {
    $s .= "-$v";
    }
    return $s;
    },
    );


    for my $sub (keys %subs) {
    print "$sub: ", $subs{$sub}(), "\n";
    }


    for my $n (10, 50, 100) {
    @a = ("x" x 1024000) x $n;
    print "\n$n 1,024,000 byte items\n";
    cmpthese -3, \%subs;
    }

  11. [Imported from blog]

  12. [Imported from blog]

  13. Sebastian

    Thank you, you're right.

  14. [Imported from blog]

  15. [Imported from blog]

Schreib was dazu

Die folgenden HTML-Tags sind erlaubt:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>