Hunting down a very good secured memory leak

Jan 08

von Sebastian am 8.01.2012 um 18:30 in English, Perl

Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.

I took away 90% of a recently developed job's memory usage lately, but it's still leaking memory and I don't want to accept this.

There is an easy way to solve this: Process each user in a different task (or maybe 10 or 100 of them in one task, the start a fresh one), but I don't want to give up finding the reason. These class objects are also used in long-running webservers and may also leak memory there which isn't critical (at the moment), but still not good.

Today is Sunday and here is my wager for today: I will only declare this as working time if memory usage is being cut down by another 25% today. The job currently uses up 87.992 kB right before doing anything (after loading modules, database connections, etc.) and 209.252 kB after processing one user (while only few users seem to leak memory, most don't). User leak is 209.252 - 87.992 = 121.260 kB, my target line is 90.945 + 87.992 base usage = 178.937 kB after that user has been done.

First attempt: Try to discover the leaking subroutines (see the first post about this jobfor details) by adding a line before and after each main script line (calculating data):

print __LINE__."\t"; system "ps hu $$";

I suspect that some objects are created but not destroyed as they should. I need to know every created and destroyed object to proof this, but as we're using a parent class for all database tables involved, things are easy:

Add this at the end of "sub new":

print "NEW $self\t".$self->id."\n";

And add a DESTROY sub:

print "DESTROY $self\t".$self->id."\n";

Now run the job redirecting the output into a file and check the results. Every single object will be destroyed by Perl while finishing off the script, but one of the "ps hu" calls is placed just before the "exit" call finishing the script after my test user has been processed. Nothing was destroyed after this line except for unprocessed user objects (which is ok as they're very light in memory).

Next try: Is this a real leak or might this user be get that heavy to eat up 120 MB with his child-objects? I'll process him twice in the same job (after removing the NEW and DESTROY debug prints).

Interesting: Processing the user twice doesn't add up memory, but also restarting the job within a very short timeframe finishes at 90.236 kB. The leak must be in a place only called if one of my very-short-term Memcached cachings get too old. I'll add a five-minute-delay after the first processing and look what happens. I don't remember any caching time between 90 seconds and 1 day, so a 5-minute-delay should show if the leak adds in two runs of this guy. I'll go searching manually while the delay is running...

There is one "universal" method called by many other places which caches short-term data for 69 seconds. Two memory usage prints will show up if it's leaking:

print "$self\t$method\t$month/$year\n\t"; system "ps hu $$";return [map {$self->$method($_, $month, $year) || 0} 1 .. Days_in_Month($year, $month)];print "\t"; system "ps hu $$";

The five-minute-delay is finished: No additional leaking for the second run even if no object is reused. Processing might really take up that many memory freeing it for the next user. My huge memory usage might be the amount required to walk through the biggest user?

Did you find the bug above? The second "ps hu" call won't be executed ever - it's following a return line. A temporary hack will fix this:

print "$self\t$method\t$month/$year\n\t"; system "ps hu $$";my $x = [map {$self->$method($_, $month, $year) || 0} 1 .. Days_in_Month($year, $month)];print "\t"; system "ps hu $$";return $x;

Here is the interesting part out of many output lines:

MyProject::DOA::User=HASH(0x5b43458) clickrate_per_day 1/2012sftp 13330 68.5 0.1 219124  90296 pts/0 S+ 17:09 0:01 /usr/bin/perl cron_test.plsftp 13330 72.6 0.4 336556 207568 pts/0 S+ 17:09 0:02 /usr/bin/perl cron_test.pl

Yeah, I'm getting closer. Adding the memory usage prints before and after the $self->$method call in map showed up that the first day is going from 90.292 to 206.980 kB, the remaining days add up less than 2 MB. I'll hunt the big fish first.

Results aren't always what you want to have: Half a dozen of "ps hu" calls in that called $method moved to problem down one level. It really seems that the memory isn't really leaked but just used to store child objects. I won't be able to cut them down that much because removing them from memory before the user has been completed would require to re-load them from database lateron and memory is much cheaper than database power.

Devel::Size proved what I expected: The $user object holding all childs is growing by megabytes for every new item loaded. But why are they that heavy? I can't believe this.

The plain objects copied from database into memory take about 2 kB each one making no more than 200 or 300 kB, but they store additional data as they're calculating their numbers to pass to the upstream objects. Because all objects are linked as parents or childs, I had to store the total_size from Devel::Size before loading the new objects and subtract it from the total_size after everything has been loaded.

Going on deeper and deeper into the structures, I discovered that surprisingly the "old" data is taking all the space. I've to explain this a little deeper: Today's and yesterdays data is not being put into Memcached because it's changing too often to live long enough in cache to be useful. Yesterdays numbers may still change in this special case for a very good reason, otherwise only today's data would be exempt from cache. I already wrote that the first day of the month eats up everything but until now I didn't realize that the "realtime" data calculation can't be the reason.

Got it. Not really solved, but it's clear now: The script is not leaking memory in terms of "keeping memory assigned which is never used again after the object went out-of-scope". All the stuff used for calculation is just that big to actually use this memory. But I don't want to loose by bet and keep one working with a module that heavy.

Here is the result: I have to calculate statistics data based on day sums for sub-sub-sub-objects fetched from the database. I'm saving time by putting intermediate values like day- and month-sums for each level sub object into Memcache using a very long caching time for "old" data (last month's or last year's data won't change, no need to ever refresh them), but "current month" isn't high-level cached because "current day" (and maybe yesterday) value's are changing every second or minute or so.

The most low-level object stored it's full dump of per-day-sums into Memcache, a 600 to 800 kB block for each object - but I don't need that many data. I just changed caching to use one Memcached item per year - which currently takes up 3 kB or so. We got January the 8th, not much data is stored for this year, so I might trade the delay of more Memcache accesses for using less memory while year goes on by loading only one month at a time.

Here is the final result: 90.260 kB after loading that user, pretty much less than my 178.937 kB target line.

PS: Yes, doing all that calculation stuff in database would be much better, but this isn't possible here for some reasons out of my influence.