Seitenanfang

Intelligent caching for alltime statistics

Dieser Post wurde aus meiner alten WordPress-Installation importiert. Sollte es Darstellungsprobleme, falsche Links oder fehlende Bilder geben, bitte einfach hier einen Kommentar hinterlassen. Danke.


Statistics are important for many projects, but not all of them are easy to build. Many long-term stats involve heavy SQL queries and some may of these queries may also freeze database servers or tables. Caching intermediate results may limit these queries to a minimum.

Todays problem: Get the number of livetime impressions for each banner. These numbers are important and required for many followup calculations within this project. The logging table has four columns: banner_id, count_of_impressions, month, year. Each impression is being added to the count column within the row for the current month, year and delivered banner.

A low traffic project on an oversized server may use

SELECT SUM(count) FROM IMPRESSION_LOG WHERE id=?
every time but even MySQL query & result caching won't help if more and more users are requesting this data.

This project is using a pool of Memcached servers where data may be cached longer than the livetime of a request and survive even a webserver restart. I already wrote a combination of DBI's select commands and the memcache interface, but what exactly should be cached? Current data is changing every few seconds due to new impressions of banners being delivered, these data must not be cached for a longer time, but older data won't ever change.

Here is the final result:

my $MAX_CACHE_TIME = (30 * 86400) - 1;

sub all_imps { my $self = shift;

# Calculate end-of-month my @now = localtime(time); ++$now[4]; $now[5] += 1900; my $cache_length = timelocal(23,59,59,Days_in_Month(@now[5,4]),($now[4] - 1), ($now[5] + 1900)) - time; $cache_length = $MAX_CACHE_TIME if $cache_length > $MAX_CACHE_TIME;

# Cache everything but current month up to the end of this month # next month will cache again including results of this month my $self->{pre_month_sums} = { select2col_hash_cached($self->dbh, 'SELECT banner_id, SUM(count) FROM IMPRESSION_LOG WHERE year < ? OR (year = ? AND month < ?) GROUP BY banner_id', $cache_length, $now[5], $now[5], $now[4], )};

# Cache current month values for up to 30 seconds my $self->{this_month_sums} = { select2col_hash_cached($self->dbh, 'SELECT banner_id, SUM(count) FROM IMPRESSION_LOG WHERE year = ? AND month = ? GROUP BY banner_id', 30, $now[5], $now[4], )};

return $self->{pre_month_sums}->{$self->id} + $self->{this_month_sums}->{$self->id};}

This method is part of a class module for one banner. It should return the number of all impressions delivered since the beginning of the (recording) time.

A MAX_CACHE_TIME pseudo-constant is defined at the beginning of the module. It's the maximum time Cache::Memcached::Fast may use as max age value for caching. Longer times may be given as unix timestamps but the DBI/caching functions have to be backward compatible and may not only be used with a Memcached backend. I didn't want to use unix timestamps as maximum age values because non-Memcached solutions don't support them.

Days_in_Month() is provided by Date::Calc, the famous Gregorian date calculation module. It returns the number of the last day of the given (current) month (28 for most Februaries, 31 for March, etc.). timelocal (from Time::Local) is the reversed localtime function calculating the last unix timestamp of the current month. The difference between this timestamp and the current timestamp is the number of seconds until the end of the month.

A new statistics row is created for every month, the statistics row for the last month won't ever change and may be cached endless.

The select2col_hash_cached function fetches two columns from a table, and converts them into a hash. The first columns becomes the key and the second one the value. The result is cached and only re-fetched if it's not fresh within the cache.

The first select2col call fetches all data for up to the end of the previous month. The caching timeout is set to the end of the current month as the alltime sums have to include current months data from that on.

The second statement requests the current month values and everything is added together and returned.

Data for all columns is fetched together to limit the overall number of SQL queries. The first query is filling up the cache for all following queries and all other banner_id's.

 

Noch keine Kommentare. Schreib was dazu

Schreib was dazu

Die folgenden HTML-Tags sind erlaubt:<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>