Unslow Regexps

Our mail logs accumulate a few million lines per weekday. Some of them contain information on whether SpamAssassin considered a given message to be spam:

Jun 11 23:58:15 sunapee MailScanner[7997]: Message q5C3vVli019865 from 71.243.115.147 (info@softwareevals.org) to unh.edu is spam, SpamAssassin (not cached, score=15.565, required 5, BAYES_00 -1.90, DIGEST_MULTIPLE 0.29, HELO_DYNAMIC_IPADDR 1.95, HTML_MESSAGE 0.00, KHOP_DNSBL_BUMP 2.00, MIME_HTML_ONLY 0.72, PYZOR_CHECK 1.39, RAZOR2_CF_RANGE_51_100 0.50, RAZOR2_CF_RANGE_E8_51_100 1.89, RAZOR2_CHECK 0.92, RCVD_IN_HOSTKARMA_BL 1.70, URIBL_BLACK 1.73, URIBL_JP_SURBL 1.25, URIBL_RHS_DOB 1.51, URIBL_WS_SURBL 1.61)

or “ham” (not spam):

Jun 11 23:59:54 sunapee MailScanner[7634]: Message q5C3xp77020291 from 208.117.48.80 (bounces+54769-353f-cfg6=cisunix.unh.edu@email.news.spotifymail.com) to cisunix.unh.edu is not spam, SpamAssassin (not cached, score=-2, required 5, autolearn=not spam, BAYES_00 -1.90, HTML_MESSAGE 0.00, RCVD_IN_HOSTKARMA_NO -0.09, SPF_PASS -0.00, T_RP_MATCHES_RCVD -0.01)

(Lines wrapped for clarity.)

Perl scripts scan through the logs and produce plots of ham/spam traffic, typical example here. Such scripts need to (for lines like the above) extract the date/time information (always the first 15 characters) and whether the line says “is spam” or “is not spam”. My initial approach (years ago) was very simple-minded. Old code snippet:

while (<>) {
    if (/is spam/ || /is not spam/) {
        $date = ParseDate(substr($_, 0, 15));
        $dt = UnixDate($date, "%Y-%m-%d %H:00");
        if (/is spam/) {
             $spamcount{$dt}++;
        else {
             $hamcount{$dt}++;
        }
    }
}

(ParseDate and UnixDate are routines from the Date::Manip package. Without getting bogged down in details, they allow the messages to be counted in hour-sized buckets for the plot linked above.)

For not-particularly-important reasons, I decided to try an “all at once” regular expression on each line instead. New code snippet:

use Readonly;
use Regexp::Common qw/net/;

Readonly my $SPAMLINE_RE => qr {
    ^
    (\w{3} \s+ \d+ \s \d{2}:\d{2}:\d{2}) \s
    \w+ \s MailScanner \[\d+\]: \s
    Message \s \w+ \s
    from \s ($RE{net}{IPv4}) \s
    \([^)]*\) \s to \s \S+ \s
    (is \s (?:not \s)? spam)
}x;

while (<>) {
    if ( $_ =~ $SPAMLINE_RE ) {
        my $d        = $1;
        my $spamflag = $3;
        my $date     = ParseDate($d);
        my $dt       = UnixDate( $date, '%Y-%m-%d %H:00' );
        if ( $spamflag eq 'is spam' ) {
            $spamcount{$dt}++;
        }
        else {
            $hamcount{$dt}++;
        }
    }
}

There’s plenty to criticize about both snippets, but mainly I was thinking: running the second version’s huge, hairy regular expression on each log line will just have to be way slower than the simple fixed-string searching in the first version. (Even though there can be up to three searches per line in the first version, they’re searches for fixed strings, right? Right?)

Wrong, as it turns out. The hairy second version ran slightly faster than the simple-minded first version. (In addition it extracts the IP address of the server that sent us the message; not used here, but important in other scripts.)

Humbling moral: even after many years of Perl coding, my gut feelings about efficiency are not to be trusted.

3 Responses to “Unslow Regexps”

  1. Be not humble, writing regexes like that one!

  2. Bill Costa says:

    Nice use of Regexp::Common, but why the Readonly module?

  3. Paul Sand says:

    Basically the rewrite was to bring code into better compliance with Perl Best Practices. Which encourages using Readonly for magic constant numbers. But once it’s in use for that, it’s not much of a stretch to use it for magic scalars in general.

    So, just the standard benefits attached to named constants: slight readability improvment, emphasis that we’re not going to alter it elsewhere, errors thrown if we try to do that anyway.

Leave a Reply

Panorama theme by Themocracy