Posts tagged: regexp

Unslow Regexps

Our mail logs accumulate a few million lines per weekday. Some of them contain information on whether SpamAssassin considered a given message to be spam:

Jun 11 23:58:15 sunapee MailScanner[7997]: Message q5C3vVli019865 from 71.243.115.147 (info@softwareevals.org) to unh.edu is spam, SpamAssassin (not cached, score=15.565, required 5, BAYES_00 -1.90, DIGEST_MULTIPLE 0.29, HELO_DYNAMIC_IPADDR 1.95, HTML_MESSAGE 0.00, KHOP_DNSBL_BUMP 2.00, MIME_HTML_ONLY 0.72, PYZOR_CHECK 1.39, RAZOR2_CF_RANGE_51_100 0.50, RAZOR2_CF_RANGE_E8_51_100 1.89, RAZOR2_CHECK 0.92, RCVD_IN_HOSTKARMA_BL 1.70, URIBL_BLACK 1.73, URIBL_JP_SURBL 1.25, URIBL_RHS_DOB 1.51, URIBL_WS_SURBL 1.61)

or “ham” (not spam):

Jun 11 23:59:54 sunapee MailScanner[7634]: Message q5C3xp77020291 from 208.117.48.80 (bounces+54769-353f-cfg6=cisunix.unh.edu@email.news.spotifymail.com) to cisunix.unh.edu is not spam, SpamAssassin (not cached, score=-2, required 5, autolearn=not spam, BAYES_00 -1.90, HTML_MESSAGE 0.00, RCVD_IN_HOSTKARMA_NO -0.09, SPF_PASS -0.00, T_RP_MATCHES_RCVD -0.01)

(Lines wrapped for clarity.)

Perl scripts scan through the logs and produce plots of ham/spam traffic, typical example here. Such scripts need to (for lines like the above) extract the date/time information (always the first 15 characters) and whether the line says “is spam” or “is not spam”. My initial approach (years ago) was very simple-minded. Old code snippet:

while (<>) {
    if (/is spam/ || /is not spam/) {
        $date = ParseDate(substr($_, 0, 15));
        $dt = UnixDate($date, "%Y-%m-%d %H:00");
        if (/is spam/) {
             $spamcount{$dt}++;
        else {
             $hamcount{$dt}++;
        }
    }
}

(ParseDate and UnixDate are routines from the Date::Manip package. Without getting bogged down in details, they allow the messages to be counted in hour-sized buckets for the plot linked above.)

For not-particularly-important reasons, I decided to try an “all at once” regular expression on each line instead. New code snippet:

use Readonly;
use Regexp::Common qw/net/;

Readonly my $SPAMLINE_RE => qr {
    ^
    (\w{3} \s+ \d+ \s \d{2}:\d{2}:\d{2}) \s
    \w+ \s MailScanner \[\d+\]: \s
    Message \s \w+ \s
    from \s ($RE{net}{IPv4}) \s
    \([^)]*\) \s to \s \S+ \s
    (is \s (?:not \s)? spam)
}x;

while (<>) {
    if ( $_ =~ $SPAMLINE_RE ) {
        my $d        = $1;
        my $spamflag = $3;
        my $date     = ParseDate($d);
        my $dt       = UnixDate( $date, '%Y-%m-%d %H:00' );
        if ( $spamflag eq 'is spam' ) {
            $spamcount{$dt}++;
        }
        else {
            $hamcount{$dt}++;
        }
    }
}

There’s plenty to criticize about both snippets, but mainly I was thinking: running the second version’s huge, hairy regular expression on each log line will just have to be way slower than the simple fixed-string searching in the first version. (Even though there can be up to three searches per line in the first version, they’re searches for fixed strings, right? Right?)

Wrong, as it turns out. The hairy second version ran slightly faster than the simple-minded first version. (In addition it extracts the IP address of the server that sent us the message; not used here, but important in other scripts.)

Humbling moral: even after many years of Perl coding, my gut feelings about efficiency are not to be trusted.

My Premature Optimization Problem

Many of you know the adage; but learning it anew can still be fun.

I had to trim leading and trailing whitespace in Perl. So:

# strip any leading or trailing whitespace
$string =~ s/^\s+//;
$string =~ s/\s+$//;

No prob.

Then I made the mistake I often make, and started thinking. Two lines for that? Please. That can be done in one line, especially in Perl. So I found this:

$string =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;

Yuck! Oh well, at least it probably runs faster.

Think again.

Or better yet, Marcus, just stop thinking altogether. This one-liner idea would have made the routine both slower and less readable for the next person (notice I forgot the #comment the second time).

Panorama theme by Themocracy