out of sorts

Not that it matters, but while reading through the sort(1) man page, I noticed a new (to me) option:

-R, –random-sort
sort by random hash of keys

Yes, newer versions of sort will actually shuffle your input data.I’m not sure if that’s a cool thing for a command named sort to do, but I like it anyway.

A quick test (on Red Hat 6) shows that it really is random: you don’t get the same shuffle each time.

Good replacement for the Perl one-liner:

perl -MList::Util -e ‘print List::Util::shuffle <>’

(Added later: many versions of Linux include the “shuf” command.)

Unslow Regexps

Our mail logs accumulate a few million lines per weekday. Some of them contain information on whether SpamAssassin considered a given message to be spam:

Jun 11 23:58:15 sunapee MailScanner[7997]: Message q5C3vVli019865 from 71.243.115.147 (info@softwareevals.org) to unh.edu is spam, SpamAssassin (not cached, score=15.565, required 5, BAYES_00 -1.90, DIGEST_MULTIPLE 0.29, HELO_DYNAMIC_IPADDR 1.95, HTML_MESSAGE 0.00, KHOP_DNSBL_BUMP 2.00, MIME_HTML_ONLY 0.72, PYZOR_CHECK 1.39, RAZOR2_CF_RANGE_51_100 0.50, RAZOR2_CF_RANGE_E8_51_100 1.89, RAZOR2_CHECK 0.92, RCVD_IN_HOSTKARMA_BL 1.70, URIBL_BLACK 1.73, URIBL_JP_SURBL 1.25, URIBL_RHS_DOB 1.51, URIBL_WS_SURBL 1.61)

or “ham” (not spam):

Jun 11 23:59:54 sunapee MailScanner[7634]: Message q5C3xp77020291 from 208.117.48.80 (bounces+54769-353f-cfg6=cisunix.unh.edu@email.news.spotifymail.com) to cisunix.unh.edu is not spam, SpamAssassin (not cached, score=-2, required 5, autolearn=not spam, BAYES_00 -1.90, HTML_MESSAGE 0.00, RCVD_IN_HOSTKARMA_NO -0.09, SPF_PASS -0.00, T_RP_MATCHES_RCVD -0.01)

(Lines wrapped for clarity.)

Perl scripts scan through the logs and produce plots of ham/spam traffic, typical example here. Such scripts need to (for lines like the above) extract the date/time information (always the first 15 characters) and whether the line says “is spam” or “is not spam”. My initial approach (years ago) was very simple-minded. Old code snippet:

while (<>) {
    if (/is spam/ || /is not spam/) {
        $date = ParseDate(substr($_, 0, 15));
        $dt = UnixDate($date, "%Y-%m-%d %H:00");
        if (/is spam/) {
             $spamcount{$dt}++;
        else {
             $hamcount{$dt}++;
        }
    }
}

(ParseDate and UnixDate are routines from the Date::Manip package. Without getting bogged down in details, they allow the messages to be counted in hour-sized buckets for the plot linked above.)

For not-particularly-important reasons, I decided to try an “all at once” regular expression on each line instead. New code snippet:

use Readonly;
use Regexp::Common qw/net/;

Readonly my $SPAMLINE_RE => qr {
    ^
    (\w{3} \s+ \d+ \s \d{2}:\d{2}:\d{2}) \s
    \w+ \s MailScanner \[\d+\]: \s
    Message \s \w+ \s
    from \s ($RE{net}{IPv4}) \s
    \([^)]*\) \s to \s \S+ \s
    (is \s (?:not \s)? spam)
}x;

while (<>) {
    if ( $_ =~ $SPAMLINE_RE ) {
        my $d        = $1;
        my $spamflag = $3;
        my $date     = ParseDate($d);
        my $dt       = UnixDate( $date, '%Y-%m-%d %H:00' );
        if ( $spamflag eq 'is spam' ) {
            $spamcount{$dt}++;
        }
        else {
            $hamcount{$dt}++;
        }
    }
}

There’s plenty to criticize about both snippets, but mainly I was thinking: running the second version’s huge, hairy regular expression on each log line will just have to be way slower than the simple fixed-string searching in the first version. (Even though there can be up to three searches per line in the first version, they’re searches for fixed strings, right? Right?)

Wrong, as it turns out. The hairy second version ran slightly faster than the simple-minded first version. (In addition it extracts the IP address of the server that sent us the message; not used here, but important in other scripts.)

Humbling moral: even after many years of Perl coding, my gut feelings about efficiency are not to be trusted.

Passphrase Security

Bruce Schneier has a recent post about a new research paper that seems to throw a little bit of cold water on the obvious superiority of passphrases over passwords. Schneier has a pointer to the paper and a less-formal blog summary. The bottom line seems to be: users can choose poor “easily guessed” passphrases, and left to their own devices, they probably will. As usual with Schneier’s blog, many of the comments to the post are insightful and worth reading.

It seems that it might also be much more difficult to check the “quality” of a passphrase than a password. You’d like to be able to say things like: “Maybe you shouldn’t use Psalm 23:1 (King James Version) as a passphrase.”

Grammatically-Correct Random Pass Phrase Generator (in Perl)

I change my passwords every 3 months, but my passphrases are getting kind of stale. The arugment for changing your passphrases is about the same as changing your password. Being lazy (but curious), I went looking for a passphrase generator similar in philosophy to the apg password generator that I’ve been using for a long time.

I didn’t find anything comparable to apg, but I came across a blog post by Curtis Copley describing his algorithm for generating grammatically-correct random passphrases, a neat idea. He provided source code in PL/SQL and Java. Both painful languages for me. How about a Perl translation? Looking at the code, I was discouraged: “This is impossible.”

But then, looking closer at the algorithm: “Hey, maybe not that bad.”

So I coded up a Perl implementation. As far as I can tall, it’s accurate. If you’d like to look, more information and the code is here.

Still haven’t changed my passphrases, though.

Assumptions

From Stack Overflow:



I am the developer of some family tree software (written in C++ and Qt). I had no problems until one of my customers mailed me a bug report. The problem is that he has two children with his own daughter, and he can’t use my software because of errors.

Those errors are the result of my various assertions and invariants about the family graph being processed (for example, after walking a cycle, the program states that X can’t be both father and grandfather of Y).

How can I resolve those errors without removing all data assertions?

A Joke At Which I Laughed

A  wife asks her programmer husband, “Could you please go shopping for me and buy one carton of milk, and  if they have eggs, get 6.”

A short time  later the husband comes back with 6 cartons of  milk.

The wife asks him, “Why the hell did you buy 6  cartons of milk?”

He replied, “They had  eggs.”

Nine traits of the veteran Unix admin

I could not help but read an article with a title like “Nine traits of the veteran Unix admin“. It’s good! And not just because the first two traits are ones I bitterly cling to myself!

Can’t resist commenting on this one:

Veteran Unix admin trait No. 3: We wield regular expressions like weapons

“…and sometimes wind up shooting ourselves in the foot.” I was reminded of a Jamie Zawinski quote:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

And there’s this one:

Veteran Unix admin trait No. 5: We prefer elegant solutions

Well, who doesn’t? Let’s add: “…and we’re darn good at deluding ourselves that our solutions are elegant.”

Women in CS

Women in CS

History of Programming Languages

… is right here. Sample:

1801 – Joseph Marie Jacquard uses punch cards to instruct a loom to weave “hello, world” into a tapestry. Redditers of the time are not impressed due to the lack of tail call recursion, concurrency, or proper capitalization.

Perl 5.12 is out!

How have we survived til now without the yada yada operator?

And is this the first operator named after a TV catchphrase?

Panorama theme by Themocracy