Home » Projekte » Similarreceiver

Similarreceiver

We once had a user "elke.hanssmann@somewhere". Unfortunately, this name is very strange to english speakers, so we got incoming e-mail for "elke.hansman", "elke.hanssman", "elke.hannsmann" and whatever else is possible.

Anyway, not being able to receive e-mail was undesirable. Or, in another sense, it was very user-unfriendly. But the worst: all of this misspelled e-mail would either go back to the sender or to me. Having understood Larry Wall's reasoning that laziness is a virtue (for programmers!), I wrote this software code.

Now, at that time I worked in a multinational organization. The most widely deployed "fuzzy word matching algorithms" like Soundex were really unusable in such a context. They are based on a german or english phoneme systems. That has not much in conjunction with, say, Mongolian language and usually gives very poor results than applied. Maybe that's the reason why Soundex is never really used in products?

I made some adaptions to real world scenario (e.g., two long words with one different character are considered to be nearer to each other then two three-letter words with one differing character). With those heuristics, the code has shown to be quite usable.

The program is in use at about 20-30 locations worldwide, since about 1997.

How it works 

SimilarReceiver first collects all possible receivers. To have things simply, only the local user base is considered (i.e. no LDAP, NIS or NIS+ queries). During the collection, the pattern is tested against each collected Receiver. Local users and entries in your sendmail alias file apply for being a receiver.

The check is done by a algorithm known as "Levenshtein distance", sometimes also known a as "Edit distance". This algorithm calculates the necessary numbers of character additions, character deletions and character replacements necessary to transform one string into the other.

The transfer "postmast" into "postmaster" needs two character additions, therefore the edit distance is two. The distance between "holge.schurick" and "holger.schurig" is 3:

  1. you have to add an 'r'
  2. substitute the 'c' with a 'g' and
  3. delete the 'k'

Now, one would make this simply assumption "The smaller the edit distance is, the more similar are the words". Unfortunately this is not always the case.

The raw edit distance is sometimes not that good to see the similarity, especially if you have small words. "tst-a" and "tst-b" are obviously much more different than "Elke.Hanssmann" and "Elke.Hanssman", but the edit distance is in both cases 1. Therefore the length of both the pattern and the match string is taken into account to yield an error ratio between 0 and 100, with 0 as a perfect match.

Okay, now we have our misspelled name and a list of our 200 local receivers names, each marked with an error-ratio. Now the best best hit is taken. But only if one of several heuristics apply:

How to verify the results 

Just use the -d switch ... Use it several times for more and more debugging output.

Notes on user processing 

Your user database is read by the C library function getpwent(). Only users with a password are considered. The login name (e.g. "hschurig") is stored into the receiver list.

The full name, e.g. "Holger Schurig" is lowercased. It will the be stored into the database as "holger.schurig", "holger" and "schurig". But if the full name field contains too much spaces, e.g. "Gesellschaft fuer Mission und Lebenshilfe", this is considered to be strange and not entered.

Mailer integration 

The big picture on how to integrate this technology into your mailer:

This scheme should probably work with any mailer, e.g. sendmail, postfix or qmail. The contrib directory contains two sample files for the sendmail-integration.

Download 

Get the file here: similarreceiver-1.3.tar.gz.