Emmesmail utilizes a multi-faceted approach to junk mail that, after many years of development, achieves a spam rejection rate approaching 99% with the number of false positives (each rejection of valid email being considered a false positive) < 1%. Emmesmail initially used a whitelist, a blacklist, a Bayesian filter and an "appropriateness" filter. The whitelist and blacklist were user-specific, locally-generated files. In 2016, this scheme was simplified so that Emmesmail used only a locally-generated whitelist, a Bayesian filter, and a filter that looked at the percentage of unrecognized words. In 2017, the Bayesian filter's handling on unrecognized words was changed and the filter examining the percentage of unrecognized words was eliminated. Since 2020, Emmesmail has been practically a pure Bayesian filter.
Here is an outline of how the filtering works:
1) The first thing Emmesmail does is to check if the sender is included in the local whitelist, a list of senders previously determined not to be spammers. If so, the email is immediately delivered. This procedure was followed until 2020, when it was abandoned.
2) Emmesmail used to check next if the sender was included in the local blacklist, a list of senders previously found to be spammers. If included, the email was re-directed from the recipient's mailbox to the spam mailbox. This step was eliminated in 2016 since virtually every email whose sender was on the blacklist, was deemed spam by the Bayesian filter.
3) If the sender is not included in the whitelist, the entire email, including the header, is next examined by a Bayesian filter modeled after that of Paul Graham.
4) If the Bayesian filter reports that the email is likely spam it is re-directed to the spam mailbox, appended to the database of spam emails (see information on Bayesian filtering below), and the sender added to the blacklist. If the email is thought not to be spam based upon Bayesian analysis, it is then examined by aditional filters. Prior to 2016, the next filter examined the ratio of characters to actual tokens (tokens are defined below), after which an "appropriateness" filter, examined the appropriateness of the words used in the email. Starting in 2016, the token paucity filter was eliminated and the "appropriateness" filter simply labeled as spam those emails where more than 40% of the included tokens were not already in the Bayesian filter's corpi.
5) If an email passes all the filtering, it is forwarded to the recipient's mailbox. Just having an email passed by the filtering process is not sufficient to add that email's sender to the whitelist. This only occurs once the email is saved by the recipient.
If upon checking the spam mailbox, if it is found that a mistake has been made and an innocent email has been diverted there, a single click will correct the mistake, deliver the mail to the intended recipient, and correct the databases.
Initially, Emmesmail rejects spam based upon Emmes Technologies' databases that come with the software, until such time as the user's databases become large enough to use.
When Emmesmail has determined that an email is spam, it can, if configured to do so, return the email to the spammer with a customizable "failure-to-deliver" message. Most authorities recommend that this feature not be used.
We found that in implementing the Bayesian filter described by Paul Graham, the following parameters needed to be defined.
Parameter | Definition | Value chosen |
MAXW | Maximum number of tokens allowed in the hash table | 400000 |
MWDS | Maximum number of words considered when calculating weights | 9000 |
WMIN | Minimum length of a hash table token | 2 |
WMAX | Maximum length of a hash table token | 40 |
PMIN | Minimum weight of a token | 0.001 |
PMAX | Maximum weight of a token | 0.999 |
PUNK | Weight assigned to a token not seen previously | 0.75 |
MINO | Minimum number of times a token must appear in corpi to count | 3 |
MNUM | Maximum number of emails in each corpus before thinning | 410 |
RNUM | Number of emails remaining after thinning | 390 |
CUT | Likelihood above which an email is considered spam | 0.5 |
NTW | Number of words to weigh in likelihood calculation | 15 |
AFPB | Anti false-positive bias factor | 2.0 |
- | Characters which act as token separators | \040, \011, \012, @, ? |
WMIN: Was set to 2 to avoid examining single letters.
WMAX: This eliminates long undecipherable tokens as occur with pdf documents.
PMIN, PMAX: Not 0 or 1, in order to avoid division by zero in the calculations. Also, if too small, a single word can carry too much weight. Currently, out of the approximately 18,500 tokens in our hash file, 26 are assigned the maximum weight and 12 the minimum weight.
MINO: A word must occur at least four times in our corpi to be significant with regard to determining whether an email is spam. Graham used five, but we felt four might allow one less spam to be passed during the filter's training period.
MNUM, RNUM: When one of our corpi gets to contain 350 emails, we reduce it to include only the 250 most recent and then add new ones until the total number is again 350.
CUT, NTW: Like the original Paul Graham filter, we calculate the likelihood of an email being spam according to the formula
Likelihood = pspam/(pspam + pnspam)
where pspam = w1*w2*w3*....wn, and pnspam = (1-w1)*(1-w2)*...(1-wn), and where the wn are the weights of the tokens in the email. Like the original Graham protocol, we arbitrarily consider only the NTW (15) most significant (closest to 0 or 1) weights in the calculation of likelihood, and we reject emails whose likelihood of spam is greater than CUT. We set CUT to 0.5, a logical choice. Setting CUT to 0.9 as in Graham's formulation, gives the same results, since, as he points out, the probabilities tend to be close to 0 or 1, with hardly any falling between 0.5 and 0.9.
AFPB: The anti false-positive bias factor. The weights, wn, strictly should be calculated according to the formula
wn = a/( a + b )
where a and b are the frequency of the word in the spam and non-spam
corpi respectively. The description of the original Graham filter recommended
counting the words in the non-spam corpus twice in order to reduce the incidence
of false positives. In our implementation this amounts to using the formula
wn = a/( a + b*AFPB )
where AFPB is 2.0. We tried values for AFPB varying from 3.0 to 0.4, before setting AFPB to 1.0 for many years, essentially eliminating it as a variable. Eventually we found that using Graham's value gave better results.
PUNK: Initially this was set to 0.5, which meant that unrecognized words were essentially ignored. However, we got better results as PUNK was increased, the largest value tested, 0.95, giving the best results. After spates of false-positives, it was lowered to 0.90 on 15 Jun 2021, 0.80 on 1 January 2022, and 0.75 on 1 April 2022.
Our attempt to implement Graham's formulation exactly did not, initially, achieve as high a spam rejection rate as he reported, so we made a number of changes to our spam filtering. Starting in 2004, we introduced what we refer to as hierarchical filtering, With this system, the Bayesian filter is just one of a number of filters, applied in a linear fashion.
In 2004, the first filter we applied was sender-filtering, which used a user-specific whitelist and blacklist. Only afterward was the Bayesian filter applied.
Between 2004 and 2014, we added and tweaked a number of filters, including a "token-paucity" filter which examined the ratio of characters to actual tokens and trapped those spam emails that avoid detection by Bayesian filters by not containing very many words in ASCII or UTF-8 format.
Ironically, once our success rate approached that of Graham, et al, we developed a means of evaluating the efficiency of each individual filter and concluded that the improvement in our success rate was probaly due more to elimination of bugs in the original program, than to the specific modificatons. As a result, we slowly eliminated many of the modifications we had added.
After many years of work, we feel that the one improvement we have made to the original Graham formulation is in the weight assigned to an unrecognized token. Initially we assigned an unrecognized token a weight of PUNK = 0.5, which meant that, essentially, that token was ignored. We are not sure what Graham, et al. did with unrecognized tokens, but suspect they also ignored them. Our recent data suggested that assigning unrecognized tokens a weight of PUNK > 0.5 increases the spam rejection rate without increasing the rate of false-positives. Currently we are using PUNK = 0.90.
The table below summarizes our year to year results. The
specific details of Emmesmail's development post-2004 are
described here.
Year | Spam Emails Rec. | Spam Emails Rej. | Rej. Rate (%) | Valid Emails Rec. | Valid Emails Rej. | False Pos. (%) |
2003 | 276 | 256 | 92.8 | 682 | 28 | 4.1 |
2004 | 1173 | 1099 | 93.7 | 834 | 15 | 1.8 |
2005 | 2749 | 2624 | 95.5 | 1008 | 10 | 1.0 |
2006 | 11677 | 11401 | 97.6 | 804 | 16 | 2.0 |
2007 | 11622 | 11433 | 98.4 | 642 | 9 | 1.4 |
2008 | 11879 | 11579 | 97.5 | 1060 | 11 | 1.0 |
2009 | 1523 | 1504 | 98.8 | 607 | 4 | 0.7 |
2010 | 805 | 785 | 97.5 | 678 | 8 | 1.2 |
2011 | 784 | 773 | 98.6 | 528 | 5 | 0.9 |
2012 | 874 | 863 | 98.7 | 568 | 9 | 1.6 |
2013 | 1905 | 1882 | 98.8 | 639 | 9 | 1.4 |
2014 | 1982 | 1970 | 99.4 | 658 | 4 | 0.6 |
2015 | 2020 | 2001 | 99.1 | 611 | 15 | 2.5 |
2016 | 2264 | 2234 | 98.7 | 647 | 13 | 2.0 |
2017 | 2175 | 2131 | 98.0 | 949 | 11 | 1.2 |
2018 | 1595 | 1557 | 97.6 ± 0.4 | 754 | 7 | 0.9 |
2019 | 3652 | 3528 | 96.6 ± 0.3 | 735 | 11 | 1.5 |
In 2020 there was a major change in our spam-filtering formulation in that nearly every email received was tested by the Baysian filter and the other filters, (token-paucity, inappropriate-tokens, and sender-blacklist) were ignored. We retained our sender-whitelist, but it was applied after the Bayes filter and any emails bayes-classified as spam which were from senders on our sender-whitelist were reclassified as bayes-false-positives. As it turned out, there were hardly any of these. Starting from 2020, false-positives were not tabulated separately, but were combined with the false-negatives and spam-missed in calculating filter efficiency. Subsequent development and results are reported here.