Emmesmail Development, 2004 to Present

This entry documents Emmesmail's development from 2004 to the present.   Little or no effort has been made to reinterpret older entries in light of our current thinking.   A summary of the work is here.

2004-2014

Use of a whitelist and blacklist to augment the Bayesian filter was introduced in mid-2004. A filter looking at "unrecognized" words in the email was added in 2006. In 2008, we briefly tried using "tokens" made up of two consecutive words in our comparisons, but this did not significantly improve spam rejection. In early 2014, token-paucity filtering was introduced, but only when the user had specified the use of sender filtering. The logic behind this was that many spammers were reducing the number of words in an email to hinder Bayesian filtering and whereas a friend might send you an email typing only a few words like "my pics", a legitimate person sending you an email would have to explain with "these are pics I am sending to pursuade you to go on a date with me". Emmesmail thinks if someone you don't know sends you an email with just "My pics", they are likely a spammer. Prior to 2011, Emmesmail ran only under Microsoft Windows. Since then, it ran only under Linux.

2015

In early 2015, we refined our email classification scheme in order to allow better assessment of the efficiency of each filter. Each email was classified according to one of the following:

ok-whitelist (sender is in the whitelist)
ok-passed-all (all filters used thought the email was not spam)
spam-blacklist (sender is in the blacklist)
ok-fp-blacklist (sender appeared to be in blacklist, probably as a wildcard entry, but the email was not spam)
spam-bayes (the Bayesian filter thought the email was spam)
ok-fp-bayes (the Bayesian filter thought the email was spam, but it was not)
spam-token-paucity (too few tokens for the number of characters in email)
ok-fp-token-paucity (the token-paucity filter misdiagnosed this valid email)
spam-unrecognized-words
ok-fp-unrecognized-words (high fraction of words not previously seen in emails, but not spam)
spam-missed (this spam email missed by all the filters)
spam-user-defined (someone or someone posing as someone in the whitelist sent this spam email)

This allowed more detailed description of the filter results.

In the table below, the filters are listed in the order applied. The number of emails tested by each spam filter goes down sequentially because the filtering is hierarchical and if the email is declared spam by one test, or valid by the whitelist, no more tests are done. The "No. tested" column lists the number of spam emails tested by that filter and is the denominator for the "% spam rejected" entry. The denominator for "% false positives" is the total number of valid emails received (including the false positives).

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type No. tested No. caught False Pos. % Spam rej. % False Pos.
Sender-filtering 2020 1404 0 70 0
Bayesian-filtering 616 520 7 84 1.1
Token-paucity 96 67 9 70 1.4
Unrecognized-words 29 10 0 34 0
All filters combined 2020 2001 15 99.1 2.5
  Total Valid Emails   Whitelisted Passed all filters False Pos. % False Pos.
611 590 6 15 2.5


2016

In years prior to 2016, Emmesmail used a whitelist, a blacklist, a Bayesian filter, a token-paucity filter, and an "appropriateness" or "unrecognized words" filter.  Filtering based upon the whitelist and blacklist was referred to as sender-filtering.  In 2016, sender-filtering was separated into filtering against a whitelist and filtering against a blacklist.  Black-sender-filtering was eliminated after it was noted that nearly all emails on the blacklist were caught by the Bayesian filter as well.  White-sender-filtering was continued in order to reduce false positives when people on the whitelist occasionally sent unusual, but none-the-less valid emails that fail Bayesian filtering.  Token-paucity filtering also was eliminated because it caused too many false-positives relative to the number of additional spam emails it caught.

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type No. tested No. caught False Pos. % Spam rej. % False Pos.
Bayesian-filtering 2264 2193 7 97 1.1
Unrecognized-words 69 39 6 57 0.9
All filters combined 2264 2234 13 98.7 2.0
  Total Valid Emails   Whitelisted Passed all filters False Pos. % False Pos.
647 579 55 13 2.0


2017

The success of eliminating the blacklist encouraged us also to try to eliminate the "unrecognized" filter. In 2016, the "unrecognized" filter increased our overall spam rejection rate and also increased our false positive rate.  In 2017, our intention was to experiment on how to modify our Bayesian filter to operate without any additional filters (aside from whitelist filtering).  For many years, prior to 2017, an unrecognized word in the Bayesian filter was assigned a spam-probability of PUNK = 0.5, essentially eliminating it in the calculation of the overall probability of spam.  This was done even though the assumption of our unrecognized filter was that the higher the fraction of unrecognized words, the higher the likelihood an email is spam.  Our plan for 2017, after eliminating the unrecognized filter, was to slowly increase the value of PUNK and examine the effect this had on the spam rejection and false positive rates.  While the results are not conclusive, plots of the spam rejection and false-positives rates versus PUNK suggest that higher values of PUNK are well-correlated with a higher spam rejection rate and much less-well-correlated with the false-positives rate. We will attempt to confirm this in 2018.

Efficiency of Bayesian Filter with Different Values of PUNK

Period PUNK Spam Received Spam Caught Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
All 2016 0.5 2264 2193 579 55 7 97.0 1.1
Jan-Mar 2017 0.6 576 558 222 20 1 96.9 0.5
Apr-Jun 2017 0.7 523 514 245 25 6 98.3 2.2
Jul-Sep 2017 0.8 503 490 208 59 3 97.4 1.1
Oct-Dec 2017 0.9 573 569 138 42 1 99.3 0.6

2018

By the end of 2017 it became apparent that our Bayesian filter had trouble distinguishing between required bank emails containing a one-use security code and the irritating ones from the same bank email address announcing that a requested transaction had been processed.   Thus, in 2018, we used our director file (which determines, based upon the sender, recipient, and tokens in the subject, to which account the incoming emails should be directed) to redirect the security code emails to a totally separate account, so virtually no messsages from our bank end up in our normal email corpus, almost all being directed by the Bayesian filter to the spam corpus. Another thing learned in 2017 is that our filtering allows in so few spam emails that a test period of just three months does not contain sufficient data to discern small differences in performance.   Accordingly, in 2018 we extended the test period for each value of PUNK from three to six months and began calculating an estimate of the uncertainty in our measurement.

Efficiency of Bayesian Filter with Different Values of PUNK

Period PUNK Spam Directed Spam Bayes-tested Spam Bayes-Caught Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
Jan-Jun 2018 0.9 2 924 911 327 37 2 98.6 ± 0.4 0.5
Jul-Dec 2018 0.8 5 664 639 356 27 5 96.2 ± 0.7 1.3

2019

The data from 2018 supports the hypothesis that higher values of PUNK lead to better Bayesian filtering, and unlike the data for 2017, they appear to be statistically significant. Despite our concern, that the significant development work done on our email program during the latter half of 2018, which accidentally may have introduced a systematic error, particularly since the 2018 results are worse than those in 2017, we decided for 2019, to set PUNK to 0.95, slightly larger than any value we have tried previously.

In 2018, we began to use our director file not only to determine to which account to attempt to deliver an incoming email, but also to determine from the same information, whether the incoming email already should be classified as spam.   The director file, therefore, acted as a very efficient filter.   Since the director file information needs to be evaluated early in the email-evaluation process, we applied the director file filtering second only to whitelist filtering within our hierarchical filter.   It is clear that for 2019 we can get the same overall efficiency, and also a better evaluation of the efficiency of the Bayesian filter, by placing the director filtering after the whitelist and Bayesian filtering.  In 2019, we added what we call Bayes-failure filtering.   Unlike the sender blacklist filtering we used many years ago, which had a huge (7000+ entries) blacklist of all senders determined to be spammers, and which was applied before Bayesian filtering, this blacklist, called bflist, is applied only to emails that pass the Bayesian and director filters and will consist of only those senders whose emails were missed by the Bayesian filter within the previous twelve months.   To facilitate generation of this list, we wrote a simple bash-script to automate our monthly log backup and rotation.

We changed our DNS in 2019. This resulted in a 6-fold increase in spam received, which quickly overwhelmed (dominated) our spam corpus, leading to a significant drop in Bayes filter efficiency.  To prevent this in the future, we modified our program so that instead of adding each spam caught to the spam corpus as done previously, we added only 1 out of every X spam emails, where X, determined from the average number of spam emails received per day the previous month, was chosen to cause the spam corpus to turn over about twice per year.   To keep the corpi in sync, we similarly changed our procedure so only 1 out of every Y emails was added to the normal corpus, where Y was calculated from the average number of valid emails per day saved the previous month.   Generally, X was about 5, and Y 1.

Efficiency of Bayesian Filter with PUNK = 0.95

Period Spam Bayes-tested Spam Bayes-Caught Spam Directed Spam-bflist Spam-missed Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
2019H1 1878 1812 3 10 53 412 24 3 96.5 ± 0.4 0.7
2019H2 1542 1495 0 5 37 197 91 6 96.7 ± 0.4 2.0

2020

Prior to 2020, except for a brief testing period in 2019, emails from whitelisted recipients did not undergo any further testing, and were classified as "ok-whitelist". In 2020. we will instead subject all received emails to all the tests. This does away with the ok-whitelist category and should allow better characterization of the Bayes filter itself.   Clearly the percentages given in the table currently are placeholders to illustrate format and will be replaced with proper values once data is available.

Efficiency of Bayesian Filter with PUNK = 0.95

Period Spam Received Caught-Bayes Caught-Token-Pauc. Caught-Bflist Caught-Director Spam-missed Valid Passed-all False-Pos Bayes False-Pos Token % Spam rej. % False Pos.
2020H1 0 0 0 0 0 0 0 0 0 96.5 ± 0.4 0.7
2020H2 0 0 0 0 0 0 0 0 0 96.6 ± 0.5 0.4



Emmes Technologies
Updated 2 Nov, 2019