Emmesmail Development, 2004 to Present

This page documents Emmesmail's development from 2004 to the present.   An overall summary of Emmesmail's history is here.

2004-2014

Use of a whitelist and blacklist to augment Emmesmail's Bayesian filter was introduced in mid-2004.   A filter looking at "unrecognized" words in the email was added in 2006. In 2008, we briefly tried using "tokens" made up of two consecutive words in our comparisons, but this did not significantly improve spam rejection. In early 2014, token-paucity filtering was introduced, but only when the user had specified the use of sender filtering. The logic behind this was that many spammers were reducing the number of words in an email to hinder Bayesian filtering and whereas a friend might send you an email typing only a few words like "my pics", a legitimate person sending you an email would have to explain with "these are pics I am sending to pursuade you to go on a date with me". Emmesmail thinks if someone you don't know sends you an email with just "My pics", they are likely a spammer. Prior to 2011, Emmesmail ran only under Microsoft Windows. Since then, it ran only under Linux.

2015

In early 2015, we refined our email classification scheme in order to allow better assessment of the efficiency of each filter. Each email was classified according to one of the following:

ok-whitelist (this calegory was used only until 2018, when the Baysian filter started testing emails from senders on the whitelist)
ok-passed-all (all filters used thought the email was not spam)
spam-blacklist (sender is in the blacklist)
ok-fp-blacklist (sender appeared to be in blacklist, probably as a wildcard entry, but the email was not spam)
spam-bayes (the Bayesian filter thought the email was spam)
ok-fp-bayes (the Bayesian filter thought the email was spam, but it was not)
spam-token-paucity (too few tokens for the number of characters in email)
ok-fp-token-paucity (the token-paucity filter misdiagnosed this valid email)
spam-unrecognized-words
ok-fp-unrecognized-words (high fraction of words not previously seen in emails, but not spam)
spam-missed (this spam email missed by all the filters)
spam-user-defined (someone or someone posing as someone in the whitelist sent this spam email)

This allowed more detailed description of the filter results.

In the table below, the filters are listed in the order applied. The number of emails tested by each spam filter goes down sequentially because the filtering is hierarchical and if the email is declared spam by one test, or valid by the whitelist, no more tests are done. The "No. tested" column lists the number of spam emails tested by that filter and is the denominator for the "% spam rejected" entry. The denominator for "% false positives" is the total number of valid emails received (including the false positives).

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type No. tested No. caught False Pos. % Spam rej. % False Pos.
Sender-filtering 2020 1404 0 70 0
Bayesian-filtering 616 520 7 84 1.1
Token-paucity 96 67 9 70 1.4
Unrecognized-words 29 10 0 34 0
All filters combined 2020 2001 15 99.1 2.5
  Total Valid Emails   Whitelisted Passed all filters False Pos. % False Pos.
611 590 6 15 2.5


2016

In years prior to 2016, Emmesmail used a whitelist, a blacklist, a Bayesian filter, a token-paucity filter, and an "appropriateness" or "unrecognized words" filter.  Filtering based upon the whitelist and blacklist was referred to as sender-filtering.  In 2016, sender-filtering was separated into filtering against a whitelist and filtering against a blacklist.  Black-sender-filtering was eliminated after it was noted that nearly all emails on the blacklist were caught by the Bayesian filter as well.  White-sender-filtering was continued in order to reduce false positives when people on the whitelist occasionally sent unusual, but none-the-less valid emails that fail Bayesian filtering.  Token-paucity filtering also was eliminated because it caused too many false-positives relative to the number of additional spam emails it caught.

Emmesmail's Spam Rejection Statistics by Filter

Spam Filter Type No. tested No. caught False Pos. % Spam rej. % False Pos.
Bayesian-filtering 2264 2193 7 97 1.1
Unrecognized-words 69 39 6 57 0.9
All filters combined 2264 2234 13 98.7 2.0
  Total Valid Emails   Whitelisted Passed all filters False Pos. % False Pos.
647 579 55 13 2.0


2017

The success of eliminating the blacklist encouraged us also to try to eliminate the "unrecognized" filter. In 2016, the "unrecognized" filter increased our overall spam rejection rate and also increased our false positive rate.  In 2017, our intention was to experiment on how to modify our Bayesian filter to operate without any additional filters (aside from whitelist filtering).  For many years, prior to 2017, an unrecognized word in the Bayesian filter was assigned a spam-probability of PUNK = 0.5, essentially eliminating it in the calculation of the overall probability of spam.  This was done even though the assumption of our unrecognized filter was that the higher the fraction of unrecognized words, the higher the likelihood an email is spam.  Our plan for 2017, after eliminating the unrecognized filter, was to slowly increase the value of PUNK and examine the effect this had on the spam rejection and false positive rates.  While the results are not conclusive, plots of the spam rejection and false-positives rates versus PUNK suggest that higher values of PUNK are well-correlated with a higher spam rejection rate and much less-well-correlated with the false-positives rate. We will attempt to confirm this in 2018.

Efficiency of Bayesian Filter with Different Values of PUNK

Period PUNK Spam Received Spam Caught Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
All 2016 0.5 2264 2193 579 55 7 97.0 1.1
Jan-Mar 2017 0.6 576 558 222 20 1 96.9 0.5
Apr-Jun 2017 0.7 523 514 245 25 6 98.3 2.2
Jul-Sep 2017 0.8 503 490 208 59 3 97.4 1.1
Oct-Dec 2017 0.9 573 569 138 42 1 99.3 0.6

2018

By the end of 2017 it became apparent that our Bayesian filter had trouble distinguishing between required bank emails containing a one-use security code and the irritating ones from the same bank email address announcing that a requested transaction had been processed.   Thus, in 2018, we used our director file (which determines, based upon the sender, recipient, and tokens in the subject, to which account the incoming emails should be directed) to redirect the security code emails to a totally separate account, so virtually no messsages from our bank end up in our normal email corpus, almost all being directed by the Bayesian filter to the spam corpus. Another thing learned in 2017 is that our filtering allows in so few spam emails that a test period of just three months does not contain sufficient data to discern small differences in performance.   Accordingly, in 2018 we extended the test period for each value of PUNK from three to six months and began calculating an estimate of the uncertainty in our measurement.

Efficiency of Bayesian Filter with Different Values of PUNK

Period PUNK Spam Directed Spam Bayes-tested Spam Bayes-Caught Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
Jan-Jun 2018 0.9 2 924 911 327 37 2 98.6 ± 0.4 0.5
Jul-Dec 2018 0.8 5 664 639 356 27 5 96.2 ± 0.7 1.3

2019

The data from 2018 supports the hypothesis that higher values of PUNK lead to better Bayesian filtering, and unlike the data for 2017, they appear to be statistically significant.   We decided for 2019, to set PUNK to 0.95, slightly larger than any value we tried previously.

In a change from all previous years, starting mid-2019, any sort of other filtering applied, was applied only after Bayesian filtering, in order to allow better evaluation of the efficiency of the Bayes filter itself. We still used a whitelist, but by applying it after the Bayes-filter, we effectively eliminated the "ok-whitelist" category. Any email which was classified as spam by the Bayesian filter but was on the whitelist, ended up being classified as "ok-fp-bayes".   We also added a Bayes-failure list to filter on. Unlike the sender blacklist filtering we used many years ago, which had a huge (7000+ entries) blacklist of all senders determined to be spammers, and which was applied before Bayesian filtering, this blacklist, called bflist, was applied only to emails that passed the Bayesian and director filters and contained the addresses of only those senders whose emails were missed by the Bayesian filter within the previous twelve months.   To facilitate generation of this list, we wrote a simple bash-script to automate our monthly log backup and rotation.

We changed our DNS in the second quarter of 2019. This resulted in an immediate 6-fold increase in spam received that quickly overwhelmed (dominated) our spam corpus, leading to a significant drop in Bayes filter efficiency.  To prevent this in the future, we modified our program so that instead of adding each spam caught to the spam corpus as done previously, we added only 1 out of every X spam emails, where X, determined from the average number of spam emails received per day the previous month, was chosen to cause the spam corpus to turn over about twice per year.   To keep the corpi in sync, we similarly changed our procedure so only 1 out of every Y emails was added to the normal corpus, where Y was calculated from the average number of valid emails per day saved the previous month.   Generally, X was about 5, and Y 1.

Eventually our email account on this new server was specifically targeted and we started receiving 25,000 spam emails/day. Our filter, which operates only after downloading emails to a local computer, was never designed to handle that quantity of spam, which has to be handled on the server level. Since the server staff offered no immediate solution, we had no choice but to retire that email address, which had been used for more than 15 years and documented above. We ceased collecting data on spam-filtering efficiency on November 10 and moved to a new server, using a new email address. Our experience with the new email address will be documented starting 2020.

Efficiency of Bayesian Filter with PUNK = 0.95

Period Spam Bayes-tested Spam Bayes-Caught Spam Directed Spam-bflist Spam-missed Valid W.L. Valid Passed-all False Pos. % Spam rej. % False Pos.
2019H1 1878 1812 3 10 53 412 24 3 96.5 ± 0.4 0.7
2019H2 1774 1716 0 5 37 197 91 8 96.7 ± 0.4 2.5

2020 - 2021

After switching our DNS and email address, the quantity of spam received decreased dramatically.   It is difficult to compare the data post-2019 with that before 2020 as we also changed the spam-filtering formulation: Unlike prior to 2020, when emails from senders on our whitelist did not undergo Bayes-filtering, post-2019, every received email not part of 2-step authentification or internal testing was tested by the Bayesian filter, and results from the other filters (token-paucity, token-inappropriateness, and the blacklist) were ignored.   We retained a sender-whitelist, but it was applied only after the Bayes-filter and used to find false-positives. We also changed the formula for calculating the efficiency of our filtering.   Prior to 1, January 2020, it was 100%*((spam-emails caught)/((spam-emails caught) + (spam-emails missed)) and post, when virtually1 all emails were tested, it was 100%*(1 - (false-negatives + false-positives)/(emails-tested)).   A perfect result in both instances is 100%.   The term false-positives has frequently been defined and is well known.   The term false-negatives also is well-known in statistics, but with regard to spam-filtering, it has more frequently been referred to as "spam-missed".

12-factor authentification and internal test-emails were the only emails delivered unfiltered.


Efficiency of Bayesian Filter with PUNK = 0.95 (changed to 0.90, 15 June 2021)

Period Emails-filtered Spam Bayes-Caught Spam Directed Spam-blist Spam-missed Valid Passed-all False Pos. Filter-Efficiency(%)
2020 979 26 0 0 25 921 9 96.5 ± 2.6
2021 1257 167 2 0 9 101 80 92.9 ± 1.9



Emmes Technologies