Emmesmail's Development, 2004 to 2021

This page documents Emmesmail's development from 2004 to 2021. An overall summary of Emmesmail's history is here.

Use of a whitelist and blacklist to augment Emmesmail's Bayesian filter was introduced in mid-2004. A filter looking at "unrecognized" words in the email was added in 2006. In 2008, we briefly tried using "tokens" made up of two consecutive words in our comparisons, but this did not significantly improve spam rejection. In early 2014, token-paucity filtering was introduced, but only when the user had specified the use of sender filtering. The logic behind this was that many spammers were reducing the number of words in an email to hinder Bayesian filtering and whereas a friend might send you an email typing only a few words like "my pics", a legitimate person sending you an email would have to explain with "these are pics I am sending to pursuade you to go on a date with me". Emmesmail thinks if someone you don't know sends you an email with just "My pics", they are likely a spammer. Prior to 2011, Emmesmail ran only under Microsoft Windows. Since then, it ran only under Linux.

In early 2015, we refined our email classification scheme in order to allow better assessment of the efficiency of each filter. Each email was classified according to one of the following:

ok-whitelist (this calegory was used only until 2018, when the Baysian filter started testing emails from senders on the whitelist)
ok-passed-all (all filters used thought the email was not spam)
spam-blacklist (sender is in the blacklist)
ok-fp-blacklist (sender appeared to be in blacklist, probably as a wildcard entry, but the email was not spam)
spam-bayes (the Bayesian filter thought the email was spam)
ok-fp-bayes (the Bayesian filter thought the email was spam, but it was not)
spam-token-paucity (too few tokens for the number of characters in email)
ok-fp-token-paucity (the token-paucity filter misdiagnosed this valid email)
spam-unrecognized-words
ok-fp-unrecognized-words (high fraction of words not previously seen in emails, but not spam)
spam-missed (this spam email missed by all the filters)
spam-user-defined (someone or someone posing as someone in the whitelist sent this spam email)

In the table below, the filters are listed in the order applied. The number of emails tested by each spam filter goes down sequentially because the filtering is hierarchical and if the email is declared spam by one test, or valid by the whitelist, no more tests are done. The "No. tested" column lists the number of spam emails tested by that filter and is the denominator for the "% spam rejected" entry. The denominator for "% false positives" is the total number of valid emails received (including the false positives).

In years prior to 2016, Emmesmail used a whitelist, a blacklist, a Bayesian filter, a token-paucity filter, and an "appropriateness" or "unrecognized words" filter. Filtering based upon the whitelist and blacklist was referred to as sender-filtering. In 2016, sender-filtering was separated into filtering against a whitelist and filtering against a blacklist. Black-sender-filtering was eliminated after it was noted that nearly all emails on the blacklist were caught by the Bayesian filter as well. White-sender-filtering was continued in order to reduce false positives when people on the whitelist occasionally sent unusual, but none-the-less valid emails that fail Bayesian filtering. Token-paucity filtering also was eliminated because it caused too many false-positives relative to the number of additional spam emails it caught.

The success of eliminating the blacklist encouraged us also to try to eliminate the "unrecognized" filter. In 2016, the "unrecognized" filter increased our overall spam rejection rate and also increased our false positive rate. In 2017, our intention was to experiment on how to modify our Bayesian filter to operate without any additional filters (aside from whitelist filtering). For many years, prior to 2017, an unrecognized word in the Bayesian filter was assigned a spam-probability of PUNK = 0.5, essentially eliminating it in the calculation of the overall probability of spam. This was done even though the assumption of our unrecognized filter was that the higher the fraction of unrecognized words, the higher the likelihood an email is spam. Our plan for 2017, after eliminating the unrecognized filter, was to slowly increase the value of PUNK and examine the effect this had on the spam rejection and false positive rates. While the results are not conclusive, plots of the spam rejection and false-positives rates versus PUNK suggest that higher values of PUNK are well-correlated with a higher spam rejection rate and much less-well-correlated with the false-positives rate. We will attempt to confirm this in 2018.

By the end of 2017 it became apparent that our Bayesian filter had trouble distinguishing between required bank emails containing a one-use security code and the irritating ones from the same bank email address announcing that a requested transaction had been processed. Thus, in 2018, we used our director file (which determines, based upon the sender, recipient, and tokens in the subject, to which account the incoming emails should be directed) to redirect the security code emails to a totally separate account, so virtually no messsages from our bank end up in our normal email corpus, almost all being directed by the Bayesian filter to the spam corpus. Another thing learned in 2017 is that our filtering allows in so few spam emails that a test period of just three months does not contain sufficient data to discern small differences in performance. Accordingly, in 2018 we extended the test period for each value of PUNK from three to six months and began calculating an estimate of the uncertainty in our measurement.

The data from 2018 supports the hypothesis that higher values of PUNK lead to better Bayesian filtering, and unlike the data for 2017, they appear to be statistically significant. We decided for 2019, to set PUNK to 0.95, slightly larger than any value we tried previously.

In a change from all previous years, starting mid-2019, any sort of other filtering applied, was applied only after Bayesian filtering, in order to allow better evaluation of the efficiency of the Bayes filter itself. We still used a whitelist, but by applying it after the Bayes-filter, we effectively eliminated the "ok-whitelist" category. Any email which was classified as spam by the Bayesian filter but was on the whitelist, ended up being classified as "ok-fp-bayes". We also added a Bayes-failure list to filter on. Unlike the sender blacklist filtering we used many years ago, which had a huge (7000+ entries) blacklist of all senders determined to be spammers, and which was applied before Bayesian filtering, this blacklist, called bflist, was applied only to emails that passed the Bayesian and director filters and contained the addresses of only those senders whose emails were missed by the Bayesian filter within the previous twelve months. To facilitate generation of this list, we wrote a simple bash-script to automate our monthly log backup and rotation.

We changed our DNS in the second quarter of 2019. This resulted in an immediate 6-fold increase in spam received that quickly overwhelmed (dominated) our spam corpus, leading to a significant drop in Bayes filter efficiency. To prevent this in the future, we modified our program so that instead of adding each spam caught to the spam corpus as done previously, we added only 1 out of every X spam emails, where X, determined from the average number of spam emails received per day the previous month, was chosen to cause the spam corpus to turn over about twice per year. To keep the corpi in sync, we similarly changed our procedure so only 1 out of every Y emails was added to the normal corpus, where Y was calculated from the average number of valid emails per day saved the previous month. Generally, X was about 5, and Y 1.

Eventually our email account on this new server was specifically targeted and we started receiving 25,000 spam emails/day. Our filter, which operates only after downloading emails to a local computer, was never designed to handle that quantity of spam, which has to be handled on the server level. Since the server staff offered no immediate solution, we had no choice but to retire that email address, which had been used for more than 15 years and documented above. We ceased collecting data on spam-filtering efficiency on November 10 and moved to a new server, using a new email address. Our experience with the new email address will be documented starting 2020.

After switching our DNS and email address, the quantity of spam received decreased dramatically. It is difficult to compare the data post-2019 with that before 2020 as we also changed the spam-filtering formulation: Unlike prior to 2020, when emails from senders on our whitelist did not undergo Bayes-filtering, post-2019, every received email not part of 2-step authentification or internal testing was tested by the Bayesian filter, and results from the other filters (token-paucity, token-inappropriateness, and the blacklist) were ignored. We retained a sender-whitelist, but it was applied only after the Bayes-filter and used to find false-positives. We also changed the formula for calculating the efficiency of our filtering. Prior to 1, January 2020, it was 100%*((spam-emails caught)/((spam-emails caught) + (spam-emails missed)) and post, when virtually¹ all emails were tested, it was 100%*(1 - (false-negatives + false-positives)/(emails-tested)). A perfect result in both instances is 100%. The term false-positives has frequently been defined and is well known. The term false-negatives also is well-known in statistics, but with regard to spam-filtering, it has more frequently been referred to as "spam-missed".

Starting in 2022, we shifted the focus of Emmesmail to filtering unwanted emails more so than emails that satisfied the original definition of spam.

Spam Filter Type	No. tested	No. caught	False Pos.	% Spam rej.	% False Pos.
Sender-filtering	2020	1404	0	70	0
Bayesian-filtering	616	520	7	84	1.1
Token-paucity	96	67	9	70	1.4
Unrecognized-words	29	10	0	34	0
All filters combined	2020	2001	15	99.1	2.5

Total Valid Emails	Whitelisted	Passed all filters	False Pos.	% False Pos.
611	590	6	15	2.5

Spam Filter Type	No. tested	No. caught	False Pos.	% Spam rej.	% False Pos.
Bayesian-filtering	2264	2193	7	97	1.1
Unrecognized-words	69	39	6	57	0.9
All filters combined	2264	2234	13	98.7	2.0

Total Valid Emails	Whitelisted	Passed all filters	False Pos.	% False Pos.
647	579	55	13	2.0

Period	PUNK	Spam Received	Spam Caught	Valid W.L.	Valid Passed-all	False Pos.	% Spam rej.	% False Pos.
All 2016	0.5	2264	2193	579	55	7	97.0	1.1
Jan-Mar 2017	0.6	576	558	222	20	1	96.9	0.5
Apr-Jun 2017	0.7	523	514	245	25	6	98.3	2.2
Jul-Sep 2017	0.8	503	490	208	59	3	97.4	1.1
Oct-Dec 2017	0.9	573	569	138	42	1	99.3	0.6

Emmesmail Development, 2004 to 2021

2004-2014

2015

Emmesmail's Spam Rejection Statistics by Filter

2016

Emmesmail's Spam Rejection Statistics by Filter

2017

Efficiency of Bayesian Filter with Different Values of PUNK

2018

Efficiency of Bayesian Filter with Different Values of PUNK

2019

Efficiency of Bayesian Filter with PUNK = 0.95

2020 - 2021

Efficiency of Bayesian Filter with PUNK = 0.95 (changed to 0.90, 15 June 2021)

Period	PUNK	Spam Directed	Spam Bayes-tested	Spam Bayes-Caught	Valid W.L.	Valid Passed-all	False Pos.	% Spam rej.	% False Pos.
Jan-Jun 2018	0.9	2	924	911	327	37	2	98.6 ± 0.4	0.5
Jul-Dec 2018	0.8	5	664	639	356	27	5	96.2 ± 0.7	1.3

Period	Emails-filtered	Spam Bayes-Caught	Spam Directed	Spam-blist	Spam-missed	Valid Passed-all	False Pos.	Filter-Efficiency(%)
2020	979	26	0	0	25	921	9	96.5 ± 2.6
2021	1257	167	2	0	9	101	80	92.9 ± 1.9