A computer is good at doing mathematics but doesn't understand much of the
message itself of an incoming email. The email might come from a friend, which
you (the reader) easily understand the contents of. You will also easily understand
if an incoming message is a spam email, and might delete it since it is of no
value to you. The principle is: You read the message, and then
decide if it's a spam email or a good email (aka ham).
A computer program which handles email has to be based on statistical means to decide whether an email is likely to be spam or ham. Since a filtering program is based on statistics on words (ham and spam words), it can compute a likelihood of spam or ham mail, but not 100 % correctly all the time. An email might be a false positive, for example.
To make a spam filter program, you will need to know the probabilities of spam words and ham words. A filter program must therefore be "trained" to separate good and bad words in some way. Most of today's spam filter program do this by using Bayesian filtering, but serveral candidates have evolved from this theory, which seem to work more accurate.
For example, the word "buy" (viagra, medicine, holiday etc) might be found in 30 ham emails and in 66 spam emails, occurring respectively 40 and 93 times in each category of ham and spam. This fact in my example shows that an email containing the word "buy" probably is a spam mail.
The filtering program should also consider the email sender. You can put your friends in a friends list (whitelist). If the sender is in your whitelist, the program doesn't have to examine the probabilities of the spam at all. Just let this email pass, because it's from a friend.
On the other hand, if the sender is unknown, the spam filter program has to examine every word in the message to determine the final spam probability.
One of the statistical means available is Bayes Theorem, which is a foundation of the Bayesian Filtering. This theorem is stating that
![]() |
(B1, B2,..., Bk forms a partition of a set S, and A is any event in S) |
I'll adapt and apply this theorem later to compute the spam probability for a whole email, below..
At this stage it's obvious that you will need a database of words (tokens), together with their spam and ham probabilities, in order to compute an email's spam score. A spam score is nothing else than the probability value of an email to be spam given it include spam words.
Make a word database which you populate with words coming from spam and ham emails. This database has at least three fields, a Word text field, a HamWord number field and a SpamWord number field. You will need at least one spam corpus and one ham corpus. A corpus is a set of emails, for example a folder in Outlook (Express). About 500 emails or so should do fine.
Now, start with one of the corpora, for example the ham corpus - taking one email at the time. However, first you have to define a set of word delimiters. These are Delim in {'&', '/', '(', ')', ',', 'Space'}.
Do the same with the spam corpus, adding each spam word to your database's SpamWord number field.
Your database now ought to count several thousands of records.
For computing the spam score, we will adapt the Bayes Theorem in this way:
| For one word (this spam value is element a1): | P(spam) = a1 / (a1 + (1-a1)) |
| For two words with corresponding occurrences s1 and a2 : | P (spam = (a1*a2) / (a1*a2 + ((1-a1)* (1-a2)) |
and so on |
Once you have your database with tens of thousands words in it, together with each word's ham and spam occurrences, you can start using it to compute an email's spam score. The mathematics of a spam score is something like this:
9. End of loop on I
10. If there are more lines in the email, goto 4)
11. If there are more emails to process, goto 2)
12. Compute the spam score from the formula P = Numerator / (Numerator + L)
13. End
P is now the spam score.
This system is implemented on this web server, go to the spam computing page. This spam computing web page is a standalone server, which might incidentally be down. (You might drop a notification to tore [at] aasli.com if the server should be down)
To test for spam, try these words and notice how the bayesian spam filter is reacting
Test 1 : Buy
Test 2: Buy viagra
Test 3: Buy now
Test 4: Buy now!
Test 5: Buy viagra now!
etc, and watch how the filter is returning an increasing spam score! You might also test some of your ham and spam emails if you like.
©Tore Aasli 2006
11 December, 2006