Return Path’s Guide to Spam Filters – An Excerpt
“People spend entire lifetimes trying to avoid the things that have already happened.”
There are countless articles on the web entitled “How to Avoid Spam Filters.” A quick Google search results in at least ten pages of links to editorials and “how-to” guides meant to steer email marketers away from the spam folder and into the inbox with advice like, “Don’t use cheesy subject lines.” Many of these articles accuse overzealous spam filters of tossing perfectly legitimate marketing mail in the spam folder.
The problem with most of these articles and their advice is that it’s based on a false premise: that it’s possible to avoid a spam filter. Spam filters are part of the process. If you send email it will be filtered–either to the inbox, a categories tab, spam folder, or it will be blocked completely. Filter technology plays a massive role in the success of your email campaigns. This is why at Return Path, we encourage our clients to embrace spam filters, learn how they work, and understand how the Mailbox Providers (MBPs) use them.
We recently published a guide for Return Path clients aimed at helping them better understand the filter technology that decides the destination of email: how it works, who uses it, and what you can do to work with the filter. What do we mean by “work with the filter”? Well, if you know your mail will be subjected to analyses (both machine and human), you can make the most of best practices and specific knowledge to maximize your chance of being delivered to the inbox and help you quickly turn things around when you’re not!
Here is a small excerpt from the 13-page spam filter guide:
What is a Filter?
In principle, email filtering is organizing email according to specified criteria. Originally, filters were designed primarily to identify spam and block it or place it in the spam folder. Today, some MBPs are also using email filters to categorize messages for inbox-organization purposes (like social media and newsletters).
MBPs have strong motivations to use spam filters, whether they build their own system or leverage third-party spam filter technology. Spam is annoying, no doubt, but it can also be dangerous. Malware and phishing are hugely profitable for scammers and can be equally costly for mailbox providers who face intense market competition.
Practically speaking, spam filters drastically reduce the load on server resources as well. Considering that 95% of all mail sent globally is spam, the sooner the mail can be filtered, the less processing and storage is required.
Third-party spam filter technology like Cloudmark, Vade Retro, and Symantec are relatively easy for an MBP to use and configure according to their needs. They also provide service models that ensure a postmaster has support when they need it.
Homebuilt systems require planning, engineering and programming expertise and may be limited in terms of global data sharing. Large providers like Gmail and Yahoo have their own systems in place because they can afford the resources necessary for building and maintaining it. Smaller regional providers are more likely to use external technology which they can configure to enforce their own unique policies.
At the end of the day, filtering mail is about two things: protecting and keeping their customers.
The Basics of Filter Technology
Although there are more than a few ways to develop a spam filter, the three methodologies most often employed are algorithms, heuristics, and the more advanced form of heuristics known as Bayesian.
Algorithms in this context are rules that tell a program what to do. Heuristics means “to speculate” and works by subjecting email messages through thousands of predefined rules (algorithms) against the message envelope, header and content. Each rule assigns a numerical score to the probability of the message being spam.
An equation might look like this:
Pr(S|W) = Pr(W|S) • Pr(S)
Pr(W|S) • Pr(S) + Pr(W|H) • Pr(H)
Pr(S|W) is the probability that a message is spam, knowing that the word “viagra” is in it.
Pr(S) is the overall probability that any given message is spam.
Pr(W|S) is the probability that the word “Viagra” appears in spam messages.
Pr(H) is the overall probability that any given message is not spam (is “ham”)
Pr(W|H) is the probability that the word “Viagra” appears in ham messages.
You don’t need to be a statistician or mathematician to understand how spam filters work. The point here is that it’s about logic and probability, and most importantly, it’s not personal. If a filter decides your mail is spam, its conclusion is based on thousands of such predefined rules.
The most advanced form of heuristics is Bayesian. Bayesian differs from simple heuristics in that it can learn. Bayesian filters are capable of comparing two sets of information and acting on the result. This is in direct contrast to the vast majority of other heuristic filters that use pre-built rules to decide which email is spam and which is not.
Postmasters at MBPs will often say they are “training the machine.” Postmasters can manually review the decisions their Bayesian filters made, effectively telling the machine “yes, that was the right decision” or “no, this was a false positive.”
The Main Types of Analysis
Mailbox Providers and the filters they use look at a three main aspects of mail: the reputation of the sender, the source of the mail (the IP and/or sending network), and the content of the mail they send. Most senders know that the content of their mail will be scrutinized for “spammy” content, but you may not be aware of just how much of your content is examined.
One well-known method of analyzing content is called “fingerprinting.” Technology providers like Cloudmark are known for creating fingerprints of email content. Fingerprinting in and of itself is not a filter, but it is a technology that helps filters make decisions about mail. Fingerprints are hashes or checksums of content.
These hashes are many times smaller (64 bytes) than the content that they’re generated from which makes them easier to store. Once the fingerprints are created and stored, they can be compared to other fingerprints. The result of the comparison helps filters decide whether mail is spam or not. They score the similarity of fingerprints, meaning if your fingerprint is more similar to a fingerprint belonging to mail that has been confirmed as spam than it is to a fingerprint of mail that has been confirmed ham, then your mail will likely be flagged as unwanted mail.
Content analysis technology scans mail, everything from the header, footer, code, HTML markup, images, text color, timestamp, URLs, subject line, text-to-image ratio, language, attachments, and more. For some content filters, there is not one part of the message that the content filter ignores. Other content filters may look at only the structure of an email or they might simply parse URLs out of the message and then reference them against blacklist (like SURBL).
The second most commonly analyzed aspect of email is the reputation of the IP address and/or domain being used by the sender. This is sometimes called “sender reputation.” Using the algorithms and heuristics that we previously discussed, reputation-based filters leverage millions of data points and hundreds of parameters to generate a reputation score that may range from 0 to 100 (Return Path’s Sender Score) or from -10 to +10 (IronPort’s SenderBase).
Some of the parameters leveraged to generate a reputation score are:
- Spam-Trap Data
- Message Composition Data
- Volume Data
It’s important to note that the filter provider does not block your mail or decide where it goes (inbox or bulk). Those decisions are always made by the MBP. Reputation-based filters can automatically apply the MBP’s mail flow policies based on the reputation score of the sender. As the filter receives inbound mail, a threat assessment of the sender is performed. This assessment returns a reputation score which is linked to mail flow policies specified by the administrator.
There is a difference between filtered mail and blocked mail. Filtered mail is mail that is accepted by the MBP’s network and then delivered to a mailbox folder (inbox, spam or other). Blocked mail is mail that is not accepted and thus is never seen by the MBP’s filter technology.
Spammers have ways of gaming reputation systems by using multiple IPs. All MBPs therefore rely on both content analysis and reputation. Complaints are a contributing factor to your reputation, and complaints are minimized by keeping your content (and frequency) relevant to the individual user. If you have a reputation issue, the first thing you should do is reassess your content and look at engagement metrics.
If you would like to learn more about filters, ask your Return Path Account Manager for our guide entitled, A Marketer’s Guide to Spam Filters. The guide covers all of the well-known spam filters, everything from Abaca to Vade Retro and covers blacklists as well. It also provides regional specific advice for Europe, Australia, Asia, South America, and the United States.
About Dana Huten
Dana Huten is an Anti-Spam and Security Consultant in the Email Intelligence Team of Return Path. With 15 years in the European telecommunications and internet industry, Dana has a broad spectrum of knowledge. She has experience working in mobile, internet and email marketing as well as in IT and internet security.