Big Data and Email – What Would Sherlock Do?
“The world is full of obvious things which nobody by any chance ever observes.”
― Sherlock Holmes, The Hound of the Baskervilles
Sherlock Holmes was a keen observer and brilliant at spotting patterns. Not everyone is lucky enough to possess such skill. And yet, throughout all of his cases there were recurring themes and ways of perceiving data that helped even his cohort Dr. Watson, learn to connect the dots and develop a coherent strategy for mapping the landscape of a situation…you always need to know where you’re standing and have an orientation before mapping your surroundings.
The same holds true for troubleshooting, intelligence gathering, email optimization and product development. Asking the right questions and identifying blind-spots (what you don’t know, i.e. the data that you’re missing) is the first, most crucial step. That’s easier said than done. Which is why despite the massive opportunity that Big Data provides for the email industry at large, making Big Data work for you can be a challenge.
A couple of years ago I undertook a new endeavor, a personal project to begin reading every Sherlock Holmes story in one large volume, from the very first story published by Sir Author Conan Doyle in 1887, A Study in Scarlet, to the last, Adventure of Shoscombe Old Place, published in 1927. Like many of you, I had read some Sherlock Holmes in my youth, just a story here and there but suddenly I decided that I wanted to get to know Sherlock as Watson did, from the very first, uninformed impression. At the time, I couldn’t predict just how complementary my new private venture was with understanding the challenges of using Big Data and the potential impact it has on email.
Before I go any further let’s define Big Data. What is it? Most simply put it’s ““Emerging technologies and practices that enable the collection, processing, discovery, and storage of large volumes of structured and unstructured data quickly and cost-effectively.” Sherlock would be beside himself with joy.
I’ve found parallels in Sherlock Holmes that fit Big Data in one way or another. Some are rather striking. So let’s get to it. Here are some quotes from Sherlock that reflect the best of Big Data philosophy and methodology:
"Data! Data! Data!" he cried impatiently. "I can't make bricks without clay!" – Sherlock Holmes, The Adventure of the Copper Beeches.
This is a problem we no longer have. Organizations have had a plethora of data for years now. The challenge hasn’t been getting the data but rather collecting and finding ways to manage and store it efficiently. When your server resource are limited, you don’t want to store data if you don’t know if you will eventually use it.
"A man should keep his little brain-attic stocked with all the furniture that he is likely to use, and the rest he can put away in the lumber-room of his library, where he can get it if he wants it." – Sherlock Holmes, The Five Orange Pips.
Enter Hadoop; Hadoop is an open-source software framework that was originally developed by two employees working at Yahoo!, for the purpose of indexing the data they were collecting. It was specifically designed to solve the problem that having a lot of different data, such as structured and unstructured, presented; How to store it, how to index it and how to query it.
In a nutshell Hadoop and derivative classes of products enables organizations to keep their data…forever.
Return Path has been using Hadoop for the past five years for raw-data storage and to resolve the 25 million or so IPs that we score for the Return Path Sender Score program. Before Hadoop it took us 24 hours to score all of the IPs, now we can score them in real-time.
Hadoop, as a distributed computing paradigm has enabled ISPs to use Big Data to fight spam in ways they never could before, or in many cases, has simply made anti-spam processes more efficient. On the ISP side of the email industry, Big Data is being used to fight spam and reduce false-positives. On the Sender side, Big Data is being used to optimize email communications and campaigns.
Let’s start with ISPs. One such ISP is Yahoo!. Like all organizations processing vast amounts of data, Yahoo! ran up against scalability and storage issues. It used to take them weeks to analyse certain issues. Now Hadoop enables them to create queries that “slice through billions of messages to isolate problems and identify spammers” in just a matter of minutes.
Hadoop also enables the use of certain algorithms that work best when you can store vast amounts of historical data. The ability to apply a “frequent item-set mining” algorithm is one example of Hadoop’s impact on Yahoo!’s anti-spam technology. Frequent item-set mining algorithms track the frequency of any given item-set in a continuous stream of data. In plain language it researches interesting relationships between variables. In order to use an FIM algorithm to find item-sets that indicate spam, you need a new way of mining in databases. Due to the speed of newly arriving data, the history of the data cannot be revisited unless it’s stored and previously, storing such large portions of data was impossible. Hadoop changed that.
Anti-phishing technology is another area in which Big Data can make a difference. Phishing attacks are becoming increasingly targeted and can slip through spam filters.
Gary Warner, a researcher at the University of Alabama and recent recipient of the J.D Falk Award, says of anti-phishing techniques, "People are really trying to rely on old ways of dealing with the problem like outmoded methods such as end-user education. What we are really talking about here is the intersection of big data – collected data that we can take a look at … to be able to find that needle in the haystack."
This reminds me of something my colleague Georges Smine says about the Return Path approach to anti-phishing, “To find more needles, you need more hay.” You need more data and our anti-phishing solution is supported by just that. We analyze more email data than any other security provider, and we analyze it more quickly.
“You see but you do not observe.” – Sherlock Holmes, A Scandal in Bohemia.
I mentioned earlier that both ISPs and email senders are benefiting from Big Data. Let’s consider the Sender side now. How are large email senders using Big Data to glean applicable insights and optimize their email programs?
I chose the aforementioned “frequent item-set mining” algorithm as a primary example because it’s a technique, supported by Big Data, that is also used by Senders. FIM algorithms are often applied for something called “Market Basket Analysis”, aka, “Affinity Analysis”. Market Basket Analysis helps retailers better understand the behaviour of their customers. We can immediately see the applications of this kind of intelligence; cross-and-up selling, campaign optimization and loyalty programs.
One well-known sender, Williams-Sonoma has successfully used Big Data and Hadoop to expand their market basket analysis time-frame. They now process over 50 million data records daily and use it to custom-tailor marketing emails to their customers.
Williams-Sonoma also uses Big Data to apply techniques like chi-square test to their list-segmentation. Chi-square test results will show whether your segmentation practice is actually dividing the customer list into significant and meaningful groups. As a sender, this is essential to making segmentation work for you.
In the end, Williams-Sonoma successful used Big Data to increase their email engagement (opens and clicks) by 20%. As a sender they aimed to send mail only to people who actually want it. Big Data helped them do that.
When it comes to Big Data, ISPs, Top-Senders and Return Path all have one thing in common; we understand that it’s not enough to store and mine large quantities of data. You have to find the meaning in it.
Sherlock was nothing if not insightful. The ability to gather data is one thing, but understanding what it means is all-together different. Insight surpasses knowledge. Knowledge (information/data) comes first, the insights follow.
According to Robert Barclay, VP of Analytics as Return Path, “The quality of your insights scales with two things; the number of ideas you have to test and the connection in your data”.
Just like Sherlock, if you want to use data to fight spam or optimize your email programs you begin by asking questions and having ideas that you can test.
Finding meaning in the data is the insight. Big Data doesn’t matter, Big Insights do!
About Dana Huten
Dana Huten is an Anti-Spam and Security Consultant in the Email Intelligence Team of Return Path. With 15 years in the European telecommunications and internet industry, Dana has a broad spectrum of knowledge. She has experience working in mobile, internet and email marketing as well as in IT and internet security.