Validating Addresses in an Unbounded Namespace
by J.D. Falk
Director of Product Strategy
Earlier this week, we wrote about the expansion of top-level domain names, and the decreasing importance of domain names to users looking for web content.
Though they don’t always realize it, accuracy in domain names is important to end users when it comes to their email addresses — and it’s equally important to anyone who collects email addresses, for any purpose.
Mistyped email addresses can have far-reaching consequences. One of the canonical examples is the story of Nadine, where someone input the wrong address when signing up at a sweepstakes site in 2001 — and the owner of the domain name still receives more than 70 spam messages addressed to her every day.
Closer to home, there’s somebody out there named James Falk (no relation to me) who occasionally gives sites my yahoo.com email address, which I’ve had since 1996 or ’97. One of those legal document email sites (a competitor to our partner RPost) even sent me what appeared to be the lease to his new house! There was no confirmation (or “double opt in”) step, no “this is not me” link, no way to unsubscribe. Often, these sites will happily send all sorts of personal information — except, unfortunately, his actual email address so I can inform him of his mistake.
In both of these cases, a user typed in the wrong address at a valid domain. There’s no way to gather statistics, but I’m sure it’s far more common that typos point to entirely invalid domains: yahoo.cmo, or returnpath.nett. These can be caught in software, but it’s still not as easy as it looks.
Consider a regular expression such as:
That would match email addresses at the original six generic top-level domains, or gTLDs. It wouldn’t match two-letter country code TLDs (ccTLDs), but there are hundreds of those, so let’s include them more simply:
For those who can’t read regular expressions, this means: from the start of the line, match any number of any characters, then an @ symbol, then any number of any characters, then a . symbol, then either: one of com, edu, gov, int, mil, net, or org, or two characters — after which the line ends.
But now there are more gTLDs. If they were all three characters long, it’d be easy — but they’re not, so we’re left with:
And with ICANN poised to add more soon, that list will keep getting longer — requiring constant maintenance just for this deceptively simple and woefully incomplete email address checking algorithm.
Why incomplete? For one thing, it only tells you that the domain might exist, not that it does. It also lets through all sorts of characters which aren’t valid, and has nothing to prevent SQL injection or similar attacks. Seriously, it’s just an example, don’t use it.
Last year Steve Atkins wrote about the legal components of an email address (which are far more limited than my simple regular expressions here), and gave a list of things to check to make sure an address is valid. His list is a lot longer and more accurate than my example above, and should be proof against the now-unbounded TLD namespace.
And, remember: just because a domain is valid, and the email address is correctly formed, doesn’t mean that there isn’t a spam trap on the other end — or that the recipient, if there is one, wants to receive your email. Luckily, there’s an easy way to ask — and to make sure the address is valid at the same time.
It’s just a shame that invalid addresses can’t be caught as easily in software anymore.