Abuse Reporting Format Demystified

As we discussed earlier, the standard Abuse Reporting Format (ARF) is used by nearly all complaint feedback loops. Today, let's talk about what that format consists of.  (This is just an overview; programmers should also refer to RFC 5965 for further details.)

ARF is intended for software-to-software and software-to-human communication, rather than human-to-human or human-to-software. In other words: an ARF report is generated by software, rather than by hand. Every report includes the entire original message, along with some machine-readable meta-data and a free-form text area intended primarily for human viewers.

ARF is MIME. You can probably extract the necessary data from an ARF report without a MIME processor, but there's a good chance your software could miss something. So, seriously, start by learning about MIME.

More specifically, an ARF message has a MIME type of multipart/report, just like a modern non-delivery notice (defined in RFC 3462.) The report-type is "feedback-report". Within that are three parts:

  1. text/plain
  2. message/feedback-report
  3. message/rfc822 or message/rfc822-headers

The first, text/plain portion, is intended to be entirely human-readable. It often contains a canned message such as "This is a Comcast email abuse report for an email message received from IP 10.11.12.13 on Tue, 31 Aug 2010 09:40:55 +0000". Simple, boring, but occasionally useful.

Next is the message/feedback-report portion, containing various header-like fields (though they're not email headers.) This is where the report generator can put information that they think is relevant to the message; the report consumer must decide for themselves whether that information is accurate or actionable. These fields may appear in any order.

Three of these fields are required to appear in every ARF message:

Feedback-Type: may be abuse (spam and the like), fraud (phishing), virus, or other. There used to be some additional types, but these were removed because they weren't being used. A new type for DKIM failures will be introduced soon.

User-Agent: is purely informational, and follows the same format as the HTTP User-Agent string sent by web browsers. There is no standard registry of agents.

Version: has been "0.1" since ARF's inception, but becomes the integer "1" now that the standard has been published as RFC 5965. It may take some time for all of the 0.1 versions to catch up, particularly if they don't need to change anything else.

There's also an assortment of optional fields:

Original-Envelope-ID: is the SMTP session ID. I've never seen this.

Original-Mail-From: is the SMTP MAIL FROM string, often seen in the Return-Path: header -- not the From: header.

Original-Rcpt-To: is the SMTP RCPT TO string, and can appear multiple times for multiple recipients.

Arrival-Date: is the date & time that the message arrived at the receiving MTA, regardless of what the Date: header may say.

Source-IP: is the "last hop" IP address.

Incidents: can be a count of how many individual incidents that one report refers to. For example, a feedback generator may choose to send one report for dozens or even hundreds of spam messages, indicating the total number of messages in this field.

Authentication-Results: is identical to the Authentication-Results: mail header (RFC 5451), and can appear multiple times.

Reported-URI: is a universal resource identifier (RFC 3986) derived from the message, which the report generator believes to be related to the incident being reported. Most often this'll be an HTTP or MAILTO URL. This can also appear multiple times.

Finally, the third part of the ARF report is either the entire original message as a message/rfc822 part, or (rarely) just the headers.

Some mailbox providers choose to redact portions of the original message, citing concerns of privacy and/or legal liability. Because complaint feedback never happens randomly -- there's always a subscription process, and an approval process, and often Terms of Service or other legal agreements along the way -- it's expected that feedback consumers will know whether to expect redaction from any particular feedback generator.

There are a few open source software libraries to help with generating and/or parsing ARF reports:

ARF could easily be extended to include new feedback types or metadata fields. One of these, for DKIM error reporting, is already being discussed -- and we'll write about it here next week. The appropriate place to suggest extensions is the IETF MARF Working Group, preferably after reviewing the mailing list archives to make sure your idea isn't already in development.

I've been actively participating in the group, and have a few proposals of my own in play. We'll let you know when anything particularly important or interesting comes along.