For the past week or so I've been looped into various projects that, for one reason or another, required analysis of web logs. This little side trip has (re)taught me a number of things about large, production websites and how attackers work.
Perhaps the most intruiging was an analysis of HTTP return codes from our production web tier as compared to what Akamai had logged for the same time period. I had initially assumed that, for the really strange error codes, the numbers between what Akamai had recorded and what we saw should match. Well, I was wrong. For example, for HTTP 500 ("Internal Server Error"), despite the fact that our production web tier received 25% less traffic than what Akamai handled for us, it saw 10% more HTTP 500 errors than what Akamai saw. The reverse was true for 503s and 504s, which are arguably worse than 500s. Production web tier recorded no HTTP 504s, whereas Akamai showed roughly 3500. For 503s, the production web tier recorded roughly 10% more than what Akamai had logged.
This got me to thinking about why the error code distribution varied so widely. I could explain away some of the HTTP 503 and 504s based on global network wonkiness and with some handwaving, but the 500s and most of the 40x series codes eluded me temporarily. On the way home, it hit me and I spent the rest of my drive in a dead stare thinking about this.
The variations in HTTP error codes on Akamaized sites can be largely explained by the nature of the way most web-based attacks work. I'd venture to say that 90% of the HTTP attacks that any publicly exposed web server faces are not targeted at their host in particular, but rather that their IP address(es) were simply part of a massive attack that someone somewhere had launched in the hopes of getting lucky. So, in theory, for Akamaized sites no "valid" traffic should ever trickle its way to your web tier unless it comes through Akamai. Anything that bypasses Akamai is likely malicious and should be dropped, or at least screwy enough to warrant investigation. Preventing such attacks should be fairly easy, as Akamai clearly identifies itself by way of Akamai-Origin-Hop, Via and X-Forwarded-For headers. Sure, those headers can be forged, but we are not looking to hinder attackers that are intent on attacking our specific web presence. The downside is that you must do L7 inspection for this to work, so processing time and power can become an issue.
If anyone has considered this approach and/or actually implemented it, I'd like your input -- leave a comment or drop me an email.