« The Art of Software Security Assessment -- Page corruption | Main | Cisco -- Making log analysis more difficult? »

Saving bandwidth by ... removing comments?

I fixed up a number of shortcomings in htcomment the other day. As part of the development of this tool, I regularly run it against sites I frequent. I started to notice what appeared to be an inordinate amount of comments on some sites, so that got me thinking -- on average, what percentage of a site's web content is comments?

Some quick analysis gave some interesting results

$  for site in boston.com latimes.com cnn.com neu.edu mit.edu 
www.whitehouse.gov www.fbi.gov sony.com slashdot.org 
google.com reddit.com craigslist.org  blogger.com myspace.com 
amazon.com w3c.org php.net; do
full=`lynx -source http://$site | wc -c`     
comments=`./htcomment -q http://$site |wc -c`                                
echo "$site is"`echo "scale=4; ($comments/$full)*100" | bc`"% comments"    
done                                                  
boston.com is 16.9600 % comments
latimes.com is 5.5300 % comments
cnn.com is 1.6200 % comments
neu.edu is 15.2600 % comments
mit.edu is 2.6600 % comments
www.whitehouse.gov is 5.8500 % comments
www.fbi.gov is 4.5900 % comments
sony.com is .7300 % comments
slashdot.org is 3.7300 % comments
google.com is 0 % comments
reddit.com is 0 % comments
craigslist.org is 0 % comments
blogger.com is 2.7500 % comments
myspace.com is 17.0700 % comments
amazon.com is .2600 % comments
w3c.org is 0 % comments
php.net is .3400 % comments

Unfortunately these numbers are not 100% accurate -- htcomment can't differentiate between "kjdflakjfdaf <!-- comment -->" and just "<!-- comment -->", so the numbers for the sites that do have comments can be a bit skewed in some respects, but it is a good first order approximation. It is no coincidence, in my opinion, that google, w3c and craigslist have 0 comments on their frontpage. For sites that have >5% comments on their frontpage alone, you can't help but wonder how the behavior of their site or their bandwidth expenses would change if those comments were filtered out at their edge, or never put there in the first place.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

Jon Hart
Name: Jon Hart

Location: Hiding between the smog and the Pacific

Occupation: Security Ninja, Thrill Seeker.

Categories