[Air-l] Website/weblog word counts

Alex Halavais alex at halavais.net
Wed May 16 06:11:59 PDT 2007


It gets pretty thorny, actually, depending on how you have things
archived, and what you are trying to get at.

1. Are you looking for words-per-post? If so, you should probably be
archiving permalinked posts, but not all blogs allow you to address
individual posts with a specific URL. Most also include comments at
that permalink.

2. Just stripping out the HTML still leaves you with the cruft
(sidebar, etc.) that is automatically generated, along with the
comments if they are included.

Words-per-month might be easier, since most blogging platforms/systems
provide this at a single URL and without comments. You will still have
cruft, but if you are sneaky about it (including a future month in
your archive), you might be able to subtract this out from your
counts.

The other possibility is to use the RSS feed, assuming you have been
archiving it. You can either feed it through an RSS parser (most
scripting languages have them), or apply a regex to the feed. This,
unfortunately, excludes those blogs that do not have RSS--a shrinking
but still substantial number.

The final possibility is to get hold of a sample--like the Blogpulse
sample--that has already had some of the munging done. I would be
pretty surprised if someone hadn't already done a word-count on the
Weblogging Ecosystem data this year:

http://www.blogpulse.com/www2006-workshop/

Best,

Alex


--
//
// This email is
// [X] assumed public and may be blogged / forwarded.
// [ ] assumed to be private, please ask before redistributing.
//
// Alexander C. Halavais
// Social Architect
// http://alex.halavais.net
//



More information about the Air-L mailing list