My blog search engine SproutSearch is now indexing over 8 million blogs. I am now working on changing the way the blogs are ranked. For now, they are sorted by the sheer amount of content they contain. I noticed a big problem with this method is that many spam blogs contain masses of content. I don’t like SproutSearch linking to so much spam, so I need to find a way to remove a lot of these listings.It is not practical for me to read 8 million blogs, so I need to come up with an automated method to detect spam. Many spam blogs use the same words over and over. So I wrote a program to count the number of repeated words. Most spam blogs seem to use a similar number of words per post. I made another program that computes the standard deviation of the number of words in a post. Using these metrics, I will make a program that flags potential spam so I can review and delete it.
More: continued here
This entry was posted
on Tuesday, June 19th, 2007 at 4:54 pm
and is filed under news.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.
Since 1983, developing the free Unix style operating system GNU, so that computer users can have the freedom to share and improve the software they use.
Find out how to distribute Adobe Reader software on an intranet, CD, or other media. More info. Adobe Reader; Adobe Reader for Symbian OS? Adobe Reader for Pocket PC; Adobe Reader for ...
Gateway page for software downloads for all HP ProCurve products. Access software for Switches, Hubs, Wireless access points, 700wl series, network management, MIBs, Routers, 100VG ...
Computer software, or just software is a general term used to describe a collection of computer programs, procedures and documentation that perform some tasks on a computer system ...
The organization that "started it all" in free or open source software.