splogs

My blog search engine SproutSearch is now indexing over 8 million blogs. I am now working on changing the way the blogs are ranked. For now, they are sorted by the sheer amount of content they contain. I noticed a big problem with this method is that many spam blogs contain masses of content. I don’t like SproutSearch linking to so much spam, so I need to find a way to remove a lot of these listings.It is not practical for me to read 8 million blogs, so I need to come up with an automated method to detect spam. Many spam blogs use the same words over and over. So I wrote a program to count the number of repeated words. Most spam blogs seem to use a similar number of words per post. I made another program that computes the standard deviation of the number of words in a post. Using these metrics, I will make a program that flags potential spam so I can review and delete it.

More: continued here



Leave a Reply

You must be logged in to post a comment.



Related Resources

GNU Project and Free Software Foundation
Since 1983, developing the free Unix style operating system GNU, so that computer users can have the freedom to share and improve the software they use.

Adobe - Adobe Reader Download - All versions
Find out how to distribute Adobe Reader software on an intranet, CD, or other media. More info. Adobe Reader; Adobe Reader for Symbian OS? Adobe Reader for Pocket PC; Adobe Reader for ...

Software - HP ProCurve Networking
Gateway page for software downloads for all HP ProCurve products. Access software for Switches, Hubs, Wireless access points, 700wl series, network management, MIBs, Routers, 100VG ...

Computer software - Wikipedia, the free encyclopedia
Computer software, or just software is a general term used to describe a collection of computer programs, procedures and documentation that perform some tasks on a computer system ...

Welcome! - Free Software Foundation
The organization that "started it all" in free or open source software.