Newsgroups : Borland : borland.public.delphi.internet.winsock : 2006 Jun : Re: spam sender addresses

www.cryer.info
Managed Newsgroup Archive

Re: spam sender addresses

Subject:Re: spam sender addresses
Posted by:"Charles Stack" (csta..@codysystems.com)
Date:Mon, 5 Jun 2006 16:25:15

Have you considered using a naive Bayesian classifier to further refine
you checks?  One of the approaches would be to compute the probability
of each element in the address as to being spam vs non-spam.

It could then "learn" from what you call valid vs non-valid if you
provide a batch or some sort of incremental update / purge mechanism.

You could also look at the concept of chained-tokens when using a
classifier in addition to straight NBC.  This would allow the classifier
take into account the additional tokens of "elmer.fudd@someplace.com"
and perform classification on the tokens, elment.fudd, fudd.someplace
and someplace.com.

Using a chain span of 2, this would leave you with:

elmer            could be real
elmer|fudd        not likely - good candidate for chaining
fudd            could be real
someplace        limit to dictionary search?
fudd|somplace        could be real - good candidate for chaining.
com            Probably could ignore the top level domain
someplace.com        Could also use a dictionary search (or DNS                     MX
lookup). If it fails DNS lookup, it's                     probably bogus

You could also consider chain spans of 3 as well.

It's an interesting problem you are looking to solve - I think NBC might
be a further way to increase the accuracy of the test.

Good Luck!

Charles


theo wrote:
> Wow, my stupid little function is now more reliable than i ever dreamt of.
>
> Download the test version here:
>
> www.theo.ch/kylix/spamdet.zip
>
> It's for Kylix, but should with little changes compile under Windows.
>
> I've added some manual learning to it, and it can now detect strange
> words in a sample text (TextDatei.txt) that consists of unedited wiki
> texts in english and german about snails. I've added some spam words to
> test it.
>
> The output of the entire text consists of some numbers and crap text and
> looks like this
>
> 1797
> 000
> 78%
> Spammmtext
> 1996
> crappytexxxt
> 600
> HelpMeImWWzrong
> kdkdkdkdkd
> 100
>
>
> Cool hu?

Replies:

none

In response to:

www.cryer.info
Managed Newsgroup Archive