Newsgroups : Borland : borland.public.delphi.internet.winsock : 2006 Jun : Re: spam sender addresses
| Subject: | Re: spam sender addresses |
| Posted by: | "Charles Stack" (csta..@codysystems.com) |
| Date: | Mon, 5 Jun 2006 16:25:15 |
Have you considered using a naive Bayesian classifier to further refine
you checks? One of the approaches would be to compute the probability
of each element in the address as to being spam vs non-spam.
It could then "learn" from what you call valid vs non-valid if you
provide a batch or some sort of incremental update / purge mechanism.
You could also look at the concept of chained-tokens when using a
classifier in addition to straight NBC. This would allow the classifier
take into account the additional tokens of "elmer.fudd@someplace.com"
and perform classification on the tokens, elment.fudd, fudd.someplace
and someplace.com.
Using a chain span of 2, this would leave you with:
elmer could be real
elmer|fudd not likely - good candidate for chaining
fudd could be real
someplace limit to dictionary search?
fudd|somplace could be real - good candidate for chaining.
com Probably could ignore the top level domain
someplace.com Could also use a dictionary search (or DNS MX
lookup). If it fails DNS lookup, it's probably bogus
You could also consider chain spans of 3 as well.
It's an interesting problem you are looking to solve - I think NBC might
be a further way to increase the accuracy of the test.
Good Luck!
Charles
theo wrote:
> Wow, my stupid little function is now more reliable than i ever dreamt of.
>
> Download the test version here:
>
> www.theo.ch/kylix/spamdet.zip
>
> It's for Kylix, but should with little changes compile under Windows.
>
> I've added some manual learning to it, and it can now detect strange
> words in a sample text (TextDatei.txt) that consists of unedited wiki
> texts in english and german about snails. I've added some spam words to
> test it.
>
> The output of the entire text consists of some numbers and crap text and
> looks like this
>
> 1797
> 000
> 78%
> Spammmtext
> 1996
> crappytexxxt
> 600
> HelpMeImWWzrong
> kdkdkdkdkd
> 100
>
>
> Cool hu?
none