Newsgroups : Borland : borland.public.delphi.internet.winsock : 2006 Jun : Re: spam sender addresses
| Subject: | Re: spam sender addresses |
| Posted by: | "theo" (nospam@for.me) |
| Date: | Thu, 1 Jun 2006 23:02:20 |
Remy Lebeau (TeamB) schrieb:
>
> Syntax-wise, those are valid addresses. So a syntax filtering algorithm
> will not help you. What you are asking for is a content filtering
> algorithm, and that is very hard to implement.
After reading some computer linguistic websites today I gave up with the
idea to make it "the scientific way".
Then I simply tried to write a function which does what I mean (see below).
This is all I need. This function makes a guess if a word is a "natural"
word or consists of random characters (Result > 0)
It is capable of detecting 3 of the 4 words I initially posted in this
thread as spam:
WxOHT -> not detected as spam
rokqawmwrp -> spam
jhmtnr -> spam
xwldsxu -> spam
It simply anaylizes the sequence of soundex-similars, consonants and vowels.
All my normal email contacts pass the test (Result=0), even
"Gschwandtner" and "Burckhardt".
That's really all I wanted. Just to tag messages as "suspicous" by the
sender (user name not host).
The function can certainly be improved. I wrote it two hours ago...
----------------------
function AnalyzeWord(input: ShortString): integer;
var i, len, countsdex, countvow, countcons: integer;
tmpChar, last: byte;
const
CSoundexTable: array[65..122] of ShortInt =
// A B C D E F G H I J K L M N O P Q R S T U V
W X Y Z
(0, 1, 2, 3, 0, 1, 2, -1, 0, 2, 2, 4, 5, 5, 0, 1, 2, 6, 2, 3, 0, 1,
-1, 2, 0, 2,
// [ / ] ^ _ '
0, 0, 0, 0, 0, 0,
// a b c d e f g h i j k l m n o p q r s t u v
w x y z
0, 1, 2, 3, 0, 1, 2, -1, 0, 2, 2, 4, 5, 5, 0, 1, 2, 6, 2, 3, 0, 1,
-1, 2, 0, 2);
function Score(AChar: Integer): Integer;
begin
Result := 0;
if (AChar >= Low(CSoundexTable)) and (AChar <= High(CSoundexTable)) then
Result := CSoundexTable[AChar];
end;
procedure ExceptionalCase;
begin
inc(Result);
countsdex := 0;
countvow := 0;
countcons := 0;
end;
begin
last := 99;
countsdex := 0;
countvow := 0;
countcons := 0;
Result := 0;
//Replace some known combinations (I don't know the linguistic term
for this)
//this reduces false alerts for names like "Gschwandtner" --> "Gswantner"
//"Burckhardt" --> Burkhart
input := StringReplace(input, 'sch', 's', [rfReplaceAll, rfIgnoreCase]);
input := StringReplace(input, 'dt', 't', [rfReplaceAll, rfIgnoreCase]);
input := StringReplace(input, 'sh', 's', [rfReplaceAll, rfIgnoreCase]);
input := StringReplace(input, 'ck', 'k', [rfReplaceAll, rfIgnoreCase]);
input := StringReplace(input, 'ch', 'k', [rfReplaceAll, rfIgnoreCase]);
input := StringReplace(input, 'th', 't', [rfReplaceAll, rfIgnoreCase]);
input := StringReplace(input, 'oao', 'oan', [rfReplaceAll,
rfIgnoreCase]); //pt joao
//to be exented
len := length(input);
for i := 1 to len do
begin
tmpChar := Score(Ord(input[i]));
if last = tmpChar then inc(countsdex) else countsdex:=0;
if tmpChar = 0 then inc(countvow) else countvow := 0;
if tmpChar <> 0 then inc(countcons) else countcons := 0;
last := tmpChar;
if countsdex > 1 then ExceptionalCase else
if countvow > 2 then ExceptionalCase else
if countcons > 3 then ExceptionalCase;
end;
end;