LinuxLists.cc - Spam, bogofilter, etc

2006-09-29 14:22:32

by Lee Revell

[permalink] [raw]

Subject: Spam, bogofilter, etc

What ever happened with bogofilter on vger? The spam problem is
considerably worse in the past few weeks.

Lee

2006-09-29 14:29:05

by Ismail Dönmez

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Friday 29 September 2006 17:23, you wrote:
> What ever happened with bogofilter on vger? The spam problem is
> considerably worse in the past few weeks.

Its busy filtering legitimate ham messages I guess ;-)

Sorry couldn't resist.

/ismail
--
They that can give up essential liberty to obtain a little temporary safety
deserve neither liberty nor safety.
-- Benjamin Franklin

2006-10-01 23:23:21

by Chris Wedgwood

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Fri, Sep 29, 2006 at 10:23:12AM -0400, Lee Revell wrote:

> What ever happened with bogofilter on vger? The spam problem is
> considerably worse in the past few weeks.

run it locally and see how well it works for you (my guess is not very
well)

2006-10-02 00:42:13

by Kasper Sandberg

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Sun, 2006-10-01 at 16:23 -0700, Chris Wedgwood wrote:
> On Fri, Sep 29, 2006 at 10:23:12AM -0400, Lee Revell wrote:
>
> > What ever happened with bogofilter on vger? The spam problem is
> > considerably worse in the past few weeks.
>
> run it locally and see how well it works for you (my guess is not very
> well)
this latest spam seems relentless, i have spent 15 minutes moving it all
way, to have my filter learn from it, yet it doesent seem to work.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2006-10-02 10:03:04

by Matti Aarnio

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Fri, Sep 29, 2006 at 10:23:12AM -0400, Lee Revell wrote:
> What ever happened with bogofilter on vger? The spam problem is
> considerably worse in the past few weeks.

It is globbing (and blocking) lots of spam.
And now it is rarely blocking ham - under 10 cases per week.

Yes, the thing is NOT 100% perfect.
Especially very short spams are prone to leak thru it,
and those hams that do get block do tend to be longish, and
never before seen. (It all comes from Bayes Statistics.)

However what is now apparent is that we are no longer
adding very much new patterns into Majordomo filter.
Indeed we are taking off old patterns that bite at wrong things.

I do think that Markov Chains combined with Bayes Statistics
might do a wee bit better. (Except with very short emails.)
However all that these things are able to do is essentially
grow the key database when spammers are producing new mutated
(mis-spelled) texts by mixing in spaces, punctuations, and even
occasional characters.

For recognizing those pill merchants one needs complex software
to read the site at the URL, and to read texts out of the IMAGES
at the site. Captcha to get thru spam filters...

The idea of closed lists is ever more appealing.
We do need to do something for the bug-report addresses like
[email protected] -- so that addresses specified for
receiving the bug reports will still receive them.

I can do fairly easily "this address is allowed to post" type
filters at Majordomo - it has a way to specify allowed posters.
Usually it is used to permit list members to post, but it can
also be configured to use other datasets.

> Lee

/Matti Aarnio

2006-10-02 15:21:14

by Lee Revell

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 2006-10-02 at 13:03 +0300, Matti Aarnio wrote:
> I do think that Markov Chains combined with Bayes Statistics
> might do a wee bit better. (Except with very short emails.)
> However all that these things are able to do is essentially
> grow the key database when spammers are producing new mutated
> (mis-spelled) texts by mixing in spaces, punctuations, and even
> occasional characters.
>
> For recognizing those pill merchants one needs complex software
> to read the site at the URL, and to read texts out of the IMAGES
> at the site. Captcha to get thru spam filters...
>

Could a heuristic be added to reject messages with wildly incorrect
dates? I notice that the last 5-10 messages in my LKML folder every
morning are spam with a date that's ~24 hours in the future.

Lee

2006-10-02 15:26:09

by Martin Bligh

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Lee Revell wrote:
> On Mon, 2006-10-02 at 13:03 +0300, Matti Aarnio wrote:
>> I do think that Markov Chains combined with Bayes Statistics
>> might do a wee bit better. (Except with very short emails.)
>> However all that these things are able to do is essentially
>> grow the key database when spammers are producing new mutated
>> (mis-spelled) texts by mixing in spaces, punctuations, and even
>> occasional characters.
>>
>> For recognizing those pill merchants one needs complex software
>> to read the site at the URL, and to read texts out of the IMAGES
>> at the site. Captcha to get thru spam filters...
>>
>
> Could a heuristic be added to reject messages with wildly incorrect
> dates? I notice that the last 5-10 messages in my LKML folder every
> morning are spam with a date that's ~24 hours in the future.

If you got rid of "slut" and "schoolgirl" that'd get rid of half of it.

M.

2006-10-02 15:48:45

by Lee Revell

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 2006-10-02 at 08:24 -0700, Martin J. Bligh wrote:
> > Could a heuristic be added to reject messages with wildly incorrect
> > dates? I notice that the last 5-10 messages in my LKML folder every
> > morning are spam with a date that's ~24 hours in the future.
>
> If you got rid of "slut" and "schoolgirl" that'd get rid of half of
> it.

That will work for a day then they'll just change the spelling. But
I've seen spammers using incorrect dates (presumably to appear at the
beginning or end of a mailbox) for years.

You could also flag a very short message that contains a URL and is not
a reply to an existing thread - I can't think of a legitimate post to
LKML fitting this pattern.

I would hate to see the list closed as that would amount to surrendering
to the spammers.

Lee

2006-10-02 16:40:53

by Linus Torvalds

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 2 Oct 2006, Martin J. Bligh wrote:
>
> If you got rid of "slut" and "schoolgirl" that'd get rid of half of it.

The problem with bogo-filter is that THE WHOLE CONCEPT IS FLAWED.

I'm sorry, but spam-filtering is simply harder than the bayesian
word-count weenies think it is. I even used to _know_ something about
bayesian filtering, since it was one of the projects I worked on at uni,
and dammit, it's not a good approach, as shown by the fact that it's
trivial to get around.

I don't know why people got so excited about the whole bayesian thing.
It's fine as _one_ small clause in a bigger framework of deciding spam,
but it's totally inappropriate for a "yes/no" kind of decision on its own.

If you want a yes/no kind of thing, do it on real hard issues, like not
accepting email from machines that aren't registered MX gateways. Sure,
that will mean that people who just set up their local sendmail thing and
connect directly to port 25 will just not be able to email, but let's face
it, that's why we have ISP's and DNS in the first place.

But don't do it purely on some bogus word analysis.

If you want to do word analysis, use it like SpamAssassin does it - with
some Bayesian rule perhaps adding a few points to the score. That's
entirely appropriate. But running bogo-filter _instead_ of spamassassin is
just asinine.

Linus

2006-10-02 17:24:36

by Alan

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
> that will mean that people who just set up their local sendmail thing and
> connect directly to port 25 will just not be able to email, but let's face
> it, that's why we have ISP's and DNS in the first place.

Except most of the ISPs are incompetent and many people have to run
their own mail system in order to get mail that actually *works*. I've
had that experience several times, although thankfully I now have a sane
ISP.

MX checking is as broken or more broken than bayes.

There is another reason bayes is not very good too - every good spammer
reruns their message through spamassassin adding random text till they
get a good score *then* they spew it out.

Alan

2006-10-02 17:34:26

by David Lang

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 2 Oct 2006, Alan Cox wrote:

> Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
>> If you want a yes/no kind of thing, do it on real hard issues, like not
>> accepting email from machines that aren't registered MX gateways. Sure,
>> that will mean that people who just set up their local sendmail thing and
>> connect directly to port 25 will just not be able to email, but let's face
>> it, that's why we have ISP's and DNS in the first place.
>
> Except most of the ISPs are incompetent and many people have to run
> their own mail system in order to get mail that actually *works*. I've
> had that experience several times, although thankfully I now have a sane
> ISP.
>
> MX checking is as broken or more broken than bayes.
>
> There is another reason bayes is not very good too - every good spammer
> reruns their message through spamassassin adding random text till they
> get a good score *then* they spew it out.

that's why you don't use a fixed table like that. if the table is customized for
your mail then it's unlikly to agree with anyone else's, so mail that will get
through their filter wont' get through yours (and vice versa)

David Lang

2006-10-02 17:35:15

by Thomas Davis

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Matti Aarnio wrote:
>
> I can do fairly easily "this address is allowed to post" type
> filters at Majordomo - it has a way to specify allowed posters.
> Usually it is used to permit list members to post, but it can
> also be configured to use other datasets.
>
>

Sounds like a version of greylisting.

I know, people will argue against it - but it does work, and it
still works. I use it on website I run, and it cuts 2k+ spams per
day into about 50 that spamassassin has to process.

Yes, there are broken mailers that cannot deal with it properly.
They should fix their mailers..

thomas

2006-10-02 17:40:13

by Erik Andersen

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon Oct 02, 2006 at 11:48:57AM -0400, Lee Revell wrote:
> You could also flag a very short message that contains a URL and is not
> a reply to an existing thread - I can't think of a legitimate post to
> LKML fitting this pattern.

Blocking emails containing URLs pointing to domains registered
less than a week ago would block most of the recent spams.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2006-10-02 18:02:43

by Linus Torvalds

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 2 Oct 2006, Alan Cox wrote:
>
> Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
> > If you want a yes/no kind of thing, do it on real hard issues, like not
> > accepting email from machines that aren't registered MX gateways. Sure,
> > that will mean that people who just set up their local sendmail thing and
> > connect directly to port 25 will just not be able to email, but let's face
> > it, that's why we have ISP's and DNS in the first place.
>
> Except most of the ISPs are incompetent and many people have to run
> their own mail system in order to get mail that actually *works*. I've
> had that experience several times, although thankfully I now have a sane
> ISP.

Sure. I kind of agree - I'm just saying that if you have a _hard_
decision, you should base in on _hard_ data.

The MX checking is at least hard, and is a valid reason to just deny
email. I'm not claiming it's "perfect", but it's a hell of a lot better
than bayes.

> MX checking is as broken or more broken than bayes.

I have to say, OSDL has been doing MX checking, and it's effective as
hell. Most importantly, when it _does_ break, it's not because some
"content" is considered inappropriate, it's because some ISP does
something technically wrong.

OSDL also refused to talk to open mail relays etc. I got into something of
a (fairly civilized) shouting match with John Gilmore over it, who used to
send out email from a "fake open mail relay" on princuple (maybe he still
does). He claimed I was censoring his free speech rights when I didn't
read his emails, but I just told him that I was expressing my right to not
listen to people who are so stupid that they can't configure their email
servers.

(I'm not saying that John is stupid, since he did it on purpose, but he
was also clever enough to know exactly what was involved, so it's not like
he couldn't be heard if he wanted to - it's not "censoring" if nobody
listens to you because you built your own sound-proof walls around you).

> There is another reason bayes is not very good too - every good spammer
> reruns their message through spamassassin adding random text till they
> get a good score *then* they spew it out.

Yes. Which is why it's better to rely on hard technical data, or on a
large body of different small rules, including some that are personalized
(ie white-lists and blacklists that are site-specific, including making
things like the bayesian rules be per-site - perhaps _seeded_ by some
common data, but updated locally).

Of course, the MX checking can also be avoided, and a lot of spam-bots
know to use the ISP connection instead of a direct port-25 approach. But
at least that way, the mail gateway can (and often does) notice the
flooding, and many ISP's successfully throttle at least some spam at the
source, so it does actually have real meaning.

Linus

2006-10-02 18:07:58

by Martin Bligh

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

>>MX checking is as broken or more broken than bayes.
>
> I have to say, OSDL has been doing MX checking, and it's effective as
> hell. Most importantly, when it _does_ break, it's not because some
> "content" is considered inappropriate, it's because some ISP does
> something technically wrong.
>
> OSDL also refused to talk to open mail relays etc. I got into something of
> a (fairly civilized) shouting match with John Gilmore over it, who used to
> send out email from a "fake open mail relay" on princuple (maybe he still
> does). He claimed I was censoring his free speech rights when I didn't
> read his emails, but I just told him that I was expressing my right to not
> listen to people who are so stupid that they can't configure their email
> servers.

That was actually pretty broken. Sending Andrew email stopped working
for ages. IIRC because I was sending email from my home address through
the IBM work server. It's not a trouble-free solution, and otherwise
fairly reasonable things stop working. I forget what the OSDL admins
did in the end ... I think put in a specific exception for an IP range.

M.

2006-10-02 18:23:08

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 02 Oct 2006 11:02:36 PDT, Linus Torvalds said:

> > MX checking is as broken or more broken than bayes.
>
> I have to say, OSDL has been doing MX checking, and it's effective as
> hell. Most importantly, when it _does_ break, it's not because some
> "content" is considered inappropriate, it's because some ISP does
> something technically wrong.

How did OSDL's MX checking deal with split in/out configurations like ours,
where our MX points at a load-balanced farm of Mirapoint front end appliances
with 1 IP address, but our main off-campus *outbound* comes from a different
address?

Attachments:

(No filename) (226.00 B)

2006-10-02 18:29:48

by Linus Torvalds

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 2 Oct 2006, [email protected] wrote:
>
> How did OSDL's MX checking deal with split in/out configurations like ours,
> where our MX points at a load-balanced farm of Mirapoint front end appliances
> with 1 IP address, but our main off-campus *outbound* comes from a different
> address?

Hey, if I knew what I was doing, I'd be in MIS.

As it is, I just criticise other peoples patches.

Linus

2006-10-02 19:32:05

by jdow

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

From: "Linus Torvalds" <[email protected]>
> On Mon, 2 Oct 2006, [email protected] wrote:
>>
>> How did OSDL's MX checking deal with split in/out configurations like ours,
>> where our MX points at a load-balanced farm of Mirapoint front end appliances
>> with 1 IP address, but our main off-campus *outbound* comes from a different
>> address?
>
> Hey, if I knew what I was doing, I'd be in MIS.
>
> As it is, I just criticise other peoples patches.

DK or DKIM comes to mind. SpamAssassin 3.1.5 handles it neatly.

Off hand expecting a list to maintain perfect anti-spam is rather
difficult. Distributed processing works better. Folks should have
their own anti-spam tools and train them to their own preferences.

(It helps with a list like this one to have a SpamAssassin meta
rule that boosts the scores for BAYES_80 and above while reducing
scores for BAYES_40 and below. It also helps to run a lot of the
SARE, SpamAssassin Rules Emporium, rule sets. Pick and choose for
your particular needs. http://www.rulesemporium.com/rules)

{^_^} Joanne

2006-10-02 19:31:53

by Antonio Vargas

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On 10/2/06, Linus Torvalds <[email protected]> wrote:
>
>
> On Mon, 2 Oct 2006, [email protected] wrote:
> >
> > How did OSDL's MX checking deal with split in/out configurations like ours,
> > where our MX points at a load-balanced farm of Mirapoint front end appliances
> > with 1 IP address, but our main off-campus *outbound* comes from a different
> > address?
>
> Hey, if I knew what I was doing, I'd be in MIS.
>

I'd rather say you are not in MIS exactly because you prefer knowing
what you are doing.

> As it is, I just criticise other peoples patches.
>
> Linus

--
Greetz, Antonio Vargas aka winden of network

Every day, every year
you have to work
you have to study
you have to scene.

2006-10-02 21:33:54

by Horst H. von Brand

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Linus Torvalds <[email protected]> wrote:

[...]

> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
> that will mean that people who just set up their local sendmail thing and
> connect directly to port 25 will just not be able to email, but let's face
> it, that's why we have ISP's and DNS in the first place.

Larger sites have ingoing (MX) machines and outgoing (no MX) ones... this
is useless. And the whole SPF fiasco shows that such mechanisms (DNS based,
remote site publishes the data) are even easier to bypass (I've seen
statistics showing that the overwhelming mayority of SPF-"protected" email
is spam).

What does work rather well is greylisting (on first try tell them to come
back later, spammers rarely retry their junk).

Add blacklists (sadly, there are few reliable ones, AFAICS) and you cut it
down even more.

And yes, there is no silver bullet. This is an arms race, get a new
anti-spam device (filter configuration, ...) and soon they will figure out
how to bypass it.

In any case, I've seen claims that around 80% of email now is spam. That
it is still only a little in LKML says that the listmasters are doing an
oustanding job.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria +56 32 2654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 2797513

2006-10-02 21:33:44

by Alan

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

> Of course, the MX checking can also be avoided, and a lot of spam-bots
> know to use the ISP connection instead of a direct port-25 approach. But
> at least that way, the mail gateway can (and often does) notice the
> flooding, and many ISP's successfully throttle at least some spam at the
> source, so it does actually have real meaning.

Actually some of the smarter big ISPs with the less technical customers
transproxy port 25 anyway - using big Linux boxes and the netfilter
code.

2006-10-03 03:37:57

by dean gaudet

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 2 Oct 2006, Erik Andersen wrote:

> On Mon Oct 02, 2006 at 11:48:57AM -0400, Lee Revell wrote:
> > You could also flag a very short message that contains a URL and is not
> > a reply to an existing thread - I can't think of a legitimate post to
> > LKML fitting this pattern.
>
> Blocking emails containing URLs pointing to domains registered
> less than a week ago would block most of the recent spams.

unless they changed pattern the past week this wouldn't work... two weeks
ago the domains from the 1-liner porn spams were registered 3 or 4 months
ago. i checked a dozen+ of them looking for anything useful for
filtering.

if you visited the urls they lead to the same web page text -- something
so obviously a porn front-door even bayes could have got it right. (i.e.
"are you 18?").

it sure would be nice if posting were subscribers-only.

-dean

2006-10-03 04:06:09

by NeilBrown

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Monday October 2, [email protected] wrote:
>
> it sure would be nice if posting were subscribers-only.
>

How about adding a header to messages posted by non-subscribers
X-vger-kernel-org: non-subscriber

and maybe even a different footer:

--
This mail was from a non-subscriber and may not have reached all
subscribers.

Then people who think the list should be subscriber-only could enforce
that locally, and people who want to be more broad-minded still have
that option.

Then something like:
IF it is from a subscriber
OR it has a subject mentioning my area of interest
THEN let it through
ELSE apply strict spam checks.

I'd really like to add
OR in reply to some lkml message, but as as Message-id: is left
untouch when the mail is forwarded (probably a good thing) that
doesn't seem to be possible.

NeilBrown

2006-10-03 06:08:34

by Paul Zimmerman

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Oh _come on_! Do you guys mean to say that none of
these Bogofilter/Spamassasin/MX do-hickies can figure
out that a message titled "Youngest pleasantly
Schoolgirls fuckedd by oldman" is probably spam?
That's ridiculous! Run a spell-checker on the title,
and then filter it. How hard could that be?

--
Paul Z.

2006-10-03 07:02:16

by NeilBrown

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Monday October 2, [email protected] wrote:
> Oh _come on_! Do you guys mean to say that none of these
> Bogofilter/Spamassasin/MX do-hickies can figure out that a message
> titled "Youngest pleasantly Schoolgirls fuckedd by oldman" is probably
> spam? That's ridiculous! Run a spell-checker on the title, and then
> filter it. How hard could that be?

Sounds sensible. Did you have a procmail stanza I could test out :-)

NeilBrown

2006-10-03 08:15:10

by John Graham-Cumming

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

2006-10-03 08:53:53

by Howard Chu

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

John Graham-Cumming wrote:
> Linus Torvalds <torvalds <at> osdl.org> writes:
>> I'm sorry, but spam-filtering is simply harder than the bayesian
>> word-count weenies think it is. I even used to _know_ something about
>> bayesian filtering, since it was one of the projects I worked on at uni,
>> and dammit, it's not a good approach, as shown by the fact that it's
>> trivial to get around.

> Have you actually followed any of the research into Bayesian (and similar
> machine learning based) anti-spam filtering, and attacks on such filters? Are
> you making a claim that these filters are 'trivial to get around' based on a
> project you did at University over 10 years ago?

Well the recent spate of spams with technical/jargon keywords in their
subjects was enough to make my Seamonkey client start marking all
incoming mail as spam. Interesting that recent journals talk about this
as an approach to get spam past current filters; instead it had a
reverse effect.

So much for email management at our hosting provider. At least on my
highlandsun.com domain I've got my own sendmail milter blocking spams
before they get into the server. It's basically the equivalent of a
sendmail accessdb in LDAP, plus simple rules to reject relays from
unregistered IP addresses, or addresses with dynamically generated
hostnames. Rejecting with 451 temporary failure is also useful, most
bulk mailer programs fail immediately and go away. Real mail servers
will retry; by looking at the logs of the envelope FROM and RCPT I can
pick out any emails that should have been let thru and add an OK
exception to LDAP so the message eventually gets redelivered. I suppose
I could put a URL in the reject error message, and let the sender
confirm it from there. At this point the only spam that gets thru is
from dedicated mass marketers with legitimate DNS registrations and I
just manually add their subnets to my blacklist.

(One then is faced with the interesting question - what if someone from
one of those companies was actually trying to hire my services? Their
loss I guess, sometimes money really is tainted...)
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/

2006-10-03 09:46:50

by Helge Hafting

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Linus Torvalds wrote:
> On Mon, 2 Oct 2006, Martin J. Bligh wrote:
>
>> If you got rid of "slut" and "schoolgirl" that'd get rid of half of it.
>>
>
> The problem with bogo-filter is that THE WHOLE CONCEPT IS FLAWED.
>
Perhaps, but it works remarkably well anyway. After training with a
few thousand messages of each kind the amount of wrong
decisions is low. Each month I retrain the filter with the 20
or so messages it wasn't able to classify. (I sort into
spam, nonspam, and "dubious".)

Helge Hafting

2006-10-03 10:15:22

by Devdas Bhagat

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Linus Torvalds <torvalds <at> osdl.org> writes:

<snip>
> I'm sorry, but spam-filtering is simply harder than the bayesian
> word-count weenies think it is. I even used to _know_ something about

Spam stopping is harder than anyone thinks it is. Spam is about consent, not
content, and we have no really reliable way yet of knowing consent (except a
pure whitelist).

> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,

Uhm, MX is for receiving mail, not sending it. Plenty of organisations have
different hosts for MX MTAs and outbound MTAs. I work in that field, so just a
warning note for anyone who wants to take Linus' advice.

Devdas Bhagat

2006-10-03 11:10:49

by Gordon Cormack

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Linus Torvalds <torvalds <at> osdl.org> writes:

> I'm sorry, but spam-filtering is simply harder than the bayesian
> word-count weenies think it is. I even used to _know_ something about
> bayesian filtering, since it was one of the projects I worked on at uni,
> and dammit, it's not a good approach, as shown by the fact that it's
> trivial to get around.

Linus, I've seen no evidence that statistical filters are trivial
to beat. Can you provide some?

> I don't know why people got so excited about the whole bayesian thing.
> It's fine as _one_ small clause in a bigger framework of deciding spam,
> but it's totally inappropriate for a "yes/no" kind of decision on its own.

Why is that? Statistical filters (so-called 'Bayesian) have lower
false positive and false negative rates than many other approaches.
Bogofilter is one of the better ones, although it is not particularly
Bayesian.

> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
> that will mean that people who just set up their local sendmail thing and
> connect directly to port 25 will just not be able to email, but let's face
> it, that's why we have ISP's and DNS in the first place.

You are saying that this sort of false positive is acceptable to
you. With no corresponding claim as to the corresponding false
negative rate.

So-called yes/no values are simply tests with their own failure
rates. As such, they have strictly less information than
scores or probability estimates that offer a confidence
estimate as well. The trick is in combining several sources
of evidence, and 'Bayesian' is but one method of combining this
evidence.
>
> If you want to do word analysis, use it like SpamAssassin does it - with
> some Bayesian rule perhaps adding a few points to the score. That's
> entirely appropriate. But running bogo-filter _instead_ of spamassassin is
> just asinine.

Spamassassin performs quite poorly with the default weight
given to its statistical filter. It works much better
if you increase the weight. Many tests show that it works
better still if you simply discard the ad hoc rules and
rely on the 'Bayesian' filter alone. I have found that
almost all of the false positives I've encountered in
the last 3 years have been due to Spamassassin's ad hoc
rules, not its statistical filter.

References

http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05
http://plg.uwaterloo.ca/~gvcormac/spamassassin.html
http://www.ceas.cc/2006/listabs.html#12.pdf

Gordon Cormack
University of Waterloo

2006-10-03 12:51:05

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, 02 Oct 2006 23:08:29 PDT, Paul Zimmerman said:

> That's ridiculous! Run a spell-checker on the title,
> and then filter it. How hard could that be?

Subject: iounmap supplicant usx2y racy

(Picking from the last 10 or so e-mails I see here).

let me know what your spell checker says. :)

Attachments:

(No filename) (226.00 B)

2006-10-03 16:40:52

by Mariusz Kozlowski

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

Hi,

> Yes, the thing is NOT 100% perfect.
> Especially very short spams are prone to leak thru it,
> and those hams that do get block do tend to be longish, and
> never before seen. (It all comes from Bayes Statistics.)

If I can suggest something. Please run latest p0f and match the output against
vger incoming traffic (or even only against the messages that leak trough the
filters used now). I bet you'll see an obvious and very descriptive pattern.
Now what you will or will not do with that knowledge is the other story.

Mariusz

2006-10-03 17:30:31

by Mariusz Kozlowski

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

> every good spammer reruns their message through spamassassin adding random
> text till they get a good score *then* they spew it out.

That's a flaw in the whole idea of having pre-defined (by human) separate
rules catching misc obvious (to us) spam indicators. If you had a filter that
you just feed with raw data from many sources and that does pattern
recognition and learns on its own, there (probably) would be no way to go
around it. At least it wouldn't be easy. In fact i.e. when ANN is used as
classifier, the rules created after training are hidden and have no obvious
represantation to us so one would have no idea what to change to get the
desired filter output.

Mariusz

2006-10-04 22:41:57

by Adrian Bunk

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On Mon, Oct 02, 2006 at 11:02:36AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 2 Oct 2006, Alan Cox wrote:
> >
> > Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
> > > If you want a yes/no kind of thing, do it on real hard issues, like not
> > > accepting email from machines that aren't registered MX gateways. Sure,
> > > that will mean that people who just set up their local sendmail thing and
> > > connect directly to port 25 will just not be able to email, but let's face
> > > it, that's why we have ISP's and DNS in the first place.
> >
> > Except most of the ISPs are incompetent and many people have to run
> > their own mail system in order to get mail that actually *works*. I've
> > had that experience several times, although thankfully I now have a sane
> > ISP.
>
> Sure. I kind of agree - I'm just saying that if you have a _hard_
> decision, you should base in on _hard_ data.
>...

My personal hard data is:
- if you are sending emails to me, the fourth-last mail server in the
path (the one that actually receives the emails from the Internet)
does greylisting, IOW much spam that can be trivially determined is
already eliminated when bogofilter gets the emails
- much spam I'm getting cames through lists like linux-kernel that
have already filtered out the easy to determine spam
- despite these points, bogofilter catches 90% of the arriving spam
- one false positive every 1-2 years (sic)
- I can (and do) train bogofilter myself

It might have it's weaknesses and might therefore not work well forever,
but at least during the last years bogofilter served me well.

> Linus

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-10-27 22:30:37

by Oleg Verych

[permalink] [raw]

Subject: Re: Spam, bogofilter, etc

On 2006-09-29, Lee Revell wrote:
> What ever happened with bogofilter on vger? The spam problem is
> considerably worse in the past few weeks.
>
> Lee
>
+--
|Original-To: <[email protected]>
|Original-Sender: [email protected]
|Precedence: bulk
|X-Mailing-List: [email protected]
|X-Spam-Report: 9.8 points; * 0.3 J_CHICKENPOX_43 BODY: {4}Letter - dot - {3}Let
*^^^^^*
|Xref: news.gmane.org gmane.linux.kernel:461120 gmane.spam.detected:1953699
|Archived-At: <http://permalink.gmane.org/gmane.linux.kernel/461120>
|
|
|Tranny Outdoor Gets...
+--

>From one recent message. What's the problem?
____