Date: Fri, 23 Jan 2004 20:38:58 +0100
From: Pavel Machek <pavel@ucw.cz>
To: Linus Torvalds <torvalds@osdl.org>
Cc: jw schultz <jw@pegasys.ws>,
       Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to public posts
Message-ID: <20040123193858.GB1355@elf.ucw.cz>
References: <20040121211550.GK9327@redhat.com> <20040121213027.GN23765@srv-lnx2600.matchmail.com> <pan.2004.01.21.23.40.00.181984@dungeon.inka.de> <1074731162.25704.10.camel@localhost.localdomain> <yq0hdyo15gt.fsf@wildopensource.com> <401000C1.9010901@blue-labs.org> <Pine.LNX.4.58.0401221034090.4548@dlang.diginsite.com> <40101B1E.3030908@blue-labs.org> <20040122221802.GD12666@pegasys.ws> <Pine.LNX.4.58.0401221441500.2998@home.osdl.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Pine.LNX.4.58.0401221441500.2998@home.osdl.org>
User-Agent: Mutt/1.5.4i
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1872
Lines: 46

Hi!

> > Beyes is the wrong aproach for those random words from the
> > dictionary blocks.
> 
> Bayes is not wrong per se, but doing bayes on pure word statistics is
> wrong. It always was. People knew how it could be broken. The current rash
> of spams is just the obvious way to do it.

You want to do it on trigrams (groups of three works). Anything longer
than trigrams is not likely to be effective.

> > What we need is a bounty on these scum.  $1000 fine per
> > reported recipient with half going to the reporter would be
> > nice.
> 
> What you should aim for, and which should be much harder to break, is to 
> realize that random words that make no sense give a really unlikely 
> score when you build up a markov chain of them.
> 
> So to avoid the random words problem, do Bayes on the _chain_ of words
> instead.
> 
> Now, you can try to overcome this by spamming with something that makes
> "sense" from the markov chain standpoint, but by then that spam is going
> to be hilarious. Once I start getting spams that are generated by markov
> generators and read like "real" email, I might stop filtering them, just
> because they are bound to be a lot of fun to read.

Even if you get 100 of them per day?

I'm doing language modeling in school, and generating text that "looks
like meaningfull" to everyone but human is too easy.

[Take your favourite voice-recognition software, and speak in another
language to it. It will generate plausible-looking sentences at the
output. This could be easily automated.]
								Pavel
-- 
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/