Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S266461AbUAVW7F (ORCPT ); Thu, 22 Jan 2004 17:59:05 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S266464AbUAVW7E (ORCPT ); Thu, 22 Jan 2004 17:59:04 -0500 Received: from fw.osdl.org ([65.172.181.6]:37031 "EHLO mail.osdl.org") by vger.kernel.org with ESMTP id S266461AbUAVW67 (ORCPT ); Thu, 22 Jan 2004 17:58:59 -0500 Date: Thu, 22 Jan 2004 14:58:54 -0800 (PST) From: Linus Torvalds To: jw schultz cc: Linux Kernel Mailing List Subject: Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to public posts In-Reply-To: <20040122221802.GD12666@pegasys.ws> Message-ID: References: <1074717499.18964.9.camel@localhost.localdomain> <20040121211550.GK9327@redhat.com> <20040121213027.GN23765@srv-lnx2600.matchmail.com> <1074731162.25704.10.camel@localhost.localdomain> <401000C1.9010901@blue-labs.org> <40101B1E.3030908@blue-labs.org> <20040122221802.GD12666@pegasys.ws> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2424 Lines: 58 On Thu, 22 Jan 2004, jw schultz wrote: > > Beyes is the wrong aproach for those random words from the > dictionary blocks. Bayes is not wrong per se, but doing bayes on pure word statistics is wrong. It always was. People knew how it could be broken. The current rash of spams is just the obvious way to do it. > Those i've seen seem to be a long string of words all longer > than 4 characters. A rule that gave a score of based on the > number of consecutive words longer than some number or > characters would catch those fairly easily. If i get > annoyed enough i may figure out how to write such a rule. Don't. That's easily broken too, as you realized yourself. > What we need is a bounty on these scum. $1000 fine per > reported recipient with half going to the reporter would be > nice. What you should aim for, and which should be much harder to break, is to realize that random words that make no sense give a really unlikely score when you build up a markov chain of them. So to avoid the random words problem, do Bayes on the _chain_ of words instead. Now, you can try to overcome this by spamming with something that makes "sense" from the markov chain standpoint, but by then that spam is going to be hilarious. Once I start getting spams that are generated by markov generators and read like "real" email, I might stop filtering them, just because they are bound to be a lot of fun to read. Have you played with Markov chains? What happens is that you don't just build up a list of words and their likelihood of being spam or ham, you build up a list of word _combinations_ and the likelihood of one particular word following another one. That's how a lot of the "random phrase" generators on the web work. They can be absolutely hilarious, exactly because the sentences they generate actually _almost_ make sense. Sometimes you get an almost readable story, but one that reads like somebody having a bad trip and his reality just shifted 90 degrees. (Usually the best stories come if the training material is coherent, which email sadly usually isn't). Do a google search for "Mark V Shaney", and you should get some idea about this. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/