Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S266673AbUAWTjW (ORCPT ); Fri, 23 Jan 2004 14:39:22 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S266675AbUAWTjW (ORCPT ); Fri, 23 Jan 2004 14:39:22 -0500 Received: from gprs214-223.eurotel.cz ([160.218.214.223]:2177 "EHLO amd.ucw.cz") by vger.kernel.org with ESMTP id S266673AbUAWTjT (ORCPT ); Fri, 23 Jan 2004 14:39:19 -0500 Date: Fri, 23 Jan 2004 20:38:58 +0100 From: Pavel Machek To: Linus Torvalds Cc: jw schultz , Linux Kernel Mailing List Subject: Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to public posts Message-ID: <20040123193858.GB1355@elf.ucw.cz> References: <20040121211550.GK9327@redhat.com> <20040121213027.GN23765@srv-lnx2600.matchmail.com> <1074731162.25704.10.camel@localhost.localdomain> <401000C1.9010901@blue-labs.org> <40101B1E.3030908@blue-labs.org> <20040122221802.GD12666@pegasys.ws> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Warning: Reading this can be dangerous to your mental health. User-Agent: Mutt/1.5.4i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1872 Lines: 46 Hi! > > Beyes is the wrong aproach for those random words from the > > dictionary blocks. > > Bayes is not wrong per se, but doing bayes on pure word statistics is > wrong. It always was. People knew how it could be broken. The current rash > of spams is just the obvious way to do it. You want to do it on trigrams (groups of three works). Anything longer than trigrams is not likely to be effective. > > What we need is a bounty on these scum. $1000 fine per > > reported recipient with half going to the reporter would be > > nice. > > What you should aim for, and which should be much harder to break, is to > realize that random words that make no sense give a really unlikely > score when you build up a markov chain of them. > > So to avoid the random words problem, do Bayes on the _chain_ of words > instead. > > Now, you can try to overcome this by spamming with something that makes > "sense" from the markov chain standpoint, but by then that spam is going > to be hilarious. Once I start getting spams that are generated by markov > generators and read like "real" email, I might stop filtering them, just > because they are bound to be a lot of fun to read. Even if you get 100 of them per day? I'm doing language modeling in school, and generating text that "looks like meaningfull" to everyone but human is too easy. [Take your favourite voice-recognition software, and speak in another language to it. It will generate plausible-looking sentences at the output. This could be easily automated.] Pavel -- When do you have a heart between your knees? [Johanka's followup: and *two* hearts?] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/