From: Joakim Tjernlund Subject: Re: [PATCH v3] crc32c: Implement CRC32c with slicing-by-8 algorithm Date: Mon, 3 Oct 2011 22:13:18 +0200 Message-ID: References: <20110930161223.GW11984@tux1.beaverton.ibm.com> <20111003160036.GX11984@tux1.beaverton.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Cc: linux-crypto , linux-kernel To: djwong@us.ibm.com Return-path: Received: from gw1.transmode.se ([195.58.98.146]:39159 "EHLO gw1.transmode.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755231Ab1JCUNV (ORCPT ); Mon, 3 Oct 2011 16:13:21 -0400 In-Reply-To: <20111003160036.GX11984@tux1.beaverton.ibm.com> Sender: linux-crypto-owner@vger.kernel.org List-ID: "Darrick J. Wong" wrote on 2011/10/03 18:00:36: > > On Sat, Oct 01, 2011 at 03:52:00PM +0200, Joakim Tjernlund wrote: > > "Darrick J. Wong" wrote on 2011/09/30 18:12:23: > > > > > > [putting mailing lists on cc] [SNIP] > > > > > > I suppose I could make CRC32C_BITS configurable. What is the hardware > > > profile of your ppc32 processor? How much L1D/L2 cache? slice-by-8 does have > > > a big cache footprint. On the other hand it's faster than the slice-by-4 > > > (crc32) and Sarwate (crc32c) code in the kernel, even on old slow 32-bit x86 > > > processors (PII, PIII, P4). > > > > It is a low end embedded 333 MHz CPU with only L1 cache. How much faster > > is slice by 8 than slice by 4 on these old x86 machines? > > How much L1 cache? Or, if you'd rather not give away specifics, has the CPU > more than 8KB L1 cache? I'm willing to concede that with little cache the > added memory pressure could be painful. > > As for the old x86 machines, please have a look at: > http://djwong.org/docs/ext4_metadata_checksums.html#Benchmarking > > ~15% faster on a 2GHz Via C7 > ~20% faster on a 2.7GHz P4 > ~25% faster on a 500MHz P3 > > I vaguely recall it was ~20% faster on a 400MHz P2, but all the kernel.org > wikis are still down. :( > > So I suspect the key factor here is memory hierachy, since all of those systems > have at least 16K of L1 cache. Slice by 8 might actually suck on a Pentium > Proor earlier. Unfortunately I don't have anything older than a PII... It is 16KB cache on this CPU. I don't know why it was so much slower. Could be a gcc thing as gcc does a fairly lame job at optimizing crc32. Still think making this configurable is a good thing. At least until the verdict is in from other CPUs. Jocke