The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
On platforms with Zhaoxin CPUs supporting this X86 feature, When
crc32c-intel and crc32c-generic are both registered, system will
use crc32c-intel because its .cra_priority is greater than
crc32c-generic. This case expect to use crc32c-generic driver for
some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
CPUs support from crc32c-intel.
Signed-off-by: Tony W Wang-oc <[email protected]>
---
arch/x86/crypto/crc32c-intel_glue.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c
index feccb52..6dafdae 100644
--- a/arch/x86/crypto/crc32c-intel_glue.c
+++ b/arch/x86/crypto/crc32c-intel_glue.c
@@ -222,8 +222,16 @@ MODULE_DEVICE_TABLE(x86cpu, crc32c_cpu_id);
static int __init crc32c_intel_mod_init(void)
{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+
if (!x86_match_cpu(crc32c_cpu_id))
return -ENODEV;
+
+ if (c->x86_vendor == X86_VENDOR_ZHAOXIN || c->x86_vendor == X86_VENDOR_CENTAUR) {
+ if (c->x86 == 0x6 || (c->x86 == 0x7 && c->x86_model <= 0x3b))
+ return -ENODEV;
+ }
+
#ifdef CONFIG_X86_64
if (boot_cpu_has(X86_FEATURE_PCLMULQDQ)) {
alg.update = crc32c_pcl_intel_update;
--
2.7.4
On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
> On platforms with Zhaoxin CPUs supporting this X86 feature, When
> crc32c-intel and crc32c-generic are both registered, system will
> use crc32c-intel because its .cra_priority is greater than
> crc32c-generic. This case expect to use crc32c-generic driver for
> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
> CPUs support from crc32c-intel.
>
> Signed-off-by: Tony W Wang-oc <[email protected]>
> ---
> arch/x86/crypto/crc32c-intel_glue.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c
> index feccb52..6dafdae 100644
> --- a/arch/x86/crypto/crc32c-intel_glue.c
> +++ b/arch/x86/crypto/crc32c-intel_glue.c
> @@ -222,8 +222,16 @@ MODULE_DEVICE_TABLE(x86cpu, crc32c_cpu_id);
>
> static int __init crc32c_intel_mod_init(void)
> {
> + struct cpuinfo_x86 *c = &boot_cpu_data;
> +
> if (!x86_match_cpu(crc32c_cpu_id))
> return -ENODEV;
> +
> + if (c->x86_vendor == X86_VENDOR_ZHAOXIN || c->x86_vendor == X86_VENDOR_CENTAUR) {
> + if (c->x86 == 0x6 || (c->x86 == 0x7 && c->x86_model <= 0x3b))
> + return -ENODEV;
> + }
Egads, why can't you use that x86_match_cpu() above, and also this
really wants a comment on why you're excluding these chips. Also, since
(IIRC) ZHAOXIN is basically AND, shouldn't they also be listed?
That is; write it like:
m = x86_match_cpu(crc32_cpu_id);
if (!m || !m->data)
return -ENODEV;
That way you can have positive and negative matches in the array
(obviously the existing FEATURE test would need data=1 and be last).
On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
> On platforms with Zhaoxin CPUs supporting this X86 feature, When
> crc32c-intel and crc32c-generic are both registered, system will
> use crc32c-intel because its .cra_priority is greater than
> crc32c-generic. This case expect to use crc32c-generic driver for
> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
> CPUs support from crc32c-intel.
>
> Signed-off-by: Tony W Wang-oc <[email protected]>
Does this mean that the performance of the crc32c instruction on those CPUs is
actually slower than a regular C implementation? That's very weird.
- Eric
On 11/12/2020 21:00, Peter Zijlstra wrote:
> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
>> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
>> On platforms with Zhaoxin CPUs supporting this X86 feature, When
>> crc32c-intel and crc32c-generic are both registered, system will
>> use crc32c-intel because its .cra_priority is greater than
>> crc32c-generic. This case expect to use crc32c-generic driver for
>> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
>> CPUs support from crc32c-intel.
>>
>> Signed-off-by: Tony W Wang-oc <[email protected]>
>> ---
>> arch/x86/crypto/crc32c-intel_glue.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/arch/x86/crypto/crc32c-intel_glue.c b/arch/x86/crypto/crc32c-intel_glue.c
>> index feccb52..6dafdae 100644
>> --- a/arch/x86/crypto/crc32c-intel_glue.c
>> +++ b/arch/x86/crypto/crc32c-intel_glue.c
>> @@ -222,8 +222,16 @@ MODULE_DEVICE_TABLE(x86cpu, crc32c_cpu_id);
>>
>> static int __init crc32c_intel_mod_init(void)
>> {
>> + struct cpuinfo_x86 *c = &boot_cpu_data;
>> +
>> if (!x86_match_cpu(crc32c_cpu_id))
>> return -ENODEV;
>> +
>> + if (c->x86_vendor == X86_VENDOR_ZHAOXIN || c->x86_vendor == X86_VENDOR_CENTAUR) {
>> + if (c->x86 == 0x6 || (c->x86 == 0x7 && c->x86_model <= 0x3b))
>> + return -ENODEV;
>> + }
>
> Egads, why can't you use that x86_match_cpu() above, and also this
> really wants a comment on why you're excluding these chips.
When doing lmbench3 Create and Delete file test on partitions with ext4
enabling metadata checksum, found using crc32c-generic driver could get
about 20% performance gain than using the driver crc32c-intel on these
chips.
Also, since
> (IIRC) ZHAOXIN is basically AND, shouldn't they also be listed?
>
> That is; write it like:
>
> m = x86_match_cpu(crc32_cpu_id);
> if (!m || !m->data)
> return -ENODEV;
>
> That way you can have positive and negative matches in the array
> (obviously the existing FEATURE test would need data=1 and be last).
> .
>
Lot thanks for you suggestion, will list these chips in crc32c_cpu_id
and use x86_match_cpu:
static const struct x86_cpu_id crc32c_cpu_id[] = {
+ X86_MATCH_VENDOR_FAM_FEATURE(ZHAOXIN, 0x6, X86_FEATURE_XMM4_2, 1),
+ X86_MATCH_VENDOR_FAM_MODEL_FEATURE(ZHAOXIN, 0x7, 0x1b,
X86_FEATURE_XMM4_2, 1),
+ X86_MATCH_VENDOR_FAM_MODEL_FEATURE(ZHAOXIN, 0x7, 0x3b,
X86_FEATURE_XMM4_2, 1),
+ X86_MATCH_VENDOR_FAM_FEATURE(CENTAUR, 0x6, X86_FEATURE_XMM4_2, 1),
+ X86_MATCH_VENDOR_FAM_MODEL_FEATURE(CENTAUR, 0x7, 0x1b,
X86_FEATURE_XMM4_2, 1),
+ X86_MATCH_VENDOR_FAM_MODEL_FEATURE(CENTAUR, 0x7, 0x3b,
X86_FEATURE_XMM4_2, 1),
X86_MATCH_FEATURE(X86_FEATURE_XMM4_2, NULL),
{}
};
@@ -228,8 +234,10 @@ MODULE_DEVICE_TABLE(x86cpu, crc32c_cpu_id);
static int __init crc32c_intel_mod_init(void)
{
- if (!x86_match_cpu(crc32c_cpu_id))
+ const struct x86_cpu_id *m = x86_match_cpu(crc32c_cpu_id);
+ if (!m || m->driver_data)
return -ENODEV;
sincerely
TonyWWangoc
On Fri, 11 Dec 2020 at 20:07, Eric Biggers <[email protected]> wrote:
>
> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
> > The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
> > On platforms with Zhaoxin CPUs supporting this X86 feature, When
> > crc32c-intel and crc32c-generic are both registered, system will
> > use crc32c-intel because its .cra_priority is greater than
> > crc32c-generic. This case expect to use crc32c-generic driver for
> > some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
> > CPUs support from crc32c-intel.
> >
> > Signed-off-by: Tony W Wang-oc <[email protected]>
>
> Does this mean that the performance of the crc32c instruction on those CPUs is
> actually slower than a regular C implementation? That's very weird.
>
This driver does not use CRC instructions, but carryless
multiplication and aggregation. So I suppose the pclmulqdq instruction
triggers some pathological performance limitation here.
That means the crct10dif driver probably needs the same treatment.
On Sat, 12 Dec 2020 at 10:36, Ard Biesheuvel <[email protected]> wrote:
>
> On Fri, 11 Dec 2020 at 20:07, Eric Biggers <[email protected]> wrote:
> >
> > On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
> > > The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
> > > On platforms with Zhaoxin CPUs supporting this X86 feature, When
> > > crc32c-intel and crc32c-generic are both registered, system will
> > > use crc32c-intel because its .cra_priority is greater than
> > > crc32c-generic. This case expect to use crc32c-generic driver for
> > > some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
> > > CPUs support from crc32c-intel.
> > >
> > > Signed-off-by: Tony W Wang-oc <[email protected]>
> >
> > Does this mean that the performance of the crc32c instruction on those CPUs is
> > actually slower than a regular C implementation? That's very weird.
> >
>
> This driver does not use CRC instructions, but carryless
> multiplication and aggregation. So I suppose the pclmulqdq instruction
> triggers some pathological performance limitation here.
>
Just noticed it uses both crc instructions and pclmulqdq instructions.
Sorry for the noise.
> That means the crct10dif driver probably needs the same treatment.
Tony, can you confirm that the problem is in the CRC instructions and
not in the PCLMULQDQ code path that supersedes it when available?
On 12/12/2020 01:43, Eric Biggers wrote:
> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
>> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
>> On platforms with Zhaoxin CPUs supporting this X86 feature, When
>> crc32c-intel and crc32c-generic are both registered, system will
>> use crc32c-intel because its .cra_priority is greater than
>> crc32c-generic. This case expect to use crc32c-generic driver for
>> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
>> CPUs support from crc32c-intel.
>>
>> Signed-off-by: Tony W Wang-oc <[email protected]>
>
> Does this mean that the performance of the crc32c instruction on those CPUs is
> actually slower than a regular C implementation? That's very weird.
>
From the lmbench3 Create and Delete file test on those chips, I think yes.
sincerely
Tony
On 12/12/2020 18:54, Ard Biesheuvel wrote:
> On Sat, 12 Dec 2020 at 10:36, Ard Biesheuvel <[email protected]> wrote:
>>
>> On Fri, 11 Dec 2020 at 20:07, Eric Biggers <[email protected]> wrote:
>>>
>>> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
>>>> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
>>>> On platforms with Zhaoxin CPUs supporting this X86 feature, When
>>>> crc32c-intel and crc32c-generic are both registered, system will
>>>> use crc32c-intel because its .cra_priority is greater than
>>>> crc32c-generic. This case expect to use crc32c-generic driver for
>>>> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
>>>> CPUs support from crc32c-intel.
>>>>
>>>> Signed-off-by: Tony W Wang-oc <[email protected]>
>>>
>>> Does this mean that the performance of the crc32c instruction on those CPUs is
>>> actually slower than a regular C implementation? That's very weird.
>>>
>>
>> This driver does not use CRC instructions, but carryless
>> multiplication and aggregation. So I suppose the pclmulqdq instruction
>> triggers some pathological performance limitation here.
>>
>
> Just noticed it uses both crc instructions and pclmulqdq instructions.
> Sorry for the noise.
>
>> That means the crct10dif driver probably needs the same treatment.
>
> Tony, can you confirm that the problem is in the CRC instructions and
> not in the PCLMULQDQ code path that supersedes it when available?
CRC instructions.
sincerely
Tony
On Mon, Dec 14, 2020 at 10:28:19AM +0800, Tony W Wang-oc wrote:
> On 12/12/2020 01:43, Eric Biggers wrote:
> > On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
> >> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
> >> On platforms with Zhaoxin CPUs supporting this X86 feature, When
> >> crc32c-intel and crc32c-generic are both registered, system will
> >> use crc32c-intel because its .cra_priority is greater than
> >> crc32c-generic. This case expect to use crc32c-generic driver for
> >> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
> >> CPUs support from crc32c-intel.
> >>
> >> Signed-off-by: Tony W Wang-oc <[email protected]>
> >
> > Does this mean that the performance of the crc32c instruction on those CPUs is
> > actually slower than a regular C implementation? That's very weird.
> >
>
> From the lmbench3 Create and Delete file test on those chips, I think yes.
>
Did you try measuring the performance of the hashing itself, and not some
higher-level filesystem operations?
- Eric
On 15/12/2020 04:41, Eric Biggers wrote:
> On Mon, Dec 14, 2020 at 10:28:19AM +0800, Tony W Wang-oc wrote:
>> On 12/12/2020 01:43, Eric Biggers wrote:
>>> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
>>>> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
>>>> On platforms with Zhaoxin CPUs supporting this X86 feature, When
>>>> crc32c-intel and crc32c-generic are both registered, system will
>>>> use crc32c-intel because its .cra_priority is greater than
>>>> crc32c-generic. This case expect to use crc32c-generic driver for
>>>> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
>>>> CPUs support from crc32c-intel.
>>>>
>>>> Signed-off-by: Tony W Wang-oc <[email protected]>
>>>
>>> Does this mean that the performance of the crc32c instruction on those CPUs is
>>> actually slower than a regular C implementation? That's very weird.
>>>
>>
>> From the lmbench3 Create and Delete file test on those chips, I think yes.
>>
>
> Did you try measuring the performance of the hashing itself, and not some
> higher-level filesystem operations?
>
Yes. Was testing on these Zhaoxin CPUs, the result is that with the same
input value the generic C implementation takes fewer time than the
crc32c instruction implementation.
Sincerely
Tony
On Tue, Dec 15, 2020 at 10:15:29AM +0800, Tony W Wang-oc wrote:
>
> On 15/12/2020 04:41, Eric Biggers wrote:
> > On Mon, Dec 14, 2020 at 10:28:19AM +0800, Tony W Wang-oc wrote:
> >> On 12/12/2020 01:43, Eric Biggers wrote:
> >>> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
> >>>> The driver crc32c-intel match CPUs supporting X86_FEATURE_XMM4_2.
> >>>> On platforms with Zhaoxin CPUs supporting this X86 feature, When
> >>>> crc32c-intel and crc32c-generic are both registered, system will
> >>>> use crc32c-intel because its .cra_priority is greater than
> >>>> crc32c-generic. This case expect to use crc32c-generic driver for
> >>>> some Zhaoxin CPUs to get performance gain, So remove these Zhaoxin
> >>>> CPUs support from crc32c-intel.
> >>>>
> >>>> Signed-off-by: Tony W Wang-oc <[email protected]>
> >>>
> >>> Does this mean that the performance of the crc32c instruction on those CPUs is
> >>> actually slower than a regular C implementation? That's very weird.
> >>>
> >>
> >> From the lmbench3 Create and Delete file test on those chips, I think yes.
> >>
> >
> > Did you try measuring the performance of the hashing itself, and not some
> > higher-level filesystem operations?
> >
>
> Yes. Was testing on these Zhaoxin CPUs, the result is that with the same
> input value the generic C implementation takes fewer time than the
> crc32c instruction implementation.
>
And that is really "working as intended"? Why do these CPUs even declare that
they support the crc32c instruction, when it is so slow?
Are there any other instruction sets (AES-NI, PCLMUL, SSE, SSE2, AVX, etc.) that
these CPUs similarly declare support for but they are uselessly slow?
- Eric
On December 16, 2020 1:56:45 AM GMT+08:00, Eric Biggers <[email protected]> wrote:
>On Tue, Dec 15, 2020 at 10:15:29AM +0800, Tony W Wang-oc wrote:
>>
>> On 15/12/2020 04:41, Eric Biggers wrote:
>> > On Mon, Dec 14, 2020 at 10:28:19AM +0800, Tony W Wang-oc wrote:
>> >> On 12/12/2020 01:43, Eric Biggers wrote:
>> >>> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
>> >>>> The driver crc32c-intel match CPUs supporting
>X86_FEATURE_XMM4_2.
>> >>>> On platforms with Zhaoxin CPUs supporting this X86 feature, When
>> >>>> crc32c-intel and crc32c-generic are both registered, system will
>> >>>> use crc32c-intel because its .cra_priority is greater than
>> >>>> crc32c-generic. This case expect to use crc32c-generic driver
>for
>> >>>> some Zhaoxin CPUs to get performance gain, So remove these
>Zhaoxin
>> >>>> CPUs support from crc32c-intel.
>> >>>>
>> >>>> Signed-off-by: Tony W Wang-oc <[email protected]>
>> >>>
>> >>> Does this mean that the performance of the crc32c instruction on
>those CPUs is
>> >>> actually slower than a regular C implementation? That's very
>weird.
>> >>>
>> >>
>> >> From the lmbench3 Create and Delete file test on those chips, I
>think yes.
>> >>
>> >
>> > Did you try measuring the performance of the hashing itself, and
>not some
>> > higher-level filesystem operations?
>> >
>>
>> Yes. Was testing on these Zhaoxin CPUs, the result is that with the
>same
>> input value the generic C implementation takes fewer time than the
>> crc32c instruction implementation.
>>
>
>And that is really "working as intended"?
These CPU's crc32c instruction is not working as intended.
Why do these CPUs even
>declare that
>they support the crc32c instruction, when it is so slow?
>
The presence of crc32c and some other instructions supports are enumerated by CPUID.01:ECX[SSE4.2] = 1, other instructions are ok except the crc32c instruction.
>Are there any other instruction sets (AES-NI, PCLMUL, SSE, SSE2, AVX,
>etc.) that
>these CPUs similarly declare support for but they are uselessly slow?
No.
Sincerely
Tonyw
>
>- Eric