Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp3226226pxj; Mon, 24 May 2021 01:30:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyzUD2PxilUl+or2uXrYnaNEsceBlUPMww7LSJvt3R/sIfYYILxDMz+UliLMbXDVVb26y2D X-Received: by 2002:aa7:d413:: with SMTP id z19mr24507759edq.37.1621845050558; Mon, 24 May 2021 01:30:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1621845050; cv=none; d=google.com; s=arc-20160816; b=yEG6BQ8L0mltfukK+Hb+G+vlkziDIaIsOQyY2rvb4f7OZEYuV1aJ0HJvu3n9WH0Rl5 OjWR7NhQpgjmabMmQTt4Ak7Yo1h0n2zqyyac4RkxfxSwrEwD+6Atv5PYpMQamhWUevgo WKYqMnAwf1ntM4PmjqjGoFQdNClP4cFnBP00iCOkB5l/cNf/SEMHdmOyN1RHazHIMONe DvcUYb9DeE16x9gOdCO1zAVBeQlyzMi+w7ruy/DmMAAR6VNmTj4DkbuPS+Rxp0kjNWYI 8M7tdUP6My30VxtqY4G2h8KcMryON0lF0Z3I5nY5YLjaBIZxe5Tt9DfE4CvBIrpPRqfN v8Ug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :mime-version:accept-language:in-reply-to:references:message-id:date :thread-index:thread-topic:subject:to:from; bh=9yxhWMXScilIjv5aEbrWyq8H7zVDzXTZ9ohJrK1JsU8=; b=FXx+oTkaQNUXfGlEBF0U1c2gB+MyToY/jvU/RzMpwwJY3FKQSlnACbChcVRBPCp7Qr RfBsS/w/GYYErpqxjwO2eBeE2YWxy4+9V1UOimj5NzBXWWd3XT1vW8noi1j7UPoW3dWY Dz0p5M2yswbcUI1JvdYs64P0f2KiV1jBk7Evt5JN2STJs8K3SGeCy1NBVGy5Ax+DW9b7 QhuGcZ6yDDzBc+TlXjAcE2iQnbEXPCjTy+LdEY1VHwEzvLQ3MD/VlIduX14TXInfzkB4 oX4Ovy6gRkQ9HezHXOVtGeoifRNBGhiaJNRqGNCvbBxooA0MA03XFfBHkp85ptgpWjGD yDvA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q13si12895046edd.354.2021.05.24.01.30.27; Mon, 24 May 2021 01:30:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=aculab.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232362AbhEXIax convert rfc822-to-8bit (ORCPT + 99 others); Mon, 24 May 2021 04:30:53 -0400 Received: from eu-smtp-delivery-151.mimecast.com ([185.58.86.151]:47067 "EHLO eu-smtp-delivery-151.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232279AbhEXIaw (ORCPT ); Mon, 24 May 2021 04:30:52 -0400 Received: from AcuMS.aculab.com (156.67.243.121 [156.67.243.121]) (Using TLS) by relay.mimecast.com with ESMTP id uk-mta-287-jZos3xe8MRC6svOSyzJzVQ-1; Mon, 24 May 2021 09:29:21 +0100 X-MC-Unique: jZos3xe8MRC6svOSyzJzVQ-1 Received: from AcuMS.Aculab.com (fd9f:af1c:a25b:0:994c:f5c2:35d6:9b65) by AcuMS.aculab.com (fd9f:af1c:a25b:0:994c:f5c2:35d6:9b65) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 24 May 2021 09:29:19 +0100 Received: from AcuMS.Aculab.com ([fe80::994c:f5c2:35d6:9b65]) by AcuMS.aculab.com ([fe80::994c:f5c2:35d6:9b65%12]) with mapi id 15.00.1497.015; Mon, 24 May 2021 09:29:19 +0100 From: David Laight To: 'Samuel Neves' , "x86@kernel.org" , "ak@linux.intel.com" , "linux-kernel@vger.kernel.org" Subject: RE: [PATCH] x86/usercopy: speed up 64-bit __clear_user() with stos{b,q} Thread-Topic: [PATCH] x86/usercopy: speed up 64-bit __clear_user() with stos{b,q} Thread-Index: AQHXT/9yOXqO/LIr3EWA7KTWrkvuyaryStrQ Date: Mon, 24 May 2021 08:29:18 +0000 Message-ID: <58be4e3df1954458890d69c2684fb748@AcuMS.aculab.com> References: <20210523180423.108087-1-sneves@dei.uc.pt> In-Reply-To: <20210523180423.108087-1-sneves@dei.uc.pt> Accept-Language: en-GB, en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-exchange-transport-fromentityheader: Hosted x-originating-ip: [10.202.205.107] MIME-Version: 1.0 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=C51A453 smtp.mailfrom=david.laight@aculab.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: aculab.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Samuel Neves > Sent: 23 May 2021 19:04 > > The current 64-bit implementation of __clear_user consists of a simple loop > writing an 8-byte register per iteration. On typical x86_64 chips, this will > result in a rate of ~8 bytes per cycle. > > On those same typical chips, much better is often possible, ranging from 16 > to 32 to 64 bytes per cycle. Here we want to avoid bringing vector > instructions for this, but we can still achieve something close to those fill > rates using `rep stos{b,q}`. This is actually how it is already done in > usercopy_32.c. > > This patch does precisely this. But because `rep stosb` can be slower for > short fills, I've retained the old loop for sizes below 256 bytes. This is a > somewhat arbitrary threshold; some documents say that `rep stosb` should be > faster after 128 bytes, whereas glibc puts the threshold at 2048 bytes (but > there it is competing against vector instructions). My measurements on > various (but not an exhaustive variety of) machines suggest this is a > reasonable threshold, but I could be mistaken. > > It should also be mentioned that the existent code contains a bug. In the loop > > "0: movq $0,(%[dst])\n" > " addq $8,%[dst]\n" > " decl %%ecx ; jnz 0b\n" > > The `decl %%ecx` instruction truncates the register containing `size/8` to > 32 bits, which means that calling __clear_user on a buffer longer than 32 GiB > would leave part of it unzeroed. > > This change is noticeable from userspace. That is in fact how I spotted it; in > a hashing benchmark that read from /dev/zero, around 10-15% of the CPU time > was spent in __clear_user. After this patch, on a Skylake CPU, these are the > before/after figures: > > $ dd if=/dev/zero of=/dev/null bs=1024k status=progress > 94402248704 bytes (94 GB, 88 GiB) copied, 6 s, 15.7 GB/s > > $ dd if=/dev/zero of=/dev/null bs=1024k status=progress > 446476320768 bytes (446 GB, 416 GiB) copied, 15 s, 29.8 GB/s > > The difference decreases when reading in smaller increments, but I have > observed no slowdowns. > > Signed-off-by: Samuel Neves > --- > arch/x86/lib/usercopy_64.c | 59 +++++++++++++++++++++++++------------- > 1 file changed, 39 insertions(+), 20 deletions(-) > > diff --git a/arch/x86/lib/usercopy_64.c b/arch/x86/lib/usercopy_64.c > index 508c81e97..af0f3089a 100644 > --- a/arch/x86/lib/usercopy_64.c > +++ b/arch/x86/lib/usercopy_64.c > @@ -9,6 +9,7 @@ > #include > #include > #include > +#include > > /* > * Zero Userspace > @@ -16,33 +17,51 @@ > > unsigned long __clear_user(void __user *addr, unsigned long size) > { > - long __d0; > + long __d0, __d1; > might_fault(); > /* no memory constraint because it doesn't change any memory gcc knows > about */ > stac(); > asm volatile( > - " testq %[size8],%[size8]\n" > - " jz 4f\n" > - " .align 16\n" > - "0: movq $0,(%[dst])\n" > - " addq $8,%[dst]\n" > - " decl %%ecx ; jnz 0b\n" > - "4: movq %[size1],%%rcx\n" > - " testl %%ecx,%%ecx\n" > - " jz 2f\n" > - "1: movb $0,(%[dst])\n" > - " incq %[dst]\n" > - " decl %%ecx ; jnz 1b\n" > - "2:\n" > + " cmp $256, %[size]\n" > + " jae 3f\n" /* size >= 256 */ > + " mov %k[size], %k[aux]\n" > + " and $7, %k[aux]\n" > + " shr $3, %[size]\n" > + " jz 1f\n" /* size < 8 */ > + ".align 16\n" > + "0: movq %%rax,(%[dst])\n" > + " add $8,%[dst]\n" > + " dec %[size]; jnz 0b\n" No need for a loop, just write zeros to the end of the buffer. It may be worth doing that even if the size is a multiple of 8 and the last 'block zero' clears the same bytes. > + "1: mov %k[aux],%k[size]\n" > + " test %k[aux], %k[aux]\n" > + " jz 6f\n" > + "2: movb %%al,(%[dst])\n" > + " inc %[dst]\n" > + " dec %k[size]; jnz 2b\n" > + " jmp 6f\n" > + "3: \n" > + ALTERNATIVE( > + "mov %k[size], %k[aux]\n" > + "shr $3, %[size]\n" > + "and $7, %k[aux]\n" > + "4: rep stosq\n" > + "mov %k[aux], %k[size]\n", You really don't want to use 'rep stosb' here. There are a large class of x86 cpu where it is really horrid. IIRC there is one small set (just before the ERMS ones) where short 'rep movsb' isn't too bad. David > + "", > + X86_FEATURE_ERMS > + ) > + "5: rep stosb\n" > + "6: \n" > ".section .fixup,\"ax\"\n" > - "3: lea 0(%[size1],%[size8],8),%[size8]\n" > - " jmp 2b\n" > + "7: lea 0(%[aux],%[size],8),%[size]\n" > + " jmp 6b\n" > ".previous\n" > - _ASM_EXTABLE_UA(0b, 3b) > - _ASM_EXTABLE_UA(1b, 2b) > - : [size8] "=&c"(size), [dst] "=&D" (__d0) > - : [size1] "r"(size & 7), "[size8]" (size / 8), "[dst]"(addr)); > + _ASM_EXTABLE_UA(0b, 7b) > + _ASM_EXTABLE_UA(2b, 6b) > + _ASM_EXTABLE_UA(4b, 7b) > + _ASM_EXTABLE_UA(5b, 6b) > + : [size] "=&c"(size), [dst] "=&D" (__d0), [aux] "=&r"(__d1) > + : "[size]" (size), "[dst]"(addr), "a"(0)); > clac(); > return size; > } > -- > 2.31.1 - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)