Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp43121pxf; Wed, 24 Mar 2021 20:24:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzad1Rs7/DafE6cCBz1Sx4i+nZetKbDW/Kn9j9li/+w2mrZJeQEgL2SsExsYrDe1zlOk0E/ X-Received: by 2002:a50:fd15:: with SMTP id i21mr6538736eds.384.1616642642110; Wed, 24 Mar 2021 20:24:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1616642642; cv=none; d=google.com; s=arc-20160816; b=Z5S4LalCJFA+wrzULO5DWCppn5nrJVxhT2G4oh/cTan+rniAjcUEHnv2qSISDtSbH4 +OauY7QU+xbI4WmVRIIoJTXXt50pn1iyUJKsxDIocDPeKgW46e2LyyzPF8B9EOcJFiSk MUdNVCat/HIdZUIqR5q7AJCNjLHjUIvHtjRX58VpgGmjA45P673+iwnwcGFFgIZNSku+ qcHMDdqGH2RzsiuXYIKh7S5cyEKj7sVNfqoxeBYcNLGQE4qPrlgH3U5J5HAr6DYOmqFT iZnSfi/KPFaLXgNiHXj2g+D2oYX2PZQfY8arQ11sgmhFXDas5pkcLgZaYtN2+dumTqoa TbKw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=BSc5+XRcaOYG8RjJLL4uqZo5TGs/KuwOIFMJFa0KEc0=; b=009/3DYtft2jL3P/p9yKI1cW3o+D4itpe5cC84wilz1jwy4VYJ347Y3JrSHbZdh2zI i2s+j1AqVp5AtZhYVAcUoQZJEYp1Jk3Z07eWNyHPuD7M03Qzaqx/RcXG+m6ge08/Fk6N +sHBOteGb/Pzl60ScymeMjKVSbpleleNX5X6v/f822sxiTGcJHH+DvjNVzyjbDxerS2n HGE01cL1vM0MHtX5zNdnIyfNJE9d+w5MnJDWayEJb0Iq2pyfgGQK3ljGb9mBjwtfdid6 w17Wqen0n5ft8QEO76AsAsWJoqP0OIQpFeYGtBb8/tS1lMgKYWI2jKoS+IcWgENtuAWZ m1FA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n14si3181438edo.179.2021.03.24.20.23.40; Wed, 24 Mar 2021 20:24:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237910AbhCXThg (ORCPT + 99 others); Wed, 24 Mar 2021 15:37:36 -0400 Received: from foss.arm.com ([217.140.110.172]:38528 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237907AbhCXThE (ORCPT ); Wed, 24 Mar 2021 15:37:04 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 21AA01529; Wed, 24 Mar 2021 12:37:04 -0700 (PDT) Received: from [10.57.50.37] (unknown [10.57.50.37]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7100D3F718; Wed, 24 Mar 2021 12:37:02 -0700 (PDT) Subject: Re: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes To: David Laight , Yang Yingliang , "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" Cc: "catalin.marinas@arm.com" , "will@kernel.org" , "guohanjun@huawei.com" References: <20210323073432.3422227-1-yangyingliang@huawei.com> <20210323073432.3422227-3-yangyingliang@huawei.com> <03ac41af-c433-cd66-8195-afbf9c49554c@arm.com> <62602598e7b742d09c581f3fc988e487@AcuMS.aculab.com> From: Robin Murphy Message-ID: <7c38b9e3-f7ba-4d51-1c84-9c47dad07189@arm.com> Date: Wed, 24 Mar 2021 19:36:56 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <62602598e7b742d09c581f3fc988e487@AcuMS.aculab.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2021-03-24 16:38, David Laight wrote: > From: Robin Murphy >> Sent: 23 March 2021 12:09 >> >> On 2021-03-23 07:34, Yang Yingliang wrote: >>> When copy over 128 bytes, src/dst is added after >>> each ldp/stp instruction, it will cost more time. >>> To improve this, we only add src/dst after load >>> or store 64 bytes. >> >> This breaks the required behaviour for copy_*_user(), since the fault >> handler expects the base address to be up-to-date at all times. Say >> you're copying 128 bytes and fault on the 4th store, it should return 80 >> bytes not copied; the code below would return 128 bytes not copied, even >> though 48 bytes have actually been written to the destination. > > Are there any non-superscaler amd64 cpu (that anyone cares about)? > > If the cpu can execute multiple instructions in one clock > then it is usually possible to get the loop control (almost) free. > > You might need to unroll once to interleave read/write > but any more may be pointless. Nah, the whole point is that using post-increment addressing is crap in the first place because it introduces register dependencies between each access that could be avoided entirely if we could use offset addressing (and especially crap when we don't even *have* a post-index addressing mode for the unprivileged load/store instructions used in copy_*_user() and have to simulate it with extra instructions that throw off the code alignment). We already have code that's tuned to work well across our microarchitectures[1], the issue is that butchering it to satisfy the additional requirements of copy_*_user() with a common template has hobbled regular memcpy() performance. I intend to have a crack at fixing that properly tomorrow ;) Robin. [1] https://github.com/ARM-software/optimized-routines > So something like: > a = *src++ > do { > b = *src++; > *dst++ = a; > a = *src++; > *dst++ = b; > } while (src != lim); > *dst++ = b; > > David > > - > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > Registration No: 1397386 (Wales) >