Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp3827366pxb; Tue, 17 Nov 2020 04:47:17 -0800 (PST) X-Google-Smtp-Source: ABdhPJywU7XXe88HWRKWcbO4ZksNj/4DiUp0g2kfJcY6URpKq7tb4h7boBuNo4qInVUbdyLzMTGq X-Received: by 2002:a50:fe14:: with SMTP id f20mr19600501edt.61.1605617237401; Tue, 17 Nov 2020 04:47:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605617237; cv=none; d=google.com; s=arc-20160816; b=fkl1hu0ppBRiVxt4Uf3N22O5BjHi6OVziFsgXJy0noIILkupg5ZBFlka12jARc+bM5 r/rzIEHO6eZgtlCdCeWWiGNjPqpJAZyEdocfRSgRhBPqB+CbKOQJ1zsTuQVSksD3vS9d JDUUvpgALFASW5Rmx4un2WpQobXwn6PTrlhGmCrvFURNbN+OOc4ElMSHJLTKi2pN4hHw nvBxobmL+FyuWOrie/lQN33rcjfi06yx0vphwM+jqS6exKzIA1Ktfn+KqfNLqBNLAv+5 WZYHYSj5Lnpmxjy3gm2DzyIHHN2klMDsvZ7SyCe4CqCLWYGn9FMHHWjxcS70pEcDDdiF zvWg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to :mime-version:user-agent:date:message-id:from:references:cc:to :subject; bh=rDr8sA7ZXJGkOTQxE+bLSbKwa8NG21xbLgvfAHL3nVY=; b=MqWgkrtpHjhn55obEyY7/1d7ewM322knZh1ub8R0wnqKPR1uG4KZNu/w/8pLpO9mK6 HP+tT2YrR3V6vEawTGbm36wllT62Ns5TejgkK811PszfY+/7sYVPFsHzhWWn4V46CV5L 7ibaoacCKDh49JMQa/XC9wxgkxjgE0+5GIgwbSXUS74RIRUsF2hCTxHE2DpL3mqDqQFM hWWV3ExYAEMGYaDUp3bLDoQf2mkwwHywT5e4GBh1BqEfUiY06XlJ39TuMUY0nyx8f/jq R7DX4JCTjFMiREEXOm4H29FOmjNCQY4YMwFe9tpRPUZk7l8YsiTNj07lnYKkNWFX/VMs Ih3g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a3si14360354edq.430.2020.11.17.04.46.45; Tue, 17 Nov 2020 04:47:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-crypto-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-crypto-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727267AbgKQMqI (ORCPT + 99 others); Tue, 17 Nov 2020 07:46:08 -0500 Received: from szxga04-in.huawei.com ([45.249.212.190]:7636 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725355AbgKQMqI (ORCPT ); Tue, 17 Nov 2020 07:46:08 -0500 Received: from DGGEMS414-HUB.china.huawei.com (unknown [172.30.72.60]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4Cb5Jw45DWz15MbQ; Tue, 17 Nov 2020 20:45:52 +0800 (CST) Received: from [10.110.54.32] (10.110.54.32) by DGGEMS414-HUB.china.huawei.com (10.3.19.214) with Microsoft SMTP Server id 14.3.487.0; Tue, 17 Nov 2020 20:45:55 +0800 Subject: Re: [PATCH 0/1] arm64: Accelerate Adler32 using arm64 SVE instructions. To: Dave Martin CC: , , , , , , , , , , , References: <20201103121506.1533-1-liqiang64@huawei.com> <20201105165301.GH6882@arm.com> <20201110104629.GJ6882@arm.com> <89a9bdcc-b96e-2f2d-6c52-ca44e0e3472c@huawei.com> <20201110160708.GL6882@arm.com> <484ad2c8-3905-fc98-237c-f7eb4045edbc@huawei.com> <20201112111745.GS6882@arm.com> <72514954-ea04-6aa3-73d8-bb0fc39b6de2@huawei.com> <20201116155636.GZ6882@arm.com> From: Li Qiang Message-ID: Date: Tue, 17 Nov 2020 20:45:55 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.4.0 MIME-Version: 1.0 In-Reply-To: <20201116155636.GZ6882@arm.com> Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.110.54.32] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-crypto@vger.kernel.org 在 2020/11/16 23:56, Dave Martin 写道: >> --8<-- >> ... >> adler_A .req x10 >> adler_B .req x11 >> >> .macro adler32_core >> ld1b zX.h, p0/z, [x1] // load bytes >> inch x1 >> >> uaddv d0, p0, zX.h >> mul zX.h, p0/m, zX.h, zJ.h // Sum [j=0 .. v-1] j*X[j+n] >> mov x9, v0.d[0] >> uaddv d1, p0, zX.h >> add adler_A, adler_A, x9 // A[n+v] = An + Sum [j=0 ... v-1] X[j] >> mov x9, v1.d[0] >> madd adler_B, x7, adler_A, adler_B // Bn + v*A[n+v] >> sub adler_B, adler_B, x9 // B[n+v] = Bn + v*A[n+v] - Sum [j=0 .. v-1] j*X[j+n] >> .endm > If this has best performance, I find that quite surprising. Those uaddv > instructions will stop the vector lanes flowing independently inside the > loop, so if an individual element load is slow arriuaddving then everything > will have to wait. I don't know much about this problem, do you mean that the uaddv instruction used in the loop has a great impact on performance? > > A decent hardware prefetcher may tend to hide that issue for sequential > memory access, though: i.e., if the hardware does a decent job of > fetching data before the actual loads are issued, the data may appear to > arrive with minimal delay. > > The effect might be a lot worse for algorithms that have less > predictable memory access patterns. > > Possibly you do win some additional performance due to processing twice > as many elements at once, here. I think so. Compared to loading bytes into zX.h, if you load them directly into zX.b, and then use uunpklo/uunpkhi for register expansion, the performance will be more better(20% faster). This may be the reason. -- Best regards, Li Qiang