Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp5627264pxj; Wed, 23 Jun 2021 05:42:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxeFOJUKsf1d/8y0vHIg5ZS2uS3j9wKPFudqZZ70yiwCM+42iqBettKpmCwwlJDX9M1ShBJ X-Received: by 2002:a17:907:1b22:: with SMTP id mp34mr9849341ejc.408.1624452157380; Wed, 23 Jun 2021 05:42:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1624452157; cv=none; d=google.com; s=arc-20160816; b=FpDMvPC3xH94PJrQhCCuAF438EfgLQm4Bj6nRQo1SFnNHUea0x7SIDOzmNEL7Wd1ci LIS/FYgX21XIeCDOiuwXwamG+eRpUgwNitWTiiJ8gteh0xRGWKzhgDJXm2g1MZoy/5V/ aB0zWWIFNyxqzaEx1q32nJ5f1R3C5Ijt373O1BGy4AvgJ7lvkTT3bgZXGWH8xeQzHwD8 dqG8F7Yvm1ExooQysz5wklr9G50WoWSKeHJd4j7ap+jXj42K282iEmX+V8zFdyCN4ciJ AmigtOflSyeHZIN1/7W92PoInY5h3Rg9CRuulLcXHkeRvSkKkZh7qDcFrIThunakQzIe OqDw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :to:subject:dkim-signature; bh=I/4YCRBKsnY6Xv/PITeXD+BVMi9fYXi3CxxBamZ8trc=; b=QZEVLU8Mes5V64dldkokVZqSwB8zuPNYzKY+3pmXizCXPukJ5oHSXAwiOyzHkEdYR3 rYpKWL0dYqdrn0peNPFASPn5y35jIYjVs6JynEt+r91iK5ZA2adK049i9b08T78sXjGs Q9ksCrelOcBCvD8PgV3ei+Ef3eb39aUsU4+5tx99yAm3wL8Vgn+p660LP3IzqAvbhwpz Dqyn7p45k0Dz9S/JPeCDeggpis9IttN6oy+z+5hbKjFwLCqNVE8yla/iKnbQX5Npz2y/ Xkn3pp1FKuj9fNzB4S2z9C5jKbn5YNMQ+y9WyzHHChgpf4E3cdOJeKlKKlN0BSUd1CGb o/Gg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=QycPtinC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b9si2636095edd.494.2021.06.23.05.42.14; Wed, 23 Jun 2021 05:42:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=QycPtinC; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230463AbhFWMnF (ORCPT + 99 others); Wed, 23 Jun 2021 08:43:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230523AbhFWMnA (ORCPT ); Wed, 23 Jun 2021 08:43:00 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 62815C061574 for ; Wed, 23 Jun 2021 05:40:43 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id 69so1115597plc.5 for ; Wed, 23 Jun 2021 05:40:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=I/4YCRBKsnY6Xv/PITeXD+BVMi9fYXi3CxxBamZ8trc=; b=QycPtinCevIScLZg3OclfMDLsXEwgCuHdFzogGqicU5E1W+rFhe6A9JxQQAMzJHXsE QYQcYuC8Cctjt+cHej37HVSFv3D6nmFNVbyOElL6Hj5lz5PD3KNys358FyDGCiBcq8Xu vCyT+XYADAdSTxR+vRbyWDErDOGVG1L2xjaqUTxWO4uxuY4t9b8cdqqR70DHFJHZRCZW 25ZKgxS3KDgMnUHK9xfzS0lMdXHnViTYPEJUifpW6bNjvOxNynC1VDzCQNIoY9bxfKvg dfdgaCY2QOxvbXed2acEjJv7nvpjq7+aMcOMThKC5owba+pG/ZgXU//9RuqA8Y14Yk0i m9Tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=I/4YCRBKsnY6Xv/PITeXD+BVMi9fYXi3CxxBamZ8trc=; b=Z73bIN0qDT8kH7LIKbU6CEvwfot7ahOgnZ99iKvdkbs4MPPdrhpEJDtA9S5mRp2TIb REHdqsCWBGJQXC8gGnVL+2tOJ2MkwX8VIfW5Ilgp+hWGmoZti1Dlwe6B1+CoAPsWA1sA PZynjZusU89nhvkvxdM4a4/JhdqtuVOgr+l/WMymTNyBom2CacEcY6ctUQsUcO2hRpqX sFxVZpzKC06oMjHfJVGzHa/8ByW7poYACNLSaXtzrqhOccz4NOZlRSfl999Qkpc4ioYj 1Avt6pOKM6ZbOxUV/9rW5j+xIcE9r9sDvXMG8GZpxgZ3fQho80sEVhikiQP/S20IYbTq utzw== X-Gm-Message-State: AOAM531aRwA2ZG6T/QocleCgJGVAFDmeJuv9hIL3XDvz36CNEPqPvbQM f23VkDW9+c+83TCgFzbVNZLUX5lsUax9iA== X-Received: by 2002:a17:90a:8a95:: with SMTP id x21mr9370137pjn.154.1624452042517; Wed, 23 Jun 2021 05:40:42 -0700 (PDT) Received: from [192.168.1.153] (163.128.178.217.shared.user.transix.jp. [217.178.128.163]) by smtp.gmail.com with ESMTPSA id c18sm2435620pfo.143.2021.06.23.05.40.40 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 23 Jun 2021 05:40:42 -0700 (PDT) Subject: [PATCH v3 1/1] riscv: __asm_copy_to-from_user: Optimize unaligned memory access and pipeline stall To: Paul Walmsley , Palmer Dabbelt , Albert Ou , linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org References: <3e1dbea4-3b0f-de32-5447-2e23c6d4652a@gmail.com> From: Akira Tsukamoto Message-ID: <60c1f087-1e8b-8f22-7d25-86f5f3dcee3f@gmail.com> Date: Wed, 23 Jun 2021 21:40:39 +0900 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <3e1dbea4-3b0f-de32-5447-2e23c6d4652a@gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This patch will reduce cpu usage dramatically in kernel space especially for application which use sys-call with large buffer size, such as network applications. The main reason behind this is that every unaligned memory access will raise exceptions and switch between s-mode and m-mode causing large overhead. First copy in bytes until reaches the first word aligned boundary in destination memory address. This is the preparation before the bulk aligned word copy. The destination address is aligned now, but oftentimes the source address is not in an aligned boundary. To reduce the unaligned memory access, it reads the data from source in aligned boundaries, which will cause the data to have an offset, and then combines the data in the next iteration by fixing offset with shifting before writing to destination. The majority of the improving copy speed comes from this shift copy. In the lucky situation that the both source and destination address are on the aligned boundary, perform load and store with register size to copy the data. Without the unrolling, it will reduce the speed since the next store instruction for the same register using from the load will stall the pipeline. At last, copying the remainder in one byte at a time. Signed-off-by: Akira Tsukamoto --- arch/riscv/lib/uaccess.S | 181 +++++++++++++++++++++++++++++++-------- 1 file changed, 146 insertions(+), 35 deletions(-) diff --git a/arch/riscv/lib/uaccess.S b/arch/riscv/lib/uaccess.S index fceaeb18cc64..bceb0629e440 100644 --- a/arch/riscv/lib/uaccess.S +++ b/arch/riscv/lib/uaccess.S @@ -19,50 +19,161 @@ ENTRY(__asm_copy_from_user) li t6, SR_SUM csrs CSR_STATUS, t6 - add a3, a1, a2 - /* Use word-oriented copy only if low-order bits match */ - andi t0, a0, SZREG-1 - andi t1, a1, SZREG-1 - bne t0, t1, 2f + /* Save for return value */ + mv t5, a2 - addi t0, a1, SZREG-1 - andi t1, a3, ~(SZREG-1) - andi t0, t0, ~(SZREG-1) /* - * a3: terminal address of source region - * t0: lowest XLEN-aligned address in source - * t1: highest XLEN-aligned address in source + * Register allocation for code below: + * a0 - start of uncopied dst + * a1 - start of uncopied src + * a2 - size + * t0 - end of uncopied dst */ - bgeu t0, t1, 2f - bltu a1, t0, 4f + add t0, a0, a2 + bgtu a0, t0, 5f + + /* + * Use byte copy only if too small. + */ + li a3, 8*SZREG /* size must be larger than size in word_copy */ + bltu a2, a3, .Lbyte_copy_tail + + /* + * Copy first bytes until dst is align to word boundary. + * a0 - start of dst + * t1 - start of aligned dst + */ + addi t1, a0, SZREG-1 + andi t1, t1, ~(SZREG-1) + /* dst is already aligned, skip */ + beq a0, t1, .Lskip_first_bytes 1: - fixup REG_L, t2, (a1), 10f - fixup REG_S, t2, (a0), 10f - addi a1, a1, SZREG - addi a0, a0, SZREG - bltu a1, t1, 1b + /* a5 - one byte for copying data */ + fixup lb a5, 0(a1), 10f + addi a1, a1, 1 /* src */ + fixup sb a5, 0(a0), 10f + addi a0, a0, 1 /* dst */ + bltu a0, t1, 1b /* t1 - start of aligned dst */ + +.Lskip_first_bytes: + /* + * Now dst is aligned. + * Use shift-copy if src is misaligned. + * Use word-copy if both src and dst are aligned because + * can not use shift-copy which do not require shifting + */ + /* a1 - start of src */ + andi a3, a1, SZREG-1 + bnez a3, .Lshift_copy + +.Lword_copy: + /* + * Both src and dst are aligned, unrolled word copy + * + * a0 - start of aligned dst + * a1 - start of aligned src + * a3 - a1 & mask:(SZREG-1) + * t0 - end of aligned dst + */ + addi t0, t0, -(8*SZREG-1) /* not to over run */ 2: - bltu a1, a3, 5f + fixup REG_L a4, 0(a1), 10f + fixup REG_L a5, SZREG(a1), 10f + fixup REG_L a6, 2*SZREG(a1), 10f + fixup REG_L a7, 3*SZREG(a1), 10f + fixup REG_L t1, 4*SZREG(a1), 10f + fixup REG_L t2, 5*SZREG(a1), 10f + fixup REG_L t3, 6*SZREG(a1), 10f + fixup REG_L t4, 7*SZREG(a1), 10f + fixup REG_S a4, 0(a0), 10f + fixup REG_S a5, SZREG(a0), 10f + fixup REG_S a6, 2*SZREG(a0), 10f + fixup REG_S a7, 3*SZREG(a0), 10f + fixup REG_S t1, 4*SZREG(a0), 10f + fixup REG_S t2, 5*SZREG(a0), 10f + fixup REG_S t3, 6*SZREG(a0), 10f + fixup REG_S t4, 7*SZREG(a0), 10f + addi a0, a0, 8*SZREG + addi a1, a1, 8*SZREG + bltu a0, t0, 2b + + addi t0, t0, 8*SZREG-1 /* revert to original value */ + j .Lbyte_copy_tail + +.Lshift_copy: + + /* + * Word copy with shifting. + * For misaligned copy we still perform aligned word copy, but + * we need to use the value fetched from the previous iteration and + * do some shifts. + * This is safe because reading less than a word size. + * + * a0 - start of aligned dst + * a1 - start of src + * a3 - a1 & mask:(SZREG-1) + * t0 - end of uncopied dst + * t1 - end of aligned dst + */ + /* calculating aligned word boundary for dst */ + andi t1, t0, ~(SZREG-1) + /* Converting unaligned src to aligned arc */ + andi a1, a1, ~(SZREG-1) + + /* + * Calculate shifts + * t3 - prev shift + * t4 - current shift + */ + slli t3, a3, LGREG + li a5, SZREG*8 + sub t4, a5, t3 + + /* Load the first word to combine with seceond word */ + fixup REG_L a5, 0(a1), 10f 3: + /* Main shifting copy + * + * a0 - start of aligned dst + * a1 - start of aligned src + * t1 - end of aligned dst + */ + + /* At least one iteration will be executed */ + srl a4, a5, t3 + fixup REG_L a5, SZREG(a1), 10f + addi a1, a1, SZREG + sll a2, a5, t4 + or a2, a2, a4 + fixup REG_S a2, 0(a0), 10f + addi a0, a0, SZREG + bltu a0, t1, 3b + + /* Revert src to original unaligned value */ + add a1, a1, a3 + +.Lbyte_copy_tail: + /* + * Byte copy anything left. + * + * a0 - start of remaining dst + * a1 - start of remaining src + * t0 - end of remaining dst + */ + bgeu a0, t0, 5f +4: + fixup lb a5, 0(a1), 10f + addi a1, a1, 1 /* src */ + fixup sb a5, 0(a0), 10f + addi a0, a0, 1 /* dst */ + bltu a0, t0, 4b /* t0 - end of dst */ + +5: /* Disable access to user memory */ csrc CSR_STATUS, t6 - li a0, 0 + li a0, 0 ret -4: /* Edge case: unalignment */ - fixup lbu, t2, (a1), 10f - fixup sb, t2, (a0), 10f - addi a1, a1, 1 - addi a0, a0, 1 - bltu a1, t0, 4b - j 1b -5: /* Edge case: remainder */ - fixup lbu, t2, (a1), 10f - fixup sb, t2, (a0), 10f - addi a1, a1, 1 - addi a0, a0, 1 - bltu a1, a3, 5b - j 3b ENDPROC(__asm_copy_to_user) ENDPROC(__asm_copy_from_user) EXPORT_SYMBOL(__asm_copy_to_user) @@ -117,7 +228,7 @@ EXPORT_SYMBOL(__clear_user) 10: /* Disable access to user memory */ csrs CSR_STATUS, t6 - mv a0, a2 + mv a0, t5 ret 11: csrs CSR_STATUS, t6 -- 2.17.1