Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp1464067pxy; Fri, 23 Apr 2021 08:37:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyddhsoMO5M6Gjz9MlyqB3TTIchIbATZql9nQrXvHtnIzSQ0iU+jcbCe1AelVMrXf/RTsnc X-Received: by 2002:a17:902:da8a:b029:ec:9032:35f6 with SMTP id j10-20020a170902da8ab02900ec903235f6mr4558187plx.37.1619192270472; Fri, 23 Apr 2021 08:37:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619192270; cv=none; d=google.com; s=arc-20160816; b=riHivmeyreFUpRX1/FU5uZw+Ck6mPYik351y6EqJc7UuXYXqfLroqIfymu8tgFHBS1 OLAt69cDoNf2hZAMF2w1k5OJKQR5RBQ5qwNM8V04MMDVgQ9j8Me3WfYFPUFlu5vX7yLR aTjUzjXRrnujv8rdmgjLTfNhEst3/hxFfDLEPQI46I6NkuJeY5VUTg/neqLsEIakYLTw Ji0bJojfqz/m4uzgTHcqVgaSnMsKY2ZOajyyyE8e185rGntqGs/HB21FcA2fe1v+m/NU Wx5I045zXXXtT4pvyWF2W8bClUhQeNKOJoHywUDZAK6ST3x2ZXImu+tpoP9Jd6du4/Nw bihQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=k6ms/XWP+Mvai3QvWvA6F76FfPYG5zRXRVrMBHVHUnc=; b=QcJCm0V6ShyfVB8WSE7hc2bTiWg+1fDc+yH1uewqCFJd5LBt3ZS/vMXMOvpm+lwNsb kWjiQhW3Cfh/hevs6fisIuh60y08FA1kWaTPTXozClNfmriT+XtA9sht5s+FMVkKue8v 3Ysh5usMVDdgIsJy3PP8zzfqhSESmVGYxmc8cfc0TRlP69a0/AsnsRwCgH1U4kDteLF9 aJB1ZcIvI+TWWWrPnIdbzM8z+ZSTLLoUEi3Iw/DCDo1fXL+4e8+7ywjeMRmszkoUZ54C x2G9LmM2aom0jPf+Dcg5N1CcgrRPqW6my2BDYU9r+HKg13mn2/jJy6lIRT/ZAR5Xim9c Qf9w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d18si7672871pgg.144.2021.04.23.08.37.37; Fri, 23 Apr 2021 08:37:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231748AbhDWPhm (ORCPT + 99 others); Fri, 23 Apr 2021 11:37:42 -0400 Received: from mail.kernel.org ([198.145.29.99]:36236 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230294AbhDWPhm (ORCPT ); Fri, 23 Apr 2021 11:37:42 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id C8F86611AE; Fri, 23 Apr 2021 15:37:04 +0000 (UTC) Date: Fri, 23 Apr 2021 16:37:02 +0100 From: Catalin Marinas To: Kai Shen Cc: will@kernel.org, linux-arm-kernel@lists.infradead.org, LKML , xuwei5@hisilicon.com, hewenliang4@huawei.com, wuxu.wu@huawei.com Subject: Re: [PATCH] arm64:align function __arch_clear_user Message-ID: <20210423153701.GP18757@arm.com> References: <58fecb22-f932-cb6e-d996-ca75fe26a75d@huawei.com> <20210414104144.GB8320@arm.com> <6829062c-a2d4-57da-4037-269fb7508993@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <6829062c-a2d4-57da-4037-269fb7508993@huawei.com> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 19, 2021 at 10:05:16AM +0800, Kai Shen wrote: > On 2021/4/14 18:41, Catalin Marinas wrote: > > On Wed, Apr 14, 2021 at 05:25:43PM +0800, Kai Shen wrote: > > > Performance decreases happen in __arch_clear_user when this > > > function is not correctly aligned on HISI-HIP08 arm64 SOC which > > > fetches 32 bytes (8 instructions) from icache with a 32-bytes > > > aligned end address. As a result, if the hot loop is not 32-bytes > > > aligned, it may take more icache fetches which leads to decrease > > > in performance. > > > Dump of assembler code for function __arch_clear_user: > > > 0xffff0000809e3f10 : nop > > > 0xffff0000809e3f14 : mov x2, x1 > > > 0xffff0000809e3f18 : subs x1, x1, #0x8 > > > 0xffff0000809e3f1c : b.mi 0xffff0000809e3f30 <__arch_clear_user+3 > > > ----- 0xffff0000809e3f20 : str xzr, [x0],#8 > > > hot 0xffff0000809e3f24 : nop > > > loop 0xffff0000809e3f28 : subs x1, x1, #0x8 > > > ----- 0xffff0000809e3f2c : b.pl 0xffff0000809e3f20 <__arch_clear_user+1 > > > The hot loop above takes one icache fetch as the code is in one > > > 32-bytes aligned area and the loop takes one more icache fetch > > > when it is not aligned like below. > > > 0xffff0000809e4178 : str xzr, [x0],#8 > > > 0xffff0000809e417c : nop > > > 0xffff0000809e4180 : subs x1, x1, #0x8 > > > 0xffff0000809e4184 : b.pl 0xffff0000809e4178 <__arch_clear_user+ > > > Data collected by perf: > > > aligned not aligned > > > instructions 57733790 57739065 > > > L1-dcache-store 14938070 13718242 > > > L1-dcache-store-misses 349280 349869 > > > L1-icache-loads 15380895 28500665 > > > As we can see, L1-icache-loads almost double when the loop is not > > > aligned. > > > This problem is found in linux 4.19 on HISI-HIP08 arm64 SOC. > > > Not sure what the case is on other arm64 SOC, but it should do > > > no harm. > > > Signed-off-by: Kai Shen > > > > Do you have a real world workload that's affected by this function? > > > > I'm against adding alignments and nops for specific hardware > > implementations. What about lots of other loops that the compiler may > > generate or that we wrote in asm? > > The benchmark we used which suffer performance decrease: > https://github.com/redhat-performance/libMicro > pread $OPTS -N "pread_z1k" -s 1k -I 300 -f /dev/zero > pread $OPTS -N "pread_z10k" -s 10k -I 1000 -f /dev/zero > pread $OPTS -N "pread_z100k" -s 100k -I 2000 -f /dev/zero Is there any real world use-case that would benefit from this optimisation? Reading /dev/zero in a loop hardly counts as a practical workload. -- Catalin