Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47168C636CC for ; Mon, 13 Feb 2023 08:19:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229822AbjBMITI (ORCPT ); Mon, 13 Feb 2023 03:19:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59768 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229876AbjBMITC (ORCPT ); Mon, 13 Feb 2023 03:19:02 -0500 Received: from mail-pj1-x102f.google.com (mail-pj1-x102f.google.com [IPv6:2607:f8b0:4864:20::102f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D71BBCDC3 for ; Mon, 13 Feb 2023 00:18:56 -0800 (PST) Received: by mail-pj1-x102f.google.com with SMTP id d13-20020a17090ad3cd00b0023127b2d602so11399376pjw.2 for ; Mon, 13 Feb 2023 00:18:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=0F9S1wNTZyJxIrouvIcef18H8doxrRN6lrNngjyJbB4=; b=PUTud1UhAhxH9pliu/gU/vijb4ZPFF6khADtQyuTDBdErHOVDHBJfumeoFfjND0Uq/ hdA4cYcEima7/cLeaQuR+FDX+9zwhVfBsBqx0tHBGwYmtCl61Z1dfOR430gt8TFf075E v7rZeBmFs/5XbAoCXInmqzZqVhGyjSXci4m47sWT+2LuGUWdLiGk5J/syANXX08TxjLF KDxMIoHwqNNzAIgNhDipvHS/cQARUQVKYs6MjJuMze8sCdFzjb+pzb6PpyeR09Z2iEU0 FILD9XTxG5Ka5C3SRtvC7Amyt3splr1IFe9shhGTHK3dAkG4A50j4nX0wgrbiYx+TKg3 Zxpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=0F9S1wNTZyJxIrouvIcef18H8doxrRN6lrNngjyJbB4=; b=blRbtgZccjFhwhPBDcyvyBXQzXKAo5y0VDjHZaccgYZ7xTKuKRtnF4zQWt8MFKFAMz iLDQ978xdsGGnq7/mC74WuhL5syexHvJpaBuC+UbyjmD7yVqfvwW0aTkbcyI7U29BsF6 kBZlb/WPBCF1CuwuYaBGmHj9LECIoEIpe7g3tSCC5MwdM01XjljVEFgHQyDgtEvXCyUk Drr3w436uDEQP21m3mOI4kLe2C98Z2Mz+soP5gJPZhUVetRUCsDuwOWR0zxV0rQjrQyB TagpTbONQdDXZPojw0w+WB67gM9r0oy4GG4BFlEtbF0bMbIu5L4gB779K/gotgUGaSRX QrXA== X-Gm-Message-State: AO0yUKXf/uMUJyEvKbq12TBOewpKGaZgh+p9WUHj3t9xupVCaD7HAGaX M01L+uGbKLzYkkWlYbthWe1lDUzCQtoZKZlk8X+sKg== X-Google-Smtp-Source: AK7set94esnAmiFUzb5PBrIHOwtNtLfJWgckJsxpllK6HZZVN2EOz0ZQFXeQxhLkyaitmLKBJpmSBA== X-Received: by 2002:a17:902:f68a:b0:198:f160:59a6 with SMTP id l10-20020a170902f68a00b00198f16059a6mr15486517plg.21.1676276336175; Mon, 13 Feb 2023 00:18:56 -0800 (PST) Received: from leoy-yangtze.lan (211-75-219-203.hinet-ip.hinet.net. [211.75.219.203]) by smtp.gmail.com with ESMTPSA id v22-20020a170902e8d600b001949f21e1d2sm3979480plg.308.2023.02.13.00.18.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 13 Feb 2023 00:18:55 -0800 (PST) Date: Mon, 13 Feb 2023 16:18:50 +0800 From: Leo Yan To: Feng Tang Cc: Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , Mark Rutland , Alexander Shishkin , Jiri Olsa , Namhyung Kim , Joe Mario , linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Andi Kleen , Kan Liang , Xing Zhengjun Subject: Re: [PATCH] perf c2c: Add report option to show false sharing in adjacent cachelines Message-ID: References: <20230213031733.236485-1-feng.tang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230213031733.236485-1-feng.tang@intel.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 13, 2023 at 11:17:33AM +0800, Feng Tang wrote: > Many platforms have feature of adjacent cachelines prefetch, when it > is enabled, for data in RAM of 2 cachelines (2N and 2N+1) granularity, > if one is fetched to cache, the other one could likely be fetched too, > which sort of extends the cacheline size to double, thus the false > sharing could happens in adjacent cachelines. > > 0Day has captured performance changed related with this [1], and some > commercial software explicitly makes its hot global variables 128 bytes > aligned (2 cache lines) to avoid this kind of extended false sharing. > > So add an option "-a" or "--double-cl" for c2c report to show false > sharing in double cache line granularity, which acts just like the > cacheline size is doubled. There is no change to c2c record. The > hardware HITM events are still per cacheline. The option just changes > the granularity of how events are grouped and displayed. > > In the c2c report below (will-it-scale's pagefault2 case on old kernel): > > ---------------------------------------------------------------------- > 26 31 2 0 0 0 0xffff888103ec6000 > ---------------------------------------------------------------------- > 35.48% 50.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff8133148b 1153 66 971 3748 74 [k] get_mem_cgroup_from_mm > 6.45% 0.00% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff813396e4 570 0 1531 879 75 [k] mem_cgroup_charge > 25.81% 50.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81331472 949 70 593 3359 74 [k] get_mem_cgroup_from_mm > 19.35% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81339686 1352 0 1073 1022 74 [k] mem_cgroup_charge > 9.68% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff813396d6 1401 0 863 768 74 [k] mem_cgroup_charge > 3.23% 0.00% 0.00% 0.00% 0.00% 0x54 0 1 0xffffffff81333106 618 0 804 11 9 [k] uncharge_batch > > The offset 0x10 and 0x54 used to displayed in 2 groups, and now they > are listed together to give users a hint. > > [1]. https://lore.kernel.org/lkml/20201102091543.GM31092@shao2-debian/ > > Signed-off-by: Feng Tang > Reviewed-by: Andi Kleen > --- > tools/perf/Documentation/perf-c2c.txt | 6 ++++++ > tools/perf/builtin-c2c.c | 22 +++++++++++++--------- > tools/perf/util/cacheline.h | 25 ++++++++++++++++++++----- > tools/perf/util/sort.c | 13 ++++++++++--- > tools/perf/util/sort.h | 1 + > 5 files changed, 50 insertions(+), 17 deletions(-) > > diff --git a/tools/perf/Documentation/perf-c2c.txt b/tools/perf/Documentation/perf-c2c.txt > index 5c5eb2def83e..a8e1e40d270e 100644 > --- a/tools/perf/Documentation/perf-c2c.txt > +++ b/tools/perf/Documentation/perf-c2c.txt > @@ -126,6 +126,12 @@ REPORT OPTIONS > The known limitations include exception handing such as > setjmp/longjmp will have calls/returns not match. > > +-a:: > +--double-cl:: > + Group HITM events in double cacheline granularity. Some architecture > + has Adjacent Cacheline Prefetch feature, which behaves like the > + cachline size is doubled. Sorry I gave my review tags too quick... Just found several typos, please fix it. s/Some architecture has/Some architectures have This is just literal fixing, except x86, I don't know if any other architecture has Adjacent Cacheline Prefetch feature :) s/cachline/cacheline > + > C2C RECORD > ---------- > The perf c2c record command setup options related to HITM cacheline analysis > diff --git a/tools/perf/builtin-c2c.c b/tools/perf/builtin-c2c.c > index 52d94c7dd836..7d495db7e5a2 100644 > --- a/tools/perf/builtin-c2c.c > +++ b/tools/perf/builtin-c2c.c > @@ -524,7 +524,7 @@ static int dcacheline_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, > char buf[20]; > > if (he->mem_info) > - addr = cl_address(he->mem_info->daddr.addr); > + addr = cl_address(he->mem_info->daddr.addr, chk_double_cl); > > return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr)); > } > @@ -562,7 +562,7 @@ static int offset_entry(struct perf_hpp_fmt *fmt, struct perf_hpp *hpp, > char buf[20]; > > if (he->mem_info) > - addr = cl_offset(he->mem_info->daddr.al_addr); > + addr = cl_offset(he->mem_info->daddr.al_addr, chk_double_cl); > > return scnprintf(hpp->buf, hpp->size, "%*s", width, HEX_STR(buf, addr)); > } > @@ -574,9 +574,10 @@ offset_cmp(struct perf_hpp_fmt *fmt __maybe_unused, > uint64_t l = 0, r = 0; > > if (left->mem_info) > - l = cl_offset(left->mem_info->daddr.addr); > + l = cl_offset(left->mem_info->daddr.addr, chk_double_cl); > + > if (right->mem_info) > - r = cl_offset(right->mem_info->daddr.addr); > + r = cl_offset(right->mem_info->daddr.addr, chk_double_cl); > > return (int64_t)(r - l); > } > @@ -2590,7 +2591,7 @@ perf_c2c_cacheline_browser__title(struct hist_browser *browser, > he = cl_browser->he; > > if (he->mem_info) > - addr = cl_address(he->mem_info->daddr.addr); > + addr = cl_address(he->mem_info->daddr.addr, chk_double_cl); > > scnprintf(bf, size, "Cacheline 0x%lx", addr); > return 0; > @@ -2788,15 +2789,16 @@ static int ui_quirks(void) > if (!c2c.use_stdio) { > dim_offset.width = 5; > dim_offset.header = header_offset_tui; > - nodestr = "CL"; > + nodestr = chk_double_cl ? "Double-CL" : "CL"; > } > > dim_percent_costly_snoop.header = percent_costly_snoop_header[c2c.display]; > > /* Fix the zero line for dcacheline column. */ > - buf = fill_line("Cacheline", dim_dcacheline.width + > - dim_dcacheline_node.width + > - dim_dcacheline_count.width + 4); > + buf = fill_line(chk_double_cl ? "Double-Cacheline" : "Cacheline", > + dim_dcacheline.width + > + dim_dcacheline_node.width + > + dim_dcacheline_count.width + 4); > if (!buf) > return -ENOMEM; > > @@ -3037,6 +3039,8 @@ static int perf_c2c__report(int argc, const char **argv) > OPT_BOOLEAN('f', "force", &symbol_conf.force, "don't complain, do it"), > OPT_BOOLEAN(0, "stitch-lbr", &c2c.stitch_lbr, > "Enable LBR callgraph stitching approach"), > + OPT_BOOLEAN('a', "double-cl", &chk_double_cl, > + "Check adjacent cachline false sharing"), I personally think the word "Detect" is better than "Check". s/cachline/cacheline > OPT_PARENT(c2c_options), > OPT_END() > }; > diff --git a/tools/perf/util/cacheline.h b/tools/perf/util/cacheline.h > index dec8c0fb1f4a..630d16731b4f 100644 > --- a/tools/perf/util/cacheline.h > +++ b/tools/perf/util/cacheline.h > @@ -6,16 +6,31 @@ > > int __pure cacheline_size(void); > > -static inline u64 cl_address(u64 address) > + > +/* > + * Some architecture has 'Adjacent Cacheline Prefetch' feature, s/Some architecture has/Some architectures have > + * which performs like the cacheline size being doubled. > + */ > +static inline u64 cl_address(u64 address, bool double_cl) > { > + u64 size = cacheline_size(); > + > + if (double_cl) > + size *= 2; > + > /* return the cacheline of the address */ > - return (address & ~(cacheline_size() - 1)); > + return (address & ~(size - 1)); > } > > -static inline u64 cl_offset(u64 address) > +static inline u64 cl_offset(u64 address, bool double_cl) > { > - /* return the cacheline of the address */ > - return (address & (cacheline_size() - 1)); > + u64 size = cacheline_size(); > + > + if (double_cl) > + size *= 2; > + > + /* return the offset inside cachline */ s/cachline/cacheline > + return (address & (size - 1)); > } > > #endif // PERF_CACHELINE_H > diff --git a/tools/perf/util/sort.c b/tools/perf/util/sort.c > index e188f74698dd..148b28f0a7e2 100644 > --- a/tools/perf/util/sort.c > +++ b/tools/perf/util/sort.c > @@ -52,6 +52,13 @@ enum sort_mode sort__mode = SORT_MODE__NORMAL; > static const char *const dynamic_headers[] = {"local_ins_lat", "ins_lat", "local_p_stage_cyc", "p_stage_cyc"}; > static const char *const arch_specific_sort_keys[] = {"local_p_stage_cyc", "p_stage_cyc"}; > > +/* > + * Some architecture has Adjacent Cacheline Prefetch feature, which behaves > + * like the cachline size is doubled. Enable this flag to check things in s/Some architecture has/Some architectures have s/cachline/cacheline With these typo fixes, you could add my reviewed and tested tags. Thanks, Leo > + * double cacheline granularity. > + */ > +bool chk_double_cl; > + > /* > * Replaces all occurrences of a char used with the: > * > @@ -1499,8 +1506,8 @@ sort__dcacheline_cmp(struct hist_entry *left, struct hist_entry *right) > > addr: > /* al_addr does all the right addr - start + offset calculations */ > - l = cl_address(left->mem_info->daddr.al_addr); > - r = cl_address(right->mem_info->daddr.al_addr); > + l = cl_address(left->mem_info->daddr.al_addr, chk_double_cl); > + r = cl_address(right->mem_info->daddr.al_addr, chk_double_cl); > > if (l > r) return -1; > if (l < r) return 1; > @@ -1519,7 +1526,7 @@ static int hist_entry__dcacheline_snprintf(struct hist_entry *he, char *bf, > if (he->mem_info) { > struct map *map = he->mem_info->daddr.ms.map; > > - addr = cl_address(he->mem_info->daddr.al_addr); > + addr = cl_address(he->mem_info->daddr.al_addr, chk_double_cl); > ms = &he->mem_info->daddr.ms; > > /* print [s] for shared data mmaps */ > diff --git a/tools/perf/util/sort.h b/tools/perf/util/sort.h > index 921715e6aec4..04f0a6dc7381 100644 > --- a/tools/perf/util/sort.h > +++ b/tools/perf/util/sort.h > @@ -35,6 +35,7 @@ extern struct sort_entry sort_sym_from; > extern struct sort_entry sort_sym_to; > extern struct sort_entry sort_srcline; > extern const char default_mem_sort_order[]; > +extern bool chk_double_cl; > > struct res_sample { > u64 time; > -- > 2.30.2 >