Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2140739rwd; Mon, 15 May 2023 07:48:13 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ42sriQetAyUOjBJ+0ZxKiy6RvS5LcsO7BamloMCTAFbFcggvepgc2tQ9k2g0omOHdmR7na X-Received: by 2002:a05:6a20:3d8a:b0:106:4e09:151e with SMTP id s10-20020a056a203d8a00b001064e09151emr3439442pzi.39.1684162093533; Mon, 15 May 2023 07:48:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1684162093; cv=none; d=google.com; s=arc-20160816; b=peS6T57cEqPN38cut7fAAt3svdbdmXoPKxs5WWQakp2U5LGdzJhfbNW0e3SMglGOWM yZdU0MgvhV2gyoMW633U37yEkMJY72K2S4NddkGvizmjdmGlq6qKSv1ggUUA206lzLMr Lg3RY1SfccaoGUS6vJE6ordCz91a5i0fIS+4+TW4O5YQvokCScdZuy2VqGOGyo4wwG6K W/z2y8ZSlMQkKojQrTPPCklDF2DVmuft4SzWm9l+NqZFeu1F3oZt0u0qxuw1hNQTdixH gKDPUN7C53kPD2OVoGQKu0AJaavKiiabls0BYh7BRYPbJHTahne4obdHc7MEae+cpMq8 kigA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=xivJhdpQaweyzm6nbX3Ea6ei977VI1nEaj15dJjfUJY=; b=Mam5hM0NtWNtWFbEtjFqPnXcpCsF28/HpB53KXbgoiZVZCg2/p2qZc7zQ7Eg/Hysxm OA0gNZIGotA6Lj9cztXO1S89Om1dhAl0ecnccyBbFJFYWzmOBtSXhJ4YJUjyzVBGkWM9 N64UKCGIrVzRMLawxsQ//esDRLfl1mxCOjG35d7WzgkDNfP03LhVEfOORXEWqZapPqe1 DMG4yZ4cyc6yZL4lqYBHWUWHWepatiDcmWxcGXUHQwDmkCUR+BQI0R2Oc3+f2/HR8ZsA TAcmx2bXqP7OSZ5tHhZRUj5TJm34kybZY1MZzSoTvAz/sgQ3rtI+MjjoVw1tUaUAUbB/ wLlA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=LfuJOEaI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a15-20020aa78e8f000000b00643aa8d2548si16989738pfr.120.2023.05.15.07.47.58; Mon, 15 May 2023 07:48:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=LfuJOEaI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238682AbjEOOmT (ORCPT + 99 others); Mon, 15 May 2023 10:42:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35452 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234853AbjEOOmS (ORCPT ); Mon, 15 May 2023 10:42:18 -0400 Received: from mail-qt1-x82e.google.com (mail-qt1-x82e.google.com [IPv6:2607:f8b0:4864:20::82e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 26066E58 for ; Mon, 15 May 2023 07:42:13 -0700 (PDT) Received: by mail-qt1-x82e.google.com with SMTP id d75a77b69052e-3f396606ab0so1482911cf.0 for ; Mon, 15 May 2023 07:42:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684161732; x=1686753732; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=xivJhdpQaweyzm6nbX3Ea6ei977VI1nEaj15dJjfUJY=; b=LfuJOEaIB58sGTFfv23yzuCJolheVTJXvLkYJ1qK1oPHaLr1UvzuZuRtTRvqpID3N6 jL5AV5D0cs/N9+U8AOpCLdWjAWZHCLaPvfYJcggpTqRZAEtz1N6nejos4oP2mwv74SUc KCO0kiKYYL2d7MZqUZYVw0fepdUctVAE6/x7j0VnOVU7atm1aGQDKvK59K3fTJHLf+Nk z98Pt572j6TiPVUf4g1nc0dOxUeUBrLF9i7xjR8TUPNTbkApLSQqFbmRINRkWHbGHqDJ hVKMCCqHmM0LkmqjyWa+vIrmukdOJHQNq5oP1ZonuU2lRL5V4enNf44uWTocN9vr1lna Y6yA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684161732; x=1686753732; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xivJhdpQaweyzm6nbX3Ea6ei977VI1nEaj15dJjfUJY=; b=OXYrTx3qoRpKojL33x1levIkiuQbYFMg6P8UO36z/TRFSrO9Fu3iK7T2aJE6aNF8J7 AtCy1pwnU38qBTWyemGVdM5eSp49Toj80DuXt/k3TeSlCCP13k//6LFtclvL23XRPnjf 0Hrf/YLh09CtaTw5cpCzMjgJ1nA2nrUgGKtEX9+FNmsQ264EsY5MYue44OrzMghysCSL 9fCwawulVVTRVQqEGa/9Ia2fY3Rc4I7N0/T4zpUqX8dw4neGWZEPxPtxeafae0/rBvzT YkuBhr97Zetl6GlkIpk/03Ue3nVxRVlWSTWwLU9qCXNizGR0HUwk8dUh158/MQ81xjTU i1LA== X-Gm-Message-State: AC+VfDyyO9fmvZpWdIOYGmbcpJ23P9nsMGaZLSaE0MDuiJ0WzvWNOwco uc5PnqToSt4RrYuv7DL8heWYqbWU7rmI05VChaX6YQ== X-Received: by 2002:a05:622a:d5:b0:3f2:2c89:f1ef with SMTP id p21-20020a05622a00d500b003f22c89f1efmr7073qtw.5.1684161732039; Mon, 15 May 2023 07:42:12 -0700 (PDT) MIME-Version: 1.0 References: <20230421141723.2405942-1-peternewman@google.com> <20230421141723.2405942-4-peternewman@google.com> <38b9e6df-cccd-a745-da4a-1d1a0ec86ff3@intel.com> <31993ea8-97e5-b8d5-b344-48db212bc9cf@intel.com> In-Reply-To: <31993ea8-97e5-b8d5-b344-48db212bc9cf@intel.com> From: Peter Newman Date: Mon, 15 May 2023 16:42:01 +0200 Message-ID: Subject: Re: [PATCH v1 3/9] x86/resctrl: Add resctrl_mbm_flush_cpu() to collect CPUs' MBM events To: Reinette Chatre Cc: Fenghua Yu , Babu Moger , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Stephane Eranian , James Morse , linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Reinette, On Fri, May 12, 2023 at 5:26=E2=80=AFPM Reinette Chatre wrote: > On 5/12/2023 6:25 AM, Peter Newman wrote: > > On Thu, May 11, 2023 at 11:37=E2=80=AFPM Reinette Chatre > > wrote: > >> On 4/21/2023 7:17 AM, Peter Newman wrote: > >>> Implement resctrl_mbm_flush_cpu(), which collects a domain's current = MBM > >>> event counts into its current software RMID. The delta for each CPU i= s > >>> determined by tracking the previous event counts in per-CPU data. Th= e > >>> software byte counts reside in the arch-independent mbm_state > >>> structures. > >> > >> Could you elaborate why the arch-independent mbm_state was chosen? > > > > It largely had to do with how many soft RMIDs to implement. For our > > own needs, we were mainly concerned with getting back to the number of > > monitoring groups the hardware claimed to support, so there wasn't > > much internal motivation to support an unbounded number of soft RMIDs. > > Apologies for not being explicit, I was actually curious why the > arch-independent mbm_state, as opposed to the arch-dependent state, was > chosen. > > I think the lines are getting a bit blurry here with the software RMID > feature added as a resctrl filesystem feature (and thus non architectural= ), > but it is specific to AMD architecture. The soft RMID solution applies conceptually to any system where the number of hardware counters is smaller than the number of desired monitoring groups, but at least as large as the number of CPUs. It's a solution we may need to rely on more in the future as it's easier for monitoring hardware to scale to the number of CPUs than (CPUs * mbm_domains). I believed the counts in bytes would apply to the user interface universally. However, I did recently rebase these changes onto one of James's MPAM snapshot branches and __mbm_flush() did end up fitting better on the arch-dependent side, so I was forced to move the counters over to arch_mbm_state because on the snapshot branch the arch-dependent code cannot see the arch-independent mbm_state structure. I then created resctrl_arch-() helpers for __mon_event_count() to read the counts from the arch_mbm_state. In hindsight, despite generic-looking code being able to retrieve the CPU counts with resctrl_arch_rmid_read(), the permanent assignment of a HW RMID to a CPU is an implementation-detail specific to the RDT/PQoS interface and would probably not align to a theoretical MPAM implementation. > > > However, breaking this artificial connection between supported HW and > > SW RMIDs to support arbitrarily-many monitoring groups could make the > > implementation conceptually cleaner. If you agree, I would be happy > > to give it a try in the next series. > > I have not actually considered this. At first glance I think this would > add more tentacles into the core where currently the number of RMIDs > supported are queried from the device and supporting an arbitrary number > would impact that. At this time the RMID state is also pre-allocated > and thus not possible to support an "arbitrarily-many". Yes, this was the part that made me want to just leave the RMID count alone= . > > >>> +/* > >>> + * Called from context switch code __resctrl_sched_in() when the cur= rent soft > >>> + * RMID is changing or before reporting event counts to user space. > >>> + */ > >>> +void resctrl_mbm_flush_cpu(void) > >>> +{ > >>> + struct rdt_resource *r =3D &rdt_resources_all[RDT_RESOURCE_L3].= r_resctrl; > >>> + int cpu =3D smp_processor_id(); > >>> + struct rdt_domain *d; > >>> + > >>> + d =3D get_domain_from_cpu(cpu, r); > >>> + if (!d) > >>> + return; > >>> + > >>> + if (is_mbm_local_enabled()) > >>> + __mbm_flush(QOS_L3_MBM_LOCAL_EVENT_ID, r, d); > >>> + if (is_mbm_total_enabled()) > >>> + __mbm_flush(QOS_L3_MBM_TOTAL_EVENT_ID, r, d); > >>> +} > >> > >> This (potentially) adds two MSR writes and two MSR reads to what could= possibly > >> be quite slow MSRs if it was not designed to be used in context switch= . Do you > >> perhaps have data on how long these MSR reads/writes take on these sys= tems to get > >> an idea about the impact on context switch? I think this data should f= eature > >> prominently in the changelog. > > > > I can probably use ftrace to determine the cost of an __rmid_read() > > call on a few implementations. > > On a lower level I think it may be interesting to measure more closely > just how long a wrmsr and rdmsr take on these registers. It may be intere= sting > if you, for example, use rdtsc_ordered() before and after these calls, an= d then > compare it to how long it takes to write the PQR register that has been > designed to be used in context switch. > > > To understand the overall impact to context switch, I can put together > > a scenario where I can control whether the context switches being > > measured result in change of soft RMID to prevent the data from being > > diluted by non-flushing switches. > > This sounds great. Thank you very much. I used a simple parent-child pipe loop benchmark with the parent in one monitoring group and the child in another to trigger 2M context-switches on the same CPU and compared the sample-based profiles on an AMD and Intel implementation. I used perf diff to compare the samples between hard and soft RMID switches. Intel(R) Xeon(R) Platinum 8173M CPU @ 2.00GHz: +44.80% [kernel.kallsyms] [k] __rmid_read 10.43% -9.52% [kernel.kallsyms] [k] __switch_to AMD EPYC 7B12 64-Core Processor: +28.27% [kernel.kallsyms] [k] __rmid_read 13.45% -13.44% [kernel.kallsyms] [k] __switch_to Note that a soft RMID switch that doesn't change CLOSID skips the PQR_ASSOC write completely, so from this data I can roughly say that __rmid_read() is a little over 2x the length of a PQR_ASSOC write that changes the current RMID on the AMD implementation and about 4.5x longer on Intel. Let me know if this clarifies the cost enough or if you'd like to also see instrumented measurements on the individual WRMSR/RDMSR instructions. Thanks! -Peter