2023-10-24 18:16:25

by Luck, Tony

[permalink] [raw]
Subject: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

On Intel the various resource director technology (RDT) features are all
orthogonal and independently enumerated. Thus it is possible to have
a system that provides "total" memory bandwidth measurements without
providing "local" bandwidth measurements.

If local bandwidth measurement is not available, do not give up on
providing the "mba_MBps" feedback option completely, make the code fall
back to using total bandwidth.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/monitor.c | 34 ++++++++++++++++----------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
2 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..3b9531cce807 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -418,6 +418,14 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
return 0;
}

+static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int rmid)
+{
+ if (is_mbm_local_enabled())
+ return &dom_mbm->mbm_local[rmid];
+
+ return &dom_mbm->mbm_total[rmid];
+}
+
/*
* mbm_bw_count() - Update bw count from values previously read by
* __mon_event_count().
@@ -431,7 +439,7 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
+ struct mbm_state *m = get_mbm_data(rr->d, rmid);
u64 cur_bw, bytes, cur_bytes;

cur_bytes = rr->val;
@@ -526,14 +534,14 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
struct list_head *head;
struct rdtgroup *entry;

- if (!is_mbm_local_enabled())
+ if (!is_mbm_enabled())
return;

r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ pmbm_data = get_mbm_data(dom_mbm, rmid);

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -553,7 +561,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
head = &rgrp->mon.crdtgrp_list;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_data(dom_mbm, entry->mon.rmid);
cur_bw += cmbm_data->prev_bw;
delta_bw += cmbm_data->delta_bw;
}
@@ -595,7 +603,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
pmbm_data->delta_comp = true;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_data(dom_mbm, entry->mon.rmid);
cmbm_data->delta_comp = true;
}
}
@@ -621,15 +629,15 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
-
- /*
- * Call the MBA software controller only for the
- * control groups and when user has enabled
- * the software controller explicitly.
- */
- if (is_mba_sc(NULL))
- mbm_bw_count(rmid, &rr);
}
+
+ /*
+ * Call the MBA software controller only for the
+ * control groups and when user has enabled
+ * the software controller explicitly.
+ */
+ if (is_mba_sc(NULL))
+ mbm_bw_count(rmid, &rr);
}

/*
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..0c4f8a1b8df0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

- return (is_mbm_local_enabled() &&
+ return (is_mbm_enabled() &&
r->alloc_capable && is_mba_linear());
}

--
2.41.0


2023-10-24 18:25:14

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

> If local bandwidth measurement is not available, do not give up on
> providing the "mba_MBps" feedback option completely, make the code fall
> back to using total bandwidth.

N.B. I don't have a system yet that does this. Tested using boot command line
"clearcpuid=cqm_mbm_local" argument to the kernel to fake the non-prescence
of local bandwidth monitoring.

-Tony

2023-10-24 23:20:46

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

> -----Original Message-----
> From: Tony Luck <[email protected]>
> Sent: Tuesday, October 24, 2023 1:16 PM
> To: Fenghua Yu <[email protected]>; Reinette Chatre
> <[email protected]>; Peter Newman <[email protected]>;
> Jonathan Corbet <[email protected]>; Shuah Khan <[email protected]>;
> [email protected]
> Cc: Shaopeng Tan <[email protected]>; James Morse
> <[email protected]>; Jamie Iles <[email protected]>; Moger, Babu
> <[email protected]>; Randy Dunlap <[email protected]>; linux-
> [email protected]; [email protected]; [email protected];
> Tony Luck <[email protected]>
> Subject: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w
> unavailable
>
> On Intel the various resource director technology (RDT) features are all
> orthogonal and independently enumerated. Thus it is possible to have a system
> that provides "total" memory bandwidth measurements without providing
> "local" bandwidth measurements.
>
> If local bandwidth measurement is not available, do not give up on providing
> the "mba_MBps" feedback option completely, make the code fall back to using
> total bandwidth.


Is this customer requirement ?
What do you mean by " If local bandwidth measurement is not available" ?
Is the hardware supports only total bandwidth and not local?

It can get real ugly if we try to handle one special case.

thanks
Babu

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 34 ++++++++++++++++----------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 2 files changed, 22 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..3b9531cce807 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -418,6 +418,14 @@ static int __mon_event_count(u32 rmid, struct
> rmid_read *rr)
> return 0;
> }
>
> +static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int
> +rmid) {
> + if (is_mbm_local_enabled())
> + return &dom_mbm->mbm_local[rmid];
> +
> + return &dom_mbm->mbm_total[rmid];
> +}
> +
> /*
> * mbm_bw_count() - Update bw count from values previously read by
> * __mon_event_count().
> @@ -431,7 +439,7 @@ static int __mon_event_count(u32 rmid, struct
> rmid_read *rr)
> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr) {
> - struct mbm_state *m = &rr->d->mbm_local[rmid];
> + struct mbm_state *m = get_mbm_data(rr->d, rmid);
> u64 cur_bw, bytes, cur_bytes;
>
> cur_bytes = rr->val;
> @@ -526,14 +534,14 @@ static void update_mba_bw(struct rdtgroup *rgrp,
> struct rdt_domain *dom_mbm)
> struct list_head *head;
> struct rdtgroup *entry;
>
> - if (!is_mbm_local_enabled())
> + if (!is_mbm_enabled())
> return;
>
> r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> closid = rgrp->closid;
> rmid = rgrp->mon.rmid;
> - pmbm_data = &dom_mbm->mbm_local[rmid];
> + pmbm_data = get_mbm_data(dom_mbm, rmid);
>
> dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
> if (!dom_mba) {
> @@ -553,7 +561,7 @@ static void update_mba_bw(struct rdtgroup *rgrp,
> struct rdt_domain *dom_mbm)
> */
> head = &rgrp->mon.crdtgrp_list;
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> + cmbm_data = get_mbm_data(dom_mbm, entry->mon.rmid);
> cur_bw += cmbm_data->prev_bw;
> delta_bw += cmbm_data->delta_bw;
> }
> @@ -595,7 +603,7 @@ static void update_mba_bw(struct rdtgroup *rgrp,
> struct rdt_domain *dom_mbm)
> */
> pmbm_data->delta_comp = true;
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> + cmbm_data = get_mbm_data(dom_mbm, entry->mon.rmid);
> cmbm_data->delta_comp = true;
> }
> }
> @@ -621,15 +629,15 @@ static void mbm_update(struct rdt_resource *r, struct
> rdt_domain *d, int rmid)
> rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> -
> - /*
> - * Call the MBA software controller only for the
> - * control groups and when user has enabled
> - * the software controller explicitly.
> - */
> - if (is_mba_sc(NULL))
> - mbm_bw_count(rmid, &rr);
> }
> +
> + /*
> + * Call the MBA software controller only for the
> + * control groups and when user has enabled
> + * the software controller explicitly.
> + */
> + if (is_mba_sc(NULL))
> + mbm_bw_count(rmid, &rr);
> }
>
> /*
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..0c4f8a1b8df0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void) {
> struct rdt_resource *r =
> &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> - return (is_mbm_local_enabled() &&
> + return (is_mbm_enabled() &&
> r->alloc_capable && is_mba_linear()); }
>
> --
> 2.41.0

2023-10-24 23:43:41

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

> Is this customer requirement ?

Any customer using the mba_MBps feedback mount option will need this
on platforms that don't support local bandwidth measurement.

> What do you mean by " If local bandwidth measurement is not available" ?
> Is the hardware supports only total bandwidth and not local?

There's going to be an Intel CPU that will only provide "total" bandwidth.

The CPUID enumeration in (CPUID.(EAX=0FH, ECX=1H) ).EDX{2}
will be "0" indicating that the local mbm monitor event is not supported.

> It can get real ugly if we try to handle one special case.

Hard to predict the future (I didn't see this coming, or I'd have had Vikas
implement the fallback in the original mba_MBps code). But I don't believe
this will be a one-off special case.

I'm also wondering why this feedback loop picked "local" rather than "total".
I dug into the e-mail archives, and I don't see any discussion. There's just
an RFC series, and then the v2 series was applied with a few small suggestions
from Thomas to make things cleaner..

-Tony

2023-10-25 12:47:22

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On Tue, Oct 24, 2023 at 8:16 PM Tony Luck <[email protected]> wrote:
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -418,6 +418,14 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> return 0;
> }
>
> +static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int rmid)
> +{
> + if (is_mbm_local_enabled())
> + return &dom_mbm->mbm_local[rmid];
> +
> + return &dom_mbm->mbm_total[rmid];
> +}

That looks very similar to the get_mbm_state() function I added to
this same file recently:

https://lore.kernel.org/all/20221220164132.443083-2-peternewman%40google.com

I think the name you picked is misleadingly general. "local if
available, otherwise total" seems to be a choice specific to the mbps
controller. I think these functions should be reconciled a little
better.

Other than that, looks good.

Reviewed-by: Peter Newman <[email protected]>

2023-10-25 16:02:08

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 10/24/23 18:43, Luck, Tony wrote:
>> Is this customer requirement ?
>
> Any customer using the mba_MBps feedback mount option will need this
> on platforms that don't support local bandwidth measurement.
>
>> What do you mean by " If local bandwidth measurement is not available" ?
>> Is the hardware supports only total bandwidth and not local?
>
> There's going to be an Intel CPU that will only provide "total" bandwidth.

ok.

Why dont you use get_mbm_state which is already available instead of
writing another function(get_mbm_data).

You can pass evtid, rmid, domain information. Decide the evtid based on
what is available. I think that will make code simpler.

>
> The CPUID enumeration in (CPUID.(EAX=0FH, ECX=1H) ).EDX{2}
> will be "0" indicating that the local mbm monitor event is not supported.
>
>> It can get real ugly if we try to handle one special case.
>
> Hard to predict the future (I didn't see this coming, or I'd have had Vikas
> implement the fallback in the original mba_MBps code). But I don't believe
> this will be a one-off special case.
>
> I'm also wondering why this feedback loop picked "local" rather than "total".
> I dug into the e-mail archives, and I don't see any discussion. There's just
> an RFC series, and then the v2 series was applied with a few small suggestions
> from Thomas to make things cleaner..

May be MSR write which feedback loop does only has local effect. This will
be interesting to know.
--
Thanks
Babu Moger

2023-10-25 19:39:12

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

On Wed, Oct 25, 2023 at 02:46:53PM +0200, Peter Newman wrote:
> Hi Tony,
>
> On Tue, Oct 24, 2023 at 8:16 PM Tony Luck <[email protected]> wrote:
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -418,6 +418,14 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> > return 0;
> > }
> >
> > +static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int rmid)
> > +{
> > + if (is_mbm_local_enabled())
> > + return &dom_mbm->mbm_local[rmid];
> > +
> > + return &dom_mbm->mbm_total[rmid];
> > +}
>
> That looks very similar to the get_mbm_state() function I added to
> this same file recently:
>
> https://lore.kernel.org/all/20221220164132.443083-2-peternewman%40google.com
>
> I think the name you picked is misleadingly general. "local if
> available, otherwise total" seems to be a choice specific to the mbps
> controller. I think these functions should be reconciled a little
> better.
>

Peter (and Babu, who made the same point about get_mbm_state().

Do you want to see your function extended to do the "pick an MBM event?"

I could add a s/w defined "event" to the enum resctrl_event_id and
extend get_mbm_state() like this:


static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
enum resctrl_event_id evtid)
{
switch (evtid) {
case QOS_L3_MBM_TOTAL_EVENT_ID:
return &d->mbm_total[rmid];
case QOS_L3_MBM_LOCAL_EVENT_ID:
return &d->mbm_local[rmid];
+ case QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID:
+ if (is_mbm_local_enabled())
+ return &d->mbm_local[rmid];
+ if (is_mbm_total_enabled())
+ return &d->mbm_total[rmid];
+ fallthrough;
default:
return NULL;
}
}

Is this the direction you are thinking of?

Callers then look like:

static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
struct mbm_state *m = get_mbm_state(rr->d, rmid, QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID);
u64 cur_bw, bytes, cur_bytes;

similar for the other three places where this is needed.

Any suggestions on how "QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID" could be
abbreviated, or just have some different, but descriptive, name?

-Tony

2023-10-25 20:39:54

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 10/25/23 14:38, Tony Luck wrote:
> On Wed, Oct 25, 2023 at 02:46:53PM +0200, Peter Newman wrote:
>> Hi Tony,
>>
>> On Tue, Oct 24, 2023 at 8:16 PM Tony Luck <[email protected]> wrote:
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -418,6 +418,14 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
>>> return 0;
>>> }
>>>
>>> +static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int rmid)
>>> +{
>>> + if (is_mbm_local_enabled())
>>> + return &dom_mbm->mbm_local[rmid];
>>> +
>>> + return &dom_mbm->mbm_total[rmid];
>>> +}
>>
>> That looks very similar to the get_mbm_state() function I added to
>> this same file recently:
>>
>> https://lore.kernel.org/all/20221220164132.443083-2-peternewman%40google.com
>>
>> I think the name you picked is misleadingly general. "local if
>> available, otherwise total" seems to be a choice specific to the mbps
>> controller. I think these functions should be reconciled a little
>> better.
>>
>
> Peter (and Babu, who made the same point about get_mbm_state().
>
> Do you want to see your function extended to do the "pick an MBM event?"
>
> I could add a s/w defined "event" to the enum resctrl_event_id and
> extend get_mbm_state() like this:
>
>
> static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
> enum resctrl_event_id evtid)
> {
> switch (evtid) {
> case QOS_L3_MBM_TOTAL_EVENT_ID:
> return &d->mbm_total[rmid];
> case QOS_L3_MBM_LOCAL_EVENT_ID:
> return &d->mbm_local[rmid];
> + case QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID:
> + if (is_mbm_local_enabled())
> + return &d->mbm_local[rmid];
> + if (is_mbm_total_enabled())
> + return &d->mbm_total[rmid];
> + fallthrough;
> default:
> return NULL;
> }
> }
>
> Is this the direction you are thinking of?

No. I was not thinking bit different.

You need these changes in only two functions, mbm_bw_count and
update_mba_bw. You decide which event you want to use based on availability,

Something like this. I updated mbm_bw_count.

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
b/arch/x86/kernel/cpu/resctrl/monitor.c
index 0ad23475fe16..302993e4fbc3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -436,8 +436,16 @@ static int __mon_event_count(u32 rmid, struct
rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
u64 cur_bw, bytes, cur_bytes;
+ struct mbm_state *m;
+ int evtid;
+
+ if (is_mbm_local_enabled())
+ evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
+ else
+ evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
+
+ m = get_mbm_state(rr->d, rmid, evtid);

cur_bytes = rr->val;
bytes = cur_bytes - m->prev_bw_bytes;


Will this work?

Thanks
Babu


>
> Callers then look like:
>
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> struct mbm_state *m = get_mbm_state(rr->d, rmid, QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID);
> u64 cur_bw, bytes, cur_bytes;
>
> similar for the other three places where this is needed.
>
> Any suggestions on how "QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID" could be
> abbreviated, or just have some different, but descriptive, name?
>
> -Tony

--
Thanks
Babu Moger

2023-10-25 20:44:26

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Tony,

On 10/25/23 15:39, Moger, Babu wrote:
> Hi Tony,
>
> On 10/25/23 14:38, Tony Luck wrote:
>> On Wed, Oct 25, 2023 at 02:46:53PM +0200, Peter Newman wrote:
>>> Hi Tony,
>>>
>>> On Tue, Oct 24, 2023 at 8:16 PM Tony Luck <[email protected]> wrote:
>>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>>> @@ -418,6 +418,14 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
>>>> return 0;
>>>> }
>>>>
>>>> +static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int rmid)
>>>> +{
>>>> + if (is_mbm_local_enabled())
>>>> + return &dom_mbm->mbm_local[rmid];
>>>> +
>>>> + return &dom_mbm->mbm_total[rmid];
>>>> +}
>>>
>>> That looks very similar to the get_mbm_state() function I added to
>>> this same file recently:
>>>
>>> https://lore.kernel.org/all/20221220164132.443083-2-peternewman%40google.com
>>>
>>> I think the name you picked is misleadingly general. "local if
>>> available, otherwise total" seems to be a choice specific to the mbps
>>> controller. I think these functions should be reconciled a little
>>> better.
>>>
>>
>> Peter (and Babu, who made the same point about get_mbm_state().
>>
>> Do you want to see your function extended to do the "pick an MBM event?"
>>
>> I could add a s/w defined "event" to the enum resctrl_event_id and
>> extend get_mbm_state() like this:
>>
>>
>> static struct mbm_state *get_mbm_state(struct rdt_domain *d, u32 rmid,
>> enum resctrl_event_id evtid)
>> {
>> switch (evtid) {
>> case QOS_L3_MBM_TOTAL_EVENT_ID:
>> return &d->mbm_total[rmid];
>> case QOS_L3_MBM_LOCAL_EVENT_ID:
>> return &d->mbm_local[rmid];
>> + case QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID:
>> + if (is_mbm_local_enabled())
>> + return &d->mbm_local[rmid];
>> + if (is_mbm_total_enabled())
>> + return &d->mbm_total[rmid];
>> + fallthrough;
>> default:
>> return NULL;
>> }
>> }
>>
>> Is this the direction you are thinking of?
>
> No. I was not thinking bit different.

I meant, I was thinking bit different.

>
> You need these changes in only two functions, mbm_bw_count and
> update_mba_bw. You decide which event you want to use based on availability,
>
> Something like this. I updated mbm_bw_count.
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 0ad23475fe16..302993e4fbc3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -436,8 +436,16 @@ static int __mon_event_count(u32 rmid, struct
> rmid_read *rr)
> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> - struct mbm_state *m = &rr->d->mbm_local[rmid];
> u64 cur_bw, bytes, cur_bytes;
> + struct mbm_state *m;
> + int evtid;
> +
> + if (is_mbm_local_enabled())
> + evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> + else
> + evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> +
> + m = get_mbm_state(rr->d, rmid, evtid);
>
> cur_bytes = rr->val;
> bytes = cur_bytes - m->prev_bw_bytes;
>
>
> Will this work?
>
> Thanks
> Babu
>
>
>>
>> Callers then look like:
>>
>> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>> {
>> struct mbm_state *m = get_mbm_state(rr->d, rmid, QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID);
>> u64 cur_bw, bytes, cur_bytes;
>>
>> similar for the other three places where this is needed.
>>
>> Any suggestions on how "QOS_L3_MBM_LOCAL_OR_TOTAL_EVENT_ID" could be
>> abbreviated, or just have some different, but descriptive, name?
>>
>> -Tony
>

--
Thanks
Babu Moger

2023-10-25 20:55:47

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

On Wed, Oct 25, 2023 at 03:42:14PM -0500, Moger, Babu wrote:
> I meant, I was thinking bit different.
>
> >
> > You need these changes in only two functions, mbm_bw_count and
> > update_mba_bw. You decide which event you want to use based on availability,
> >
> > Something like this. I updated mbm_bw_count.
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
> > b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index 0ad23475fe16..302993e4fbc3 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -436,8 +436,16 @@ static int __mon_event_count(u32 rmid, struct
> > rmid_read *rr)
> > */
> > static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> > {
> > - struct mbm_state *m = &rr->d->mbm_local[rmid];
> > u64 cur_bw, bytes, cur_bytes;
> > + struct mbm_state *m;
> > + int evtid;
> > +
> > + if (is_mbm_local_enabled())
> > + evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> > + else
> > + evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> > +
> > + m = get_mbm_state(rr->d, rmid, evtid);

Ok. Yes. That seems simpler.

Maybe I should just set a global "mbm_evtid" at mount
time. No need to check every time to see if is_mbm_local_enabled()
somehow changed and local b/w measurements were suddenly
available!

-Tony

2023-10-25 21:07:00

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On Wed, Oct 25, 2023, 21:38 Tony Luck <[email protected]> wrote:
>
> On Wed, Oct 25, 2023 at 02:46:53PM +0200, Peter Newman wrote:
>
> > > +static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int rmid)
> > > +{
> > > + if (is_mbm_local_enabled())
> > > + return &dom_mbm->mbm_local[rmid];
> > > +
> > > + return &dom_mbm->mbm_total[rmid];
> > > +}
> >
> > That looks very similar to the get_mbm_state() function I added to
> > this same file recently:
> >
> > https://lore.kernel.org/all/20221220164132.443083-2-peternewman%40google.com
> >
> > I think the name you picked is misleadingly general. "local if
> > available, otherwise total" seems to be a choice specific to the mbps
> > controller. I think these functions should be reconciled a little
> > better.
> >
>
> Peter (and Babu, who made the same point about get_mbm_state().
>
> Do you want to see your function extended to do the "pick an MBM event?"

What I meant was I think it would be enough to just give the function
you added a name that's more specific to the Mbps controller use case.
For example, get_mba_sc_mbm_state().

It's only problematic that you added a function with an equivalent
name to an existing function that does something different.

-Peter

2023-10-25 23:42:16

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 10/25/2023 3:52 PM, Tony Luck wrote:
> On Wed, Oct 25, 2023 at 03:42:14PM -0500, Moger, Babu wrote:
>> I meant, I was thinking bit different.
>>
>>> You need these changes in only two functions, mbm_bw_count and
>>> update_mba_bw. You decide which event you want to use based on availability,
>>>
>>> Something like this. I updated mbm_bw_count.
>>>
>>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> index 0ad23475fe16..302993e4fbc3 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>>> @@ -436,8 +436,16 @@ static int __mon_event_count(u32 rmid, struct
>>> rmid_read *rr)
>>> */
>>> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>>> {
>>> - struct mbm_state *m = &rr->d->mbm_local[rmid];
>>> u64 cur_bw, bytes, cur_bytes;
>>> + struct mbm_state *m;
>>> + int evtid;
>>> +
>>> + if (is_mbm_local_enabled())
>>> + evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
>>> + else
>>> + evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
>>> +
>>> + m = get_mbm_state(rr->d, rmid, evtid);
> Ok. Yes. That seems simpler.
>
> Maybe I should just set a global "mbm_evtid" at mount

Lets not make it global yet. This is only affecting couple of functions
when mba_MPps is enabled.

> time. No need to check every time to see if is_mbm_local_enabled()
> somehow changed and local b/w measurements were suddenly
> available!

What changed suddenly? Can you please elaborate.

Thanks

Babu

2023-10-25 23:52:20

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

On Intel the various resource director technology (RDT) features are all
orthogonal and independently enumerated. Thus it is possible to have
a system that provides "total" memory bandwidth measurements without
providing "local" bandwidth measurements.

If local bandwidth measurement is not available, do not give up on
providing the "mba_MBps" feedback option completely, make the code fall
back to using total bandwidth.

Signed-off-by: Tony Luck <[email protected]>
---

Changes since v1:

+ Both Peter and Babu didn't like my get_mbm_data() function. Peter
thought it just needed a more descriptive name. But Babu explained
how to get rid of it completley. I went with a modified version of
Babu's suggestion (saving the event to use for mba_MBps during
initialization, instead of checking every time).

arch/x86/kernel/cpu/resctrl/monitor.c | 41 ++++++++++++++++----------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
2 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..6c3536af024e 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -73,6 +73,13 @@ unsigned int resctrl_rmid_realloc_threshold;
*/
unsigned int resctrl_rmid_realloc_limit;

+/*
+ * MBM monitor event to use for the mba_MBps mount option.
+ * Preference is QOS_L3_MBM_LOCAL_EVENT_ID, but fall
+ * back to QOS_L3_MBM_TOTAL_EVENT_ID.
+ */
+static enum resctrl_event_id mba_mbps_evt_id;
+
#define CF(cf) ((unsigned long)(1048576 * (cf) + 0.5))

/*
@@ -431,7 +438,7 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
+ struct mbm_state *m = get_mbm_state(rr->d, rmid, mba_mbps_evt_id);
u64 cur_bw, bytes, cur_bytes;

cur_bytes = rr->val;
@@ -526,14 +533,14 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
struct list_head *head;
struct rdtgroup *entry;

- if (!is_mbm_local_enabled())
+ if (!is_mbm_enabled())
return;

r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ pmbm_data = get_mbm_state(dom_mbm, rmid, mba_mbps_evt_id);

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -553,7 +560,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
head = &rgrp->mon.crdtgrp_list;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, mba_mbps_evt_id);
cur_bw += cmbm_data->prev_bw;
delta_bw += cmbm_data->delta_bw;
}
@@ -595,7 +602,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
pmbm_data->delta_comp = true;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, mba_mbps_evt_id);
cmbm_data->delta_comp = true;
}
}
@@ -621,15 +628,15 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
-
- /*
- * Call the MBA software controller only for the
- * control groups and when user has enabled
- * the software controller explicitly.
- */
- if (is_mba_sc(NULL))
- mbm_bw_count(rmid, &rr);
}
+
+ /*
+ * Call the MBA software controller only for the
+ * control groups and when user has enabled
+ * the software controller explicitly.
+ */
+ if (is_mba_sc(NULL))
+ mbm_bw_count(rmid, &rr);
}

/*
@@ -769,10 +776,14 @@ static void l3_mon_evt_init(struct rdt_resource *r)

if (is_llc_occupancy_enabled())
list_add_tail(&llc_occupancy_event.list, &r->evt_list);
- if (is_mbm_total_enabled())
+ if (is_mbm_total_enabled()) {
list_add_tail(&mbm_total_event.list, &r->evt_list);
- if (is_mbm_local_enabled())
+ mba_mbps_evt_id = QOS_L3_MBM_TOTAL_EVENT_ID;
+ }
+ if (is_mbm_local_enabled()) {
list_add_tail(&mbm_local_event.list, &r->evt_list);
+ mba_mbps_evt_id = QOS_L3_MBM_LOCAL_EVENT_ID;
+ }
}

int __init rdt_get_mon_l3_config(struct rdt_resource *r)
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..0c4f8a1b8df0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

- return (is_mbm_local_enabled() &&
+ return (is_mbm_enabled() &&
r->alloc_capable && is_mba_linear());
}

--
2.41.0

2023-10-26 00:08:39

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

> Lets not make it global yet. This is only affecting couple of functions
> when mba_MPps is enabled.

See v2 (just posted). I made it "static" in monitor.c since both the
functions that need it are in that file.

>> time. No need to check every time to see if is_mbm_local_enabled()
>> somehow changed and local b/w measurements were suddenly
>> available!
>
> What changed suddenly? Can you please elaborate.

Nothing is going to change. If the system doesn't support local memory
bandwidth reporting when it booted, it isn't going to start doing so.
Same in reverse. If you have local bandwidth reporting, it won't
go away.

So at resctrl init time when discovering which of local & total are supported,
just save the event id to use for mba_MBps at that point.

-Tony

2023-10-26 13:55:55

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 10/25/23 16:06, Peter Newman wrote:
> Hi Tony,
>
> On Wed, Oct 25, 2023, 21:38 Tony Luck <[email protected]> wrote:
>>
>> On Wed, Oct 25, 2023 at 02:46:53PM +0200, Peter Newman wrote:
>>
>>>> +static struct mbm_state *get_mbm_data(struct rdt_domain *dom_mbm, int rmid)
>>>> +{
>>>> + if (is_mbm_local_enabled())
>>>> + return &dom_mbm->mbm_local[rmid];
>>>> +
>>>> + return &dom_mbm->mbm_total[rmid];
>>>> +}
>>>
>>> That looks very similar to the get_mbm_state() function I added to
>>> this same file recently:
>>>
>>> https://lore.kernel.org/all/20221220164132.443083-2-peternewman%40google.com
>>>
>>> I think the name you picked is misleadingly general. "local if
>>> available, otherwise total" seems to be a choice specific to the mbps
>>> controller. I think these functions should be reconciled a little
>>> better.
>>>
>>
>> Peter (and Babu, who made the same point about get_mbm_state().
>>
>> Do you want to see your function extended to do the "pick an MBM event?"
>
> What I meant was I think it would be enough to just give the function
> you added a name that's more specific to the Mbps controller use case.
> For example, get_mba_sc_mbm_state().

I actually liked this idea. Add a new function get_mba_sc_mbm_state. That
way we exactly know why this function is used. I see you already sent a v2
making the event global. Making it global may not be good idea. Can you
please update the patch and resend. Also please add the comment about why
you are adding that function.
--
Thanks
Babu Moger

2023-10-26 16:11:01

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

> > What I meant was I think it would be enough to just give the function
> > you added a name that's more specific to the Mbps controller use case.
> > For example, get_mba_sc_mbm_state().
>
> I actually liked this idea. Add a new function get_mba_sc_mbm_state. That
> way we exactly know why this function is used. I see you already sent a v2
> making the event global. Making it global may not be good idea. Can you
> please update the patch and resend. Also please add the comment about why
> you are adding that function.

Can you explain why you don't like the global? If there is a better name for it,
or a better comment for what it does, or you think the code that sets the value
could be clearer, then I'm happy to make changes there.

Which events are supported by a system is a static property. Figuring out once
at "init" time which event to use for mba_MBps seems a better choice than
re-checking for each of possibly hundreds of RMIDs every second. Even though
the check is cheap, it is utterly pointless.

-Tony

2023-10-26 17:19:26

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 10/26/23 11:09, Luck, Tony wrote:
>>> What I meant was I think it would be enough to just give the function
>>> you added a name that's more specific to the Mbps controller use case.
>>> For example, get_mba_sc_mbm_state().
>>
>> I actually liked this idea. Add a new function get_mba_sc_mbm_state. That
>> way we exactly know why this function is used. I see you already sent a v2
>> making the event global. Making it global may not be good idea. Can you
>> please update the patch and resend. Also please add the comment about why
>> you are adding that function.
>
> Can you explain why you don't like the global? If there is a better name for it,
> or a better comment for what it does, or you think the code that sets the value
> could be clearer, then I'm happy to make changes there.

My theory is always try to localize the changes and avoid global variables
when there are other ways to do the same thing. It may not be strong argument.
>
> Which events are supported by a system is a static property. Figuring out once
> at "init" time which event to use for mba_MBps seems a better choice than
> re-checking for each of possibly hundreds of RMIDs every second. Even though
> the check is cheap, it is utterly pointless.

mbm_update happens here only to the active group (not on all the available
rmids).
Also, I am not clear about weather this is going fix your problem.
You are setting the MSR limit based on total bandwidth. The MSR you are
writing may only have the local socket effect. In cases where all the
memory is allocated from remote socket then writing the MSR may not have
any effect.
Also you said you don't have the hardware to verify. Its always good to
verify if is really fixing the problem. my 02 cents.
Thanks
Babu Moger

2023-10-26 19:54:20

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

On Thu, Oct 26, 2023 at 12:19:14PM -0500, Moger, Babu wrote:
> Hi Tony,
>
> On 10/26/23 11:09, Luck, Tony wrote:
> >>> What I meant was I think it would be enough to just give the function
> >>> you added a name that's more specific to the Mbps controller use case.
> >>> For example, get_mba_sc_mbm_state().
> >>
> >> I actually liked this idea. Add a new function get_mba_sc_mbm_state. That
> >> way we exactly know why this function is used. I see you already sent a v2
> >> making the event global. Making it global may not be good idea. Can you
> >> please update the patch and resend. Also please add the comment about why
> >> you are adding that function.
> >
> > Can you explain why you don't like the global? If there is a better name for it,
> > or a better comment for what it does, or you think the code that sets the value
> > could be clearer, then I'm happy to make changes there.
>
> My theory is always try to localize the changes and avoid global variables
> when there are other ways to do the same thing. It may not be strong argument.

A good theory. I do this too. But it seems I'm more likely to go with
global variables if the cost of avoiding them is high. But "cost" is
a very subjective thing.

> > Which events are supported by a system is a static property. Figuring out once
> > at "init" time which event to use for mba_MBps seems a better choice than
> > re-checking for each of possibly hundreds of RMIDs every second. Even though
> > the check is cheap, it is utterly pointless.
>
> mbm_update happens here only to the active group (not on all the available
> rmids).

mbaMBps needs to get data from all active RMIDs to provide input to
the feedback loop. That might be a lot of RMIDs if many jobs are being
monitored independently (which I believe is a common mode of operation).

> Also, I am not clear about weather this is going fix your problem.
> You are setting the MSR limit based on total bandwidth. The MSR you are
> writing may only have the local socket effect. In cases where all the
> memory is allocated from remote socket then writing the MSR may not have
> any effect.

Intel MBA controls operate on all memory operations that miss the L3
cache (whether they are going to a local memory controller, or across
a UPI link to a memory controller on another socket).

> Also you said you don't have the hardware to verify. Its always good to
> verify if is really fixing the problem. my 02 cents.

I don't have hardare that enforces this. But Linux does have a boot
option clearcpuid=cqm_mbm_local to tell Linux that the system doesn't
provide a local counter. I've been using that for all my testing.

-Tony

2023-10-26 20:02:56

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

On Intel the various resource director technology (RDT) features are all
orthogonal and independently enumerated. Thus it is possible to have
a system that provides "total" memory bandwidth measurements without
providing "local" bandwidth measurements.

If local bandwidth measurement is not available, do not give up on
providing the "mba_MBps" feedback option completely, make the code fall
back to using total bandwidth.

Signed-off-by: Tony Luck <[email protected]>
---
Change since v2:

Babu doesn't like the global variable. So here's a version without it.

Note that my preference is still the v2 version. But as I tell newbies
to Linux "Your job isn't to get YOUR patch upstream. You job is to get
the problem fixed.". So taking my own advice I don't really mind
whether v2 or v3 is applied.

arch/x86/kernel/cpu/resctrl/monitor.c | 43 ++++++++++++++++++--------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
2 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..29e86310677d 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -418,6 +418,20 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
return 0;
}

+/*
+ * For legacy compatibility use the local memory bandwidth to drive
+ * the mba_MBps feedback control loop. But on platforms that do not
+ * provide the local event fall back to use the total bandwidth event
+ * instead.
+ */
+static enum resctrl_event_id pick_mba_mbps_event(void)
+{
+ if (is_mbm_local_enabled())
+ return QOS_L3_MBM_LOCAL_EVENT_ID;
+
+ return QOS_L3_MBM_TOTAL_EVENT_ID;
+}
+
/*
* mbm_bw_count() - Update bw count from values previously read by
* __mon_event_count().
@@ -431,9 +445,11 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
+ enum resctrl_event_id mba_mbps_evt_id = pick_mba_mbps_event();
u64 cur_bw, bytes, cur_bytes;
+ struct mbm_state *m;

+ m = get_mbm_state(rr->d, rmid, mba_mbps_evt_id);
cur_bytes = rr->val;
bytes = cur_bytes - m->prev_bw_bytes;
m->prev_bw_bytes = cur_bytes;
@@ -518,6 +534,7 @@ void mon_event_count(void *info)
*/
static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
+ enum resctrl_event_id mba_mbps_evt_id = pick_mba_mbps_event();
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
u32 cur_bw, delta_bw, user_bw;
@@ -526,14 +543,14 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
struct list_head *head;
struct rdtgroup *entry;

- if (!is_mbm_local_enabled())
+ if (!is_mbm_enabled())
return;

r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ pmbm_data = get_mbm_state(dom_mbm, rmid, mba_mbps_evt_id);

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -553,7 +570,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
head = &rgrp->mon.crdtgrp_list;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, mba_mbps_evt_id);
cur_bw += cmbm_data->prev_bw;
delta_bw += cmbm_data->delta_bw;
}
@@ -595,7 +612,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
pmbm_data->delta_comp = true;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, mba_mbps_evt_id);
cmbm_data->delta_comp = true;
}
}
@@ -621,15 +638,15 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
-
- /*
- * Call the MBA software controller only for the
- * control groups and when user has enabled
- * the software controller explicitly.
- */
- if (is_mba_sc(NULL))
- mbm_bw_count(rmid, &rr);
}
+
+ /*
+ * Call the MBA software controller only for the
+ * control groups and when user has enabled
+ * the software controller explicitly.
+ */
+ if (is_mba_sc(NULL))
+ mbm_bw_count(rmid, &rr);
}

/*
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..0c4f8a1b8df0 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

- return (is_mbm_local_enabled() &&
+ return (is_mbm_enabled() &&
r->alloc_capable && is_mba_linear());
}

--
2.41.0

2023-10-26 22:41:17

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 10/26/2023 3:02 PM, Tony Luck wrote:
> On Intel the various resource director technology (RDT) features are all
> orthogonal and independently enumerated. Thus it is possible to have
> a system that provides "total" memory bandwidth measurements without
> providing "local" bandwidth measurements.
>
> If local bandwidth measurement is not available, do not give up on
> providing the "mba_MBps" feedback option completely, make the code fall
> back to using total bandwidth.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Change since v2:
>
> Babu doesn't like the global variable. So here's a version without it.
>
> Note that my preference is still the v2 version. But as I tell newbies
> to Linux "Your job isn't to get YOUR patch upstream. You job is to get
> the problem fixed.". So taking my own advice I don't really mind
> whether v2 or v3 is applied.
Hmm. I like v3 better. Few minor comments below.
>
> arch/x86/kernel/cpu/resctrl/monitor.c | 43 ++++++++++++++++++--------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 2 files changed, 31 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..29e86310677d 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -418,6 +418,20 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> return 0;
> }
>
> +/*
> + * For legacy compatibility use the local memory bandwidth to drive
> + * the mba_MBps feedback control loop. But on platforms that do not
> + * provide the local event fall back to use the total bandwidth event
> + * instead.
> + */
> +static enum resctrl_event_id pick_mba_mbps_event(void)
> +{
> + if (is_mbm_local_enabled())
> + return QOS_L3_MBM_LOCAL_EVENT_ID;
> +
> + return QOS_L3_MBM_TOTAL_EVENT_ID;
> +}
> +
> /*
> * mbm_bw_count() - Update bw count from values previously read by
> * __mon_event_count().
> @@ -431,9 +445,11 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> - struct mbm_state *m = &rr->d->mbm_local[rmid];
> + enum resctrl_event_id mba_mbps_evt_id = pick_mba_mbps_event();

How about evt_id instead of mba_mbps_evt_id? It seems pretty mouthful
for temp variable.

> u64 cur_bw, bytes, cur_bytes;
> + struct mbm_state *m;
>
> + m = get_mbm_state(rr->d, rmid, mba_mbps_evt_id);
> cur_bytes = rr->val;
> bytes = cur_bytes - m->prev_bw_bytes;
> m->prev_bw_bytes = cur_bytes;
> @@ -518,6 +534,7 @@ void mon_event_count(void *info)
> */
> static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> {
> + enum resctrl_event_id mba_mbps_evt_id = pick_mba_mbps_event();
same comment as above.
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> u32 cur_bw, delta_bw, user_bw;
> @@ -526,14 +543,14 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> struct list_head *head;
> struct rdtgroup *entry;
>
> - if (!is_mbm_local_enabled())
> + if (!is_mbm_enabled())
> return;
>
> r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> closid = rgrp->closid;
> rmid = rgrp->mon.rmid;
> - pmbm_data = &dom_mbm->mbm_local[rmid];
> + pmbm_data = get_mbm_state(dom_mbm, rmid, mba_mbps_evt_id);
>
> dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
> if (!dom_mba) {
> @@ -553,7 +570,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> */
> head = &rgrp->mon.crdtgrp_list;
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> + cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, mba_mbps_evt_id);
> cur_bw += cmbm_data->prev_bw;
> delta_bw += cmbm_data->delta_bw;
> }
> @@ -595,7 +612,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> */
> pmbm_data->delta_comp = true;
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> + cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, mba_mbps_evt_id);
> cmbm_data->delta_comp = true;
> }
> }
> @@ -621,15 +638,15 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> -
> - /*
> - * Call the MBA software controller only for the
> - * control groups and when user has enabled
> - * the software controller explicitly.
> - */
> - if (is_mba_sc(NULL))
> - mbm_bw_count(rmid, &rr);
> }
> +
> + /*
> + * Call the MBA software controller only for the
> + * control groups and when user has enabled
> + * the software controller explicitly.
> + */
> + if (is_mba_sc(NULL))
> + mbm_bw_count(rmid, &rr);
> }
>
> /*
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..0c4f8a1b8df0 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> - return (is_mbm_local_enabled() &&
> + return (is_mbm_enabled() &&
> r->alloc_capable && is_mba_linear());
> }
>

Otherwise looks good to me.

Thanks

Babu

2023-10-26 22:59:53

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

>> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>> {
>> - struct mbm_state *m = &rr->d->mbm_local[rmid];
>> + enum resctrl_event_id mba_mbps_evt_id = pick_mba_mbps_event();
>
> How about evt_id instead of mba_mbps_evt_id? It seems pretty mouthful
> for temp variable.

Sure. The long name made sense as a globa. But as you say, a temp
variable doesn't need to be so verbose. Context from being initialized
by the pick_mba_mbps_event() function also helps reader to see what
it is used for.

I'll make that change if others vote for the v3 approach over v2.

-Tony



2023-11-03 21:43:39

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 10/26/2023 1:02 PM, Tony Luck wrote:
> On Intel the various resource director technology (RDT) features are all
> orthogonal and independently enumerated. Thus it is possible to have
> a system that provides "total" memory bandwidth measurements without
> providing "local" bandwidth measurements.

This motivation is written in support of Intel systems but from what I
can tell the changes impact Intel as well as AMD.

>
> If local bandwidth measurement is not available, do not give up on
> providing the "mba_MBps" feedback option completely, make the code fall
> back to using total bandwidth.

It is interesting to me that the "fall back" is essentially a drop-in
replacement without any adjustments to the data/algorithm.

Can these measurements be considered equivalent? Could a user now perhaps
want to experiment by disabling local bandwidth measurement to explore if
system behaves differently when using total memory bandwidth? What
would have a user choose one over the other (apart from when user
is forced by system ability)?

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Change since v2:
>
> Babu doesn't like the global variable. So here's a version without it.
>
> Note that my preference is still the v2 version. But as I tell newbies
> to Linux "Your job isn't to get YOUR patch upstream. You job is to get
> the problem fixed.". So taking my own advice I don't really mind
> whether v2 or v3 is applied.
>
> arch/x86/kernel/cpu/resctrl/monitor.c | 43 ++++++++++++++++++--------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 +-
> 2 files changed, 31 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..29e86310677d 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -418,6 +418,20 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> return 0;
> }
>
> +/*
> + * For legacy compatibility use the local memory bandwidth to drive
> + * the mba_MBps feedback control loop. But on platforms that do not
> + * provide the local event fall back to use the total bandwidth event
> + * instead.
> + */
> +static enum resctrl_event_id pick_mba_mbps_event(void)
> +{
> + if (is_mbm_local_enabled())
> + return QOS_L3_MBM_LOCAL_EVENT_ID;
> +
> + return QOS_L3_MBM_TOTAL_EVENT_ID;
> +}

Can there be a WARN here to catch the unlikely event that
!is_mbm_total_enabled()?
This may mean the caller (in update_mba_bw()) needs to move
to code protected by is_mbm_enabled().

One option to consider is to have a single "get_mba_mbps_state()"
call (similar to V1) that determines the eventid as above and
then calls get_mbm_state() to return a pointer to mbm_state in one
call. Starting to seem like nitpicking but I'd thought I'd mention it
since it seemed a way to have V1 solution with request to use
get_mbm_state() addressed.

> +
> /*
> * mbm_bw_count() - Update bw count from values previously read by
> * __mon_event_count().
> @@ -431,9 +445,11 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> - struct mbm_state *m = &rr->d->mbm_local[rmid];
> + enum resctrl_event_id mba_mbps_evt_id = pick_mba_mbps_event();
> u64 cur_bw, bytes, cur_bytes;
> + struct mbm_state *m;
>
> + m = get_mbm_state(rr->d, rmid, mba_mbps_evt_id);
> cur_bytes = rr->val;
> bytes = cur_bytes - m->prev_bw_bytes;
> m->prev_bw_bytes = cur_bytes;

It should not be necessary to pick the event id again. It is available
within the struct rmid_read parameter.

Reinette

2023-11-03 21:50:42

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable



On 11/3/2023 2:43 PM, Reinette Chatre wrote:
> On 10/26/2023 1:02 PM, Tony Luck wrote:
>> On Intel the various resource director technology (RDT) features are all
>> orthogonal and independently enumerated. Thus it is possible to have
>> a system that provides "total" memory bandwidth measurements without
>> providing "local" bandwidth measurements.
>
> This motivation is written in support of Intel systems but from what I
> can tell the changes impact Intel as well as AMD.

No wait. Sorry. AMD does not enable delay_linear.

Reinette

2023-11-15 16:09:49

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 11/9/2023 1:27 PM, Luck, Tony wrote:
>>> Maybe additional an mount option "mba_MBps_total" so the user can pick
>>> total instead of local?
>>
>> Is this something for which a remount is required? Can it not perhaps be
>> changed at runtime?
>
> In theory, yes. But I've been playing with a patch that adds a writable info/
> file to allow runtime switch:
>
> # ls -l /sys/fs/resctrl/info/MB/mba_MBps_control
> -rw-r--r--. 1 root root 0 Nov 9 10:57 /sys/fs/resctrl/info/MB/mba_MBps_control
> ]# cat /sys/fs/resctrl/info/MB/mba_MBps_control
> total
>
> and found that it's a bit tricky to switch out the MBM event from the
> state machine driving the feedback loop. I think the problem is in the
> code that tries to stop the control loop from switching between two
> throttling levels every second:
>
> if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
> new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
> } else if (cur_msr_val < MAX_MBA_BW &&
> (user_bw > (cur_bw + delta_bw))) {
> new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
> } else {
> return;
> }
>
> The code drops down one percentage step if current bandwidth is above
> the desired target. But stepping back up checks to see if "cur_bw + delta_bw"
> is below the target.
>
> Where does "delta_bw" come from? Code uses the Boolean flag "pmbm_data->delta_comp"
> to request the once-per-second polling compute the change in bandwidth on the
> next poll after adjusting throttling MSRs.
>
> All of these values are in the "struct mbm_state" which is a per-event-id structure.
>
> Picking an event at boot time works fine. Likely also fine at mount time. But
> switching at run-time seems to frequently end up with a very large value in
> "delta_bw" (as it compares current & previous for this event ... and it looks
> like things changed from zero). Net effect is that throttling is increased when
> processes go over their target for the resctrl group, but throttling is never decreased.

This is not clear to me. Would the state not also start from zero at boot and mount
time? From what I understand the state is also reset to zero on monitor group creation.

>
> The whole heuristic seems a bit fragile. It works well for test processes that have
> constant memory bandwidth. But I could see it failing in scenarios like this:
>
> 1) Process is over MB limit
> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
> 3) Process blocks on some event and uses no bandwidth in next one second
> 4) Next poll. Linux computes delta_bw as abs(cur_bw - m->prev_bw). cur_bw is zero,
> so delta_bw is set to full value of bandwidth that process used when over budget
> 5) Process resumes running
> 6) Linux sees process using less than target, but cur_bw + delta_bw is above target,
> so Linux doesn't adjust throttling
>
> I think the goal was to avoid relaxing throttling and letting a resctrl group go back over
> target bandwidth. But that doesn't work either for groups with highly variable bandwidth
> requirements.
>
> 1) Group is over budget
> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
> 3) Group forks additional processes. New bandwidth from those offsets the reduction due to throttling
> 4) Next poll. Linux sees bandwidth is unchanged. Sets delta_bw = 0.
> 5) Next poll. Groups aggregate bandwidth is fractionally below target. Because delta_bw=0, Linux
> reduces throttling.
> 6) Group goes over target.
>

I'll defer to you for the history about this algorithm. I am not familiar with how
broadly this feature is used but I have not heard about issues with it. It does
seem as though there is some opportunity for investigation here.

Reinette

2023-11-15 21:55:12

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

On Wed, Nov 15, 2023 at 08:09:13AM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 11/9/2023 1:27 PM, Luck, Tony wrote:
> >>> Maybe additional an mount option "mba_MBps_total" so the user can pick
> >>> total instead of local?
> >>
> >> Is this something for which a remount is required? Can it not perhaps be
> >> changed at runtime?
> >
> > In theory, yes. But I've been playing with a patch that adds a writable info/
> > file to allow runtime switch:
> >
> > # ls -l /sys/fs/resctrl/info/MB/mba_MBps_control
> > -rw-r--r--. 1 root root 0 Nov 9 10:57 /sys/fs/resctrl/info/MB/mba_MBps_control
> > ]# cat /sys/fs/resctrl/info/MB/mba_MBps_control
> > total
> >
> > and found that it's a bit tricky to switch out the MBM event from the
> > state machine driving the feedback loop. I think the problem is in the
> > code that tries to stop the control loop from switching between two
> > throttling levels every second:
> >
> > if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
> > new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
> > } else if (cur_msr_val < MAX_MBA_BW &&
> > (user_bw > (cur_bw + delta_bw))) {
> > new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
> > } else {
> > return;
> > }
> >
> > The code drops down one percentage step if current bandwidth is above
> > the desired target. But stepping back up checks to see if "cur_bw + delta_bw"
> > is below the target.
> >
> > Where does "delta_bw" come from? Code uses the Boolean flag "pmbm_data->delta_comp"
> > to request the once-per-second polling compute the change in bandwidth on the
> > next poll after adjusting throttling MSRs.
> >
> > All of these values are in the "struct mbm_state" which is a per-event-id structure.
> >
> > Picking an event at boot time works fine. Likely also fine at mount time. But
> > switching at run-time seems to frequently end up with a very large value in
> > "delta_bw" (as it compares current & previous for this event ... and it looks
> > like things changed from zero). Net effect is that throttling is increased when
> > processes go over their target for the resctrl group, but throttling is never decreased.
>
> This is not clear to me. Would the state not also start from zero at boot and mount
> time? From what I understand the state is also reset to zero on monitor group creation.

Yes. All of boot, mount, mkdir start a group in a well defined state
with no throttling applied (schemata shows bandwitdh limit as 2^32
MBytes/sec). If the user sets some realistic limit, and the group
MBM measurement exceeds that limit, then the MBA MSR for the group
is dropped from 100% to 90% and the delta_comp flag set to record
the delta_bw on the next 1-second poll.

The value of delta_bw is only used when looking to reduce throttling.
To be in that state this group must have been in a state where
throttling was increased ... which would result in delta_bw being
set up.

Now look at what happens when switching from local to total for the
first time. delta_bw is zero in the structures recording total bandwidth
information. But more importanly so is prev_bw. If the code above
changes throttling value and requests an updated calulation of delta_bw,
that will be done using a value of prev_bw==0. I.e. delta_bw will be
set to the current bandwidth. That high value will likely block attempts
to reduce throttling.

Maybe when switching MBM source events the prev_bw value should be
copied from old source structures to new source structures as a rough
guide to avoid crazy actions. But that could also be wrong when
switching from total to local for a group that has poor NUMA
localization and total bandwidth is far higher than local.

> > The whole heuristic seems a bit fragile. It works well for test processes that have
> > constant memory bandwidth. But I could see it failing in scenarios like this:
> >
> > 1) Process is over MB limit
> > 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
> > 3) Process blocks on some event and uses no bandwidth in next one second
> > 4) Next poll. Linux computes delta_bw as abs(cur_bw - m->prev_bw). cur_bw is zero,
> > so delta_bw is set to full value of bandwidth that process used when over budget
> > 5) Process resumes running
> > 6) Linux sees process using less than target, but cur_bw + delta_bw is above target,
> > so Linux doesn't adjust throttling
> >
> > I think the goal was to avoid relaxing throttling and letting a resctrl group go back over
> > target bandwidth. But that doesn't work either for groups with highly variable bandwidth
> > requirements.
> >
> > 1) Group is over budget
> > 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
> > 3) Group forks additional processes. New bandwidth from those offsets the reduction due to throttling
> > 4) Next poll. Linux sees bandwidth is unchanged. Sets delta_bw = 0.
> > 5) Next poll. Groups aggregate bandwidth is fractionally below target. Because delta_bw=0, Linux
> > reduces throttling.
> > 6) Group goes over target.
> >
>
> I'll defer to you for the history about this algorithm. I am not familiar with how
> broadly this feature is used but I have not heard about issues with it. It does
> seem as though there is some opportunity for investigation here.

I sure I could construct an artificial test case to force this scenario.
But maybe:
1) It never happens in real life
2) It happens, but nobody noticed
3) People figured out the workaround (set schemata to a really big
MBytes/sec value for a second, and then back to desired value).
4) Few people use this option

I dug again into the lore.kernel.org archives. Thomas complained
that is wasn't "calibration" (as Vikas had descibed in in V1) but
seems to have otherwise been OK with it as a heuristic.

https://lore.kernel.org/all/[email protected]/


I coded up and tested the below patch as a possible replacement heuristic.
But I also wonder whether just letting the feedback loop flip throttling
up and down between throttling values above/below the target bandwidth
would really be so bad. It's just one MSR write that can be done from
the current CPU and would result in average bandwidth closer to the
user requested target.

-Tony


From 2d6d08c8ebeb62830b6d8dea36f5a8d5a53b75eb Mon Sep 17 00:00:00 2001
From: Tony Luck <[email protected]>
Date: Mon, 13 Nov 2023 13:29:29 -0800
Subject: [PATCH] x86/resctrl: Implement new MBA_mbps throttling heuristic

The MBA_mbps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.

To avoid possibly stepping throttling up and down on every poll a
flag "delta_comp" is set whenever throttling is changed to indicate
that the actual change in bandwidth should be recorded on the next
poll in "delta_bw". Throttling is only reduced if the current bandwidth
plus delta_bw is below the user target.

This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plu delta_bw is above the target set by the user.

Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/internal.h | 4 ---
arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++--------------------
2 files changed, 10 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..71bbd2245cc7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -296,14 +296,10 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @prev_bw_bytes: Previous bytes value read for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
- * @delta_bw: Difference between the current and previous bandwidth
- * @delta_comp: Indicates whether to compute the delta_bw
*/
struct mbm_state {
u64 prev_bw_bytes;
u32 prev_bw;
- u32 delta_bw;
- bool delta_comp;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..1961823b555b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)

cur_bw = bytes / SZ_1M;

- if (m->delta_comp)
- m->delta_bw = abs(cur_bw - m->prev_bw);
- m->delta_comp = false;
m->prev_bw = cur_bw;
}

@@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
+ u32 cur_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
@@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

cur_bw = pmbm_data->prev_bw;
user_bw = dom_mba->mbps_val[closid];
- delta_bw = pmbm_data->delta_bw;

/* MBA resource doesn't support CDP */
cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
@@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
list_for_each_entry(entry, head, mon.crdtgrp_list) {
cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
cur_bw += cmbm_data->prev_bw;
- delta_bw += cmbm_data->delta_bw;
}

/*
* Scale up/down the bandwidth linearly for the ctrl group. The
* bandwidth step is the bandwidth granularity specified by the
* hardware.
- *
- * The delta_bw is used when increasing the bandwidth so that we
- * dont alternately increase and decrease the control values
- * continuously.
- *
- * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
- * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
- * switching between 90 and 110 continuously if we only check
- * cur_bw < user_bw.
+ * Always increase throttling if current bandwidth is above the
+ * target set by user.
+ * But avoid thrashing up and down on every poll by checking
+ * whether a decrease in throttling is likely to push the group
+ * back over target. E.g. if currently throttling to 30% of bandwidth
+ * on a system with 10% granularity steps, check whether moving to
+ * 40% would go past the limit by multiplying current bandwidth by
+ * "(30 + 10) / 30".
*/
if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
} else if (cur_msr_val < MAX_MBA_BW &&
- (user_bw > (cur_bw + delta_bw))) {
+ (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
} else {
return;
}

resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
-
- /*
- * Delta values are updated dynamically package wise for each
- * rdtgrp every time the throttle MSR changes value.
- *
- * This is because (1)the increase in bandwidth is not perfectly
- * linear and only "approximately" linear even when the hardware
- * says it is linear.(2)Also since MBA is a core specific
- * mechanism, the delta values vary based on number of cores used
- * by the rdtgrp.
- */
- pmbm_data->delta_comp = true;
- list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
- cmbm_data->delta_comp = true;
- }
}

static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
--
2.41.0

2023-11-16 19:49:11

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v3] x86/resctrl: mba_MBps: Fall back to total b/w if local b/w unavailable

Hi Tony,

On 11/15/2023 1:54 PM, Tony Luck wrote:
> On Wed, Nov 15, 2023 at 08:09:13AM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 11/9/2023 1:27 PM, Luck, Tony wrote:
>>>>> Maybe additional an mount option "mba_MBps_total" so the user can pick
>>>>> total instead of local?
>>>>
>>>> Is this something for which a remount is required? Can it not perhaps be
>>>> changed at runtime?
>>>
>>> In theory, yes. But I've been playing with a patch that adds a writable info/
>>> file to allow runtime switch:
>>>
>>> # ls -l /sys/fs/resctrl/info/MB/mba_MBps_control
>>> -rw-r--r--. 1 root root 0 Nov 9 10:57 /sys/fs/resctrl/info/MB/mba_MBps_control
>>> ]# cat /sys/fs/resctrl/info/MB/mba_MBps_control
>>> total
>>>
>>> and found that it's a bit tricky to switch out the MBM event from the
>>> state machine driving the feedback loop. I think the problem is in the
>>> code that tries to stop the control loop from switching between two
>>> throttling levels every second:
>>>
>>> if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
>>> new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
>>> } else if (cur_msr_val < MAX_MBA_BW &&
>>> (user_bw > (cur_bw + delta_bw))) {
>>> new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
>>> } else {
>>> return;
>>> }
>>>
>>> The code drops down one percentage step if current bandwidth is above
>>> the desired target. But stepping back up checks to see if "cur_bw + delta_bw"
>>> is below the target.
>>>
>>> Where does "delta_bw" come from? Code uses the Boolean flag "pmbm_data->delta_comp"
>>> to request the once-per-second polling compute the change in bandwidth on the
>>> next poll after adjusting throttling MSRs.
>>>
>>> All of these values are in the "struct mbm_state" which is a per-event-id structure.
>>>
>>> Picking an event at boot time works fine. Likely also fine at mount time. But
>>> switching at run-time seems to frequently end up with a very large value in
>>> "delta_bw" (as it compares current & previous for this event ... and it looks
>>> like things changed from zero). Net effect is that throttling is increased when
>>> processes go over their target for the resctrl group, but throttling is never decreased.
>>
>> This is not clear to me. Would the state not also start from zero at boot and mount
>> time? From what I understand the state is also reset to zero on monitor group creation.
>
> Yes. All of boot, mount, mkdir start a group in a well defined state
> with no throttling applied (schemata shows bandwitdh limit as 2^32
> MBytes/sec). If the user sets some realistic limit, and the group
> MBM measurement exceeds that limit, then the MBA MSR for the group
> is dropped from 100% to 90% and the delta_comp flag set to record
> the delta_bw on the next 1-second poll.
>
> The value of delta_bw is only used when looking to reduce throttling.
> To be in that state this group must have been in a state where
> throttling was increased ... which would result in delta_bw being
> set up.
>
> Now look at what happens when switching from local to total for the
> first time. delta_bw is zero in the structures recording total bandwidth
> information. But more importanly so is prev_bw. If the code above
> changes throttling value and requests an updated calulation of delta_bw,
> that will be done using a value of prev_bw==0. I.e. delta_bw will be
> set to the current bandwidth. That high value will likely block attempts
> to reduce throttling.

Thank you for the detailed explanation. I think there are ways in which
to make this transition smoother, for example to not compute delta_bw
if there is no history (no "prev_bw_bytes"). But that would just fix
the existing algorithm without addressing the other issues you raised
with this algorithm.

>
> Maybe when switching MBM source events the prev_bw value should be
> copied from old source structures to new source structures as a rough
> guide to avoid crazy actions. But that could also be wrong when
> switching from total to local for a group that has poor NUMA
> localization and total bandwidth is far higher than local.
>
>>> The whole heuristic seems a bit fragile. It works well for test processes that have
>>> constant memory bandwidth. But I could see it failing in scenarios like this:
>>>
>>> 1) Process is over MB limit
>>> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
>>> 3) Process blocks on some event and uses no bandwidth in next one second
>>> 4) Next poll. Linux computes delta_bw as abs(cur_bw - m->prev_bw). cur_bw is zero,
>>> so delta_bw is set to full value of bandwidth that process used when over budget
>>> 5) Process resumes running
>>> 6) Linux sees process using less than target, but cur_bw + delta_bw is above target,
>>> so Linux doesn't adjust throttling
>>>
>>> I think the goal was to avoid relaxing throttling and letting a resctrl group go back over
>>> target bandwidth. But that doesn't work either for groups with highly variable bandwidth
>>> requirements.
>>>
>>> 1) Group is over budget
>>> 2) Linux increases throttling, and sets flag to compute delta_bw on next poll
>>> 3) Group forks additional processes. New bandwidth from those offsets the reduction due to throttling
>>> 4) Next poll. Linux sees bandwidth is unchanged. Sets delta_bw = 0.
>>> 5) Next poll. Groups aggregate bandwidth is fractionally below target. Because delta_bw=0, Linux
>>> reduces throttling.
>>> 6) Group goes over target.
>>>
>>
>> I'll defer to you for the history about this algorithm. I am not familiar with how
>> broadly this feature is used but I have not heard about issues with it. It does
>> seem as though there is some opportunity for investigation here.
>
> I sure I could construct an artificial test case to force this scenario.
> But maybe:
> 1) It never happens in real life
> 2) It happens, but nobody noticed
> 3) People figured out the workaround (set schemata to a really big
> MBytes/sec value for a second, and then back to desired value).
> 4) Few people use this option
>
> I dug again into the lore.kernel.org archives. Thomas complained
> that is wasn't "calibration" (as Vikas had descibed in in V1) but
> seems to have otherwise been OK with it as a heuristic.
>
> https://lore.kernel.org/all/[email protected]/
>
>
> I coded up and tested the below patch as a possible replacement heuristic.
> But I also wonder whether just letting the feedback loop flip throttling
> up and down between throttling values above/below the target bandwidth
> would really be so bad. It's just one MSR write that can be done from
> the current CPU and would result in average bandwidth closer to the
> user requested target.

The proposed heuristic seem to assume that the bandwidth used has
a linear relationship to the throttling percentage. It seems to set
aside the reasons that motivated this "delta_bw" in the first place:

> - * This is because (1)the increase in bandwidth is not perfectly
> - * linear and only "approximately" linear even when the hardware
> - * says it is linear.(2)Also since MBA is a core specific
> - * mechanism, the delta values vary based on number of cores used
> - * by the rdtgrp.

From the above I understand that reducing throttling by 10% does not
imply that bandwidth consumed will increase by 10%. A new heuristic like
this may thus decide not to relax throttling expecting that doing so would
cause bandwidth to go over limit while the non-linear increase may result
in bandwidth consumed not going over limit when throttling is relaxed.

I am also curious if only using target bandwidth would be bad.

I looked through the spec and was not able to find any information
guiding to the cost of adjusting the allocation once per second
(per resource group per domain). The closest I could find was
the discussion of a need of a "fine-grained software controller" where
it is not clear if "once per second" can be considered "fine grained".

Reinette

2023-11-28 23:15:02

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v4] x86/resctrl: Add mount option to pick total MBM event

Add a "total" mount option to be used in conjunction with "mba_MBps"
to request use of the total memory bandwidth event as the feedback
input to the control loop.

Also fall back to using the total event if the local event is not
supported by the CPU.

Update the once-per-second polling code to use the event (local
or total memory bandwidth).

Signed-off-by: Tony Luck <[email protected]>
---

Changes since v3:

Reinette suggested that users might like the option to use the total
memory bandwidth event. I tried out some code to make the event runtime
selectable via a r/w file in the resctrl/info directories. But that
got complicated because of the amount of state that needs to be updated
when switching events. Since there isn't a firm use case for user
selectable event, this latest version falls back to the far simpler
case of using a mount option.

Documentation/arch/x86/resctrl.rst | 3 +++
arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
arch/x86/kernel/cpu/resctrl/monitor.c | 20 +++++++++-----------
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 ++++++++++++++-
4 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..29c3e7137eb8 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -46,6 +46,9 @@ mount options are:
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
bandwidth in MBps
+"total":
+ Use total instead of local memory bandwidth to drive the
+ MBA Software Controller
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..f98fc9adc2da 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -59,6 +59,7 @@ struct rdt_fs_context {
bool enable_cdpl2;
bool enable_cdpl3;
bool enable_mba_mbps;
+ bool use_mbm_total;
bool enable_debug;
};

@@ -428,6 +429,8 @@ extern struct rdt_hw_resource rdt_resources_all[];
extern struct rdtgroup rdtgroup_default;
DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);

+extern enum resctrl_event_id mba_mbps_evt_id;
+
extern struct dentry *debugfs_resctrl;

enum resctrl_res_level {
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..230297603836 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -431,9 +431,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
u64 cur_bw, bytes, cur_bytes;
+ struct mbm_state *m;

+ m = get_mbm_state(rr->d, rmid, rr->evtid);
cur_bytes = rr->val;
bytes = cur_bytes - m->prev_bw_bytes;
m->prev_bw_bytes = cur_bytes;
@@ -518,6 +519,7 @@ void mon_event_count(void *info)
*/
static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
+ enum resctrl_event_id evt_id = mba_mbps_evt_id;
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
u32 cur_bw, delta_bw, user_bw;
@@ -526,14 +528,14 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
struct list_head *head;
struct rdtgroup *entry;

- if (!is_mbm_local_enabled())
+ if (!is_mbm_enabled())
return;

r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ pmbm_data = get_mbm_state(dom_mbm, rmid, evt_id);

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -553,7 +555,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
head = &rgrp->mon.crdtgrp_list;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, evt_id);
cur_bw += cmbm_data->prev_bw;
delta_bw += cmbm_data->delta_bw;
}
@@ -616,18 +618,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
+ if (is_mba_sc(NULL) && rr.evtid == mba_mbps_evt_id)
+ mbm_bw_count(rmid, &rr);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
-
- /*
- * Call the MBA software controller only for the
- * control groups and when user has enabled
- * the software controller explicitly.
- */
- if (is_mba_sc(NULL))
+ if (is_mba_sc(NULL) && rr.evtid == mba_mbps_evt_id)
mbm_bw_count(rmid, &rr);
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..39a5b73af4ef 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -57,6 +57,8 @@ static char last_cmd_status_buf[512];
static int rdtgroup_setup_root(struct rdt_fs_context *ctx);
static void rdtgroup_destroy_root(void);

+enum resctrl_event_id mba_mbps_evt_id;
+
struct dentry *debugfs_resctrl;

static bool resctrl_debug;
@@ -2294,7 +2296,7 @@ static bool supports_mba_mbps(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

- return (is_mbm_local_enabled() &&
+ return (is_mbm_enabled() &&
r->alloc_capable && is_mba_linear());
}

@@ -2470,6 +2472,10 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
ret = set_mba_sc(true);
if (ret)
goto out_cdpl3;
+ if (ctx->use_mbm_total || !is_mbm_local_enabled())
+ mba_mbps_evt_id = QOS_L3_MBM_TOTAL_EVENT_ID;
+ else
+ mba_mbps_evt_id = QOS_L3_MBM_LOCAL_EVENT_ID;
}

if (ctx->enable_debug)
@@ -2683,6 +2689,7 @@ enum rdt_param {
Opt_cdp,
Opt_cdpl2,
Opt_mba_mbps,
+ Opt_mba_mbps_total,
Opt_debug,
nr__rdt_params
};
@@ -2691,6 +2698,7 @@ static const struct fs_parameter_spec rdt_fs_parameters[] = {
fsparam_flag("cdp", Opt_cdp),
fsparam_flag("cdpl2", Opt_cdpl2),
fsparam_flag("mba_MBps", Opt_mba_mbps),
+ fsparam_flag("total", Opt_mba_mbps_total),
fsparam_flag("debug", Opt_debug),
{}
};
@@ -2717,6 +2725,11 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
return -EINVAL;
ctx->enable_mba_mbps = true;
return 0;
+ case Opt_mba_mbps_total:
+ if (!is_mbm_total_enabled())
+ return -EINVAL;
+ ctx->use_mbm_total = true;
+ return 0;
case Opt_debug:
ctx->enable_debug = true;
return 0;
--
2.41.0

2023-11-29 23:49:01

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4] x86/resctrl: Add mount option to pick total MBM event

Hi Tony,

On 11/28/2023 3:14 PM, Tony Luck wrote:
> Add a "total" mount option to be used in conjunction with "mba_MBps"
> to request use of the total memory bandwidth event as the feedback
> input to the control loop.

"total" is very generic. It is also not clear to me why users
would need to use two mount options. What if the new mount option
is "mba_MBps_total" instead, without user needing to also provide
"mba_MBps"?

>
> Also fall back to using the total event if the local event is not
> supported by the CPU.
>
> Update the once-per-second polling code to use the event (local
> or total memory bandwidth).

Please take care to describe why this change is needed, not just
what it does. This is required by x86. For confirmation:
https://lore.kernel.org/lkml/20231009172517.GRZSQ3fT05LGgpcW35@fat_crate.local/

>
> Signed-off-by: Tony Luck <[email protected]>
> ---
>
> Changes since v3:
>
> Reinette suggested that users might like the option to use the total
> memory bandwidth event. I tried out some code to make the event runtime
> selectable via a r/w file in the resctrl/info directories. But that
> got complicated because of the amount of state that needs to be updated
> when switching events. Since there isn't a firm use case for user
> selectable event, this latest version falls back to the far simpler
> case of using a mount option.

(I did not realize that that discussion was over.)

>
> Documentation/arch/x86/resctrl.rst | 3 +++
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
> arch/x86/kernel/cpu/resctrl/monitor.c | 20 +++++++++-----------
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 ++++++++++++++-
> 4 files changed, 29 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..29c3e7137eb8 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -46,6 +46,9 @@ mount options are:
> "mba_MBps":
> Enable the MBA Software Controller(mba_sc) to specify MBA
> bandwidth in MBps
> +"total":
> + Use total instead of local memory bandwidth to drive the
> + MBA Software Controller
> "debug":
> Make debug files accessible. Available debug files are annotated with
> "Available only with debug option".
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..f98fc9adc2da 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -59,6 +59,7 @@ struct rdt_fs_context {
> bool enable_cdpl2;
> bool enable_cdpl3;
> bool enable_mba_mbps;
> + bool use_mbm_total;
> bool enable_debug;
> };

Why did you choose new member to not follow existing custom of having
an enable_ prefix?

>
> @@ -428,6 +429,8 @@ extern struct rdt_hw_resource rdt_resources_all[];
> extern struct rdtgroup rdtgroup_default;
> DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
>
> +extern enum resctrl_event_id mba_mbps_evt_id;
> +

This global seems unnecessary. struct resctrl_membw.mba_sc indicates if
the software controller is enabled. Creating this global fragments
related information.

One option could be to change the type of struct resctrl_membw.mba_sc to
enum resctrl_event_id. I assume that 0 would never be a valid event ID and
can thus be used to know if the software controller is disabled. If this
is done then enum resctrl_event_id's documentation should be updated
with this assumption/usage.

Reinette

2023-12-01 20:46:04

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v4] x86/resctrl: Add mount option to pick total MBM event

On Wed, Nov 29, 2023 at 03:48:37PM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 11/28/2023 3:14 PM, Tony Luck wrote:
> > Add a "total" mount option to be used in conjunction with "mba_MBps"
> > to request use of the total memory bandwidth event as the feedback
> > input to the control loop.
>
> "total" is very generic. It is also not clear to me why users
> would need to use two mount options. What if the new mount option
> is "mba_MBps_total" instead, without user needing to also provide
> "mba_MBps"?

I wasn't attached to "total". I tried to change the type of the existing
"mba_MBps" option to fsparam_string_emtpy() in the hope that existing users
would be able to keep using "mba_MBps", and users who want to use total
bandwidth could use "mba_MBps=total". But that type requires the "="
in the string provided by the user for the "empty" case.

So now I'm experimenting with adding a new option:

fsparam_string("mba_MBps_event", Opt_mba_mbps_event)

so a user would specify "mba_MBps_event=total" instead of "mba_MBps".
My code also allow for "mba_MBps_event=local" if a user want to ensure
they use the local bandwidth event (failing the mount if it is not
available).

Existing code using the legacy "mba_MBps" option will get local by
default, failing over to total if needed.

> >
> > Also fall back to using the total event if the local event is not
> > supported by the CPU.
> >
> > Update the once-per-second polling code to use the event (local
> > or total memory bandwidth).
>
> Please take care to describe why this change is needed, not just
> what it does. This is required by x86. For confirmation:
> https://lore.kernel.org/lkml/20231009172517.GRZSQ3fT05LGgpcW35@fat_crate.local/

Yes. I had some justification in earlier versions, but lost it along the
way. I will include in next version.

> >
> > Signed-off-by: Tony Luck <[email protected]>
> > ---
> >
> > Changes since v3:
> >
> > Reinette suggested that users might like the option to use the total
> > memory bandwidth event. I tried out some code to make the event runtime
> > selectable via a r/w file in the resctrl/info directories. But that
> > got complicated because of the amount of state that needs to be updated
> > when switching events. Since there isn't a firm use case for user
> > selectable event, this latest version falls back to the far simpler
> > case of using a mount option.
>
> (I did not realize that that discussion was over.)

I'd like to avoid too much feature creep in this series. I'd like to
finish solving the original problem (systems without local bandwidth
monitoring should have a way to use total bandwidth) before tackling
additional issues in a separate patch series. Taking on a simple way
for users to request total bandwidth isn't much extra code. Making
it possible to switch events at runtime isn't. Fixing the corner cases
where the feedback loop may get stuck is also a separate issue.

> >
> > Documentation/arch/x86/resctrl.rst | 3 +++
> > arch/x86/kernel/cpu/resctrl/internal.h | 3 +++
> > arch/x86/kernel/cpu/resctrl/monitor.c | 20 +++++++++-----------
> > arch/x86/kernel/cpu/resctrl/rdtgroup.c | 15 ++++++++++++++-
> > 4 files changed, 29 insertions(+), 12 deletions(-)
> >
> > diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> > index a6279df64a9d..29c3e7137eb8 100644
> > --- a/Documentation/arch/x86/resctrl.rst
> > +++ b/Documentation/arch/x86/resctrl.rst
> > @@ -46,6 +46,9 @@ mount options are:
> > "mba_MBps":
> > Enable the MBA Software Controller(mba_sc) to specify MBA
> > bandwidth in MBps
> > +"total":
> > + Use total instead of local memory bandwidth to drive the
> > + MBA Software Controller
> > "debug":
> > Make debug files accessible. Available debug files are annotated with
> > "Available only with debug option".
> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index a4f1aa15f0a2..f98fc9adc2da 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -59,6 +59,7 @@ struct rdt_fs_context {
> > bool enable_cdpl2;
> > bool enable_cdpl3;
> > bool enable_mba_mbps;
> > + bool use_mbm_total;
> > bool enable_debug;
> > };
>
> Why did you choose new member to not follow existing custom of having
> an enable_ prefix?

That does look awful. Next version will do this:

- bool enable_mba_mbps;
+ bool enable_mba_mbps_local;
+ bool enable_mba_mbps_total;

>
> >
> > @@ -428,6 +429,8 @@ extern struct rdt_hw_resource rdt_resources_all[];
> > extern struct rdtgroup rdtgroup_default;
> > DECLARE_STATIC_KEY_FALSE(rdt_alloc_enable_key);
> >
> > +extern enum resctrl_event_id mba_mbps_evt_id;
> > +
>
> This global seems unnecessary. struct resctrl_membw.mba_sc indicates if
> the software controller is enabled. Creating this global fragments
> related information.

Global is now gone.

> One option could be to change the type of struct resctrl_membw.mba_sc to
> enum resctrl_event_id. I assume that 0 would never be a valid event ID and
> can thus be used to know if the software controller is disabled. If this
> is done then enum resctrl_event_id's documentation should be updated
> with this assumption/usage.

But I'm not fond of this overloading the meaning of resctrl_membw.mba_sc
by making that assumption that "0" will never be a valid event.

I've left the mba_sc field as a boolean that indicates enabled and added
a new field for the event:

@@ -138,6 +139,7 @@ struct resctrl_membw {
bool arch_needs_linear;
enum membw_throttle_mode throttle_mode;
bool mba_sc;
+ enum resctrl_event_id mba_mbps_event;
u32 *mb_map;
};

New version will be posted soon.

-Tony

2023-12-01 21:47:58

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

The MBA Software Controller(mba_sc) is a feedback loop that uses
measurements of local memory bandwidth to adjust MBA throttling levels to
keep workloads in a resctrl group within a target bandwidth set in the
schemata file.

But on Intel systems the memory bandwidth monitoring events are
independently enumerated. It is possible for a system to support
total memory bandwidth monitoring, but not support local bandwidth
monitoring. On such a system a user could not enable mba_sc mode.
Users will see this highly unhelpful error message from mount:

# mount -t resctrl -o mba_MBps resctrl /sys/fs/resctrl
mount: /sys/fs/resctrl: wrong fs type, bad option, bad superblock on
resctrl, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.

dmesg(1) does not provide any additional information.

Add a new mount option "mba_MBps_event=[local|total]" that allows
a user to specify which monitoring event to use. Also modify the
existing "mba_MBps" option to switch to total bandwidth monitoring
if local monitoring is not available.

Update the once-per-second polling code to use the chosen event (local
or total memory bandwidth).

Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 7 +++-
include/linux/resctrl.h | 2 ++
arch/x86/kernel/cpu/resctrl/internal.h | 3 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 21 ++++++-----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 48 ++++++++++++++++++++------
5 files changed, 58 insertions(+), 23 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..f06cb189911a 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -45,7 +45,12 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
- bandwidth in MBps
+ bandwidth in MBps. Defaults to using MBM local bandwidth,
+ but will use total bandwidth on systems that do not support
+ local bandwidth monitoring.
+"mba_MBps_event=[local|total]":
+ Enable the MBA Software Controller(mba_sc) with a specific
+ MBM event as input to the feedback loop.
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..1feb3b2e64fa 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -129,6 +129,7 @@ enum membw_throttle_mode {
* @throttle_mode: Bandwidth throttling mode when threads request
* different memory bandwidths
* @mba_sc: True if MBA software controller(mba_sc) is enabled
+ * @mba_mbps_event: Event (local or total) for mba_sc
* @mb_map: Mapping of memory B/W percentage to memory B/W delay
*/
struct resctrl_membw {
@@ -138,6 +139,7 @@ struct resctrl_membw {
bool arch_needs_linear;
enum membw_throttle_mode throttle_mode;
bool mba_sc;
+ enum resctrl_event_id mba_mbps_event;
u32 *mb_map;
};

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..8b9b8f664324 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -58,7 +58,8 @@ struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
bool enable_cdpl3;
- bool enable_mba_mbps;
+ bool enable_mba_mbps_local;
+ bool enable_mba_mbps_total;
bool enable_debug;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..d9e590f1cbc3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -431,9 +431,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
u64 cur_bw, bytes, cur_bytes;
+ struct mbm_state *m;

+ m = get_mbm_state(rr->d, rmid, rr->evtid);
cur_bytes = rr->val;
bytes = cur_bytes - m->prev_bw_bytes;
m->prev_bw_bytes = cur_bytes;
@@ -521,19 +522,21 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
u32 cur_bw, delta_bw, user_bw;
+ enum resctrl_event_id evt_id;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
struct rdtgroup *entry;

- if (!is_mbm_local_enabled())
+ if (!is_mbm_enabled())
return;

r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
+ evt_id = r_mba->membw.mba_mbps_event;

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ pmbm_data = get_mbm_state(dom_mbm, rmid, evt_id);

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -553,7 +556,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
head = &rgrp->mon.crdtgrp_list;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, evt_id);
cur_bw += cmbm_data->prev_bw;
delta_bw += cmbm_data->delta_bw;
}
@@ -616,18 +619,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
+ if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
+ mbm_bw_count(rmid, &rr);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
-
- /*
- * Call the MBA software controller only for the
- * control groups and when user has enabled
- * the software controller explicitly.
- */
- if (is_mba_sc(NULL))
+ if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
mbm_bw_count(rmid, &rr);
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..79141d33d5b4 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

- return (is_mbm_local_enabled() &&
+ return (is_mbm_enabled() &&
r->alloc_capable && is_mba_linear());
}

@@ -2302,7 +2302,7 @@ static bool supports_mba_mbps(void)
* Enable or disable the MBA software controller
* which helps user specify bandwidth in MBps.
*/
-static int set_mba_sc(bool mba_sc)
+static int set_mba_sc(bool mba_sc, enum resctrl_event_id mba_mbps_event)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
u32 num_closid = resctrl_arch_get_num_closid(r);
@@ -2313,6 +2313,7 @@ static int set_mba_sc(bool mba_sc)
return -EINVAL;

r->membw.mba_sc = mba_sc;
+ r->membw.mba_mbps_event = mba_mbps_event;

list_for_each_entry(d, &r->domains, list) {
for (i = 0; i < num_closid; i++)
@@ -2445,13 +2446,14 @@ static void rdt_disable_ctx(void)
{
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
- set_mba_sc(false);
+ set_mba_sc(false, QOS_L3_MBM_LOCAL_EVENT_ID);

resctrl_debug = false;
}

static int rdt_enable_ctx(struct rdt_fs_context *ctx)
{
+ enum resctrl_event_id mba_mbps_event;
int ret = 0;

if (ctx->enable_cdpl2) {
@@ -2466,8 +2468,12 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
goto out_cdpl2;
}

- if (ctx->enable_mba_mbps) {
- ret = set_mba_sc(true);
+ if (ctx->enable_mba_mbps_local || ctx->enable_mba_mbps_total) {
+ if (ctx->enable_mba_mbps_total)
+ mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
+ else
+ mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
+ ret = set_mba_sc(true, mba_mbps_event);
if (ret)
goto out_cdpl3;
}
@@ -2683,15 +2689,17 @@ enum rdt_param {
Opt_cdp,
Opt_cdpl2,
Opt_mba_mbps,
+ Opt_mba_mbps_event,
Opt_debug,
nr__rdt_params
};

static const struct fs_parameter_spec rdt_fs_parameters[] = {
- fsparam_flag("cdp", Opt_cdp),
- fsparam_flag("cdpl2", Opt_cdpl2),
- fsparam_flag("mba_MBps", Opt_mba_mbps),
- fsparam_flag("debug", Opt_debug),
+ fsparam_flag("cdp", Opt_cdp),
+ fsparam_flag("cdpl2", Opt_cdpl2),
+ fsparam_flag("mba_MBps", Opt_mba_mbps),
+ fsparam_string("mba_MBps_event", Opt_mba_mbps_event),
+ fsparam_flag("debug", Opt_debug),
{}
};

@@ -2715,7 +2723,27 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
case Opt_mba_mbps:
if (!supports_mba_mbps())
return -EINVAL;
- ctx->enable_mba_mbps = true;
+ if (is_mbm_local_enabled())
+ ctx->enable_mba_mbps_local = true;
+ else if (is_mbm_total_enabled())
+ ctx->enable_mba_mbps_total = true;
+ else
+ return -EINVAL;
+ return 0;
+ case Opt_mba_mbps_event:
+ if (!supports_mba_mbps())
+ return -EINVAL;
+ if (!strcmp("local", param->string)) {
+ if (!is_mbm_local_enabled())
+ return -EINVAL;
+ ctx->enable_mba_mbps_local = true;
+ } else if (!strcmp("total", param->string)) {
+ if (!is_mbm_total_enabled())
+ return -EINVAL;
+ ctx->enable_mba_mbps_total = true;
+ } else {
+ return -EINVAL;
+ }
return 0;
case Opt_debug:
ctx->enable_debug = true;
--
2.41.0

2023-12-04 16:25:57

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

Hi Tony,

You are intending to achieve two things at once here.
1. Adding new mount option
2. Changing behaviour for the current option.
I think you need to split this patch into two. Few comments below.

On 12/1/23 15:47, Tony Luck wrote:
> The MBA Software Controller(mba_sc) is a feedback loop that uses
> measurements of local memory bandwidth to adjust MBA throttling levels to
> keep workloads in a resctrl group within a target bandwidth set in the
> schemata file.
>
> But on Intel systems the memory bandwidth monitoring events are
> independently enumerated. It is possible for a system to support
> total memory bandwidth monitoring, but not support local bandwidth
> monitoring. On such a system a user could not enable mba_sc mode.
> Users will see this highly unhelpful error message from mount:
>
> # mount -t resctrl -o mba_MBps resctrl /sys/fs/resctrl
> mount: /sys/fs/resctrl: wrong fs type, bad option, bad superblock on
> resctrl, missing codepage or helper program, or other error.
> dmesg(1) may have more information after failed mount system call.
>
> dmesg(1) does not provide any additional information.
>
> Add a new mount option "mba_MBps_event=[local|total]" that allows
> a user to specify which monitoring event to use. Also modify the
> existing "mba_MBps" option to switch to total bandwidth monitoring
> if local monitoring is not available.

I am not sure why you need both these options. I feel you just need one of
these options.

Thanks
Babu

>
> Update the once-per-second polling code to use the chosen event (local
> or total memory bandwidth).
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 7 +++-
> include/linux/resctrl.h | 2 ++
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 21 ++++++-----
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 48 ++++++++++++++++++++------
> 5 files changed, 58 insertions(+), 23 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..f06cb189911a 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -45,7 +45,12 @@ mount options are:
> Enable code/data prioritization in L2 cache allocations.
> "mba_MBps":
> Enable the MBA Software Controller(mba_sc) to specify MBA
> - bandwidth in MBps
> + bandwidth in MBps. Defaults to using MBM local bandwidth,
> + but will use total bandwidth on systems that do not support
> + local bandwidth monitoring.
> +"mba_MBps_event=[local|total]":
> + Enable the MBA Software Controller(mba_sc) with a specific
> + MBM event as input to the feedback loop.
> "debug":
> Make debug files accessible. Available debug files are annotated with
> "Available only with debug option".
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..1feb3b2e64fa 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -129,6 +129,7 @@ enum membw_throttle_mode {
> * @throttle_mode: Bandwidth throttling mode when threads request
> * different memory bandwidths
> * @mba_sc: True if MBA software controller(mba_sc) is enabled
> + * @mba_mbps_event: Event (local or total) for mba_sc
> * @mb_map: Mapping of memory B/W percentage to memory B/W delay
> */
> struct resctrl_membw {
> @@ -138,6 +139,7 @@ struct resctrl_membw {
> bool arch_needs_linear;
> enum membw_throttle_mode throttle_mode;
> bool mba_sc;
> + enum resctrl_event_id mba_mbps_event;
> u32 *mb_map;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..8b9b8f664324 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -58,7 +58,8 @@ struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> bool enable_cdpl3;
> - bool enable_mba_mbps;
> + bool enable_mba_mbps_local;
> + bool enable_mba_mbps_total;
> bool enable_debug;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..d9e590f1cbc3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -431,9 +431,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> - struct mbm_state *m = &rr->d->mbm_local[rmid];
> u64 cur_bw, bytes, cur_bytes;
> + struct mbm_state *m;
>
> + m = get_mbm_state(rr->d, rmid, rr->evtid);
> cur_bytes = rr->val;
> bytes = cur_bytes - m->prev_bw_bytes;
> m->prev_bw_bytes = cur_bytes;
> @@ -521,19 +522,21 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> u32 cur_bw, delta_bw, user_bw;
> + enum resctrl_event_id evt_id;
> struct rdt_resource *r_mba;
> struct rdt_domain *dom_mba;
> struct list_head *head;
> struct rdtgroup *entry;
>
> - if (!is_mbm_local_enabled())
> + if (!is_mbm_enabled())
> return;
>
> r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> + evt_id = r_mba->membw.mba_mbps_event;
>
> closid = rgrp->closid;
> rmid = rgrp->mon.rmid;
> - pmbm_data = &dom_mbm->mbm_local[rmid];
> + pmbm_data = get_mbm_state(dom_mbm, rmid, evt_id);
>
> dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
> if (!dom_mba) {
> @@ -553,7 +556,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> */
> head = &rgrp->mon.crdtgrp_list;
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> + cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, evt_id);
> cur_bw += cmbm_data->prev_bw;
> delta_bw += cmbm_data->delta_bw;
> }
> @@ -616,18 +619,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> + if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
> + mbm_bw_count(rmid, &rr);
> }
> if (is_mbm_local_enabled()) {
> rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> -
> - /*
> - * Call the MBA software controller only for the
> - * control groups and when user has enabled
> - * the software controller explicitly.
> - */
> - if (is_mba_sc(NULL))
> + if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
> mbm_bw_count(rmid, &rr);
> }
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..79141d33d5b4 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> - return (is_mbm_local_enabled() &&
> + return (is_mbm_enabled() &&
> r->alloc_capable && is_mba_linear());
> }
>
> @@ -2302,7 +2302,7 @@ static bool supports_mba_mbps(void)
> * Enable or disable the MBA software controller
> * which helps user specify bandwidth in MBps.
> */
> -static int set_mba_sc(bool mba_sc)
> +static int set_mba_sc(bool mba_sc, enum resctrl_event_id mba_mbps_event)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> u32 num_closid = resctrl_arch_get_num_closid(r);
> @@ -2313,6 +2313,7 @@ static int set_mba_sc(bool mba_sc)
> return -EINVAL;
>
> r->membw.mba_sc = mba_sc;
> + r->membw.mba_mbps_event = mba_mbps_event;
>
> list_for_each_entry(d, &r->domains, list) {
> for (i = 0; i < num_closid; i++)
> @@ -2445,13 +2446,14 @@ static void rdt_disable_ctx(void)
> {
> resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
> resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
> - set_mba_sc(false);
> + set_mba_sc(false, QOS_L3_MBM_LOCAL_EVENT_ID);
>
> resctrl_debug = false;
> }
>
> static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> {
> + enum resctrl_event_id mba_mbps_event;
> int ret = 0;
>
> if (ctx->enable_cdpl2) {
> @@ -2466,8 +2468,12 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> goto out_cdpl2;
> }
>
> - if (ctx->enable_mba_mbps) {
> - ret = set_mba_sc(true);
> + if (ctx->enable_mba_mbps_local || ctx->enable_mba_mbps_total) {
> + if (ctx->enable_mba_mbps_total)
> + mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> + else
> + mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> + ret = set_mba_sc(true, mba_mbps_event);
> if (ret)
> goto out_cdpl3;
> }
> @@ -2683,15 +2689,17 @@ enum rdt_param {
> Opt_cdp,
> Opt_cdpl2,
> Opt_mba_mbps,
> + Opt_mba_mbps_event,
> Opt_debug,
> nr__rdt_params
> };
>
> static const struct fs_parameter_spec rdt_fs_parameters[] = {
> - fsparam_flag("cdp", Opt_cdp),
> - fsparam_flag("cdpl2", Opt_cdpl2),
> - fsparam_flag("mba_MBps", Opt_mba_mbps),
> - fsparam_flag("debug", Opt_debug),
> + fsparam_flag("cdp", Opt_cdp),
> + fsparam_flag("cdpl2", Opt_cdpl2),
> + fsparam_flag("mba_MBps", Opt_mba_mbps),
> + fsparam_string("mba_MBps_event", Opt_mba_mbps_event),
> + fsparam_flag("debug", Opt_debug),
> {}
> };
>
> @@ -2715,7 +2723,27 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> case Opt_mba_mbps:
> if (!supports_mba_mbps())
> return -EINVAL;
> - ctx->enable_mba_mbps = true;
> + if (is_mbm_local_enabled())
> + ctx->enable_mba_mbps_local = true;
> + else if (is_mbm_total_enabled())
> + ctx->enable_mba_mbps_total = true;
> + else
> + return -EINVAL;
> + return 0;
> + case Opt_mba_mbps_event:
> + if (!supports_mba_mbps())
> + return -EINVAL;
> + if (!strcmp("local", param->string)) {
> + if (!is_mbm_local_enabled())
> + return -EINVAL;
> + ctx->enable_mba_mbps_local = true;
> + } else if (!strcmp("total", param->string)) {
> + if (!is_mbm_total_enabled())
> + return -EINVAL;
> + ctx->enable_mba_mbps_total = true;
> + } else {
> + return -EINVAL;
> + }
> return 0;
> case Opt_debug:
> ctx->enable_debug = true;

--
Thanks
Babu Moger

2023-12-04 18:17:00

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

On Mon, Dec 04, 2023 at 10:24:58AM -0600, Moger, Babu wrote:
> Hi Tony,
>
> You are intending to achieve two things at once here.
> 1. Adding new mount option
> 2. Changing behaviour for the current option.
> I think you need to split this patch into two. Few comments below.

Hi Babu,

Thanks for looking at this patch.

You are right. I will split the patch into two as you suggest.

> On 12/1/23 15:47, Tony Luck wrote:
> > The MBA Software Controller(mba_sc) is a feedback loop that uses
> > measurements of local memory bandwidth to adjust MBA throttling levels to
> > keep workloads in a resctrl group within a target bandwidth set in the
> > schemata file.
> >
> > But on Intel systems the memory bandwidth monitoring events are
> > independently enumerated. It is possible for a system to support
> > total memory bandwidth monitoring, but not support local bandwidth
> > monitoring. On such a system a user could not enable mba_sc mode.
> > Users will see this highly unhelpful error message from mount:
> >
> > # mount -t resctrl -o mba_MBps resctrl /sys/fs/resctrl
> > mount: /sys/fs/resctrl: wrong fs type, bad option, bad superblock on
> > resctrl, missing codepage or helper program, or other error.
> > dmesg(1) may have more information after failed mount system call.
> >
> > dmesg(1) does not provide any additional information.
> >
> > Add a new mount option "mba_MBps_event=[local|total]" that allows
> > a user to specify which monitoring event to use. Also modify the
> > existing "mba_MBps" option to switch to total bandwidth monitoring
> > if local monitoring is not available.
>
> I am not sure why you need both these options. I feel you just need one of
> these options.

I should have included "changes since v4" in with this message, and
pasted in some parts of this earlier messge from the discussion about
v4:

https://lore.kernel.org/all/ZWpF5m4mIeZdK8kv@agluck-desk3/

Having the option take "local" would give a way for a user to
avoid the failover to using "total" if they really didn't want
that to happen.

Not in that message, because I didn't think of it until later, it
opens the door for different events in the future.

But I'm also open to other suggestions on naming and function of
mount options here.

Thanks

-Tony

2023-12-04 19:05:43

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

Hi Tony,

On 12/4/23 12:16, Tony Luck wrote:
> On Mon, Dec 04, 2023 at 10:24:58AM -0600, Moger, Babu wrote:
>> Hi Tony,
>>
>> You are intending to achieve two things at once here.
>> 1. Adding new mount option
>> 2. Changing behaviour for the current option.
>> I think you need to split this patch into two. Few comments below.
>
> Hi Babu,
>
> Thanks for looking at this patch.
>
> You are right. I will split the patch into two as you suggest.
>
>> On 12/1/23 15:47, Tony Luck wrote:
>>> The MBA Software Controller(mba_sc) is a feedback loop that uses
>>> measurements of local memory bandwidth to adjust MBA throttling levels to
>>> keep workloads in a resctrl group within a target bandwidth set in the
>>> schemata file.
>>>
>>> But on Intel systems the memory bandwidth monitoring events are
>>> independently enumerated. It is possible for a system to support
>>> total memory bandwidth monitoring, but not support local bandwidth
>>> monitoring. On such a system a user could not enable mba_sc mode.
>>> Users will see this highly unhelpful error message from mount:
>>>
>>> # mount -t resctrl -o mba_MBps resctrl /sys/fs/resctrl
>>> mount: /sys/fs/resctrl: wrong fs type, bad option, bad superblock on
>>> resctrl, missing codepage or helper program, or other error.
>>> dmesg(1) may have more information after failed mount system call.
>>>
>>> dmesg(1) does not provide any additional information.
>>>
>>> Add a new mount option "mba_MBps_event=[local|total]" that allows
>>> a user to specify which monitoring event to use. Also modify the
>>> existing "mba_MBps" option to switch to total bandwidth monitoring
>>> if local monitoring is not available.
>>
>> I am not sure why you need both these options. I feel you just need one of
>> these options.
>
> I should have included "changes since v4" in with this message, and
> pasted in some parts of this earlier messge from the discussion about
> v4:
>
> https://lore.kernel.org/all/ZWpF5m4mIeZdK8kv@agluck-desk3/
>
> Having the option take "local" would give a way for a user to
> avoid the failover to using "total" if they really didn't want
> that to happen.

Yes. I saw the thread. Even then I feel having two similar options can
cause confusion. I feel it is enough just to solve the original problem.
Giving more options to a corner cases is a overkill in my opinion.

Thanks
Babu


>
> Not in that message, because I didn't think of it until later, it
> opens the door for different events in the future.
>
> But I'm also open to other suggestions on naming and function of
> mount options here.
>
> Thanks
>
> -Tony

--
Thanks
Babu Moger

2023-12-04 19:46:10

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

> Yes. I saw the thread. Even then I feel having two similar options can
> cause confusion. I feel it is enough just to solve the original problem.
> Giving more options to a corner cases is a overkill in my opinion.

The "original" problem was systems without "local" bandwidth event. I
wanted to give a way for users of mba_MBps to still have some way to
use it (assuming that "total" bandwidth event was present).

Reinette suggested that some people might want to use "total", even
on systems that support "local". I firmly agree with that. It is easy to
construct scenarios where most bandwidth is to a remote node. using
"local" event will do nothing to throttle in these case. I'm not at all sure
why "local" event was picked. There's nothing in the LKML threads to
give clues.

I proposed a mount option "total" as a modifier to be used in conjunction
with "mba_MBps". Reinette said it was too generic. Her suggestion was
to add "mba_MBps_total" to be used instead of "mba_MBps".

Is that where I should have gone, instead of "mba_MBps={local|total}"?

-Tony

2023-12-04 20:03:53

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

Hi Tony,

On 12/4/2023 11:45 AM, Luck, Tony wrote:
>> Yes. I saw the thread. Even then I feel having two similar options can
>> cause confusion. I feel it is enough just to solve the original problem.
>> Giving more options to a corner cases is a overkill in my opinion.
>
> The "original" problem was systems without "local" bandwidth event. I
> wanted to give a way for users of mba_MBps to still have some way to
> use it (assuming that "total" bandwidth event was present).
>
> Reinette suggested that some people might want to use "total", even
> on systems that support "local". I firmly agree with that. It is easy to
> construct scenarios where most bandwidth is to a remote node. using
> "local" event will do nothing to throttle in these case. I'm not at all sure
> why "local" event was picked. There's nothing in the LKML threads to
> give clues.
>
> I proposed a mount option "total" as a modifier to be used in conjunction
> with "mba_MBps". Reinette said it was too generic. Her suggestion was
> to add "mba_MBps_total" to be used instead of "mba_MBps".

No, it cannot be used instead of "mba_MBps". My intention was for it to be
in addition to existing "mba_MBps" since taking "mba_MBps" away would be
considered breaking user space ABI.

Even so ...

>
> Is that where I should have gone, instead of "mba_MBps={local|total}"?

While I did propose "mba_MBps_total" (in addition to "mba_MBps") I do
recognize your comment that a new key of mba_MBps_event does give more
flexibility if different events become available in future. Emphasis is
on "different" since I do not believe the parsing can support multiple
events and thus mba_MBps_event cannot be treated as a general bucket for all
mba_sc options, just different events guiding the feedback loop.

"mba_MBps" must be kept and having it continue to use local bw as default,
but total bw on systems that do not support local bw seems appropriate,
(which is what this patch does).

Reinette

2023-12-04 21:08:09

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

On Mon, Dec 04, 2023 at 12:03:23PM -0800, Reinette Chatre wrote:
> Hi Tony,
>
> On 12/4/2023 11:45 AM, Luck, Tony wrote:
> >> Yes. I saw the thread. Even then I feel having two similar options can
> >> cause confusion. I feel it is enough just to solve the original problem.
> >> Giving more options to a corner cases is a overkill in my opinion.
> >
> > The "original" problem was systems without "local" bandwidth event. I
> > wanted to give a way for users of mba_MBps to still have some way to
> > use it (assuming that "total" bandwidth event was present).
> >
> > Reinette suggested that some people might want to use "total", even
> > on systems that support "local". I firmly agree with that. It is easy to
> > construct scenarios where most bandwidth is to a remote node. using
> > "local" event will do nothing to throttle in these case. I'm not at all sure
> > why "local" event was picked. There's nothing in the LKML threads to
> > give clues.
> >
> > I proposed a mount option "total" as a modifier to be used in conjunction
> > with "mba_MBps". Reinette said it was too generic. Her suggestion was
> > to add "mba_MBps_total" to be used instead of "mba_MBps".
>
> No, it cannot be used instead of "mba_MBps". My intention was for it to be
> in addition to existing "mba_MBps" since taking "mba_MBps" away would be
> considered breaking user space ABI.

I was unclear. The mba_MBps option must remain as legacy ABI. My
"instead of" was intended to convey that a user wanting total bandwidth
would use:

# mount -t resctrl -o mba_MBps_total resctrl /sys/fs/resctrl

rather than the new option being a modifier and requiring both
the legacy option and the modifier like this:

# mount -t resctrl -o mba_MBps,mba_MBps_total resctrl /sys/fs/resctrl

which seems overly verbose.

>
> Even so ...
>
> >
> > Is that where I should have gone, instead of "mba_MBps={local|total}"?
>
> While I did propose "mba_MBps_total" (in addition to "mba_MBps") I do
> recognize your comment that a new key of mba_MBps_event does give more
> flexibility if different events become available in future. Emphasis is
> on "different" since I do not believe the parsing can support multiple
> events and thus mba_MBps_event cannot be treated as a general bucket for all
> mba_sc options, just different events guiding the feedback loop.
>
> "mba_MBps" must be kept and having it continue to use local bw as default,
> but total bw on systems that do not support local bw seems appropriate,
> (which is what this patch does).

So we defintely have:

"mba_MBps" - defaults to local, on systems without local may switch to
total if that is available. Should this switch get a pr_info()? Or just happen
silently (as I've done in patches so far).

and need to come to agreement on which of these to implement:

A) "mba_MBps_total" - forces use of total. Fails the mount if total is not
available.

B) "mba_MBps={local|total)" forces use of chosen event, fails if event
is unavailable.

C) Something else.

D) Don't provide any way to force use of total event.

>
> Reinette

-Tony

2023-12-04 22:15:53

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v5] x86/resctrl: Add event choices for mba_MBps

Hi Tony,

On 12/4/2023 1:08 PM, Tony Luck wrote:
> On Mon, Dec 04, 2023 at 12:03:23PM -0800, Reinette Chatre wrote:
>> Hi Tony,
>>
>> On 12/4/2023 11:45 AM, Luck, Tony wrote:
>>>> Yes. I saw the thread. Even then I feel having two similar options can
>>>> cause confusion. I feel it is enough just to solve the original problem.
>>>> Giving more options to a corner cases is a overkill in my opinion.
>>>
>>> The "original" problem was systems without "local" bandwidth event. I
>>> wanted to give a way for users of mba_MBps to still have some way to
>>> use it (assuming that "total" bandwidth event was present).
>>>
>>> Reinette suggested that some people might want to use "total", even
>>> on systems that support "local". I firmly agree with that. It is easy to
>>> construct scenarios where most bandwidth is to a remote node. using
>>> "local" event will do nothing to throttle in these case. I'm not at all sure
>>> why "local" event was picked. There's nothing in the LKML threads to
>>> give clues.
>>>
>>> I proposed a mount option "total" as a modifier to be used in conjunction
>>> with "mba_MBps". Reinette said it was too generic. Her suggestion was
>>> to add "mba_MBps_total" to be used instead of "mba_MBps".
>>
>> No, it cannot be used instead of "mba_MBps". My intention was for it to be
>> in addition to existing "mba_MBps" since taking "mba_MBps" away would be
>> considered breaking user space ABI.
>
> I was unclear. The mba_MBps option must remain as legacy ABI. My
> "instead of" was intended to convey that a user wanting total bandwidth
> would use:
>
> # mount -t resctrl -o mba_MBps_total resctrl /sys/fs/resctrl
>
> rather than the new option being a modifier and requiring both
> the legacy option and the modifier like this:
>
> # mount -t resctrl -o mba_MBps,mba_MBps_total resctrl /sys/fs/resctrl
>
> which seems overly verbose.
>

I misunderstood your comment. Thank you for clarifying.

>>
>> Even so ...
>>
>>>
>>> Is that where I should have gone, instead of "mba_MBps={local|total}"?
>>
>> While I did propose "mba_MBps_total" (in addition to "mba_MBps") I do
>> recognize your comment that a new key of mba_MBps_event does give more
>> flexibility if different events become available in future. Emphasis is
>> on "different" since I do not believe the parsing can support multiple
>> events and thus mba_MBps_event cannot be treated as a general bucket for all
>> mba_sc options, just different events guiding the feedback loop.
>>
>> "mba_MBps" must be kept and having it continue to use local bw as default,
>> but total bw on systems that do not support local bw seems appropriate,
>> (which is what this patch does).
>
> So we defintely have:
>
> "mba_MBps" - defaults to local, on systems without local may switch to
> total if that is available. Should this switch get a pr_info()? Or just happen
> silently (as I've done in patches so far).

I do not think a message is required ... I cannot see how it provides information
that user space does not already know. It surely does no harm and I would not
object if it is added. Even so, I do not think a kernel message should be what
is relied on to share with user space what guides the feedback loop. To that end
I think the mount options combined with the system capabilities (learned via
resctrl) provide that information.

Please do note that if the solution of this version is maintained then
rdtgroup_show_options() needs to be updated. With that done, user space should
be able to determine at any time during runtime (no matter if kernel log has been
cleared) which event is being used.

> and need to come to agreement on which of these to implement:
>
> A) "mba_MBps_total" - forces use of total. Fails the mount if total is not
> available.

The "cons" of this is that (a) user is not able to prevent the failover,
and (b) harder to support future events (which are unknown and difficult
to prepare for anyway).

>
> B) "mba_MBps={local|total)" forces use of chosen event, fails if event
> is unavailable.

I assume you mean "mba_MBps_event={local|total}". This is my preference but
I would like to learn more about Babu's "overkill" argument. I do believe that
(B) is well motivated in [1] and has no impact on AMD.
My uncertainty here (considering one motivation is to future proof it against
adding more events) is the generic "local" and "total" names. Even so, all
I could come up with are "local_bw" and "total_bw", which I do not think are
improvements.

>
> C) Something else.
>
> D) Don't provide any way to force use of total event.

(D) is what would be needed at minimum.

In support of Babu's comment I can see how having mba_MBps as well as
mba_MBps_event can cause confusion. To help with this the documentation could
be expanded with more detailed content and also examples.

Reinette

[1] https://lore.kernel.org/lkml/SJ1PR11MB60839F92B1C15A659CD32478FC86A@SJ1PR11MB6083.namprd11.prod.outlook.com/

2023-12-04 22:52:25

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v5] x86/resctrl: Add event choices for mba_MBps



On 12/4/2023 2:15 PM, Reinette Chatre wrote:
> On 12/4/2023 1:08 PM, Tony Luck wrote:

>>
>> B) "mba_MBps={local|total)" forces use of chosen event, fails if event
>> is unavailable.
>
> I assume you mean "mba_MBps_event={local|total}". This is my preference but
> I would like to learn more about Babu's "overkill" argument. I do believe that
> (B) is well motivated in [1] and has no impact on AMD.
> My uncertainty here (considering one motivation is to future proof it against
> adding more events) is the generic "local" and "total" names. Even so, all
> I could come up with are "local_bw" and "total_bw", which I do not think are
> improvements.

Thinking about this more the existing names in resctrl may help here
... "mbm_total_bytes" and "mbm_local_bytes"? Any new event would need to
integrate with these names.

Reinette

2023-12-07 19:56:44

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v6 0/3] x86/resctrl: mba_MBps enhancements

Two changes relating to the MBA Software Controller(mba_sc):

1) Add a new mount option so the user can choose which memory
bandwidth monitoring event to use as the input to the feedback
loop.

2) Update the "mba_MBps" mount option to make use of total memory
bandwidth event on systems that do not support local bandwidth
event.

Signed-off-by: Tony Luck <[email protected]>

---
Changes since v5:

Babu: Split into separate patches for the new mount option and the
update to the exising mount option. Since this is now a series, I
also moved the Documentation change to its own patch.

Reinette: Use "mbm_local_bytes", "mbm_total_bytes" as the strings for
the new "mba_MBps_event" option. This sets a precedent that new events
should follow the naming convention for the monitor files for the event.

Reinette: Update rdtgroup_show_options(). I'd completely missed this in
the earlier versions. Note that when the legacy "mba_MBps" mount option
is used, this will show in /proc as if the new option was used:

# mount -t resctrl -o mba_MBps resctrl /sys/fs/resctrl/
# grep resctrl /proc/mounts
resctrl /sys/fs/resctrl resctrl rw,relatime,mba_MBps_event=mbm_local_bytes 0 0

Changing this to exacly match what the user typed would lose the detail
of which event is being used.

Reinette: More documentation needed. I added some text on why using
total instead of local might be useful to some users.

Tony Luck (3):
x86/resctrl: Add mount option "mba_MBps_event"
x86/resctrl: Use total bandwidth for mba_MBps option when local isn't
present
x86/resctrl: Add new "mba_MBps_event" mount option to documentation

Documentation/arch/x86/resctrl.rst | 16 ++++++-
include/linux/resctrl.h | 2 +
arch/x86/kernel/cpu/resctrl/internal.h | 3 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 21 ++++-----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 63 +++++++++++++++++++++-----
5 files changed, 79 insertions(+), 26 deletions(-)


base-commit: 33cc938e65a98f1d29d0a18403dbbee050dcad9a
--
2.41.0

2023-12-07 19:56:47

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v6 2/3] x86/resctrl: Use total bandwidth for mba_MBps option when local isn't present

On Intel systems the memory bandwidth monitoring events are
independently enumerated. It is possible for a system to support
total memory bandwidth monitoring, but not support local bandwidth
monitoring. On such a system a user could not enable mba_sc mode.
Users will see this highly unhelpful error message from mount:

# mount -t resctrl -o mba_MBps resctrl /sys/fs/resctrl
mount: /sys/fs/resctrl: wrong fs type, bad option, bad superblock on
resctrl, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.

dmesg(1) does not provide any additional information.

Modify the existing "mba_MBps" mount option to switch to total bandwidth
monitoring if local monitoring is not available.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 5f64a0b2597c..7410513db45a 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2725,6 +2725,8 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
return -EINVAL;
if (is_mbm_local_enabled())
ctx->enable_mba_mbps_local = true;
+ else if (is_mbm_total_enabled())
+ ctx->enable_mba_mbps_total = true;
else
return -EINVAL;
return 0;
--
2.41.0

2023-12-07 19:56:52

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v6 3/3] x86/resctrl: Add new "mba_MBps_event" mount option to documentation

New mount option may be used to choose a specific memory bandwidth
monitoring event to feed the MBA Software Controller(mba_sc) feedback
loop.

Signed-off-by: Tony Luck <[email protected]>
---
Documentation/arch/x86/resctrl.rst | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
index a6279df64a9d..a0c521db6786 100644
--- a/Documentation/arch/x86/resctrl.rst
+++ b/Documentation/arch/x86/resctrl.rst
@@ -35,7 +35,8 @@ about the feature from resctrl's info directory.

To use the feature mount the file system::

- # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
+ # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps] \
+ [,mba_MBps_event=[mbm_local_bytes|mbm_total_bytes]][,debug]] /sys/fs/resctrl

mount options are:

@@ -45,7 +46,12 @@ mount options are:
Enable code/data prioritization in L2 cache allocations.
"mba_MBps":
Enable the MBA Software Controller(mba_sc) to specify MBA
- bandwidth in MBps
+ bandwidth in MBps. Defaults to using MBM local bandwidth,
+ but will use total bandwidth on systems that do not support
+ local bandwidth monitoring.
+"mba_MBps_event=[mbm_local_bytes|mbm_total_bytes]":
+ Enable the MBA Software Controller(mba_sc) with a specific
+ MBM event as input to the feedback loop.
"debug":
Make debug files accessible. Available debug files are annotated with
"Available only with debug option".
@@ -538,6 +544,12 @@ where as user can switch to the "MBA software controller" mode using
a mount option 'mba_MBps'. The schemata format is specified in the below
sections.

+By default the software feedback mechanism uses measurement of local
+memory bandwidth to make adjustments to throttling levels. If a system
+is running applications with poor NUMA locality users may want to use
+the "mba_MBps_event=mbm_total_bytes" mount option which will use total
+memory bandwidth measurements instead of local.
+
L3 schemata file details (code and data prioritization disabled)
----------------------------------------------------------------
With CDP disabled the L3 schemata format is::
--
2.41.0

2023-12-07 19:57:00

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

The MBA Software Controller(mba_sc) is a feedback loop that uses
measurements of local memory bandwidth to adjust MBA throttling levels
to keep workloads in a resctrl group within a target bandwidth set in
the schemata file.

Users may want to use total memory bandwidth instead of local to handle
workloads that have poor NUMA localization.

Add a new mount option "mba_MBps_event={event_name}" where event_name
is one of "mbm_Local_bytes" or "mbm_total_bytes" that allows a user to
specify which monitoring event to use.

Update the once-per-second polling code to use the chosen event (local
or total memory bandwidth).

Signed-off-by: Tony Luck <[email protected]>
---
include/linux/resctrl.h | 2 +
arch/x86/kernel/cpu/resctrl/internal.h | 3 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 21 +++++----
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++-----
4 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index 66942d7fba7f..1feb3b2e64fa 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -129,6 +129,7 @@ enum membw_throttle_mode {
* @throttle_mode: Bandwidth throttling mode when threads request
* different memory bandwidths
* @mba_sc: True if MBA software controller(mba_sc) is enabled
+ * @mba_mbps_event: Event (local or total) for mba_sc
* @mb_map: Mapping of memory B/W percentage to memory B/W delay
*/
struct resctrl_membw {
@@ -138,6 +139,7 @@ struct resctrl_membw {
bool arch_needs_linear;
enum membw_throttle_mode throttle_mode;
bool mba_sc;
+ enum resctrl_event_id mba_mbps_event;
u32 *mb_map;
};

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..8b9b8f664324 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -58,7 +58,8 @@ struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
bool enable_cdpl3;
- bool enable_mba_mbps;
+ bool enable_mba_mbps_local;
+ bool enable_mba_mbps_total;
bool enable_debug;
};

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..d9e590f1cbc3 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -431,9 +431,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
*/
static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
{
- struct mbm_state *m = &rr->d->mbm_local[rmid];
u64 cur_bw, bytes, cur_bytes;
+ struct mbm_state *m;

+ m = get_mbm_state(rr->d, rmid, rr->evtid);
cur_bytes = rr->val;
bytes = cur_bytes - m->prev_bw_bytes;
m->prev_bw_bytes = cur_bytes;
@@ -521,19 +522,21 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
u32 cur_bw, delta_bw, user_bw;
+ enum resctrl_event_id evt_id;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
struct rdtgroup *entry;

- if (!is_mbm_local_enabled())
+ if (!is_mbm_enabled())
return;

r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
+ evt_id = r_mba->membw.mba_mbps_event;

closid = rgrp->closid;
rmid = rgrp->mon.rmid;
- pmbm_data = &dom_mbm->mbm_local[rmid];
+ pmbm_data = get_mbm_state(dom_mbm, rmid, evt_id);

dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
if (!dom_mba) {
@@ -553,7 +556,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
*/
head = &rgrp->mon.crdtgrp_list;
list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
+ cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, evt_id);
cur_bw += cmbm_data->prev_bw;
delta_bw += cmbm_data->delta_bw;
}
@@ -616,18 +619,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
+ if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
+ mbm_bw_count(rmid, &rr);
}
if (is_mbm_local_enabled()) {
rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
rr.val = 0;
__mon_event_count(rmid, &rr);
-
- /*
- * Call the MBA software controller only for the
- * control groups and when user has enabled
- * the software controller explicitly.
- */
- if (is_mba_sc(NULL))
+ if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
mbm_bw_count(rmid, &rr);
}
}
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 69a1de92384a..5f64a0b2597c 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;

- return (is_mbm_local_enabled() &&
+ return (is_mbm_enabled() &&
r->alloc_capable && is_mba_linear());
}

@@ -2302,7 +2302,7 @@ static bool supports_mba_mbps(void)
* Enable or disable the MBA software controller
* which helps user specify bandwidth in MBps.
*/
-static int set_mba_sc(bool mba_sc)
+static int set_mba_sc(bool mba_sc, enum resctrl_event_id mba_mbps_event)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
u32 num_closid = resctrl_arch_get_num_closid(r);
@@ -2313,6 +2313,7 @@ static int set_mba_sc(bool mba_sc)
return -EINVAL;

r->membw.mba_sc = mba_sc;
+ r->membw.mba_mbps_event = mba_mbps_event;

list_for_each_entry(d, &r->domains, list) {
for (i = 0; i < num_closid; i++)
@@ -2445,13 +2446,14 @@ static void rdt_disable_ctx(void)
{
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
- set_mba_sc(false);
+ set_mba_sc(false, QOS_L3_MBM_LOCAL_EVENT_ID);

resctrl_debug = false;
}

static int rdt_enable_ctx(struct rdt_fs_context *ctx)
{
+ enum resctrl_event_id mba_mbps_event;
int ret = 0;

if (ctx->enable_cdpl2) {
@@ -2466,8 +2468,12 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
goto out_cdpl2;
}

- if (ctx->enable_mba_mbps) {
- ret = set_mba_sc(true);
+ if (ctx->enable_mba_mbps_local || ctx->enable_mba_mbps_total) {
+ if (ctx->enable_mba_mbps_total)
+ mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
+ else
+ mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
+ ret = set_mba_sc(true, mba_mbps_event);
if (ret)
goto out_cdpl3;
}
@@ -2683,15 +2689,17 @@ enum rdt_param {
Opt_cdp,
Opt_cdpl2,
Opt_mba_mbps,
+ Opt_mba_mbps_event,
Opt_debug,
nr__rdt_params
};

static const struct fs_parameter_spec rdt_fs_parameters[] = {
- fsparam_flag("cdp", Opt_cdp),
- fsparam_flag("cdpl2", Opt_cdpl2),
- fsparam_flag("mba_MBps", Opt_mba_mbps),
- fsparam_flag("debug", Opt_debug),
+ fsparam_flag("cdp", Opt_cdp),
+ fsparam_flag("cdpl2", Opt_cdpl2),
+ fsparam_flag("mba_MBps", Opt_mba_mbps),
+ fsparam_string("mba_MBps_event", Opt_mba_mbps_event),
+ fsparam_flag("debug", Opt_debug),
{}
};

@@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
case Opt_mba_mbps:
if (!supports_mba_mbps())
return -EINVAL;
- ctx->enable_mba_mbps = true;
+ if (is_mbm_local_enabled())
+ ctx->enable_mba_mbps_local = true;
+ else
+ return -EINVAL;
+ return 0;
+ case Opt_mba_mbps_event:
+ if (!supports_mba_mbps())
+ return -EINVAL;
+ if (!strcmp("mbm_local_bytes", param->string)) {
+ if (!is_mbm_local_enabled())
+ return -EINVAL;
+ ctx->enable_mba_mbps_local = true;
+ } else if (!strcmp("mbm_total_bytes", param->string)) {
+ if (!is_mbm_total_enabled())
+ return -EINVAL;
+ ctx->enable_mba_mbps_total = true;
+ } else {
+ return -EINVAL;
+ }
return 0;
case Opt_debug:
ctx->enable_debug = true;
@@ -3780,16 +3806,27 @@ static int rdtgroup_rename(struct kernfs_node *kn,
return ret;
}

+static char *mba_sc_event_opt_name(struct rdt_resource *r)
+{
+ if (r->membw.mba_mbps_event == QOS_L3_MBM_LOCAL_EVENT_ID)
+ return ",mba_MBps_event=mbm_local_bytes";
+ else if (r->membw.mba_mbps_event == QOS_L3_MBM_TOTAL_EVENT_ID)
+ return ",mba_MBps_event=mbm_total_bytes";
+ return "";
+}
+
static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
{
+ struct rdt_resource *r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
+
if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3))
seq_puts(seq, ",cdp");

if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2))
seq_puts(seq, ",cdpl2");

- if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl))
- seq_puts(seq, ",mba_MBps");
+ if (is_mba_sc(r_mba))
+ seq_puts(seq, mba_sc_event_opt_name(r_mba));

if (resctrl_debug)
seq_puts(seq, ",debug");
--
2.41.0

2023-12-08 18:17:33

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

Hi Tony,

On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
>
> The MBA Software Controller(mba_sc) is a feedback loop that uses
> measurements of local memory bandwidth to adjust MBA throttling levels
> to keep workloads in a resctrl group within a target bandwidth set in
> the schemata file.
>
> Users may want to use total memory bandwidth instead of local to handle
> workloads that have poor NUMA localization.
>
> Add a new mount option "mba_MBps_event={event_name}" where event_name
> is one of "mbm_Local_bytes" or "mbm_total_bytes" that allows a user to

It's "mbm_local_bytes" in the matching logic later on.


> specify which monitoring event to use.
>
> Update the once-per-second polling code to use the chosen event (local
> or total memory bandwidth).
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 2 +
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 21 +++++----
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++-----
> 4 files changed, 63 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..1feb3b2e64fa 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -129,6 +129,7 @@ enum membw_throttle_mode {
> * @throttle_mode: Bandwidth throttling mode when threads request
> * different memory bandwidths
> * @mba_sc: True if MBA software controller(mba_sc) is enabled
> + * @mba_mbps_event: Event (local or total) for mba_sc
> * @mb_map: Mapping of memory B/W percentage to memory B/W delay
> */
> struct resctrl_membw {
> @@ -138,6 +139,7 @@ struct resctrl_membw {
> bool arch_needs_linear;
> enum membw_throttle_mode throttle_mode;
> bool mba_sc;
> + enum resctrl_event_id mba_mbps_event;
> u32 *mb_map;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..8b9b8f664324 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -58,7 +58,8 @@ struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> bool enable_cdpl3;
> - bool enable_mba_mbps;
> + bool enable_mba_mbps_local;
> + bool enable_mba_mbps_total;
> bool enable_debug;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..d9e590f1cbc3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -431,9 +431,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> - struct mbm_state *m = &rr->d->mbm_local[rmid];
> u64 cur_bw, bytes, cur_bytes;
> + struct mbm_state *m;
>
> + m = get_mbm_state(rr->d, rmid, rr->evtid);

WARN_ON(m == NULL) since we assume the caller has confirmed rr->evtid
is an MBM event?

> cur_bytes = rr->val;
> bytes = cur_bytes - m->prev_bw_bytes;
> m->prev_bw_bytes = cur_bytes;
> @@ -521,19 +522,21 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> u32 cur_bw, delta_bw, user_bw;
> + enum resctrl_event_id evt_id;
> struct rdt_resource *r_mba;
> struct rdt_domain *dom_mba;
> struct list_head *head;
> struct rdtgroup *entry;
>
> - if (!is_mbm_local_enabled())
> + if (!is_mbm_enabled())
> return;
>
> r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> + evt_id = r_mba->membw.mba_mbps_event;
>
> closid = rgrp->closid;
> rmid = rgrp->mon.rmid;
> - pmbm_data = &dom_mbm->mbm_local[rmid];
> + pmbm_data = get_mbm_state(dom_mbm, rmid, evt_id);

One defensive WARN_ON((!pmbm_data) for this function to ensure evt_id
is valid for this call and the ones in the loop below?

>
> dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
> if (!dom_mba) {
> @@ -553,7 +556,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> */
> head = &rgrp->mon.crdtgrp_list;
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> + cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, evt_id);
> cur_bw += cmbm_data->prev_bw;
> delta_bw += cmbm_data->delta_bw;
> }
> @@ -616,18 +619,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> + if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
> + mbm_bw_count(rmid, &rr);
> }
> if (is_mbm_local_enabled()) {
> rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> -
> - /*
> - * Call the MBA software controller only for the
> - * control groups and when user has enabled
> - * the software controller explicitly.
> - */
> - if (is_mba_sc(NULL))
> + if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
> mbm_bw_count(rmid, &rr);
> }
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..5f64a0b2597c 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> - return (is_mbm_local_enabled() &&
> + return (is_mbm_enabled() &&
> r->alloc_capable && is_mba_linear());
> }
>
> @@ -2302,7 +2302,7 @@ static bool supports_mba_mbps(void)
> * Enable or disable the MBA software controller
> * which helps user specify bandwidth in MBps.
> */
> -static int set_mba_sc(bool mba_sc)
> +static int set_mba_sc(bool mba_sc, enum resctrl_event_id mba_mbps_event)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> u32 num_closid = resctrl_arch_get_num_closid(r);
> @@ -2313,6 +2313,7 @@ static int set_mba_sc(bool mba_sc)
> return -EINVAL;
>
> r->membw.mba_sc = mba_sc;
> + r->membw.mba_mbps_event = mba_mbps_event;
>
> list_for_each_entry(d, &r->domains, list) {
> for (i = 0; i < num_closid; i++)
> @@ -2445,13 +2446,14 @@ static void rdt_disable_ctx(void)
> {
> resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
> resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
> - set_mba_sc(false);
> + set_mba_sc(false, QOS_L3_MBM_LOCAL_EVENT_ID);
>
> resctrl_debug = false;
> }
>
> static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> {
> + enum resctrl_event_id mba_mbps_event;
> int ret = 0;
>
> if (ctx->enable_cdpl2) {
> @@ -2466,8 +2468,12 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> goto out_cdpl2;
> }
>
> - if (ctx->enable_mba_mbps) {
> - ret = set_mba_sc(true);
> + if (ctx->enable_mba_mbps_local || ctx->enable_mba_mbps_total) {
> + if (ctx->enable_mba_mbps_total)
> + mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> + else
> + mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;

Total takes precedence over local when the user picks both.

> + ret = set_mba_sc(true, mba_mbps_event);
> if (ret)
> goto out_cdpl3;
> }
> @@ -2683,15 +2689,17 @@ enum rdt_param {
> Opt_cdp,
> Opt_cdpl2,
> Opt_mba_mbps,
> + Opt_mba_mbps_event,
> Opt_debug,
> nr__rdt_params
> };
>
> static const struct fs_parameter_spec rdt_fs_parameters[] = {
> - fsparam_flag("cdp", Opt_cdp),
> - fsparam_flag("cdpl2", Opt_cdpl2),
> - fsparam_flag("mba_MBps", Opt_mba_mbps),
> - fsparam_flag("debug", Opt_debug),
> + fsparam_flag("cdp", Opt_cdp),
> + fsparam_flag("cdpl2", Opt_cdpl2),
> + fsparam_flag("mba_MBps", Opt_mba_mbps),
> + fsparam_string("mba_MBps_event", Opt_mba_mbps_event),
> + fsparam_flag("debug", Opt_debug),
> {}
> };
>
> @@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> case Opt_mba_mbps:
> if (!supports_mba_mbps())
> return -EINVAL;
> - ctx->enable_mba_mbps = true;
> + if (is_mbm_local_enabled())
> + ctx->enable_mba_mbps_local = true;
> + else
> + return -EINVAL;
> + return 0;
> + case Opt_mba_mbps_event:
> + if (!supports_mba_mbps())
> + return -EINVAL;
> + if (!strcmp("mbm_local_bytes", param->string)) {
> + if (!is_mbm_local_enabled())
> + return -EINVAL;
> + ctx->enable_mba_mbps_local = true;
> + } else if (!strcmp("mbm_total_bytes", param->string)) {
> + if (!is_mbm_total_enabled())
> + return -EINVAL;
> + ctx->enable_mba_mbps_total = true;
> + } else {
> + return -EINVAL;

It looks like if I pass
"mba_MBps_event=mbm_total_bytes,mba_MBps_event=mbm_local_bytes" I can
set both flags true.

> + }
> return 0;
> case Opt_debug:
> ctx->enable_debug = true;
> @@ -3780,16 +3806,27 @@ static int rdtgroup_rename(struct kernfs_node *kn,
> return ret;
> }
>
> +static char *mba_sc_event_opt_name(struct rdt_resource *r)
> +{
> + if (r->membw.mba_mbps_event == QOS_L3_MBM_LOCAL_EVENT_ID)
> + return ",mba_MBps_event=mbm_local_bytes";
> + else if (r->membw.mba_mbps_event == QOS_L3_MBM_TOTAL_EVENT_ID)
> + return ",mba_MBps_event=mbm_total_bytes";
> + return "";
> +}
> +
> static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
> {
> + struct rdt_resource *r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> +
> if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3))
> seq_puts(seq, ",cdp");
>
> if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2))
> seq_puts(seq, ",cdpl2");
>
> - if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl))
> - seq_puts(seq, ",mba_MBps");
> + if (is_mba_sc(r_mba))
> + seq_puts(seq, mba_sc_event_opt_name(r_mba));
>
> if (resctrl_debug)
> seq_puts(seq, ",debug");
> --
> 2.41.0
>

Consider the setting-both-events quirk and a little bit of defensive
programming for get_mbm_data() returning NULL.

Assuming the case of "Local" is fixed in the commit message:

Reviewed-by: Peter Newman <[email protected]>

Thanks!

2023-12-08 18:26:32

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v6 2/3] x86/resctrl: Use total bandwidth for mba_MBps option when local isn't present

On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
>
> On Intel systems the memory bandwidth monitoring events are
> independently enumerated. It is possible for a system to support
> total memory bandwidth monitoring, but not support local bandwidth
> monitoring. On such a system a user could not enable mba_sc mode.
> Users will see this highly unhelpful error message from mount:
>
> # mount -t resctrl -o mba_MBps resctrl /sys/fs/resctrl
> mount: /sys/fs/resctrl: wrong fs type, bad option, bad superblock on
> resctrl, missing codepage or helper program, or other error.
> dmesg(1) may have more information after failed mount system call.
>
> dmesg(1) does not provide any additional information.
>
> Modify the existing "mba_MBps" mount option to switch to total bandwidth
> monitoring if local monitoring is not available.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 5f64a0b2597c..7410513db45a 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2725,6 +2725,8 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> return -EINVAL;
> if (is_mbm_local_enabled())
> ctx->enable_mba_mbps_local = true;
> + else if (is_mbm_total_enabled())
> + ctx->enable_mba_mbps_total = true;
> else
> return -EINVAL;
> return 0;
> --
> 2.41.0
>

Reviewed-by: Peter Newman <[email protected]>

2023-12-08 18:29:38

by Moger, Babu

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

Hi Tony,

On 12/7/2023 1:56 PM, Tony Luck wrote:
> The MBA Software Controller(mba_sc) is a feedback loop that uses
> measurements of local memory bandwidth to adjust MBA throttling levels
> to keep workloads in a resctrl group within a target bandwidth set in
> the schemata file.
>
> Users may want to use total memory bandwidth instead of local to handle
> workloads that have poor NUMA localization.
>
> Add a new mount option "mba_MBps_event={event_name}" where event_name
> is one of "mbm_Local_bytes" or "mbm_total_bytes" that allows a user to
> specify which monitoring event to use.
>
> Update the once-per-second polling code to use the chosen event (local
> or total memory bandwidth).
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> include/linux/resctrl.h | 2 +
> arch/x86/kernel/cpu/resctrl/internal.h | 3 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 21 +++++----
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 61 +++++++++++++++++++++-----
> 4 files changed, 63 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..1feb3b2e64fa 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -129,6 +129,7 @@ enum membw_throttle_mode {
> * @throttle_mode: Bandwidth throttling mode when threads request
> * different memory bandwidths
> * @mba_sc: True if MBA software controller(mba_sc) is enabled
> + * @mba_mbps_event: Event (local or total) for mba_sc
> * @mb_map: Mapping of memory B/W percentage to memory B/W delay
> */
> struct resctrl_membw {
> @@ -138,6 +139,7 @@ struct resctrl_membw {
> bool arch_needs_linear;
> enum membw_throttle_mode throttle_mode;
> bool mba_sc;
> + enum resctrl_event_id mba_mbps_event;
> u32 *mb_map;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..8b9b8f664324 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -58,7 +58,8 @@ struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> bool enable_cdpl3;
> - bool enable_mba_mbps;
> + bool enable_mba_mbps_local;
> + bool enable_mba_mbps_total;
> bool enable_debug;
> };
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..d9e590f1cbc3 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -431,9 +431,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> */
> static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> {
> - struct mbm_state *m = &rr->d->mbm_local[rmid];
> u64 cur_bw, bytes, cur_bytes;
> + struct mbm_state *m;
>
> + m = get_mbm_state(rr->d, rmid, rr->evtid);
> cur_bytes = rr->val;
> bytes = cur_bytes - m->prev_bw_bytes;
> m->prev_bw_bytes = cur_bytes;
> @@ -521,19 +522,21 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> u32 cur_bw, delta_bw, user_bw;
> + enum resctrl_event_id evt_id;
> struct rdt_resource *r_mba;
> struct rdt_domain *dom_mba;
> struct list_head *head;
> struct rdtgroup *entry;
>
> - if (!is_mbm_local_enabled())
> + if (!is_mbm_enabled())
> return;
>
> r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> + evt_id = r_mba->membw.mba_mbps_event;
>
> closid = rgrp->closid;
> rmid = rgrp->mon.rmid;
> - pmbm_data = &dom_mbm->mbm_local[rmid];
> + pmbm_data = get_mbm_state(dom_mbm, rmid, evt_id);
>
> dom_mba = get_domain_from_cpu(smp_processor_id(), r_mba);
> if (!dom_mba) {
> @@ -553,7 +556,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> */
> head = &rgrp->mon.crdtgrp_list;
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> + cmbm_data = get_mbm_state(dom_mbm, entry->mon.rmid, evt_id);
> cur_bw += cmbm_data->prev_bw;
> delta_bw += cmbm_data->delta_bw;
> }
> @@ -616,18 +619,14 @@ static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
> rr.evtid = QOS_L3_MBM_TOTAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> + if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
> + mbm_bw_count(rmid, &rr);
> }
> if (is_mbm_local_enabled()) {
> rr.evtid = QOS_L3_MBM_LOCAL_EVENT_ID;
> rr.val = 0;
> __mon_event_count(rmid, &rr);
> -
> - /*
> - * Call the MBA software controller only for the
> - * control groups and when user has enabled
> - * the software controller explicitly.
> - */
> - if (is_mba_sc(NULL))
> + if (is_mba_sc(NULL) && rr.evtid == r->membw.mba_mbps_event)
> mbm_bw_count(rmid, &rr);
> }
> }
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 69a1de92384a..5f64a0b2597c 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2294,7 +2294,7 @@ static bool supports_mba_mbps(void)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
>
> - return (is_mbm_local_enabled() &&
> + return (is_mbm_enabled() &&
> r->alloc_capable && is_mba_linear());
> }
>
> @@ -2302,7 +2302,7 @@ static bool supports_mba_mbps(void)
> * Enable or disable the MBA software controller
> * which helps user specify bandwidth in MBps.
> */
> -static int set_mba_sc(bool mba_sc)
> +static int set_mba_sc(bool mba_sc, enum resctrl_event_id mba_mbps_event)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> u32 num_closid = resctrl_arch_get_num_closid(r);
> @@ -2313,6 +2313,7 @@ static int set_mba_sc(bool mba_sc)
> return -EINVAL;
>
> r->membw.mba_sc = mba_sc;
> + r->membw.mba_mbps_event = mba_mbps_event;
>
> list_for_each_entry(d, &r->domains, list) {
> for (i = 0; i < num_closid; i++)
> @@ -2445,13 +2446,14 @@ static void rdt_disable_ctx(void)
> {
> resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
> resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
> - set_mba_sc(false);
> + set_mba_sc(false, QOS_L3_MBM_LOCAL_EVENT_ID);

This is kind of miss leading. Why do you pass
"QOS_L3_MBM_LOCAL_EVENT_ID" here?

If you move the following initialization to rdt_enable_ctx, then you
don't need to pass the second argument.

r->membw.mba_mbps_event = mba_mbps_event;

Thanks
Babu

>
> resctrl_debug = false;
> }
>
> static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> {
> + enum resctrl_event_id mba_mbps_event;
> int ret = 0;
>
> if (ctx->enable_cdpl2) {
> @@ -2466,8 +2468,12 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> goto out_cdpl2;
> }
>
> - if (ctx->enable_mba_mbps) {
> - ret = set_mba_sc(true);
> + if (ctx->enable_mba_mbps_local || ctx->enable_mba_mbps_total) {
> + if (ctx->enable_mba_mbps_total)
> + mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> + else
> + mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> + ret = set_mba_sc(true, mba_mbps_event);
> if (ret)
> goto out_cdpl3;
> }
> @@ -2683,15 +2689,17 @@ enum rdt_param {
> Opt_cdp,
> Opt_cdpl2,
> Opt_mba_mbps,
> + Opt_mba_mbps_event,
> Opt_debug,
> nr__rdt_params
> };
>
> static const struct fs_parameter_spec rdt_fs_parameters[] = {
> - fsparam_flag("cdp", Opt_cdp),
> - fsparam_flag("cdpl2", Opt_cdpl2),
> - fsparam_flag("mba_MBps", Opt_mba_mbps),
> - fsparam_flag("debug", Opt_debug),
> + fsparam_flag("cdp", Opt_cdp),
> + fsparam_flag("cdpl2", Opt_cdpl2),
> + fsparam_flag("mba_MBps", Opt_mba_mbps),
> + fsparam_string("mba_MBps_event", Opt_mba_mbps_event),
> + fsparam_flag("debug", Opt_debug),
> {}
> };
>
> @@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> case Opt_mba_mbps:
> if (!supports_mba_mbps())
> return -EINVAL;
> - ctx->enable_mba_mbps = true;
> + if (is_mbm_local_enabled())
> + ctx->enable_mba_mbps_local = true;
> + else
> + return -EINVAL;
> + return 0;
> + case Opt_mba_mbps_event:
> + if (!supports_mba_mbps())
> + return -EINVAL;
> + if (!strcmp("mbm_local_bytes", param->string)) {
> + if (!is_mbm_local_enabled())
> + return -EINVAL;
> + ctx->enable_mba_mbps_local = true;
> + } else if (!strcmp("mbm_total_bytes", param->string)) {
> + if (!is_mbm_total_enabled())
> + return -EINVAL;
> + ctx->enable_mba_mbps_total = true;
> + } else {
> + return -EINVAL;
> + }
> return 0;
> case Opt_debug:
> ctx->enable_debug = true;
> @@ -3780,16 +3806,27 @@ static int rdtgroup_rename(struct kernfs_node *kn,
> return ret;
> }
>
> +static char *mba_sc_event_opt_name(struct rdt_resource *r)
> +{
> + if (r->membw.mba_mbps_event == QOS_L3_MBM_LOCAL_EVENT_ID)
> + return ",mba_MBps_event=mbm_local_bytes";
> + else if (r->membw.mba_mbps_event == QOS_L3_MBM_TOTAL_EVENT_ID)
> + return ",mba_MBps_event=mbm_total_bytes";
> + return "";
> +}
> +
> static int rdtgroup_show_options(struct seq_file *seq, struct kernfs_root *kf)
> {
> + struct rdt_resource *r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> +
> if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L3))
> seq_puts(seq, ",cdp");
>
> if (resctrl_arch_get_cdp_enabled(RDT_RESOURCE_L2))
> seq_puts(seq, ",cdpl2");
>
> - if (is_mba_sc(&rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl))
> - seq_puts(seq, ",mba_MBps");
> + if (is_mba_sc(r_mba))
> + seq_puts(seq, mba_sc_event_opt_name(r_mba));
>
> if (resctrl_debug)
> seq_puts(seq, ",debug");

2023-12-08 19:22:59

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] x86/resctrl: Add new "mba_MBps_event" mount option to documentation

On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
>
> New mount option may be used to choose a specific memory bandwidth
> monitoring event to feed the MBA Software Controller(mba_sc) feedback
> loop.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..a0c521db6786 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -35,7 +35,8 @@ about the feature from resctrl's info directory.
>
> To use the feature mount the file system::
>
> - # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
> + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps] \
> + [,mba_MBps_event=[mbm_local_bytes|mbm_total_bytes]][,debug]] /sys/fs/resctrl
>
> mount options are:
>
> @@ -45,7 +46,12 @@ mount options are:
> Enable code/data prioritization in L2 cache allocations.
> "mba_MBps":
> Enable the MBA Software Controller(mba_sc) to specify MBA
> - bandwidth in MBps
> + bandwidth in MBps. Defaults to using MBM local bandwidth,
> + but will use total bandwidth on systems that do not support
> + local bandwidth monitoring.
> +"mba_MBps_event=[mbm_local_bytes|mbm_total_bytes]":
> + Enable the MBA Software Controller(mba_sc) with a specific
> + MBM event as input to the feedback loop.
> "debug":
> Make debug files accessible. Available debug files are annotated with
> "Available only with debug option".
> @@ -538,6 +544,12 @@ where as user can switch to the "MBA software controller" mode using
> a mount option 'mba_MBps'. The schemata format is specified in the below
> sections.
>
> +By default the software feedback mechanism uses measurement of local
> +memory bandwidth to make adjustments to throttling levels. If a system
> +is running applications with poor NUMA locality users may want to use
> +the "mba_MBps_event=mbm_total_bytes" mount option which will use total
> +memory bandwidth measurements instead of local.
> +
> L3 schemata file details (code and data prioritization disabled)
> ----------------------------------------------------------------
> With CDP disabled the L3 schemata format is::
> --
> 2.41.0
>

for content:

Reviewed-by: Peter Newman <[email protected]>

2023-12-08 21:50:52

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

On Fri, Dec 08, 2023 at 12:29:18PM -0600, Moger, Babu wrote:
> Hi Tony,
>
> On 12/7/2023 1:56 PM, Tony Luck wrote:
> > -static int set_mba_sc(bool mba_sc)
> > +static int set_mba_sc(bool mba_sc, enum resctrl_event_id mba_mbps_event)
> > {
> > struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> > u32 num_closid = resctrl_arch_get_num_closid(r);
> > @@ -2313,6 +2313,7 @@ static int set_mba_sc(bool mba_sc)
> > return -EINVAL;
> > r->membw.mba_sc = mba_sc;
> > + r->membw.mba_mbps_event = mba_mbps_event;
> > list_for_each_entry(d, &r->domains, list) {
> > for (i = 0; i < num_closid; i++)
> > @@ -2445,13 +2446,14 @@ static void rdt_disable_ctx(void)
> > {
> > resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L3, false);
> > resctrl_arch_set_cdp_enabled(RDT_RESOURCE_L2, false);
> > - set_mba_sc(false);
> > + set_mba_sc(false, QOS_L3_MBM_LOCAL_EVENT_ID);
>
> This is kind of miss leading. Why do you pass "QOS_L3_MBM_LOCAL_EVENT_ID"
> here?
>
> If you move the following initialization to rdt_enable_ctx, then you don't
> need to pass the second argument.
>
> r->membw.mba_mbps_event = mba_mbps_event;

Babu,

Yes. That was funky. I will drop the second argumen to set_mba_sc() and
move the initialization to rdt_enable_ctx()

Thnaks for the review.

-Tony

2023-12-08 21:58:46

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

On Fri, Dec 08, 2023 at 10:17:08AM -0800, Peter Newman wrote:
> Hi Tony,
>
> On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
> >
> > The MBA Software Controller(mba_sc) is a feedback loop that uses
> > measurements of local memory bandwidth to adjust MBA throttling levels
> > to keep workloads in a resctrl group within a target bandwidth set in
> > the schemata file.
> >
> > Users may want to use total memory bandwidth instead of local to handle
> > workloads that have poor NUMA localization.
> >
> > Add a new mount option "mba_MBps_event={event_name}" where event_name
> > is one of "mbm_Local_bytes" or "mbm_total_bytes" that allows a user to
>
> It's "mbm_local_bytes" in the matching logic later on.

Clearly my left hand operating the shift key is not well synchronized
with my right hand moving from "_" to "l".

Will fix.

> > diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> > index a4f1aa15f0a2..8b9b8f664324 100644
> > --- a/arch/x86/kernel/cpu/resctrl/internal.h
> > +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> > @@ -58,7 +58,8 @@ struct rdt_fs_context {
> > struct kernfs_fs_context kfc;
> > bool enable_cdpl2;
> > bool enable_cdpl3;
> > - bool enable_mba_mbps;
> > + bool enable_mba_mbps_local;
> > + bool enable_mba_mbps_total;
> > bool enable_debug;
> > };
> >
> > diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> > index f136ac046851..d9e590f1cbc3 100644
> > --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> > +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> > @@ -431,9 +431,10 @@ static int __mon_event_count(u32 rmid, struct rmid_read *rr)
> > */
> > static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
> > {
> > - struct mbm_state *m = &rr->d->mbm_local[rmid];
> > u64 cur_bw, bytes, cur_bytes;
> > + struct mbm_state *m;
> >
> > + m = get_mbm_state(rr->d, rmid, rr->evtid);
>
> WARN_ON(m == NULL) since we assume the caller has confirmed rr->evtid
> is an MBM event?

Will add this WARN_ON (though I'll write "WARN_ON(!m);" rather than "== NULL").
>
> > cur_bytes = rr->val;
> > bytes = cur_bytes - m->prev_bw_bytes;
> > m->prev_bw_bytes = cur_bytes;
> > @@ -521,19 +522,21 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> > u32 closid, rmid, cur_msr_val, new_msr_val;
> > struct mbm_state *pmbm_data, *cmbm_data;
> > u32 cur_bw, delta_bw, user_bw;
> > + enum resctrl_event_id evt_id;
> > struct rdt_resource *r_mba;
> > struct rdt_domain *dom_mba;
> > struct list_head *head;
> > struct rdtgroup *entry;
> >
> > - if (!is_mbm_local_enabled())
> > + if (!is_mbm_enabled())
> > return;
> >
> > r_mba = &rdt_resources_all[RDT_RESOURCE_MBA].r_resctrl;
> > + evt_id = r_mba->membw.mba_mbps_event;
> >
> > closid = rgrp->closid;
> > rmid = rgrp->mon.rmid;
> > - pmbm_data = &dom_mbm->mbm_local[rmid];
> > + pmbm_data = get_mbm_state(dom_mbm, rmid, evt_id);
>
> One defensive WARN_ON((!pmbm_data) for this function to ensure evt_id
> is valid for this call and the ones in the loop below?

Will add this. And the WARN_ON(!cmbm_data); in the loop.

> > @@ -2466,8 +2468,12 @@ static int rdt_enable_ctx(struct rdt_fs_context *ctx)
> > goto out_cdpl2;
> > }
> >
> > - if (ctx->enable_mba_mbps) {
> > - ret = set_mba_sc(true);
> > + if (ctx->enable_mba_mbps_local || ctx->enable_mba_mbps_total) {
> > + if (ctx->enable_mba_mbps_total)
> > + mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> > + else
> > + mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
>
> Total takes precedence over local when the user picks both.

Harmless ... but see below.

> > + ret = set_mba_sc(true, mba_mbps_event);
> > if (ret)
> > goto out_cdpl3;
> > }
> > @@ -2683,15 +2689,17 @@ enum rdt_param {
> > Opt_cdp,
> > Opt_cdpl2,
> > Opt_mba_mbps,
> > + Opt_mba_mbps_event,
> > Opt_debug,
> > nr__rdt_params
> > };
> >
> > static const struct fs_parameter_spec rdt_fs_parameters[] = {
> > - fsparam_flag("cdp", Opt_cdp),
> > - fsparam_flag("cdpl2", Opt_cdpl2),
> > - fsparam_flag("mba_MBps", Opt_mba_mbps),
> > - fsparam_flag("debug", Opt_debug),
> > + fsparam_flag("cdp", Opt_cdp),
> > + fsparam_flag("cdpl2", Opt_cdpl2),
> > + fsparam_flag("mba_MBps", Opt_mba_mbps),
> > + fsparam_string("mba_MBps_event", Opt_mba_mbps_event),
> > + fsparam_flag("debug", Opt_debug),
> > {}
> > };
> >
> > @@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > case Opt_mba_mbps:
> > if (!supports_mba_mbps())
> > return -EINVAL;
> > - ctx->enable_mba_mbps = true;
> > + if (is_mbm_local_enabled())
> > + ctx->enable_mba_mbps_local = true;
> > + else
> > + return -EINVAL;
> > + return 0;
> > + case Opt_mba_mbps_event:
> > + if (!supports_mba_mbps())
> > + return -EINVAL;
> > + if (!strcmp("mbm_local_bytes", param->string)) {
> > + if (!is_mbm_local_enabled())
> > + return -EINVAL;
> > + ctx->enable_mba_mbps_local = true;
> > + } else if (!strcmp("mbm_total_bytes", param->string)) {
> > + if (!is_mbm_total_enabled())
> > + return -EINVAL;
> > + ctx->enable_mba_mbps_total = true;
> > + } else {
> > + return -EINVAL;
>
> It looks like if I pass
> "mba_MBps_event=mbm_total_bytes,mba_MBps_event=mbm_local_bytes" I can
> set both flags true.

That's going to be confusing. I'll add code to stop the user from
passing both options.

> > --
> > 2.41.0
> >
>
> Consider the setting-both-events quirk and a little bit of defensive
> programming for get_mbm_data() returning NULL.
>
> Assuming the case of "Local" is fixed in the commit message:
>
> Reviewed-by: Peter Newman <[email protected]>

Thanks for reviewing, and for the tags for parts 2 & 3.

-Tony

2023-12-08 22:10:10

by Peter Newman

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

On Fri, Dec 8, 2023 at 1:57 PM Tony Luck <[email protected]> wrote:
>
> On Fri, Dec 08, 2023 at 10:17:08AM -0800, Peter Newman wrote:
> > Hi Tony,
> >
> > On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
> > > @@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> > > case Opt_mba_mbps:
> > > if (!supports_mba_mbps())
> > > return -EINVAL;
> > > - ctx->enable_mba_mbps = true;
> > > + if (is_mbm_local_enabled())
> > > + ctx->enable_mba_mbps_local = true;
> > > + else
> > > + return -EINVAL;
> > > + return 0;
> > > + case Opt_mba_mbps_event:
> > > + if (!supports_mba_mbps())
> > > + return -EINVAL;
> > > + if (!strcmp("mbm_local_bytes", param->string)) {
> > > + if (!is_mbm_local_enabled())
> > > + return -EINVAL;
> > > + ctx->enable_mba_mbps_local = true;
> > > + } else if (!strcmp("mbm_total_bytes", param->string)) {
> > > + if (!is_mbm_total_enabled())
> > > + return -EINVAL;
> > > + ctx->enable_mba_mbps_total = true;
> > > + } else {
> > > + return -EINVAL;
> >
> > It looks like if I pass
> > "mba_MBps_event=mbm_total_bytes,mba_MBps_event=mbm_local_bytes" I can
> > set both flags true.
>
> That's going to be confusing. I'll add code to stop the user from
> passing both options.

Also kind of confusing, after reading the second patch, I realized
"mba_MBps_event=mbm_total_bytes,mba_MBps" also results in both being
set. If you're able to fail the mount operation if both flags somehow
get set, that would address this one too.

-Peter

2023-12-08 22:37:51

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

> > That's going to be confusing. I'll add code to stop the user from
> > passing both options.
>
> Also kind of confusing, after reading the second patch, I realized
> "mba_MBps_event=mbm_total_bytes,mba_MBps" also results in both being
> set. If you're able to fail the mount operation if both flags somehow
> get set, that would address this one too.

Peter,

Yes. That's possible too. I'll cover that case in the next version. I'll wait
until this gets to the top of Reinette's review queue before posting
again.

-Tony

2023-12-12 17:54:55

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"


On 12/8/2023 2:09 PM, Peter Newman wrote:
> On Fri, Dec 8, 2023 at 1:57 PM Tony Luck <[email protected]> wrote:
>>
>> On Fri, Dec 08, 2023 at 10:17:08AM -0800, Peter Newman wrote:
>>> Hi Tony,
>>>
>>> On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
>>>> @@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
>>>> case Opt_mba_mbps:
>>>> if (!supports_mba_mbps())
>>>> return -EINVAL;
>>>> - ctx->enable_mba_mbps = true;
>>>> + if (is_mbm_local_enabled())
>>>> + ctx->enable_mba_mbps_local = true;
>>>> + else
>>>> + return -EINVAL;
>>>> + return 0;
>>>> + case Opt_mba_mbps_event:
>>>> + if (!supports_mba_mbps())
>>>> + return -EINVAL;
>>>> + if (!strcmp("mbm_local_bytes", param->string)) {
>>>> + if (!is_mbm_local_enabled())
>>>> + return -EINVAL;
>>>> + ctx->enable_mba_mbps_local = true;
>>>> + } else if (!strcmp("mbm_total_bytes", param->string)) {
>>>> + if (!is_mbm_total_enabled())
>>>> + return -EINVAL;
>>>> + ctx->enable_mba_mbps_total = true;
>>>> + } else {
>>>> + return -EINVAL;
>>>
>>> It looks like if I pass
>>> "mba_MBps_event=mbm_total_bytes,mba_MBps_event=mbm_local_bytes" I can
>>> set both flags true.
>>
>> That's going to be confusing. I'll add code to stop the user from
>> passing both options.
>
> Also kind of confusing, after reading the second patch, I realized
> "mba_MBps_event=mbm_total_bytes,mba_MBps" also results in both being
> set. If you're able to fail the mount operation if both flags somehow
> get set, that would address this one too.

Are two separate flags required? All existing options within struct rdt_fs_context
are of type bool but that does not imply that it is the required type for
all.

Reinette

2023-12-12 18:59:43

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

Hi Tony,

I'm just adding a few minor comments to the bigger items already
mentioned ...

On 12/7/2023 11:56 AM, Tony Luck wrote:

...
> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index 66942d7fba7f..1feb3b2e64fa 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -129,6 +129,7 @@ enum membw_throttle_mode {
> * @throttle_mode: Bandwidth throttling mode when threads request
> * different memory bandwidths
> * @mba_sc: True if MBA software controller(mba_sc) is enabled
> + * @mba_mbps_event: Event (local or total) for mba_sc

This is quite cryptic and does not add more than the variable name and type.
One example could be:
"Monitoring event guiding feedback loop when @mba_sc is true."

Please feel free to improve.

...

> @@ -3780,16 +3806,27 @@ static int rdtgroup_rename(struct kernfs_node *kn,
> return ret;
> }
>
> +static char *mba_sc_event_opt_name(struct rdt_resource *r)

This can be "const char *".

Reinette

2023-12-12 19:00:06

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v6 3/3] x86/resctrl: Add new "mba_MBps_event" mount option to documentation

Hi Tony,

On 12/7/2023 11:56 AM, Tony Luck wrote:
> New mount option may be used to choose a specific memory bandwidth
> monitoring event to feed the MBA Software Controller(mba_sc) feedback
> loop.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> Documentation/arch/x86/resctrl.rst | 16 ++++++++++++++--
> 1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/arch/x86/resctrl.rst b/Documentation/arch/x86/resctrl.rst
> index a6279df64a9d..a0c521db6786 100644
> --- a/Documentation/arch/x86/resctrl.rst
> +++ b/Documentation/arch/x86/resctrl.rst
> @@ -35,7 +35,8 @@ about the feature from resctrl's info directory.
>
> To use the feature mount the file system::
>
> - # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps][,debug]] /sys/fs/resctrl
> + # mount -t resctrl resctrl [-o cdp[,cdpl2][,mba_MBps] \
> + [,mba_MBps_event=[mbm_local_bytes|mbm_total_bytes]][,debug]] /sys/fs/resctrl
>
> mount options are:
>
> @@ -45,7 +46,12 @@ mount options are:
> Enable code/data prioritization in L2 cache allocations.
> "mba_MBps":
> Enable the MBA Software Controller(mba_sc) to specify MBA
> - bandwidth in MBps
> + bandwidth in MBps. Defaults to using MBM local bandwidth,
> + but will use total bandwidth on systems that do not support
> + local bandwidth monitoring.

This document is intended as user documentation. With this perspective
I find the above change to have potential for confusion. The first
(original) sentence is clearly aimed at the user, informing that when this
option is enabled bandwidth allocation can be specified in MBps. The
second sentence jumps into kernel internals mentioning bandwidth monitoring
(user may wonder - what does this have to do with allocation?) without any
transition or context.

> +"mba_MBps_event=[mbm_local_bytes|mbm_total_bytes]":
> + Enable the MBA Software Controller(mba_sc) with a specific
> + MBM event as input to the feedback loop.

I think it should be made obvious what the relationship between the
mba_MBps and mba_MBps_event mount options are. As it is written the
user may believe that both options are needed. The user is also
left needing to interpret cryptic information regarding kernel internals
- for example, note how "feedback loop" does not occur in this document.

> "debug":
> Make debug files accessible. Available debug files are annotated with
> "Available only with debug option".
> @@ -538,6 +544,12 @@ where as user can switch to the "MBA software controller" mode using
> a mount option 'mba_MBps'. The schemata format is specified in the below
> sections.
>
> +By default the software feedback mechanism uses measurement of local
> +memory bandwidth to make adjustments to throttling levels. If a system
> +is running applications with poor NUMA locality users may want to use
> +the "mba_MBps_event=mbm_total_bytes" mount option which will use total
> +memory bandwidth measurements instead of local.

The paragraph is just appended to existing MBps documentation instead of
integrated with existing content. User is left to parse and connect the
different sections attempting to make sense of it all.

Reinette

2023-12-12 20:02:23

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

On Tue, Dec 12, 2023 at 09:54:38AM -0800, Reinette Chatre wrote:
>
> On 12/8/2023 2:09 PM, Peter Newman wrote:
> > On Fri, Dec 8, 2023 at 1:57 PM Tony Luck <[email protected]> wrote:
> >>
> >> On Fri, Dec 08, 2023 at 10:17:08AM -0800, Peter Newman wrote:
> >>> Hi Tony,
> >>>
> >>> On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
> >>>> @@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
> >>>> case Opt_mba_mbps:
> >>>> if (!supports_mba_mbps())
> >>>> return -EINVAL;
> >>>> - ctx->enable_mba_mbps = true;
> >>>> + if (is_mbm_local_enabled())
> >>>> + ctx->enable_mba_mbps_local = true;
> >>>> + else
> >>>> + return -EINVAL;
> >>>> + return 0;
> >>>> + case Opt_mba_mbps_event:
> >>>> + if (!supports_mba_mbps())
> >>>> + return -EINVAL;
> >>>> + if (!strcmp("mbm_local_bytes", param->string)) {
> >>>> + if (!is_mbm_local_enabled())
> >>>> + return -EINVAL;
> >>>> + ctx->enable_mba_mbps_local = true;
> >>>> + } else if (!strcmp("mbm_total_bytes", param->string)) {
> >>>> + if (!is_mbm_total_enabled())
> >>>> + return -EINVAL;
> >>>> + ctx->enable_mba_mbps_total = true;
> >>>> + } else {
> >>>> + return -EINVAL;
> >>>
> >>> It looks like if I pass
> >>> "mba_MBps_event=mbm_total_bytes,mba_MBps_event=mbm_local_bytes" I can
> >>> set both flags true.
> >>
> >> That's going to be confusing. I'll add code to stop the user from
> >> passing both options.
> >
> > Also kind of confusing, after reading the second patch, I realized
> > "mba_MBps_event=mbm_total_bytes,mba_MBps" also results in both being
> > set. If you're able to fail the mount operation if both flags somehow
> > get set, that would address this one too.
>
> Are two separate flags required? All existing options within struct rdt_fs_context
> are of type bool but that does not imply that it is the required type for
> all.

Reinette,

Maybe a flag and a value? The structure becomes:

struct rdt_fs_context {
struct kernfs_fs_context kfc;
bool enable_cdpl2;
bool enable_cdpl3;
bool enable_mba_mbps;
enum resctrl_event_id mba_mbps_event;
bool enable_debug;
};

Mount option parsing (including blocking user from setting the options
multiple times):

case Opt_mba_mbps:
if (!supports_mba_mbps() || ctx->enable_mba_mbps)
return -EINVAL;
if (is_mbm_local_enabled())
ctx->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
else if (is_mbm_total_enabled())
ctx->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
else
return -EINVAL;
ctx->enable_mba_mbps = true;
return 0;
case Opt_mba_mbps_event:
if (!supports_mba_mbps() || ctx->enable_mba_mbps)
return -EINVAL;
if (!strcmp("mbm_local_bytes", param->string))
ctx->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
else if (!strcmp("mbm_total_bytes", param->string))
ctx->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
else
return -EINVAL;
ctx->enable_mba_mbps = true;
return 0;


and use of the options to enable the feature:

if (ctx->enable_mba_mbps) {
r->membw.mba_mbps_event = ctx->mba_mbps_event;
ret = set_mba_sc(true);
if (ret)
goto out_cdpl3;
}

-Tony

2023-12-12 21:44:07

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

Hi Tony,

On 12/12/2023 12:02 PM, Tony Luck wrote:
> On Tue, Dec 12, 2023 at 09:54:38AM -0800, Reinette Chatre wrote:
>>
>> On 12/8/2023 2:09 PM, Peter Newman wrote:
>>> On Fri, Dec 8, 2023 at 1:57 PM Tony Luck <[email protected]> wrote:
>>>>
>>>> On Fri, Dec 08, 2023 at 10:17:08AM -0800, Peter Newman wrote:
>>>>> Hi Tony,
>>>>>
>>>>> On Thu, Dec 7, 2023 at 11:56 AM Tony Luck <[email protected]> wrote:
>>>>>> @@ -2715,7 +2723,25 @@ static int rdt_parse_param(struct fs_context *fc, struct fs_parameter *param)
>>>>>> case Opt_mba_mbps:
>>>>>> if (!supports_mba_mbps())
>>>>>> return -EINVAL;
>>>>>> - ctx->enable_mba_mbps = true;
>>>>>> + if (is_mbm_local_enabled())
>>>>>> + ctx->enable_mba_mbps_local = true;
>>>>>> + else
>>>>>> + return -EINVAL;
>>>>>> + return 0;
>>>>>> + case Opt_mba_mbps_event:
>>>>>> + if (!supports_mba_mbps())
>>>>>> + return -EINVAL;
>>>>>> + if (!strcmp("mbm_local_bytes", param->string)) {
>>>>>> + if (!is_mbm_local_enabled())
>>>>>> + return -EINVAL;
>>>>>> + ctx->enable_mba_mbps_local = true;
>>>>>> + } else if (!strcmp("mbm_total_bytes", param->string)) {
>>>>>> + if (!is_mbm_total_enabled())
>>>>>> + return -EINVAL;
>>>>>> + ctx->enable_mba_mbps_total = true;
>>>>>> + } else {
>>>>>> + return -EINVAL;
>>>>>
>>>>> It looks like if I pass
>>>>> "mba_MBps_event=mbm_total_bytes,mba_MBps_event=mbm_local_bytes" I can
>>>>> set both flags true.
>>>>
>>>> That's going to be confusing. I'll add code to stop the user from
>>>> passing both options.
>>>
>>> Also kind of confusing, after reading the second patch, I realized
>>> "mba_MBps_event=mbm_total_bytes,mba_MBps" also results in both being
>>> set. If you're able to fail the mount operation if both flags somehow
>>> get set, that would address this one too.
>>
>> Are two separate flags required? All existing options within struct rdt_fs_context
>> are of type bool but that does not imply that it is the required type for
>> all.
>
> Reinette,
>
> Maybe a flag and a value? The structure becomes:
>
> struct rdt_fs_context {
> struct kernfs_fs_context kfc;
> bool enable_cdpl2;
> bool enable_cdpl3;
> bool enable_mba_mbps;
> enum resctrl_event_id mba_mbps_event;
> bool enable_debug;
> };

A flag and value would work. This brings the implementation close
to the resource properties. Something that is confusing to me with
this change is the inconsistent naming:

struct rdt_fs_context:
bool enable_mba_mbps
enum resctrl event_id mba_mbps_event

struct resctrl_membw:
bool mba_sc
enum resctrl_event_id mba_mbps_event


The intention with the above naming is not obvious to me. How are
these intended to be viewed?

One option could be to view these as separately representing user
space (struct rdt_fs_context) and kernel space (struct resctrl_membw).
If this is the case then the following naming may be more intuitive:

struct rdt_fs_context:
bool enable_mba_mbps
enum resctrl event_id mba_mbps_event

struct resctrl_membw:
bool mba_sc
enum resctrl_event_id mba_sc_event



>
> Mount option parsing (including blocking user from setting the options
> multiple times):
>
> case Opt_mba_mbps:
> if (!supports_mba_mbps() || ctx->enable_mba_mbps)
> return -EINVAL;

I am not familiar with the API but it seems that invalfc() is available
to communicate a more useful message to user space than the default one
shown in changelog of patch #2.

> if (is_mbm_local_enabled())
> ctx->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> else if (is_mbm_total_enabled())
> ctx->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> else
> return -EINVAL;
> ctx->enable_mba_mbps = true;
> return 0;
> case Opt_mba_mbps_event:
> if (!supports_mba_mbps() || ctx->enable_mba_mbps)
> return -EINVAL;
> if (!strcmp("mbm_local_bytes", param->string))
> ctx->mba_mbps_event = QOS_L3_MBM_LOCAL_EVENT_ID;
> else if (!strcmp("mbm_total_bytes", param->string))
> ctx->mba_mbps_event = QOS_L3_MBM_TOTAL_EVENT_ID;
> else
> return -EINVAL;
> ctx->enable_mba_mbps = true;
> return 0;
>
>
> and use of the options to enable the feature:
>
> if (ctx->enable_mba_mbps) {
> r->membw.mba_mbps_event = ctx->mba_mbps_event;
> ret = set_mba_sc(true);
> if (ret)
> goto out_cdpl3;
> }

Since 0 will not be used for an unset/invalid value I expect mba_mbps_event
will not (cannot) be cleared by rdt_disable_ctx(). If this is the case I think
future changes can be supported by expanding the kerneldoc of struct resctrl_membw
to document that "@mba_mbps_event (or @mba_sc_event?) is invalid if @mba_sc
is false".

Reinette

2023-12-13 01:07:47

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v6 1/3] x86/resctrl: Add mount option "mba_MBps_event"

>> case Opt_mba_mbps:
>> if (!supports_mba_mbps() || ctx->enable_mba_mbps)
>> return -EINVAL;
>
> I am not familiar with the API but it seems that invalfc() is available
> to communicate a more useful message to user space than the default one
> shown in changelog of patch #2.

I experimented with invalfc(). It seems to be the answer to this part of the
mount error message:

dmesg(1) may have more information after failed mount system call.

because dmesg(1) does indeed include whatever message that is provided
by the invalfc() call (including automatically adding the prefix "resctrl: ").

I'll work on incorporating your other feedback, but I'm unlikely to get it
all done and tested before the end of this week. I'll be on vacation
the last two weeks of the year. So v7 of this series will have to wait
until 2024.

Thanks

-Tony


2024-01-09 22:00:33

by Luck, Tony

[permalink] [raw]
Subject: [PATCH] x86/resctrl: Implement new MBA_mbps throttling heuristic

The MBA_mbps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.

To avoid possibly stepping throttling up and down on every poll a
flag "delta_comp" is set whenever throttling is changed to indicate
that the actual change in bandwidth should be recorded on the next
poll in "delta_bw". Throttling is only reduced if the current bandwidth
plus delta_bw is below the user target.

This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plu delta_bw is above the target set by the user.

Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.

Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
---

This patch was previously posted in:

https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t

as part of a proposal to allow the mba_MBps mount option to base its
feedback loop input on total bandwidth instead of local bandwidth.
Sending it now as a standalone patch because Xiaochen reported that
real systems have experienced problems when delta_bw is incorrectly
calculated.

arch/x86/kernel/cpu/resctrl/internal.h | 4 ---
arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++--------------------
2 files changed, 10 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..71bbd2245cc7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -296,14 +296,10 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @prev_bw_bytes: Previous bytes value read for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
- * @delta_bw: Difference between the current and previous bandwidth
- * @delta_comp: Indicates whether to compute the delta_bw
*/
struct mbm_state {
u64 prev_bw_bytes;
u32 prev_bw;
- u32 delta_bw;
- bool delta_comp;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..1961823b555b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)

cur_bw = bytes / SZ_1M;

- if (m->delta_comp)
- m->delta_bw = abs(cur_bw - m->prev_bw);
- m->delta_comp = false;
m->prev_bw = cur_bw;
}

@@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
+ u32 cur_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
@@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

cur_bw = pmbm_data->prev_bw;
user_bw = dom_mba->mbps_val[closid];
- delta_bw = pmbm_data->delta_bw;

/* MBA resource doesn't support CDP */
cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
@@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
list_for_each_entry(entry, head, mon.crdtgrp_list) {
cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
cur_bw += cmbm_data->prev_bw;
- delta_bw += cmbm_data->delta_bw;
}

/*
* Scale up/down the bandwidth linearly for the ctrl group. The
* bandwidth step is the bandwidth granularity specified by the
* hardware.
- *
- * The delta_bw is used when increasing the bandwidth so that we
- * dont alternately increase and decrease the control values
- * continuously.
- *
- * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
- * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
- * switching between 90 and 110 continuously if we only check
- * cur_bw < user_bw.
+ * Always increase throttling if current bandwidth is above the
+ * target set by user.
+ * But avoid thrashing up and down on every poll by checking
+ * whether a decrease in throttling is likely to push the group
+ * back over target. E.g. if currently throttling to 30% of bandwidth
+ * on a system with 10% granularity steps, check whether moving to
+ * 40% would go past the limit by multiplying current bandwidth by
+ * "(30 + 10) / 30".
*/
if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
} else if (cur_msr_val < MAX_MBA_BW &&
- (user_bw > (cur_bw + delta_bw))) {
+ (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
} else {
return;
}

resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
-
- /*
- * Delta values are updated dynamically package wise for each
- * rdtgrp every time the throttle MSR changes value.
- *
- * This is because (1)the increase in bandwidth is not perfectly
- * linear and only "approximately" linear even when the hardware
- * says it is linear.(2)Also since MBA is a core specific
- * mechanism, the delta values vary based on number of cores used
- * by the rdtgrp.
- */
- pmbm_data->delta_comp = true;
- list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
- cmbm_data->delta_comp = true;
- }
}

static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)

base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
2.43.0


2024-01-16 21:01:23

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: Implement new MBA_mbps throttling heuristic

Hi Xiaochen,

On 1/9/2024 2:00 PM, Tony Luck wrote:
> The MBA_mbps feedback loop increases throttling when a group is using
> more bandwidth than the target set by the user in the schemata file, and
> decreases throttling when below target.
>
> To avoid possibly stepping throttling up and down on every poll a
> flag "delta_comp" is set whenever throttling is changed to indicate
> that the actual change in bandwidth should be recorded on the next
> poll in "delta_bw". Throttling is only reduced if the current bandwidth
> plus delta_bw is below the user target.
>
> This algorithm works well if the workload has steady bandwidth needs.
> But it can go badly wrong if the workload moves to a different phase
> just as the throttling level changed. E.g. if the workload becomes
> essentially idle right as throttling level is increased, the value
> calculated for delta_bw will be more or less the old bandwidth level.
> If the workload then resumes, Linux may never reduce throttling because
> current bandwidth plu delta_bw is above the target set by the user.
>
> Implement a simpler heuristic by assuming that in the worst case the
> currently measured bandwidth is being controlled by the current level of
> throttling. Compute how much it may increase if throttling is relaxed to
> the next higher level. If that is still below the user target, then it
> is ok to reduce the amount of throttling.
>
> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
> Reported-by: Xiaochen Shen <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
>
> This patch was previously posted in:
>
> https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t
>
> as part of a proposal to allow the mba_MBps mount option to base its
> feedback loop input on total bandwidth instead of local bandwidth.
> Sending it now as a standalone patch because Xiaochen reported that
> real systems have experienced problems when delta_bw is incorrectly
> calculated.
>

Does this new heuristic fix the problems you observe? If so, would you be
willing to provide a "Tested-by" tag?

Thank you.

Reinette

2024-01-17 03:37:15

by Xiaochen Shen

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: Implement new MBA_mbps throttling heuristic

Hi Reinette and Tony,

On 1/17/2024 3:55, Reinette Chatre wrote:
> Hi Xiaochen,
>
> On 1/9/2024 2:00 PM, Tony Luck wrote:
>> The MBA_mbps feedback loop increases throttling when a group is using
>> more bandwidth than the target set by the user in the schemata file, and
>> decreases throttling when below target.
>>
>> To avoid possibly stepping throttling up and down on every poll a
>> flag "delta_comp" is set whenever throttling is changed to indicate
>> that the actual change in bandwidth should be recorded on the next
>> poll in "delta_bw". Throttling is only reduced if the current bandwidth
>> plus delta_bw is below the user target.
>>
>> This algorithm works well if the workload has steady bandwidth needs.
>> But it can go badly wrong if the workload moves to a different phase
>> just as the throttling level changed. E.g. if the workload becomes
>> essentially idle right as throttling level is increased, the value
>> calculated for delta_bw will be more or less the old bandwidth level.
>> If the workload then resumes, Linux may never reduce throttling because
>> current bandwidth plu delta_bw is above the target set by the user.
>>
>> Implement a simpler heuristic by assuming that in the worst case the
>> currently measured bandwidth is being controlled by the current level of
>> throttling. Compute how much it may increase if throttling is relaxed to
>> the next higher level. If that is still below the user target, then it
>> is ok to reduce the amount of throttling.
>>
>> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
>> Reported-by: Xiaochen Shen <[email protected]>
>> Signed-off-by: Tony Luck <[email protected]>
>> ---
>>
>> This patch was previously posted in:
>>
>> https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t
>>
>> as part of a proposal to allow the mba_MBps mount option to base its
>> feedback loop input on total bandwidth instead of local bandwidth.
>> Sending it now as a standalone patch because Xiaochen reported that
>> real systems have experienced problems when delta_bw is incorrectly
>> calculated.
>>
> Does this new heuristic fix the problems you observe? If so, would you be
> willing to provide a "Tested-by" tag?

Yes. The patch fixed the problem. I will reply to the original thread to
add "Tested-by" tag.
Thank you very much for help!

>
> Thank you.
>
> Reinette

Best regards,
Xiaochen


2024-01-17 03:40:31

by Xiaochen Shen

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: Implement new MBA_mbps throttling heuristic

Tested-by: Xiaochen Shen <[email protected]>

On 1/10/2024 6:00, Tony Luck wrote:
> The MBA_mbps feedback loop increases throttling when a group is using
> more bandwidth than the target set by the user in the schemata file, and
> decreases throttling when below target.
>
> To avoid possibly stepping throttling up and down on every poll a
> flag "delta_comp" is set whenever throttling is changed to indicate
> that the actual change in bandwidth should be recorded on the next
> poll in "delta_bw". Throttling is only reduced if the current bandwidth
> plus delta_bw is below the user target.
>
> This algorithm works well if the workload has steady bandwidth needs.
> But it can go badly wrong if the workload moves to a different phase
> just as the throttling level changed. E.g. if the workload becomes
> essentially idle right as throttling level is increased, the value
> calculated for delta_bw will be more or less the old bandwidth level.
> If the workload then resumes, Linux may never reduce throttling because
> current bandwidth plu delta_bw is above the target set by the user.
>
> Implement a simpler heuristic by assuming that in the worst case the
> currently measured bandwidth is being controlled by the current level of
> throttling. Compute how much it may increase if throttling is relaxed to
> the next higher level. If that is still below the user target, then it
> is ok to reduce the amount of throttling.
>
> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
> Reported-by: Xiaochen Shen <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---
>
> This patch was previously posted in:
>
> https://lore.kernel.org/lkml/ZVU+L92LRBbJXgXn@agluck-desk3/#t
>
> as part of a proposal to allow the mba_MBps mount option to base its
> feedback loop input on total bandwidth instead of local bandwidth.
> Sending it now as a standalone patch because Xiaochen reported that
> real systems have experienced problems when delta_bw is incorrectly
> calculated.
>
> arch/x86/kernel/cpu/resctrl/internal.h | 4 ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++--------------------
> 2 files changed, 10 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
> index a4f1aa15f0a2..71bbd2245cc7 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -296,14 +296,10 @@ struct rftype {
> * struct mbm_state - status for each MBM counter in each domain
> * @prev_bw_bytes: Previous bytes value read for bandwidth calculation
> * @prev_bw: The most recent bandwidth in MBps
> - * @delta_bw: Difference between the current and previous bandwidth
> - * @delta_comp: Indicates whether to compute the delta_bw
> */
> struct mbm_state {
> u64 prev_bw_bytes;
> u32 prev_bw;
> - u32 delta_bw;
> - bool delta_comp;
> };
>
> /**
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index f136ac046851..1961823b555b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)
>
> cur_bw = bytes / SZ_1M;
>
> - if (m->delta_comp)
> - m->delta_bw = abs(cur_bw - m->prev_bw);
> - m->delta_comp = false;
> m->prev_bw = cur_bw;
> }
>
> @@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> {
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> - u32 cur_bw, delta_bw, user_bw;
> + u32 cur_bw, user_bw;
> struct rdt_resource *r_mba;
> struct rdt_domain *dom_mba;
> struct list_head *head;
> @@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
>
> cur_bw = pmbm_data->prev_bw;
> user_bw = dom_mba->mbps_val[closid];
> - delta_bw = pmbm_data->delta_bw;
>
> /* MBA resource doesn't support CDP */
> cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
> @@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> list_for_each_entry(entry, head, mon.crdtgrp_list) {
> cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> cur_bw += cmbm_data->prev_bw;
> - delta_bw += cmbm_data->delta_bw;
> }
>
> /*
> * Scale up/down the bandwidth linearly for the ctrl group. The
> * bandwidth step is the bandwidth granularity specified by the
> * hardware.
> - *
> - * The delta_bw is used when increasing the bandwidth so that we
> - * dont alternately increase and decrease the control values
> - * continuously.
> - *
> - * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
> - * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
> - * switching between 90 and 110 continuously if we only check
> - * cur_bw < user_bw.
> + * Always increase throttling if current bandwidth is above the
> + * target set by user.
> + * But avoid thrashing up and down on every poll by checking
> + * whether a decrease in throttling is likely to push the group
> + * back over target. E.g. if currently throttling to 30% of bandwidth
> + * on a system with 10% granularity steps, check whether moving to
> + * 40% would go past the limit by multiplying current bandwidth by
> + * "(30 + 10) / 30".
> */
> if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
> new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
> } else if (cur_msr_val < MAX_MBA_BW &&
> - (user_bw > (cur_bw + delta_bw))) {
> + (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
> new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
> } else {
> return;
> }
>
> resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
> -
> - /*
> - * Delta values are updated dynamically package wise for each
> - * rdtgrp every time the throttle MSR changes value.
> - *
> - * This is because (1)the increase in bandwidth is not perfectly
> - * linear and only "approximately" linear even when the hardware
> - * says it is linear.(2)Also since MBA is a core specific
> - * mechanism, the delta values vary based on number of cores used
> - * by the rdtgrp.
> - */
> - pmbm_data->delta_comp = true;
> - list_for_each_entry(entry, head, mon.crdtgrp_list) {
> - cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
> - cmbm_data->delta_comp = true;
> - }
> }
>
> static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)
>
> base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a

Best regards,
Xiaochen


2024-01-18 00:26:24

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH] x86/resctrl: Implement new MBA_mbps throttling heuristic

Hi Tony,

Please note in the subject and changelog that the mount option
is mba_MBps. The subject and first sentence of changelog seems
to fuse the mount option with the software controller enabled by it,
but I do not think doing so dilutes the message.

On 1/9/2024 2:00 PM, Tony Luck wrote:
> The MBA_mbps feedback loop increases throttling when a group is using
> more bandwidth than the target set by the user in the schemata file, and
> decreases throttling when below target.
>
> To avoid possibly stepping throttling up and down on every poll a
> flag "delta_comp" is set whenever throttling is changed to indicate
> that the actual change in bandwidth should be recorded on the next
> poll in "delta_bw". Throttling is only reduced if the current bandwidth
> plus delta_bw is below the user target.
>
> This algorithm works well if the workload has steady bandwidth needs.
> But it can go badly wrong if the workload moves to a different phase
> just as the throttling level changed. E.g. if the workload becomes
> essentially idle right as throttling level is increased, the value
> calculated for delta_bw will be more or less the old bandwidth level.
> If the workload then resumes, Linux may never reduce throttling because
> current bandwidth plu delta_bw is above the target set by the user.

"plu" -> "plus"

>
> Implement a simpler heuristic by assuming that in the worst case the
> currently measured bandwidth is being controlled by the current level of
> throttling. Compute how much it may increase if throttling is relaxed to
> the next higher level. If that is still below the user target, then it
> is ok to reduce the amount of throttling.
>
> Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
> Reported-by: Xiaochen Shen <[email protected]>
> Signed-off-by: Tony Luck <[email protected]>
> ---

Thank you very much for catching and fixing this.

With the issues ("MBA_mbps" -> "mba_MBps", "plu" -> "plus") addressed:

Reviewed-by: Reinette Chatre <[email protected]>

Reinette

2024-01-18 21:43:03

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

The mba_MBps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.

To avoid possibly stepping throttling up and down on every poll a
flag "delta_comp" is set whenever throttling is changed to indicate
that the actual change in bandwidth should be recorded on the next
poll in "delta_bw". Throttling is only reduced if the current bandwidth
plus delta_bw is below the user target.

This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plus delta_bw is above the target set by the user.

Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.

Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Xiaochen Shen <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
---

Changes since v1:
Reinette: Subject & commit comment: ("MBA_mbps" -> "mba_MBps", "plu" -> "plus")

Added Xiaochen's Tested-by
Added Reinette's Reviewed-by

arch/x86/kernel/cpu/resctrl/internal.h | 4 ---
arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++--------------------
2 files changed, 10 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..71bbd2245cc7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -296,14 +296,10 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @prev_bw_bytes: Previous bytes value read for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
- * @delta_bw: Difference between the current and previous bandwidth
- * @delta_comp: Indicates whether to compute the delta_bw
*/
struct mbm_state {
u64 prev_bw_bytes;
u32 prev_bw;
- u32 delta_bw;
- bool delta_comp;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..1961823b555b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)

cur_bw = bytes / SZ_1M;

- if (m->delta_comp)
- m->delta_bw = abs(cur_bw - m->prev_bw);
- m->delta_comp = false;
m->prev_bw = cur_bw;
}

@@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
+ u32 cur_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
@@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

cur_bw = pmbm_data->prev_bw;
user_bw = dom_mba->mbps_val[closid];
- delta_bw = pmbm_data->delta_bw;

/* MBA resource doesn't support CDP */
cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
@@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
list_for_each_entry(entry, head, mon.crdtgrp_list) {
cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
cur_bw += cmbm_data->prev_bw;
- delta_bw += cmbm_data->delta_bw;
}

/*
* Scale up/down the bandwidth linearly for the ctrl group. The
* bandwidth step is the bandwidth granularity specified by the
* hardware.
- *
- * The delta_bw is used when increasing the bandwidth so that we
- * dont alternately increase and decrease the control values
- * continuously.
- *
- * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
- * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
- * switching between 90 and 110 continuously if we only check
- * cur_bw < user_bw.
+ * Always increase throttling if current bandwidth is above the
+ * target set by user.
+ * But avoid thrashing up and down on every poll by checking
+ * whether a decrease in throttling is likely to push the group
+ * back over target. E.g. if currently throttling to 30% of bandwidth
+ * on a system with 10% granularity steps, check whether moving to
+ * 40% would go past the limit by multiplying current bandwidth by
+ * "(30 + 10) / 30".
*/
if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
} else if (cur_msr_val < MAX_MBA_BW &&
- (user_bw > (cur_bw + delta_bw))) {
+ (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
} else {
return;
}

resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
-
- /*
- * Delta values are updated dynamically package wise for each
- * rdtgrp every time the throttle MSR changes value.
- *
- * This is because (1)the increase in bandwidth is not perfectly
- * linear and only "approximately" linear even when the hardware
- * says it is linear.(2)Also since MBA is a core specific
- * mechanism, the delta values vary based on number of cores used
- * by the rdtgrp.
- */
- pmbm_data->delta_comp = true;
- list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
- cmbm_data->delta_comp = true;
- }
}

static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)

base-commit: 0dd3ee31125508cd67f7e7172247f05b7fd1753a
--
2.43.0


2024-01-22 18:10:22

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

Hi Tony,

On 1/18/2024 1:42 PM, Tony Luck wrote:

> @@ -520,7 +517,7 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
> {
> u32 closid, rmid, cur_msr_val, new_msr_val;
> struct mbm_state *pmbm_data, *cmbm_data;
> - u32 cur_bw, delta_bw, user_bw;
> + u32 cur_bw, user_bw;
> struct rdt_resource *r_mba;
> struct rdt_domain *dom_mba;
> struct list_head *head;

I am sorry Tony, this reverse-fir breakage slipped through. I only noticed it after
trying to see how best to present all the pending work to x86 maintainers. Could
you please send a fixup with this addressed? After that I am planning to propose its
inclusion among all the other pending resctrl fixes followed by the SNC work. The SNC
work will need a rebase starting with this snippet, unless you see an easier way forward?

Reinette

2024-01-22 18:48:30

by Luck, Tony

[permalink] [raw]
Subject: [PATCH v3] x86/resctrl: Implement new mba_MBps throttling heuristic

The mba_MBps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.

To avoid possibly stepping throttling up and down on every poll a
flag "delta_comp" is set whenever throttling is changed to indicate
that the actual change in bandwidth should be recorded on the next
poll in "delta_bw". Throttling is only reduced if the current bandwidth
plus delta_bw is below the user target.

This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plus delta_bw is above the target set by the user.

Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.

Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Tested-by: Xiaochen Shen <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
---

v2->v3 changes:

Reinette noticed a reverse fir tree breakage in update_mba_bw()

arch/x86/kernel/cpu/resctrl/internal.h | 4 ---
arch/x86/kernel/cpu/resctrl/monitor.c | 42 ++++++--------------------
2 files changed, 10 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index a4f1aa15f0a2..71bbd2245cc7 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -296,14 +296,10 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @prev_bw_bytes: Previous bytes value read for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
- * @delta_bw: Difference between the current and previous bandwidth
- * @delta_comp: Indicates whether to compute the delta_bw
*/
struct mbm_state {
u64 prev_bw_bytes;
u32 prev_bw;
- u32 delta_bw;
- bool delta_comp;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index f136ac046851..6c2d4e97e5f1 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)

cur_bw = bytes / SZ_1M;

- if (m->delta_comp)
- m->delta_bw = abs(cur_bw - m->prev_bw);
- m->delta_comp = false;
m->prev_bw = cur_bw;
}

@@ -520,11 +517,11 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
struct rdtgroup *entry;
+ u32 cur_bw, user_bw;

if (!is_mbm_local_enabled())
return;
@@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

cur_bw = pmbm_data->prev_bw;
user_bw = dom_mba->mbps_val[closid];
- delta_bw = pmbm_data->delta_bw;

/* MBA resource doesn't support CDP */
cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
@@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
list_for_each_entry(entry, head, mon.crdtgrp_list) {
cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
cur_bw += cmbm_data->prev_bw;
- delta_bw += cmbm_data->delta_bw;
}

/*
* Scale up/down the bandwidth linearly for the ctrl group. The
* bandwidth step is the bandwidth granularity specified by the
* hardware.
- *
- * The delta_bw is used when increasing the bandwidth so that we
- * dont alternately increase and decrease the control values
- * continuously.
- *
- * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
- * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
- * switching between 90 and 110 continuously if we only check
- * cur_bw < user_bw.
+ * Always increase throttling if current bandwidth is above the
+ * target set by user.
+ * But avoid thrashing up and down on every poll by checking
+ * whether a decrease in throttling is likely to push the group
+ * back over target. E.g. if currently throttling to 30% of bandwidth
+ * on a system with 10% granularity steps, check whether moving to
+ * 40% would go past the limit by multiplying current bandwidth by
+ * "(30 + 10) / 30".
*/
if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
} else if (cur_msr_val < MAX_MBA_BW &&
- (user_bw > (cur_bw + delta_bw))) {
+ (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
} else {
return;
}

resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
-
- /*
- * Delta values are updated dynamically package wise for each
- * rdtgrp every time the throttle MSR changes value.
- *
- * This is because (1)the increase in bandwidth is not perfectly
- * linear and only "approximately" linear even when the hardware
- * says it is linear.(2)Also since MBA is a core specific
- * mechanism, the delta values vary based on number of cores used
- * by the rdtgrp.
- */
- pmbm_data->delta_comp = true;
- list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
- cmbm_data->delta_comp = true;
- }
}

static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)

base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
--
2.43.0


2024-01-22 18:59:26

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

Hi Tony,

On 1/22/2024 10:07 AM, Luck, Tony wrote:
>> I am sorry Tony, this reverse-fir breakage slipped through. I only noticed it after
>> trying to see how best to present all the pending work to x86 maintainers. Could
>> you please send a fixup with this addressed? After that I am planning to propose its
>> inclusion among all the other pending resctrl fixes followed by the SNC work. The SNC
>> work will need a rebase starting with this snippet, unless you see an easier way forward?
>
> Reinette,
>
> Thanks for catching this. I can send v3 of this patch right away (just fixing fir tree order only
> needs a compile test, not extensive regression tests).

Thank you very much Tony.

>
> What are "all the other pending resctrl" fixes? Since the SNC series is invasive to so many
> files & functions, I'll need to re-base SNC on top of everything else that's happening this
> cycle.

The fixes I have as pending are:

[PATCH v2] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
https://lore.kernel.org/lkml/ZULCd%2FTGJL9Dmncf@agluck-desk3/

[PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit
https://lore.kernel.org/lkml/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com/

[PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command
https://lore.kernel.org/lkml/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com/

[PATCH v3] x86/resctrl: Implement new mba_MBps throttling heuristic
https://lore.kernel.org/lkml/[email protected]/


Reinette




2024-01-22 19:05:59

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

> I am sorry Tony, this reverse-fir breakage slipped through. I only noticed it after
> trying to see how best to present all the pending work to x86 maintainers. Could
> you please send a fixup with this addressed? After that I am planning to propose its
> inclusion among all the other pending resctrl fixes followed by the SNC work. The SNC
> work will need a rebase starting with this snippet, unless you see an easier way forward?

Reinette,

Thanks for catching this. I can send v3 of this patch right away (just fixing fir tree order only
needs a compile test, not extensive regression tests).

What are "all the other pending resctrl" fixes? Since the SNC series is invasive to so many
files & functions, I'll need to re-base SNC on top of everything else that's happening this
cycle.

-Tony

2024-01-22 19:06:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

On Mon, Jan 22, 2024 at 10:18:01AM -0800, Reinette Chatre wrote:
> The fixes I have as pending are:
>
> [PATCH v2] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
> https://lore.kernel.org/lkml/ZULCd%2FTGJL9Dmncf@agluck-desk3/
>
> [PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit
> https://lore.kernel.org/lkml/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com/
>
> [PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command
> https://lore.kernel.org/lkml/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com/
>
> [PATCH v3] x86/resctrl: Implement new mba_MBps throttling heuristic
> https://lore.kernel.org/lkml/[email protected]/

Is that the order I should start queueing them in?

If not, pls send me a note and I can start picking up stuff.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-01-22 19:27:40

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

Hi Boris,

On 1/22/2024 10:21 AM, Borislav Petkov wrote:
> On Mon, Jan 22, 2024 at 10:18:01AM -0800, Reinette Chatre wrote:
>> The fixes I have as pending are:
>>
>> [PATCH v2] x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
>> https://lore.kernel.org/lkml/ZULCd%2FTGJL9Dmncf@agluck-desk3/
>>
>> [PATCH v5 1/2] x86/resctrl: Remove hard-coded memory bandwidth limit
>> https://lore.kernel.org/lkml/c26a8ca79d399ed076cf8bf2e9fbc58048808289.1705359148.git.babu.moger@amd.com/
>>
>> [PATCH v5 2/2] x86/resctrl: Read supported bandwidth sources using CPUID command
>> https://lore.kernel.org/lkml/669896fa512c7451319fa5ca2fdb6f7e015b5635.1705359148.git.babu.moger@amd.com/
>>
>> [PATCH v3] x86/resctrl: Implement new mba_MBps throttling heuristic
>> https://lore.kernel.org/lkml/[email protected]/
>
> Is that the order I should start queueing them in?

Thank you for noting this exchange. Yes, this order applies cleanly.
This is the list of pending resctrl fixes. There is also one
pending resctrl feature, the SNC work [1], that Tony plans to rebase
on these fixes.

As a side note: I did have a conversation with James about the
MPAM work and his expectation is that the SNC work goes first.

> If not, pls send me a note and I can start picking up stuff.

I was working on a separate note to you but we can continue discussing
the pending resctrl work here.

Reinette

[1] https://lore.kernel.org/lkml/[email protected]/


2024-01-22 19:48:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

On Mon, Jan 22, 2024 at 10:41:01AM -0800, Reinette Chatre wrote:
> Thank you for noting this exchange. Yes, this order applies cleanly.
> This is the list of pending resctrl fixes.

Ok, if some of the fixes need to go to Linus now, lemme know.

> There is also one pending resctrl feature, the SNC work [1], that Tony
> plans to rebase on these fixes.

Ack.

> As a side note: I did have a conversation with James about the
> MPAM work and his expectation is that the SNC work goes first.

Ok.

> I was working on a separate note to you but we can continue discussing
> the pending resctrl work here.

Separate note is fine too - however you prefer. I'll start going through
those in the coming days and we can take it from there.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-01-22 20:59:03

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

>> There is also one pending resctrl feature, the SNC work [1], that Tony
>> plans to rebase on these fixes.
>
> Ack.

It looks like there is just one change in part 0004/0008 that doesn't apply automatically from git am.
Super easy to resolve.

I need to grab an SNC system to re-check that everything still works when re-based.
But right now, this looks like adding the SNC series will be easy (famous last words!).

-Tony

2024-01-23 12:12:58

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

Hi Tony,

On 22/01/2024 20:58, Luck, Tony wrote:
>>> There is also one pending resctrl feature, the SNC work [1], that Tony
>>> plans to rebase on these fixes.
>>
>> Ack.
>
> It looks like there is just one change in part 0004/0008 that doesn't apply automatically from git am.
> Super easy to resolve.
>
> I need to grab an SNC system to re-check that everything still works when re-based.
> But right now, this looks like adding the SNC series will be easy (famous last words!).

Once you're ready - can you point me at something I can use as a stable branch to rebase onto?


Thanks,

James

2024-01-23 17:08:28

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

>> I need to grab an SNC system to re-check that everything still works when re-based.
>> But right now, this looks like adding the SNC series will be easy (famous last words!).
>
> Once you're ready - can you point me at something I can use as a stable branch to rebase onto?

I'll be running the tests this afternoon. But I'm somewhat confident that they will pass.

Tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git branch "hopeful_snc"

Commits are:

793635f10aeb x86/resctrl: Update documentation with Sub-NUMA cluster changes
dca7ba785d6f x86/resctrl: Sub NUMA Cluster detection and enable
e41bd88101c8 x86/resctrl: Introduce snc_nodes_per_l3_cache
ccbab7875197 x86/resctrl: Add node-scope to the options for feature scope
2f0db8c7072b x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
22697627cc5f x86/resctrl: Prepare for different scope for control/monitor operations
f3ff831a9042 x86/resctrl: Prepare to split rdt_domain structure
c0679868ee78 x86/resctrl: Prepare for new domain scope
327b4394f309 x86/resctrl: Implement new mba_MBps throttling heuristic
5b14817cf87e x86/resctrl: Read supported bandwidth sources using CPUID command
5699cd082e1f x86/resctrl: Remove hard-coded memory bandwidth limit
1b908debf53f x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()

Only the last of these (1b908debf53f) had been accepted by Boris into TIP x86/cache
when I cut this tree yesterday.

Two more applied by Boris today without any changes from what is in my tree, but the
commit IDs in TIP are obviously different from in my tree.

54e35eb8611c x86/resctrl: Read supported bandwidth sources from CPUID
0976783bb123 x86/resctrl: Remove hard-coded memory bandwidth limit

All the others have the normal risk that Boris will find some way to make them better.

-Tony

2024-01-24 00:30:00

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

On Tue, Jan 23, 2024 at 05:07:50PM +0000, Luck, Tony wrote:
> >> I need to grab an SNC system to re-check that everything still works when re-based.
> >> But right now, this looks like adding the SNC series will be easy (famous last words!).
> >
> > Once you're ready - can you point me at something I can use as a stable branch to rebase onto?
>
> I'll be running the tests this afternoon. But I'm somewhat confident that they will pass.
>
> Tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git branch "hopeful_snc"
>
> Commits are:
>
> 793635f10aeb x86/resctrl: Update documentation with Sub-NUMA cluster changes
> dca7ba785d6f x86/resctrl: Sub NUMA Cluster detection and enable
> e41bd88101c8 x86/resctrl: Introduce snc_nodes_per_l3_cache
> ccbab7875197 x86/resctrl: Add node-scope to the options for feature scope
> 2f0db8c7072b x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
> 22697627cc5f x86/resctrl: Prepare for different scope for control/monitor operations
> f3ff831a9042 x86/resctrl: Prepare to split rdt_domain structure
> c0679868ee78 x86/resctrl: Prepare for new domain scope
> 327b4394f309 x86/resctrl: Implement new mba_MBps throttling heuristic
> 5b14817cf87e x86/resctrl: Read supported bandwidth sources using CPUID command
> 5699cd082e1f x86/resctrl: Remove hard-coded memory bandwidth limit
> 1b908debf53f x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
>
> Only the last of these (1b908debf53f) had been accepted by Boris into TIP x86/cache
> when I cut this tree yesterday.
>
> Two more applied by Boris today without any changes from what is in my tree, but the
> commit IDs in TIP are obviously different from in my tree.
>
> 54e35eb8611c x86/resctrl: Read supported bandwidth sources from CPUID
> 0976783bb123 x86/resctrl: Remove hard-coded memory bandwidth limit
>
> All the others have the normal risk that Boris will find some way to make them better.

All my SNC tests pass. So this tree is as good as I can make it.

Boris: After you take/reject "x86/resctrl: Implement new mba_MBps throttling heuristic"
in TIP x86/cache I will post v14 of the SNC series rebased to that TIP
branch.

-Tony

Subject: [tip: x86/cache] x86/resctrl: Implement new mba_MBps throttling heuristic

The following commit has been merged into the x86/cache branch of tip:

Commit-ID: c2427e70c1630d98966375fffc2b713ab9768a94
Gitweb: https://git.kernel.org/tip/c2427e70c1630d98966375fffc2b713ab9768a94
Author: Tony Luck <[email protected]>
AuthorDate: Mon, 22 Jan 2024 10:08:07 -08:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Wed, 24 Jan 2024 11:32:01 +01:00

x86/resctrl: Implement new mba_MBps throttling heuristic

The mba_MBps feedback loop increases throttling when a group is using
more bandwidth than the target set by the user in the schemata file, and
decreases throttling when below target.

To avoid possibly stepping throttling up and down on every poll a flag
"delta_comp" is set whenever throttling is changed to indicate that the
actual change in bandwidth should be recorded on the next poll in
"delta_bw". Throttling is only reduced if the current bandwidth plus
delta_bw is below the user target.

This algorithm works well if the workload has steady bandwidth needs.
But it can go badly wrong if the workload moves to a different phase
just as the throttling level changed. E.g. if the workload becomes
essentially idle right as throttling level is increased, the value
calculated for delta_bw will be more or less the old bandwidth level.
If the workload then resumes, Linux may never reduce throttling because
current bandwidth plus delta_bw is above the target set by the user.

Implement a simpler heuristic by assuming that in the worst case the
currently measured bandwidth is being controlled by the current level of
throttling. Compute how much it may increase if throttling is relaxed to
the next higher level. If that is still below the user target, then it
is ok to reduce the amount of throttling.

Fixes: ba0f26d8529c ("x86/intel_rdt/mba_sc: Prepare for feedback loop")
Reported-by: Xiaochen Shen <[email protected]>
Signed-off-by: Tony Luck <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Reviewed-by: Reinette Chatre <[email protected]>
Tested-by: Xiaochen Shen <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/resctrl/internal.h | 4 +--
arch/x86/kernel/cpu/resctrl/monitor.c | 42 +++++--------------------
2 files changed, 10 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index e3dc35a..52e7e7d 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -295,14 +295,10 @@ struct rftype {
* struct mbm_state - status for each MBM counter in each domain
* @prev_bw_bytes: Previous bytes value read for bandwidth calculation
* @prev_bw: The most recent bandwidth in MBps
- * @delta_bw: Difference between the current and previous bandwidth
- * @delta_comp: Indicates whether to compute the delta_bw
*/
struct mbm_state {
u64 prev_bw_bytes;
u32 prev_bw;
- u32 delta_bw;
- bool delta_comp;
};

/**
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index acca577..3a6c069 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -440,9 +440,6 @@ static void mbm_bw_count(u32 rmid, struct rmid_read *rr)

cur_bw = bytes / SZ_1M;

- if (m->delta_comp)
- m->delta_bw = abs(cur_bw - m->prev_bw);
- m->delta_comp = false;
m->prev_bw = cur_bw;
}

@@ -520,11 +517,11 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
{
u32 closid, rmid, cur_msr_val, new_msr_val;
struct mbm_state *pmbm_data, *cmbm_data;
- u32 cur_bw, delta_bw, user_bw;
struct rdt_resource *r_mba;
struct rdt_domain *dom_mba;
struct list_head *head;
struct rdtgroup *entry;
+ u32 cur_bw, user_bw;

if (!is_mbm_local_enabled())
return;
@@ -543,7 +540,6 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)

cur_bw = pmbm_data->prev_bw;
user_bw = dom_mba->mbps_val[closid];
- delta_bw = pmbm_data->delta_bw;

/* MBA resource doesn't support CDP */
cur_msr_val = resctrl_arch_get_config(r_mba, dom_mba, closid, CDP_NONE);
@@ -555,49 +551,31 @@ static void update_mba_bw(struct rdtgroup *rgrp, struct rdt_domain *dom_mbm)
list_for_each_entry(entry, head, mon.crdtgrp_list) {
cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
cur_bw += cmbm_data->prev_bw;
- delta_bw += cmbm_data->delta_bw;
}

/*
* Scale up/down the bandwidth linearly for the ctrl group. The
* bandwidth step is the bandwidth granularity specified by the
* hardware.
- *
- * The delta_bw is used when increasing the bandwidth so that we
- * dont alternately increase and decrease the control values
- * continuously.
- *
- * For ex: consider cur_bw = 90MBps, user_bw = 100MBps and if
- * bandwidth step is 20MBps(> user_bw - cur_bw), we would keep
- * switching between 90 and 110 continuously if we only check
- * cur_bw < user_bw.
+ * Always increase throttling if current bandwidth is above the
+ * target set by user.
+ * But avoid thrashing up and down on every poll by checking
+ * whether a decrease in throttling is likely to push the group
+ * back over target. E.g. if currently throttling to 30% of bandwidth
+ * on a system with 10% granularity steps, check whether moving to
+ * 40% would go past the limit by multiplying current bandwidth by
+ * "(30 + 10) / 30".
*/
if (cur_msr_val > r_mba->membw.min_bw && user_bw < cur_bw) {
new_msr_val = cur_msr_val - r_mba->membw.bw_gran;
} else if (cur_msr_val < MAX_MBA_BW &&
- (user_bw > (cur_bw + delta_bw))) {
+ (user_bw > (cur_bw * (cur_msr_val + r_mba->membw.min_bw) / cur_msr_val))) {
new_msr_val = cur_msr_val + r_mba->membw.bw_gran;
} else {
return;
}

resctrl_arch_update_one(r_mba, dom_mba, closid, CDP_NONE, new_msr_val);
-
- /*
- * Delta values are updated dynamically package wise for each
- * rdtgrp every time the throttle MSR changes value.
- *
- * This is because (1)the increase in bandwidth is not perfectly
- * linear and only "approximately" linear even when the hardware
- * says it is linear.(2)Also since MBA is a core specific
- * mechanism, the delta values vary based on number of cores used
- * by the rdtgrp.
- */
- pmbm_data->delta_comp = true;
- list_for_each_entry(entry, head, mon.crdtgrp_list) {
- cmbm_data = &dom_mbm->mbm_local[entry->mon.rmid];
- cmbm_data->delta_comp = true;
- }
}

static void mbm_update(struct rdt_resource *r, struct rdt_domain *d, int rmid)

2024-01-25 17:29:56

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCH v2] x86/resctrl: Implement new mba_MBps throttling heuristic

On Tue, Jan 23, 2024 at 04:29:42PM -0800, Tony Luck wrote:
> On Tue, Jan 23, 2024 at 05:07:50PM +0000, Luck, Tony wrote:
> > >> I need to grab an SNC system to re-check that everything still works when re-based.
> > >> But right now, this looks like adding the SNC series will be easy (famous last words!).
> > >
> > > Once you're ready - can you point me at something I can use as a stable branch to rebase onto?
> >
> > I'll be running the tests this afternoon. But I'm somewhat confident that they will pass.
> >
> > Tree is here: git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux.git branch "hopeful_snc"
> >
> > Commits are:
> >
> > 793635f10aeb x86/resctrl: Update documentation with Sub-NUMA cluster changes
> > dca7ba785d6f x86/resctrl: Sub NUMA Cluster detection and enable
> > e41bd88101c8 x86/resctrl: Introduce snc_nodes_per_l3_cache
> > ccbab7875197 x86/resctrl: Add node-scope to the options for feature scope
> > 2f0db8c7072b x86/resctrl: Split the rdt_domain and rdt_hw_domain structures
> > 22697627cc5f x86/resctrl: Prepare for different scope for control/monitor operations
> > f3ff831a9042 x86/resctrl: Prepare to split rdt_domain structure
> > c0679868ee78 x86/resctrl: Prepare for new domain scope
> > 327b4394f309 x86/resctrl: Implement new mba_MBps throttling heuristic
> > 5b14817cf87e x86/resctrl: Read supported bandwidth sources using CPUID command
> > 5699cd082e1f x86/resctrl: Remove hard-coded memory bandwidth limit
> > 1b908debf53f x86/resctrl: Fix unused variable warning in cache_alloc_hsw_probe()
> >
> > Only the last of these (1b908debf53f) had been accepted by Boris into TIP x86/cache
> > when I cut this tree yesterday.
> >
> > Two more applied by Boris today without any changes from what is in my tree, but the
> > commit IDs in TIP are obviously different from in my tree.
> >
> > 54e35eb8611c x86/resctrl: Read supported bandwidth sources from CPUID
> > 0976783bb123 x86/resctrl: Remove hard-coded memory bandwidth limit
> >
> > All the others have the normal risk that Boris will find some way to make them better.
>
> All my SNC tests pass. So this tree is as good as I can make it.
>
> Boris: After you take/reject "x86/resctrl: Implement new mba_MBps throttling heuristic"
> in TIP x86/cache I will post v14 of the SNC series rebased to that TIP
> branch.

Boris,

I got the tip bot message that you have applied:

c2427e70c163 ("x86/resctrl: Implement new mba_MBps throttling heuristic")

I see that you also applied:

fc747eebef73 ("x86/resctrl: Remove redundant variable in mbm_config_write_domain()")

Does that empty your resctrl queue except for the SNC series? If it
does, then I'll rebase SNC patches onto TIP x86/cache.

-Tony