2023-01-13 18:05:11

by James Morse

[permalink] [raw]
Subject: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

x86 is blessed with an abundance of monitors, one per RMID, that can be
read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
the number implemented is up to the manufacturer. This means when there are
fewer monitors than needed, they need to be allocated and freed.

Worse, the domain may be broken up into slices, and the MMIO accesses
for each slice may need performing from different CPUs.

These two details mean MPAMs monitor code needs to be able to sleep, and
IPI another CPU in the domain to read from a resource that has been sliced.

mon_event_read() already invokes mon_event_count() via IPI, which means
this isn't possible.

Change mon_event_read() to schedule mon_event_count() on a remote CPU and
wait, instead of sending an IPI. This function is only used in response to
a user-space filesystem request (not the timing sensitive overflow code).

This allows MPAM to hide the slice behaviour from resctrl, and to keep
the monitor-allocation in monitor.c.

Tested-by: Shaopeng Tan <[email protected]>
Signed-off-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 7 +++++--
arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++--
3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
index 1df0e3262bca..4ee3da6dced7 100644
--- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
+++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
@@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
int evtid, int first)
{
+ /* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
+ lockdep_assert_held(&rdtgroup_mutex);
+
/*
- * setup the parameters to send to the IPI to read the data.
+ * setup the parameters to pass to mon_event_count() to read the data.
*/
rr->rgrp = rdtgrp;
rr->evtid = evtid;
@@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
rr->val = 0;
rr->first = first;

- smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
+ smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
}

int rdtgroup_mondata_show(struct seq_file *m, void *arg)
diff --git a/arch/x86/kernel/cpu/resctrl/internal.h b/arch/x86/kernel/cpu/resctrl/internal.h
index adae6231324f..1f90a10b75a1 100644
--- a/arch/x86/kernel/cpu/resctrl/internal.h
+++ b/arch/x86/kernel/cpu/resctrl/internal.h
@@ -514,7 +514,7 @@ void closid_free(int closid);
int alloc_rmid(u32 closid);
void free_rmid(u32 closid, u32 rmid);
int rdt_get_mon_l3_config(struct rdt_resource *r);
-void mon_event_count(void *info);
+int mon_event_count(void *info);
int rdtgroup_mondata_show(struct seq_file *m, void *arg);
void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
struct rdt_domain *d, struct rdtgroup *rdtgrp,
diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 190ac183139e..d309b830aeb2 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -509,10 +509,10 @@ static void mbm_bw_count(u32 closid, u32 rmid, struct rmid_read *rr)
}

/*
- * This is called via IPI to read the CQM/MBM counters
+ * This is scheduled by mon_event_read() to read the CQM/MBM counters
* on a domain.
*/
-void mon_event_count(void *info)
+int mon_event_count(void *info)
{
struct rdtgroup *rdtgrp, *entry;
struct rmid_read *rr = info;
@@ -545,6 +545,8 @@ void mon_event_count(void *info)
*/
if (ret == 0)
rr->err = 0;
+
+ return 0;
}

/*
--
2.30.2


2023-01-17 19:41:46

by Fenghua Yu

[permalink] [raw]
Subject: RE: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi, James,

> x86 is blessed with an abundance of monitors, one per RMID, that can be read
> from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the
> number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
>
> Worse, the domain may be broken up into slices, and the MMIO accesses for
> each slice may need performing from different CPUs.
>
> These two details mean MPAMs monitor code needs to be able to sleep, and IPI
> another CPU in the domain to read from a resource that has been sliced.
>
> mon_event_read() already invokes mon_event_count() via IPI, which means this
> isn't possible.
>
> Change mon_event_read() to schedule mon_event_count() on a remote CPU
> and wait, instead of sending an IPI. This function is only used in response to a
> user-space filesystem request (not the timing sensitive overflow code).

But mkdir mon group needs mon_event_count() to reset RMID state.
If mon_event_count() is called much later, the RMID state may be used
before it's reset. E.g. prev_msr might be non-0 value. That will cause
overflow code failure.

Seems this may happen on both x86 and arm64. At least need to make sure
RMID state reset happens before it's used.

>
> This allows MPAM to hide the slice behaviour from resctrl, and to keep the
> monitor-allocation in monitor.c.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 7 +++++--
> arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++--
> 3 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1df0e3262bca..4ee3da6dced7 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct
> rdt_resource *r,
> struct rdt_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first)
> {
> + /* When picking a cpu from cpu_mask, ensure it can't race with cpuhp
> */
> + lockdep_assert_held(&rdtgroup_mutex);
> +
> /*
> - * setup the parameters to send to the IPI to read the data.
> + * setup the parameters to pass to mon_event_count() to read the data.
> */
> rr->rgrp = rdtgrp;
> rr->evtid = evtid;
> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct
> rdt_resource *r,
> rr->val = 0;
> rr->first = first;
>
> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> + smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr,
> +false);
> }
>
> int rdtgroup_mondata_show(struct seq_file *m, void *arg) diff --git
> a/arch/x86/kernel/cpu/resctrl/internal.h
> b/arch/x86/kernel/cpu/resctrl/internal.h
> index adae6231324f..1f90a10b75a1 100644
> --- a/arch/x86/kernel/cpu/resctrl/internal.h
> +++ b/arch/x86/kernel/cpu/resctrl/internal.h
> @@ -514,7 +514,7 @@ void closid_free(int closid); int alloc_rmid(u32 closid);
> void free_rmid(u32 closid, u32 rmid); int rdt_get_mon_l3_config(struct
> rdt_resource *r); -void mon_event_count(void *info);
> +int mon_event_count(void *info);
> int rdtgroup_mondata_show(struct seq_file *m, void *arg); void
> mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_domain *d, struct rdtgroup *rdtgrp, diff --git
> a/arch/x86/kernel/cpu/resctrl/monitor.c
> b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 190ac183139e..d309b830aeb2 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -509,10 +509,10 @@ static void mbm_bw_count(u32 closid, u32 rmid,
> struct rmid_read *rr) }
>
> /*
> - * This is called via IPI to read the CQM/MBM counters
> + * This is scheduled by mon_event_read() to read the CQM/MBM counters
> * on a domain.
> */
> -void mon_event_count(void *info)
> +int mon_event_count(void *info)
> {
> struct rdtgroup *rdtgrp, *entry;
> struct rmid_read *rr = info;
> @@ -545,6 +545,8 @@ void mon_event_count(void *info)
> */
> if (ret == 0)
> rr->err = 0;
> +
> + return 0;
> }
>
> /*
> --
> 2.30.2

Thanks.

-Fenghua

2023-02-02 23:48:44

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi James,

On 1/13/2023 9:54 AM, James Morse wrote:
> x86 is blessed with an abundance of monitors, one per RMID, that can be
> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
> the number implemented is up to the manufacturer. This means when there are
> fewer monitors than needed, they need to be allocated and freed.
>
> Worse, the domain may be broken up into slices, and the MMIO accesses
> for each slice may need performing from different CPUs.
>
> These two details mean MPAMs monitor code needs to be able to sleep, and
> IPI another CPU in the domain to read from a resource that has been sliced.
>
> mon_event_read() already invokes mon_event_count() via IPI, which means
> this isn't possible.
>
> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
> wait, instead of sending an IPI. This function is only used in response to
> a user-space filesystem request (not the timing sensitive overflow code).
>
> This allows MPAM to hide the slice behaviour from resctrl, and to keep
> the monitor-allocation in monitor.c.
>
> Tested-by: Shaopeng Tan <[email protected]>
> Signed-off-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/ctrlmondata.c | 7 +++++--
> arch/x86/kernel/cpu/resctrl/internal.h | 2 +-
> arch/x86/kernel/cpu/resctrl/monitor.c | 6 ++++--
> 3 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> index 1df0e3262bca..4ee3da6dced7 100644
> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
> @@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> struct rdt_domain *d, struct rdtgroup *rdtgrp,
> int evtid, int first)
> {
> + /* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */

Please consistently use CPU instead of cpu.

> + lockdep_assert_held(&rdtgroup_mutex);
> +
> /*
> - * setup the parameters to send to the IPI to read the data.
> + * setup the parameters to pass to mon_event_count() to read the data.
> */
> rr->rgrp = rdtgrp;
> rr->evtid = evtid;
> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
> rr->val = 0;
> rr->first = first;
>
> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
> + smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);

This would be problematic for the use cases where single tasks are run on
adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
it may never run. Real-time environments are target usage of resctrl (with examples
in the documentation).

Reinette

2023-03-06 11:33:07

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi Fenghua,

On 17/01/2023 18:29, Yu, Fenghua wrote:
>> x86 is blessed with an abundance of monitors, one per RMID, that can be read
>> from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the
>> number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses for
>> each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and IPI
>> another CPU in the domain to read from a resource that has been sliced.
>>
>> mon_event_read() already invokes mon_event_count() via IPI, which means this
>> isn't possible.
>>
>> Change mon_event_read() to schedule mon_event_count() on a remote CPU
>> and wait, instead of sending an IPI. This function is only used in response to a
>> user-space filesystem request (not the timing sensitive overflow code).

> But mkdir mon group needs mon_event_count() to reset RMID state.
> If mon_event_count() is called much later, the RMID state may be used
> before it's reset. E.g. prev_msr might be non-0 value. That will cause
> overflow code failure.
>
> Seems this may happen on both x86 and arm64. At least need to make sure
> RMID state reset happens before it's used.

On the architecture side, there is a patch from Peter that records the MSR value on the
architecture side when an RMID is reset/re-allocated. 2a81160d29d6 ("x86/resctrl: Fix
event counts regression in reused RMIDs")

For the filesystem, the 'first' value is passed through and handled by the CPU that reads
the MSR. I don't see what problem any extra delay due to scheduling would cause.


Thanks,

James

2023-03-06 11:33:24

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi Reinette,

On 02/02/2023 23:47, Reinette Chatre wrote:
> Hi James,
>
> On 1/13/2023 9:54 AM, James Morse wrote:
>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>> the number implemented is up to the manufacturer. This means when there are
>> fewer monitors than needed, they need to be allocated and freed.
>>
>> Worse, the domain may be broken up into slices, and the MMIO accesses
>> for each slice may need performing from different CPUs.
>>
>> These two details mean MPAMs monitor code needs to be able to sleep, and
>> IPI another CPU in the domain to read from a resource that has been sliced.
>>
>> mon_event_read() already invokes mon_event_count() via IPI, which means
>> this isn't possible.
>>
>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>> wait, instead of sending an IPI. This function is only used in response to
>> a user-space filesystem request (not the timing sensitive overflow code).
>>
>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>> the monitor-allocation in monitor.c.

>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> index 1df0e3262bca..4ee3da6dced7 100644
>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>> @@ -532,8 +532,11 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>> struct rdt_domain *d, struct rdtgroup *rdtgrp,
>> int evtid, int first)
>> {
>> + /* When picking a cpu from cpu_mask, ensure it can't race with cpuhp */
>
> Please consistently use CPU instead of cpu.

(done)


>> + lockdep_assert_held(&rdtgroup_mutex);
>> +
>> /*
>> - * setup the parameters to send to the IPI to read the data.
>> + * setup the parameters to pass to mon_event_count() to read the data.
>> */
>> rr->rgrp = rdtgrp;
>> rr->evtid = evtid;
>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>> rr->val = 0;
>> rr->first = first;
>>
>> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>> + smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);

> This would be problematic for the use cases where single tasks are run on
> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
> it may never run. Real-time environments are target usage of resctrl (with examples
> in the documentation).

Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
this breaks!

Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
limbo or overflow timer work.

I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
nohz-full.

Ideally cpumask_any() would do this but it isn't possible without allocating memory.
If I can reproduce this problem, I'll propose adding the behaviour to
smp_call_function_any(), probably overloading 'wait' to return an error if all the target
CPUs are nohz-full.


This means those MPAM systems that need this can't use nohz_full, or at least need a
housekeeping CPU that can access each MSC. The driver already considers whether interrupts
are masked for this stuff because of the prototype perf support, which those machines
wouldn't be able to support either.

(I'll look at hooking this up to the limbo/overflow code too)


Thanks,

James

untested and incomplete hunk of code:
------------%<------------
cpu = get_cpu();
if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
smp_call_function_single(cpu, smp_mon_event_count, rr, true);
put_cpu();
} else {
put_cpu();

/*
* If all the CPUs available use nohz_full, fall back to an IPI.
* Some MPAM systems will be unable to use their counters in this case.
*/
cpu = cpumask_any_housekeeping(&d->cpu_mask);
if (tick_nohz_full_cpu(cpu))
smp_call_function_single(cpu, smp_mon_event_count, rr, true);
else
smp_call_on_cpu(cpu, mon_event_count, rr, false);
}
------------%<------------

2023-03-08 16:10:51

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi Reinette,

On 06/03/2023 11:33, James Morse wrote:
> On 02/02/2023 23:47, Reinette Chatre wrote:
>> On 1/13/2023 9:54 AM, James Morse wrote:
>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>> the number implemented is up to the manufacturer. This means when there are
>>> fewer monitors than needed, they need to be allocated and freed.
>>>
>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>> for each slice may need performing from different CPUs.
>>>
>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>
>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>> this isn't possible.
>>>
>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>> wait, instead of sending an IPI. This function is only used in response to
>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>
>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>> the monitor-allocation in monitor.c.

>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> index 1df0e3262bca..4ee3da6dced7 100644
>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>> rr->val = 0;
>>> rr->first = first;
>>>
>>> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>> + smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
>
>> This would be problematic for the use cases where single tasks are run on
>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>> it may never run. Real-time environments are target usage of resctrl (with examples
>> in the documentation).
>
> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
> this breaks!
>
> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
> limbo or overflow timer work.
>
> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
> nohz-full.
>
> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
> If I can reproduce this problem, ...

... I haven't been able to reproduce this.

With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
user-space as much as possible.

This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
tasks 100% in user-space isn't a realistic use-case.

Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
scheduled on those CPUs, it is expected to run eventually.


Thanks,

James

[0] https://lore.kernel.org/all/[email protected]/

2023-03-10 20:01:00

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi James and Fenghua,

On 3/6/2023 3:32 AM, James Morse wrote:
> On 17/01/2023 18:29, Yu, Fenghua wrote:
>>> x86 is blessed with an abundance of monitors, one per RMID, that can be read
>>> from any CPU in the domain. MPAMs monitors reside in the MMIO MSC, the
>>> number implemented is up to the manufacturer. This means when there are
>>> fewer monitors than needed, they need to be allocated and freed.
>>>
>>> Worse, the domain may be broken up into slices, and the MMIO accesses for
>>> each slice may need performing from different CPUs.
>>>
>>> These two details mean MPAMs monitor code needs to be able to sleep, and IPI
>>> another CPU in the domain to read from a resource that has been sliced.
>>>
>>> mon_event_read() already invokes mon_event_count() via IPI, which means this
>>> isn't possible.
>>>
>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU
>>> and wait, instead of sending an IPI. This function is only used in response to a
>>> user-space filesystem request (not the timing sensitive overflow code).
>
>> But mkdir mon group needs mon_event_count() to reset RMID state.
>> If mon_event_count() is called much later, the RMID state may be used
>> before it's reset. E.g. prev_msr might be non-0 value. That will cause
>> overflow code failure.
>>
>> Seems this may happen on both x86 and arm64. At least need to make sure
>> RMID state reset happens before it's used.
>
> On the architecture side, there is a patch from Peter that records the MSR value on the
> architecture side when an RMID is reset/re-allocated. 2a81160d29d6 ("x86/resctrl: Fix
> event counts regression in reused RMIDs")
>
> For the filesystem, the 'first' value is passed through and handled by the CPU that reads
> the MSR. I don't see what problem any extra delay due to scheduling would cause.
>

Both the monitor directory creation and the overflow handler rely on the
rdtgroup mutex so I do not see how an overflow code failure could sneak in here.

Reinette

2023-03-10 20:06:42

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi James,

On 3/8/2023 8:09 AM, James Morse wrote:
> Hi Reinette,
>
> On 06/03/2023 11:33, James Morse wrote:
>> On 02/02/2023 23:47, Reinette Chatre wrote:
>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>>> the number implemented is up to the manufacturer. This means when there are
>>>> fewer monitors than needed, they need to be allocated and freed.
>>>>
>>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>>> for each slice may need performing from different CPUs.
>>>>
>>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>>
>>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>>> this isn't possible.
>>>>
>>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>>> wait, instead of sending an IPI. This function is only used in response to
>>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>>
>>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>>> the monitor-allocation in monitor.c.
>
>>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> index 1df0e3262bca..4ee3da6dced7 100644
>>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>>> rr->val = 0;
>>>> rr->first = first;
>>>>
>>>> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>>> + smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
>>
>>> This would be problematic for the use cases where single tasks are run on
>>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>>> it may never run. Real-time environments are target usage of resctrl (with examples
>>> in the documentation).
>>
>> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
>> this breaks!
>>
>> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
>> limbo or overflow timer work.
>>
>> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
>> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
>> nohz-full.
>>
>> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
>> If I can reproduce this problem, ...
>
> ... I haven't been able to reproduce this.
>
> With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
> smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
> user-space as much as possible.
>
> This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
> CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
> tasks 100% in user-space isn't a realistic use-case.
>
> Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
> CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
> scheduled on those CPUs, it is expected to run eventually.

From what I understand the email you point to, and I assume your testing,
used the system defaults (SCHED_FIFO gets 0.95s out of 1s).

Users are not constrained by these defaults. Please see
Documentation/scheduler/sched-rt-group.rst

It is thus possible for tightly controlled task to have a CPU dedicated to
it for great lengths or even forever. Ideally written in a way to manage power
and thermal constraints.

In the current behavior, users can use resctrl to read the data at any time
and expect to understand consequences of such action.

It seems to me that there may be scenarios under which this change could
result in a read of data to never return? As you indicated it is expected
to run eventually, but that would be dictated by the RT scheduling period
that can be up to about 35 minutes (or "no limit" prompting me to make the
"never return" statement).

I do not see at this time that limbo/overflow should avoid these CPUs. Limbo
could be avoided from user space. I have not hear about overflow impacting
such workloads negatively.

Reinette

2023-03-20 17:18:01

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v2 08/18] x86/resctrl: Queue mon_event_read() instead of sending an IPI

Hi Reinette,

On 10/03/2023 20:06, Reinette Chatre wrote:
> On 3/8/2023 8:09 AM, James Morse wrote:
>> On 06/03/2023 11:33, James Morse wrote:
>>> On 02/02/2023 23:47, Reinette Chatre wrote:
>>>> On 1/13/2023 9:54 AM, James Morse wrote:
>>>>> x86 is blessed with an abundance of monitors, one per RMID, that can be
>>>>> read from any CPU in the domain. MPAMs monitors reside in the MMIO MSC,
>>>>> the number implemented is up to the manufacturer. This means when there are
>>>>> fewer monitors than needed, they need to be allocated and freed.
>>>>>
>>>>> Worse, the domain may be broken up into slices, and the MMIO accesses
>>>>> for each slice may need performing from different CPUs.
>>>>>
>>>>> These two details mean MPAMs monitor code needs to be able to sleep, and
>>>>> IPI another CPU in the domain to read from a resource that has been sliced.
>>>>>
>>>>> mon_event_read() already invokes mon_event_count() via IPI, which means
>>>>> this isn't possible.
>>>>>
>>>>> Change mon_event_read() to schedule mon_event_count() on a remote CPU and
>>>>> wait, instead of sending an IPI. This function is only used in response to
>>>>> a user-space filesystem request (not the timing sensitive overflow code).
>>>>>
>>>>> This allows MPAM to hide the slice behaviour from resctrl, and to keep
>>>>> the monitor-allocation in monitor.c.
>>
>>>>> diff --git a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>>> index 1df0e3262bca..4ee3da6dced7 100644
>>>>> --- a/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>>> +++ b/arch/x86/kernel/cpu/resctrl/ctrlmondata.c
>>>>> @@ -542,7 +545,7 @@ void mon_event_read(struct rmid_read *rr, struct rdt_resource *r,
>>>>> rr->val = 0;
>>>>> rr->first = first;
>>>>>
>>>>> - smp_call_function_any(&d->cpu_mask, mon_event_count, rr, 1);
>>>>> + smp_call_on_cpu(cpumask_any(&d->cpu_mask), mon_event_count, rr, false);
>>>
>>>> This would be problematic for the use cases where single tasks are run on
>>>> adaptive-tick CPUs. If an adaptive-tick CPU is chosen to run the function then
>>>> it may never run. Real-time environments are target usage of resctrl (with examples
>>>> in the documentation).
>>>
>>> Interesting. I can't find an IPI wakeup under smp_call_on_cpu() ... I wonder what else
>>> this breaks!
>>>
>>> Resctrl doesn't consider the nohz-cpus when doing any of this work, or when setting up the
>>> limbo or overflow timer work.
>>>
>>> I think the right thing to do here is add some cpumask_any_housekeeping() helper to avoid
>>> nohz-full CPUs where possible, and fall back to an IPI if all the CPUs in a domain are
>>> nohz-full.
>>>
>>> Ideally cpumask_any() would do this but it isn't possible without allocating memory.
>>> If I can reproduce this problem, ...
>>
>> ... I haven't been able to reproduce this.
>>
>> With "nohz_full=1 isolcpus=nohz,domain,1" on the command-line I can still
>> smp_call_on_cpu() on cpu-1 even when its running a SCHED_FIFO task that spins in
>> user-space as much as possible.
>>
>> This looks to be down to "sched: RT throttling activated", which seems to be to prevent RT
>> CPU hogs from blocking kernel work. From Peter's comments at [0], it looks like running
>> tasks 100% in user-space isn't a realistic use-case.
>>
>> Given that, I think resctrl should use smp_call_on_cpu() to avoid interrupting a nohz_full
>> CPUs, and the limbo/overflow code should equally avoid these CPUs. If work does get
>> scheduled on those CPUs, it is expected to run eventually.

> From what I understand the email you point to, and I assume your testing,
> used the system defaults (SCHED_FIFO gets 0.95s out of 1s).
>
> Users are not constrained by these defaults. Please see
> Documentation/scheduler/sched-rt-group.rst

Aha, I didn't find thiese to change. But I note most things down here say:
| Fiddling with these settings can result in an unstable system, the knobs are
| root only and assumes root knows what he is doing.

on them.


> It is thus possible for tightly controlled task to have a CPU dedicated to
> it for great lengths or even forever. Ideally written in a way to manage power
> and thermal constraints.
>
> In the current behavior, users can use resctrl to read the data at any time
> and expect to understand consequences of such action.

Those consequences are that resctrl might pick that CPU as the victim of an IPI, so the
time taken to read the counters goes missing from user-space.


> It seems to me that there may be scenarios under which this change could
> result in a read of data to never return? As you indicated it is expected
> to run eventually, but that would be dictated by the RT scheduling period
> that can be up to about 35 minutes (or "no limit" prompting me to make the
> "never return" statement).

Surely not interrupting an RT task is a better state of affairs. User-space can't know
which CPU resctrl is going to IPI.
If this means resctrl doesn't work properly, I'd file that under 'can result in an
unstable system' as quoted above.

I think the best solution here is for resctrl to assume there is one housekeeping CPU per
domain, (e.g. for processing offloaded RCU callbacks), and that it should prefer that CPU
for all cross-call work. This avoids ever interrupting RT tasks.

If you feel strongly that all CPUs in a domain could be dedicated 100% to user-space work,
can always use an IPI when nohz_full is in use, (and hope for the best on the CPU choice).
This will prevent a class of MPAM platforms from using their monitors with nohz_full,
which I'm fine with as the conditions are detectable.


> I do not see at this time that limbo/overflow should avoid these CPUs. Limbo
> could be avoided from user space. I have not hear about overflow impacting
> such workloads negatively.

Its got all the same properties. The limbo/overflow work picks a CPU to run on, it may
pick a nohz_full CPU. I suspect no-one has complained is because this 100%-in-userspace is
a niche sport.

Again, I think the best solution here is for the limbo/overflow code to prefer
housekeeping CPUs for all their work. This is what I've done for v3.


Thanks,

James