2023-05-25 18:10:33

by James Morse

[permalink] [raw]
Subject: [PATCH v4 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

MPAM's cache occupancy counters can take a little while to settle once
the monitor has been configured. The maximum settling time is described
to the driver via a firmware table. The value could be large enough
that it makes sense to sleep. To avoid exposing this to resctrl, it
should be hidden behind MPAM's resctrl_arch_rmid_read().

resctrl_arch_rmid_read() may be called via IPI meaning it is unable
to sleep. In this case resctrl_arch_rmid_read() should return an error
if it needs to sleep. This will only affect MPAM platforms where
the cache occupancy counter isn't available immediately, nohz_full is
in use, and there are there are no housekeeping CPUs in the necessary
domain.

There are three callers of resctrl_arch_rmid_read():
__mon_event_count() and __check_limbo() are both called from a
non-migrateable context. mon_event_read() invokes __mon_event_count()
using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
rdtgroup_mutex() is held, meaning this cannot race with the resctrl
cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
also adds work to a per-cpu workqueue.

The remaining call is add_rmid_to_limbo() which is called in response
to a user-space syscall that frees an rmid. This opportunistically
reads the llc occupancy counter on the current domain to see if the
RMID is over the dirty threshold. This has to disable preemption to
avoid reading the wrong domain's value. Disabling pre-emption here
prevents resctrl_arch_rmid_read() from sleeping.

add_rmid_to_limbo() walks each domain, but only reads the counter
on one domain. If the system has more than one domain, the RMID will
always be added to the limbo list. If the RMIDs usage was not over the
threshold, it will be removed from the list when __check_limbo() runs.
Make this the default behaviour. Free RMIDs are always added to the
limbo list for each domain.

The user visible effect of this is that a clean RMID is not available
for re-allocation immediately after 'rmdir()' completes, this behaviour
was never portable as it never happened on a machine with multiple
domains.

Removing this path allows resctrl_arch_rmid_read() to sleep if its called
with interrupts unmasked. Document this is the expected behaviour, and
add a might_sleep() annotation to catch changes that won't work on arm64.

Signed-off-by: James Morse <[email protected]>
---
The previous version allowed resctrl_arch_rmid_read() to be called on the
wrong CPUs, but now that this needs to take nohz_full and housekeeping into
account, its too complex.

Changes since v3:
* Removed error handling for smp_call_function_any(), this can't race
with the cpuhp callbacks as both hold rdtgroup_mutex.
* Switched to the alternative of removing the counter read, this simplifies
things dramatically.
---
arch/x86/kernel/cpu/resctrl/monitor.c | 15 ++-------------
include/linux/resctrl.h | 18 +++++++++++++++++-
2 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
index 6ba40495589a..fb33100e172b 100644
--- a/arch/x86/kernel/cpu/resctrl/monitor.c
+++ b/arch/x86/kernel/cpu/resctrl/monitor.c
@@ -272,6 +272,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
struct arch_mbm_state *am;
int ret = 0;

+ resctrl_arch_rmid_read_context_check();
+
if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
return -EINVAL;

@@ -462,8 +464,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
{
struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
struct rdt_domain *d;
- int cpu, err;
- u64 val = 0;
u32 idx;

lockdep_assert_held(&rdtgroup_mutex);
@@ -471,17 +471,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);

entry->busy = 0;
- cpu = get_cpu();
list_for_each_entry(d, &r->domains, list) {
- if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
- err = resctrl_arch_rmid_read(r, d, entry->closid,
- entry->rmid,
- QOS_L3_OCCUP_EVENT_ID,
- &val);
- if (err || val <= resctrl_rmid_realloc_threshold)
- continue;
- }
-
/*
* For the first limbo RMID in the domain,
* setup up the limbo worker.
@@ -491,7 +481,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
set_bit(idx, d->rmid_busy_llc);
entry->busy++;
}
- put_cpu();

if (entry->busy)
rmid_limbo_count++;
diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
index ff7452f644e4..b961936decfa 100644
--- a/include/linux/resctrl.h
+++ b/include/linux/resctrl.h
@@ -234,7 +234,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
* @eventid: eventid to read, e.g. L3 occupancy.
* @val: result of the counter read in bytes.
*
- * Call from process context on a CPU that belongs to domain @d.
+ * Some architectures need to sleep when first programming some of the counters.
+ * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
+ * for a short period of time). Call from a non-migrateable process context on
+ * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
+ * schedule_work_on(). This function can be called with interrupts masked,
+ * e.g. using smp_call_function_any(), but may concistently return an error.
*
* Return:
* 0 on success, or -EIO, -EINVAL etc on error.
@@ -243,6 +248,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
u32 closid, u32 rmid, enum resctrl_event_id eventid,
u64 *val);

+/**
+ * resctrl_arch_rmid_read_context_check() - warn about invalid contexts
+ *
+ * When built with CONFIG_DEBUG_ATOMIC_SLEEP, this function will generate a
+ * warning when resctrl_arch_rmid_read() is called from an invalid context.
+ */
+static inline void resctrl_arch_rmid_read_context_check(void)
+{
+ if (!irqs_disabled())
+ might_sleep();
+}

/**
* resctrl_arch_reset_rmid() - Reset any private state associated with rmid
--
2.39.2



2023-06-15 22:23:29

by Reinette Chatre

[permalink] [raw]
Subject: Re: [PATCH v4 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Hi James,

On 5/25/2023 11:01 AM, James Morse wrote:
> MPAM's cache occupancy counters can take a little while to settle once
> the monitor has been configured. The maximum settling time is described
> to the driver via a firmware table. The value could be large enough
> that it makes sense to sleep. To avoid exposing this to resctrl, it
> should be hidden behind MPAM's resctrl_arch_rmid_read().
>
> resctrl_arch_rmid_read() may be called via IPI meaning it is unable
> to sleep. In this case resctrl_arch_rmid_read() should return an error
> if it needs to sleep. This will only affect MPAM platforms where
> the cache occupancy counter isn't available immediately, nohz_full is
> in use, and there are there are no housekeeping CPUs in the necessary
> domain.
>
> There are three callers of resctrl_arch_rmid_read():
> __mon_event_count() and __check_limbo() are both called from a
> non-migrateable context. mon_event_read() invokes __mon_event_count()
> using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
> rdtgroup_mutex() is held, meaning this cannot race with the resctrl
> cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
> also adds work to a per-cpu workqueue.
>
> The remaining call is add_rmid_to_limbo() which is called in response
> to a user-space syscall that frees an rmid. This opportunistically

rmid -> RMID

> reads the llc occupancy counter on the current domain to see if the

llc -> LLC

> RMID is over the dirty threshold. This has to disable preemption to
> avoid reading the wrong domain's value. Disabling pre-emption here
> prevents resctrl_arch_rmid_read() from sleeping.
>
> add_rmid_to_limbo() walks each domain, but only reads the counter
> on one domain. If the system has more than one domain, the RMID will
> always be added to the limbo list. If the RMIDs usage was not over the
> threshold, it will be removed from the list when __check_limbo() runs.
> Make this the default behaviour. Free RMIDs are always added to the
> limbo list for each domain.
>
> The user visible effect of this is that a clean RMID is not available
> for re-allocation immediately after 'rmdir()' completes, this behaviour
> was never portable as it never happened on a machine with multiple
> domains.
>
> Removing this path allows resctrl_arch_rmid_read() to sleep if its called
> with interrupts unmasked. Document this is the expected behaviour, and
> add a might_sleep() annotation to catch changes that won't work on arm64.
>
> Signed-off-by: James Morse <[email protected]>
> ---
> The previous version allowed resctrl_arch_rmid_read() to be called on the
> wrong CPUs, but now that this needs to take nohz_full and housekeeping into
> account, its too complex.
>
> Changes since v3:
> * Removed error handling for smp_call_function_any(), this can't race
> with the cpuhp callbacks as both hold rdtgroup_mutex.
> * Switched to the alternative of removing the counter read, this simplifies
> things dramatically.
> ---
> arch/x86/kernel/cpu/resctrl/monitor.c | 15 ++-------------
> include/linux/resctrl.h | 18 +++++++++++++++++-
> 2 files changed, 19 insertions(+), 14 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
> index 6ba40495589a..fb33100e172b 100644
> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
> @@ -272,6 +272,8 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> struct arch_mbm_state *am;
> int ret = 0;
>
> + resctrl_arch_rmid_read_context_check();
> +
> if (!cpumask_test_cpu(smp_processor_id(), &d->cpu_mask))
> return -EINVAL;
>
> @@ -462,8 +464,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> {
> struct rdt_resource *r = &rdt_resources_all[RDT_RESOURCE_L3].r_resctrl;
> struct rdt_domain *d;
> - int cpu, err;
> - u64 val = 0;
> u32 idx;
>
> lockdep_assert_held(&rdtgroup_mutex);
> @@ -471,17 +471,7 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> idx = resctrl_arch_rmid_idx_encode(entry->closid, entry->rmid);
>
> entry->busy = 0;
> - cpu = get_cpu();
> list_for_each_entry(d, &r->domains, list) {
> - if (cpumask_test_cpu(cpu, &d->cpu_mask)) {
> - err = resctrl_arch_rmid_read(r, d, entry->closid,
> - entry->rmid,
> - QOS_L3_OCCUP_EVENT_ID,
> - &val);
> - if (err || val <= resctrl_rmid_realloc_threshold)
> - continue;
> - }
> -
> /*
> * For the first limbo RMID in the domain,
> * setup up the limbo worker.
> @@ -491,7 +481,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
> set_bit(idx, d->rmid_busy_llc);
> entry->busy++;
> }
> - put_cpu();
>
> if (entry->busy)
> rmid_limbo_count++;

Would entry->busy ever be 0 after this change?

> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
> index ff7452f644e4..b961936decfa 100644
> --- a/include/linux/resctrl.h
> +++ b/include/linux/resctrl.h
> @@ -234,7 +234,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
> * @eventid: eventid to read, e.g. L3 occupancy.
> * @val: result of the counter read in bytes.
> *
> - * Call from process context on a CPU that belongs to domain @d.
> + * Some architectures need to sleep when first programming some of the counters.
> + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
> + * for a short period of time). Call from a non-migrateable process context on
> + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
> + * schedule_work_on(). This function can be called with interrupts masked,
> + * e.g. using smp_call_function_any(), but may concistently return an error.
> *

concistently -> consistently?

> * Return:
> * 0 on success, or -EIO, -EINVAL etc on error.
> @@ -243,6 +248,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
> u32 closid, u32 rmid, enum resctrl_event_id eventid,
> u64 *val);
>
> +/**
> + * resctrl_arch_rmid_read_context_check() - warn about invalid contexts
> + *
> + * When built with CONFIG_DEBUG_ATOMIC_SLEEP, this function will generate a
> + * warning when resctrl_arch_rmid_read() is called from an invalid context.

No need to say "this function"
Could this comment be more specific about which contexts are invalid instead of
just referring to it as "invalid context". This seems to expect the reader to
already know what an invalid context would be.

> + */
> +static inline void resctrl_arch_rmid_read_context_check(void)
> +{
> + if (!irqs_disabled())
> + might_sleep();
> +}

Could you please elaborate why the "!irqs_disabled()" is needed?

>
> /**
> * resctrl_arch_reset_rmid() - Reset any private state associated with rmid


Reinette

2023-07-28 17:05:24

by James Morse

[permalink] [raw]
Subject: Re: [PATCH v4 14/24] x86/resctrl: Allow resctrl_arch_rmid_read() to sleep

Hi Reinette,

On 15/06/2023 23:13, Reinette Chatre wrote:
> On 5/25/2023 11:01 AM, James Morse wrote:
>> MPAM's cache occupancy counters can take a little while to settle once
>> the monitor has been configured. The maximum settling time is described
>> to the driver via a firmware table. The value could be large enough
>> that it makes sense to sleep. To avoid exposing this to resctrl, it
>> should be hidden behind MPAM's resctrl_arch_rmid_read().
>>
>> resctrl_arch_rmid_read() may be called via IPI meaning it is unable
>> to sleep. In this case resctrl_arch_rmid_read() should return an error
>> if it needs to sleep. This will only affect MPAM platforms where
>> the cache occupancy counter isn't available immediately, nohz_full is
>> in use, and there are there are no housekeeping CPUs in the necessary
>> domain.
>>
>> There are three callers of resctrl_arch_rmid_read():
>> __mon_event_count() and __check_limbo() are both called from a
>> non-migrateable context. mon_event_read() invokes __mon_event_count()
>> using smp_call_on_cpu(), which adds work to the target CPUs workqueue.
>> rdtgroup_mutex() is held, meaning this cannot race with the resctrl
>> cpuhp callback. __check_limbo() is invoked via schedule_delayed_work_on()
>> also adds work to a per-cpu workqueue.
>>
>> The remaining call is add_rmid_to_limbo() which is called in response
>> to a user-space syscall that frees an rmid. This opportunistically
>> reads the llc occupancy counter on the current domain to see if the
>> RMID is over the dirty threshold. This has to disable preemption to
>> avoid reading the wrong domain's value. Disabling pre-emption here
>> prevents resctrl_arch_rmid_read() from sleeping.
>>
>> add_rmid_to_limbo() walks each domain, but only reads the counter
>> on one domain. If the system has more than one domain, the RMID will
>> always be added to the limbo list. If the RMIDs usage was not over the
>> threshold, it will be removed from the list when __check_limbo() runs.
>> Make this the default behaviour. Free RMIDs are always added to the
>> limbo list for each domain.
>>
>> The user visible effect of this is that a clean RMID is not available
>> for re-allocation immediately after 'rmdir()' completes, this behaviour
>> was never portable as it never happened on a machine with multiple
>> domains.
>>
>> Removing this path allows resctrl_arch_rmid_read() to sleep if its called
>> with interrupts unmasked. Document this is the expected behaviour, and
>> add a might_sleep() annotation to catch changes that won't work on arm64.


>> diff --git a/arch/x86/kernel/cpu/resctrl/monitor.c b/arch/x86/kernel/cpu/resctrl/monitor.c
>> index 6ba40495589a..fb33100e172b 100644
>> --- a/arch/x86/kernel/cpu/resctrl/monitor.c
>> +++ b/arch/x86/kernel/cpu/resctrl/monitor.c
>> @@ -491,7 +481,6 @@ static void add_rmid_to_limbo(struct rmid_entry *entry)
>> set_bit(idx, d->rmid_busy_llc);
>> entry->busy++;
>> }
>> - put_cpu();
>>
>> if (entry->busy)
>> rmid_limbo_count++;
>
> Would entry->busy ever be 0 after this change?

Good point - the domains list can never be empty!


>> diff --git a/include/linux/resctrl.h b/include/linux/resctrl.h
>> index ff7452f644e4..b961936decfa 100644
>> --- a/include/linux/resctrl.h
>> +++ b/include/linux/resctrl.h
>> @@ -234,7 +234,12 @@ void resctrl_offline_domain(struct rdt_resource *r, struct rdt_domain *d);
>> * @eventid: eventid to read, e.g. L3 occupancy.
>> * @val: result of the counter read in bytes.
>> *
>> - * Call from process context on a CPU that belongs to domain @d.
>> + * Some architectures need to sleep when first programming some of the counters.
>> + * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
>> + * for a short period of time). Call from a non-migrateable process context on
>> + * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
>> + * schedule_work_on(). This function can be called with interrupts masked,
>> + * e.g. using smp_call_function_any(), but may concistently return an error.
>> *
>
> concistently -> consistently?
>
>> * Return:
>> * 0 on success, or -EIO, -EINVAL etc on error.
>> @@ -243,6 +248,17 @@ int resctrl_arch_rmid_read(struct rdt_resource *r, struct rdt_domain *d,
>> u32 closid, u32 rmid, enum resctrl_event_id eventid,
>> u64 *val);
>>
>> +/**
>> + * resctrl_arch_rmid_read_context_check() - warn about invalid contexts
>> + *
>> + * When built with CONFIG_DEBUG_ATOMIC_SLEEP, this function will generate a
>> + * warning when resctrl_arch_rmid_read() is called from an invalid context.
>
> No need to say "this function"
> Could this comment be more specific about which contexts are invalid instead of
> just referring to it as "invalid context". This seems to expect the reader to
> already know what an invalid context would be.

I thought the implementation was enough - the comment was intended to say this does
nothing unless you select that build option. I've changed it to:
| * When built with CONFIG_DEBUG_ATOMIC_SLEEP generate a warning when
| * resctrl_arch_rmid_read() is called with preemption disabled.


>> + */
>> +static inline void resctrl_arch_rmid_read_context_check(void)
>> +{
>> + if (!irqs_disabled())
>> + might_sleep();
>> +}

> Could you please elaborate why the "!irqs_disabled()" is needed?

It's a question of what the interface should support.
It needs to support being called via IPI because that is going to happen on NOHZ_FULL
platforms. In this case the platforms where the MPAM code needs to sleep will have to
return an error. For folk that see this in practice: they need at least one housekeeping
CPU per domain.

Outside that the function shouldn't be called with pre-emption disabled, as that would
prevent it from sleeping, but the lockdep annotations that check for pre-emption also go
off when interrupts are masked, hence the slightly funny check.

It's documented against resctrl_arch_rmid_read():
| * Some architectures need to sleep when first programming some of the counters.
| * (specifically: arm64's MPAM cache occupancy counters can return 'not ready'
| * for a short period of time). Call from a non-migrateable process context on
| * a CPU that belongs to domain @d. e.g. use smp_call_on_cpu() or
| * schedule_work_on(). This function can be called with interrupts masked,
| * e.g. using smp_call_function_any(), but may consistently return an error.


Thanks,

James