LinuxLists.cc - [PATCH v3 0/2] x86/resctrl: fix task CLOSID update race

2022-11-15 14:27:43

Subject: [PATCH v3 0/2] x86/resctrl: fix task CLOSID update race

Hi Reinette, Fenghua,

I've reorganized the patches for clarity, following James's guidance.

The patch series addresses the IPI race we discussed in the container
move RFD thread[1].

The first patch changes group-wide CLOSID/RMID updates to IPI all CPUs.
Now that the synchronization cost of correctly updating a single task is
more than originally thought, we believe that it's cheaper to IPI all
CPUs than forming a more precise CPU mask by synchronizing with all
tasks in an rdtgroup, especially when there is a large number of tasks
in the group. It's possible that this update could upset users who
frequently delete groups with few tasks. If anyone is aware of a use
case that frequently deletes groups, we can consider mitigations.

The second one uses the new task_call_func() interface to serialize
updating closid and rmid with any context switch of the task. AFAICT,
the implementation of this function acts like a mutex with context
switch, but I'm not certain whether it is intended to be one. If this is
not how task_call_func() is meant to be used, I will instead move the
code performing the update under sched/ where it can be done holding the
task_rq_lock() explicitly, as Reinette has suggested before[2].

Updates in v3:
- Split the handling of multi-task and single-task operations into
separate patches, now that they're handled differently.
- Clarify justification in the commit message, including moving some of
it out of inline code comment.
Updates in v2:
- Following Reinette's suggestion: use task_call_func() for single
task, IPI broadcast for group movements.
- Rebased to v6.1-rc4

v1: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/

Thanks!
-Peter

[1] https://lore.kernel.org/all/CALPaoCg2-9ARbK+MEgdvdcjJtSy_2H6YeRkLrT97zgy8Aro3Vg@mail.gmail.com/
[2] https://lore.kernel.org/lkml/[email protected]/

Peter Newman (2):
x86/resctrl: IPI all CPUs for group updates
x86/resctrl: update task closid/rmid with task_call_func()

arch/x86/kernel/cpu/resctrl/rdtgroup.c | 128 +++++++++++--------------
1 file changed, 58 insertions(+), 70 deletions(-)

--
2.38.1.493.g58b659f92b-goog

2022-11-15 14:28:12

by Peter Newman

[permalink] [raw]

Subject: [PATCH v3 2/2] x86/resctrl: update task closid/rmid with task_call_func()

When determining whether running tasks need to be interrupted due to a
closid/rmid change, it was possible for the task in question to migrate
or wake up concurrently without observing the updated values.

In particular, __rdtgroup_move_task() assumed that a CPU migrating
implied that it observed the updated closid/rmid. This assumption is
broken by the following reorderings, both of which are allowed on x86:

1. In __rdtgroup_move_task(), stores updating the closid and rmid in
the task structure could reorder with the loads in task_curr() and
task_cpu().
2. In resctrl_sched_in(), the earlier stores to the fields read by
task_curr() could be delayed until after the loads from
t->{closid,rmid}.

Preventing this reordering with barriers would have required an smp_mb()
in all context switches whenever resctrlfs is mounted. Instead, when
moving a single task, use task_call_func() to serialize updates to the
closid and rmid fields in the task_struct with context switch.

Signed-off-by: Peter Newman <[email protected]>
Reviewed-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 76 +++++++++++++++-----------
1 file changed, 45 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 049971efea2f..511b7cea143f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -538,10 +538,47 @@ static void _update_task_closid_rmid(void *task)
resctrl_sched_in();
}

-static void update_task_closid_rmid(struct task_struct *t)
+static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
{
- if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
- smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
+ struct rdtgroup *rdtgrp = arg;
+
+ /*
+ * We assume task_call_func() has provided the necessary serialization
+ * with resctrl_sched_in().
+ */
+ if (rdtgrp->type == RDTCTRL_GROUP) {
+ t->closid = rdtgrp->closid;
+ t->rmid = rdtgrp->mon.rmid;
+ } else if (rdtgrp->type == RDTMON_GROUP) {
+ t->rmid = rdtgrp->mon.rmid;
+ }
+
+ /*
+ * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
+ * updated to make the resource group go into effect. If the task is not
+ * current, the MSR will be updated when the task is scheduled in.
+ */
+ return task_curr(t);
+}
+
+static void update_task_closid_rmid(struct task_struct *t,
+ struct rdtgroup *rdtgrp)
+{
+ /*
+ * Serialize the closid and rmid update with context switch. If this
+ * function indicates that the task was running, then it needs to be
+ * interrupted to install the new closid and rmid.
+ */
+ if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) &&
+ IS_ENABLED(CONFIG_SMP))
+ /*
+ * If the task has migrated away from the CPU indicated by
+ * task_cpu() below, then it has already switched in on the
+ * new CPU using the updated closid and rmid and the call below
+ * unnecessary, but harmless.
+ */
+ smp_call_function_single(task_cpu(t),
+ _update_task_closid_rmid, t, 1);
else
_update_task_closid_rmid(t);
}
@@ -557,39 +594,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
return 0;

/*
- * Set the task's closid/rmid before the PQR_ASSOC MSR can be
- * updated by them.
- *
- * For ctrl_mon groups, move both closid and rmid.
* For monitor groups, can move the tasks only from
* their parent CTRL group.
*/
-
- if (rdtgrp->type == RDTCTRL_GROUP) {
- WRITE_ONCE(tsk->closid, rdtgrp->closid);
- WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
- } else if (rdtgrp->type == RDTMON_GROUP) {
- if (rdtgrp->mon.parent->closid == tsk->closid) {
- WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
- } else {
- rdt_last_cmd_puts("Can't move task to different control group\n");
- return -EINVAL;
- }
+ if (rdtgrp->type == RDTMON_GROUP &&
+ rdtgrp->mon.parent->closid != tsk->closid) {
+ rdt_last_cmd_puts("Can't move task to different control group\n");
+ return -EINVAL;
}

- /*
- * Ensure the task's closid and rmid are written before determining if
- * the task is current that will decide if it will be interrupted.
- */
- barrier();
-
- /*
- * By now, the task's closid and rmid are set. If the task is current
- * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
- * group go into effect. If the task is not current, the MSR will be
- * updated when the task is scheduled in.
- */
- update_task_closid_rmid(tsk);
+ update_task_closid_rmid(tsk, rdtgrp);

return 0;
}
--
2.38.1.493.g58b659f92b-goog

2022-11-15 14:37:54

by Peter Newman

[permalink] [raw]

Subject: [PATCH v3 1/2] x86/resctrl: IPI all CPUs for group updates

To rule out needing to update a CPU when deleting an rdtgroup, we must
search the entire tasklist for group members which could be running on
that CPU. This needs to be done while blocking updates to the tasklist
to avoid leaving newly-created child tasks assigned to the old
CLOSID/RMID.

The cost of reliably propagating a CLOSID or RMID update to a single
task is higher than originally thought. The present understanding is
that we must obtain the task_rq_lock() on each task to ensure that it
observes CLOSID/RMID updates in the case that it migrates away from its
current CPU before the update IPI reaches it.

For now, just notify all the CPUs after updating the closid/rmid fields
in impacted tasks task_structs rather than paying the cost of obtaining
a more precise cpu mask.

Signed-off-by: Peter Newman <[email protected]>
Reviewed-by: James Morse <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++-------------------
1 file changed, 13 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index e5a48f05e787..049971efea2f 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2385,12 +2385,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
* Move tasks from one to the other group. If @from is NULL, then all tasks
* in the systems are moved unconditionally (used for teardown).
*
- * If @mask is not NULL the cpus on which moved tasks are running are set
- * in that mask so the update smp function call is restricted to affected
- * cpus.
+ * Following this operation, the caller is required to update the MSRs on all
+ * CPUs.
*/
-static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
- struct cpumask *mask)
+static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to)
{
struct task_struct *p, *t;

@@ -2400,16 +2398,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
is_rmid_match(t, from)) {
WRITE_ONCE(t->closid, to->closid);
WRITE_ONCE(t->rmid, to->mon.rmid);
-
- /*
- * If the task is on a CPU, set the CPU in the mask.
- * The detection is inaccurate as tasks might move or
- * schedule before the smp function call takes place.
- * In such a case the function call is pointless, but
- * there is no other side effect.
- */
- if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
- cpumask_set_cpu(task_cpu(t), mask);
}
}
read_unlock(&tasklist_lock);
@@ -2440,7 +2428,7 @@ static void rmdir_all_sub(void)
struct rdtgroup *rdtgrp, *tmp;

/* Move all tasks to the default resource group */
- rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
+ rdt_move_group_tasks(NULL, &rdtgroup_default);

list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
/* Free any child rmids */
@@ -3099,23 +3087,19 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
return -EPERM;
}

-static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
+static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp)
{
struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
int cpu;

/* Give any tasks back to the parent group */
- rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask);
+ rdt_move_group_tasks(rdtgrp, prdtgrp);

/* Update per cpu rmid of the moved CPUs first */
for_each_cpu(cpu, &rdtgrp->cpu_mask)
per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid;
- /*
- * Update the MSR on moved CPUs and CPUs which have moved
- * task running on them.
- */
- cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
- update_closid_rmid(tmpmask, NULL);
+
+ update_closid_rmid(cpu_online_mask, NULL);

rdtgrp->flags = RDT_DELETED;
free_rmid(rdtgrp->mon.rmid);
@@ -3140,12 +3124,12 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp)
return 0;
}

-static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
+static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp)
{
int cpu;

/* Give any tasks back to the default group */
- rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
+ rdt_move_group_tasks(rdtgrp, &rdtgroup_default);

/* Give any CPUs back to the default group */
cpumask_or(&rdtgroup_default.cpu_mask,
@@ -3157,12 +3141,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid;
}

- /*
- * Update the MSR on moved CPUs and CPUs which have moved
- * task running on them.
- */
- cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
- update_closid_rmid(tmpmask, NULL);
+ update_closid_rmid(cpu_online_mask, NULL);

closid_free(rdtgrp->closid);
free_rmid(rdtgrp->mon.rmid);
@@ -3181,12 +3160,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
{
struct kernfs_node *parent_kn = kn->parent;
struct rdtgroup *rdtgrp;
- cpumask_var_t tmpmask;
int ret = 0;

- if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
- return -ENOMEM;
-
rdtgrp = rdtgroup_kn_lock_live(kn);
if (!rdtgrp) {
ret = -EPERM;
@@ -3206,18 +3181,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
ret = rdtgroup_ctrl_remove(rdtgrp);
} else {
- ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask);
+ ret = rdtgroup_rmdir_ctrl(rdtgrp);
}
} else if (rdtgrp->type == RDTMON_GROUP &&
is_mon_groups(parent_kn, kn->name)) {
- ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask);
+ ret = rdtgroup_rmdir_mon(rdtgrp);
} else {
ret = -EPERM;
}

out:
rdtgroup_kn_unlock(kn);
- free_cpumask_var(tmpmask);
return ret;
}

--
2.38.1.493.g58b659f92b-goog

2022-11-21 22:11:09

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH v3 2/2] x86/resctrl: update task closid/rmid with task_call_func()

Hi Peter,

Patch description in subject should start with upper case.

On 11/15/2022 6:19 AM, Peter Newman wrote:
> When determining whether running tasks need to be interrupted due to a
> closid/rmid change, it was possible for the task in question to migrate
> or wake up concurrently without observing the updated values.

Mixing tenses can quickly become confusing. Please stick to imperative tone.

Also, please start with the context of this work before jumping to
the problem description.

For example (not a requirement to use - feel free to change):
"A task is moved to a resource group when its task id is written to the
destination resource group's "tasks" file. Moving a task to a new
resource group involves updating the task's closid and rmid (found
in its task_struct) and updating the PQR_ASSOC MSR if the task
is current on a CPU.

It is possible for the task to migrate or wake up while it is moved
to a new resource group. In this scenario the task starts running
with the old closid and rmid values but it may not be considered
as running and thus continue running with the old values until it is
rescheduled."

>
> In particular, __rdtgroup_move_task() assumed that a CPU migrating
> implied that it observed the updated closid/rmid. This assumption is
> broken by the following reorderings, both of which are allowed on x86:
>
> 1. In __rdtgroup_move_task(), stores updating the closid and rmid in
> the task structure could reorder with the loads in task_curr() and
> task_cpu().
> 2. In resctrl_sched_in(), the earlier stores to the fields read by
> task_curr() could be delayed until after the loads from
> t->{closid,rmid}.
>
> Preventing this reordering with barriers would have required an smp_mb()
> in all context switches whenever resctrlfs is mounted. Instead, when
> moving a single task, use task_call_func() to serialize updates to the
> closid and rmid fields in the task_struct with context switch.

Please adjust the above to imperative tone.

>
> Signed-off-by: Peter Newman <[email protected]>
> Reviewed-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 76 +++++++++++++++-----------
> 1 file changed, 45 insertions(+), 31 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index 049971efea2f..511b7cea143f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -538,10 +538,47 @@ static void _update_task_closid_rmid(void *task)
> resctrl_sched_in();
> }
>
> -static void update_task_closid_rmid(struct task_struct *t)
> +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
> {
> - if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
> - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
> + struct rdtgroup *rdtgrp = arg;
> +
> + /*
> + * We assume task_call_func() has provided the necessary serialization
> + * with resctrl_sched_in().

Please no "we".

Also, either task_call_func() provides serialization or it does not. Wording
it as "assume" creates uncertainty about this change.

> + */
> + if (rdtgrp->type == RDTCTRL_GROUP) {
> + t->closid = rdtgrp->closid;
> + t->rmid = rdtgrp->mon.rmid;
> + } else if (rdtgrp->type == RDTMON_GROUP) {
> + t->rmid = rdtgrp->mon.rmid;
> + }

wrt the READ_ONCE() in __resctrl_sched_in() ... memory_barriers.txt tells me
that "When dealing with CPU-CPU interactions, certain types of memory barrier
should always be paired. A lack of appropriate pairing is almost certainly
an error."

> +
> + /*
> + * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
> + * updated to make the resource group go into effect. If the task is not
> + * current, the MSR will be updated when the task is scheduled in.
> + */
> + return task_curr(t);
> +}
> +
> +static void update_task_closid_rmid(struct task_struct *t,
> + struct rdtgroup *rdtgrp)
> +{
> + /*
> + * Serialize the closid and rmid update with context switch. If this
> + * function indicates that the task was running, then it needs to be

What does "this function" refer to? Please replace with function name to be
specific since there are a few functions below.

/was running/is running/?

> + * interrupted to install the new closid and rmid.
> + */
> + if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) &&
> + IS_ENABLED(CONFIG_SMP))
> + /*
> + * If the task has migrated away from the CPU indicated by
> + * task_cpu() below, then it has already switched in on the
> + * new CPU using the updated closid and rmid and the call below
> + * unnecessary, but harmless.

is unnecessary ?

> + */
> + smp_call_function_single(task_cpu(t),
> + _update_task_closid_rmid, t, 1);
> else
> _update_task_closid_rmid(t);
> }

Could you please keep update_task_closid_rmid() and
_update_task_closid_rmid() together?

> @@ -557,39 +594,16 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
> return 0;
>
> /*
> - * Set the task's closid/rmid before the PQR_ASSOC MSR can be
> - * updated by them.
> - *
> - * For ctrl_mon groups, move both closid and rmid.
> * For monitor groups, can move the tasks only from
> * their parent CTRL group.
> */
> -
> - if (rdtgrp->type == RDTCTRL_GROUP) {
> - WRITE_ONCE(tsk->closid, rdtgrp->closid);
> - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> - } else if (rdtgrp->type == RDTMON_GROUP) {
> - if (rdtgrp->mon.parent->closid == tsk->closid) {
> - WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
> - } else {
> - rdt_last_cmd_puts("Can't move task to different control group\n");
> - return -EINVAL;
> - }
> + if (rdtgrp->type == RDTMON_GROUP &&
> + rdtgrp->mon.parent->closid != tsk->closid) {
> + rdt_last_cmd_puts("Can't move task to different control group\n");
> + return -EINVAL;
> }
>
> - /*
> - * Ensure the task's closid and rmid are written before determining if
> - * the task is current that will decide if it will be interrupted.
> - */
> - barrier();
> -
> - /*
> - * By now, the task's closid and rmid are set. If the task is current
> - * on a CPU, the PQR_ASSOC MSR needs to be updated to make the resource
> - * group go into effect. If the task is not current, the MSR will be
> - * updated when the task is scheduled in.
> - */
> - update_task_closid_rmid(tsk);
> + update_task_closid_rmid(tsk, rdtgrp);
>
> return 0;
> }

Reinette

2022-11-21 22:52:40

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] x86/resctrl: IPI all CPUs for group updates

Hi Peter,

On 11/15/2022 6:19 AM, Peter Newman wrote:
> To rule out needing to update a CPU when deleting an rdtgroup, we must

Please do not impersonate code in changelog and comments (do not
use "we"). This is required for resctrl changes to be considered for
inclusion because resctrl patches are routed via the "tip" repo
and is thus required to follow the "tip tree handbook"
(Documentation/process/maintainer-tip.rst). Please also
stick to a clear "context-problem-solution" changelog as is the custom
in this area.

> search the entire tasklist for group members which could be running on
> that CPU. This needs to be done while blocking updates to the tasklist
> to avoid leaving newly-created child tasks assigned to the old
> CLOSID/RMID.

This is not clear to me. rdt_move_group_tasks() obtains a read lock,
read_lock(&tasklist_lock), so concurrent modifications to the tasklist
are indeed possible. Should this perhaps be write_lock() instead?
It sounds like the scenario you are describing may be a concern. That is,
if a task belonging to a group that is being removed happens to
call fork()/clone() during the move then the child may end up being
created with old closid.

>
> The cost of reliably propagating a CLOSID or RMID update to a single
> task is higher than originally thought. The present understanding is
> that we must obtain the task_rq_lock() on each task to ensure that it
> observes CLOSID/RMID updates in the case that it migrates away from its
> current CPU before the update IPI reaches it.

I find this confusing since it describes why a potential solution does
not solve a problem, neither problem nor solution is well described at this
point.

What if you switch the order of the two patches? Patch #2 provides
the potential solution mentioned here so that may be helpful to have as
reference in this changelog.

> For now, just notify all the CPUs after updating the closid/rmid fields

For now? If you anticipate changes then there should be a plan for that,
otherwise this is the fix without further speculation.

> in impacted tasks task_structs rather than paying the cost of obtaining
> a more precise cpu mask.

s/cpu/CPU/
It may be helpful to add that an accurate CPU mask cannot be guaranteed and
the more tasks moved the less accurate it could be (if I understand correctly).

>
> Signed-off-by: Peter Newman <[email protected]>
> Reviewed-by: James Morse <[email protected]>
> ---
> arch/x86/kernel/cpu/resctrl/rdtgroup.c | 52 +++++++-------------------
> 1 file changed, 13 insertions(+), 39 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index e5a48f05e787..049971efea2f 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -2385,12 +2385,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
> * Move tasks from one to the other group. If @from is NULL, then all tasks
> * in the systems are moved unconditionally (used for teardown).
> *
> - * If @mask is not NULL the cpus on which moved tasks are running are set
> - * in that mask so the update smp function call is restricted to affected
> - * cpus.
> + * Following this operation, the caller is required to update the MSRs on all
> + * CPUs.
> */

On x86 only one MSR needs updating, the PQR_ASSOC MSR. The above could be
summarized as:
"Caller should update per CPU storage and PQR_ASSOC."

> -static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
> - struct cpumask *mask)
> +static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to)
> {
> struct task_struct *p, *t;
>
> @@ -2400,16 +2398,6 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
> is_rmid_match(t, from)) {
> WRITE_ONCE(t->closid, to->closid);
> WRITE_ONCE(t->rmid, to->mon.rmid);
> -
> - /*
> - * If the task is on a CPU, set the CPU in the mask.
> - * The detection is inaccurate as tasks might move or
> - * schedule before the smp function call takes place.
> - * In such a case the function call is pointless, but
> - * there is no other side effect.
> - */
> - if (IS_ENABLED(CONFIG_SMP) && mask && task_curr(t))
> - cpumask_set_cpu(task_cpu(t), mask);
> }
> }
> read_unlock(&tasklist_lock);
> @@ -2440,7 +2428,7 @@ static void rmdir_all_sub(void)
> struct rdtgroup *rdtgrp, *tmp;
>
> /* Move all tasks to the default resource group */
> - rdt_move_group_tasks(NULL, &rdtgroup_default, NULL);
> + rdt_move_group_tasks(NULL, &rdtgroup_default);
>
> list_for_each_entry_safe(rdtgrp, tmp, &rdt_all_groups, rdtgroup_list) {
> /* Free any child rmids */
> @@ -3099,23 +3087,19 @@ static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> return -EPERM;
> }
>
> -static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> +static int rdtgroup_rmdir_mon(struct rdtgroup *rdtgrp)
> {
> struct rdtgroup *prdtgrp = rdtgrp->mon.parent;
> int cpu;
>
> /* Give any tasks back to the parent group */
> - rdt_move_group_tasks(rdtgrp, prdtgrp, tmpmask);
> + rdt_move_group_tasks(rdtgrp, prdtgrp);
>
> /* Update per cpu rmid of the moved CPUs first */
> for_each_cpu(cpu, &rdtgrp->cpu_mask)
> per_cpu(pqr_state.default_rmid, cpu) = prdtgrp->mon.rmid;
> - /*
> - * Update the MSR on moved CPUs and CPUs which have moved
> - * task running on them.
> - */
> - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
> - update_closid_rmid(tmpmask, NULL);
> +
> + update_closid_rmid(cpu_online_mask, NULL);
>
> rdtgrp->flags = RDT_DELETED;
> free_rmid(rdtgrp->mon.rmid);
> @@ -3140,12 +3124,12 @@ static int rdtgroup_ctrl_remove(struct rdtgroup *rdtgrp)
> return 0;
> }
>
> -static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> +static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp)
> {
> int cpu;
>
> /* Give any tasks back to the default group */
> - rdt_move_group_tasks(rdtgrp, &rdtgroup_default, tmpmask);
> + rdt_move_group_tasks(rdtgrp, &rdtgroup_default);
>
> /* Give any CPUs back to the default group */
> cpumask_or(&rdtgroup_default.cpu_mask,
> @@ -3157,12 +3141,7 @@ static int rdtgroup_rmdir_ctrl(struct rdtgroup *rdtgrp, cpumask_var_t tmpmask)
> per_cpu(pqr_state.default_rmid, cpu) = rdtgroup_default.mon.rmid;
> }
>
> - /*
> - * Update the MSR on moved CPUs and CPUs which have moved
> - * task running on them.
> - */
> - cpumask_or(tmpmask, tmpmask, &rdtgrp->cpu_mask);
> - update_closid_rmid(tmpmask, NULL);
> + update_closid_rmid(cpu_online_mask, NULL);
>
> closid_free(rdtgrp->closid);
> free_rmid(rdtgrp->mon.rmid);
> @@ -3181,12 +3160,8 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
> {
> struct kernfs_node *parent_kn = kn->parent;
> struct rdtgroup *rdtgrp;
> - cpumask_var_t tmpmask;
> int ret = 0;
>
> - if (!zalloc_cpumask_var(&tmpmask, GFP_KERNEL))
> - return -ENOMEM;
> -
> rdtgrp = rdtgroup_kn_lock_live(kn);
> if (!rdtgrp) {
> ret = -EPERM;
> @@ -3206,18 +3181,17 @@ static int rdtgroup_rmdir(struct kernfs_node *kn)
> rdtgrp->mode == RDT_MODE_PSEUDO_LOCKED) {
> ret = rdtgroup_ctrl_remove(rdtgrp);
> } else {
> - ret = rdtgroup_rmdir_ctrl(rdtgrp, tmpmask);
> + ret = rdtgroup_rmdir_ctrl(rdtgrp);
> }
> } else if (rdtgrp->type == RDTMON_GROUP &&
> is_mon_groups(parent_kn, kn->name)) {
> - ret = rdtgroup_rmdir_mon(rdtgrp, tmpmask);
> + ret = rdtgroup_rmdir_mon(rdtgrp);
> } else {
> ret = -EPERM;
> }
>
> out:
> rdtgroup_kn_unlock(kn);
> - free_cpumask_var(tmpmask);
> return ret;
> }
>

The fix looks good to me. I do think its motivation and description
needs to improve to make it palatable to folks not familiar with this area.

Reinette

2022-11-22 15:57:07

by Peter Newman

[permalink] [raw]

Subject: Re: [PATCH v3 2/2] x86/resctrl: update task closid/rmid with task_call_func()

Hi Reinette,

On Mon, Nov 21, 2022 at 10:59 PM Reinette Chatre
<[email protected]> wrote:
> Patch description in subject should start with upper case.
>
> On 11/15/2022 6:19 AM, Peter Newman wrote:
> > When determining whether running tasks need to be interrupted due to a
> > closid/rmid change, it was possible for the task in question to migrate
> > or wake up concurrently without observing the updated values.
>
> Mixing tenses can quickly become confusing. Please stick to imperative tone.

Sorry about this. Looking at how other bug fix commit messages are
worded, they describe the bug as the present behavior of the kernel
rather than how the kernel used to behave before the patch was applied.
It seems I've been consistently wording problem statements as past
behavior.

> Also, please start with the context of this work before jumping to
> the problem description.
>
> For example (not a requirement to use - feel free to change):
> "A task is moved to a resource group when its task id is written to the
> destination resource group's "tasks" file. Moving a task to a new
> resource group involves updating the task's closid and rmid (found
> in its task_struct) and updating the PQR_ASSOC MSR if the task
> is current on a CPU.
>
> It is possible for the task to migrate or wake up while it is moved
> to a new resource group. In this scenario the task starts running
> with the old closid and rmid values but it may not be considered
> as running and thus continue running with the old values until it is
> rescheduled."

Thanks, I'll use your intro as a starting point.

> > In particular, __rdtgroup_move_task() assumed that a CPU migrating
> > implied that it observed the updated closid/rmid. This assumption is
> > broken by the following reorderings, both of which are allowed on x86:
> >
> > 1. In __rdtgroup_move_task(), stores updating the closid and rmid in
> > the task structure could reorder with the loads in task_curr() and
> > task_cpu().
> > 2. In resctrl_sched_in(), the earlier stores to the fields read by
> > task_curr() could be delayed until after the loads from
> > t->{closid,rmid}.
> >
> > Preventing this reordering with barriers would have required an smp_mb()
> > in all context switches whenever resctrlfs is mounted. Instead, when
> > moving a single task, use task_call_func() to serialize updates to the
> > closid and rmid fields in the task_struct with context switch.
>
> Please adjust the above to imperative tone.

Ok.

> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index 049971efea2f..511b7cea143f 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -538,10 +538,47 @@ static void _update_task_closid_rmid(void *task)
> > resctrl_sched_in();
> > }
> >
> > -static void update_task_closid_rmid(struct task_struct *t)
> > +static int update_locked_task_closid_rmid(struct task_struct *t, void *arg)
> > {
> > - if (IS_ENABLED(CONFIG_SMP) && task_curr(t))
> > - smp_call_function_single(task_cpu(t), _update_task_closid_rmid, t, 1);
> > + struct rdtgroup *rdtgrp = arg;
> > +
> > + /*
> > + * We assume task_call_func() has provided the necessary serialization
> > + * with resctrl_sched_in().
>
> Please no "we".

Ok.

>
> Also, either task_call_func() provides serialization or it does not. Wording
> it as "assume" creates uncertainty about this change.
>
> > + */
> > + if (rdtgrp->type == RDTCTRL_GROUP) {
> > + t->closid = rdtgrp->closid;
> > + t->rmid = rdtgrp->mon.rmid;
> > + } else if (rdtgrp->type == RDTMON_GROUP) {
> > + t->rmid = rdtgrp->mon.rmid;
> > + }
>
> wrt the READ_ONCE() in __resctrl_sched_in() ... memory_barriers.txt tells me
> that "When dealing with CPU-CPU interactions, certain types of memory barrier
> should always be paired. A lack of appropriate pairing is almost certainly
> an error."

Consequently the statement earlier about task_call_func() (maybe)
providing serialization has been incorrect since I stopped using it for
group updates in the v2 series, where I had to put the READ_ONCE() calls
back in __resctrl_sched_in(), so the READ_ONCE() is indeed necessary
again.

>
> > +
> > + /*
> > + * If the task is current on a CPU, the PQR_ASSOC MSR needs to be
> > + * updated to make the resource group go into effect. If the task is not
> > + * current, the MSR will be updated when the task is scheduled in.
> > + */
> > + return task_curr(t);
> > +}
> > +
> > +static void update_task_closid_rmid(struct task_struct *t,
> > + struct rdtgroup *rdtgrp)
> > +{
> > + /*
> > + * Serialize the closid and rmid update with context switch. If this
> > + * function indicates that the task was running, then it needs to be
>
> What does "this function" refer to? Please replace with function name to be
> specific since there are a few functions below.

Ok.

>
> /was running/is running/?

The task is no longer locked here, so it doesn't seem correct to say "is
running" when it could have already stopped running.

Also, maybe related, I was considering moving the task_curr() call out
of update_locked_task_closid_rmid() to avoid giving the misconception
that the result is current, because the value is used after releasing
the task's pi/rq lock.

>
> > + * interrupted to install the new closid and rmid.
> > + */
> > + if (task_call_func(t, update_locked_task_closid_rmid, rdtgrp) &&
> > + IS_ENABLED(CONFIG_SMP))
> > + /*
> > + * If the task has migrated away from the CPU indicated by
> > + * task_cpu() below, then it has already switched in on the
> > + * new CPU using the updated closid and rmid and the call below
> > + * unnecessary, but harmless.
>
> is unnecessary ?

Yes, thanks.

>
> > + */
> > + smp_call_function_single(task_cpu(t),
> > + _update_task_closid_rmid, t, 1);
> > else
> > _update_task_closid_rmid(t);
> > }
>
> Could you please keep update_task_closid_rmid() and
> _update_task_closid_rmid() together?

Yes.

> [...]
>
> Reinette

Thanks for your feedback, Reinette. I'll fix the wording throughout the
patch.

I'll read over the wording in the other patch series I sent out as well
to make sure it's consistent.

-Peter

2022-11-22 21:25:54

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH v3 2/2] x86/resctrl: update task closid/rmid with task_call_func()

Hi Peter,

On 11/22/2022 7:17 AM, Peter Newman wrote:
> Hi Reinette,
>
> On Mon, Nov 21, 2022 at 10:59 PM Reinette Chatre
> <[email protected]> wrote:
>> Patch description in subject should start with upper case.
>>
>> On 11/15/2022 6:19 AM, Peter Newman wrote:
>>> When determining whether running tasks need to be interrupted due to a
>>> closid/rmid change, it was possible for the task in question to migrate
>>> or wake up concurrently without observing the updated values.
>>
>> Mixing tenses can quickly become confusing. Please stick to imperative tone.
>
> Sorry about this. Looking at how other bug fix commit messages are
> worded, they describe the bug as the present behavior of the kernel
> rather than how the kernel used to behave before the patch was applied.
> It seems I've been consistently wording problem statements as past
> behavior.

No problem. There is a good section in the tip handbook about changelogs.

>>> +static void update_task_closid_rmid(struct task_struct *t,
>>> + struct rdtgroup *rdtgrp)
>>> +{
>>> + /*
>>> + * Serialize the closid and rmid update with context switch. If this
>>> + * function indicates that the task was running, then it needs to be
>>
>> What does "this function" refer to? Please replace with function name to be
>> specific since there are a few functions below.
>
> Ok.
>
>>
>> /was running/is running/?
>
> The task is no longer locked here, so it doesn't seem correct to say "is
> running" when it could have already stopped running.

ok, how about "was running during its closid and rmid change" or similar?

>
> Also, maybe related, I was considering moving the task_curr() call out
> of update_locked_task_closid_rmid() to avoid giving the misconception
> that the result is current, because the value is used after releasing
> the task's pi/rq lock.

I think task_curr() should stay within update_locked_task_closid_rmid(). It
is an appropriate usage per description of task_call_func() and from what I
understand it is the most accurate way to answer the question "was task
running while its closid/rmid changed?". It cannot be guaranteed that the
task is still running when the MSR changing code gets its change to run but
_update_task_closid_rmid() as well as resctrl_sched_in() have that covered.

Reinette

2022-11-23 12:23:39

by Peter Newman

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] x86/resctrl: IPI all CPUs for group updates

Hi Reinette,

On Mon, Nov 21, 2022 at 10:53 PM Reinette Chatre
<[email protected]> wrote:
> On 11/15/2022 6:19 AM, Peter Newman wrote:
> > To rule out needing to update a CPU when deleting an rdtgroup, we must
>
> Please do not impersonate code in changelog and comments (do not
> use "we"). This is required for resctrl changes to be considered for
> inclusion because resctrl patches are routed via the "tip" repo
> and is thus required to follow the "tip tree handbook"
> (Documentation/process/maintainer-tip.rst). Please also
> stick to a clear "context-problem-solution" changelog as is the custom
> in this area.

Thanks, I forgot about the "tip tree handbook" and was trying to
remember where the pointers about wording from the reply to the other
patch came from. Unfortunately I didn't read your replies in FIFO
order.

>
> > search the entire tasklist for group members which could be running on
> > that CPU. This needs to be done while blocking updates to the tasklist
> > to avoid leaving newly-created child tasks assigned to the old
> > CLOSID/RMID.
>
> This is not clear to me. rdt_move_group_tasks() obtains a read lock,
> read_lock(&tasklist_lock), so concurrent modifications to the tasklist
> are indeed possible. Should this perhaps be write_lock() instead?
> It sounds like the scenario you are describing may be a concern. That is,
> if a task belonging to a group that is being removed happens to
> call fork()/clone() during the move then the child may end up being
> created with old closid.

Shouldn't read_lock(&tasklist_lock) cause write_lock(&tasklist_lock) to
block?

Maybe I paraphrased too much in the explanation.

> > The cost of reliably propagating a CLOSID or RMID update to a single
> > task is higher than originally thought. The present understanding is
> > that we must obtain the task_rq_lock() on each task to ensure that it
> > observes CLOSID/RMID updates in the case that it migrates away from its
> > current CPU before the update IPI reaches it.
>
> I find this confusing since it describes why a potential solution does
> not solve a problem, neither problem nor solution is well described at this
> point.
>
> What if you switch the order of the two patches? Patch #2 provides
> the potential solution mentioned here so that may be helpful to have as
> reference in this changelog.

Yes, I will try that. The single-task and multi-task cases should be
independent enough that I can handle them in either order.

> > For now, just notify all the CPUs after updating the closid/rmid fields
>
> For now? If you anticipate changes then there should be a plan for that,
> otherwise this is the fix without further speculation.

It seems like I have to either do something to acknowledge the tradeoff,
or make a case that the negatively affected usage isn't important.
Should I assert that deleting groups while a realtime, core-isolated
application is running isn't a legitimate use case?

>
> > in impacted tasks task_structs rather than paying the cost of obtaining
> > a more precise cpu mask.
>
> s/cpu/CPU/
> It may be helpful to add that an accurate CPU mask cannot be guaranteed and
> the more tasks moved the less accurate it could be (if I understand correctly).

Ok.

> > diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > index e5a48f05e787..049971efea2f 100644
> > --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> > @@ -2385,12 +2385,10 @@ static int reset_all_ctrls(struct rdt_resource *r)
> > * Move tasks from one to the other group. If @from is NULL, then all tasks
> > * in the systems are moved unconditionally (used for teardown).
> > *
> > - * If @mask is not NULL the cpus on which moved tasks are running are set
> > - * in that mask so the update smp function call is restricted to affected
> > - * cpus.
> > + * Following this operation, the caller is required to update the MSRs on all
> > + * CPUs.
> > */
>
> On x86 only one MSR needs updating, the PQR_ASSOC MSR. The above could be
> summarized as:
> "Caller should update per CPU storage and PQR_ASSOC."

Sounds good.

> [...]
>
> The fix looks good to me. I do think its motivation and description
> needs to improve to make it palatable to folks not familiar with this area.
>
> Reinette

Once again, thanks for your review. I will work on clarifying the
comments. The explanation is usually most of the work with changes like
these, which is tough to do without some feedback from an actual reader.
I've been looking at this section of code too long to be able to judge
my own explanation anymore.

-Peter

2022-11-23 17:25:12

by Reinette Chatre

[permalink] [raw]

Subject: Re: [PATCH v3 1/2] x86/resctrl: IPI all CPUs for group updates

Hi Peter,

On 11/23/2022 3:09 AM, Peter Newman wrote:
> On Mon, Nov 21, 2022 at 10:53 PM Reinette Chatre
> <[email protected]> wrote:
>> On 11/15/2022 6:19 AM, Peter Newman wrote:

...

>>> search the entire tasklist for group members which could be running on
>>> that CPU. This needs to be done while blocking updates to the tasklist
>>> to avoid leaving newly-created child tasks assigned to the old
>>> CLOSID/RMID.
>>
>> This is not clear to me. rdt_move_group_tasks() obtains a read lock,
>> read_lock(&tasklist_lock), so concurrent modifications to the tasklist
>> are indeed possible. Should this perhaps be write_lock() instead?
>> It sounds like the scenario you are describing may be a concern. That is,
>> if a task belonging to a group that is being removed happens to
>> call fork()/clone() during the move then the child may end up being
>> created with old closid.
>
> Shouldn't read_lock(&tasklist_lock) cause write_lock(&tasklist_lock) to
> block?

Apologies, yes, you are right. I was focusing too much on the detail of
task_struct::tasks being a RCU list and lost sight of the actual lock
obtained (rcu_read_lock() vs read_lock()) in this instance.

Reinette