LinuxLists.cc - [PATCH 0/2] x86/intel_rdt: task_work vs task

2020-11-18 18:04:27

Subject: [PATCH 0/2] x86/intel_rdt: task_work vs task_struct rmid/closid write race

Hi folks,

This is a small cleanup + a fix for a race I stumbled upon while staring at
resctrl stuff.

Briefly tested on a Xeon Gold 5120 (m2.xlarge.x86 on Equinix) by bouncing a few
tasks around control groups.

Cheers,
Valentin

Valentin Schneider (2):
x86/intel_rdt: Check monitor group vs control group membership earlier
x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race

arch/x86/include/asm/resctrl.h | 11 ++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 71 ++++++++++++++++----------
2 files changed, 50 insertions(+), 32 deletions(-)

--
2.27.0

2020-11-18 18:05:47

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH 1/2] x86/intel_rdt: Check monitor group vs control group membership earlier

A task can only be moved between monitor groups if both groups belong to
the same control group. This is checked fairly late however: by that time
we already have appended a task_work() callback.

Check the validity of the move before getting anywhere near task_work
callbacks.

Signed-off-by: Valentin Schneider <[email protected]>
---
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index af323e2e3100..b6b5b95df833 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -581,12 +581,7 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
tsk->closid = rdtgrp->closid;
tsk->rmid = rdtgrp->mon.rmid;
} else if (rdtgrp->type == RDTMON_GROUP) {
- if (rdtgrp->mon.parent->closid == tsk->closid) {
- tsk->rmid = rdtgrp->mon.rmid;
- } else {
- rdt_last_cmd_puts("Can't move task to different control group\n");
- ret = -EINVAL;
- }
+ tsk->rmid = rdtgrp->mon.rmid;
}
}
return ret;
@@ -673,9 +668,19 @@ static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
rcu_read_unlock();

ret = rdtgroup_task_write_permission(tsk, of);
- if (!ret)
- ret = __rdtgroup_move_task(tsk, rdtgrp);
+ if (ret)
+ goto out;

+ if (rdtgrp->type == RDTMON_GROUP &&
+ rdtgrp->mon.parent->closid != tsk->closid) {
+ rdt_last_cmd_puts("Can't move task to different control group\n");
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = __rdtgroup_move_task(tsk, rdtgrp);
+
+out:
put_task_struct(tsk);
return ret;
}
--
2.27.0

2020-11-18 18:06:41

by Valentin Schneider

[permalink] [raw]

Subject: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race

Upon moving a task to a new control / monitor group, said task's {closid,
rmid} fields are updated *after* triggering the move_myself() task_work
callback. This can cause said callback to miss the update, e.g. if the
triggering thread got preempted before fiddling with task_struct, or if the
targeted task was already on its way to return to userspace.

Update the task_struct's {closid, rmid} tuple *before* invoking
task_work_add(). As they can happen concurrently, wrap {closid, rmid}
accesses with READ_ONCE() and WRITE_ONCE(). Highlight the required ordering
with a pair of comments.

Fixes: e02737d5b826 ("x86/intel_rdt: Add tasks files")
Signed-off-by: Valentin Schneider <[email protected]>
---
arch/x86/include/asm/resctrl.h | 11 ++++--
arch/x86/kernel/cpu/resctrl/rdtgroup.c | 54 +++++++++++++++-----------
2 files changed, 39 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/resctrl.h b/arch/x86/include/asm/resctrl.h
index 07603064df8f..d60ed0668a59 100644
--- a/arch/x86/include/asm/resctrl.h
+++ b/arch/x86/include/asm/resctrl.h
@@ -56,19 +56,22 @@ static void __resctrl_sched_in(void)
struct resctrl_pqr_state *state = this_cpu_ptr(&pqr_state);
u32 closid = state->default_closid;
u32 rmid = state->default_rmid;
+ u32 tmp;

/*
* If this task has a closid/rmid assigned, use it.
* Else use the closid/rmid assigned to this cpu.
*/
if (static_branch_likely(&rdt_alloc_enable_key)) {
- if (current->closid)
- closid = current->closid;
+ tmp = READ_ONCE(current->closid);
+ if (tmp)
+ closid = tmp;
}

if (static_branch_likely(&rdt_mon_enable_key)) {
- if (current->rmid)
- rmid = current->rmid;
+ tmp = READ_ONCE(current->rmid);
+ if (tmp)
+ rmid = tmp;
}

if (closid != state->cur_closid || rmid != state->cur_rmid) {
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index b6b5b95df833..135a51529f70 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -524,11 +524,13 @@ static void move_myself(struct callback_head *head)
* If resource group was deleted before this task work callback
* was invoked, then assign the task to root group and free the
* resource group.
+ *
+ * See pairing atomic_inc() in __rdtgroup_move_task()
*/
if (atomic_dec_and_test(&rdtgrp->waitcount) &&
(rdtgrp->flags & RDT_DELETED)) {
- current->closid = 0;
- current->rmid = 0;
+ WRITE_ONCE(current->closid, 0);
+ WRITE_ONCE(current->rmid, 0);
kfree(rdtgrp);
}

@@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
callback = kzalloc(sizeof(*callback), GFP_KERNEL);
if (!callback)
return -ENOMEM;
- callback->work.func = move_myself;
+
+ init_task_work(&callback->work, move_myself);
callback->rdtgrp = rdtgrp;

+ /*
+ * For ctrl_mon groups move both closid and rmid.
+ * For monitor groups, can move the tasks only from
+ * their parent CTRL group.
+ */
+ if (rdtgrp->type == RDTCTRL_GROUP)
+ WRITE_ONCE(tsk->closid, rdtgrp->closid);
+ WRITE_ONCE(tsk->rmid, rdtgrp->mon.rmid);
+
/*
* Take a refcount, so rdtgrp cannot be freed before the
* callback has been invoked.
+ *
+ * Also ensures above {closid, rmid} writes are observed by
+ * move_myself(), as it can run immediately after task_work_add().
+ * Otherwise old values may be loaded, and the move will only actually
+ * happen at the next context switch.
+ *
+ * Pairs with atomic_dec() in move_myself().
*/
atomic_inc(&rdtgrp->waitcount);
+
ret = task_work_add(tsk, &callback->work, TWA_RESUME);
if (ret) {
/*
@@ -571,18 +591,6 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
atomic_dec(&rdtgrp->waitcount);
kfree(callback);
rdt_last_cmd_puts("Task exited\n");
- } else {
- /*
- * For ctrl_mon groups move both closid and rmid.
- * For monitor groups, can move the tasks only from
- * their parent CTRL group.
- */
- if (rdtgrp->type == RDTCTRL_GROUP) {
- tsk->closid = rdtgrp->closid;
- tsk->rmid = rdtgrp->mon.rmid;
- } else if (rdtgrp->type == RDTMON_GROUP) {
- tsk->rmid = rdtgrp->mon.rmid;
- }
}
return ret;
}
@@ -590,13 +598,15 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
static bool is_closid_match(struct task_struct *t, struct rdtgroup *r)
{
return (rdt_alloc_capable &&
- (r->type == RDTCTRL_GROUP) && (t->closid == r->closid));
+ (r->type == RDTCTRL_GROUP) &&
+ (READ_ONCE(t->closid) == r->closid));
}

static bool is_rmid_match(struct task_struct *t, struct rdtgroup *r)
{
return (rdt_mon_capable &&
- (r->type == RDTMON_GROUP) && (t->rmid == r->mon.rmid));
+ (r->type == RDTMON_GROUP) &&
+ (READ_ONCE(t->rmid) == r->mon.rmid));
}

/**
@@ -672,7 +682,7 @@ static int rdtgroup_move_task(pid_t pid, struct rdtgroup *rdtgrp,
goto out;

if (rdtgrp->type == RDTMON_GROUP &&
- rdtgrp->mon.parent->closid != tsk->closid) {
+ rdtgrp->mon.parent->closid != READ_ONCE(tsk->closid)) {
rdt_last_cmd_puts("Can't move task to different control group\n");
ret = -EINVAL;
goto out;
@@ -802,7 +812,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
rdtg->mode != RDT_MODE_EXCLUSIVE)
continue;

- if (rdtg->closid != tsk->closid)
+ if (rdtg->closid != READ_ONCE(tsk->closid))
continue;

seq_printf(s, "res:%s%s\n", (rdtg == &rdtgroup_default) ? "/" : "",
@@ -810,7 +820,7 @@ int proc_resctrl_show(struct seq_file *s, struct pid_namespace *ns,
seq_puts(s, "mon:");
list_for_each_entry(crg, &rdtg->mon.crdtgrp_list,
mon.crdtgrp_list) {
- if (tsk->rmid != crg->mon.rmid)
+ if (READ_ONCE(tsk->rmid) != crg->mon.rmid)
continue;
seq_printf(s, "%s", crg->kn->name);
break;
@@ -2328,8 +2338,8 @@ static void rdt_move_group_tasks(struct rdtgroup *from, struct rdtgroup *to,
for_each_process_thread(p, t) {
if (!from || is_closid_match(t, from) ||
is_rmid_match(t, from)) {
- t->closid = to->closid;
- t->rmid = to->mon.rmid;
+ WRITE_ONCE(t->closid, to->closid);
+ WRITE_ONCE(t->rmid, to->mon.rmid);

#ifdef CONFIG_SMP
/*
--
2.27.0

2020-11-20 14:56:32

by James Morse

[permalink] [raw]

Subject: Re: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race

Hi Valentin,

On 18/11/2020 18:00, Valentin Schneider wrote:
> Upon moving a task to a new control / monitor group, said task's {closid,
> rmid} fields are updated *after* triggering the move_myself() task_work
> callback. This can cause said callback to miss the update, e.g. if the
> triggering thread got preempted before fiddling with task_struct, or if the
> targeted task was already on its way to return to userspace.

So, if move_myself() runs after task_work_add() but before tsk is written to.
Sounds fun!

> Update the task_struct's {closid, rmid} tuple *before* invoking
> task_work_add(). As they can happen concurrently, wrap {closid, rmid}
> accesses with READ_ONCE() and WRITE_ONCE(). Highlight the required ordering
> with a pair of comments.

... and this one is if move_myself() or __resctrl_sched_in() runs while tsk is being
written to on another CPU. It might get torn values, or multiple-reads get different values.

The READ_ONCE/WRITE_ONCEry would have been easier to read as a separate patch as you touch
all sites, and move/change some of them.

Regardless:
Reviewed-by: James Morse <[email protected]>

I don't 'get' memory-ordering, so one curiosity below:

> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> index b6b5b95df833..135a51529f70 100644
> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
> @@ -524,11 +524,13 @@ static void move_myself(struct callback_head *head)
> * If resource group was deleted before this task work callback
> * was invoked, then assign the task to root group and free the
> * resource group.
> + *
> + * See pairing atomic_inc() in __rdtgroup_move_task()
> */
> if (atomic_dec_and_test(&rdtgrp->waitcount) &&
> (rdtgrp->flags & RDT_DELETED)) {
> - current->closid = 0;
> - current->rmid = 0;
> + WRITE_ONCE(current->closid, 0);
> + WRITE_ONCE(current->rmid, 0);
> kfree(rdtgrp);
> }
>
> @@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,

> /*
> * Take a refcount, so rdtgrp cannot be freed before the
> * callback has been invoked.
> + *
> + * Also ensures above {closid, rmid} writes are observed by
> + * move_myself(), as it can run immediately after task_work_add().
> + * Otherwise old values may be loaded, and the move will only actually
> + * happen at the next context switch.

But __resctrl_sched_in() can still occur at anytime and READ_ONCE() a pair of values that
don't go together?
I don't think this is a problem for RDT as with old-rmid the task was a member of that
monitor-group previously, and 'freed' rmid are kept in limbo for a while after.
(old-closid is the same as the task having not schedule()d since the change, which is fine).

For MPAM, this is more annoying as changing just the closid may put the task in a
monitoring group that never existed, meaning its surprise dirty later.

If this all makes sense, I guess the fix (for much later) is to union closid/rmid, and
WRITE_ONCE() them together where necessary.
(I've made a note for when I next pass that part of the MPAM tree)

> + *
> + * Pairs with atomic_dec() in move_myself().
> */
> atomic_inc(&rdtgrp->waitcount);
> +
> ret = task_work_add(tsk, &callback->work, TWA_RESUME);
> if (ret) {
> /*

Thanks!

James

2020-11-20 14:58:05

by James Morse

[permalink] [raw]

Subject: Re: [PATCH 1/2] x86/intel_rdt: Check monitor group vs control group membership earlier

Hi Valentin,

On 18/11/2020 18:00, Valentin Schneider wrote:
> A task can only be moved between monitor groups if both groups belong to
> the same control group. This is checked fairly late however: by that time
> we already have appended a task_work() callback.

(is that a problem? It's needed to do the kfree())

> Check the validity of the move before getting anywhere near task_work
> callbacks.

This saves the kzalloc()/task_work_add() if it wasn't going to be necessary.

Reviewed-by: James Morse <[email protected]>

Thanks,

James

2020-11-20 15:56:39

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH 1/2] x86/intel_rdt: Check monitor group vs control group membership earlier

Hi James,

On 20/11/20 14:53, James Morse wrote:
> Hi Valentin,
>
> On 18/11/2020 18:00, Valentin Schneider wrote:
>> A task can only be moved between monitor groups if both groups belong to
>> the same control group. This is checked fairly late however: by that time
>> we already have appended a task_work() callback.
>
> (is that a problem? It's needed to do the kfree())
>
>
>> Check the validity of the move before getting anywhere near task_work
>> callbacks.
>
> This saves the kzalloc()/task_work_add() if it wasn't going to be necessary.
>

Right, to hopefully better point it out:

In such cases (invalid move), __rdtgroup_move_task() would trigger a
move_myself() task_work callback without updating {closid, rmid}.
Given nothing changed (barring concurrent updates), move_myself() won't do
any useful work.

The task_work_add() and associated alloc could thus be entirely avoided.

> Reviewed-by: James Morse <[email protected]>
>

Thanks!

>
> Thanks,
>
> James

2020-11-20 15:58:03

by Valentin Schneider

[permalink] [raw]

Subject: Re: [PATCH 2/2] x86/intel_rdt: Plug task_work vs task_struct {rmid,closid} update race

Hi James,

On 20/11/20 14:53, James Morse wrote:
> Hi Valentin,
>
> On 18/11/2020 18:00, Valentin Schneider wrote:
>> Upon moving a task to a new control / monitor group, said task's {closid,
>> rmid} fields are updated *after* triggering the move_myself() task_work
>> callback. This can cause said callback to miss the update, e.g. if the
>> triggering thread got preempted before fiddling with task_struct, or if the
>> targeted task was already on its way to return to userspace.
>
> So, if move_myself() runs after task_work_add() but before tsk is written to.
> Sounds fun!
>
>
>> Update the task_struct's {closid, rmid} tuple *before* invoking
>> task_work_add(). As they can happen concurrently, wrap {closid, rmid}
>> accesses with READ_ONCE() and WRITE_ONCE(). Highlight the required ordering
>> with a pair of comments.
>
> ... and this one is if move_myself() or __resctrl_sched_in() runs while tsk is being
> written to on another CPU. It might get torn values, or multiple-reads get different values.
>
> The READ_ONCE/WRITE_ONCEry would have been easier to read as a separate patch as you touch
> all sites, and move/change some of them.
>

True, I initially only fixed up the reads/writes involved with
__rdtgroup_move_task(), but ended up coccinelle'ing the whole lot - which I
should have then moved to a dedicated patch. Thanks for powering through
it, I'll send a v2 with a neater split.

> Regardless:
> Reviewed-by: James Morse <[email protected]>
>

Thanks!

>
> I don't 'get' memory-ordering, so one curiosity below:
>
>> diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> index b6b5b95df833..135a51529f70 100644
>> --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
>> @@ -524,11 +524,13 @@ static void move_myself(struct callback_head *head)
>> * If resource group was deleted before this task work callback
>> * was invoked, then assign the task to root group and free the
>> * resource group.
>> + *
>> + * See pairing atomic_inc() in __rdtgroup_move_task()
>> */
>> if (atomic_dec_and_test(&rdtgrp->waitcount) &&
>> (rdtgrp->flags & RDT_DELETED)) {
>> - current->closid = 0;
>> - current->rmid = 0;
>> + WRITE_ONCE(current->closid, 0);
>> + WRITE_ONCE(current->rmid, 0);
>> kfree(rdtgrp);
>> }
>>
>> @@ -553,14 +555,32 @@ static int __rdtgroup_move_task(struct task_struct *tsk,
>
>> /*
>> * Take a refcount, so rdtgrp cannot be freed before the
>> * callback has been invoked.
>> + *
>> + * Also ensures above {closid, rmid} writes are observed by
>> + * move_myself(), as it can run immediately after task_work_add().
>> + * Otherwise old values may be loaded, and the move will only actually
>> + * happen at the next context switch.
>
> But __resctrl_sched_in() can still occur at anytime and READ_ONCE() a pair of values that
> don't go together?

Yes, the thought did cross my mind...

> I don't think this is a problem for RDT as with old-rmid the task was a member of that
> monitor-group previously, and 'freed' rmid are kept in limbo for a while after.
> (old-closid is the same as the task having not schedule()d since the change, which is fine).
>
> For MPAM, this is more annoying as changing just the closid may put the task in a
> monitoring group that never existed, meaning its surprise dirty later.
>
> If this all makes sense, I guess the fix (for much later) is to union closid/rmid, and
> WRITE_ONCE() them together where necessary.
> (I've made a note for when I next pass that part of the MPAM tree)
>

It does make sense to me - one more question back to you: can RDT exist on
an X86_32 system? It shouldn't be a stopper, but would be an inconvenience.

FWIW kernel/sched/fair.c uses two synced u64's for this; see

struct cfs_rq { .min_vruntime, .min_vruntime_copy }

and

kernel/sched/fair.c:update_min_vruntime()
kernel/sched/fair.c:migrate_task_rq_fair()
>
>> + *
>> + * Pairs with atomic_dec() in move_myself().
>> */
>> atomic_inc(&rdtgrp->waitcount);
>> +
>> ret = task_work_add(tsk, &callback->work, TWA_RESUME);
>> if (ret) {
>> /*
>
>
> Thanks!
>
> James