Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:
task_numa_compare() do_exit()
rcu_read_lock() schedule()
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...
As noted by Oleg:
<<The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp>>
The patch adds SLAB_DESTROY_BY_RCU flag to task_struct allocation
cache options. This guarantees that dst_rq->curr memory can't become
unmapped during RCU gp, and we may safely directly read it.
Also it adds rq_curr_if_not_exiting() function, which returns dst->curr
(at time of call) only if delayed_put_task_struct() callback hasn't
been called for its task_struct yet. This means the returned memory
is still a task while we are under RCU lock (and its task_struct::usage
is not zero), so we can safely use {get,put}_task_struct() to manipulate
with it.
Signed-off-by: Kirill Tkhai <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
---
kernel/fork.c | 9 +++++++--
kernel/sched/fair.c | 38 ++++++++++++++++++++++++++++++++++++--
2 files changed, 43 insertions(+), 4 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 9b7d746..72b5e73 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -259,10 +259,15 @@ void __init fork_init(unsigned long mempages)
#ifndef ARCH_MIN_TASKALIGN
#define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
#endif
- /* create a slab on which task_structs can be allocated */
+ /*
+ * Create a slab on which task_structs can be allocated.
+ * Note, we need SLAB_DESTROY_BY_RCU flag, when we access
+ * rq::curr under RCU read lock. See scheduler code.
+ */
task_struct_cachep =
kmem_cache_create("task_struct", sizeof(struct task_struct),
- ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
+ ARCH_MIN_TASKALIGN,
+ SLAB_PANIC | SLAB_NOTRACK | SLAB_DESTROY_BY_RCU, NULL);
#endif
/* do the arch specific task caches init */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..d2d1625 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1147,6 +1147,39 @@ static bool load_too_imbalanced(long src_load, long dst_load,
}
/*
+ * Return rq->curr if delayed_put_task_struct() callback hasn't
+ * been called for its task_struct yet).
+ *
+ * If result is not NULL, it is safe to use it like it'd be
+ * picked from RCU-protected list (use get_task_struct() etc).
+ */
+static struct task_struct *rq_curr_if_not_put(struct rq *rq)
+{
+ struct task_struct *cur = ACCESS_ONCE(rq->curr);
+
+ rcu_lockdep_assert(rcu_read_lock_held(), "RCU lock must be held");
+
+ if (cur->flags & PF_EXITING)
+ return NULL;
+
+ smp_rmb(); /* Pairs with smp_mb() in do_exit() */
+
+ /*
+ * We've reached here. Three situations are possible:
+ * 1)cur has gone, and dst_rq->curr is pointing to other memory.
+ * 2)cur is pointing to a new task, which is using the memory of
+ * just gone and freed cur (and it is new dst_rq->curr). It is
+ * OK, because we've locked RCU even before the new task has been
+ * created (so delayed_put_task_struct() hasn't been called yet);
+ * 3)we've taken a not exiting task (likely case). No need to worry.
+ */
+ if (cur != ACCESS_ONCE(rq->curr))
+ cur = NULL;
+
+ return cur;
+}
+
+/*
* This checks if the overall compute and NUMA accesses of the system would
* be improved if the source tasks was migrated to the target dst_cpu taking
* into account that it might be best if task running on the dst_cpu should
@@ -1164,8 +1197,9 @@ static void task_numa_compare(struct task_numa_env *env,
long moveimp = imp;
rcu_read_lock();
- cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+ cur = rq_curr_if_not_put(dst_rq);
+
+ if (cur && is_idle_task(cur))
cur = NULL;
/*
Kirill,
I leave this to you and Peter, but...
On 10/20, Kirill Tkhai wrote:
>
> @@ -259,10 +259,15 @@ void __init fork_init(unsigned long mempages)
> #ifndef ARCH_MIN_TASKALIGN
> #define ARCH_MIN_TASKALIGN L1_CACHE_BYTES
> #endif
> - /* create a slab on which task_structs can be allocated */
> + /*
> + * Create a slab on which task_structs can be allocated.
> + * Note, we need SLAB_DESTROY_BY_RCU flag, when we access
> + * rq::curr under RCU read lock. See scheduler code.
> + */
> task_struct_cachep =
> kmem_cache_create("task_struct", sizeof(struct task_struct),
> - ARCH_MIN_TASKALIGN, SLAB_PANIC | SLAB_NOTRACK, NULL);
> + ARCH_MIN_TASKALIGN,
> + SLAB_PANIC | SLAB_NOTRACK | SLAB_DESTROY_BY_RCU, NULL);
to me this change needs more justification.
Again, perhaps we will need to change the lifetime rules for task_struct
anyway, if we have more problems like this. But until then this looks like
an overkill to me. Plus rq_curr_if_not_put() looks too subtle, and it is
not generic.
May be we should start with something simple and stupid?
(it seems we can remove rcu_read_lock() with this patch, but I am not
sure).
Oleg.
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1087,9 +1087,6 @@ static void task_numa_assign(struct task_numa_env *env,
{
if (env->best_task)
put_task_struct(env->best_task);
- if (p)
- get_task_struct(p);
-
env->best_task = p;
env->best_imp = imp;
env->best_cpu = env->dst_cpu;
@@ -1139,6 +1136,18 @@ static bool load_too_imbalanced(long src_load, long dst_load,
return (imb > old_imb);
}
+struct task_struct *get_rq_curr(struct rq *rq)
+{
+ struct task_struct *curr;
+
+ raw_spin_lock_irq(&rq->lock);
+ curr = rq->curr;
+ get_task_struct(curr);
+ raw_spin_unlock_irq(&rq->lock);
+
+ return curr;
+}
+
/*
* This checks if the overall compute and NUMA accesses of the system would
* be improved if the source tasks was migrated to the target dst_cpu taking
@@ -1156,11 +1165,9 @@ static void task_numa_compare(struct task_numa_env *env,
long imp = env->p->numa_group ? groupimp : taskimp;
long moveimp = imp;
- rcu_read_lock();
- cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
- cur = NULL;
+ cur = get_rq_curr(dst_rq);
+ rcu_read_lock();
/*
* "imp" is the fault differential for the source task between the
* source and destination node. Calculate the total differential for
@@ -1235,6 +1242,7 @@ balance:
*/
if (!load_too_imbalanced(src_load, dst_load, env)) {
imp = moveimp - 1;
+ put_task_struct(cur);
cur = NULL;
goto assign;
}
@@ -1254,8 +1262,12 @@ balance:
assign:
task_numa_assign(env, cur, imp);
+ cur = NULL;
unlock:
rcu_read_unlock();
+
+ if (cur)
+ put_task_struct(cur);
}
static void task_numa_find_cpu(struct task_numa_env *env,
On 10/20, Oleg Nesterov wrote:
>
> Again, perhaps we will need to change the lifetime rules for task_struct
> anyway, if we have more problems like this. But until then this looks like
> an overkill to me. Plus rq_curr_if_not_put() looks too subtle, and it is
> not generic.
Yes... otoh, perhaps we can do something more generic? Something like
struct task_struct *xxx(struct task_struct **ptask)
{
struct task_struct *task;
void *sighand;
retry:
task = ACCESS_ONCE(*ptask);
if (!task)
return NULL;
if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
if (probe_kernel_read(&sighand, &task->sighand, sizeof(sighand)))
goto retry;
} else {
sighand = task->sighand;
}
if (!sighand)
return NULL;
/*
* Pairs with atomic_dec_and_test() in put_task_struct(task).
* If we have read the freed/reused memory, we must see that
* the pointer was updated.
*/
smp_rmb();
if (task != ACCESS_ONCE(*ptask))
goto retry;
return task;
}
task_numa_compare() can do cur = xxx(&rc->curr), but this helper can work
with any "task_struct *" pointer assuming that somehow this pointer is
cleared or changed before the final put_task_struct().
What do you think? Peter?
Oleg.
On 10/20, Oleg Nesterov wrote:
>
> On 10/20, Oleg Nesterov wrote:
> >
> > Again, perhaps we will need to change the lifetime rules for task_struct
> > anyway, if we have more problems like this. But until then this looks like
> > an overkill to me. Plus rq_curr_if_not_put() looks too subtle, and it is
> > not generic.
>
> Yes... otoh, perhaps we can do something more generic? Something like
>
> struct task_struct *xxx(struct task_struct **ptask)
> {
> struct task_struct *task;
> void *sighand;
> retry:
> task = ACCESS_ONCE(*ptask);
> if (!task)
> return NULL;
>
> if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> if (probe_kernel_read(&sighand, &task->sighand, sizeof(sighand)))
> goto retry;
> } else {
> sighand = task->sighand;
> }
>
> if (!sighand)
> return NULL;
> /*
> * Pairs with atomic_dec_and_test() in put_task_struct(task).
> * If we have read the freed/reused memory, we must see that
> * the pointer was updated.
> */
> smp_rmb();
> if (task != ACCESS_ONCE(*ptask))
> goto retry;
>
> return task;
> }
>
> task_numa_compare() can do cur = xxx(&rc->curr), but this helper can work
> with any "task_struct *" pointer assuming that somehow this pointer is
> cleared or changed before the final put_task_struct().
>
> What do you think? Peter?
And if we introduce this helper, it would better to check "sighand != NULL"
after "task != *ptask":
struct task_struct *xxx(struct task_struct **ptask)
{
struct task_struct *task;
struct sighand_struct *sighand;
retry:
task = ACCESS_ONCE(*ptask);
if (!task)
return task;
if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
if (probe_kernel_read(&sighand, &task->sighand, sizeof(sighand)))
goto retry;
} else {
sighand = task->sighand;
}
/*
* Pairs with atomic_dec_and_test() in put_task_struct(task).
* If we have read the freed/reused memory, we must see that
* the pointer was updated.
*/
smp_rmb();
if (unlikely(task != ACCESS_ONCE(*ptask)))
goto retry;
/*
* release_task(task) was already called, potentially before
* the caller took rcu_read_lock() and in this case it can be
* freed before rcu_read_unlock().
*/
if (!sighand)
return NULL;
return task;
}
Of course, task_numa_compare() do not really need "retry", and task == NULL
is not possible. But this way the new helper can (probably) have more users,
and this just looks better imo.
Oleg.
Hi, Oleg,
On 20.10.2014 22:27, Oleg Nesterov wrote:
> On 10/20, Oleg Nesterov wrote:
>>
>> On 10/20, Oleg Nesterov wrote:
>>>
>>> Again, perhaps we will need to change the lifetime rules for task_struct
>>> anyway, if we have more problems like this. But until then this looks like
>>> an overkill to me. Plus rq_curr_if_not_put() looks too subtle, and it is
>>> not generic.
>>
>> Yes... otoh, perhaps we can do something more generic? Something like
>>
>> struct task_struct *xxx(struct task_struct **ptask)
>> {
>> struct task_struct *task;
>> void *sighand;
>> retry:
>> task = ACCESS_ONCE(*ptask);
>> if (!task)
>> return NULL;
>>
>> if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
>> if (probe_kernel_read(&sighand, &task->sighand, sizeof(sighand)))
>> goto retry;
>> } else {
>> sighand = task->sighand;
>> }
>>
>> if (!sighand)
>> return NULL;
>> /*
>> * Pairs with atomic_dec_and_test() in put_task_struct(task).
>> * If we have read the freed/reused memory, we must see that
>> * the pointer was updated.
>> */
>> smp_rmb();
>> if (task != ACCESS_ONCE(*ptask))
>> goto retry;
>>
>> return task;
>> }
>>
>> task_numa_compare() can do cur = xxx(&rc->curr), but this helper can work
>> with any "task_struct *" pointer assuming that somehow this pointer is
>> cleared or changed before the final put_task_struct().
>>
>> What do you think? Peter?
>
> And if we introduce this helper, it would better to check "sighand != NULL"
> after "task != *ptask":
>
> struct task_struct *xxx(struct task_struct **ptask)
> {
> struct task_struct *task;
> struct sighand_struct *sighand;
>
> retry:
> task = ACCESS_ONCE(*ptask);
> if (!task)
> return task;
>
> if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
> if (probe_kernel_read(&sighand, &task->sighand, sizeof(sighand)))
> goto retry;
> } else {
> sighand = task->sighand;
> }
> /*
> * Pairs with atomic_dec_and_test() in put_task_struct(task).
> * If we have read the freed/reused memory, we must see that
> * the pointer was updated.
> */
> smp_rmb();
> if (unlikely(task != ACCESS_ONCE(*ptask)))
> goto retry;
> /*
> * release_task(task) was already called, potentially before
> * the caller took rcu_read_lock() and in this case it can be
> * freed before rcu_read_unlock().
> */
> if (!sighand)
> return NULL;
> return task;
> }
>
> Of course, task_numa_compare() do not really need "retry", and task == NULL
> is not possible. But this way the new helper can (probably) have more users,
> and this just looks better imo.
I think generic helper is a good idea. The prototype looks OK.
But I'm a little doubt about retry loop. If this helper is generic and
one day it may move to ./include directory, isn't there a probability
people will use it wrong? This loop may bring delays or something bad.
Nothing specific, just suspicion...
And since we still depends on RCU, I'd suggest to add its lockdep assert.
Kirill
On 10/21, Kirill Tkhai wrote:
>
> I think generic helper is a good idea. The prototype looks OK.
>
> But I'm a little doubt about retry loop. If this helper is generic and
> one day it may move to ./include directory,
Well, if we add a generic helper I think it should be exported even if
it has a single caller. But I agree this probably needs a justification
too.
> isn't there a probability
> people will use it wrong? This loop may bring delays or something bad.
Yes, I thought about livelock too. OK, we can remove it, just
s/goto retry/return NULL/. Or, perhaps better, return ERR_PTR(-EAGAIN)
so that the caller can know that retry is possible.
I do not really mind, and we can reconsider this later. And I will not
argue if you prefer to add the rq->curr specific hack (like your patch
does).
> And since we still depends on RCU, I'd suggest to add its lockdep assert.
Agreed.
Let me explain what I personally dislike in v3:
- I think that we do not have enough reasons for
SLAB_DESTROY_BY_RCU. This is the serious change.
probe_kernel_read() looks better to me, and hopefully
IS_ENABLED(DEBUG_PAGEALLOC) can make it conditional.
- PF_EXITING was fine in task_numa_compare(), but if we
move this logic into a helper (even if it is not exported)
then I think we need a more specific check. sighand == NULL
looks better to me because it clearly connects to
release_task() which makes this task_struct "rcu-unsafe".
- Again, perhaps we should start we a simple and stupid fix.
We can do get_task_struct() under rq->lock or, if nothing
else, just
raw_spin_lock_irq(&rq->lock);
cur = rq->curr;
if (is_idle_task(cur) || (cur->flags & PF_EXITING))
cur = NULL;
raw_spin_unlock_irq(&rq->lock);
Either way, I hope you will send v4 ;)
But probably you should wait for for Peter's opinion first.
Oleg.
On 21.10.2014 00:50, Oleg Nesterov wrote:
> On 10/21, Kirill Tkhai wrote:
>>
>> I think generic helper is a good idea. The prototype looks OK.
>>
>> But I'm a little doubt about retry loop. If this helper is generic and
>> one day it may move to ./include directory,
>
> Well, if we add a generic helper I think it should be exported even if
> it has a single caller. But I agree this probably needs a justification
> too.
>
>> isn't there a probability
>> people will use it wrong? This loop may bring delays or something bad.
>
> Yes, I thought about livelock too. OK, we can remove it, just
> s/goto retry/return NULL/. Or, perhaps better, return ERR_PTR(-EAGAIN)
> so that the caller can know that retry is possible.
>
> I do not really mind, and we can reconsider this later. And I will not
> argue if you prefer to add the rq->curr specific hack (like your patch
> does).
>
>> And since we still depends on RCU, I'd suggest to add its lockdep assert.
>
> Agreed.
>
> Let me explain what I personally dislike in v3:
>
> - I think that we do not have enough reasons for
> SLAB_DESTROY_BY_RCU. This is the serious change.
>
> probe_kernel_read() looks better to me, and hopefully
> IS_ENABLED(DEBUG_PAGEALLOC) can make it conditional.
>
> - PF_EXITING was fine in task_numa_compare(), but if we
> move this logic into a helper (even if it is not exported)
> then I think we need a more specific check. sighand == NULL
> looks better to me because it clearly connects to
> release_task() which makes this task_struct "rcu-unsafe".
>
> - Again, perhaps we should start we a simple and stupid fix.
> We can do get_task_struct() under rq->lock or, if nothing
> else, just
>
> raw_spin_lock_irq(&rq->lock);
> cur = rq->curr;
> if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> cur = NULL;
> raw_spin_unlock_irq(&rq->lock);
>
> Either way, I hope you will send v4 ;)
No, I won't send. Please do this. Your idea and your patch is almost
ready. Thanks :)
> But probably you should wait for for Peter's opinion first.
Yeah.
Kirill
Kirill,
You should be as rude as you can if you send an email to lkml! You
are breaking the well-known rules, and now I have to mimic your wrong
behaviour.
On 10/21, Kirill Tkhai wrote:
>
> No, I won't send. Please do this. Your idea and your patch is almost
> ready. Thanks :)
Come on. You have found the nontrivial problem. This is much more
important than a simple fix.
And if we prefer a less-trivial fix which plays with ->curr recheck,
then it will be based on your idea, not mine.
So please make a patch after we know what Peter thinks.
Oleg.
On Mon, Oct 20, 2014 at 11:34:29PM +0200, Oleg Nesterov wrote:
> So please make a patch after we know what Peter thinks.
Peter thinks its about time to go sleep ;-) Sorry I was hiding today, I
wanted to get that speculative fault crap posted. I'll go look at this
when I wake up again. Zzzz...
On Mon, Oct 20, 2014 at 10:50:06PM +0200, Oleg Nesterov wrote:
> Let me explain what I personally dislike in v3:
>
> - I think that we do not have enough reasons for
> SLAB_DESTROY_BY_RCU. This is the serious change.
What exactly would the downsides be? SDBR has very limited space
overhead iirc.
> - Again, perhaps we should start we a simple and stupid fix.
> We can do get_task_struct() under rq->lock or, if nothing
> else, just
>
> raw_spin_lock_irq(&rq->lock);
> cur = rq->curr;
> if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> cur = NULL;
> raw_spin_unlock_irq(&rq->lock);
I think I agree with you, this is the simple safe option and is
something we can easily backport. After that we can add creative bits on
top.
I think I prefer the SLAB_DESTROY_BY_RCU thing over the probe_kernel
thing -- but we can take our time once we've fixed the immediate issue
with the simple option.
On 10/21, Peter Zijlstra wrote:
>
> On Mon, Oct 20, 2014 at 10:50:06PM +0200, Oleg Nesterov wrote:
> > Let me explain what I personally dislike in v3:
> >
> > - I think that we do not have enough reasons for
> > SLAB_DESTROY_BY_RCU. This is the serious change.
>
> What exactly would the downsides be? SDBR has very limited space
> overhead iirc.
Yes, SDBR is nice (and it could probably have more users), but my
concern is not overhead. Please see below.
> > - Again, perhaps we should start we a simple and stupid fix.
> > We can do get_task_struct() under rq->lock or, if nothing
> > else, just
> >
> > raw_spin_lock_irq(&rq->lock);
> > cur = rq->curr;
> > if (is_idle_task(cur) || (cur->flags & PF_EXITING))
> > cur = NULL;
> > raw_spin_unlock_irq(&rq->lock);
>
> I think I agree with you, this is the simple safe option and is
> something we can easily backport. After that we can add creative bits on
> top.
Agreed.
Kirill, could you please make a patch?
> I think I prefer the SLAB_DESTROY_BY_RCU thing over the probe_kernel
> thing
I won't really insist, but let me try to explain why I dislike it in
this particular case.
- It is not clear who else (except task_numa_compare) will need it.
And it looks at bit strange to make task_struct SLAB_DESTROY_BY_RCU
just to read a word in task_numa_compare().
- In some sense, the usage of SDBR looks simply "wrong" in this case.
IOW, I agree that probe_kernel_read() is ugly, but in this case
SDBR acts exactly the same way as probe_kernel_read().
SDBR does not make the object rcu-safe, it only protects the object
type plus ensures that this memory can't go away. It was designed
for the case when you read the stable members initialized in ctor
(usually a lock) and verify/lock the object.
But in this case we can not detect that the object is still alive
without the additional trick, so when you read ->sighand or ->flags,
the fact that this memory is still "struct task_struct" doesn't help
and doesn't matter at all. Only the subsequent "task == rq->curr"
check proves that yes, this is task_struct.
OTOH, (afaics) we only need probe_kernel_read() if CONFIG_DEBUG_SLAB.
And in fact I think that "read the valid but potentially freed kernel
pointer" deserves another helper, it can have more users. For example,
please look at get_freepointer_safe().
What if we add get_kernel(x, ptr) macro to factor out the IS_ENABLED()
of ifdef hack? Or inline function... This way the new xxx() helper we
discussed won't look that bad.
But again, I agree that this subjective, I won't really argue.
Oleg.
On 21.10.2014 23:03, Oleg Nesterov wrote:
> On 10/21, Peter Zijlstra wrote:
>>
>> On Mon, Oct 20, 2014 at 10:50:06PM +0200, Oleg Nesterov wrote:
>>> Let me explain what I personally dislike in v3:
>>>
>>> - I think that we do not have enough reasons for
>>> SLAB_DESTROY_BY_RCU. This is the serious change.
>>
>> What exactly would the downsides be? SDBR has very limited space
>> overhead iirc.
>
> Yes, SDBR is nice (and it could probably have more users), but my
> concern is not overhead. Please see below.
>
>>> - Again, perhaps we should start we a simple and stupid fix.
>>> We can do get_task_struct() under rq->lock or, if nothing
>>> else, just
>>>
>>> raw_spin_lock_irq(&rq->lock);
>>> cur = rq->curr;
>>> if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>>> cur = NULL;
>>> raw_spin_unlock_irq(&rq->lock);
>>
>> I think I agree with you, this is the simple safe option and is
>> something we can easily backport. After that we can add creative bits on
>> top.
>
> Agreed.
>
> Kirill, could you please make a patch?
Yeah, I'll send it tomorrow.
>> I think I prefer the SLAB_DESTROY_BY_RCU thing over the probe_kernel
>> thing
>
> I won't really insist, but let me try to explain why I dislike it in
> this particular case.
>
> - It is not clear who else (except task_numa_compare) will need it.
> And it looks at bit strange to make task_struct SLAB_DESTROY_BY_RCU
> just to read a word in task_numa_compare().
>
> - In some sense, the usage of SDBR looks simply "wrong" in this case.
> IOW, I agree that probe_kernel_read() is ugly, but in this case
> SDBR acts exactly the same way as probe_kernel_read().
>
> SDBR does not make the object rcu-safe, it only protects the object
> type plus ensures that this memory can't go away. It was designed
> for the case when you read the stable members initialized in ctor
> (usually a lock) and verify/lock the object.
>
> But in this case we can not detect that the object is still alive
> without the additional trick, so when you read ->sighand or ->flags,
> the fact that this memory is still "struct task_struct" doesn't help
> and doesn't matter at all. Only the subsequent "task == rq->curr"
> check proves that yes, this is task_struct.
>
> OTOH, (afaics) we only need probe_kernel_read() if CONFIG_DEBUG_SLAB.
> And in fact I think that "read the valid but potentially freed kernel
> pointer" deserves another helper, it can have more users. For example,
> please look at get_freepointer_safe().
>
> What if we add get_kernel(x, ptr) macro to factor out the IS_ENABLED()
> of ifdef hack? Or inline function... This way the new xxx() helper we
> discussed won't look that bad.
>
> But again, I agree that this subjective, I won't really argue.
So this patch we fix task_numa_compare(). We need remember to fix
remaining later:
$ git grep ACCESS_ONCE kernel/sched/ | grep "\->curr"
kernel/sched/deadline.c: curr = ACCESS_ONCE(rq->curr); /*
kernel/sched/fair.c: cur = ACCESS_ONCE(dst_rq->curr);
kernel/sched/fair.c: tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
kernel/sched/rt.c: curr = ACCESS_ONCE(rq->curr); /* unlocked
Kirill
On 22.10.2014 00:03, Kirill Tkhai wrote:
> On 21.10.2014 23:03, Oleg Nesterov wrote:
>> On 10/21, Peter Zijlstra wrote:
>>>
>>> On Mon, Oct 20, 2014 at 10:50:06PM +0200, Oleg Nesterov wrote:
>>>> Let me explain what I personally dislike in v3:
>>>>
>>>> - I think that we do not have enough reasons for
>>>> SLAB_DESTROY_BY_RCU. This is the serious change.
>>>
>>> What exactly would the downsides be? SDBR has very limited space
>>> overhead iirc.
>>
>> Yes, SDBR is nice (and it could probably have more users), but my
>> concern is not overhead. Please see below.
>>
>>>> - Again, perhaps we should start we a simple and stupid fix.
>>>> We can do get_task_struct() under rq->lock or, if nothing
>>>> else, just
>>>>
>>>> raw_spin_lock_irq(&rq->lock);
>>>> cur = rq->curr;
>>>> if (is_idle_task(cur) || (cur->flags & PF_EXITING))
>>>> cur = NULL;
>>>> raw_spin_unlock_irq(&rq->lock);
>>>
>>> I think I agree with you, this is the simple safe option and is
>>> something we can easily backport. After that we can add creative bits on
>>> top.
>>
>> Agreed.
>>
>> Kirill, could you please make a patch?
>
> Yeah, I'll send it tomorrow.
>
>>> I think I prefer the SLAB_DESTROY_BY_RCU thing over the probe_kernel
>>> thing
>>
>> I won't really insist, but let me try to explain why I dislike it in
>> this particular case.
>>
>> - It is not clear who else (except task_numa_compare) will need it.
>> And it looks at bit strange to make task_struct SLAB_DESTROY_BY_RCU
>> just to read a word in task_numa_compare().
>>
>> - In some sense, the usage of SDBR looks simply "wrong" in this case.
>> IOW, I agree that probe_kernel_read() is ugly, but in this case
>> SDBR acts exactly the same way as probe_kernel_read().
>>
>> SDBR does not make the object rcu-safe, it only protects the object
>> type plus ensures that this memory can't go away. It was designed
>> for the case when you read the stable members initialized in ctor
>> (usually a lock) and verify/lock the object.
>>
>> But in this case we can not detect that the object is still alive
>> without the additional trick, so when you read ->sighand or ->flags,
>> the fact that this memory is still "struct task_struct" doesn't help
>> and doesn't matter at all. Only the subsequent "task == rq->curr"
>> check proves that yes, this is task_struct.
>>
>> OTOH, (afaics) we only need probe_kernel_read() if CONFIG_DEBUG_SLAB.
>> And in fact I think that "read the valid but potentially freed kernel
>> pointer" deserves another helper, it can have more users. For example,
>> please look at get_freepointer_safe().
>>
>> What if we add get_kernel(x, ptr) macro to factor out the IS_ENABLED()
>> of ifdef hack? Or inline function... This way the new xxx() helper we
>> discussed won't look that bad.
>>
>> But again, I agree that this subjective, I won't really argue.
>
> So this patch we fix task_numa_compare(). We need remember to fix
> remaining later:
>
> $ git grep ACCESS_ONCE kernel/sched/ | grep "\->curr"
> kernel/sched/deadline.c: curr = ACCESS_ONCE(rq->curr); /*
> kernel/sched/fair.c: cur = ACCESS_ONCE(dst_rq->curr);
> kernel/sched/fair.c: tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
> kernel/sched/rt.c: curr = ACCESS_ONCE(rq->curr); /* unlocked
*)in other places we use task_struct R/O.
On Tue, Oct 21, 2014 at 09:03:35PM +0200, Oleg Nesterov wrote:
> > I think I prefer the SLAB_DESTROY_BY_RCU thing over the probe_kernel
> > thing
>
> I won't really insist, but let me try to explain why I dislike it in
> this particular case.
>
> - It is not clear who else (except task_numa_compare) will need it.
> And it looks at bit strange to make task_struct SLAB_DESTROY_BY_RCU
> just to read a word in task_numa_compare().
>
> - In some sense, the usage of SDBR looks simply "wrong" in this case.
> IOW, I agree that probe_kernel_read() is ugly, but in this case
> SDBR acts exactly the same way as probe_kernel_read().
>
> SDBR does not make the object rcu-safe, it only protects the object
> type plus ensures that this memory can't go away. It was designed
> for the case when you read the stable members initialized in ctor
> (usually a lock) and verify/lock the object.
>
> But in this case we can not detect that the object is still alive
> without the additional trick, so when you read ->sighand or ->flags,
> the fact that this memory is still "struct task_struct" doesn't help
> and doesn't matter at all. Only the subsequent "task == rq->curr"
> check proves that yes, this is task_struct.
>
> OTOH, (afaics) we only need probe_kernel_read() if CONFIG_DEBUG_SLAB.
> And in fact I think that "read the valid but potentially freed kernel
> pointer" deserves another helper, it can have more users. For example,
> please look at get_freepointer_safe().
>
> What if we add get_kernel(x, ptr) macro to factor out the IS_ENABLED()
> of ifdef hack? Or inline function... This way the new xxx() helper we
> discussed won't look that bad.
>
> But again, I agree that this subjective, I won't really argue.
So I worry about cache aliasing (not an issue on x86), so by touching
'random' pages that might be freed and reissued to back userspace, we
could be accessing the one page through multiple virtual mappings which
therefore result in aliases.
SDBR avoids this issue by guaranteeing the page is not reissued for
another purpose.
I'm not sure I can convince myself SLUB is correct here. How do we avoid
cache aliasing.
On 10/22, Peter Zijlstra wrote:
>
> So I worry about cache aliasing (not an issue on x86), so by touching
> 'random' pages that might be freed and reissued to back userspace, we
> could be accessing the one page through multiple virtual mappings which
> therefore result in aliases.
Or this page can be vmalloc'ed. Yes, but we do not care. Although this
was one of the reasons why the 2nd version of xxx() checks ->sighand at
the end, even if this is not needed correctness-wise.
Let's look at the code again,
struct task_struct *xxx(struct task_struct **ptask)
{
struct task_struct *task;
struct sighand_struct *sighand;
retry:
task = ACCESS_ONCE(*ptask);
if (!task)
return task;
if (IS_ENABLED(CONFIG_DEBUG_PAGEALLOC)) {
if (probe_kernel_read(&sighand, &task->sighand, sizeof(sighand)))
goto retry;
} else {
sighand = task->sighand;
}
(this if/else should go into a separare helper)
/*
* Pairs with atomic_dec_and_test() in put_task_struct(task).
* If we have read the freed/reused memory, we must see that
* the pointer was updated.
*/
smp_rmb();
if (unlikely(task != ACCESS_ONCE(*ptask)))
goto retry;
At this point we know that task_struct was not freed. Otherwise, since
this function assumes that "*ptask" must be cleared or updated before
the final put_task_struct(), we must notice that *ptask differs.
This means that we have read the correct value of ->sighand and the check
below is correct too. Even if ->sighand is not stable and can be already
NULL right after probe_kernel_read(), this doesn't matter.
And this also means that aliasing is not a problem. If it was freed we
could read the random value, but in this case we are not even going to
look at result.
/*
* release_task(task) was already called; potentially before
* the caller took rcu_read_lock() and in this case it can be
* freed before rcu_read_unlock().
*/
if (!sighand)
return NULL;
return task;
}
> SDBR avoids this issue by guaranteeing the page is not reissued for
> another purpose.
Yes, this is true.
> I'm not sure I can convince myself SLUB is correct here. How do we avoid
> cache aliasing.
Hmm. so perhaps I misunderstood your concern...
Do you mean that on !x86 a plain LOAD can "corrupt" the memory as it seen
from another vaddr?
If yes, this is another argument for a helper which reads the potentially
freed freed slab memory. get_freepointer_safe() can use it too and it can
be reimplemented in arch/xxx/include if necessary.
Or I missed your point completely?
Oleg.
On Wed, Oct 22, 2014 at 06:14:50PM +0200, Oleg Nesterov wrote:
> Hmm. so perhaps I misunderstood your concern...
>
> Do you mean that on !x86 a plain LOAD can "corrupt" the memory as it seen
> from another vaddr?
I'm not sure. Stores for sure, loads I'm not sure about.
I suspect loads are OK, the aliasing cacheline will be !modified and
therefore later eviction should discard (not write back). But like said,
I'm not at all sure.
I would hesitate to put such assumptions into generic code -- although
it appears we already have.
Add cc's.
On 10/22, Peter Zijlstra wrote:
>
> On Wed, Oct 22, 2014 at 06:14:50PM +0200, Oleg Nesterov wrote:
> > Hmm. so perhaps I misunderstood your concern...
> >
> > Do you mean that on !x86 a plain LOAD can "corrupt" the memory as it seen
> > from another vaddr?
>
> I'm not sure. Stores for sure, loads I'm not sure about.
>
> I suspect loads are OK, the aliasing cacheline will be !modified and
> therefore later eviction should discard (not write back). But like said,
> I'm not at all sure.
>
> I would hesitate to put such assumptions into generic code -- although
> it appears we already have.
So perhaps something like this makes sense?
If some arch has problems with D-cache aliasing (because the freed page
can be already mapped by user-space or vmalloc'ed), it can redefine this
helper.
Do you think we can use it to access rq->curr? (although let me repeat
that I won't really argue with SLAB_DESTROY_BY_RCU).
Oleg.
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ecd3319..eb8494c 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -84,6 +84,22 @@ static inline unsigned long __copy_from_user_nocache(void *to,
})
/*
+ * @addr is the valid kernel pointer but this memory can be freed.
+ */
+#ifndef probe_slab_address
+#ifdef CONFIG_DEBUG_PAGEALLOC
+#define probe_slab_address(addr, retval) \
+ probe_kernel_address(addr, retval)
+#else
+#define probe_slab_address(addr, retval) \
+ ({ \
+ (retval) = *(typeof(retval) *)(addr); \
+ 0; \
+ })
+#endif
+#endif
+
+/*
* probe_kernel_read(): safely attempt to read from a location
* @dst: pointer to the buffer that shall take the data
* @src: address to read from
diff --git a/mm/slub.c b/mm/slub.c
index 3e8afcc..0467d22 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -265,11 +265,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
{
void *p;
-#ifdef CONFIG_DEBUG_PAGEALLOC
- probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
-#else
- p = get_freepointer(s, object);
-#endif
+ probe_slab_address(object + s->offset, p);
return p;
}
From: Oleg Nesterov <[email protected]>
Date: Wed, 22 Oct 2014 20:14:12 +0200
> So perhaps something like this makes sense?
>
> If some arch has problems with D-cache aliasing (because the freed page
> can be already mapped by user-space or vmalloc'ed), it can redefine this
> helper.
>
> Do you think we can use it to access rq->curr? (although let me repeat
> that I won't really argue with SLAB_DESTROY_BY_RCU).
Do we really need this?
We fully initialize and read from the area using the same virtual
address, there is no possiblity for corruption from SLAB's
perspective.
It's when you store at vaddr X then read at vaddr Y and expect to see
what you wrote at X that you have problems.
That is very much not what is happening here.
The lifetime is contained to SLAB's usage at one single virtual
address.
On 10/22, David Miller wrote:
>
> From: Oleg Nesterov <[email protected]>
> Date: Wed, 22 Oct 2014 20:14:12 +0200
>
> > So perhaps something like this makes sense?
> >
> > If some arch has problems with D-cache aliasing (because the freed page
> > can be already mapped by user-space or vmalloc'ed), it can redefine this
> > helper.
> >
> > Do you think we can use it to access rq->curr? (although let me repeat
> > that I won't really argue with SLAB_DESTROY_BY_RCU).
>
> Do we really need this?
Sorry, I was not clear. let me first explain why do we want this helper,
(although we can avoid it if we make task_struct_cachep SLAB_DESTROY_BY_RCU).
To simplify the discussion, lets ignore the actuall problem we are
trying to solve. Suppose that we want a trivial function, say,
int pid_of_curr_task_on_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);
int pid;
rcu_read_lock();
pid = rq->curr->pid;
rcu_read_unlock();
}
and we do not really care if it returns the wrong/random result.
The problem is that rcu_read_lock() can not help (unless we add
SLAB_DESTROY_BY_RCU), rq->curr is not protected by RCU. So rq->curr can
be freed. Again, we do not care, but CONFIG_DEBUG_PAGEALLOC can ummap
this page.
So we can use ifdef(CONFIG_DEBUG_PAGEALLOC) but it would be better to
have a helper to hide this ugliness, and (at least) slub can use it too.
Now the question: is this LOAD is safe in case when this (freed) page
already has another mapping? This is black magic to me, I do not know.
And Peter has some concerns.
And, say, copy_from_user_page() on sparc does
flush_cache_page();
memcpy();
flush_ptrace_access();
Again, I simply do not know if this is relevant or not, probably this
is because "illegal alias" was created by kmap() which can use the
same vaddr to access different pages.
> We fully initialize and read from the area using the same virtual
> address, there is no possiblity for corruption from SLAB's
> perspective.
>
> It's when you store at vaddr X then read at vaddr Y and expect to see
> what you wrote at X that you have problems.
>
> That is very much not what is happening here.
>
> The lifetime is contained to SLAB's usage at one single virtual
> address.
OK, so iiuc this should be safe.
Thanks!
Oleg.
From: Oleg Nesterov <[email protected]>
Date: Wed, 22 Oct 2014 21:42:28 +0200
> Now the question: is this LOAD is safe in case when this (freed) page
> already has another mapping? This is black magic to me, I do not know.
> And Peter has some concerns.
It is immatieral, because you can read garbage and it's "don't care"
in this context.
And later if it is used again at this virtual address, it will be
initialized with stores at that virtual address first.
So no problem.
> And, say, copy_from_user_page() on sparc does
>
> flush_cache_page();
> memcpy();
> flush_ptrace_access();
In this case, as I tried to explain, it matters because the physical
address is being accessed from two virtual address at the same time
"for the same usage".
That's what distinguishes this from the SLAB and RCU cases you cite.
Illegal aliases only matter within a specific usage of a piece of
memory.
So if a page is mapped into userspace, and we touch the kernel side
PAGE_OFFSET alias mapping of it, we can have problems. And that is
what is happening in the ptrace case above.
But if we are using the page later in SLAB, the previous userspace
mapping will not cause problems wrt. aliasing. It cannot, because
once SLAB gets the page it will initialize it at the new virtual
address.
On 10/22, David Miller wrote:
>
> From: Oleg Nesterov <[email protected]>
> Date: Wed, 22 Oct 2014 21:42:28 +0200
>
> > Now the question: is this LOAD is safe in case when this (freed) page
> > already has another mapping? This is black magic to me, I do not know.
> > And Peter has some concerns.
>
> It is immatieral, because you can read garbage and it's "don't care"
> in this context.
>
> And later if it is used again at this virtual address, it will be
> initialized with stores at that virtual address first.
>
> So no problem.
Great, thanks.
> > And, say, copy_from_user_page() on sparc does
> >
> > flush_cache_page();
> > memcpy();
> > flush_ptrace_access();
>
> In this case, as I tried to explain, it matters because the physical
> address is being accessed from two virtual address at the same time
> "for the same usage".
>
> That's what distinguishes this from the SLAB and RCU cases you cite.
Yes, this was my (vague) understanding, but thanks for another
explanation anyway.
Oleg.
On Wed, Oct 22, 2014 at 02:59:50PM -0400, David Miller wrote:
> It's when you store at vaddr X then read at vaddr Y and expect to see
> what you wrote at X that you have problems.
OK, so there are no known archs that have problems with this? I didn't
expect problems, but was somewhat uncomfortable with it.
> That is very much not what is happening here.
Agreed.