Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:
task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...
As noted by Oleg:
<<The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp>>
The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().
Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.
Signed-off-by: Kirill Tkhai <[email protected]>
Suggested-by: Oleg Nesterov <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..fbc0b82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1164,9 +1164,19 @@ static void task_numa_compare(struct task_numa_env *env,
long moveimp = imp;
rcu_read_lock();
- cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+
+ raw_spin_lock_irq(&dst_rq->lock);
+ cur = dst_rq->curr;
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe under RCU read lock.
+ * Note that rcu_read_lock() itself can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if ((cur->flags & PF_EXITING) || is_idle_task(cur))
cur = NULL;
+ raw_spin_unlock_irq(&dst_rq->lock);
/*
* "imp" is the fault differential for the source task between the
On 10/22, Kirill Tkhai wrote:
>
> Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> If curr task is exiting this may be a reason of use-after-free:
Thanks.
And as you pointed out, there are other examples of unlocked
foreign_rq->curr usage.
So, Kirill, Peter, do you think that the patch below can help? Can
we change task_numa_group() and ->select_task_rq() to do nothing if
rq_curr_rcu_safe() returns NULL? It seems we can...
task_numa_compare() can use it too, we can make another patch on
top of this one.
- Obviously just for the early review. Lacks the changelog
and the comments (at least).
- Once again, I won't insist on probe_slab_address(). We can
add SDBR and change task_rcu_dereference() to simply read
->sighand.
- Also, I won't argue if you think that we do not need a
generic helper. In this case we can move this logic into
rq_curr_rcu_safe() and it will be a bit simpler.
- OTOH, I am not sure we need rq_curr_rcu_safe(). The callers
can just use task_rcu_dereference() and check IS_ERR_OR_NULL,
I guess retry doesn't buy too much in this case.
Or do you think we need something else?
Oleg.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..0ba420e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
sigset_t *mask);
extern void unblock_all_signals(void);
extern void release_task(struct task_struct * p);
+extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
extern int send_sig_info(int, struct siginfo *, struct task_struct *);
extern int force_sigsegv(int, struct task_struct *);
extern int force_sig_info(int, struct siginfo *, struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..4aa00c7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -213,6 +213,37 @@ repeat:
goto repeat;
}
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+ struct task_struct *task;
+ struct sighand_struct *sighand;
+
+ task = rcu_dereference(*ptask);
+ if (!task)
+ return NULL;
+
+ /* If it fails the check below must fail too */
+ probe_slab_address(&task->sighand, sighand);
+ /*
+ * Pairs with atomic_dec_and_test() in put_task_struct(task).
+ * If we have read the freed/reused memory, we must see that
+ * the pointer was updated. The caller might want to retry in
+ * this case.
+ */
+ smp_rmb();
+ if (unlikely(task != ACCESS_ONCE(*ptask)))
+ return ERR_PTR(-EAGAIN);
+
+ /*
+ * release_task(task) was already called; potentially before
+ * the caller took rcu_read_lock() and in this case it can be
+ * freed before rcu_read_unlock().
+ */
+ if (!sighand)
+ return ERR_PTR(-EINVAL);
+ return task;
+}
+
/*
* This checks not only the pgrp, but falls back on the pid if no
* satisfactory pgrp is found. I dunno - gdb doesn't work correctly
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 579712f..249c0c1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -655,6 +655,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))
+static inline struct task_struct *rq_curr_rcu_safe(struct rq *rq)
+{
+ for (;;) {
+ struct task_struct *curr = task_rcu_dereference(&rq->curr);
+ /* NULL is not possible */
+ if (likely(!IS_ERR(curr)))
+ return curr;
+ if (PTR_ERR(curr) != -EAGAIN)
+ return NULL;
+ }
+}
+
static inline u64 rq_clock(struct rq *rq)
{
return rq->clock;
Damn.
On 10/22, Oleg Nesterov wrote:
>
> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> + struct task_struct *task;
> + struct sighand_struct *sighand;
> +
> + task = rcu_dereference(*ptask);
> + if (!task)
> + return NULL;
> +
> + /* If it fails the check below must fail too */
> + probe_slab_address(&task->sighand, sighand);
> + /*
> + * Pairs with atomic_dec_and_test() in put_task_struct(task).
> + * If we have read the freed/reused memory, we must see that
> + * the pointer was updated. The caller might want to retry in
> + * this case.
> + */
> + smp_rmb();
> + if (unlikely(task != ACCESS_ONCE(*ptask)))
> + return ERR_PTR(-EAGAIN);
This is not exactly right. task == *ptask can be false positive.
It can be freed, then resused (so that sighand != NULL can be false
positive), then freed again, and then reused again as task_struct.
This is not that bad, we still can safely use this task_struct, but
the comment should be updated. Plus -EINVAL below can be wrong in
this case although this minor.
Yeees, SLAB_DESTTROY_BY_RCU closes this race. Not sure why I'd like
to avoid it, but I do ;)
Oleg.
В Ср, 22/10/2014 в 23:30 +0200, Oleg Nesterov пишет:
> On 10/22, Kirill Tkhai wrote:
> >
> > Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> > If curr task is exiting this may be a reason of use-after-free:
>
> Thanks.
>
> And as you pointed out, there are other examples of unlocked
> foreign_rq->curr usage.
>
> So, Kirill, Peter, do you think that the patch below can help? Can
> we change task_numa_group() and ->select_task_rq() to do nothing if
> rq_curr_rcu_safe() returns NULL? It seems we can...
>
> task_numa_compare() can use it too, we can make another patch on
> top of this one.
>
> - Obviously just for the early review. Lacks the changelog
> and the comments (at least).
>
> - Once again, I won't insist on probe_slab_address(). We can
> add SDBR and change task_rcu_dereference() to simply read
> ->sighand.
>
> - Also, I won't argue if you think that we do not need a
> generic helper. In this case we can move this logic into
> rq_curr_rcu_safe() and it will be a bit simpler.
>
> - OTOH, I am not sure we need rq_curr_rcu_safe(). The callers
> can just use task_rcu_dereference() and check IS_ERR_OR_NULL,
> I guess retry doesn't buy too much in this case.
>
> Or do you think we need something else?
>
> Oleg.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 857ba40..0ba420e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
> sigset_t *mask);
> extern void unblock_all_signals(void);
> extern void release_task(struct task_struct * p);
> +extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
> extern int send_sig_info(int, struct siginfo *, struct task_struct *);
> extern int force_sigsegv(int, struct task_struct *);
> extern int force_sig_info(int, struct siginfo *, struct task_struct *);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 32c58f7..4aa00c7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -213,6 +213,37 @@ repeat:
> goto repeat;
> }
>
> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> + struct task_struct *task;
> + struct sighand_struct *sighand;
> +
> + task = rcu_dereference(*ptask);
> + if (!task)
> + return NULL;
> +
> + /* If it fails the check below must fail too */
> + probe_slab_address(&task->sighand, sighand);
> + /*
> + * Pairs with atomic_dec_and_test() in put_task_struct(task).
> + * If we have read the freed/reused memory, we must see that
> + * the pointer was updated. The caller might want to retry in
> + * this case.
> + */
> + smp_rmb();
> + if (unlikely(task != ACCESS_ONCE(*ptask)))
> + return ERR_PTR(-EAGAIN);
> +
> + /*
> + * release_task(task) was already called; potentially before
> + * the caller took rcu_read_lock() and in this case it can be
> + * freed before rcu_read_unlock().
> + */
> + if (!sighand)
> + return ERR_PTR(-EINVAL);
> + return task;
> +}
> +
> /*
> * This checks not only the pgrp, but falls back on the pid if no
> * satisfactory pgrp is found. I dunno - gdb doesn't work correctly
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 579712f..249c0c1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -655,6 +655,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
> #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> #define raw_rq() (&__raw_get_cpu_var(runqueues))
>
> +static inline struct task_struct *rq_curr_rcu_safe(struct rq *rq)
> +{
> + for (;;) {
> + struct task_struct *curr = task_rcu_dereference(&rq->curr);
> + /* NULL is not possible */
> + if (likely(!IS_ERR(curr)))
> + return curr;
> + if (PTR_ERR(curr) != -EAGAIN)
> + return NULL;
> + }
> +}
> +
> static inline u64 rq_clock(struct rq *rq)
> {
> return rq->clock;
>
I'm agree generic helper is better. But probe_slab_address() has a sence
if we know that SDBR is worse in our subject area. Less of code is
easier to support :) probe_slab_address() it's not a trivial logic.
Also, if we use mm primitives this increases kernel modularity.
Kirill
On 10/23, Oleg Nesterov wrote:
>
> Damn.
Yes.
> On 10/22, Oleg Nesterov wrote:
> >
> > +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> > +{
> > + struct task_struct *task;
> > + struct sighand_struct *sighand;
> > +
> > + task = rcu_dereference(*ptask);
> > + if (!task)
> > + return NULL;
> > +
> > + /* If it fails the check below must fail too */
> > + probe_slab_address(&task->sighand, sighand);
> > + /*
> > + * Pairs with atomic_dec_and_test() in put_task_struct(task).
> > + * If we have read the freed/reused memory, we must see that
> > + * the pointer was updated. The caller might want to retry in
> > + * this case.
> > + */
> > + smp_rmb();
> > + if (unlikely(task != ACCESS_ONCE(*ptask)))
> > + return ERR_PTR(-EAGAIN);
>
> This is not exactly right. task == *ptask can be false positive.
>
> It can be freed, then resused (so that sighand != NULL can be false
> positive), then freed again, and then reused again as task_struct.
>
> This is not that bad, we still can safely use this task_struct, but
> the comment should be updated. Plus -EINVAL below can be wrong in
> this case although this minor.
Yes.
> Yeees, SLAB_DESTTROY_BY_RCU closes this race. Not sure why I'd like
> to avoid it, but I do ;)
Argh. I only meant that SLAB_DESTTROY_BY_RCU can make the comments
simpler. "closes this race" applies too "check below must fail too"
too. Sorry if I confused you.
"task == *ptask can be false positive" is true with or without
SLAB_DESTTROY_BY_RCU, and this needs a good comment. Yes, it can't
be reused twice, but still we can't 100% trust the "sighand != NULL"
check.
So let me repeat, SDBR can only turn probe_slab_address() into a plain
load.
But I can't think properly today, will try to recheck tomorrow and send
v2.
Oleg.
On 10/23, Kirill Tkhai wrote:
>
> I'm agree generic helper is better. But probe_slab_address() has a sence
> if we know that SDBR is worse in our subject area.
And I still think it is worse.
> Less of code is
> easier to support :)
Sure, but ignoring the comments, SDBR needs the same code in
task_rcu_dereference() ? Except, of course
- probe_slab_address(&task->sighand, sighand);
+ sighand = task->sighand;
or how do you think we can simplify it?
> probe_slab_address() it's not a trivial logic.
But it already has a user. And probably it can have more.
To me the usage of SDBR is not trivial (and confusing) in this case.
Once again, ignoring the CONFIG_DEBUG_PAGEALLOC problems it does not
help at all.
With or without SDBR rq->curr can be reused and we need to avoid this
race. The fact that with SDBR it can be reused only as another instance
of task_struct is absolutely immaterial imo.
Not to mention that SDBR still adds some overhead while probe_slab()
is free unless CONFIG_DEBUG_PAGEALLOC, but this option adds a large
slowdown anyway.
But again, I can't really work today, perhaps I missed something.
Perhaps you can show a better code which relies on SDBR?
Oleg.
В Чт, 23/10/2014 в 20:18 +0200, Oleg Nesterov пишет:
> On 10/23, Kirill Tkhai wrote:
> >
> > I'm agree generic helper is better. But probe_slab_address() has a sence
> > if we know that SDBR is worse in our subject area.
>
> And I still think it is worse.
>
> > Less of code is
> > easier to support :)
>
> Sure, but ignoring the comments, SDBR needs the same code in
> task_rcu_dereference() ? Except, of course
>
> - probe_slab_address(&task->sighand, sighand);
> + sighand = task->sighand;
>
> or how do you think we can simplify it?
Ok, really, not big simplification there. Your variant is good.
> > probe_slab_address() it's not a trivial logic.
>
> But it already has a user. And probably it can have more.
>
> To me the usage of SDBR is not trivial (and confusing) in this case.
> Once again, ignoring the CONFIG_DEBUG_PAGEALLOC problems it does not
> help at all.
>
> With or without SDBR rq->curr can be reused and we need to avoid this
> race. The fact that with SDBR it can be reused only as another instance
> of task_struct is absolutely immaterial imo.
>
> Not to mention that SDBR still adds some overhead while probe_slab()
> is free unless CONFIG_DEBUG_PAGEALLOC, but this option adds a large
> slowdown anyway.
>
>
> But again, I can't really work today, perhaps I missed something.
> Perhaps you can show a better code which relies on SDBR?
No, it would be the same except probe_slab_address(). So, let's stay
on probe_slab_address() variant and fix the bug this way.
Kirill
Peter, let me repeat once again, if you still prefer to avoid
probe_slab_address() and use SLAB_DESTROY_BY_RCU I won't argue.
I do not like SLAB_DESTROY_BY_RCU in this particular case. With
or without SDBR rq->curr can be reused and we need to avoid this
race. The fact that with SDBR it can be reused only as another
instance of task_struct is absolutely immaterial imo.
Not to mention that SDBR still adds some overhead while probe_slab()
is free unless CONFIG_DEBUG_PAGEALLOC, but this option adds a large
slowdown anyway.
However, my arguments against SDBR are not strictly technical, and
I think that this falls into "maintainer is always right" category.
So please tell me if you prefer v2 with SDBR. In this case 2/3 is
not needed, and 3/3 can simply read ->sighand. Otherwise the code
(and even the comments) will be the same.
Compared to the draft patch I sent before
- update the comments
- do not use ERR_PTR(), just return the task or NULL, so
kernel/sched/ doesn't need another helper.
This means that task_rcu_dereference() does retry itself.
We can add __task_rcu_dereference() if we have another
which do not need/want to retry.
Oleg.
include/linux/sched.h | 1 +
include/linux/uaccess.h | 30 +++++++++++++++-------------
kernel/exit.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++
mm/slub.c | 6 +----
4 files changed, 67 insertions(+), 19 deletions(-)
Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
into the new generic helper, probe_slab_address(). The next patch will add
another user.
Signed-off-by: Oleg Nesterov <[email protected]>
---
include/linux/uaccess.h | 15 +++++++++++++++
mm/slub.c | 6 +-----
2 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index effb637..3367396 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -71,6 +71,21 @@ static inline unsigned long __copy_from_user_nocache(void *to,
__probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
/*
+ * Same as probe_kernel_address(), but @addr must be the valid pointer
+ * to a slab object, potentially freed/reused/unmapped.
+ */
+#ifdef CONFIG_DEBUG_PAGEALLOC
+#define probe_slab_address(addr, retval) \
+ probe_kernel_address(addr, retval)
+#else
+#define probe_slab_address(addr, retval) \
+ ({ \
+ (retval) = *(typeof(retval) *)(addr); \
+ 0; \
+ })
+#endif
+
+/*
* probe_kernel_read(): safely attempt to read from a location
* @dst: pointer to the buffer that shall take the data
* @src: address to read from
diff --git a/mm/slub.c b/mm/slub.c
index 3e8afcc..0467d22 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -265,11 +265,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
{
void *p;
-#ifdef CONFIG_DEBUG_PAGEALLOC
- probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
-#else
- p = get_freepointer(s, object);
-#endif
+ probe_slab_address(object + s->offset, p);
return p;
}
--
1.5.5.1
task_struct is only protected by RCU if it was found on a RCU protected
list (say, for_each_process() or find_task_by_vpid()).
And as Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means that we
need to fix the code which uses foreign_rq->curr under rcu_read_lock().
Add a new helper which can be used to dereferene rq->curr or any other
pointer to task_struct assuming that it should be cleared or updated
before the final put_task_struct(). It returns non-NULL only if this
task can't go away before rcu_read_unlock().
Suggested-by: Kirill Tkhai <[email protected]>
Signed-off-by: Oleg Nesterov <[email protected]>
---
include/linux/sched.h | 1 +
kernel/exit.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 0 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 857ba40..0ba420e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
sigset_t *mask);
extern void unblock_all_signals(void);
extern void release_task(struct task_struct * p);
+extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
extern int send_sig_info(int, struct siginfo *, struct task_struct *);
extern int force_sigsegv(int, struct task_struct *);
extern int force_sig_info(int, struct siginfo *, struct task_struct *);
diff --git a/kernel/exit.c b/kernel/exit.c
index 32c58f7..d8b95c2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -213,6 +213,55 @@ repeat:
goto repeat;
}
+struct task_struct *task_rcu_dereference(struct task_struct **ptask)
+{
+ struct task_struct *task;
+ struct sighand_struct *sighand;
+
+ /*
+ * We need to verify that release_task() was not called and thus
+ * delayed_put_task_struct() can't run and drop the last reference
+ * before rcu_read_unlock(). We check task->sighand != NULL, but
+ * we can read the already freed and reused memory.
+ */
+ retry:
+ task = rcu_dereference(*ptask);
+ if (!task)
+ return NULL;
+
+ probe_slab_address(&task->sighand, sighand);
+ /*
+ * Pairs with atomic_dec_and_test(usage) in put_task_struct(task).
+ * If this task was already freed we can not miss the preceding
+ * update of this pointer.
+ */
+ smp_rmb();
+ if (unlikely(task != ACCESS_ONCE(*ptask)))
+ goto retry;
+
+ /*
+ * Either this is the same task and we can trust sighand != NULL, or
+ * its memory was re-instantiated as another instance of task_struct.
+ * In the latter case the new task can not go away until another rcu
+ * gp pass, so the only problem is that sighand == NULL can be false
+ * positive but we can pretend we got this NULL before it was freed.
+ */
+ if (sighand)
+ return task;
+
+ /*
+ * We could even eliminate the false positive mentioned above:
+ *
+ * probe_slab_address(&task->sighand, sighand);
+ * if (sighand)
+ * goto retry;
+ *
+ * if sighand != NULL because we read the freed memory we should
+ * see the new pointer, otherwise we will likely return this task.
+ */
+ return NULL;
+}
+
/*
* This checks not only the pgrp, but falls back on the pid if no
* satisfactory pgrp is found. I dunno - gdb doesn't work correctly
--
1.5.5.1
probe_kernel_address() can just do __probe_kernel_read(sizeof(retval)),
no need to open-code its implementation.
Signed-off-by: Oleg Nesterov <[email protected]>
---
include/linux/uaccess.h | 17 ++---------------
1 files changed, 2 insertions(+), 15 deletions(-)
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index ecd3319..effb637 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -66,22 +66,9 @@ static inline unsigned long __copy_from_user_nocache(void *to,
* already holds mmap_sem, or other locks which nest inside mmap_sem.
* This must be a macro because __get_user() needs to know the types of the
* args.
- *
- * We don't include enough header files to be able to do the set_fs(). We
- * require that the probe_kernel_address() caller will do that.
*/
-#define probe_kernel_address(addr, retval) \
- ({ \
- long ret; \
- mm_segment_t old_fs = get_fs(); \
- \
- set_fs(KERNEL_DS); \
- pagefault_disable(); \
- ret = __copy_from_user_inatomic(&(retval), (__force typeof(retval) __user *)(addr), sizeof(retval)); \
- pagefault_enable(); \
- set_fs(old_fs); \
- ret; \
- })
+#define probe_kernel_address(addr, retval) \
+ __probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
/*
* probe_kernel_read(): safely attempt to read from a location
--
1.5.5.1
On Mon, 27 Oct 2014, Oleg Nesterov wrote:
> Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
> into the new generic helper, probe_slab_address(). The next patch will add
> another user.
Acked-by: Christoph Lameter <[email protected]>
В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
> into the new generic helper, probe_slab_address(). The next patch will add
> another user.
>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> include/linux/uaccess.h | 15 +++++++++++++++
> mm/slub.c | 6 +-----
> 2 files changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index effb637..3367396 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -71,6 +71,21 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> __probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
>
> /*
> + * Same as probe_kernel_address(), but @addr must be the valid pointer
> + * to a slab object, potentially freed/reused/unmapped.
> + */
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> +#define probe_slab_address(addr, retval) \
> + probe_kernel_address(addr, retval)
> +#else
> +#define probe_slab_address(addr, retval) \
> + ({ \
> + (retval) = *(typeof(retval) *)(addr); \
> + 0; \
> + })
> +#endif
> +
> +/*
> * probe_kernel_read(): safely attempt to read from a location
> * @dst: pointer to the buffer that shall take the data
> * @src: address to read from
> diff --git a/mm/slub.c b/mm/slub.c
> index 3e8afcc..0467d22 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -265,11 +265,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> {
> void *p;
>
> -#ifdef CONFIG_DEBUG_PAGEALLOC
> - probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
> -#else
> - p = get_freepointer(s, object);
> -#endif
> + probe_slab_address(object + s->offset, p);
> return p;
> }
>
probe_kernel_read() was arch-dependent on tree platforms:
arch/blackfin/mm/maccess.c
arch/parisc/lib/memcpy.c
arch/um/kernel/maccess.c
But now we skip these arch-dependent implementations. Is there no a problem?
Kirill
В Вт, 28/10/2014 в 08:44 +0300, Kirill Tkhai пишет:
> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> > Extract the ifdef(CONFIG_DEBUG_PAGEALLOC) code from get_freepointer_safe()
> > into the new generic helper, probe_slab_address(). The next patch will add
> > another user.
> >
> > Signed-off-by: Oleg Nesterov <[email protected]>
> > ---
> > include/linux/uaccess.h | 15 +++++++++++++++
> > mm/slub.c | 6 +-----
> > 2 files changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > index effb637..3367396 100644
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -71,6 +71,21 @@ static inline unsigned long __copy_from_user_nocache(void *to,
> > __probe_kernel_read(&(retval), (__force void *)(addr), sizeof(retval))
> >
> > /*
> > + * Same as probe_kernel_address(), but @addr must be the valid pointer
> > + * to a slab object, potentially freed/reused/unmapped.
> > + */
> > +#ifdef CONFIG_DEBUG_PAGEALLOC
> > +#define probe_slab_address(addr, retval) \
> > + probe_kernel_address(addr, retval)
> > +#else
> > +#define probe_slab_address(addr, retval) \
> > + ({ \
> > + (retval) = *(typeof(retval) *)(addr); \
> > + 0; \
> > + })
> > +#endif
> > +
> > +/*
> > * probe_kernel_read(): safely attempt to read from a location
> > * @dst: pointer to the buffer that shall take the data
> > * @src: address to read from
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 3e8afcc..0467d22 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -265,11 +265,7 @@ static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
> > {
> > void *p;
> >
> > -#ifdef CONFIG_DEBUG_PAGEALLOC
> > - probe_kernel_read(&p, (void **)(object + s->offset), sizeof(p));
> > -#else
> > - p = get_freepointer(s, object);
> > -#endif
> > + probe_slab_address(object + s->offset, p);
> > return p;
> > }
> >
>
> probe_kernel_read() was arch-dependent on tree platforms:
Of course, I mean get_freepointer_safe() used to use arch-dependent
probe_kernel_read() on blackfin, parisc and um.
> arch/blackfin/mm/maccess.c
> arch/parisc/lib/memcpy.c
> arch/um/kernel/maccess.c
>
> But now we skip these arch-dependent implementations. Is there no a problem?
В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> task_struct is only protected by RCU if it was found on a RCU protected
> list (say, for_each_process() or find_task_by_vpid()).
>
> And as Kirill pointed out rq->curr isn't protected by RCU, the scheduler
> drops the (potentially) last reference without RCU gp, this means that we
> need to fix the code which uses foreign_rq->curr under rcu_read_lock().
>
> Add a new helper which can be used to dereferene rq->curr or any other
> pointer to task_struct assuming that it should be cleared or updated
> before the final put_task_struct(). It returns non-NULL only if this
> task can't go away before rcu_read_unlock().
>
> Suggested-by: Kirill Tkhai <[email protected]>
> Signed-off-by: Oleg Nesterov <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/exit.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 50 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 857ba40..0ba420e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2300,6 +2300,7 @@ extern void block_all_signals(int (*notifier)(void *priv), void *priv,
> sigset_t *mask);
> extern void unblock_all_signals(void);
> extern void release_task(struct task_struct * p);
> +extern struct task_struct *task_rcu_dereference(struct task_struct **ptask);
> extern int send_sig_info(int, struct siginfo *, struct task_struct *);
> extern int force_sigsegv(int, struct task_struct *);
> extern int force_sig_info(int, struct siginfo *, struct task_struct *);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 32c58f7..d8b95c2 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -213,6 +213,55 @@ repeat:
> goto repeat;
> }
>
> +struct task_struct *task_rcu_dereference(struct task_struct **ptask)
> +{
> + struct task_struct *task;
> + struct sighand_struct *sighand;
> +
> + /*
> + * We need to verify that release_task() was not called and thus
> + * delayed_put_task_struct() can't run and drop the last reference
> + * before rcu_read_unlock(). We check task->sighand != NULL, but
> + * we can read the already freed and reused memory.
> + */
> + retry:
> + task = rcu_dereference(*ptask);
> + if (!task)
> + return NULL;
> +
> + probe_slab_address(&task->sighand, sighand);
> + /*
> + * Pairs with atomic_dec_and_test(usage) in put_task_struct(task).
> + * If this task was already freed we can not miss the preceding
> + * update of this pointer.
> + */
> + smp_rmb();
> + if (unlikely(task != ACCESS_ONCE(*ptask)))
> + goto retry;
> +
> + /*
> + * Either this is the same task and we can trust sighand != NULL, or
> + * its memory was re-instantiated as another instance of task_struct.
> + * In the latter case the new task can not go away until another rcu
> + * gp pass, so the only problem is that sighand == NULL can be false
> + * positive but we can pretend we got this NULL before it was freed.
> + */
> + if (sighand)
> + return task;
> +
> + /*
> + * We could even eliminate the false positive mentioned above:
> + *
> + * probe_slab_address(&task->sighand, sighand);
> + * if (sighand)
> + * goto retry;
> + *
> + * if sighand != NULL because we read the freed memory we should
> + * see the new pointer, otherwise we will likely return this task.
> + */
> + return NULL;
> +}
> +
> /*
> * This checks not only the pgrp, but falls back on the pid if no
> * satisfactory pgrp is found. I dunno - gdb doesn't work correctly
Reviewed-by: Kirill Tkhai <[email protected]>
Commit-ID: 1effd9f19324efb05fccc7421530e11a52db0278
Gitweb: http://git.kernel.org/tip/1effd9f19324efb05fccc7421530e11a52db0278
Author: Kirill Tkhai <[email protected]>
AuthorDate: Wed, 22 Oct 2014 11:17:11 +0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 28 Oct 2014 10:46:02 +0100
sched/numa: Fix unsafe get_task_struct() in task_numa_assign()
Unlocked access to dst_rq->curr in task_numa_compare() is racy.
If curr task is exiting this may be a reason of use-after-free:
task_numa_compare() do_exit()
... current->flags |= PF_EXITING;
... release_task()
... ~~delayed_put_task_struct()~~
... schedule()
rcu_read_lock() ...
cur = ACCESS_ONCE(dst_rq->curr) ...
... rq->curr = next;
... context_switch()
... finish_task_switch()
... put_task_struct()
... __put_task_struct()
... free_task_struct()
task_numa_assign() ...
get_task_struct() ...
As noted by Oleg:
<<The lockless get_task_struct(tsk) is only safe if tsk == current
and didn't pass exit_notify(), or if this tsk was found on a rcu
protected list (say, for_each_process() or find_task_by_vpid()).
IOW, it is only safe if release_task() was not called before we
take rcu_read_lock(), in this case we can rely on the fact that
delayed_put_pid() can not drop the (potentially) last reference
until rcu_read_unlock().
And as Kirill pointed out task_numa_compare()->task_numa_assign()
path does get_task_struct(dst_rq->curr) and this is not safe. The
task_struct itself can't go away, but rcu_read_lock() can't save
us from the final put_task_struct() in finish_task_switch(); this
reference goes away without rcu gp>>
The patch provides simple check of PF_EXITING flag. If it's not set,
this guarantees that call_rcu() of delayed_put_task_struct() callback
hasn't happened yet, so we can safely do get_task_struct() in
task_numa_assign().
Locked dst_rq->lock protects from concurrency with the last schedule().
Reusing or unmapping of cur's memory may happen without it.
Suggested-by: Oleg Nesterov <[email protected]>
Signed-off-by: Kirill Tkhai <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhai
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 14 ++++++++++++--
1 file changed, 12 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b069bf..fbc0b82 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1164,9 +1164,19 @@ static void task_numa_compare(struct task_numa_env *env,
long moveimp = imp;
rcu_read_lock();
- cur = ACCESS_ONCE(dst_rq->curr);
- if (cur->pid == 0) /* idle */
+
+ raw_spin_lock_irq(&dst_rq->lock);
+ cur = dst_rq->curr;
+ /*
+ * No need to move the exiting task, and this ensures that ->curr
+ * wasn't reaped and thus get_task_struct() in task_numa_assign()
+ * is safe under RCU read lock.
+ * Note that rcu_read_lock() itself can't protect from the final
+ * put_task_struct() after the last schedule().
+ */
+ if ((cur->flags & PF_EXITING) || is_idle_task(cur))
cur = NULL;
+ raw_spin_unlock_irq(&dst_rq->lock);
/*
* "imp" is the fault differential for the source task between the
On Tue, Oct 28, 2014 at 08:44:51AM +0300, Kirill Tkhai wrote:
> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
> > +#define probe_slab_address(addr, retval) \
> > + probe_kernel_address(addr, retval)
>
> probe_kernel_read() was arch-dependent on tree platforms:
>
> arch/blackfin/mm/maccess.c
> arch/parisc/lib/memcpy.c
> arch/um/kernel/maccess.c
>
> But now we skip these arch-dependent implementations. Is there no a problem?
Nope, see the first patch, it makes probe_kernel_address use
__probe_kernel_read().
On 28.10.2014 18:01, Peter Zijlstra wrote:
> On Tue, Oct 28, 2014 at 08:44:51AM +0300, Kirill Tkhai wrote:
>> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
>
>>> +#define probe_slab_address(addr, retval) \
>>> + probe_kernel_address(addr, retval)
>>
>> probe_kernel_read() was arch-dependent on tree platforms:
>>
>> arch/blackfin/mm/maccess.c
>> arch/parisc/lib/memcpy.c
>> arch/um/kernel/maccess.c
>>
>> But now we skip these arch-dependent implementations. Is there no a problem?
>
> Nope, see the first patch, it makes probe_kernel_address use
> __probe_kernel_read().
>
Yes, probe_kernel_read() is in [1/3], but it's not the same as
__probe_kernel_read() for blackfin, for example.
It's defined as
long __weak probe_kernel_read(void *dst, const void *src, size_t size)
__attribute__((alias("__probe_kernel_read")));
But blackfin's probe_kernel_read() redefines this __weak function,
isn't it? Didn't get_freepointer_safe() use to call architecture's
probe_kernel_read() before?
I don't see how it is called now...
On 28.10.2014 20:56, Kirill Tkhai wrote:
> On 28.10.2014 18:01, Peter Zijlstra wrote:
>> On Tue, Oct 28, 2014 at 08:44:51AM +0300, Kirill Tkhai wrote:
>>> В Пн, 27/10/2014 в 20:54 +0100, Oleg Nesterov пишет:
>>
>>>> +#define probe_slab_address(addr, retval) \
>>>> + probe_kernel_address(addr, retval)
>>>
>>> probe_kernel_read() was arch-dependent on tree platforms:
>>>
>>> arch/blackfin/mm/maccess.c
>>> arch/parisc/lib/memcpy.c
>>> arch/um/kernel/maccess.c
>>>
>>> But now we skip these arch-dependent implementations. Is there no a problem?
>>
>> Nope, see the first patch, it makes probe_kernel_address use
>> __probe_kernel_read().
>>
>
> Yes, probe_kernel_read() is in [1/3], but it's not the same as
> __probe_kernel_read() for blackfin, for example.
Vise versa, I mean __probe_kernel_read() is in [1/3].
> It's defined as
>
> long __weak probe_kernel_read(void *dst, const void *src, size_t size)
> __attribute__((alias("__probe_kernel_read")));
>
> But blackfin's probe_kernel_read() redefines this __weak function,
> isn't it? Didn't get_freepointer_safe() use to call architecture's
> probe_kernel_read() before?
>
> I don't see how it is called now...
>
On 10/28, Kirill Tkhai wrote:
>
> Yes, probe_kernel_read() is in [1/3], but it's not the same as
> __probe_kernel_read() for blackfin, for example.
>
> It's defined as
>
> long __weak probe_kernel_read(void *dst, const void *src, size_t size)
> __attribute__((alias("__probe_kernel_read")));
>
> But blackfin's probe_kernel_read() redefines this __weak function,
> isn't it? Didn't get_freepointer_safe() use to call architecture's
> probe_kernel_read() before?
I _think_ that __probe_kernel_read(slab_ddr) should be fine.
Yes, an architecture may want to reimplement probe_kernel_read() to
allow to safely access the special areas, or special addresses.
But again, in this case we know that this address points to the
"normal" kernel memory, __copy_from_user_inatomic() should work fine.
Oleg.
On 10/28, Oleg Nesterov wrote:
>
> On 10/28, Kirill Tkhai wrote:
> >
> > Yes, probe_kernel_read() is in [1/3], but it's not the same as
> > __probe_kernel_read() for blackfin, for example.
> >
> > It's defined as
> >
> > long __weak probe_kernel_read(void *dst, const void *src, size_t size)
> > __attribute__((alias("__probe_kernel_read")));
> >
> > But blackfin's probe_kernel_read() redefines this __weak function,
> > isn't it? Didn't get_freepointer_safe() use to call architecture's
> > probe_kernel_read() before?
>
> I _think_ that __probe_kernel_read(slab_ddr) should be fine.
>
> Yes, an architecture may want to reimplement probe_kernel_read() to
> allow to safely access the special areas, or special addresses.
>
> But again, in this case we know that this address points to the
> "normal" kernel memory, __copy_from_user_inatomic() should work fine.
OTOH, perhaps probe_kernel_address() should use probe_kernel_read(), not
__probe_kernel_read(). But currently it just calls __copy_inatomic() so
1/3 follows this logic.
Oleg.
В Вт, 28/10/2014 в 21:12 +0100, Oleg Nesterov пишет:
> On 10/28, Oleg Nesterov wrote:
> >
> > On 10/28, Kirill Tkhai wrote:
> > >
> > > Yes, probe_kernel_read() is in [1/3], but it's not the same as
> > > __probe_kernel_read() for blackfin, for example.
> > >
> > > It's defined as
> > >
> > > long __weak probe_kernel_read(void *dst, const void *src, size_t size)
> > > __attribute__((alias("__probe_kernel_read")));
> > >
> > > But blackfin's probe_kernel_read() redefines this __weak function,
> > > isn't it? Didn't get_freepointer_safe() use to call architecture's
> > > probe_kernel_read() before?
> >
> > I _think_ that __probe_kernel_read(slab_ddr) should be fine.
> >
> > Yes, an architecture may want to reimplement probe_kernel_read() to
> > allow to safely access the special areas, or special addresses.
> >
> > But again, in this case we know that this address points to the
> > "normal" kernel memory, __copy_from_user_inatomic() should work fine.
>
> OTOH, perhaps probe_kernel_address() should use probe_kernel_read(), not
> __probe_kernel_read(). But currently it just calls __copy_inatomic() so
> 1/3 follows this logic.
Ok, thanks for the explanation, Oleg.
On 10/22/2014 03:17 AM, Kirill Tkhai wrote:
> Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> If curr task is exiting this may be a reason of use-after-free:
[...]
I've complained about an unrelated issue in that part of the code
a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ
ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems
that both of us forgot to follow up on that and the fix never got
upstream.
Ever since this patch made it upstream, Peter's patch which I was
carrying in my tree stopped applying and I've started seeing:
[ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
[ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
[ 829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448
[ 829.539226] 0000000000000000 0000000000000000 ffff880053acb000 ffff88032b71f828
[ 829.539235] ffffffffa009fb5a 0000000000000057 ffff880631dd6b80 ffff88032b71f868
[ 829.539243] ffffffff963f0c57 ffff880053acbd80 ffff880053acbdb0 ffff88032b71f858
[ 829.539246] Call Trace:
[ 829.539265] dump_stack (lib/dump_stack.c:52)
[ 829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8))
[ 829.539282] spin_bug (kernel/locking/spinlock_debug.c:76)
[ 829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 kernel/locking/spinlock_debug.c:135)
[ 829.539304] ? __schedule (kernel/sched/core.c:2803)
[ 829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167)
[ 829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
[ 829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 kernel/rcu/tree.c:827)
[ 829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
[ 829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 kernel/sched/fair.c:1385)
[ 829.539352] ? preempt_count_sub (kernel/sched/core.c:2644)
[ 829.539358] task_numa_migrate (kernel/sched/fair.c:1452)
[ 829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391)
[ 829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 arch/x86/kernel/kvmclock.c:85)
[ 829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
[ 829.539392] ? sched_clock_local (kernel/sched/clock.c:202)
[ 829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539)
[ 829.539404] ? sched_clock_local (kernel/sched/clock.c:202)
[ 829.539411] task_numa_fault (kernel/sched/fair.c:2073)
[ 829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
[ 829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 829.539446] ? get_parent_ip (kernel/sched/core.c:2588)
[ 829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 mm/memory.c:3346 mm/memory.c:3375)
[ 829.539466] ? find_vma (mm/mmap.c:2048)
[ 829.539477] __do_page_fault (arch/x86/mm/fault.c:1246)
[ 829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144)
[ 829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 (discriminator 8))
[ 829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330)
[ 829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280)
[ 829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301)
Thanks,
Sasha
Hi,
В Пт, 07/11/2014 в 22:48 -0500, Sasha Levin пишет:
> On 10/22/2014 03:17 AM, Kirill Tkhai wrote:
> > Unlocked access to dst_rq->curr in task_numa_compare() is racy.
> > If curr task is exiting this may be a reason of use-after-free:
> [...]
>
> I've complained about an unrelated issue in that part of the code
> a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ
> ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems
> that both of us forgot to follow up on that and the fix never got
> upstream.
>
> Ever since this patch made it upstream, Peter's patch which I was
> carrying in my tree stopped applying and I've started seeing:
>
> [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> [ 829.539212] CPU: 10 PID: 11067 Comm: trinity-c594 Not tainted 3.18.0-rc3-next-20141106-sasha-00054-g09b7ccf-dirty #1448
> [ 829.539226] 0000000000000000 0000000000000000 ffff880053acb000 ffff88032b71f828
> [ 829.539235] ffffffffa009fb5a 0000000000000057 ffff880631dd6b80 ffff88032b71f868
> [ 829.539243] ffffffff963f0c57 ffff880053acbd80 ffff880053acbdb0 ffff88032b71f858
> [ 829.539246] Call Trace:
> [ 829.539265] dump_stack (lib/dump_stack.c:52)
> [ 829.539277] spin_dump (kernel/locking/spinlock_debug.c:68 (discriminator 8))
> [ 829.539282] spin_bug (kernel/locking/spinlock_debug.c:76)
> [ 829.539288] do_raw_spin_lock (kernel/locking/spinlock_debug.c:84 kernel/locking/spinlock_debug.c:135)
> [ 829.539304] ? __schedule (kernel/sched/core.c:2803)
> [ 829.539313] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167)
> [ 829.539321] ? task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
> [ 829.539330] ? rcu_is_watching (./arch/x86/include/asm/preempt.h:95 kernel/rcu/tree.c:827)
> [ 829.539336] task_numa_find_cpu (kernel/sched/fair.c:1258 kernel/sched/fair.c:1385)
> [ 829.539342] ? task_numa_find_cpu (kernel/sched/fair.c:1253 kernel/sched/fair.c:1385)
> [ 829.539352] ? preempt_count_sub (kernel/sched/core.c:2644)
> [ 829.539358] task_numa_migrate (kernel/sched/fair.c:1452)
> [ 829.539364] ? task_numa_migrate (kernel/sched/fair.c:1391)
> [ 829.539376] ? kvm_clock_read (./arch/x86/include/asm/preempt.h:87 arch/x86/kernel/kvmclock.c:85)
> [ 829.539386] ? sched_clock (./arch/x86/include/asm/paravirt.h:192 arch/x86/kernel/tsc.c:304)
> [ 829.539392] ? sched_clock_local (kernel/sched/clock.c:202)
> [ 829.539399] numa_migrate_preferred (kernel/sched/fair.c:1539)
> [ 829.539404] ? sched_clock_local (kernel/sched/clock.c:202)
> [ 829.539411] task_numa_fault (kernel/sched/fair.c:2073)
> [ 829.539417] ? sched_clock_cpu (kernel/sched/clock.c:311)
> [ 829.539429] ? debug_smp_processor_id (lib/smp_processor_id.c:57)
> [ 829.539438] ? get_lock_stats (kernel/locking/lockdep.c:249)
> [ 829.539446] ? get_parent_ip (kernel/sched/core.c:2588)
> [ 829.539461] handle_mm_fault (mm/memory.c:3187 mm/memory.c:3233 mm/memory.c:3346 mm/memory.c:3375)
> [ 829.539466] ? find_vma (mm/mmap.c:2048)
> [ 829.539477] __do_page_fault (arch/x86/mm/fault.c:1246)
> [ 829.539485] ? context_tracking_user_exit (kernel/context_tracking.c:144)
> [ 829.539491] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
> [ 829.539498] ? trace_hardirqs_off_caller (kernel/locking/lockdep.c:2640 (discriminator 8))
> [ 829.539505] trace_do_page_fault (arch/x86/mm/fault.c:1329 include/linux/jump_label.h:114 include/linux/context_tracking_state.h:27 include/linux/context_tracking.h:45 arch/x86/mm/fault.c:1330)
> [ 829.539510] do_async_page_fault (arch/x86/kernel/kvm.c:280)
> [ 829.539516] async_page_fault (arch/x86/kernel/entry_64.S:1301)
The bellow is equal to the patch suggested by Peter.
The only thing, I'm doubt, is about the comparison of cpus instead of nids.
Should task_numa_compare() be able to be called with src_nid == dst_nid
like this may happens now?! Maybe better, we should change task_numa_migrate()
and check for env.dst_nid != env.src.nid.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf3..c18129e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
/* Skip this CPU if the source task cannot migrate */
if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
continue;
+ if (cpu == env->src_cpu)
+ continue;
env->dst_cpu = cpu;
task_numa_compare(env, taskimp, groupimp);
On Sun, Nov 09, 2014 at 05:07:00PM +0300, Kirill Tkhai wrote:
> > I've complained about an unrelated issue in that part of the code
> > a while back (https://lkml.org/lkml/2014/5/12/508) which PeterZ
> > ended up fixing (https://lkml.org/lkml/2014/5/21/428) but it seems
> > that both of us forgot to follow up on that and the fix never got
> > upstream.
Argh, sorry for that!
> The bellow is equal to the patch suggested by Peter.
>
> The only thing, I'm doubt, is about the comparison of cpus instead of nids.
> Should task_numa_compare() be able to be called with src_nid == dst_nid
> like this may happens now?! Maybe better, we should change task_numa_migrate()
> and check for env.dst_nid != env.src.nid.
>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 826fdf3..c18129e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1380,6 +1380,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> /* Skip this CPU if the source task cannot migrate */
> if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
> continue;
> + if (cpu == env->src_cpu)
> + continue;
>
> env->dst_cpu = cpu;
> task_numa_compare(env, taskimp, groupimp);
Not quite the same, current can still get migrated to another cpu than
env->src_cpu, at which point we can still end up selecting ourselves.
How about the below instead, that is pretty much the old patch, but with
a nice comment.
---
Subject: sched, numa: Avoid selecting oneself as swap target
From: Peter Zijlstra <[email protected]>
Date: Mon Nov 10 10:54:35 CET 2014
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.
Cc: Rik van Riel <[email protected]>
Tested-by: Sasha Levin <[email protected]>
Reported-by: Sasha Levin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas
raw_spin_unlock_irq(&dst_rq->lock);
/*
+ * Because we have preemption enabled we can get migrated around and
+ * end try selecting ourselves (current == env->p) as a swap candidate.
+ */
+ if (cur == env->p)
+ goto unlock;
+
+ /*
* "imp" is the fault differential for the source task between the
* source and destination node. Calculate the total differential for
* the source task and potential destination task. The more negative
On 11/10/2014 05:03 AM, Peter Zijlstra wrote:
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas
> raw_spin_unlock_irq(&dst_rq->lock);
>
> /*
> + * Because we have preemption enabled we can get migrated around and
> + * end try selecting ourselves (current == env->p) as a swap candidate.
> + */
> + if (cur == env->p)
> + goto unlock;
This is too late though, because currently the lockup happens couple of lines
above that at:
raw_spin_lock_irq(&dst_rq->lock); <=== here
cur = dst_rq->curr;
Which got me a bit stuck trying to use your old patch since we can't access '->curr'
without locking dst_rq, but locking dst_rq is causing a lockup.
Thanks,
Sasha
On Mon, Nov 10, 2014 at 10:48:28AM -0500, Sasha Levin wrote:
> On 11/10/2014 05:03 AM, Peter Zijlstra wrote:
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1268,6 +1268,13 @@ static void task_numa_compare(struct tas
> > raw_spin_unlock_irq(&dst_rq->lock);
> >
> > /*
> > + * Because we have preemption enabled we can get migrated around and
> > + * end try selecting ourselves (current == env->p) as a swap candidate.
> > + */
> > + if (cur == env->p)
> > + goto unlock;
>
> This is too late though, because currently the lockup happens couple of lines
> above that at:
>
> raw_spin_lock_irq(&dst_rq->lock); <=== here
> cur = dst_rq->curr;
>
> Which got me a bit stuck trying to use your old patch since we can't access '->curr'
> without locking dst_rq, but locking dst_rq is causing a lockup.
confused... how can we lock up there. We should not be holding _any_
lock there.
That a different problem that the originally reported one.
On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
One of those again :/
On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
> On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
>> [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
>> [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
>
> Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
>
> One of those again :/
Hum. I missed that one.
Hold on, but the magic here is fine and the owner pointer is fine, why would just the owner cpu
be wrong?
Thanks,
Sasha
В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет:
> On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> > [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
>
> Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
>
> One of those again :/
We do not initialyse task_struct::numa_preferred_nid for INIT_TASK.
It there no a problem?
On Mon, Nov 10, 2014 at 11:09:29AM -0500, Sasha Levin wrote:
> On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
> > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> >> [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> >> [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> >
> > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
> >
> > One of those again :/
>
> Hum. I missed that one.
>
> Hold on, but the magic here is fine and the owner pointer is fine, why would just the owner cpu
> be wrong?
I've no clue, but I've seen them before. Complete mystery that.
В Пн, 10/11/2014 в 19:10 +0300, Kirill Tkhai пишет:
> В Пн, 10/11/2014 в 17:03 +0100, Peter Zijlstra пишет:
> > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
> > > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
> > > [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> >
> > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
> >
> > One of those again :/
>
> We do not initialyse task_struct::numa_preferred_nid for INIT_TASK.
> It there no a problem?
>
I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
and cpu is bigger than mask, the check
cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)
may be true.
So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
Strange cpu may be from here. It's just a int number in a wrong memory.
A hypothesis that below may help:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf3..a2b4a8a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
{
int cpu;
+ if (!node_online(env->dst_nid))
+ return;
+
for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
/* Skip this CPU if the source task cannot migrate */
if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
On 11/10/2014 11:36 AM, Kirill Tkhai wrote:
> I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
> and cpu is bigger than mask, the check
>
> cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)
>
> may be true.
>
> So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
> Strange cpu may be from here. It's just a int number in a wrong memory.
But the odds of the spinlock magic and owner pointer matching up are slim
to none in that case. The memory is also likely to be valid since KASAN didn't
complain about the access, so I don't believe it to be an access to freed memory.
>
> A hypothesis that below may help:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 826fdf3..a2b4a8a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> {
> int cpu;
>
> + if (!node_online(env->dst_nid))
> + return;
I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a
bit.
Thanks,
Sasha
10.11.2014, 19:45, "Sasha Levin" <[email protected]>:
> On 11/10/2014 11:36 AM, Kirill Tkhai wrote:
>> ?I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
>> ?and cpu is bigger than mask, the check
>>
>> ?cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)
>>
>> ?may be true.
>>
>> ?So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
>> ?Strange cpu may be from here. It's just a int number in a wrong memory.
>
> But the odds of the spinlock magic and owner pointer matching up are slim
> to none in that case. The memory is also likely to be valid since KASAN didn't
> complain about the access, so I don't believe it to be an access to freed memory.
I'm not good with lockdep checks, so I can't judge right now...
Just a hypothesis.
>> ?A hypothesis that below may help:
>>
>> ?diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> ?index 826fdf3..a2b4a8a 100644
>> ?--- a/kernel/sched/fair.c
>> ?+++ b/kernel/sched/fair.c
>> ?@@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>> ??{
>> ??????????int cpu;
>>
>> ?+ if (!node_online(env->dst_nid))
>> ?+ return;
>
> I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a
> bit.
I've looked one more time, and it looks like it's better to check for
BUG_ON(env->dst_nid > MAX_NUMNODES). node_online() may do not work for
insane nids.
Anyway, even if it's not connected, we need to initialize numa_preferred_nid
of init_task, because it's inherited by kernel_init() (and /sbin/init too).
I'll send the patch tomorrow.
Kirill
В Пн, 10/11/2014 в 23:01 +0300, Kirill Tkhai пишет:
>
> 10.11.2014, 19:45, "Sasha Levin" <[email protected]>:
> > On 11/10/2014 11:36 AM, Kirill Tkhai wrote:
> >> I mean task_numa_find_cpu(). If a garbage is in cpumask_of_node(env->dst_nid)
> >> and cpu is bigger than mask, the check
> >>
> >> cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)
> >>
> >> may be true.
> >>
> >> So, we dereference wrong rq in task_numa_compare(). It's not rq at all.
> >> Strange cpu may be from here. It's just a int number in a wrong memory.
> >
> > But the odds of the spinlock magic and owner pointer matching up are slim
> > to none in that case. The memory is also likely to be valid since KASAN didn't
> > complain about the access, so I don't believe it to be an access to freed memory.
>
> I'm not good with lockdep checks, so I can't judge right now...
> Just a hypothesis.
>
> >> A hypothesis that below may help:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 826fdf3..a2b4a8a 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -1376,6 +1376,9 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> >> {
> >> int cpu;
> >>
> >> + if (!node_online(env->dst_nid))
> >> + return;
> >
> > I've changed that to BUG_ON(!node_online(env->dst_nid)) and will run it for a
> > bit.
>
> I've looked one more time, and it looks like it's better to check for
> BUG_ON(env->dst_nid > MAX_NUMNODES). node_online() may do not work for
> insane nids.
>
> Anyway, even if it's not connected, we need to initialize numa_preferred_nid
> of init_task, because it's inherited by kernel_init() (and /sbin/init too).
> I'll send the patch tomorrow.
Also we can check for cpu. A problem may be in how nodes are built etc..
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 826fdf3..8f5c316 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1381,6 +1381,14 @@ static void task_numa_find_cpu(struct task_numa_env *env,
if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
continue;
+ /*
+ * Is cpumask_of_node() broken? In sched_init() we
+ * initialize only possible RQs (their rq->lock etc).
+ */
+ BUG_ON(cpu >= NR_CPUS || !cpu_possible(cpu));
+ /* Insane node id? */
+ BUG_ON(env->dst_nid >= MAX_NUMNODES || !node_online(env->dst_nid));
+
env->dst_cpu = cpu;
task_numa_compare(env, taskimp, groupimp);
}
On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
> On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
>> > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
>> > [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
> Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
>
> One of those again :/
This actually reproduces pretty easily. The stack trace seems to be different
but the end result is the same as above. Anything we can do to debug it? I'm
really not sure how just the owner_cpu can be different here.
Thanks,
Sasha
Commit-ID: 7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709
Gitweb: http://git.kernel.org/tip/7af683350cb0ddd0e9d3819b4eb7abe9e2d3e709
Author: Peter Zijlstra <[email protected]>
AuthorDate: Mon, 10 Nov 2014 10:54:35 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Sun, 16 Nov 2014 10:04:17 +0100
sched/numa: Avoid selecting oneself as swap target
Because the whole numa task selection stuff runs with preemption
enabled (its long and expensive) we can end up migrating and selecting
oneself as a swap target. This doesn't really work out well -- we end
up trying to acquire the same lock twice for the swap migrate -- so
avoid this.
Reported-and-Tested-by: Sasha Levin <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 34baa60..3af3d1e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1180,6 +1180,13 @@ static void task_numa_compare(struct task_numa_env *env,
raw_spin_unlock_irq(&dst_rq->lock);
/*
+ * Because we have preemption enabled we can get migrated around and
+ * end try selecting ourselves (current == env->p) as a swap candidate.
+ */
+ if (cur == env->p)
+ goto unlock;
+
+ /*
* "imp" is the fault differential for the source task between the
* source and destination node. Calculate the total differential for
* the source task and potential destination task. The more negative
On 11/14/2014 09:38 PM, Sasha Levin wrote:
> On 11/10/2014 11:03 AM, Peter Zijlstra wrote:
>> > On Fri, Nov 07, 2014 at 10:48:27PM -0500, Sasha Levin wrote:
>>>> >> > [ 829.539183] BUG: spinlock recursion on CPU#10, trinity-c594/11067
>>>> >> > [ 829.539203] lock: 0xffff880631dd6b80, .magic: dead4ead, .owner: trinity-c594/11067, .owner_cpu: 13
>> > Ooh, look at that. CPU#10 vs .owner_cpu: 13 on the _same_ task.
>> >
>> > One of those again :/
> This actually reproduces pretty easily. The stack trace seems to be different
> but the end result is the same as above. Anything we can do to debug it? I'm
> really not sure how just the owner_cpu can be different here.
I've added saving stack traces on raw spinlock lock/unlock. Had to keep it at
8 entries, but it's enough to get the gist of it. The trace I'm seeing is:
[ 1239.788220] BUG: spinlock recursion on CPU#0, trinity-c767/32076
[ 1239.788270] irq event stamp: 9135954
[ 1239.788455] hardirqs last enabled at (9135953): free_pages_prepare (mm/page_alloc.c:770)
[ 1239.788573] hardirqs last disabled at (9135954): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:159)
[ 1239.788687] softirqs last enabled at (9134166): __do_softirq (./arch/x86/include/asm/preempt.h:22 kernel/softirq.c:296)
[ 1239.788796] softirqs last disabled at (9134161): irq_exit (kernel/softirq.c:346 kernel/softirq.c:387)
[ 1239.791152] lock: 0xffff8805c3dd9100, .magic: dead4ead, .owner: trinity-c767/32076, .owner_cpu: -1
[ 1239.791152] lock trace:
[ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[ 1239.791152] do_raw_spin_trylock (kernel/locking/spinlock_debug.c:171)
[ 1239.791152] _raw_spin_lock_irq (include/linux/spinlock_api_smp.h:129 kernel/locking/spinlock.c:167)
[ 1239.791152] __schedule (kernel/sched/core.c:2803)
[ 1239.791152] schedule (kernel/sched/core.c:2870)
[ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
[ 1239.791152] p9_client_read (net/9p/client.c:1582)
[ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386)
[ 1239.791152] unlock trace:
[ 1239.791152] save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[ 1239.791152] do_raw_spin_unlock (kernel/locking/spinlock_debug.c:120 include/linux/jump_label.h:114 ./arch/x86/include/asm/spinlock.h:149 kernel/locking/spinlock_debug.c:176)
[ 1239.791152] _raw_spin_unlock_irq (include/linux/spinlock_api_smp.h:168 kernel/locking/spinlock.c:199)
[ 1239.791152] finish_task_switch (./arch/x86/include/asm/current.h:14 kernel/sched/core.c:2251)
[ 1239.791152] __schedule (kernel/sched/core.c:2358 kernel/sched/core.c:2840)
[ 1239.791152] schedule_preempt_disabled (kernel/sched/core.c:2897)
[ 1239.791152] cpu_startup_entry (kernel/sched/idle.c:252 kernel/sched/idle.c:274)
[ 1239.791152] start_secondary (arch/x86/kernel/smpboot.c:238)
[ 1239.791152] CPU: 0 PID: 32076 Comm: trinity-c767 Not tainted 3.18.0-rc4-next-20141117-sasha-00046-g481e3e8-dirty #1471
[ 1239.791152] 0000000000000000 0000000000000000 ffff88089a90b000 ffff880867223728
[ 1239.791152] ffffffff920d1ec1 0000000000000030 ffff8805c3dd9100 ffff880867223768
[ 1239.791152] ffffffff815d0617 ffffffff9eec7730 ffff88089a90bf58 ffff8805c3dd9100
[ 1239.791152] Call Trace:
[ 1239.791152] dump_stack (lib/dump_stack.c:52)
[ 1239.791152] spin_dump (kernel/locking/spinlock_debug.c:81 (discriminator 8))
[ 1239.791152] spin_bug (kernel/locking/spinlock_debug.c:89)
[ 1239.791152] do_raw_spin_lock (kernel/locking/spinlock_debug.c:97 kernel/locking/spinlock_debug.c:152)
[ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884)
[ 1239.791152] _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:117 kernel/locking/spinlock.c:159)
[ 1239.791152] ? load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884)
[ 1239.791152] load_balance (kernel/sched/fair.c:5582 kernel/sched/fair.c:6884)
[ 1239.791152] pick_next_task_fair (kernel/sched/fair.c:7154 kernel/sched/fair.c:5093)
[ 1239.791152] ? pick_next_task_fair (kernel/sched/fair.c:7131 kernel/sched/fair.c:5093)
[ 1239.791152] __schedule (kernel/sched/core.c:2716 kernel/sched/core.c:2830)
[ 1239.791152] ? preempt_count_sub (kernel/sched/core.c:2644)
[ 1239.791152] schedule (kernel/sched/core.c:2870)
[ 1239.791152] p9_client_rpc (net/9p/client.c:756 (discriminator 13))
[ 1239.791152] ? __init_waitqueue_head (kernel/sched/wait.c:292)
[ 1239.791152] ? p9_idpool_put (net/9p/util.c:127)
[ 1239.791152] p9_client_read (net/9p/client.c:1582)
[ 1239.791152] ? p9_free_req (net/9p/client.c:410)
[ 1239.791152] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3734)
[ 1239.791152] ? might_fault (mm/memory.c:3734)
[ 1239.791152] v9fs_fid_readn (fs/9p/vfs_file.c:386)
[ 1239.791152] v9fs_file_read (fs/9p/vfs_file.c:447)
[ 1239.791152] do_loop_readv_writev (fs/read_write.c:724)
[ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433)
[ 1239.791152] ? v9fs_fid_readn (fs/9p/vfs_file.c:433)
[ 1239.791152] do_readv_writev (fs/read_write.c:856)
[ 1239.791152] ? save_stack_trace (arch/x86/kernel/stacktrace.c:64)
[ 1239.791152] vfs_readv (fs/read_write.c:882)
[ 1239.791152] SyS_readv (fs/read_write.c:908 fs/read_write.c:899)
[ 1239.791152] tracesys_phase2 (arch/x86/kernel/entry_64.S:529)
So I still can't explain why owner_cpu is "-1" rather than something that makes sense,
but the trace *kinda* makes sense.
We locked the rq once in __schedule(), proceeded to finding the same rq as the busiest
one in load_balance() and locking it yet again.
The odd thing here is that load_balance() has a check just for that:
BUG_ON(busiest == env.dst_rq);
But that doesn't get hit, although we lock it only a bit later:
more_balance:
raw_spin_lock_irqsave(&busiest->lock, flags);
And there's no check before the actual lock there, so I've added another BUG_ON
there and will update with results.
Thanks,
Sasha