2018-01-19 09:24:51

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

Hi Steven,

> /* Called from hardirq context */
> -static void try_to_push_tasks(void *arg)
> +void rto_push_irq_work_func(struct irq_work *work)
> {
> - struct rt_rq *rt_rq = arg;
> - struct rq *rq, *src_rq;
> - int this_cpu;
> + struct rq *rq;
> int cpu;
>
> - this_cpu = rt_rq->push_cpu;
> + rq = this_rq();
>
> - /* Paranoid check */
> - BUG_ON(this_cpu != smp_processor_id());
> -
> - rq = cpu_rq(this_cpu);
> - src_rq = rq_of_rt_rq(rt_rq);
> -
> -again:
> + /*
> + * We do not need to grab the lock to check for has_pushable_tasks.
> + * When it gets updated, a check is made if a push is possible.
> + */
> if (has_pushable_tasks(rq)) {
> raw_spin_lock(&rq->lock);
> - push_rt_task(rq);
> + push_rt_tasks(rq);
> raw_spin_unlock(&rq->lock);
> }
>
> - /* Pass the IPI to the next rt overloaded queue */
> - raw_spin_lock(&rt_rq->push_lock);
> - /*
> - * If the source queue changed since the IPI went out,
> - * we need to restart the search from that CPU again.
> - */
> - if (rt_rq->push_flags & RT_PUSH_IPI_RESTART) {
> - rt_rq->push_flags &= ~RT_PUSH_IPI_RESTART;
> - rt_rq->push_cpu = src_rq->cpu;
> - }
> + raw_spin_lock(&rq->rd->rto_lock);
>
> - cpu = find_next_push_cpu(src_rq);
> + /* Pass the IPI to the next rt overloaded queue */
> + cpu = rto_next_cpu(rq);
>
> - if (cpu >= nr_cpu_ids)
> - rt_rq->push_flags &= ~RT_PUSH_IPI_EXECUTING;
> - raw_spin_unlock(&rt_rq->push_lock);
> + raw_spin_unlock(&rq->rd->rto_lock);
>
> - if (cpu >= nr_cpu_ids)
> + if (cpu < 0)
> return;

I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
stable kernel based system. This issue is observed only after
inclusion of this patch. It appears to me that rq->rd can change
between spinlock is acquired and released in rto_push_irq_work_func()
IRQ work if hotplug is in progress. It was only reported couple of
times during long stress testing. The issue can be easily reproduced
if an artificial delay is introduced between lock and unlock of
rto_lock. The rq->rd is changed under rq->lock, so we can protect this
race with rq->lock. The below patch solved the problem. we are taking
rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
here. Please let me know your thoughts on this.

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index d863d39..478192b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(&rq->lock);
}

+ raw_spin_lock(&rq->lock);
raw_spin_lock(&rq->rd->rto_lock);

/* Pass the IPI to the next rt overloaded queue */
@@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)

raw_spin_unlock(&rq->rd->rto_lock);

- if (cpu < 0)
- return;
-
/* Try the next RT overloaded CPU */
- irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ if (cpu >= 0)
+ irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ raw_spin_unlock(&rq->lock);
}
#endif /* HAVE_RT_PUSH_IPI */


Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project


2018-01-19 15:04:52

by Steven Rostedt

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

On Fri, 19 Jan 2018 14:53:05 +0530
Pavan Kondeti <[email protected]> wrote:

> I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> stable kernel based system. This issue is observed only after
> inclusion of this patch. It appears to me that rq->rd can change
> between spinlock is acquired and released in rto_push_irq_work_func()
> IRQ work if hotplug is in progress. It was only reported couple of
> times during long stress testing. The issue can be easily reproduced
> if an artificial delay is introduced between lock and unlock of
> rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> race with rq->lock. The below patch solved the problem. we are taking
> rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> here. Please let me know your thoughts on this.

As so rq->rd can change. Interesting.

>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index d863d39..478192b 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> raw_spin_unlock(&rq->lock);
> }
>
> + raw_spin_lock(&rq->lock);


What about just saving the rd then?

struct root_domain *rd;

rd = READ_ONCE(rq->rd);

then use that. Then we don't need to worry about it changing.

-- Steve


> raw_spin_lock(&rq->rd->rto_lock);
>
> /* Pass the IPI to the next rt overloaded queue */
> @@ -2291,11 +2292,10 @@ void rto_push_irq_work_func(struct irq_work *work)
>
> raw_spin_unlock(&rq->rd->rto_lock);
>
> - if (cpu < 0)
> - return;
> -
> /* Try the next RT overloaded CPU */
> - irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> + if (cpu >= 0)
> + irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> + raw_spin_unlock(&rq->lock);
> }
> #endif /* HAVE_RT_PUSH_IPI */
>
>
> Thanks,
> Pavan
>


2018-01-19 15:45:29

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

Hi Steven,

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti <[email protected]> wrote:
>
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
>
> As so rq->rd can change. Interesting.
>
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(&rq->lock);
> > }
> >
> > + raw_spin_lock(&rq->lock);
>
>
> What about just saving the rd then?
>
> struct root_domain *rd;
>
> rd = READ_ONCE(rq->rd);
>
> then use that. Then we don't need to worry about it changing.
>

Yeah, it should work. I will give it a try and send the patch
for review.

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-19 15:55:10

by Steven Rostedt

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

On Fri, 19 Jan 2018 21:14:30 +0530
Pavan Kondeti <[email protected]> wrote:


> Yeah, it should work. I will give it a try and send the patch
> for review.

Add a comment above the assignment saying something to the effect:


/* Due to CPU hotplug, rq->rd could possibly change */

-- Steve


2018-01-19 17:48:18

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

On Fri, Jan 19, 2018 at 10:03:53AM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 14:53:05 +0530
> Pavan Kondeti <[email protected]> wrote:
>
> > I am seeing "spinlock already unlocked" BUG for rd->rto_lock on a 4.9
> > stable kernel based system. This issue is observed only after
> > inclusion of this patch. It appears to me that rq->rd can change
> > between spinlock is acquired and released in rto_push_irq_work_func()
> > IRQ work if hotplug is in progress. It was only reported couple of
> > times during long stress testing. The issue can be easily reproduced
> > if an artificial delay is introduced between lock and unlock of
> > rto_lock. The rq->rd is changed under rq->lock, so we can protect this
> > race with rq->lock. The below patch solved the problem. we are taking
> > rq->lock in pull_rt_task()->tell_cpu_to_push(), so I extended the same
> > here. Please let me know your thoughts on this.
>
> As so rq->rd can change. Interesting.
>
> >
> > diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> > index d863d39..478192b 100644
> > --- a/kernel/sched/rt.c
> > +++ b/kernel/sched/rt.c
> > @@ -2284,6 +2284,7 @@ void rto_push_irq_work_func(struct irq_work *work)
> > raw_spin_unlock(&rq->lock);
> > }
> >
> > + raw_spin_lock(&rq->lock);
>
>
> What about just saving the rd then?
>
> struct root_domain *rd;
>
> rd = READ_ONCE(rq->rd);
>
> then use that. Then we don't need to worry about it changing.
>

I am thinking of another problem because of the race between
rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.

Lets say, we cache the rq->rd here and queued the IRQ work on a remote
CPU. In the mean time, the rq_attach_root() might drop all the references
to this cached (old) rd and wants to free it. The rq->rd is freed in
RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
can get freed before the IRQ work is executed. This results in the corruption
of the remote CPU's IRQ work list. Right?

Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
we have to wait for the IRQ work to finish before freeing the older root domain
in RCU-sched callback.

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-19 18:13:27

by Steven Rostedt

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

On Fri, 19 Jan 2018 23:16:17 +0530
Pavan Kondeti <[email protected]> wrote:

> I am thinking of another problem because of the race between
> rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
>
> Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> CPU. In the mean time, the rq_attach_root() might drop all the references
> to this cached (old) rd and wants to free it. The rq->rd is freed in
> RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> can get freed before the IRQ work is executed. This results in the corruption
> of the remote CPU's IRQ work list. Right?
>
> Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
> we have to wait for the IRQ work to finish before freeing the older root domain
> in RCU-sched callback.

I was wondering about this too. Yeah, it would require an RCU like
update. Once the rd was unreferenced, it would need to wait for the
irq works to to finish before freeing it.

The easy way to do this is to simply up the refcount when sending the
domain. Something like this:

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513adca3..89a086ed2b16 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
* the rt_loop_next will cause the iterator to perform another scan.
*
*/
-static int rto_next_cpu(struct rq *rq)
+static int rto_next_cpu(struct root_domain *rd)
{
- struct root_domain *rd = rq->rd;
int next;
int cpu;

@@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
* Otherwise it is finishing up and an ipi needs to be sent.
*/
if (rq->rd->rto_cpu < 0)
- cpu = rto_next_cpu(rq);
+ cpu = rto_next_cpu(rq->rd);

raw_spin_unlock(&rq->rd->rto_lock);

rto_start_unlock(&rq->rd->rto_loop_start);

- if (cpu >= 0)
+ if (cpu >= 0) {
+ /* Make sure the rd does not get freed while pushing */
+ sched_get_rd(rq->rd);
irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ }
}

/* Called from hardirq context */
void rto_push_irq_work_func(struct irq_work *work)
{
+ struct root_domain *rd =
+ container_of(work, struct root_domain, rto_push_work);
struct rq *rq;
int cpu;

@@ -2013,18 +2017,20 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(&rq->lock);
}

- raw_spin_lock(&rq->rd->rto_lock);
+ raw_spin_lock(&rd->rto_lock);

/* Pass the IPI to the next rt overloaded queue */
- cpu = rto_next_cpu(rq);
+ cpu = rto_next_cpu(rd);

- raw_spin_unlock(&rq->rd->rto_lock);
+ raw_spin_unlock(&rd->rto_lock);

- if (cpu < 0)
+ if (cpu < 0) {
+ sched_put_rd(rd);
return;
+ }

/* Try the next RT overloaded CPU */
- irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ irq_work_queue_on(&rd->rto_push_work, cpu);
}
#endif /* HAVE_RT_PUSH_IPI */

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505e23c6..fb5fc458547f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+extern void sched_get_rd(struct root_domain *rd);
+extern void sched_put_rd(struct root_domain *rd);

#ifdef HAVE_RT_PUSH_IPI
extern void rto_push_irq_work_func(struct irq_work *work);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed7f88b..519b024f4e94 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
call_rcu_sched(&old_rd->rcu, free_rootdomain);
}

+void sched_get_rd(struct root_domain *rd)
+{
+ atomic_inc(&rd->refcount);
+}
+
+void sched_put_rd(struct root_domain *rd)
+{
+ if (!atomic_dec_and_test(&rd->refcount))
+ return;
+
+ call_rcu_sched(&rd->rcu, free_rootdomain);
+}
+
static int init_rootdomain(struct root_domain *rd)
{
if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))

-- Steve

2018-01-19 18:14:26

by Steven Rostedt

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

On Fri, 19 Jan 2018 13:11:21 -0500
Steven Rostedt <[email protected]> wrote:

> void rto_push_irq_work_func(struct irq_work *work)
> {
> + struct root_domain *rd =
> + container_of(work, struct root_domain, rto_push_work);
> struct rq *rq;

Notice that I also remove the dependency on rq from getting the rd.

-- Steve

2018-01-19 18:56:03

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

On Fri, Jan 19, 2018 at 01:11:21PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 23:16:17 +0530
> Pavan Kondeti <[email protected]> wrote:
>
> > I am thinking of another problem because of the race between
> > rto_push_irq_work_func() and rq_attach_root() where rq->rd is modified.
> >
> > Lets say, we cache the rq->rd here and queued the IRQ work on a remote
> > CPU. In the mean time, the rq_attach_root() might drop all the references
> > to this cached (old) rd and wants to free it. The rq->rd is freed in
> > RCU-sched callback. If that remote CPU is in RCU quiescent state, the rq->rd
> > can get freed before the IRQ work is executed. This results in the corruption
> > of the remote CPU's IRQ work list. Right?
> >
> > Taking rq->lock in rto_push_irq_work_func() also does not help here. Probably
> > we have to wait for the IRQ work to finish before freeing the older root domain
> > in RCU-sched callback.
>
> I was wondering about this too. Yeah, it would require an RCU like
> update. Once the rd was unreferenced, it would need to wait for the
> irq works to to finish before freeing it.
>
> The easy way to do this is to simply up the refcount when sending the
> domain. Something like this:
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513adca3..89a086ed2b16 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
> * the rt_loop_next will cause the iterator to perform another scan.
> *
> */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
> {
> - struct root_domain *rd = rq->rd;
> int next;
> int cpu;
>
> @@ -1985,19 +1984,24 @@ static void tell_cpu_to_push(struct rq *rq)
> * Otherwise it is finishing up and an ipi needs to be sent.
> */
> if (rq->rd->rto_cpu < 0)
> - cpu = rto_next_cpu(rq);
> + cpu = rto_next_cpu(rq->rd);
>
> raw_spin_unlock(&rq->rd->rto_lock);
>
> rto_start_unlock(&rq->rd->rto_loop_start);
>
> - if (cpu >= 0)
> + if (cpu >= 0) {
> + /* Make sure the rd does not get freed while pushing */
> + sched_get_rd(rq->rd);
> irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> + }
> }

Since this is covered by rq->lock, it is guaranteed that we increment the
refcount on the older rd before RCU-sched callback is queued in
rq_attach_root(). Either we keep older rd alive or use the updated rd.

We are good here, I think.

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-19 18:59:44

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

Hi Steve,

Thanks for the patch.

On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> On Fri, 19 Jan 2018 13:11:21 -0500
> Steven Rostedt <[email protected]> wrote:
>
> > void rto_push_irq_work_func(struct irq_work *work)
> > {
> > + struct root_domain *rd =
> > + container_of(work, struct root_domain, rto_push_work);
> > struct rq *rq;
>
> Notice that I also remove the dependency on rq from getting the rd.
>

Nice. This snippet it self solves the original problem, I reported.
I will test your patch and let you know the results.

Thanks,
Pavan

--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2018-01-19 19:52:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

On Sat, 20 Jan 2018 00:27:56 +0530
Pavan Kondeti <[email protected]> wrote:

> Hi Steve,
>
> Thanks for the patch.
>
> On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > On Fri, 19 Jan 2018 13:11:21 -0500
> > Steven Rostedt <[email protected]> wrote:
> >
> > > void rto_push_irq_work_func(struct irq_work *work)
> > > {
> > > + struct root_domain *rd =
> > > + container_of(work, struct root_domain, rto_push_work);
> > > struct rq *rq;
> >
> > Notice that I also remove the dependency on rq from getting the rd.
> >
>
> Nice. This snippet it self solves the original problem, I reported.
> I will test your patch and let you know the results.
>
>

I'll break the patch up into two then. One with this snippet, and the
other with the rd ref counting.

-- Steve

2018-01-20 05:03:59

by Pavankumar Kondeti

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/rt: Simplify the IPI based RT balancing logic

Hi Steve,

On Fri, Jan 19, 2018 at 02:51:15PM -0500, Steven Rostedt wrote:
> On Sat, 20 Jan 2018 00:27:56 +0530
> Pavan Kondeti <[email protected]> wrote:
>
> > Hi Steve,
> >
> > Thanks for the patch.
> >
> > On Fri, Jan 19, 2018 at 01:12:54PM -0500, Steven Rostedt wrote:
> > > On Fri, 19 Jan 2018 13:11:21 -0500
> > > Steven Rostedt <[email protected]> wrote:
> > >
> > > > void rto_push_irq_work_func(struct irq_work *work)
> > > > {
> > > > + struct root_domain *rd =
> > > > + container_of(work, struct root_domain, rto_push_work);
> > > > struct rq *rq;
> > >
> > > Notice that I also remove the dependency on rq from getting the rd.
> > >
> >
> > Nice. This snippet it self solves the original problem, I reported.
> > I will test your patch and let you know the results.
> >
> >
>
> I'll break the patch up into two then. One with this snippet, and the
> other with the rd ref counting.
>

Yeah, this snippet fixed the original problem.

I have not seen "use after free" of rd in my testing. But I can see
we are operating on a rd for which refcount is 0. After applying your
refcount patch, it never happened. I also verified that we are freeing
the rd via IRQ work by dropping the last reference.

Thanks for your help with the patches. Please copy linux-stable for
the 1st patch.

Feel free to use
Tested-by: Pavankumar Kondeti <[email protected]>

Thanks,
Pavan
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


Subject: [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()

Commit-ID: ad0f1d9d65938aec72a698116cd73a980916895e
Gitweb: https://git.kernel.org/tip/ad0f1d9d65938aec72a698116cd73a980916895e
Author: Steven Rostedt (VMware) <[email protected]>
AuthorDate: Tue, 23 Jan 2018 20:45:37 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 6 Feb 2018 10:20:33 +0100

sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()

When the rto_push_irq_work_func() is called, it looks at the RT overloaded
bitmask in the root domain via the runqueue (rq->rd). The problem is that
during CPU up and down, nothing here stops rq->rd from changing between
taking the rq->rd->rto_lock and releasing it. That means the lock that is
released is not the same lock that was taken.

Instead of using this_rq()->rd to get the root domain, as the irq work is
part of the root domain, we can simply get the root domain from the irq work
that is passed to the routine:

container_of(work, struct root_domain, rto_push_work)

This keeps the root domain consistent.

Reported-by: Pavan Kondeti <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/rt.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 862a513..2fb627d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
* the rt_loop_next will cause the iterator to perform another scan.
*
*/
-static int rto_next_cpu(struct rq *rq)
+static int rto_next_cpu(struct root_domain *rd)
{
- struct root_domain *rd = rq->rd;
int next;
int cpu;

@@ -1985,7 +1984,7 @@ static void tell_cpu_to_push(struct rq *rq)
* Otherwise it is finishing up and an ipi needs to be sent.
*/
if (rq->rd->rto_cpu < 0)
- cpu = rto_next_cpu(rq);
+ cpu = rto_next_cpu(rq->rd);

raw_spin_unlock(&rq->rd->rto_lock);

@@ -1998,6 +1997,8 @@ static void tell_cpu_to_push(struct rq *rq)
/* Called from hardirq context */
void rto_push_irq_work_func(struct irq_work *work)
{
+ struct root_domain *rd =
+ container_of(work, struct root_domain, rto_push_work);
struct rq *rq;
int cpu;

@@ -2013,18 +2014,18 @@ void rto_push_irq_work_func(struct irq_work *work)
raw_spin_unlock(&rq->lock);
}

- raw_spin_lock(&rq->rd->rto_lock);
+ raw_spin_lock(&rd->rto_lock);

/* Pass the IPI to the next rt overloaded queue */
- cpu = rto_next_cpu(rq);
+ cpu = rto_next_cpu(rd);

- raw_spin_unlock(&rq->rd->rto_lock);
+ raw_spin_unlock(&rd->rto_lock);

if (cpu < 0)
return;

/* Try the next RT overloaded CPU */
- irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ irq_work_queue_on(&rd->rto_push_work, cpu);
}
#endif /* HAVE_RT_PUSH_IPI */


Subject: [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs

Commit-ID: 364f56653708ba8bcdefd4f0da2a42904baa8eeb
Gitweb: https://git.kernel.org/tip/364f56653708ba8bcdefd4f0da2a42904baa8eeb
Author: Steven Rostedt (VMware) <[email protected]>
AuthorDate: Tue, 23 Jan 2018 20:45:38 -0500
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 6 Feb 2018 10:20:33 +0100

sched/rt: Up the root domain ref count when passing it around via IPIs

When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.

Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.

Reported-by: Pavan Kondeti <[email protected]>
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/rt.c | 9 +++++++--
kernel/sched/sched.h | 2 ++
kernel/sched/topology.c | 13 +++++++++++++
3 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 2fb627d..89a086e 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1990,8 +1990,11 @@ static void tell_cpu_to_push(struct rq *rq)

rto_start_unlock(&rq->rd->rto_loop_start);

- if (cpu >= 0)
+ if (cpu >= 0) {
+ /* Make sure the rd does not get freed while pushing */
+ sched_get_rd(rq->rd);
irq_work_queue_on(&rq->rd->rto_push_work, cpu);
+ }
}

/* Called from hardirq context */
@@ -2021,8 +2024,10 @@ void rto_push_irq_work_func(struct irq_work *work)

raw_spin_unlock(&rd->rto_lock);

- if (cpu < 0)
+ if (cpu < 0) {
+ sched_put_rd(rd);
return;
+ }

/* Try the next RT overloaded CPU */
irq_work_queue_on(&rd->rto_push_work, cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e95505..fb5fc45 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
extern void init_defrootdomain(void);
extern int sched_init_domains(const struct cpumask *cpu_map);
extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
+extern void sched_get_rd(struct root_domain *rd);
+extern void sched_put_rd(struct root_domain *rd);

#ifdef HAVE_RT_PUSH_IPI
extern void rto_push_irq_work_func(struct irq_work *work);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 034cbed..519b024 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
call_rcu_sched(&old_rd->rcu, free_rootdomain);
}

+void sched_get_rd(struct root_domain *rd)
+{
+ atomic_inc(&rd->refcount);
+}
+
+void sched_put_rd(struct root_domain *rd)
+{
+ if (!atomic_dec_and_test(&rd->refcount))
+ return;
+
+ call_rcu_sched(&rd->rcu, free_rootdomain);
+}
+
static int init_rootdomain(struct root_domain *rd)
{
if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))

2018-02-07 04:15:40

by Steven Rostedt

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()


I see this was just applied to Linus's tree. It probably should be
tagged for stable as well.

-- Steve


On Tue, 6 Feb 2018 03:54:16 -0800
"tip-bot for Steven Rostedt (VMware)" <[email protected]> wrote:

> Commit-ID: ad0f1d9d65938aec72a698116cd73a980916895e
> Gitweb: https://git.kernel.org/tip/ad0f1d9d65938aec72a698116cd73a980916895e
> Author: Steven Rostedt (VMware) <[email protected]>
> AuthorDate: Tue, 23 Jan 2018 20:45:37 -0500
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 6 Feb 2018 10:20:33 +0100
>
> sched/rt: Use container_of() to get root domain in rto_push_irq_work_func()
>
> When the rto_push_irq_work_func() is called, it looks at the RT overloaded
> bitmask in the root domain via the runqueue (rq->rd). The problem is that
> during CPU up and down, nothing here stops rq->rd from changing between
> taking the rq->rd->rto_lock and releasing it. That means the lock that is
> released is not the same lock that was taken.
>
> Instead of using this_rq()->rd to get the root domain, as the irq work is
> part of the root domain, we can simply get the root domain from the irq work
> that is passed to the routine:
>
> container_of(work, struct root_domain, rto_push_work)
>
> This keeps the root domain consistent.
>
> Reported-by: Pavan Kondeti <[email protected]>
> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
> Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> kernel/sched/rt.c | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 862a513..2fb627d 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1907,9 +1907,8 @@ static void push_rt_tasks(struct rq *rq)
> * the rt_loop_next will cause the iterator to perform another scan.
> *
> */
> -static int rto_next_cpu(struct rq *rq)
> +static int rto_next_cpu(struct root_domain *rd)
> {
> - struct root_domain *rd = rq->rd;
> int next;
> int cpu;
>
> @@ -1985,7 +1984,7 @@ static void tell_cpu_to_push(struct rq *rq)
> * Otherwise it is finishing up and an ipi needs to be sent.
> */
> if (rq->rd->rto_cpu < 0)
> - cpu = rto_next_cpu(rq);
> + cpu = rto_next_cpu(rq->rd);
>
> raw_spin_unlock(&rq->rd->rto_lock);
>
> @@ -1998,6 +1997,8 @@ static void tell_cpu_to_push(struct rq *rq)
> /* Called from hardirq context */
> void rto_push_irq_work_func(struct irq_work *work)
> {
> + struct root_domain *rd =
> + container_of(work, struct root_domain, rto_push_work);
> struct rq *rq;
> int cpu;
>
> @@ -2013,18 +2014,18 @@ void rto_push_irq_work_func(struct irq_work *work)
> raw_spin_unlock(&rq->lock);
> }
>
> - raw_spin_lock(&rq->rd->rto_lock);
> + raw_spin_lock(&rd->rto_lock);
>
> /* Pass the IPI to the next rt overloaded queue */
> - cpu = rto_next_cpu(rq);
> + cpu = rto_next_cpu(rd);
>
> - raw_spin_unlock(&rq->rd->rto_lock);
> + raw_spin_unlock(&rd->rto_lock);
>
> if (cpu < 0)
> return;
>
> /* Try the next RT overloaded CPU */
> - irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> + irq_work_queue_on(&rd->rto_push_work, cpu);
> }
> #endif /* HAVE_RT_PUSH_IPI */
>


2018-02-07 04:17:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched/rt: Up the root domain ref count when passing it around via IPIs


I see this was just applied to Linus's tree. This too probably should be
tagged for stable as well.

-- Steve


On Tue, 6 Feb 2018 03:54:42 -0800
"tip-bot for Steven Rostedt (VMware)" <[email protected]> wrote:

> Commit-ID: 364f56653708ba8bcdefd4f0da2a42904baa8eeb
> Gitweb: https://git.kernel.org/tip/364f56653708ba8bcdefd4f0da2a42904baa8eeb
> Author: Steven Rostedt (VMware) <[email protected]>
> AuthorDate: Tue, 23 Jan 2018 20:45:38 -0500
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 6 Feb 2018 10:20:33 +0100
>
> sched/rt: Up the root domain ref count when passing it around via IPIs
>
> When issuing an IPI RT push, where an IPI is sent to each CPU that has more
> than one RT task scheduled on it, it references the root domain's rto_mask,
> that contains all the CPUs within the root domain that has more than one RT
> task in the runable state. The problem is, after the IPIs are initiated, the
> rq->lock is released. This means that the root domain that is associated to
> the run queue could be freed while the IPIs are going around.
>
> Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
> the root domain's ref count respectively. This way when initiating the IPIs,
> the scheduler will up the root domain's ref count before releasing the
> rq->lock, ensuring that the root domain does not go away until the IPI round
> is complete.
>
> Reported-by: Pavan Kondeti <[email protected]>
> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
> Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> kernel/sched/rt.c | 9 +++++++--
> kernel/sched/sched.h | 2 ++
> kernel/sched/topology.c | 13 +++++++++++++
> 3 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index 2fb627d..89a086e 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -1990,8 +1990,11 @@ static void tell_cpu_to_push(struct rq *rq)
>
> rto_start_unlock(&rq->rd->rto_loop_start);
>
> - if (cpu >= 0)
> + if (cpu >= 0) {
> + /* Make sure the rd does not get freed while pushing */
> + sched_get_rd(rq->rd);
> irq_work_queue_on(&rq->rd->rto_push_work, cpu);
> + }
> }
>
> /* Called from hardirq context */
> @@ -2021,8 +2024,10 @@ void rto_push_irq_work_func(struct irq_work *work)
>
> raw_spin_unlock(&rd->rto_lock);
>
> - if (cpu < 0)
> + if (cpu < 0) {
> + sched_put_rd(rd);
> return;
> + }
>
> /* Try the next RT overloaded CPU */
> irq_work_queue_on(&rd->rto_push_work, cpu);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 2e95505..fb5fc45 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -691,6 +691,8 @@ extern struct mutex sched_domains_mutex;
> extern void init_defrootdomain(void);
> extern int sched_init_domains(const struct cpumask *cpu_map);
> extern void rq_attach_root(struct rq *rq, struct root_domain *rd);
> +extern void sched_get_rd(struct root_domain *rd);
> +extern void sched_put_rd(struct root_domain *rd);
>
> #ifdef HAVE_RT_PUSH_IPI
> extern void rto_push_irq_work_func(struct irq_work *work);
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 034cbed..519b024 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -259,6 +259,19 @@ void rq_attach_root(struct rq *rq, struct root_domain *rd)
> call_rcu_sched(&old_rd->rcu, free_rootdomain);
> }
>
> +void sched_get_rd(struct root_domain *rd)
> +{
> + atomic_inc(&rd->refcount);
> +}
> +
> +void sched_put_rd(struct root_domain *rd)
> +{
> + if (!atomic_dec_and_test(&rd->refcount))
> + return;
> +
> + call_rcu_sched(&rd->rcu, free_rootdomain);
> +}
> +
> static int init_rootdomain(struct root_domain *rd)
> {
> if (!zalloc_cpumask_var(&rd->span, GFP_KERNEL))