2018-05-23 17:57:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Wed, 23 May 2018 10:29:00 -0700

cgwb_release() punts the actual release to cgwb_release_workfn() on
system_wq. Depending on the number of cgroups or block devices, there
can be a lot of cgwb_release_workfn() in flight at the same time.

We're periodically seeing close to 256 kworkers getting stuck with the
following stack trace and overtime the entire system gets stuck.

[<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
[<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
[<ffffffff811ccf23>] bdi_unregister+0x53/0x290
[<ffffffff811cd1e9>] release_bdi+0x89/0xc0
[<ffffffff811cd645>] wb_exit+0x85/0xa0
[<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
[<ffffffff810a68d0>] process_one_work+0x150/0x410
[<ffffffff810a71fd>] worker_thread+0x6d/0x520
[<ffffffff810ad3dc>] kthread+0x12c/0x160
[<ffffffff81969019>] ret_from_fork+0x29/0x40
[<ffffffffffffffff>] 0xffffffffffffffff

The events leading to the lockup are...

1. A lot of cgwb_release_workfn() is queued at the same time and all
system_wq kworkers are assigned to execute them.

2. They all end up calling synchronize_rcu_expedited(). One of them
wins and tries to perform the expedited synchronization.

3. However, that invovles queueing rcu_exp_work to system_wq and
waiting for it. Because #1 is holding all available kworkers on
system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
is waiting for synchronize_rcu_expedited() which in turn is waiting
for cgwb_release_workfn() to free up some of the kworkers.

We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
same time. There's nothing to be gained from that. This patch
updates cgwb release path to use a dedicated percpu workqueue with
@max_active of 1.

While this resolves the problem at hand, it might be a good idea to
isolate rcu_exp_work to its own workqueue too as it can be used from
various paths and is prone to this sort of indirect A-A deadlocks.

Signed-off-by: Tejun Heo <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: [email protected]
---
mm/backing-dev.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7441bd9..8fe3ebd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -412,6 +412,7 @@ static void wb_exit(struct bdi_writeback *wb)
* protected.
*/
static DEFINE_SPINLOCK(cgwb_lock);
+static struct workqueue_struct *cgwb_release_wq;

/**
* wb_congested_get_create - get or create a wb_congested
@@ -522,7 +523,7 @@ static void cgwb_release(struct percpu_ref *refcnt)
{
struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
refcnt);
- schedule_work(&wb->release_work);
+ queue_work(cgwb_release_wq, &wb->release_work);
}

static void cgwb_kill(struct bdi_writeback *wb)
@@ -784,6 +785,21 @@ static void cgwb_bdi_register(struct backing_dev_info *bdi)
spin_unlock_irq(&cgwb_lock);
}

+static int __init cgwb_init(void)
+{
+ /*
+ * There can be many concurrent release work items overwhelming
+ * system_wq. Put them in a separate wq and limit concurrency.
+ * There's no point in executing many of these in parallel.
+ */
+ cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
+ if (!cgwb_release_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+subsys_initcall(cgwb_init);
+
#else /* CONFIG_CGROUP_WRITEBACK */

static int cgwb_bdi_init(struct backing_dev_info *bdi)
--
2.9.5



2018-05-23 18:49:24

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

On Wed, May 23, 2018 at 10:56:32AM -0700, Tejun Heo wrote:
> >From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <[email protected]>
> Date: Wed, 23 May 2018 10:29:00 -0700
>
> cgwb_release() punts the actual release to cgwb_release_workfn() on
> system_wq. Depending on the number of cgroups or block devices, there
> can be a lot of cgwb_release_workfn() in flight at the same time.
>
> We're periodically seeing close to 256 kworkers getting stuck with the
> following stack trace and overtime the entire system gets stuck.
>
> [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
> [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
> [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
> [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
> [<ffffffff811cd645>] wb_exit+0x85/0xa0
> [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
> [<ffffffff810a68d0>] process_one_work+0x150/0x410
> [<ffffffff810a71fd>] worker_thread+0x6d/0x520
> [<ffffffff810ad3dc>] kthread+0x12c/0x160
> [<ffffffff81969019>] ret_from_fork+0x29/0x40
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The events leading to the lockup are...
>
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
> system_wq kworkers are assigned to execute them.
>
> 2. They all end up calling synchronize_rcu_expedited(). One of them
> wins and tries to perform the expedited synchronization.
>
> 3. However, that invovles queueing rcu_exp_work to system_wq and
> waiting for it. Because #1 is holding all available kworkers on
> system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
> is waiting for synchronize_rcu_expedited() which in turn is waiting
> for cgwb_release_workfn() to free up some of the kworkers.
>
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time. There's nothing to be gained from that. This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.
>
> While this resolves the problem at hand, it might be a good idea to
> isolate rcu_exp_work to its own workqueue too as it can be used from
> various paths and is prone to this sort of indirect A-A deadlocks.

Commit ad7c946b35ad4 ("rcu: Create RCU-specific workqueues with rescuers")
was accepted into mainline this past merge window. Does that do what
you want, or are you looking for something else?

Thanx, Paul


2018-05-23 18:53:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

On Wed, May 23, 2018 at 11:39:07AM -0700, Paul E. McKenney wrote:
> > While this resolves the problem at hand, it might be a good idea to
> > isolate rcu_exp_work to its own workqueue too as it can be used from
> > various paths and is prone to this sort of indirect A-A deadlocks.
>
> Commit ad7c946b35ad4 ("rcu: Create RCU-specific workqueues with rescuers")
> was accepted into mainline this past merge window. Does that do what
> you want, or are you looking for something else?

Ah, that does it. Sorry, was looking at an older kernel.

Thanks.

--
tejun

2018-05-23 19:21:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

On Wed, May 23, 2018 at 11:51:43AM -0700, Tejun Heo wrote:
> On Wed, May 23, 2018 at 11:39:07AM -0700, Paul E. McKenney wrote:
> > > While this resolves the problem at hand, it might be a good idea to
> > > isolate rcu_exp_work to its own workqueue too as it can be used from
> > > various paths and is prone to this sort of indirect A-A deadlocks.
> >
> > Commit ad7c946b35ad4 ("rcu: Create RCU-specific workqueues with rescuers")
> > was accepted into mainline this past merge window. Does that do what
> > you want, or are you looking for something else?
>
> Ah, that does it. Sorry, was looking at an older kernel.

No worries -- after all, I must confess that I was a bit slow in getting
to this.

Thanx, Paul


2018-05-23 21:29:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

On 5/23/18 11:56 AM, Tejun Heo wrote:
> From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <[email protected]>
> Date: Wed, 23 May 2018 10:29:00 -0700
>
> cgwb_release() punts the actual release to cgwb_release_workfn() on
> system_wq. Depending on the number of cgroups or block devices, there
> can be a lot of cgwb_release_workfn() in flight at the same time.
>
> We're periodically seeing close to 256 kworkers getting stuck with the
> following stack trace and overtime the entire system gets stuck.
>
> [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
> [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
> [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
> [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
> [<ffffffff811cd645>] wb_exit+0x85/0xa0
> [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
> [<ffffffff810a68d0>] process_one_work+0x150/0x410
> [<ffffffff810a71fd>] worker_thread+0x6d/0x520
> [<ffffffff810ad3dc>] kthread+0x12c/0x160
> [<ffffffff81969019>] ret_from_fork+0x29/0x40
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The events leading to the lockup are...
>
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
> system_wq kworkers are assigned to execute them.
>
> 2. They all end up calling synchronize_rcu_expedited(). One of them
> wins and tries to perform the expedited synchronization.
>
> 3. However, that invovles queueing rcu_exp_work to system_wq and
> waiting for it. Because #1 is holding all available kworkers on
> system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
> is waiting for synchronize_rcu_expedited() which in turn is waiting
> for cgwb_release_workfn() to free up some of the kworkers.
>
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time. There's nothing to be gained from that. This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.
>
> While this resolves the problem at hand, it might be a good idea to
> isolate rcu_exp_work to its own workqueue too as it can be used from
> various paths and is prone to this sort of indirect A-A deadlocks.

Applied, thanks.

--
Jens Axboe


2018-05-23 22:05:40

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

On Wed, 2018-05-23 at 10:56 -0700, Tejun Heo wrote:

> The events leading to the lockup are...
>
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
> system_wq kworkers are assigned to execute them.
>
> 2. They all end up calling synchronize_rcu_expedited(). One of them
> wins and tries to perform the expedited synchronization.
>
> 3. However, that invovles queueing rcu_exp_work to system_wq and
> waiting for it. Because #1 is holding all available kworkers on
> system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
> is waiting for synchronize_rcu_expedited() which in turn is
> waiting
> for cgwb_release_workfn() to free up some of the kworkers.
>
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time. There's nothing to be gained from that. This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.

Dumb question. Does setting max_active to 1 mean
that every cgwb_release_workfn() ends up forcing
another RCU grace period on the whole system, while
today you might have a bunch of them waiting on the
same RCU grace period advance?

Would it be faster to have some number (up to 16?)
push RCU once, at the same time, instead of having
each of them push RCU into a next grace period one
after another?

I may be overlooking something fundamental here,
but I thought I'd at least ask the question, just
in case :)

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2018-05-23 23:18:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

Hello, Rik.

On Wed, May 23, 2018 at 06:03:15PM -0400, Rik van Riel wrote:
> Dumb question. Does setting max_active to 1 mean
> that every cgwb_release_workfn() ends up forcing
> another RCU grace period on the whole system, while
> today you might have a bunch of them waiting on the
> same RCU grace period advance?
>
> Would it be faster to have some number (up to 16?)
> push RCU once, at the same time, instead of having
> each of them push RCU into a next grace period one
> after another?

Oh yeah, you're absolutely right. This would end up doing a lot of
back-to-back synchronize_rcu_expedited() calls which can't be good.
I'll send a patch to push it upto 16.

Thanks.

--
tejun

2018-05-23 23:27:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] bdi: Increase the concurrecy level of cgwb_release_wq

cgwb_release_wq was added to isolate and limit the concurrency level
of cgwb release work items. The max concurrency was set to 1 per CPU
but we do want some level of concurrency as otherwise it ends up
calling, through bdi_unregister(), synchronize_sched_expedited()
back-to-back for each release work item.

Let's allow them to bunch up a bit.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Rik van Riel <[email protected]>
---
mm/backing-dev.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8fe3ebd..f9a2268 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -790,9 +790,10 @@ static int __init cgwb_init(void)
/*
* There can be many concurrent release work items overwhelming
* system_wq. Put them in a separate wq and limit concurrency.
- * There's no point in executing many of these in parallel.
+ * Allow some level of concurrency so that we can batch on
+ * synchronize_rcu_expedited() calls in bdi_unregister().
*/
- cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
+ cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 16);
if (!cgwb_release_wq)
return -ENOMEM;


2018-05-24 10:20:10

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

On Wed 23-05-18 10:56:32, Tejun Heo wrote:
> From 0aa2e9b921d6db71150633ff290199554f0842a8 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <[email protected]>
> Date: Wed, 23 May 2018 10:29:00 -0700
>
> cgwb_release() punts the actual release to cgwb_release_workfn() on
> system_wq. Depending on the number of cgroups or block devices, there
> can be a lot of cgwb_release_workfn() in flight at the same time.
>
> We're periodically seeing close to 256 kworkers getting stuck with the
> following stack trace and overtime the entire system gets stuck.

OK, but that means that you have to have 256 block devices, don't you? As
we have a bdi per device and we call synchronize_rcu_expedited() only when
unregistering bdi (and the corresponding request queue must be gone at that
point as well as that's otherwise holding a reference). Am I understanding
the situation correctly?

> [<ffffffff810ee40c>] _synchronize_rcu_expedited.constprop.72+0x2fc/0x330
> [<ffffffff810ee634>] synchronize_rcu_expedited+0x24/0x30
> [<ffffffff811ccf23>] bdi_unregister+0x53/0x290
> [<ffffffff811cd1e9>] release_bdi+0x89/0xc0
> [<ffffffff811cd645>] wb_exit+0x85/0xa0
> [<ffffffff811cdc84>] cgwb_release_workfn+0x54/0xb0
> [<ffffffff810a68d0>] process_one_work+0x150/0x410
> [<ffffffff810a71fd>] worker_thread+0x6d/0x520
> [<ffffffff810ad3dc>] kthread+0x12c/0x160
> [<ffffffff81969019>] ret_from_fork+0x29/0x40
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The events leading to the lockup are...
>
> 1. A lot of cgwb_release_workfn() is queued at the same time and all
> system_wq kworkers are assigned to execute them.
>
> 2. They all end up calling synchronize_rcu_expedited(). One of them
> wins and tries to perform the expedited synchronization.
>
> 3. However, that invovles queueing rcu_exp_work to system_wq and
> waiting for it. Because #1 is holding all available kworkers on
> system_wq, rcu_exp_work can't be executed. cgwb_release_workfn()
> is waiting for synchronize_rcu_expedited() which in turn is waiting
> for cgwb_release_workfn() to free up some of the kworkers.
>
> We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> same time. There's nothing to be gained from that. This patch
> updates cgwb release path to use a dedicated percpu workqueue with
> @max_active of 1.

As Rik wrote, some paralelism is good to reduce number of forced grace
periods so raising this to 16 is good. I was thinking whether we could not
batch rcu grace periods in some explicit way but that would be difficult to
do.

But thinking a bit more about this, if we made bdi RCU freed, we could just
avoid the synchronize_rcu_expedited() in bdi_remove_from_list() altogether.
The uses of bdi list are pretty limited and everybody ends up testing
WB_registered bit before doing anything anyway... What do you think?

Other than that you can add:

Reviewed-by: Jan Kara <[email protected]>

to this patch when updated to increase concurrency as that's a good short
term solution (for stable kernels) anyway.

Honza

> While this resolves the problem at hand, it might be a good idea to
> isolate rcu_exp_work to its own workqueue too as it can be used from
> various paths and is prone to this sort of indirect A-A deadlocks.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: "Paul E. McKenney" <[email protected]>
> Cc: [email protected]



> ---
> mm/backing-dev.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7441bd9..8fe3ebd 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -412,6 +412,7 @@ static void wb_exit(struct bdi_writeback *wb)
> * protected.
> */
> static DEFINE_SPINLOCK(cgwb_lock);
> +static struct workqueue_struct *cgwb_release_wq;
>
> /**
> * wb_congested_get_create - get or create a wb_congested
> @@ -522,7 +523,7 @@ static void cgwb_release(struct percpu_ref *refcnt)
> {
> struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
> refcnt);
> - schedule_work(&wb->release_work);
> + queue_work(cgwb_release_wq, &wb->release_work);
> }
>
> static void cgwb_kill(struct bdi_writeback *wb)
> @@ -784,6 +785,21 @@ static void cgwb_bdi_register(struct backing_dev_info *bdi)
> spin_unlock_irq(&cgwb_lock);
> }
>
> +static int __init cgwb_init(void)
> +{
> + /*
> + * There can be many concurrent release work items overwhelming
> + * system_wq. Put them in a separate wq and limit concurrency.
> + * There's no point in executing many of these in parallel.
> + */
> + cgwb_release_wq = alloc_workqueue("cgwb_release", 0, 1);
> + if (!cgwb_release_wq)
> + return -ENOMEM;
> +
> + return 0;
> +}
> +subsys_initcall(cgwb_init);
> +
> #else /* CONFIG_CGROUP_WRITEBACK */
>
> static int cgwb_bdi_init(struct backing_dev_info *bdi)
> --
> 2.9.5
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-05-25 00:13:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] bdi: Move cgroup bdi_writeback to a dedicated low concurrency workqueue

Hello, Jan.

On Thu, May 24, 2018 at 12:19:00PM +0200, Jan Kara wrote:
> > We're periodically seeing close to 256 kworkers getting stuck with the
> > following stack trace and overtime the entire system gets stuck.
>
> OK, but that means that you have to have 256 block devices, don't you? As

It'd be a product of the number of cgroups being deleted at the same
time and the number of block devices they were accessing. With
multiple loop devices and multiple cgroup removals, it doesn't seem
too difficult to hit. It's rare but we do see it happening repeatedly
in one of the setups.

> we have a bdi per device and we call synchronize_rcu_expedited() only when
> unregistering bdi (and the corresponding request queue must be gone at that
> point as well as that's otherwise holding a reference). Am I understanding
> the situation correctly?

Yeah, I think so.

> > We shouldn't be scheduling hundreds of cgwb_release_workfn() at the
> > same time. There's nothing to be gained from that. This patch
> > updates cgwb release path to use a dedicated percpu workqueue with
> > @max_active of 1.
>
> As Rik wrote, some paralelism is good to reduce number of forced grace
> periods so raising this to 16 is good. I was thinking whether we could not
> batch rcu grace periods in some explicit way but that would be difficult to
> do.

Agreed, already sent a patch to increase @max_active to 16.

> But thinking a bit more about this, if we made bdi RCU freed, we could just
> avoid the synchronize_rcu_expedited() in bdi_remove_from_list() altogether.
> The uses of bdi list are pretty limited and everybody ends up testing
> WB_registered bit before doing anything anyway... What do you think?
>
> Other than that you can add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks.

--
tejun