[Adding Tejun and lkml to the cc list]
On Mon, Jun 23, 2014 at 01:05:54PM -0700, Austin Schuh wrote:
> I found 1 bug in XFS which I fixed, and I've uncovered something else that
> I'm not completely sure how to fix.
>
> In xfs_bmapi_allocate, you create a completion, and use that to wait until
> the work has finished. Occasionally, I'm seeing a case where I get a
> context switch after the completion has been completed, but before the
> workqueue has finished doing it's internal book-keeping. This results in
> the work being deleted before the workqueue is done using it, corrupting
> the internal data structures. I fixed it by waiting using flush_work and
> removing the completion entirely.
>
> --- a/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:10.008678410 -0700
> +++ b/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:14.116678239 -0700
> @@ -263,7 +263,6 @@
> current_set_flags_nested(&pflags, PF_FSTRANS);
>
> args->result = __xfs_bmapi_allocate(args);
> - complete(args->done);
>
> current_restore_flags_nested(&pflags, PF_FSTRANS);
> }
> @@ -277,16 +276,13 @@
> xfs_bmapi_allocate(
> struct xfs_bmalloca *args)
> {
> - DECLARE_COMPLETION_ONSTACK(done);
> -
> if (!args->stack_switch)
> return __xfs_bmapi_allocate(args);
>
>
> - args->done = &done;
> INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker);
> queue_work(xfs_alloc_wq, &args->work);
> - wait_for_completion(&done);
> + flush_work(&args->work);
> destroy_work_on_stack(&args->work);
> return args->result;
> }
> --- a/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:10.008678410 -0700
> +++ b/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:11.708678340 -0700
> @@ -57,7 +57,6 @@
> char conv; /* overwriting unwritten extents */
> char stack_switch;
> int flags;
> - struct completion *done;
> struct work_struct work;
> int result;
> };
Ok, that's a surprise. However, I can't see how using flush_work()
solves that underlying context switch problem, because it's
implemented the same way:
bool flush_work(struct work_struct *work)
{
struct wq_barrier barr;
lock_map_acquire(&work->lockdep_map);
lock_map_release(&work->lockdep_map);
if (start_flush_work(work, &barr)) {
wait_for_completion(&barr.done);
destroy_work_on_stack(&barr.work);
return true;
} else {
return false;
}
}
start_flush_work() is effectively a special queue_work()
implementation, so if if it's not safe to call complete() from the
workqueue as the above patch implies then this code has the same
problem.
Tejun - is this "do it yourself completion" a known issue w.r.t.
workqueues? I can't find any documentation that says "don't do
that" so...?
A quick grep also shows up the same queue_work/wait_for_completion
pattern in arch/x86/kernel/hpet.c, drivers/md/dm-thin.c,
fs/fs-writeback.c, drivers/block/drbd/drbd_main.c....
> I enabled event tracing (and added a new event which lists the number of
> workers running in a queue whenever that is changed).
>
> To me, it looks like work is scheduled from irq/44-ahci-273 that will
> acquire an inode lock. scp-3986 then acquires the lock, and then goes and
> schedules work. That work is then scheduled behind the work from
> irq/44-ahci-273 in the same pool. The first work blocks waiting on the
> lock, and scp-3986 won't finish and release that lock until the second work
> gets run.
IOWs, scp takes an inode lock and queues work to the xfs_alloc_wq,
then schedules. Then a kworker runs an xfs-data work item, which
tries to take the inode lock and blocks.
As I understand it, what then happens is that the workqueue code
grabs another kworker thread and runs the next work item in it's
queue. IOWs, work items can block, but doing that does not prevent
execution of other work items queued on other work queues or even on
the same work queue. Tejun, did I get that correct?
Hence the work on the xfs-data queue will block until another
kworker processes the item on the xfs-alloc-wq which means progress
is made and the inode gets unlocked. Then the kworker for the work
on the xfs-data queue will get the lock, complete it's work and
everything has resolved itself.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Hello,
On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
> start_flush_work() is effectively a special queue_work()
> implementation, so if if it's not safe to call complete() from the
> workqueue as the above patch implies then this code has the same
> problem.
>
> Tejun - is this "do it yourself completion" a known issue w.r.t.
> workqueues? I can't find any documentation that says "don't do
> that" so...?
It's more complex than using flush_work() but there's nothing
fundamentally wrong with it. A work item is completely unlinked
before its execution starts. It's safe to free the work item once its
work function started, whether through kfree() or returning.
One difference between flush_work() and manual completion would be
that if the work item gets requeued, flush_work() would wait for the
queued one to finish but given the work item is one-shot this doesn't
make any difference.
I can see no reason why manual completion would behave differently
from flush_work() in this case.
> As I understand it, what then happens is that the workqueue code
> grabs another kworker thread and runs the next work item in it's
> queue. IOWs, work items can block, but doing that does not prevent
> execution of other work items queued on other work queues or even on
> the same work queue. Tejun, did I get that correct?
Yes, as long as the workqueue is under its @max_active limit and has
access to an existing kworker or can create a new one, it'll start
executing the next work item immediately; however, the guaranteed
level of concurrency is 1 even for WQ_RECLAIM workqueues. IOW, the
work items queued on a workqueue must be able to make forward progress
with single work item if the work items are being depended upon for
memory reclaim.
> Hence the work on the xfs-data queue will block until another
> kworker processes the item on the xfs-alloc-wq which means progress
> is made and the inode gets unlocked. Then the kworker for the work
> on the xfs-data queue will get the lock, complete it's work and
> everything has resolved itself.
As long as a WQ_RECLAIM workqueue dosen't depend upon itself,
forward-progress is guaranteed.
Thanks.
--
tejun
[Adding tglx to the cc. Sorry for any double sends]
On Mon, Jun 23, 2014 at 8:25 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
>> start_flush_work() is effectively a special queue_work()
>> implementation, so if if it's not safe to call complete() from the
>> workqueue as the above patch implies then this code has the same
>> problem.
>>
>> Tejun - is this "do it yourself completion" a known issue w.r.t.
>> workqueues? I can't find any documentation that says "don't do
>> that" so...?
>
> It's more complex than using flush_work() but there's nothing
> fundamentally wrong with it. A work item is completely unlinked
> before its execution starts. It's safe to free the work item once its
> work function started, whether through kfree() or returning.
>
> One difference between flush_work() and manual completion would be
> that if the work item gets requeued, flush_work() would wait for the
> queued one to finish but given the work item is one-shot this doesn't
> make any difference.
>
> I can see no reason why manual completion would behave differently
> from flush_work() in this case.
I went looking for a short trace in my original log to show the
problem, and instead found evidence of the second problem. I still
like the shorter flush_work call, but that's not my call.
I did find this comment in the process_one_work function. Sounds like
this could be better documented.
/*
* It is permissible to free the struct work_struct from
* inside the function that is called from it, this we need to
* take into account for lockdep too. To avoid bogus "held
* lock freed" warnings as well as problems when looking into
* work->lockdep_map, make a copy and use that here.
*/
>> As I understand it, what then happens is that the workqueue code
>> grabs another kworker thread and runs the next work item in it's
>> queue. IOWs, work items can block, but doing that does not prevent
>> execution of other work items queued on other work queues or even on
>> the same work queue. Tejun, did I get that correct?
>
> Yes, as long as the workqueue is under its @max_active limit and has
> access to an existing kworker or can create a new one, it'll start
> executing the next work item immediately; however, the guaranteed
> level of concurrency is 1 even for WQ_RECLAIM workqueues. IOW, the
> work items queued on a workqueue must be able to make forward progress
> with single work item if the work items are being depended upon for
> memory reclaim.
I mentioned this to Dave when I initially started this thread, but I'm
running a RT patched kernel. I don't see forwards progress. The
workqueue is only used in 1 spot (xfs_alloc_wq), and has
WQ_MEM_RECLAIM set. I started with a 10,000,000 line trace and pulled
out the entries which referenced the workqueue and pool leading up to
the lockup.
scp-4110 [001] ....1.3 101.184640:
workqueue_queue_work: work struct=ffff8803c782d900
function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00
pool=ffff88042dae3fc0 req_cpu=512 cpu=1
scp-4110 [001] ....1.3 101.184641:
workqueue_activate_work: work struct ffff8803c782d900
kworker/1:1-89 [001] ....1.1 101.184883:
workqueue_nr_running: pool=ffff88042dae3fc0 nr_running=1
kworker/1:1-89 [001] ....1.. 101.184885:
workqueue_execute_start: work struct ffff8803c782d900: function
xfs_bmapi_allocate_worker
irq/44-ahci-275 [001] ....1.5 101.185086:
workqueue_queue_work: work struct=ffff8800ae3f01e0 function=xfs_end_io
workqueue=ffff88040b459a00 pool=ffff88042dae3fc0 req_cpu=512 cpu=1
irq/44-ahci-275 [001] ....1.5 101.185088:
workqueue_activate_work: work struct ffff8800ae3f01e0
scp-4110 [001] ....1.. 101.187911: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_iomap_write_allocate
scp-4110 [001] ....1.3 101.187914:
workqueue_queue_work: work struct=ffff8803c782d900
function=xfs_bmapi_allocate_worker workqueue=ffff88040c9f5a00
pool=ffff88042dae3fc0 req_cpu=512 cpu=1
scp-4110 [001] ....1.3 101.187915:
workqueue_activate_work: work struct ffff8803c782d900
scp-4110 [001] ....1.4 101.187926:
workqueue_queue_work: work struct=ffff88040a6a01c8
function=blk_delay_work workqueue=ffff88040c9f4a00
pool=ffff88042dae44c0 req_cpu=512 cpu=1
scp-4110 [001] ....1.4 101.187926:
workqueue_activate_work: work struct ffff88040a6a01c8
kworker/1:1-89 [001] ....1.. 101.187998:
workqueue_execute_end: work struct ffff8803c782d900
kworker/1:1-89 [001] ....1.. 101.188000:
workqueue_execute_start: work struct ffff8800ae3f01e0: function
xfs_end_io
kworker/1:1-89 [001] ....1.. 101.188001: xfs_ilock: dev 8:5
ino 0xf9e flags ILOCK_EXCL caller xfs_setfilesize
The last work never runs. Hangcheck triggers shortly after.
I spent some more time debugging, and I am seeing that
tsk_is_pi_blocked is returning 1 in sched_submit_work
(kernel/sched/core.c). It looks like sched_submit_work is not
detecting that the worker task is blocked on a mutex.
This looks very RT related right now. I see 2 problems from my
reading (and experimentation). The first is that the second worker
isn't getting started because tsk_is_pi_blocked is reporting that the
task isn't blocked on a mutex. The second is that even if another
worker needs to be scheduled because the original worker is blocked on
a mutex, we need the pool lock to schedule another worker. The pool
lock can be acquired by any CPU, and is a spin_lock. If we end up on
the slow path for the pool lock, we hit
BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)) in
task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with
either problem. Thomas, any ideas?
Hopefully I've got all my facts right... Debugging kernel code is a
whole new world from userspace code.
Thanks!
Austin
On Mon, Jun 23, 2014 at 11:25:21PM -0400, Tejun Heo wrote:
> Hello,
>
> On Tue, Jun 24, 2014 at 01:02:40PM +1000, Dave Chinner wrote:
> > As I understand it, what then happens is that the workqueue code
> > grabs another kworker thread and runs the next work item in it's
> > queue. IOWs, work items can block, but doing that does not prevent
> > execution of other work items queued on other work queues or even on
> > the same work queue. Tejun, did I get that correct?
>
> Yes, as long as the workqueue is under its @max_active limit and has
> access to an existing kworker or can create a new one, it'll start
> executing the next work item immediately; however, the guaranteed
> level of concurrency is 1 even for WQ_RECLAIM workqueues. IOW, the
> work items queued on a workqueue must be able to make forward progress
> with single work item if the work items are being depended upon for
> memory reclaim.
Hmmm - that's different from my understanding of what the original
behaviour WQ_MEM_RECLAIM gave us. i.e. that WQ_MEM_RECLAIM
workqueues had a rescuer thread created to guarantee that the
*workqueue* could make forward progress executing work in a
reclaim context.
The concept that the *work being executed* needs to guarantee
forwards progress is something I've never heard stated before.
That worries me a lot, especially with all the memory reclaim
problems that have surfaced in the past couple of months....
> As long as a WQ_RECLAIM workqueue dosen't depend upon itself,
> forward-progress is guaranteed.
I can't find any documentation that actually defines what
WQ_MEM_RECLAIM means, so I can't tell when or how this requirement
came about. If it's true, then I suspect most of the WQ_MEM_RECLAIM
workqueues in filesystems violate it. Can you point me at
documentation/commits/code describing the constraints of
WQ_MEM_RECLAIM and the reasons for it?
Cheers,
Dave.
--
Dave Chinner
[email protected]
Hello,
On Tue, Jun 24, 2014 at 08:05:07PM -0700, Austin Schuh wrote:
> > I can see no reason why manual completion would behave differently
> > from flush_work() in this case.
>
> I went looking for a short trace in my original log to show the problem,
> and instead found evidence of the second problem. I still like the shorter
> flush_work call, but that's not my call.
So, are you saying that the original issue you reported isn't actually
a problem? But didn't you imply that changing the waiting mechanism
fixed a deadlock or was that a false positive?
> I did find this comment in the process_one_work function. Sounds like this
> could be better documented.
Yeah, we prolly should beef up Documentation/workqueue.txt with
information on general usage.
> I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is
> returning 1 in sched_submit_work (kernel/sched/core.c). It looks
> like sched_submit_work is not detecting that the worker task is blocked on
> a mutex.
The function unplugs the block layer and doesn't have much to do with
workqueue although it has "_work" in its name.
> This looks very RT related right now. I see 2 problems from my reading
> (and experimentation). The first is that the second worker isn't getting
> started because tsk_is_pi_blocked is reporting that the task isn't blocked
> on a mutex. The second is that even if another worker needs to be
> scheduled because the original worker is blocked on a mutex, we need the
> pool lock to schedule another worker. The pool lock can be acquired by any
> CPU, and is a spin_lock. If we end up on the slow path for the pool lock,
> we hit BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on))
> in task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with
> either problem.
>
> Hopefully I've got all my facts right... Debugging kernel code is a whole
> new world from userspace code.
I don't have much idea how RT kernel works either. Can you reproduce
the issues that you see on mainline?
Thanks.
--
tejun
Hello, Dave.
On Wed, Jun 25, 2014 at 03:56:41PM +1000, Dave Chinner wrote:
> Hmmm - that's different from my understanding of what the original
> behaviour WQ_MEM_RECLAIM gave us. i.e. that WQ_MEM_RECLAIM
> workqueues had a rescuer thread created to guarantee that the
> *workqueue* could make forward progress executing work in a
> reclaim context.
>From Documentation/workqueue.txt
WQ_MEM_RECLAIM
All wq which might be used in the memory reclaim paths _MUST_
have this flag set. The wq is guaranteed to have at least one
execution context regardless of memory pressure.
So, all that's guaranteed is that the workqueue has at least one
worker executing its work items. If that one worker is serving a work
item which can't make forward progress, the workqueue is not
guaranteed to make forward progress.
> The concept that the *work being executed* needs to guarantee
> forwards progress is something I've never heard stated before.
> That worries me a lot, especially with all the memory reclaim
> problems that have surfaced in the past couple of months....
I'd love to provide that but guaranteeing that at least one work is
always being executed requires unlimited task allocation (the ones
which get blocked gotta store their context somewhere).
> > As long as a WQ_RECLAIM workqueue dosen't depend upon itself,
> > forward-progress is guaranteed.
>
> I can't find any documentation that actually defines what
> WQ_MEM_RECLAIM means, so I can't tell when or how this requirement
> came about. If it's true, then I suspect most of the WQ_MEM_RECLAIM
> workqueues in filesystems violate it. Can you point me at
> documentation/commits/code describing the constraints of
> WQ_MEM_RECLAIM and the reasons for it?
Documentation/workqueue.txt should be it but maybe we should be more
explicit. The behavior is maintaining what the
pre-concurrency-management workqueue provided with static
per-workqueue workers. Each workqueue reserved its workers (either
one per cpu or one globally) and it only supported single level of
concurrency on each CPU. WQ_MEM_RECLAIM is providing equivalent
amount of forward progress guarantee and all the existing users
shouldn't have issues on this front. If we have grown incorrect
usages from then on, we need to fix them.
Thanks.
--
tejun
On Wed, Jun 25, 2014 at 7:00 AM, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, Jun 24, 2014 at 08:05:07PM -0700, Austin Schuh wrote:
> > > I can see no reason why manual completion would behave differently
> > > from flush_work() in this case.
> >
> > I went looking for a short trace in my original log to show the problem,
> > and instead found evidence of the second problem. I still like the shorter
> > flush_work call, but that's not my call.
>
> So, are you saying that the original issue you reported isn't actually
> a problem? But didn't you imply that changing the waiting mechanism
> fixed a deadlock or was that a false positive?
Correct, that was a false positive. Sorry for the noise.
> > I spent some more time debugging, and I am seeing that tsk_is_pi_blocked is
> > returning 1 in sched_submit_work (kernel/sched/core.c). It looks
> > like sched_submit_work is not detecting that the worker task is blocked on
> > a mutex.
>
> The function unplugs the block layer and doesn't have much to do with
> workqueue although it has "_work" in its name.
Thomas moved
+ if (tsk->flags & PF_WQ_WORKER)
+ wq_worker_sleeping(tsk);
into sched_submit_work as part of the RT patchset.
> > This looks very RT related right now. I see 2 problems from my reading
> > (and experimentation). The first is that the second worker isn't getting
> > started because tsk_is_pi_blocked is reporting that the task isn't blocked
> > on a mutex. The second is that even if another worker needs to be
> > scheduled because the original worker is blocked on a mutex, we need the
> > pool lock to schedule another worker. The pool lock can be acquired by any
> > CPU, and is a spin_lock. If we end up on the slow path for the pool lock,
> > we hit BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on))
> > in task_blocks_on_rt_mutex in rtmutex.c. I'm not sure how to deal with
> > either problem.
> >
> > Hopefully I've got all my facts right... Debugging kernel code is a whole
> > new world from userspace code.
>
> I don't have much idea how RT kernel works either. Can you reproduce
> the issues that you see on mainline?
>
> Thanks.
>
> --
> tejun
I'll see what I can do.
Thanks!
Austin
On Wed, Jun 25, 2014 at 10:18:36AM -0400, Tejun Heo wrote:
> Hello, Dave.
>
> On Wed, Jun 25, 2014 at 03:56:41PM +1000, Dave Chinner wrote:
> > Hmmm - that's different from my understanding of what the original
> > behaviour WQ_MEM_RECLAIM gave us. i.e. that WQ_MEM_RECLAIM
> > workqueues had a rescuer thread created to guarantee that the
> > *workqueue* could make forward progress executing work in a
> > reclaim context.
>
> From Documentation/workqueue.txt
>
> WQ_MEM_RECLAIM
>
> All wq which might be used in the memory reclaim paths _MUST_
> have this flag set. The wq is guaranteed to have at least one
> execution context regardless of memory pressure.
>
> So, all that's guaranteed is that the workqueue has at least one
> worker executing its work items. If that one worker is serving a work
> item which can't make forward progress, the workqueue is not
> guaranteed to make forward progress.
Adding that to the docco might be useful ;)
> > > As long as a WQ_RECLAIM workqueue dosen't depend upon itself,
> > > forward-progress is guaranteed.
> >
> > I can't find any documentation that actually defines what
> > WQ_MEM_RECLAIM means, so I can't tell when or how this requirement
> > came about. If it's true, then I suspect most of the WQ_MEM_RECLAIM
> > workqueues in filesystems violate it. Can you point me at
> > documentation/commits/code describing the constraints of
> > WQ_MEM_RECLAIM and the reasons for it?
>
> Documentation/workqueue.txt should be it but maybe we should be more
> explicit. The behavior is maintaining what the
> pre-concurrency-management workqueue provided with static
> per-workqueue workers. Each workqueue reserved its workers (either
> one per cpu or one globally) and it only supported single level of
> concurrency on each CPU. WQ_MEM_RECLAIM is providing equivalent
> amount of forward progress guarantee and all the existing users
> shouldn't have issues on this front. If we have grown incorrect
> usages from then on, we need to fix them.
Ok, so it hasn't changed. We're still usingthem like we used the
original workqueues, and we never, ever provided a guarantee of
forwards progress for them, either. So if the workqueues haven't
changed, and we haven't changed how we use workqueues, then
something else is causing all our recent problems....
Thanks for the clarification, Tejun!
Cheers,
Dave.
--
Dave Chinner
[email protected]