Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751389AbaFXDDa (ORCPT ); Mon, 23 Jun 2014 23:03:30 -0400 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:63426 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750827AbaFXDD3 (ORCPT ); Mon, 23 Jun 2014 23:03:29 -0400 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: AsBSALLpqFN5LC2vPGdsb2JhbABagw2DSKgiBpk1AYEKFwQBAQEBODWDfAcBAQQBJxMcIwULCBsJJQ8FJQMHGhOFSYJxB8QTFxaFTYkZB4MtgRYEmkuXOCs Date: Tue, 24 Jun 2014 13:02:40 +1000 From: Dave Chinner To: Austin Schuh Cc: xfs , linux-kernel@vger.kernel.org, Tejun Heo Subject: On-stack work item completion race? (was Re: XFS crash?) Message-ID: <20140624030240.GB9508@dastard> References: <20140305233551.GK6851@dastard> <20140513034647.GA5421@dastard> <20140513063943.GQ26353@dastard> <20140513090321.GR26353@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [Adding Tejun and lkml to the cc list] On Mon, Jun 23, 2014 at 01:05:54PM -0700, Austin Schuh wrote: > I found 1 bug in XFS which I fixed, and I've uncovered something else that > I'm not completely sure how to fix. > > In xfs_bmapi_allocate, you create a completion, and use that to wait until > the work has finished. Occasionally, I'm seeing a case where I get a > context switch after the completion has been completed, but before the > workqueue has finished doing it's internal book-keeping. This results in > the work being deleted before the workqueue is done using it, corrupting > the internal data structures. I fixed it by waiting using flush_work and > removing the completion entirely. > > --- a/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:10.008678410 -0700 > +++ b/fs/xfs/xfs_bmap_util.c 2014-06-23 12:59:14.116678239 -0700 > @@ -263,7 +263,6 @@ > current_set_flags_nested(&pflags, PF_FSTRANS); > > args->result = __xfs_bmapi_allocate(args); > - complete(args->done); > > current_restore_flags_nested(&pflags, PF_FSTRANS); > } > @@ -277,16 +276,13 @@ > xfs_bmapi_allocate( > struct xfs_bmalloca *args) > { > - DECLARE_COMPLETION_ONSTACK(done); > - > if (!args->stack_switch) > return __xfs_bmapi_allocate(args); > > > - args->done = &done; > INIT_WORK_ONSTACK(&args->work, xfs_bmapi_allocate_worker); > queue_work(xfs_alloc_wq, &args->work); > - wait_for_completion(&done); > + flush_work(&args->work); > destroy_work_on_stack(&args->work); > return args->result; > } > --- a/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:10.008678410 -0700 > +++ b/fs/xfs/xfs_bmap_util.h 2014-06-23 12:59:11.708678340 -0700 > @@ -57,7 +57,6 @@ > char conv; /* overwriting unwritten extents */ > char stack_switch; > int flags; > - struct completion *done; > struct work_struct work; > int result; > }; Ok, that's a surprise. However, I can't see how using flush_work() solves that underlying context switch problem, because it's implemented the same way: bool flush_work(struct work_struct *work) { struct wq_barrier barr; lock_map_acquire(&work->lockdep_map); lock_map_release(&work->lockdep_map); if (start_flush_work(work, &barr)) { wait_for_completion(&barr.done); destroy_work_on_stack(&barr.work); return true; } else { return false; } } start_flush_work() is effectively a special queue_work() implementation, so if if it's not safe to call complete() from the workqueue as the above patch implies then this code has the same problem. Tejun - is this "do it yourself completion" a known issue w.r.t. workqueues? I can't find any documentation that says "don't do that" so...? A quick grep also shows up the same queue_work/wait_for_completion pattern in arch/x86/kernel/hpet.c, drivers/md/dm-thin.c, fs/fs-writeback.c, drivers/block/drbd/drbd_main.c.... > I enabled event tracing (and added a new event which lists the number of > workers running in a queue whenever that is changed). > > To me, it looks like work is scheduled from irq/44-ahci-273 that will > acquire an inode lock. scp-3986 then acquires the lock, and then goes and > schedules work. That work is then scheduled behind the work from > irq/44-ahci-273 in the same pool. The first work blocks waiting on the > lock, and scp-3986 won't finish and release that lock until the second work > gets run. IOWs, scp takes an inode lock and queues work to the xfs_alloc_wq, then schedules. Then a kworker runs an xfs-data work item, which tries to take the inode lock and blocks. As I understand it, what then happens is that the workqueue code grabs another kworker thread and runs the next work item in it's queue. IOWs, work items can block, but doing that does not prevent execution of other work items queued on other work queues or even on the same work queue. Tejun, did I get that correct? Hence the work on the xfs-data queue will block until another kworker processes the item on the xfs-alloc-wq which means progress is made and the inode gets unlocked. Then the kworker for the work on the xfs-data queue will get the lock, complete it's work and everything has resolved itself. Cheers, Dave. -- Dave Chinner david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/