Date: Fri, 12 Mar 2004 18:12:14 -0800
From: Andrew Morton <akpm@osdl.org>
To: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: thockin@Sun.COM, linux-kernel@vger.kernel.org
Subject: Re: calling flush_scheduled_work()
Message-Id: <20040312181214.15316729.akpm@osdl.org>
In-Reply-To: <1079140670.3166.87.camel@lade.trondhjem.org>
References: <20040312205814.GY1333@sun.com>
	<20040312152747.0b3f74d3.akpm@osdl.org>
	<1079140670.3166.87.camel@lade.trondhjem.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3195
Lines: 99

Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
>
> P? fr , 12/03/2004 klokka 18:27, skreiv Andrew Morton:
> > Seems simple enough to fix the workqueue code to handle this situation.
> > 
> > Wanna test this for me?
> 
> What if one of the workloads on the queue has another call to
> flush_scheduled_work()? (which again finds itself calling another
> instance, ...)

I fixed that with the old solution-via-changelog:


  Because keventd is a resource which is shared between unrelated parts
  of the kernel it is possible for one person's workqueue handler to
  accidentally call another person's flush_scheduled_work().  thockin
  managed it by calling mntput() from a workqueue handler.  It deadlocks.

  It's simple enough to fix: teach flush_scheduled_work() to go direct
  when it discovers that the calling thread is the one which should be
  running the work.

  Note that this can cause recursion.  The depth of that recursion is
  equal to the number of currently-queued works which themselves want to
  call flush_scheduled_work().  If this ever exceeds three I'll eat my hat.


Really, it's so unlikely it's hardly worth bothering about.

> This fix doesn't really resolve the deadlock issue. It just reduces the
> possibilities of it having any consequences...
> 
> Could you please at least include a dump_stack() before the call to
> run_workqueue()) so that we can catch the bug in the act, and attempt to
> fix it.

How's this?


Add a debug check for workqueues nested more than three deep via the
direct-run-workqueue() path.


---

 25-akpm/kernel/workqueue.c |   10 ++++++++++
 1 files changed, 10 insertions(+)

diff -puN kernel/workqueue.c~flush_scheduled_work-recursion-detect kernel/workqueue.c
--- 25/kernel/workqueue.c~flush_scheduled_work-recursion-detect	2004-03-12 17:54:32.859424968 -0800
+++ 25-akpm/kernel/workqueue.c	2004-03-12 18:11:09.939845704 -0800
@@ -47,6 +47,7 @@ struct cpu_workqueue_struct {
 	struct workqueue_struct *wq;
 	task_t *thread;
 
+	int run_depth;		/* Detect run_workqueue() recursion depth */
 } ____cacheline_aligned;
 
 /*
@@ -140,6 +141,13 @@ static inline void run_workqueue(struct 
 	 * done.
 	 */
 	spin_lock_irqsave(&cwq->lock, flags);
+	cwq->run_depth++;
+	if (cwq->run_depth > 3) {
+		/* morton gets to eat his hat */
+		printk("%s: recusrion depth exceeded: %d\n",
+			__FUNCTION__, cwq->run_depth);
+		dump_stack();
+	}
 	while (!list_empty(&cwq->worklist)) {
 		struct work_struct *work = list_entry(cwq->worklist.next,
 						struct work_struct, entry);
@@ -157,6 +165,7 @@ static inline void run_workqueue(struct 
 		cwq->remove_sequence++;
 		wake_up(&cwq->work_done);
 	}
+	cwq->run_depth--;
 	spin_unlock_irqrestore(&cwq->lock, flags);
 }
 
@@ -286,6 +295,7 @@ struct workqueue_struct *create_workqueu
 	wq = kmalloc(sizeof(*wq), GFP_KERNEL);
 	if (!wq)
 		return NULL;
+	memset(wq, 0, sizeof(*wq));
 
 	for (cpu = 0; cpu < NR_CPUS; cpu++) {
 		if (!cpu_online(cpu))

_

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/