Date: Fri, 18 Mar 2005 03:07:31 -0800
From: Andrew Morton <akpm@osdl.org>
To: rostedt@goodmis.org
Cc: mingo@elte.hu, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] remove lame schedule in journal inverted_lock (was: Re:
 [patch 0/3] j_state_lock, j_list_lock, remove-bitlocks)
Message-Id: <20050318030731.224aa95f.akpm@osdl.org>
In-Reply-To: <Pine.LNX.4.58.0503180531580.21994@localhost.localdomain>
References: <Pine.LNX.4.58.0503141024530.697@localhost.localdomain>
	<Pine.LNX.4.58.0503150746110.6456@localhost.localdomain>
	<20050315133540.GB4686@elte.hu>
	<Pine.LNX.4.58.0503151150170.6456@localhost.localdomain>
	<20050316085029.GA11414@elte.hu>
	<20050316011510.2a3bdfdb.akpm@osdl.org>
	<20050316095155.GA15080@elte.hu>
	<20050316020408.434cc620.akpm@osdl.org>
	<20050316101906.GA17328@elte.hu>
	<20050316024022.6d5c4706.akpm@osdl.org>
	<Pine.LNX.4.58.0503160600200.11824@localhost.localdomain>
	<20050316031909.08e6cab7.akpm@osdl.org>
	<Pine.LNX.4.58.0503160853360.11824@localhost.localdomain>
	<Pine.LNX.4.58.0503161141001.14087@localhost.localdomain>
	<Pine.LNX.4.58.0503161234350.14460@localhost.localdomain>
	<20050316131521.48b1354e.akpm@osdl.org>
	<Pine.LNX.4.58.0503170406500.17019@localhost.localdomain>
	<Pine.LNX.4.58.0503180415370.21994@localhost.localdomain>
	<20050318013251.330e4d78.akpm@osdl.org>
	<Pine.LNX.4.58.0503180531580.21994@localhost.localdomain>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3113
Lines: 82

Steven Rostedt <rostedt@goodmis.org> wrote:
>
> >
>  > I really should knock up a script to send out an email when I add a patch
>  > to -mm.
>  >
> 
>  I thought you might have had something like that already, which was
>  another reason I thought you might have skipped this.
>

I do now..

> 
>  > > I'd figure that you may have
>  > > missed this, since the subject didn't change.  So I changed the subject to
>  > > get your attention, and I've resent this. Here's the patch to get rid of
>  > > the the lame schedule that was in fs/jbd/commit.c.   Let me know if this
>  > > patch is appropriate.
>  >
>  > I'm rather aghast at all the ifdeffery and complexity in this one.  But I
>  > haven't looked at it closely yet.
>  >
> 
>  I wanted to keep the wait logic out when it wasn't a problem. Basically,
>  the problem only occurs when bit_spin_trylock is defined as an actual
>  trylock. So I put in a define there to enable the wait queues.  I didn't
>  want to waste cycles checking the wait queue in jbd_unlock_bh_state when
>  there would never be anything on it.  Heck, I figured why even have the
>  wait queue wasting memory if it wasn't needed.  So that added the
>  ifdeffery complexity.

No, that code's just a problem.  For ranking reasons it's essentially doing
this:

repeat:
	cond_resched();
	spin_lock(j_list_lock);
	....
	if (!bit_spin_trylock(bh)) {
		spin_unlock(j_list_lock);
		schedule();
		goto repeat;
	}

Now imagine that some other CPU holds the bit_spin_lock and is spinning,
trying to get the spin_lock().  The above code assumes that the schedule()
and cond_resched() will take "long enough" for the other CPU to get the
spinlock, do its business then release the locks.

So all the schedule() is really doing is "blow a few cycles so the other
CPU can get in and grab the spinlock".  That'll work OK on normal SMP but I
suspect that on NUMA setups with really big latencies we could end up
starving the other CPU: this CPU would keep on grabbing the lock.  It
depends on how the interconnect cache and all that goop works.

So what to do?

One approach would be to spin on the bit_spin_trylock after having dropped
j_list_lock.  That'll tell us when the other CPU has moved on.

Another approach would be to sleep on a waitqueue somewhere.  But that
means that jbd_unlock_bh_state() needs to do wakeups all the time - costly.

Another approach would be to simply whack an msleep(1) in there.  That
might be OK - it should be very rare.

Probably the first approach would be the one to use.  That's for mainline. 
I don't know what the super-duper-RT fix would be.  Why did we start
discussing this anyway?

Oh, SCHED_FIFO.  kjournald doesn't run SCHED_FIFO, but someone may decide
to make it do so.  But even then I don't see a problem for the mainline
kernel, because this CPU's SCHED_FIFO doesn't stop the other CPU from
running.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/