Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753664AbZFHRid (ORCPT ); Mon, 8 Jun 2009 13:38:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751427AbZFHRiS (ORCPT ); Mon, 8 Jun 2009 13:38:18 -0400 Received: from bedivere.hansenpartnership.com ([66.63.167.143]:52663 "EHLO bedivere.hansenpartnership.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752493AbZFHRiR (ORCPT ); Mon, 8 Jun 2009 13:38:17 -0400 Subject: Re: 2.6.30-rc8 Oops whilst booting From: James Bottomley To: Linus Torvalds Cc: Chris Clayton , Jaswinder Singh Rajput , NeilBrown , linux-kernel@vger.kernel.org, scsi , Tejun Heo , Arjan van de Ven In-Reply-To: References: <200906061959.55592.chris2553@googlemail.com> <200906062215.30571.chris2553@googlemail.com> <1244381140.30664.12.camel@ht.satnam> <1244413881.18742.31.camel@ht.satnam> <2f9e3044bafcae848f74a1492b0ea471.squirrel@neil.brown.name> <1244460875.12644.2.camel@ht.satnam> <1244479879.4079.284.camel@mulgrave.site> Content-Type: text/plain Date: Mon, 08 Jun 2009 17:38:16 +0000 Message-Id: <1244482696.4079.345.camel@mulgrave.site> Mime-Version: 1.0 X-Mailer: Evolution 2.24.1.1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 1980 Lines: 50 On Mon, 2009-06-08 at 10:21 -0700, Linus Torvalds wrote: > > On Mon, 8 Jun 2009, James Bottomley wrote: > > > > The root cause is a reordering of the devices caused by the async code. > > That's NULL information. > > OF COURSE the root cause is the async code. We know that. We're looking > for the specifics. > > In particular, before that commit, at most you will wait for too _much_. > In other words, it's a "good" wait. > > Your commit caused it to wait for less, and that then showed a bug. Not > all that surprising - it's now not waiting enough. right ... my question was whether this exposed an existing bug that was hidden by the waiting too much. Actually, I audited all the async code and that's impossible: we don't actually have any async domains at all (except for the spurious superblock s_async_list, which never gets anything added to its runqueue), so it must be a bug in the code. > You tried to avoid a deadlock situation of waiting for too much, but you > avoided the deadlock by now waiting for too little. > > I also think that your code is simply buggy. As far as I can tell, int he > case of having both running and pending events, you'll always pick the > pending cookie. But it's the _running_ cookie that has the lower event > number, isn't it? Yes, see later fix. Assuming we get confirmation from the reporter, we should be good to go. > I dunno. It all looks very fishy to me. Well, the other option is to revert the fix ... since there is no other separated domain, there's nothing really to fix ... the original code that showed the problem was a SCSI feature tree conversion of our current async scanning code to the async infrastructure which used a separate domain. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/