Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758279Ab1DMKgz (ORCPT ); Wed, 13 Apr 2011 06:36:55 -0400 Received: from mx2.fusionio.com ([64.244.102.31]:36757 "EHLO mx2.fusionio.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758205Ab1DMKgy (ORCPT ); Wed, 13 Apr 2011 06:36:54 -0400 X-ASG-Debug-ID: 1302691011-01de284cf813bd70001-xx1T2L X-Barracuda-Envelope-From: JAxboe@fusionio.com Message-ID: <4DA57CC0.8010400@fusionio.com> Date: Wed, 13 Apr 2011 12:36:48 +0200 From: Jens Axboe MIME-Version: 1.0 To: Richard Kennedy CC: Tejun Heo , Rob Landley , Pete Clements , linux-kernel , "linux-ide@vger.kernel.org" Subject: Re: Commit 7eaceaccab5f40 causing boot hang. References: <201103291551.p2TFpDqZ001692@clem.clem-digital.net> <4D92C874.7040104@parallels.com> <4D931634.5030807@fusionio.com> <4D933584.5050005@parallels.com> <4D94432D.5080601@fusionio.com> <4D944544.9040705@parallels.com> <4D945247.4080404@fusionio.com> <4D945976.8000401@fusionio.com> <20110331121100.GD3385@htj.dyndns.org> <4D9474AA.4070402@fusionio.com> <4D947D12.6070505@rsk.demon.co.uk> <4D947F29.5050203@fusionio.com> <1301577831.1984.2.camel@castor.rsk> <4D9482BA.8080807@fusionio.com> <1301582977.1984.7.camel@castor.rsk> <1301924863.8526.9.camel@castor.rsk> <1302690300.1993.7.camel@castor.rsk> X-ASG-Orig-Subj: Re: Commit 7eaceaccab5f40 causing boot hang. In-Reply-To: <1302690300.1993.7.camel@castor.rsk> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Barracuda-Connect: mail1.int.fusionio.com[10.101.1.21] X-Barracuda-Start-Time: 1302691011 X-Barracuda-URL: http://10.101.1.181:8000/cgi-mod/mark.cgi X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests= X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.60727 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3212 Lines: 99 On 2011-04-13 12:25, Richard Kennedy wrote: > On Mon, 2011-04-04 at 14:47 +0100, Richard Kennedy wrote: >> On Thu, 2011-03-31 at 15:49 +0100, Richard Kennedy wrote: >>> On Thu, 2011-03-31 at 15:33 +0200, Jens Axboe wrote: >>> [...] >>>>>>> Hi Jens, >>>>>>> >>>>>>> I'm seeing a problem with fio never completing when writing to 2 disks >>>>>>> simultaneously. In my test case I'm writing 2Gb to both a LVM volume & a >>>>>>> pata drive on x86_64 on a AMD X2. Could this be a related issue? >>>>>>> >>>>>>> I'm not getting anything reported in the log, lockup detection doesn't >>>>>>> report anything either. The write seems to have finished (the disk light >>>>>>> activity has stopped) and the cpu cores are both below 10% usage, but >>>>>>> fio never returns. The test does complete some times, but it seems to be >>>>>>> one 1 in 4. >>>>>> >>>>>> So when you say PATA, it's /dev/hdaX something as well? >>>>>> >>>>>>> I'm going to try tracing it and see if I can spot where it's stuck. >>>>>> >>>>>> Thanks, that would be nice. >>>>>> >>>>> The second drive is /dev/sdb1 mounted on /opt, both file systems are >>>>> ext4. >>>> >>>> So probably not related. What does the fio job look like? >>>> >>> fio job file -- >>> [global] >>> pre_read=1 >>> ioengine=mmap >>> >>> [f1] >>> size=2g >>> rw=write >>> directory=/home/tests >>> >>> [f2] >>> size=2g >>> rw=write >>> directory=/opt/tests >>> >>> Fio gets run from a script that also collects stats but it's been >>> running without any problems up until 2.6.39-rc1. >>> >> Hi Jens >> I've upgrade to the latest fio version in the git repo 1.51 and I'm >> still seeing this problem. >> >> Fio gets stuck after it writes the 100% complete message and strace on >> the processes shows this. >> >> the controlling fio process :- >> ... >> [pid 8439] wait4(8442, 0x7fff848203ac, WNOHANG, NULL) = 0 >> [pid 8439] nanosleep({0, 10000000}, NULL) = 0 >> [pid 8439] wait4(8441, 0x7fff848203ac, WNOHANG, NULL) = 0 >> [pid 8439] wait4(8442, 0x7fff848203ac, WNOHANG, NULL) = 0 >> [pid 8439] nanosleep({0, 10000000} >> >> & the 2 workers are both stopped here, strace shows only the one line >> for each process. >> >> Process 8441 attached - interrupt to quit >> futex(0x7f9db76a802c, FUTEX_WAIT_PRIVATE, 2, NULL >> >> >> Process 8442 attached - interrupt to quit >> futex(0x7f9db76a802c, FUTEX_WAIT_PRIVATE, 2, NULL >> >> How do I find out which futex it's waiting for? >> Any ideas where I should look next ? >> >> I can run the same test successfully on 2.6.38 so is it worth trying to >> bisect this ? >> >> thanks >> Richard >> > My problem has gone away in v2.6.39-rc3. > I've just finished bisecting it down to 6de9843dab3f, & that got > reverted in rc3, so no problem ;) > > (The data corruption caused by that faulty commit was zeroing out the > shared mutexs in fio & the worker threads were getting stuck on the > writeout_mutex.) Great, that's one less regression to worry about :-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/