Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934151AbbLWARV (ORCPT ); Tue, 22 Dec 2015 19:17:21 -0500 Received: from mx0b-00082601.pphosted.com ([67.231.153.30]:48875 "EHLO mx0b-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932594AbbLWART (ORCPT ); Tue, 22 Dec 2015 19:17:19 -0500 Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond To: Steven Rostedt , LKML References: <20151222190908.658e7fb7@gandalf.local.home> CC: Linus Torvalds , Andrew Morton , Michael Ellerman , Mark Salter , Laurent Dufour , Ming Lei , From: Jens Axboe Message-ID: <5679E7FB.3080505@fb.com> Date: Tue, 22 Dec 2015 17:16:59 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.4.0 MIME-Version: 1.0 In-Reply-To: <20151222190908.658e7fb7@gandalf.local.home> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [192.168.54.13] X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2015-12-22_17:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4539 Lines: 97 On 12/22/2015 05:09 PM, Steven Rostedt wrote: > OK, I started with 4.4-rc4 to add some urgent ftrace patches and > started testing. My tests started to fail, and then I noticed they > failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed > that I was constantly getting messages like this: > > ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen > ata2.00: irq_stat 0x20000000, host bus error > ata2: SError: { HostInt } > ata2.00: failed command: WRITE FPDMA QUEUED > ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out > res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) > ata2.00: status: { DRDY } > ata2.00: failed command: WRITE FPDMA QUEUED > ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out > res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) > ata2.00: status: { DRDY } > ata2.00: failed command: WRITE FPDMA QUEUED > ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out > res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) > ata2.00: status: { DRDY } > ata2.00: failed command: WRITE FPDMA QUEUED > ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out > res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error) > ata2.00: status: { DRDY } > ata2: hard resetting link > ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) > ata2.00: configured for UDMA/100 > ata2: EH complete > > > The test box has a relatively new mobo and such, but I know the HD was > old. So I thought that the HD was simply failing. I installed a new HD > and spent lots of time since last Thursday trying to set it up to work > with my testing scripts. Unfortunately, I installed a newer Fedora that > no longer supported the older grub1 and I wasted lots of time trying to > get grub2 to do what I wanted. I finally gave up and used > syslinux/extlinux and got it working again. Unfortunately, I still got > these ata2 errors! I started thinking that the mobo may be bad. > > But then I decided to try an older kernel, and the errors never showed > up. I booted back and forth several times and the errors were very > reliable. I have multiple OSes on this box so every time I got an > error, I would boot into one of the other OSes and do fsck on the > filesystems. Because the longer I ran my tests with this bug, it would > eventually start corrupting the ext4 filesystem. > > Since it seemed very reliable, I started my bisect. It came down to this > patch: > > From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001 > From: Ming Lei > Date: Tue, 24 Nov 2015 10:35:29 +0800 > Subject: [PATCH] block: fix segment split > > > I thought this strange, because I don't see anything wrong with this > patch. But if I removed it, the problem went away, and when I added it > back, the problem would show up easily. > > I checkout v4.4-rc6 and tested again, thinking something else may be > wrong and has since been fixed. Nope, the error still showed up. I then > removed this commit and tried again. Sure enough, the problem went away! Probably the other way around, I think, it uncovered an issue with the segment counting for certain cases. > My guess is that there's another bug lurking around somewhere, and the > bug that this patch fixed hid the problem. Now that this patch fixed a > bug that would hide the issue, the issue is showing up. > > I'll pass this along to the block experts and see what you can think of > it. I attached my config, and the test was a script that stress > trace-cmd filters. > > Oh, and I ran this on my i386 kernel and OS. I haven't tried testing > much on x86_64 as my tests start with i386. It originally had issues in > x86_64 but that may be because the i386 test corrupted the filesystem > which is shared. > > There may be a 32bit vs 64bit issue somewhere? I'm guessing it's the same issue that was recently diagnosed, which would make sense if you hit this on 32-bit with highmem. Patch is pending, if you feel inclined, it'd be great if you could add this patch and retry: http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/