Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756098AbbLWBOc (ORCPT ); Tue, 22 Dec 2015 20:14:32 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:39056 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753987AbbLWBOa (ORCPT ); Tue, 22 Dec 2015 20:14:30 -0500 MIME-Version: 1.0 In-Reply-To: <5679E7FB.3080505@fb.com> References: <20151222190908.658e7fb7@gandalf.local.home> <5679E7FB.3080505@fb.com> Date: Wed, 23 Dec 2015 09:14:28 +0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond From: Ming Lei To: Jens Axboe Cc: Steven Rostedt , LKML , Linus Torvalds , Andrew Morton , Michael Ellerman , Mark Salter , Laurent Dufour , linux-block@vger.kernel.org Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4953 Lines: 111 On Wed, Dec 23, 2015 at 8:16 AM, Jens Axboe wrote: > On 12/22/2015 05:09 PM, Steven Rostedt wrote: >> >> OK, I started with 4.4-rc4 to add some urgent ftrace patches and >> started testing. My tests started to fail, and then I noticed they >> failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed >> that I was constantly getting messages like this: >> >> ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen >> ata2.00: irq_stat 0x20000000, host bus error >> ata2: SError: { HostInt } >> ata2.00: failed command: WRITE FPDMA QUEUED >> ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out >> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus >> error) >> ata2.00: status: { DRDY } >> ata2.00: failed command: WRITE FPDMA QUEUED >> ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out >> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus >> error) >> ata2.00: status: { DRDY } >> ata2.00: failed command: WRITE FPDMA QUEUED >> ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out >> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus >> error) >> ata2.00: status: { DRDY } >> ata2.00: failed command: WRITE FPDMA QUEUED >> ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out >> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus >> error) >> ata2.00: status: { DRDY } >> ata2: hard resetting link >> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300) >> ata2.00: configured for UDMA/100 >> ata2: EH complete >> >> >> The test box has a relatively new mobo and such, but I know the HD was >> old. So I thought that the HD was simply failing. I installed a new HD >> and spent lots of time since last Thursday trying to set it up to work >> with my testing scripts. Unfortunately, I installed a newer Fedora that >> no longer supported the older grub1 and I wasted lots of time trying to >> get grub2 to do what I wanted. I finally gave up and used >> syslinux/extlinux and got it working again. Unfortunately, I still got >> these ata2 errors! I started thinking that the mobo may be bad. >> >> But then I decided to try an older kernel, and the errors never showed >> up. I booted back and forth several times and the errors were very >> reliable. I have multiple OSes on this box so every time I got an >> error, I would boot into one of the other OSes and do fsck on the >> filesystems. Because the longer I ran my tests with this bug, it would >> eventually start corrupting the ext4 filesystem. >> >> Since it seemed very reliable, I started my bisect. It came down to this >> patch: >> >> From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001 >> From: Ming Lei >> Date: Tue, 24 Nov 2015 10:35:29 +0800 >> Subject: [PATCH] block: fix segment split >> >> >> I thought this strange, because I don't see anything wrong with this >> patch. But if I removed it, the problem went away, and when I added it >> back, the problem would show up easily. >> >> I checkout v4.4-rc6 and tested again, thinking something else may be >> wrong and has since been fixed. Nope, the error still showed up. I then >> removed this commit and tried again. Sure enough, the problem went away! > > > Probably the other way around, I think, it uncovered an issue with the > segment counting for certain cases. Diethard said the same case can be fixed by the patch 'block: ensure to split after potentially bouncing a bio', so please just test it. Also looks it is helpful to add a warning for the splitted bio in bio_for_each_segment_all(). > >> My guess is that there's another bug lurking around somewhere, and the >> bug that this patch fixed hid the problem. Now that this patch fixed a >> bug that would hide the issue, the issue is showing up. >> >> I'll pass this along to the block experts and see what you can think of >> it. I attached my config, and the test was a script that stress >> trace-cmd filters. >> >> Oh, and I ran this on my i386 kernel and OS. I haven't tried testing >> much on x86_64 as my tests start with i386. It originally had issues in >> x86_64 but that may be because the i386 test corrupted the filesystem >> which is shared. >> >> There may be a 32bit vs 64bit issue somewhere? > > > I'm guessing it's the same issue that was recently diagnosed, which would > make sense if you hit this on 32-bit with highmem. Patch is pending, if you > feel inclined, it'd be great if you could add this patch and retry: > > http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f > > -- > Jens Axboe > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/