Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond
To: Steven Rostedt <rostedt@goodmis.org>, LKML <linux-kernel@vger.kernel.org>
References: <20151222190908.658e7fb7@gandalf.local.home>
CC: Linus Torvalds <torvalds@linux-foundation.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Michael Ellerman <mpe@ellerman.id.au>,
        Mark Salter <msalter@redhat.com>,
        Laurent Dufour <ldufour@linux.vnet.ibm.com>,
        Ming Lei <ming.lei@canonical.com>, <linux-block@vger.kernel.org>
From: Jens Axboe <axboe@fb.com>
Message-ID: <5679E7FB.3080505@fb.com>
Date: Tue, 22 Dec 2015 17:16:59 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.4.0
MIME-Version: 1.0
In-Reply-To: <20151222190908.658e7fb7@gandalf.local.home>
Content-Type: text/plain; charset="windows-1252"; format=flowed
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4539
Lines: 97

On 12/22/2015 05:09 PM, Steven Rostedt wrote:
> OK, I started with 4.4-rc4 to add some urgent ftrace patches and
> started testing. My tests started to fail, and then I noticed they
> failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed
> that I was constantly getting messages like this:
>
> ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen
> ata2.00: irq_stat 0x20000000, host bus error
> ata2: SError: { HostInt }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out
>           res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out
>           res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out
>           res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out
>           res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2: hard resetting link
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata2.00: configured for UDMA/100
> ata2: EH complete
>
>
> The test box has a relatively new mobo and such, but I know the HD was
> old. So I thought that the HD was simply failing. I installed a new HD
> and spent lots of time since last Thursday trying to set it up to work
> with my testing scripts. Unfortunately, I installed a newer Fedora that
> no longer supported the older grub1 and I wasted lots of time trying to
> get grub2 to do what I wanted. I finally gave up and used
> syslinux/extlinux and got it working again. Unfortunately, I still got
> these ata2 errors! I started thinking that the mobo may be bad.
>
> But then I decided to try an older kernel, and the errors never showed
> up. I booted back and forth several times and the errors were very
> reliable. I have multiple OSes on this box so every time I got an
> error, I would boot into one of the other OSes and do fsck on the
> filesystems. Because the longer I ran my tests with this bug, it would
> eventually start corrupting the ext4 filesystem.
>
> Since it seemed very reliable, I started my bisect. It came down to this
> patch:
>
>  From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
> From: Ming Lei <ming.lei@canonical.com>
> Date: Tue, 24 Nov 2015 10:35:29 +0800
> Subject: [PATCH] block: fix segment split
>
>
> I thought this strange, because I don't see anything wrong with this
> patch. But if I removed it, the problem went away, and when I added it
> back, the problem would show up easily.
>
> I checkout v4.4-rc6 and tested again, thinking something else may be
> wrong and has since been fixed. Nope, the error still showed up. I then
> removed this commit and tried again. Sure enough, the problem went away!

Probably the other way around, I think, it uncovered an issue with the 
segment counting for certain cases.

> My guess is that there's another bug lurking around somewhere, and the
> bug that this patch fixed hid the problem. Now that this patch fixed a
> bug that would hide the issue, the issue is showing up.
>
> I'll pass this along to the block experts and see what you can think of
> it. I attached my config, and the test was a script that stress
> trace-cmd filters.
>
> Oh, and I ran this on my i386 kernel and OS. I haven't tried testing
> much on x86_64 as my tests start with i386. It originally had issues in
> x86_64 but that may be because the i386 test corrupted the filesystem
> which is shared.
>
> There may be a 32bit vs 64bit issue somewhere?

I'm guessing it's the same issue that was recently diagnosed, which 
would make sense if you hit this on 32-bit with highmem. Patch is 
pending, if you feel inclined, it'd be great if you could add this patch 
and retry:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/