LinuxLists.cc - [BUG] File system corruption with 4.4-rc3 and beyond

2015-12-23 00:09:26

Subject: [BUG] File system corruption with 4.4-rc3 and beyond

OK, I started with 4.4-rc4 to add some urgent ftrace patches and
started testing. My tests started to fail, and then I noticed they
failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed
that I was constantly getting messages like this:

ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen
ata2.00: irq_stat 0x20000000, host bus error
ata2: SError: { HostInt }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2.00: failed command: WRITE FPDMA QUEUED
ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out
res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata2.00: configured for UDMA/100
ata2: EH complete

The test box has a relatively new mobo and such, but I know the HD was
old. So I thought that the HD was simply failing. I installed a new HD
and spent lots of time since last Thursday trying to set it up to work
with my testing scripts. Unfortunately, I installed a newer Fedora that
no longer supported the older grub1 and I wasted lots of time trying to
get grub2 to do what I wanted. I finally gave up and used
syslinux/extlinux and got it working again. Unfortunately, I still got
these ata2 errors! I started thinking that the mobo may be bad.

But then I decided to try an older kernel, and the errors never showed
up. I booted back and forth several times and the errors were very
reliable. I have multiple OSes on this box so every time I got an
error, I would boot into one of the other OSes and do fsck on the
filesystems. Because the longer I ran my tests with this bug, it would
eventually start corrupting the ext4 filesystem.

Since it seemed very reliable, I started my bisect. It came down to this
patch:

From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
From: Ming Lei <[email protected]>
Date: Tue, 24 Nov 2015 10:35:29 +0800
Subject: [PATCH] block: fix segment split

I thought this strange, because I don't see anything wrong with this
patch. But if I removed it, the problem went away, and when I added it
back, the problem would show up easily.

I checkout v4.4-rc6 and tested again, thinking something else may be
wrong and has since been fixed. Nope, the error still showed up. I then
removed this commit and tried again. Sure enough, the problem went away!

My guess is that there's another bug lurking around somewhere, and the
bug that this patch fixed hid the problem. Now that this patch fixed a
bug that would hide the issue, the issue is showing up.

I'll pass this along to the block experts and see what you can think of
it. I attached my config, and the test was a script that stress
trace-cmd filters.

Oh, and I ran this on my i386 kernel and OS. I haven't tried testing
much on x86_64 as my tests start with i386. It originally had issues in
x86_64 but that may be because the i386 test corrupted the filesystem
which is shared.

There may be a 32bit vs 64bit issue somewhere?

I spent way too much time on this. I'll try testing x86_64 after the
new year, if needed.

Merry X-mas and happy holidays!

-- Steve

Attachments:

(No filename) (3.70 kB)
config-bad (118.15 kB)
Download all attachments

2015-12-23 00:17:21

by Jens Axboe

[permalink] [raw]

Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond

On 12/22/2015 05:09 PM, Steven Rostedt wrote:
> OK, I started with 4.4-rc4 to add some urgent ftrace patches and
> started testing. My tests started to fail, and then I noticed they
> failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed
> that I was constantly getting messages like this:
>
> ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen
> ata2.00: irq_stat 0x20000000, host bus error
> ata2: SError: { HostInt }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2.00: failed command: WRITE FPDMA QUEUED
> ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out
> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus error)
> ata2.00: status: { DRDY }
> ata2: hard resetting link
> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> ata2.00: configured for UDMA/100
> ata2: EH complete
>
>
> The test box has a relatively new mobo and such, but I know the HD was
> old. So I thought that the HD was simply failing. I installed a new HD
> and spent lots of time since last Thursday trying to set it up to work
> with my testing scripts. Unfortunately, I installed a newer Fedora that
> no longer supported the older grub1 and I wasted lots of time trying to
> get grub2 to do what I wanted. I finally gave up and used
> syslinux/extlinux and got it working again. Unfortunately, I still got
> these ata2 errors! I started thinking that the mobo may be bad.
>
> But then I decided to try an older kernel, and the errors never showed
> up. I booted back and forth several times and the errors were very
> reliable. I have multiple OSes on this box so every time I got an
> error, I would boot into one of the other OSes and do fsck on the
> filesystems. Because the longer I ran my tests with this bug, it would
> eventually start corrupting the ext4 filesystem.
>
> Since it seemed very reliable, I started my bisect. It came down to this
> patch:
>
> From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
> From: Ming Lei <[email protected]>
> Date: Tue, 24 Nov 2015 10:35:29 +0800
> Subject: [PATCH] block: fix segment split
>
>
> I thought this strange, because I don't see anything wrong with this
> patch. But if I removed it, the problem went away, and when I added it
> back, the problem would show up easily.
>
> I checkout v4.4-rc6 and tested again, thinking something else may be
> wrong and has since been fixed. Nope, the error still showed up. I then
> removed this commit and tried again. Sure enough, the problem went away!

Probably the other way around, I think, it uncovered an issue with the
segment counting for certain cases.

> My guess is that there's another bug lurking around somewhere, and the
> bug that this patch fixed hid the problem. Now that this patch fixed a
> bug that would hide the issue, the issue is showing up.
>
> I'll pass this along to the block experts and see what you can think of
> it. I attached my config, and the test was a script that stress
> trace-cmd filters.
>
> Oh, and I ran this on my i386 kernel and OS. I haven't tried testing
> much on x86_64 as my tests start with i386. It originally had issues in
> x86_64 but that may be because the i386 test corrupted the filesystem
> which is shared.
>
> There may be a 32bit vs 64bit issue somewhere?

I'm guessing it's the same issue that was recently diagnosed, which
would make sense if you hit this on 32-bit with highmem. Patch is
pending, if you feel inclined, it'd be great if you could add this patch
and retry:

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f

--
Jens Axboe

2015-12-23 00:17:28

by Linus Torvalds

[permalink] [raw]

Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond

On Tue, Dec 22, 2015 at 4:09 PM, Steven Rostedt <[email protected]> wrote:
>
> Since it seemed very reliable, I started my bisect. It came down to this
> patch:
>
> From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
> From: Ming Lei <[email protected]>
> Date: Tue, 24 Nov 2015 10:35:29 +0800
> Subject: [PATCH] block: fix segment split

We *just* fixed this. It was pretty subtle, and the debugging went on
for several days. But the fixed got applied to the block tree earlier
today, and I just pulled it only 15 minutes ago and pushed it out just
now.

So check current -git, it should work for you.

(And if you care about the particular fix: it's commit 23688bf4f830:
"block: ensure to split after potentially bouncing a bio").

And yes, the way to trigger it seems to be to run a 32-bit kernel and have PAE.

Linus

2015-12-23 01:09:35

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond

On Tue, 22 Dec 2015 17:16:59 -0700
Jens Axboe <[email protected]> wrote:

> I'm guessing it's the same issue that was recently diagnosed, which
> would make sense if you hit this on 32-bit with highmem. Patch is
> pending, if you feel inclined, it'd be great if you could add this patch
> and retry:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f
>

Linus says this is in his tree already. I just did a fresh pull, and
verified that it is. I'm kicking of my test now.

-- Steve

2015-12-23 01:14:32

by Ming Lei

[permalink] [raw]

Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond

On Wed, Dec 23, 2015 at 8:16 AM, Jens Axboe <[email protected]> wrote:
> On 12/22/2015 05:09 PM, Steven Rostedt wrote:
>>
>> OK, I started with 4.4-rc4 to add some urgent ftrace patches and
>> started testing. My tests started to fail, and then I noticed they
>> failed with v4.4-rc4 as well. I got strange errors. Finally, I noticed
>> that I was constantly getting messages like this:
>>
>> ata2.00: exception Emask 0x60 SAct 0x7800000 SErr 0x800 action 0x6 frozen
>> ata2.00: irq_stat 0x20000000, host bus error
>> ata2: SError: { HostInt }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/00:b8:f3:f2:2e/08:00:0e:00:00/40 tag 23 ncq 1048576 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/00:c0:f3:fa:2e/08:00:0e:00:00/40 tag 24 ncq 1048576 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/00:c8:f3:02:2f/08:00:0e:00:00/40 tag 25 ncq 1048576 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2.00: failed command: WRITE FPDMA QUEUED
>> ata2.00: cmd 61/b8:d0:f3:0a:2f/08:00:0e:00:00/40 tag 26 ncq 1142784 out
>> res 40/00:d4:f3:0a:2f/00:00:0e:00:00/40 Emask 0x60 (host bus
>> error)
>> ata2.00: status: { DRDY }
>> ata2: hard resetting link
>> ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
>> ata2.00: configured for UDMA/100
>> ata2: EH complete
>>
>>
>> The test box has a relatively new mobo and such, but I know the HD was
>> old. So I thought that the HD was simply failing. I installed a new HD
>> and spent lots of time since last Thursday trying to set it up to work
>> with my testing scripts. Unfortunately, I installed a newer Fedora that
>> no longer supported the older grub1 and I wasted lots of time trying to
>> get grub2 to do what I wanted. I finally gave up and used
>> syslinux/extlinux and got it working again. Unfortunately, I still got
>> these ata2 errors! I started thinking that the mobo may be bad.
>>
>> But then I decided to try an older kernel, and the errors never showed
>> up. I booted back and forth several times and the errors were very
>> reliable. I have multiple OSes on this box so every time I got an
>> error, I would boot into one of the other OSes and do fsck on the
>> filesystems. Because the longer I ran my tests with this bug, it would
>> eventually start corrupting the ext4 filesystem.
>>
>> Since it seemed very reliable, I started my bisect. It came down to this
>> patch:
>>
>> From 578270bfbd2803dc7b0b03fbc2ac119efbc73195 Mon Sep 17 00:00:00 2001
>> From: Ming Lei <[email protected]>
>> Date: Tue, 24 Nov 2015 10:35:29 +0800
>> Subject: [PATCH] block: fix segment split
>>
>>
>> I thought this strange, because I don't see anything wrong with this
>> patch. But if I removed it, the problem went away, and when I added it
>> back, the problem would show up easily.
>>
>> I checkout v4.4-rc6 and tested again, thinking something else may be
>> wrong and has since been fixed. Nope, the error still showed up. I then
>> removed this commit and tried again. Sure enough, the problem went away!
>
>
> Probably the other way around, I think, it uncovered an issue with the
> segment counting for certain cases.

Diethard said the same case can be fixed by the patch 'block:
ensure to split after potentially bouncing a bio', so please just test it.

Also looks it is helpful to add a warning for the splitted bio in
bio_for_each_segment_all().

>
>> My guess is that there's another bug lurking around somewhere, and the
>> bug that this patch fixed hid the problem. Now that this patch fixed a
>> bug that would hide the issue, the issue is showing up.
>>
>> I'll pass this along to the block experts and see what you can think of
>> it. I attached my config, and the test was a script that stress
>> trace-cmd filters.
>>
>> Oh, and I ran this on my i386 kernel and OS. I haven't tried testing
>> much on x86_64 as my tests start with i386. It originally had issues in
>> x86_64 but that may be because the i386 test corrupted the filesystem
>> which is shared.
>>
>> There may be a 32bit vs 64bit issue somewhere?
>
>
> I'm guessing it's the same issue that was recently diagnosed, which would
> make sense if you hit this on 32-bit with highmem. Patch is pending, if you
> feel inclined, it'd be great if you could add this patch and retry:
>
> http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f
>
> --
> Jens Axboe
>

2015-12-23 01:20:06

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond

On Tue, 22 Dec 2015 20:09:29 -0500
Steven Rostedt <[email protected]> wrote:

> On Tue, 22 Dec 2015 17:16:59 -0700
> Jens Axboe <[email protected]> wrote:
>
> > I'm guessing it's the same issue that was recently diagnosed, which
> > would make sense if you hit this on 32-bit with highmem. Patch is
> > pending, if you feel inclined, it'd be great if you could add this patch
> > and retry:
> >
> > http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=23688bf4f830a89866fd0ed3501e342a7360fe4f
> >
>
> Linus says this is in his tree already. I just did a fresh pull, and
> verified that it is. I'm kicking of my test now.
>

OK, it got passed the point where it usually starts spitting out the
errors.

I'm going to add this patch to my "fixes" patches that get added to my
test kernels to fix issues that cause my tests to fail that were bugs
outside my code ;-)

This test takes over 8 hours (although the i386 portion is less than an
hour). If it finishes with no problems other than my own, I'll let you
know.

Thanks!

-- Steve

2015-12-23 01:23:26

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond

On Tue, 22 Dec 2015 16:17:25 -0800
Linus Torvalds <[email protected]> wrote:

> We *just* fixed this. It was pretty subtle, and the debugging went on
> for several days. But the fixed got applied to the block tree earlier
> today, and I just pulled it only 15 minutes ago and pushed it out just
> now.

Just my luck. I've spent probably 5 days or more on this thinking it
was a hardware issue. :-/

>
> So check current -git, it should work for you.
>
> (And if you care about the particular fix: it's commit 23688bf4f830:
> "block: ensure to split after potentially bouncing a bio").

Thanks, I do care. I'll be backporting this commit to the kernel I test
for my own code.

>
> And yes, the way to trigger it seems to be to run a 32-bit kernel and have PAE.

Yep, that's exactly the environment that I saw this in.

-- Steve

2015-12-23 02:44:52

by Steven Rostedt

[permalink] [raw]

Subject: Re: [BUG] File system corruption with 4.4-rc3 and beyond

On Tue, 22 Dec 2015 20:20:00 -0500
Steven Rostedt <[email protected]> wrote:

> This test takes over 8 hours (although the i386 portion is less than an
> hour). If it finishes with no problems other than my own, I'll let you
> know.

The i386 portion just finished, and it showed no indication of any
errors. So it looks to me that this fixed the issue I was having.
Although its too late...

Tested-by: Steven Rostedt <[email protected]>

-- Steve