2015-05-20 20:19:06

by Holger Kiehl

[permalink] [raw]
Subject: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

Hello,

I had a terrible weekend recovering my home system. Always when files
where deleted some data got corrupted. At first I did not notice it,
but when I rebooted the system would not come up again, systemd crashed
with SIGSEGV and that was it. Booting from an USB stick I saw that
some glibc lib had a different size from that in the original RPM. So
all I did reinstalled that lib from USB stick and everything was fine
after rebooting from Raid 0. But I then wanted to make sure that
no other files where corrupted so I checked and found more. So again I
reinstalled those RPM's and rebooted. To my big surprise the system was
again broken and failed to boot. I again tried to recover my system
from USB stick, but this time did not manage to recover the system. So
decided to reinstall the system completely from DVD. Everything looked good
until that moment when I had activated the discard option in /etc/fstab.
After doing some more work (adding and removing things) I rebooted and
again the system failed to boot. Booting from the USB stick I saw that
the /etc/fstab was all filled with NULL's. This gave me the clue that
there must be some problem with discard (trim). My system is using
a software raid 0 IMSM (intel 'fake' raid) on two Samsung SSD 840 pro.

A window system on the same disks (that is why I am using IMSM raid)
was not effected by this problem. I have checked the ram with memtest86
and everything is ok. The kernel I was running when I discovered the
problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
the last numbers). So that kernel seems also effected, but I assume it
contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.

My system seems to be now running stable for some days with kernel.org
kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
be the real cause.

Regards,
Holger


2015-05-20 20:41:48

by Roman Mamedov

[permalink] [raw]
Subject: Re: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

On Wed, 20 May 2015 20:12:31 +0000 (UTC)
Holger Kiehl <[email protected]> wrote:

> The kernel I was running when I discovered the
> problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
> I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
> the last numbers). So that kernel seems also effected, but I assume it
> contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
> is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.
>
> My system seems to be now running stable for some days with kernel.org
> kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
> be the real cause.

It is a bug in the 4.0.2 kernel, fixed in 4.0.3.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672
https://bbs.archlinux.org/viewtopic.php?id=197400
https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/d2dc317d564a46dfc683978a2e5a4f91434e9711


--
With respect,
Roman


Attachments:
signature.asc (198.00 B)

2015-05-20 23:10:23

by NeilBrown

[permalink] [raw]
Subject: Re: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

On Thu, 21 May 2015 01:32:13 +0500 Roman Mamedov <[email protected]> wrote:

> On Wed, 20 May 2015 20:12:31 +0000 (UTC)
> Holger Kiehl <[email protected]> wrote:
>
> > The kernel I was running when I discovered the
> > problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
> > I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
> > the last numbers). So that kernel seems also effected, but I assume it
> > contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
> > is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.
> >
> > My system seems to be now running stable for some days with kernel.org
> > kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
> > be the real cause.
>
> It is a bug in the 4.0.2 kernel, fixed in 4.0.3.
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672
> https://bbs.archlinux.org/viewtopic.php?id=197400
> https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/d2dc317d564a46dfc683978a2e5a4f91434e9711
>
>

I suspect that is a different bug.
I think this one is
https://bugzilla.kernel.org/show_bug.cgi?id=98501

NeilBrown


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-21 06:44:34

by Holger Kiehl

[permalink] [raw]
Subject: Re: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

On Thu, 21 May 2015, NeilBrown wrote:

> On Thu, 21 May 2015 01:32:13 +0500 Roman Mamedov <[email protected]> wrote:
>
>> On Wed, 20 May 2015 20:12:31 +0000 (UTC)
>> Holger Kiehl <[email protected]> wrote:
>>
>>> The kernel I was running when I discovered the
>>> problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
>>> I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
>>> the last numbers). So that kernel seems also effected, but I assume it
>>> contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
>>> is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.
>>>
>>> My system seems to be now running stable for some days with kernel.org
>>> kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
>>> be the real cause.
>>
>> It is a bug in the 4.0.2 kernel, fixed in 4.0.3.
>>
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672
>> https://bbs.archlinux.org/viewtopic.php?id=197400
>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/d2dc317d564a46dfc683978a2e5a4f91434e9711
>>
>>
>
> I suspect that is a different bug.
> I think this one is
> https://bugzilla.kernel.org/show_bug.cgi?id=98501
>
Should there not be a big fat warning going around telling users to disable
discard on Raid 0 until this is fixed? This breaks the filesystem completely
and I believe there is absolutly no way one can get back the data.

Is this fixed in 4.0.4? And which kernels are effected? There could be many
people running systems that have not noticed this and don't know in what
dangerous situation they are when they delete data.

Regards,
Holger

2015-05-21 07:14:40

by NeilBrown

[permalink] [raw]
Subject: Re: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard

On Thu, 21 May 2015 06:44:27 +0000 (UTC) Holger Kiehl <[email protected]>
wrote:

> On Thu, 21 May 2015, NeilBrown wrote:
>
> > On Thu, 21 May 2015 01:32:13 +0500 Roman Mamedov <[email protected]> wrote:
> >
> >> On Wed, 20 May 2015 20:12:31 +0000 (UTC)
> >> Holger Kiehl <[email protected]> wrote:
> >>
> >>> The kernel I was running when I discovered the
> >>> problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
> >>> I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
> >>> the last numbers). So that kernel seems also effected, but I assume it
> >>> contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
> >>> is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.
> >>>
> >>> My system seems to be now running stable for some days with kernel.org
> >>> kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
> >>> be the real cause.
> >>
> >> It is a bug in the 4.0.2 kernel, fixed in 4.0.3.
> >>
> >> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672
> >> https://bbs.archlinux.org/viewtopic.php?id=197400
> >> https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/d2dc317d564a46dfc683978a2e5a4f91434e9711
> >>
> >>
> >
> > I suspect that is a different bug.
> > I think this one is
> > https://bugzilla.kernel.org/show_bug.cgi?id=98501
> >
> Should there not be a big fat warning going around telling users to disable
> discard on Raid 0 until this is fixed? This breaks the filesystem completely
> and I believe there is absolutly no way one can get back the data.

Probably. Would you like to do that?

>
> Is this fixed in 4.0.4? And which kernels are effected? There could be many
> people running systems that have not noticed this and don't know in what
> dangerous situation they are when they delete data.

The patch was only added to my tree today. I will send to Linus tomorrow so
it should appear in the next -rc.
Any -stable kernel released since mid-April probably has the bug. It was
caused by
commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd

Once the fix gets into Linus' tree, it should get into subsequent -stable releases.

The fix is here:

http://git.neil.brown.name/?p=md.git;a=commitdiff;h=a81157768a00e8cf8a7b43b5ea5cac931262374f

commit id should remain unchanged.

Thanks,
NeilBrown


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature

2015-05-22 18:17:41

by Holger Kiehl

[permalink] [raw]
Subject: Re: Filesystem corruption MD (imsm) Raid0 via 2 SSD's + discard



On Thu, 21 May 2015, NeilBrown wrote:

> On Thu, 21 May 2015 06:44:27 +0000 (UTC) Holger Kiehl <[email protected]>
> wrote:
>
>> On Thu, 21 May 2015, NeilBrown wrote:
>>
>>> On Thu, 21 May 2015 01:32:13 +0500 Roman Mamedov <[email protected]> wrote:
>>>
>>>> On Wed, 20 May 2015 20:12:31 +0000 (UTC)
>>>> Holger Kiehl <[email protected]> wrote:
>>>>
>>>>> The kernel I was running when I discovered the
>>>>> problem was 4.0.2 from kernel.org. However, after reinstalling from DVD
>>>>> I updated to Fedora's lattest kernel, which was 3.19.? (I do not remember
>>>>> the last numbers). So that kernel seems also effected, but I assume it
>>>>> contains many 'fixes' from 4.0.x. As filesystem I use ext4, distribution
>>>>> is Fedora 21 and hardware is: Xeon E3-1275, 16GB ECC Ram.
>>>>>
>>>>> My system seems to be now running stable for some days with kernel.org
>>>>> kernel 4.0.3 and with discard DISABLED. But I am still unsure what could
>>>>> be the real cause.
>>>>
>>>> It is a bug in the 4.0.2 kernel, fixed in 4.0.3.
>>>>
>>>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785672
>>>> https://bbs.archlinux.org/viewtopic.php?id=197400
>>>> https://kernel.googlesource.com/pub/scm/linux/kernel/git/stable/linux-stable/+/d2dc317d564a46dfc683978a2e5a4f91434e9711
>>>>
>>>>
>>>
>>> I suspect that is a different bug.
>>> I think this one is
>>> https://bugzilla.kernel.org/show_bug.cgi?id=98501
>>>
>> Should there not be a big fat warning going around telling users to disable
>> discard on Raid 0 until this is fixed? This breaks the filesystem completely
>> and I believe there is absolutly no way one can get back the data.
>
> Probably. Would you like to do that?
>
>>
>> Is this fixed in 4.0.4? And which kernels are effected? There could be many
>> people running systems that have not noticed this and don't know in what
>> dangerous situation they are when they delete data.
>
> The patch was only added to my tree today. I will send to Linus tomorrow so
> it should appear in the next -rc.
> Any -stable kernel released since mid-April probably has the bug. It was
> caused by
> commit 47d68979cc968535cb87f3e5f2e6a3533ea48fbd
>
> Once the fix gets into Linus' tree, it should get into subsequent -stable releases.
>
> The fix is here:
>
> http://git.neil.brown.name/?p=md.git;a=commitdiff;h=a81157768a00e8cf8a7b43b5ea5cac931262374f
>
> commit id should remain unchanged.
>
I would like to confirm that with this patch and discard enabled, I no longer
see any corruption.

Many thanks for the quick fix!

Regards,
Holger