2011-02-04 22:40:50

by Matt

[permalink] [raw]
Subject: Re: ext4: Fix data corruption with multi-block writepages support

>Thanks, added to the ext4 patch queue.

>I modified the commit description slightly to give credit to Jon
>Nelson, who reported the bug and really helped by devising a
>reproduceable test case. Many thanks, Jon!!

> - Ted

So that means that the file-corruption which existed until 2.6.37-rc6
and got triggered (for me) more easily via "dm crypt: scale to
multiple CPUs"
is fixed now ?

That should give ext4 a nice speedup for >=2.6.38 :)

Could you also please add an ?

Reported-by: Matthias Bayer <jackdachef <at> gmail <dot> com >

I mainly found it through testing with the mentioned dm-crypt scaling
patch and >=2.6.36-git*

Thanks & Regards !

Matt


2011-02-07 18:13:04

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Fix data corruption with multi-block writepages support

On Fri, Feb 04, 2011 at 10:40:47PM +0000, Matt wrote:
>
> So that means that the file-corruption which existed until 2.6.37-rc6
> and got triggered (for me) more easily via "dm crypt: scale to
> multiple CPUs"
> is fixed now ?

Well, a patch exists for it that will be merged into 2.6.38.

> That should give ext4 a nice speedup for >=2.6.38 :)

I'm not going to make it be the default for 2.6.38, since it's fairly
late in the -rc features. People who want it can explicitly enable it
using the mount option mblk_io_submit, though. (And let me know your
success stories! :-) I will be enabling it as the default in
2.6.39-rc1.

> Reported-by: Matthias Bayer <jackdachef <at> gmail <dot> com >

Sure!

- Ted

2011-02-07 18:29:33

by Milan Broz

[permalink] [raw]
Subject: Re: ext4: Fix data corruption with multi-block writepages support

On 02/07/2011 06:45 PM, Ted Ts'o wrote:
> On Fri, Feb 04, 2011 at 10:40:47PM +0000, Matt wrote:
>>
>> So that means that the file-corruption which existed until 2.6.37-rc6
>> and got triggered (for me) more easily via "dm crypt: scale to
>> multiple CPUs"
>> is fixed now ?
>
> Well, a patch exists for it that will be merged into 2.6.38.
>
>> That should give ext4 a nice speedup for >=2.6.38 :)
>
> I'm not going to make it be the default for 2.6.38, since it's fairly
> late in the -rc features. People who want it can explicitly enable it
> using the mount option mblk_io_submit, though. (And let me know your
> success stories! :-) I will be enabling it as the default in
> 2.6.39-rc1.

So it was ext4 only bug in ext4_end_bio(),
dm-crypt per-cpu code was just trigger here, right?

Milan

2011-02-07 18:44:17

by Matt

[permalink] [raw]
Subject: Re: ext4: Fix data corruption with multi-block writepages support

On Mon, Feb 7, 2011 at 6:29 PM, Milan Broz <[email protected]> wrote:
> On 02/07/2011 06:45 PM, Ted Ts'o wrote:
>> On Fri, Feb 04, 2011 at 10:40:47PM +0000, Matt wrote:
>>>
>>> So that means that the file-corruption which existed until 2.6.37-rc6
>>> and got triggered (for me) more easily via "dm crypt: scale to
>>> multiple CPUs"
>>> is fixed now ?
>>
>> Well, a patch exists for it that will be merged into 2.6.38.
>>
>>> That should give ext4 a nice speedup for >=2.6.38 :)
>>
>> I'm not going to make it be the default for 2.6.38, since it's fairly
>> late in the -rc features. ?People who want it can explicitly enable it
>> using the mount option mblk_io_submit, though. ?(And let me know your
>> success stories! ?:-) I will be enabling it as the default in
>> 2.6.39-rc1.
>
> So it was ext4 only bug in ext4_end_bio(),
> dm-crypt per-cpu code was just trigger here, right?
>
> Milan
>

Hi Milan,

Well, that was at least the experience that I made

ext4: after Ted had disabled support for multiple page-io submission

I observed no data-corruption anymore (it had only appeared on the
system-partition, /home - where ext4 also is used or on my backup
partitions there was also no problem as far as I can tell)

XFS: no corruption observed

reiserfs: I can't say for sure since I'm only using it on my /boot partition :P

for other filesystems I can't say anything - I didn't use additional
ones at that time

Regards

Matt

2011-02-07 18:56:32

by Matt

[permalink] [raw]
Subject: Re: ext4: Fix data corruption with multi-block writepages support

On Mon, Feb 7, 2011 at 5:45 PM, Ted Ts'o <[email protected]> wrote:
> On Fri, Feb 04, 2011 at 10:40:47PM +0000, Matt wrote:
>>
>> So that means that the file-corruption which existed until 2.6.37-rc6
>> and got triggered (for me) more easily via "dm crypt: scale to
>> multiple CPUs"
>> is fixed now ?
>
> Well, a patch exists for it that will be merged into 2.6.38.
>
>> That should give ext4 a nice speedup for >=2.6.38 :)
>
> I'm not going to make it be the default for 2.6.38, since it's fairly
> late in the -rc features. ?People who want it can explicitly enable it
> using the mount option mblk_io_submit, though. ?(And let me know your
> success stories! ?:-) I will be enabling it as the default in
> 2.6.39-rc1.
>

Hi Ted,

I guess it should be save to enable it with 2.6.37, dm-crypt multi-cpu
patch and the following patch ?

"ext4: Fix data corruption with multi-block writepages support"
(of course that's the minimum - it would be better to pull in the ext4
changes for 2.6.38)


For a short time I had it activated (via additional) mblk_io_submit
mount-command on my portage-partition (where the portage-ball of my
Gentoo system is).
I was curious to see what messages I would get and wondered why there
was nothing about mballoc mentioned

If I recall correctly there were always messages in the past, like:

EXT4-fs: delayed allocation enabled
EXT4-fs: file extents enabled
EXT4-fs: mballoc enabled

these are from 2.6.28 -

I'm only getting:

EXT4-fs (dm-3): mounted filesystem with ordered data mode.

or

EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts:
commit=60,barrier=1

(I like to set the barriers / flushes explicitly).

Sorry if I didn't follow development but these messages were kind of
more and more silenced ?


Thanks !


>> Reported-by: Matthias Bayer <jackdachef <at> gmail <dot> com >
>
> Sure!

Thanks !


>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?- Ted
>

Regards

Matt

2011-02-07 20:44:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4: Fix data corruption with multi-block writepages support

On Mon, Feb 07, 2011 at 07:29:26PM +0100, Milan Broz wrote:
>
> So it was ext4 only bug in ext4_end_bio(),
> dm-crypt per-cpu code was just trigger here, right?

There appeared to be two bugs that people were discussing on that
particular dm_crypt mail thread. Some people were complaining about
issues with dm_crypt even when ext4 was not involved.

So I think it's fair to say that there was definitely _a_ ext4 bug
which was most easily seen when dm_crypt was in play, but which was
definitely not dm_crypt specific (it was possible to see it on an
hdd-only system, but the workload was much more severe). In any case,
as soon as the problem was found, we disabled the ext4 optimization
in 2.6.37-rc5.

So the fact that we found and fixed an ext4 bug that was triggered by
dm_crypt should not be taken as a statement (one way or the other)
that dm_crypt is Bug-Free(tm). :-)

- Ted

2011-02-07 20:51:43

by Milan Broz

[permalink] [raw]
Subject: Re: ext4: Fix data corruption with multi-block writepages support

On 02/07/2011 09:44 PM, Ted Ts'o wrote:
> So the fact that we found and fixed an ext4 bug that was triggered by
> dm_crypt should not be taken as a statement (one way or the other)
> that dm_crypt is Bug-Free(tm). :-)

Really? Sigh. ;-)

(There is a rule that if dm-crypt+XFS bug appears, the problem
is always in dm-crypt. So I am quite surprised that this time there
was NO bug in dm-crypt... yet :-)

Anyway, I would like to know if still some problem remains...

Thanks,
Milan