2023-12-05 17:55:32

by Guenter Roeck

[permalink] [raw]
Subject: Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)

On Tue, Dec 05, 2023 at 01:21:22PM +0100, Jan Kara wrote:
[ ... ]
>
> So I've got back to this and the failure is a subtle interaction between
> iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> stable causes that file position is not updated after direct IO write and
> thus we direct IO writes are ending in wrong locations effectively
> corrupting data. The subtle detail is that before this commit if ->end_io
> handler returns non-zero value (which the new ext4 ->end_io handler does),
> file pos doesn't get updated, after this commit it doesn't get updated only
> if the return value is < 0.
>
> The commit got merged in 6.5-rc1 so all stable kernels that have
> 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> carrying the problematic commit. Greg, please take out the commit from all
> stable kernels before 6.5 as soon as possible, we'll figure out proper
> backport once user data are not being corrupted anymore. Thanks!
>

Thanks a lot for the update.

Turns out this is causing a regression in chromeos-6.1, and reverting the
offending patch fixes the problem. I suspect anyone running v6.1.64+ may
have a problem.

Guenter


2023-12-05 17:57:30

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)

On Tue, Dec 05, 2023 at 09:55:08AM -0800, Guenter Roeck wrote:
> On Tue, Dec 05, 2023 at 01:21:22PM +0100, Jan Kara wrote:
> [ ... ]
> >
> > So I've got back to this and the failure is a subtle interaction between
> > iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> > ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> > stable causes that file position is not updated after direct IO write and
> > thus we direct IO writes are ending in wrong locations effectively
> > corrupting data. The subtle detail is that before this commit if ->end_io
> > handler returns non-zero value (which the new ext4 ->end_io handler does),
> > file pos doesn't get updated, after this commit it doesn't get updated only
> > if the return value is < 0.
> >
> > The commit got merged in 6.5-rc1 so all stable kernels that have
> > 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> > IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> > carrying the problematic commit. Greg, please take out the commit from all
> > stable kernels before 6.5 as soon as possible, we'll figure out proper
> > backport once user data are not being corrupted anymore. Thanks!
> >
>
> Thanks a lot for the update.
>
> Turns out this is causing a regression in chromeos-6.1, and reverting the
> offending patch fixes the problem. I suspect anyone running v6.1.64+ may
> have a problem.

Jan, thanks for the report, and Guenter, thanks for letting me know as
well. I'll go queue up the fix now and push out new -rc releases.

greg k-h

2023-12-11 08:28:54

by Pavel Machek

[permalink] [raw]
Subject: Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)

Hi!

> > > So I've got back to this and the failure is a subtle interaction between
> > > iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> > > ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> > > stable causes that file position is not updated after direct IO write and
> > > thus we direct IO writes are ending in wrong locations effectively
> > > corrupting data. The subtle detail is that before this commit if ->end_io
> > > handler returns non-zero value (which the new ext4 ->end_io handler does),
> > > file pos doesn't get updated, after this commit it doesn't get updated only
> > > if the return value is < 0.
> > >
> > > The commit got merged in 6.5-rc1 so all stable kernels that have
> > > 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> > > IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> > > carrying the problematic commit. Greg, please take out the commit from all
> > > stable kernels before 6.5 as soon as possible, we'll figure out proper
> > > backport once user data are not being corrupted anymore. Thanks!
> > >
> >
> > Thanks a lot for the update.
> >
> > Turns out this is causing a regression in chromeos-6.1, and reverting the
> > offending patch fixes the problem. I suspect anyone running v6.1.64+ may
> > have a problem.
>
> Jan, thanks for the report, and Guenter, thanks for letting me know as
> well. I'll go queue up the fix now and push out new -rc releases.

Would someone have a brief summary here? I see 6.1.66 is out but I
don't see any "Fixes: 91562895f803" tags.

Plus, what is the severity of this? It is "data being corrupted when
using O_SYNC|O_DIRECT" or does metadata somehow get corrupted, too?

Thanks and best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Erika Unter
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Attachments:
(No filename) (1.93 kB)
signature.asc (201.00 B)
Download all attachments

2023-12-11 11:58:52

by Jan Kara

[permalink] [raw]
Subject: Re: ext4 data corruption in 6.1 stable tree (was Re: [PATCH 5.15 000/297] 5.15.140-rc1 review)

On Mon 11-12-23 09:28:40, Pavel Machek wrote:
> > > > So I've got back to this and the failure is a subtle interaction between
> > > > iomap code and ext4 code. In particular that fact that commit 936e114a245b6
> > > > ("iomap: update ki_pos a little later in iomap_dio_complete") is not in
> > > > stable causes that file position is not updated after direct IO write and
> > > > thus we direct IO writes are ending in wrong locations effectively
> > > > corrupting data. The subtle detail is that before this commit if ->end_io
> > > > handler returns non-zero value (which the new ext4 ->end_io handler does),
> > > > file pos doesn't get updated, after this commit it doesn't get updated only
> > > > if the return value is < 0.
> > > >
> > > > The commit got merged in 6.5-rc1 so all stable kernels that have
> > > > 91562895f803 ("ext4: properly sync file size update after O_SYNC direct
> > > > IO") before 6.5 are corrupting data - I've noticed at least 6.1 is still
> > > > carrying the problematic commit. Greg, please take out the commit from all
> > > > stable kernels before 6.5 as soon as possible, we'll figure out proper
> > > > backport once user data are not being corrupted anymore. Thanks!
> > > >
> > >
> > > Thanks a lot for the update.
> > >
> > > Turns out this is causing a regression in chromeos-6.1, and reverting the
> > > offending patch fixes the problem. I suspect anyone running v6.1.64+ may
> > > have a problem.
> >
> > Jan, thanks for the report, and Guenter, thanks for letting me know as
> > well. I'll go queue up the fix now and push out new -rc releases.
>
> Would someone have a brief summary here? I see 6.1.66 is out but I
> don't see any "Fixes: 91562895f803" tags.
>
> Plus, what is the severity of this? It is "data being corrupted when
> using O_SYNC|O_DIRECT" or does metadata somehow get corrupted, too?

It is pure data corruption happening for ext4 direct IO writes because they
do not properly update current file position after the write.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR