2023-11-22 18:14:56

by Jan Kara

[permalink] [raw]
Subject: [PATCH] ext4: Fix warning in ext4_dio_write_end_io()

The syzbot has reported that it can hit the warning in
ext4_dio_write_end_io() because i_size < i_disksize. Indeed the
reproducer creates a race between DIO IO completion and truncate
expanding the file and thus ext4_dio_write_end_io() sees an inconsistent
inode state where i_disksize is already updated but i_size is not
updated yet. Since we are careful when setting up DIO write and consider
it extending (and thus performing the IO synchronously with i_rwsem held
exclusively) whenever it goes past either of i_size or i_disksize, we
can use the same test during IO completion without risking entering
ext4_handle_inode_extension() without i_rwsem held. This way we make it
obvious both i_size and i_disksize are large enough when we report DIO
completion without relying on unreliable WARN_ON.

Reported-by: [email protected]
Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO")
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0166bb9ca160..ba497aabdd1e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
* blocks. But the code in ext4_iomap_alloc() is careful to use
* zeroed/unwritten extents if this is possible; thus we won't leave
* uninitialized blocks in a file even if we didn't succeed in writing
- * as much as we intended.
+ * as much as we intended. Also we can race with truncate or write
+ * expanding the file so we have to be a bit careful here.
*/
- WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
- if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
+ if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) &&
+ pos + size <= i_size_read(inode))
return size;
return ext4_handle_inode_extension(inode, pos, size);
}
--
2.35.3



2023-11-23 07:07:21

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warning in ext4_dio_write_end_io()

Jan Kara <[email protected]> writes:

> The syzbot has reported that it can hit the warning in
> ext4_dio_write_end_io() because i_size < i_disksize. Indeed the
> reproducer creates a race between DIO IO completion and truncate
> expanding the file and thus ext4_dio_write_end_io() sees an inconsistent
> inode state where i_disksize is already updated but i_size is not
> updated yet. Since we are careful when setting up DIO write and consider
> it extending (and thus performing the IO synchronously with i_rwsem held
> exclusively) whenever it goes past either of i_size or i_disksize, we
> can use the same test during IO completion without risking entering
> ext4_handle_inode_extension() without i_rwsem held. This way we make it
> obvious both i_size and i_disksize are large enough when we report DIO
> completion without relying on unreliable WARN_ON.

Does it make sense to add this in ext4_handle_inode_extension()?
WARN_ON_ONCE(!inode_is_locked(inode));
Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so
hopefully it can catch via lockdep.


So, IIUC, the WARN happened when we were doing a non-extending
AIO-DIO write which was racing with truncate trying to expand the file
size. Because only then the DIO completion will not have i_rwsem held
which can race with truncate. Truncate since it is expanding the file
size, will not use inode_dio_wait() (since no block allocations).

Is this understanding correct?

>
> Reported-by: [email protected]
> Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO")
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 0166bb9ca160..ba497aabdd1e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> * blocks. But the code in ext4_iomap_alloc() is careful to use
> * zeroed/unwritten extents if this is possible; thus we won't leave
> * uninitialized blocks in a file even if we didn't succeed in writing
> - * as much as we intended.
> + * as much as we intended. Also we can race with truncate or write
> + * expanding the file so we have to be a bit careful here.
> */
> - WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
> - if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
> + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) &&
> + pos + size <= i_size_read(inode))
> return size;
> return ext4_handle_inode_extension(inode, pos, size);
> }
> --
> 2.35.3

2023-11-23 08:50:05

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warning in ext4_dio_write_end_io()

On Thu 23-11-23 12:37:03, Ritesh Harjani wrote:
> Jan Kara <[email protected]> writes:
>
> > The syzbot has reported that it can hit the warning in
> > ext4_dio_write_end_io() because i_size < i_disksize. Indeed the
> > reproducer creates a race between DIO IO completion and truncate
> > expanding the file and thus ext4_dio_write_end_io() sees an inconsistent
> > inode state where i_disksize is already updated but i_size is not
> > updated yet. Since we are careful when setting up DIO write and consider
> > it extending (and thus performing the IO synchronously with i_rwsem held
> > exclusively) whenever it goes past either of i_size or i_disksize, we
> > can use the same test during IO completion without risking entering
> > ext4_handle_inode_extension() without i_rwsem held. This way we make it
> > obvious both i_size and i_disksize are large enough when we report DIO
> > completion without relying on unreliable WARN_ON.
>
> Does it make sense to add this in ext4_handle_inode_extension()?
> WARN_ON_ONCE(!inode_is_locked(inode));
> Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so
> hopefully it can catch via lockdep.

Exactly.

> So, IIUC, the WARN happened when we were doing a non-extending
> AIO-DIO write which was racing with truncate trying to expand the file
> size. Because only then the DIO completion will not have i_rwsem held
> which can race with truncate. Truncate since it is expanding the file
> size, will not use inode_dio_wait() (since no block allocations).
>
> Is this understanding correct?

Yes, correct.

Honza

>
> >
> > Reported-by: [email protected]
> > Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO")
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/file.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 0166bb9ca160..ba497aabdd1e 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> > * blocks. But the code in ext4_iomap_alloc() is careful to use
> > * zeroed/unwritten extents if this is possible; thus we won't leave
> > * uninitialized blocks in a file even if we didn't succeed in writing
> > - * as much as we intended.
> > + * as much as we intended. Also we can race with truncate or write
> > + * expanding the file so we have to be a bit careful here.
> > */
> > - WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
> > - if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
> > + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) &&
> > + pos + size <= i_size_read(inode))
> > return size;
> > return ext4_handle_inode_extension(inode, pos, size);
> > }
> > --
> > 2.35.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-11-23 09:47:48

by Ritesh Harjani (IBM)

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warning in ext4_dio_write_end_io()

Jan Kara <[email protected]> writes:

> On Thu 23-11-23 12:37:03, Ritesh Harjani wrote:
>> Jan Kara <[email protected]> writes:
>>
>> > The syzbot has reported that it can hit the warning in
>> > ext4_dio_write_end_io() because i_size < i_disksize. Indeed the
>> > reproducer creates a race between DIO IO completion and truncate
>> > expanding the file and thus ext4_dio_write_end_io() sees an inconsistent
>> > inode state where i_disksize is already updated but i_size is not
>> > updated yet. Since we are careful when setting up DIO write and consider
>> > it extending (and thus performing the IO synchronously with i_rwsem held
>> > exclusively) whenever it goes past either of i_size or i_disksize, we
>> > can use the same test during IO completion without risking entering
>> > ext4_handle_inode_extension() without i_rwsem held. This way we make it
>> > obvious both i_size and i_disksize are large enough when we report DIO
>> > completion without relying on unreliable WARN_ON.
>>
>> Does it make sense to add this in ext4_handle_inode_extension()?
>> WARN_ON_ONCE(!inode_is_locked(inode));
>> Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so
>> hopefully it can catch via lockdep.
>
> Exactly.
>
>> So, IIUC, the WARN happened when we were doing a non-extending
>> AIO-DIO write which was racing with truncate trying to expand the file
>> size. Because only then the DIO completion will not have i_rwsem held
>> which can race with truncate. Truncate since it is expanding the file
>> size, will not use inode_dio_wait() (since no block allocations).
>>
>> Is this understanding correct?
>
> Yes, correct.

Thanks Jan,

Also ext4_inode_extension_cleanup() function can take care of deleting
the inode from the orphan list in case if there is a race with truncate
which extended made both i_disksize and inode->i_size and the DIO
completion couldn't call ext4_handle_inode_extension(), right?

In that case, does it make sense to update a comment here too?

@@ -350,7 +350,10 @@ static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
}
/*
* If i_disksize got extended due to writeback of delalloc blocks while
- * the DIO was running we could fail to cleanup the orphan list in
+ * the DIO was running, or
+ * If i_disksize and inode->i_size both got extened during truncate
+ * which raced with DIO completion,
+ * In both such cases, we could fail to cleanup the orphan list in
* ext4_handle_inode_extension(). Do it now.
*/
if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {


-ritesh

>
> Honza
>
>>
>> >
>> > Reported-by: [email protected]
>> > Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO")
>> > Signed-off-by: Jan Kara <[email protected]>
>> > ---
>> > fs/ext4/file.c | 7 ++++---
>> > 1 file changed, 4 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> > index 0166bb9ca160..ba497aabdd1e 100644
>> > --- a/fs/ext4/file.c
>> > +++ b/fs/ext4/file.c
>> > @@ -386,10 +386,11 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
>> > * blocks. But the code in ext4_iomap_alloc() is careful to use
>> > * zeroed/unwritten extents if this is possible; thus we won't leave
>> > * uninitialized blocks in a file even if we didn't succeed in writing
>> > - * as much as we intended.
>> > + * as much as we intended. Also we can race with truncate or write
>> > + * expanding the file so we have to be a bit careful here.
>> > */
>> > - WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
>> > - if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
>> > + if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize) &&
>> > + pos + size <= i_size_read(inode))
>> > return size;
>> > return ext4_handle_inode_extension(inode, pos, size);
>> > }
>> > --
>> > 2.35.3
>>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2023-11-30 09:55:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext4: Fix warning in ext4_dio_write_end_io()

On Thu 23-11-23 15:17:28, Ritesh Harjani wrote:
> Jan Kara <[email protected]> writes:
>
> > On Thu 23-11-23 12:37:03, Ritesh Harjani wrote:
> >> Jan Kara <[email protected]> writes:
> >>
> >> > The syzbot has reported that it can hit the warning in
> >> > ext4_dio_write_end_io() because i_size < i_disksize. Indeed the
> >> > reproducer creates a race between DIO IO completion and truncate
> >> > expanding the file and thus ext4_dio_write_end_io() sees an inconsistent
> >> > inode state where i_disksize is already updated but i_size is not
> >> > updated yet. Since we are careful when setting up DIO write and consider
> >> > it extending (and thus performing the IO synchronously with i_rwsem held
> >> > exclusively) whenever it goes past either of i_size or i_disksize, we
> >> > can use the same test during IO completion without risking entering
> >> > ext4_handle_inode_extension() without i_rwsem held. This way we make it
> >> > obvious both i_size and i_disksize are large enough when we report DIO
> >> > completion without relying on unreliable WARN_ON.
> >>
> >> Does it make sense to add this in ext4_handle_inode_extension()?
> >> WARN_ON_ONCE(!inode_is_locked(inode));
> >> Ohk, we already have "lockdep_assert_held_write(&inode->i_rwsem)" so
> >> hopefully it can catch via lockdep.
> >
> > Exactly.
> >
> >> So, IIUC, the WARN happened when we were doing a non-extending
> >> AIO-DIO write which was racing with truncate trying to expand the file
> >> size. Because only then the DIO completion will not have i_rwsem held
> >> which can race with truncate. Truncate since it is expanding the file
> >> size, will not use inode_dio_wait() (since no block allocations).
> >>
> >> Is this understanding correct?
> >
> > Yes, correct.
>
> Thanks Jan,
>
> Also ext4_inode_extension_cleanup() function can take care of deleting
> the inode from the orphan list in case if there is a race with truncate
> which extended made both i_disksize and inode->i_size and the DIO
> completion couldn't call ext4_handle_inode_extension(), right?
>
> In that case, does it make sense to update a comment here too?
>
> @@ -350,7 +350,10 @@ static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
> }
> /*
> * If i_disksize got extended due to writeback of delalloc blocks while
> - * the DIO was running we could fail to cleanup the orphan list in
> + * the DIO was running, or
> + * If i_disksize and inode->i_size both got extened during truncate
> + * which raced with DIO completion,
> + * In both such cases, we could fail to cleanup the orphan list in
> * ext4_handle_inode_extension(). Do it now.
> */
> if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {

Good point. Expanded comment in this way. I'll send v2 shortly.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR