2019-11-22 09:01:46

by Javier González

[permalink] [raw]
Subject: [PATCH] f2fs: disble physical prealloc in LSF mount

From: Javier González <[email protected]>

Fix file system corruption when using LFS mount (e.g., in zoned
devices). Seems like the fallback into buffered I/O creates an
inconsistency if the application is assuming both read and write DIO. I
can easily reproduce a corruption with a simple RocksDB test.

Might be that the f2fs_forced_buffered_io path brings some problems too,
but I have not seen other failures besides this one.

Problem reproducible without a zoned block device, simply by forcing
LFS mount:

$ sudo mkfs.f2fs -f -m /dev/nvme0n1
$ sudo mount /dev/nvme0n1 /mnt/f2fs
$ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
--use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
--db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
--block_size=65536

Note that the options that cause the problem are:
--use_direct_reads=true --use_direct_io_for_flush_and_compaction=true

Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")

Signed-off-by: Javier González <[email protected]>
---
fs/f2fs/data.c | 3 ---
1 file changed, 3 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 5755e897a5f0..b045dd6ab632 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
return err;
}

- if (direct_io && allow_outplace_dio(inode, iocb, from))
- return 0;
-
if (is_inode_flag_set(inode, FI_NO_PREALLOC))
return 0;

--
2.17.1


2019-11-25 00:50:28

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH] f2fs: disble physical prealloc in LSF mount

On 2019/11/22 18:00, Javier Gonz?lez wrote:
> From: Javier Gonz?lez <[email protected]>
>
> Fix file system corruption when using LFS mount (e.g., in zoned
> devices). Seems like the fallback into buffered I/O creates an
> inconsistency if the application is assuming both read and write DIO. I
> can easily reproduce a corruption with a simple RocksDB test.
>
> Might be that the f2fs_forced_buffered_io path brings some problems too,
> but I have not seen other failures besides this one.
>
> Problem reproducible without a zoned block device, simply by forcing
> LFS mount:
>
> $ sudo mkfs.f2fs -f -m /dev/nvme0n1
> $ sudo mount /dev/nvme0n1 /mnt/f2fs
> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
> --block_size=65536
>
> Note that the options that cause the problem are:
> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>
> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>
> Signed-off-by: Javier Gonz?lez <[email protected]>
> ---
> fs/f2fs/data.c | 3 ---
> 1 file changed, 3 deletions(-)
>
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 5755e897a5f0..b045dd6ab632 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
> return err;
> }
>
> - if (direct_io && allow_outplace_dio(inode, iocb, from))
> - return 0;

Since for LFS mode, all DIOs can end up out of place, I think that it
may be better to change allow_outplace_dio() to always return true in
the case of LFS mode. So may be something like:

static inline int allow_outplace_dio(struct inode *inode,
struct kiocb *iocb, struct iov_iter *iter)
{
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
int rw = iov_iter_rw(iter);

return test_opt(sbi, LFS) ||
(rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
}

instead of the original:

static inline int allow_outplace_dio(struct inode *inode,
struct kiocb *iocb, struct iov_iter *iter)
{
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
int rw = iov_iter_rw(iter);

return (test_opt(sbi, LFS) && (rw == WRITE) &&
!block_unaligned_IO(inode, iocb, iter));
}

Thoughts ?

> -
> if (is_inode_flag_set(inode, FI_NO_PREALLOC))
> return 0;
>
>


--
Damien Le Moal
Western Digital Research

2019-11-25 19:05:21

by Javier González

[permalink] [raw]
Subject: Re: [PATCH] f2fs: disble physical prealloc in LSF mount

On 25.11.2019 00:48, Damien Le Moal wrote:
>On 2019/11/22 18:00, Javier González wrote:
>> From: Javier González <[email protected]>
>>
>> Fix file system corruption when using LFS mount (e.g., in zoned
>> devices). Seems like the fallback into buffered I/O creates an
>> inconsistency if the application is assuming both read and write DIO. I
>> can easily reproduce a corruption with a simple RocksDB test.
>>
>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>> but I have not seen other failures besides this one.
>>
>> Problem reproducible without a zoned block device, simply by forcing
>> LFS mount:
>>
>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>> $ sudo mount /dev/nvme0n1 /mnt/f2fs
>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>> --block_size=65536
>>
>> Note that the options that cause the problem are:
>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>
>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>
>> Signed-off-by: Javier González <[email protected]>
>> ---
>> fs/f2fs/data.c | 3 ---
>> 1 file changed, 3 deletions(-)
>>
>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>> index 5755e897a5f0..b045dd6ab632 100644
>> --- a/fs/f2fs/data.c
>> +++ b/fs/f2fs/data.c
>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>> return err;
>> }
>>
>> - if (direct_io && allow_outplace_dio(inode, iocb, from))
>> - return 0;
>
>Since for LFS mode, all DIOs can end up out of place, I think that it
>may be better to change allow_outplace_dio() to always return true in
>the case of LFS mode. So may be something like:
>
>static inline int allow_outplace_dio(struct inode *inode,
> struct kiocb *iocb, struct iov_iter *iter)
>{
> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> int rw = iov_iter_rw(iter);
>
> return test_opt(sbi, LFS) ||
> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>}
>
>instead of the original:
>
>static inline int allow_outplace_dio(struct inode *inode,
> struct kiocb *iocb, struct iov_iter *iter)
>{
> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> int rw = iov_iter_rw(iter);
>
> return (test_opt(sbi, LFS) && (rw == WRITE) &&
> !block_unaligned_IO(inode, iocb, iter));
>}
>
>Thoughts ?
>

I see what you mean and it makes sense. However, the problem I am seeing
occurs when allow_outplace_dio() returns true, as this is what creates
the inconsistency between the write being buffered and the read being
DIO.

I did test the code you propose and, as expected, it still triggered the
corruption.

>> -
>> if (is_inode_flag_set(inode, FI_NO_PREALLOC))
>> return 0;
>>
>>
>
>
>--
>Damien Le Moal
>Western Digital Research

Thanks,
Javier

2019-11-26 02:16:14

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH] f2fs: disble physical prealloc in LSF mount

On 2019/11/26 4:03, Javier Gonz?lez wrote:
> On 25.11.2019 00:48, Damien Le Moal wrote:
>> On 2019/11/22 18:00, Javier Gonz?lez wrote:
>>> From: Javier Gonz?lez <[email protected]>
>>>
>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>> devices). Seems like the fallback into buffered I/O creates an
>>> inconsistency if the application is assuming both read and write DIO. I
>>> can easily reproduce a corruption with a simple RocksDB test.
>>>
>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>> but I have not seen other failures besides this one.
>>>
>>> Problem reproducible without a zoned block device, simply by forcing
>>> LFS mount:
>>>
>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>> --block_size=65536
>>>
>>> Note that the options that cause the problem are:
>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>
>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>
>>> Signed-off-by: Javier Gonz?lez <[email protected]>
>>> ---
>>> fs/f2fs/data.c | 3 ---
>>> 1 file changed, 3 deletions(-)
>>>
>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>> index 5755e897a5f0..b045dd6ab632 100644
>>> --- a/fs/f2fs/data.c
>>> +++ b/fs/f2fs/data.c
>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>> return err;
>>> }
>>>
>>> - if (direct_io && allow_outplace_dio(inode, iocb, from))
>>> - return 0;
>>
>> Since for LFS mode, all DIOs can end up out of place, I think that it
>> may be better to change allow_outplace_dio() to always return true in
>> the case of LFS mode. So may be something like:
>>
>> static inline int allow_outplace_dio(struct inode *inode,
>> struct kiocb *iocb, struct iov_iter *iter)
>> {
>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>> int rw = iov_iter_rw(iter);
>>
>> return test_opt(sbi, LFS) ||
>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>> }
>>
>> instead of the original:
>>
>> static inline int allow_outplace_dio(struct inode *inode,
>> struct kiocb *iocb, struct iov_iter *iter)
>> {
>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>> int rw = iov_iter_rw(iter);
>>
>> return (test_opt(sbi, LFS) && (rw == WRITE) &&
>> !block_unaligned_IO(inode, iocb, iter));
>> }
>>
>> Thoughts ?
>>
>
> I see what you mean and it makes sense. However, the problem I am seeing
> occurs when allow_outplace_dio() returns true, as this is what creates
> the inconsistency between the write being buffered and the read being
> DIO.

But if the write is switched to buffered, the DIO read should use the
buffered path too, no ? Since this is all happening under VFS, the
generic DIO read path will not ensure that the buffered writes are
flushed to disk before issuing the direct read, I think. So that would
explain your data corruption, i.e. you are reading stale data on the
device before the buffered writes make it to the media.

>
> I did test the code you propose and, as expected, it still triggered the
> corruption.
>
>>> -
>>> if (is_inode_flag_set(inode, FI_NO_PREALLOC))
>>> return 0;
>>>
>>>
>>
>>
>> --
>> Damien Le Moal
>> Western Digital Research
>
> Thanks,
> Javier
>


--
Damien Le Moal
Western Digital Research

2019-11-26 04:08:57

by Javier González

[permalink] [raw]
Subject: Re: [PATCH] f2fs: disble physical prealloc in LSF mount

On 26.11.2019 02:06, Damien Le Moal wrote:
>On 2019/11/26 4:03, Javier González wrote:
>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>> On 2019/11/22 18:00, Javier González wrote:
>>>> From: Javier González <[email protected]>
>>>>
>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>> devices). Seems like the fallback into buffered I/O creates an
>>>> inconsistency if the application is assuming both read and write DIO. I
>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>
>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>> but I have not seen other failures besides this one.
>>>>
>>>> Problem reproducible without a zoned block device, simply by forcing
>>>> LFS mount:
>>>>
>>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>> --block_size=65536
>>>>
>>>> Note that the options that cause the problem are:
>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>
>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>
>>>> Signed-off-by: Javier González <[email protected]>
>>>> ---
>>>> fs/f2fs/data.c | 3 ---
>>>> 1 file changed, 3 deletions(-)
>>>>
>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>> --- a/fs/f2fs/data.c
>>>> +++ b/fs/f2fs/data.c
>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>> return err;
>>>> }
>>>>
>>>> - if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>> - return 0;
>>>
>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>> may be better to change allow_outplace_dio() to always return true in
>>> the case of LFS mode. So may be something like:
>>>
>>> static inline int allow_outplace_dio(struct inode *inode,
>>> struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>> int rw = iov_iter_rw(iter);
>>>
>>> return test_opt(sbi, LFS) ||
>>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>> }
>>>
>>> instead of the original:
>>>
>>> static inline int allow_outplace_dio(struct inode *inode,
>>> struct kiocb *iocb, struct iov_iter *iter)
>>> {
>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>> int rw = iov_iter_rw(iter);
>>>
>>> return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>> !block_unaligned_IO(inode, iocb, iter));
>>> }
>>>
>>> Thoughts ?
>>>
>>
>> I see what you mean and it makes sense. However, the problem I am seeing
>> occurs when allow_outplace_dio() returns true, as this is what creates
>> the inconsistency between the write being buffered and the read being
>> DIO.
>
>But if the write is switched to buffered, the DIO read should use the
>buffered path too, no ? Since this is all happening under VFS, the
>generic DIO read path will not ensure that the buffered writes are
>flushed to disk before issuing the direct read, I think. So that would
>explain your data corruption, i.e. you are reading stale data on the
>device before the buffered writes make it to the media.
>

As far as I can see, the read is always sent DIO, so yes, I also believe
that we are reading stale data. This is why the corruption is not seen
if preventing allow_outplace_dio() from sending the write to the
buffered path.

What surprises me is that this is very easy to trigger (see commit), so
I assume you must have seen this with SMR in the past.

Does it make sense to leave the LFS check out of the
allow_outplace_dio()? Or in other words, is there a hard requirement for
writes to take this path on a zoned device that I am not seeing?
Something like:

static inline int allow_outplace_dio(struct inode *inode,
struct kiocb *iocb, struct iov_iter *iter)
{
struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
int rw = iov_iter_rw(iter);

return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
}

Thanks,
Javier

2019-11-26 06:31:05

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH] f2fs: disble physical prealloc in LSF mount

On 2019/11/26 12:58, Javier Gonz?lez wrote:
> On 26.11.2019 02:06, Damien Le Moal wrote:
>> On 2019/11/26 4:03, Javier Gonz?lez wrote:
>>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>>> On 2019/11/22 18:00, Javier Gonz?lez wrote:
>>>>> From: Javier Gonz?lez <[email protected]>
>>>>>
>>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>>> devices). Seems like the fallback into buffered I/O creates an
>>>>> inconsistency if the application is assuming both read and write DIO. I
>>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>>
>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>>> but I have not seen other failures besides this one.
>>>>>
>>>>> Problem reproducible without a zoned block device, simply by forcing
>>>>> LFS mount:
>>>>>
>>>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>>> --block_size=65536
>>>>>
>>>>> Note that the options that cause the problem are:
>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>
>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>>
>>>>> Signed-off-by: Javier Gonz?lez <[email protected]>
>>>>> ---
>>>>> fs/f2fs/data.c | 3 ---
>>>>> 1 file changed, 3 deletions(-)
>>>>>
>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>>> --- a/fs/f2fs/data.c
>>>>> +++ b/fs/f2fs/data.c
>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>>> return err;
>>>>> }
>>>>>
>>>>> - if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>>> - return 0;
>>>>
>>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>>> may be better to change allow_outplace_dio() to always return true in
>>>> the case of LFS mode. So may be something like:
>>>>
>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>> struct kiocb *iocb, struct iov_iter *iter)
>>>> {
>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>> int rw = iov_iter_rw(iter);
>>>>
>>>> return test_opt(sbi, LFS) ||
>>>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>>> }
>>>>
>>>> instead of the original:
>>>>
>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>> struct kiocb *iocb, struct iov_iter *iter)
>>>> {
>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>> int rw = iov_iter_rw(iter);
>>>>
>>>> return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>>> !block_unaligned_IO(inode, iocb, iter));
>>>> }
>>>>
>>>> Thoughts ?
>>>>
>>>
>>> I see what you mean and it makes sense. However, the problem I am seeing
>>> occurs when allow_outplace_dio() returns true, as this is what creates
>>> the inconsistency between the write being buffered and the read being
>>> DIO.
>>
>> But if the write is switched to buffered, the DIO read should use the
>> buffered path too, no ? Since this is all happening under VFS, the
>> generic DIO read path will not ensure that the buffered writes are
>> flushed to disk before issuing the direct read, I think. So that would
>> explain your data corruption, i.e. you are reading stale data on the
>> device before the buffered writes make it to the media.
>>
>
> As far as I can see, the read is always sent DIO, so yes, I also believe
> that we are reading stale data. This is why the corruption is not seen
> if preventing allow_outplace_dio() from sending the write to the
> buffered path.
>
> What surprises me is that this is very easy to trigger (see commit), so
> I assume you must have seen this with SMR in the past.

We just did. Shin'Ichiro in my team finally succeeded in recreating the
problem. The cause seems to be:

bool direct_io = iocb->ki_flags & IOCB_DIRECT;

being true on entry of f2fs_preallocate_blocks() whereas
f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with:

if (f2fs_force_buffered_io(inode, iocb, iter))
return 0;

which has:

if (f2fs_sb_has_blkzoned(sbi))
return true;

So the top DIO code says "do buffered IOs", but lower in the write path,
the IO is still assumed to be a DIO because of the iocb flag... That's
inconsistent.

Note that for the non-zoned device LFS case, f2fs_force_buffered_io()
returns true only for unaligned write DIOs... But that will still trip
on the iocb flag test. So the proper fix is likely something like:

int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
struct f2fs_map_blocks map;
int flag;
int err = 0;
- bool direct_io = iocb->ki_flags & IOCB_DIRECT;
+ bool direct_io = (iocb->ki_flags & IOCB_DIRECT) &&
+ !2fs_force_buffered_io(inode, iocb, iter);

/* convert inline data for Direct I/O*/
if (direct_io) {
err = f2fs_convert_inline_inode(inode);
if (err)
return err;
}

Shin'Ichiro tried this on SMR disks and the failure is gone...

Cheers.


>
> Does it make sense to leave the LFS check out of the
> allow_outplace_dio()? Or in other words, is there a hard requirement for
> writes to take this path on a zoned device that I am not seeing?
> Something like:
>
> static inline int allow_outplace_dio(struct inode *inode,
> struct kiocb *iocb, struct iov_iter *iter)
> {
> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
> int rw = iov_iter_rw(iter);
>
> return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
> }
>
> Thanks,
> Javier
>


--
Damien Le Moal
Western Digital Research

2019-11-26 06:31:12

by Damien Le Moal

[permalink] [raw]
Subject: Re: [PATCH] f2fs: disble physical prealloc in LSF mount

+ Shin'Ichiro

On 2019/11/26 15:19, Damien Le Moal wrote:
> On 2019/11/26 12:58, Javier Gonz?lez wrote:
>> On 26.11.2019 02:06, Damien Le Moal wrote:
>>> On 2019/11/26 4:03, Javier Gonz?lez wrote:
>>>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>>>> On 2019/11/22 18:00, Javier Gonz?lez wrote:
>>>>>> From: Javier Gonz?lez <[email protected]>
>>>>>>
>>>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>>>> devices). Seems like the fallback into buffered I/O creates an
>>>>>> inconsistency if the application is assuming both read and write DIO. I
>>>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>>>
>>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>>>> but I have not seen other failures besides this one.
>>>>>>
>>>>>> Problem reproducible without a zoned block device, simply by forcing
>>>>>> LFS mount:
>>>>>>
>>>>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
>>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>>>> --block_size=65536
>>>>>>
>>>>>> Note that the options that cause the problem are:
>>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>>
>>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>>>
>>>>>> Signed-off-by: Javier Gonz?lez <[email protected]>
>>>>>> ---
>>>>>> fs/f2fs/data.c | 3 ---
>>>>>> 1 file changed, 3 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>>>> --- a/fs/f2fs/data.c
>>>>>> +++ b/fs/f2fs/data.c
>>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>>>> return err;
>>>>>> }
>>>>>>
>>>>>> - if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>>>> - return 0;
>>>>>
>>>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>>>> may be better to change allow_outplace_dio() to always return true in
>>>>> the case of LFS mode. So may be something like:
>>>>>
>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>> struct kiocb *iocb, struct iov_iter *iter)
>>>>> {
>>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>> int rw = iov_iter_rw(iter);
>>>>>
>>>>> return test_opt(sbi, LFS) ||
>>>>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>>>> }
>>>>>
>>>>> instead of the original:
>>>>>
>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>> struct kiocb *iocb, struct iov_iter *iter)
>>>>> {
>>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>> int rw = iov_iter_rw(iter);
>>>>>
>>>>> return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>>>> !block_unaligned_IO(inode, iocb, iter));
>>>>> }
>>>>>
>>>>> Thoughts ?
>>>>>
>>>>
>>>> I see what you mean and it makes sense. However, the problem I am seeing
>>>> occurs when allow_outplace_dio() returns true, as this is what creates
>>>> the inconsistency between the write being buffered and the read being
>>>> DIO.
>>>
>>> But if the write is switched to buffered, the DIO read should use the
>>> buffered path too, no ? Since this is all happening under VFS, the
>>> generic DIO read path will not ensure that the buffered writes are
>>> flushed to disk before issuing the direct read, I think. So that would
>>> explain your data corruption, i.e. you are reading stale data on the
>>> device before the buffered writes make it to the media.
>>>
>>
>> As far as I can see, the read is always sent DIO, so yes, I also believe
>> that we are reading stale data. This is why the corruption is not seen
>> if preventing allow_outplace_dio() from sending the write to the
>> buffered path.
>>
>> What surprises me is that this is very easy to trigger (see commit), so
>> I assume you must have seen this with SMR in the past.
>
> We just did. Shin'Ichiro in my team finally succeeded in recreating the
> problem. The cause seems to be:
>
> bool direct_io = iocb->ki_flags & IOCB_DIRECT;
>
> being true on entry of f2fs_preallocate_blocks() whereas
> f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with:
>
> if (f2fs_force_buffered_io(inode, iocb, iter))
> return 0;
>
> which has:
>
> if (f2fs_sb_has_blkzoned(sbi))
> return true;
>
> So the top DIO code says "do buffered IOs", but lower in the write path,
> the IO is still assumed to be a DIO because of the iocb flag... That's
> inconsistent.
>
> Note that for the non-zoned device LFS case, f2fs_force_buffered_io()
> returns true only for unaligned write DIOs... But that will still trip
> on the iocb flag test. So the proper fix is likely something like:
>
> int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> struct f2fs_map_blocks map;
> int flag;
> int err = 0;
> - bool direct_io = iocb->ki_flags & IOCB_DIRECT;
> + bool direct_io = (iocb->ki_flags & IOCB_DIRECT) &&
> + !2fs_force_buffered_io(inode, iocb, iter);
>
> /* convert inline data for Direct I/O*/
> if (direct_io) {
> err = f2fs_convert_inline_inode(inode);
> if (err)
> return err;
> }
>
> Shin'Ichiro tried this on SMR disks and the failure is gone...
>
> Cheers.
>
>
>>
>> Does it make sense to leave the LFS check out of the
>> allow_outplace_dio()? Or in other words, is there a hard requirement for
>> writes to take this path on a zoned device that I am not seeing?
>> Something like:
>>
>> static inline int allow_outplace_dio(struct inode *inode,
>> struct kiocb *iocb, struct iov_iter *iter)
>> {
>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>> int rw = iov_iter_rw(iter);
>>
>> return (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>> }
>>
>> Thanks,
>> Javier
>>
>
>


--
Damien Le Moal
Western Digital Research

2019-11-26 07:41:57

by Javier González

[permalink] [raw]
Subject: Re: [PATCH] f2fs: disble physical prealloc in LSF mount

On 26.11.2019 06:20, Damien Le Moal wrote:
>+ Shin'Ichiro
>
>On 2019/11/26 15:19, Damien Le Moal wrote:
>> On 2019/11/26 12:58, Javier González wrote:
>>> On 26.11.2019 02:06, Damien Le Moal wrote:
>>>> On 2019/11/26 4:03, Javier González wrote:
>>>>> On 25.11.2019 00:48, Damien Le Moal wrote:
>>>>>> On 2019/11/22 18:00, Javier González wrote:
>>>>>>> From: Javier González <[email protected]>
>>>>>>>
>>>>>>> Fix file system corruption when using LFS mount (e.g., in zoned
>>>>>>> devices). Seems like the fallback into buffered I/O creates an
>>>>>>> inconsistency if the application is assuming both read and write DIO. I
>>>>>>> can easily reproduce a corruption with a simple RocksDB test.
>>>>>>>
>>>>>>> Might be that the f2fs_forced_buffered_io path brings some problems too,
>>>>>>> but I have not seen other failures besides this one.
>>>>>>>
>>>>>>> Problem reproducible without a zoned block device, simply by forcing
>>>>>>> LFS mount:
>>>>>>>
>>>>>>> $ sudo mkfs.f2fs -f -m /dev/nvme0n1
>>>>>>> $ sudo mount /dev/nvme0n1 /mnt/f2fs
>>>>>>> $ sudo /opt/rocksdb/db_bench --benchmarks=fillseq --use_existing_db=0
>>>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>>> --db=/mnt/f2fs --num=5000 --value_size=1048576 --verify_checksum=1
>>>>>>> --block_size=65536
>>>>>>>
>>>>>>> Note that the options that cause the problem are:
>>>>>>> --use_direct_reads=true --use_direct_io_for_flush_and_compaction=true
>>>>>>>
>>>>>>> Fixes: f9d6d0597698 ("f2fs: fix out-place-update DIO write")
>>>>>>>
>>>>>>> Signed-off-by: Javier González <[email protected]>
>>>>>>> ---
>>>>>>> fs/f2fs/data.c | 3 ---
>>>>>>> 1 file changed, 3 deletions(-)
>>>>>>>
>>>>>>> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
>>>>>>> index 5755e897a5f0..b045dd6ab632 100644
>>>>>>> --- a/fs/f2fs/data.c
>>>>>>> +++ b/fs/f2fs/data.c
>>>>>>> @@ -1081,9 +1081,6 @@ int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>>>>>>> return err;
>>>>>>> }
>>>>>>>
>>>>>>> - if (direct_io && allow_outplace_dio(inode, iocb, from))
>>>>>>> - return 0;
>>>>>>
>>>>>> Since for LFS mode, all DIOs can end up out of place, I think that it
>>>>>> may be better to change allow_outplace_dio() to always return true in
>>>>>> the case of LFS mode. So may be something like:
>>>>>>
>>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>>> struct kiocb *iocb, struct iov_iter *iter)
>>>>>> {
>>>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>>> int rw = iov_iter_rw(iter);
>>>>>>
>>>>>> return test_opt(sbi, LFS) ||
>>>>>> (rw == WRITE && !block_unaligned_IO(inode, iocb, iter));
>>>>>> }
>>>>>>
>>>>>> instead of the original:
>>>>>>
>>>>>> static inline int allow_outplace_dio(struct inode *inode,
>>>>>> struct kiocb *iocb, struct iov_iter *iter)
>>>>>> {
>>>>>> struct f2fs_sb_info *sbi = F2FS_I_SB(inode);
>>>>>> int rw = iov_iter_rw(iter);
>>>>>>
>>>>>> return (test_opt(sbi, LFS) && (rw == WRITE) &&
>>>>>> !block_unaligned_IO(inode, iocb, iter));
>>>>>> }
>>>>>>
>>>>>> Thoughts ?
>>>>>>
>>>>>
>>>>> I see what you mean and it makes sense. However, the problem I am seeing
>>>>> occurs when allow_outplace_dio() returns true, as this is what creates
>>>>> the inconsistency between the write being buffered and the read being
>>>>> DIO.
>>>>
>>>> But if the write is switched to buffered, the DIO read should use the
>>>> buffered path too, no ? Since this is all happening under VFS, the
>>>> generic DIO read path will not ensure that the buffered writes are
>>>> flushed to disk before issuing the direct read, I think. So that would
>>>> explain your data corruption, i.e. you are reading stale data on the
>>>> device before the buffered writes make it to the media.
>>>>
>>>
>>> As far as I can see, the read is always sent DIO, so yes, I also believe
>>> that we are reading stale data. This is why the corruption is not seen
>>> if preventing allow_outplace_dio() from sending the write to the
>>> buffered path.
>>>
>>> What surprises me is that this is very easy to trigger (see commit), so
>>> I assume you must have seen this with SMR in the past.
>>
>> We just did. Shin'Ichiro in my team finally succeeded in recreating the
>> problem. The cause seems to be:
>>
>> bool direct_io = iocb->ki_flags & IOCB_DIRECT;
>>
>> being true on entry of f2fs_preallocate_blocks() whereas
>> f2fs_direct_IO() forces buffered IO path for DIO on zoned devices with:
>>
>> if (f2fs_force_buffered_io(inode, iocb, iter))
>> return 0;
>>
>> which has:
>>
>> if (f2fs_sb_has_blkzoned(sbi))
>> return true;
>>
>> So the top DIO code says "do buffered IOs", but lower in the write path,
>> the IO is still assumed to be a DIO because of the iocb flag... That's
>> inconsistent.
>>
>> Note that for the non-zoned device LFS case, f2fs_force_buffered_io()
>> returns true only for unaligned write DIOs... But that will still trip
>> on the iocb flag test. So the proper fix is likely something like:
>>
>> int f2fs_preallocate_blocks(struct kiocb *iocb, struct iov_iter *from)
>> {
>> struct inode *inode = file_inode(iocb->ki_filp);
>> struct f2fs_map_blocks map;
>> int flag;
>> int err = 0;
>> - bool direct_io = iocb->ki_flags & IOCB_DIRECT;
>> + bool direct_io = (iocb->ki_flags & IOCB_DIRECT) &&
>> + !2fs_force_buffered_io(inode, iocb, iter);
>>
>> /* convert inline data for Direct I/O*/
>> if (direct_io) {
>> err = f2fs_convert_inline_inode(inode);
>> if (err)
>> return err;
>> }
>>
>> Shin'Ichiro tried this on SMR disks and the failure is gone...
>>
>> Cheers.
>>

Yes! This is it. I originally though that the problem was on
f2fs_force_buffered_io(), but could not hit the problem there. Thanks
for the analysis; it makes sense now.

Just tested your patch on our drives and the problem is gone too. Guess
you can send a new patch an ignore this one. You can set my reviewed-by
on it.

Thanks Damien!
Javier