2016-05-07 18:31:49

by Eryu Guan

[permalink] [raw]
Subject: [PATCH v2 1/2] direct-io: cleanup get_more_blocks()

Save one level of indention by returning error early.

Introduce some local variables to make the code easier to read a bit,
and do preparation for next patch.

Signed-off-by: Eryu Guan <[email protected]>
---

v2: split the cleanup code into a seperate patch to make the review easier

fs/direct-io.c | 69 ++++++++++++++++++++++++++++++----------------------------
1 file changed, 36 insertions(+), 33 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4720377..9d5aff9 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -607,52 +607,55 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
int ret;
sector_t fs_startblk; /* Into file, in filesystem-sized blocks */
sector_t fs_endblk; /* Into file, in filesystem-sized blocks */
+ sector_t block_in_file = sdio->block_in_file;
unsigned long fs_count; /* Number of filesystem-sized blocks */
int create;
- unsigned int i_blkbits = sdio->blkbits + sdio->blkfactor;
+ unsigned int blkbits = sdio->blkbits;
+ unsigned int blkfactor = sdio->blkfactor;
+ unsigned int i_blkbits = blkbits + blkfactor;
+ struct inode *inode = dio->inode;

/*
* If there was a memory error and we've overwritten all the
* mapped blocks then we can now return that memory error
*/
ret = dio->page_errors;
- if (ret == 0) {
- BUG_ON(sdio->block_in_file >= sdio->final_block_in_request);
- fs_startblk = sdio->block_in_file >> sdio->blkfactor;
- fs_endblk = (sdio->final_block_in_request - 1) >>
- sdio->blkfactor;
- fs_count = fs_endblk - fs_startblk + 1;
+ if (ret)
+ return ret;

- map_bh->b_state = 0;
- map_bh->b_size = fs_count << i_blkbits;
+ BUG_ON(block_in_file >= sdio->final_block_in_request);
+ fs_startblk = block_in_file >> blkfactor;
+ fs_endblk = (sdio->final_block_in_request - 1) >> blkfactor;
+ fs_count = fs_endblk - fs_startblk + 1;

- /*
- * For writes inside i_size on a DIO_SKIP_HOLES filesystem we
- * forbid block creations: only overwrites are permitted.
- * We will return early to the caller once we see an
- * unmapped buffer head returned, and the caller will fall
- * back to buffered I/O.
- *
- * Otherwise the decision is left to the get_blocks method,
- * which may decide to handle it or also return an unmapped
- * buffer head.
- */
- create = dio->rw & WRITE;
- if (dio->flags & DIO_SKIP_HOLES) {
- if (sdio->block_in_file < (i_size_read(dio->inode) >>
- sdio->blkbits))
- create = 0;
- }
+ map_bh->b_state = 0;
+ map_bh->b_size = fs_count << i_blkbits;
+
+ /*
+ * For writes inside i_size on a DIO_SKIP_HOLES filesystem we
+ * forbid block creations: only overwrites are permitted.
+ * We will return early to the caller once we see an
+ * unmapped buffer head returned, and the caller will fall
+ * back to buffered I/O.
+ *
+ * Otherwise the decision is left to the get_blocks method,
+ * which may decide to handle it or also return an unmapped
+ * buffer head.
+ */
+ create = dio->rw & WRITE;
+ if (dio->flags & DIO_SKIP_HOLES) {
+ if (block_in_file < (i_size_read(inode) >> blkbits))
+ create = 0;
+ }

- ret = (*sdio->get_block)(dio->inode, fs_startblk,
- map_bh, create);
+ ret = (*sdio->get_block)(inode, fs_startblk, map_bh, create);

- /* Store for completion */
- dio->private = map_bh->b_private;
+ /* Store for completion */
+ dio->private = map_bh->b_private;
+
+ if (ret == 0 && buffer_defer_completion(map_bh))
+ ret = dio_set_defer_completion(dio);

- if (ret == 0 && buffer_defer_completion(map_bh))
- ret = dio_set_defer_completion(dio);
- }
return ret;
}

--
2.5.5



2016-05-07 18:31:50

by Eryu Guan

[permalink] [raw]
Subject: [PATCH v2 2/2] direct-io: fix stale data exposure from concurrent buffered read

Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
before calling get_block() callback), if it's a sparse file, direct
writes fall back to buffered writes to avoid stale data exposure from
concurrent buffered read. But there're two cases that can result in
stale data exposure are not correctly detected.

1. The detection for "writing inside i_size" is not sufficient, writes
can be treated as "extending writes" wrongly. For example, direct write
1FSB to a 1FSB sparse file on ext2/3/4, starting from offset 0, in this
case it's writing inside i_size, but 'create' is non-zero, because
'block_in_file' and '(i_size_read(inode) >> blkbits' are both zero.

2. Direct writes starting from or beyong i_size (not inside i_size) also
could trigger block allocation and expose stale data. For example,
consider a sparse file with i_size of 2k, and a write to offset 2k or 3k
into the file, with a filesystem block size of 4k. (Thanks to Jeff Moyer
for pointing this case out.)

The first problem can be demostrated by running ltp-aiodio test ADSP045
many times. When testing on extN filesystems, I see test failures
occasionally, buffered read could read non-zero (stale) data.

ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1

dio_sparse 0 TINFO : Dirtying free blocks
dio_sparse 0 TINFO : Starting I/O tests
non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
non-zero read at offset 0
dio_sparse 0 TINFO : Killing childrens(s)
dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally

The second problem can also be reproduced easily by a hacked dio_sparse
program, which accepts an option to specify the write offset.

What we should really do is to disable block allocation for writes that
could result in filling holes inside i_size.

Signed-off-by: Eryu Guan <[email protected]>
---

v2:
- Fix the case Jeff pointed out as well
- Update commit log

fs/direct-io.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 9d5aff9..5c13bbf 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -632,8 +632,10 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
map_bh->b_size = fs_count << i_blkbits;

/*
- * For writes inside i_size on a DIO_SKIP_HOLES filesystem we
- * forbid block creations: only overwrites are permitted.
+ * For writes that could fill holes inside i_size on a
+ * DIO_SKIP_HOLES filesystem we forbid block creations: only
+ * overwrites are permitted.
+ *
* We will return early to the caller once we see an
* unmapped buffer head returned, and the caller will fall
* back to buffered I/O.
@@ -644,7 +646,7 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
*/
create = dio->rw & WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
- if (block_in_file < (i_size_read(inode) >> blkbits))
+ if (fs_startblk <= ((i_size_read(inode) - 1) >> i_blkbits))
create = 0;
}

--
2.5.5


2016-05-10 17:38:05

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] direct-io: cleanup get_more_blocks()

Eryu Guan <[email protected]> writes:

> Save one level of indention by returning error early.
>
> Introduce some local variables to make the code easier to read a bit,
> and do preparation for next patch.
>
> Signed-off-by: Eryu Guan <[email protected]>

Hi, Eryu,

I don't think you have a full appreciation of the amount of optimization
that goes into this code. I don't see anything wrong with what you've
done, but I also don't want to introduce all these local variables and
change a branch in order to find out several months down the line that
we introduced some TPC-C regression of .5%.

Look, I think this is all you need for the full fix:

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4720377..f66754e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -639,8 +639,7 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
*/
create = dio->rw & WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
- if (sdio->block_in_file < (i_size_read(dio->inode) >>
- sdio->blkbits))
+ if (fs_startblk < fs_count)
create = 0;
}


Can you just test that?

Thanks,
Jeff

2016-05-11 13:23:12

by Eryu Guan

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] direct-io: cleanup get_more_blocks()

On Tue, May 10, 2016 at 01:38:05PM -0400, Jeff Moyer wrote:
> Eryu Guan <[email protected]> writes:
>
> > Save one level of indention by returning error early.
> >
> > Introduce some local variables to make the code easier to read a bit,
> > and do preparation for next patch.
> >
> > Signed-off-by: Eryu Guan <[email protected]>
>
> Hi, Eryu,
>
> I don't think you have a full appreciation of the amount of optimization
> that goes into this code. I don't see anything wrong with what you've
> done, but I also don't want to introduce all these local variables and
> change a branch in order to find out several months down the line that
> we introduced some TPC-C regression of .5%.

Agreed, I overdid it, v2 fix doesn't need the cleanup, I realized it
after I sent it out.

>
> Look, I think this is all you need for the full fix:
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 4720377..f66754e 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -639,8 +639,7 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
> */
> create = dio->rw & WRITE;
> if (dio->flags & DIO_SKIP_HOLES) {
> - if (sdio->block_in_file < (i_size_read(dio->inode) >>
> - sdio->blkbits))
> + if (fs_startblk < fs_count)
> create = 0;
> }
>
>
> Can you just test that?

I tested it and it did fix both of the issues for me. But it seems that
it's a bit overkilled, in certain case block allocation should be
allowed, but it still sets 'create' to 0.

For example, append writing 8k to a 4k sparse file (so offset is also
4k), on a 4k block size filesystem, fs_startblk(1) is smaller than
fs_count(2), so it still sets 'create' to 0. But block allocation should
be allowed in this case, and both the original code and my patch do so.

So I simplified my real fix to this (updates for comments not included):

diff --git a/fs/direct-io.c b/fs/direct-io.c
index 4720377..0cace3e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -639,8 +639,8 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
*/
create = dio->rw & WRITE;
if (dio->flags & DIO_SKIP_HOLES) {
- if (sdio->block_in_file < (i_size_read(dio->inode) >>
- sdio->blkbits))
+ if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
+ i_blkbits))
create = 0;
}


Do you think it's a proper fix?

Thanks again for your time!

Eryu

2016-05-11 17:05:56

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] direct-io: cleanup get_more_blocks()

On Wed 11-05-16 21:23:12, Eryu Guan wrote:
> On Tue, May 10, 2016 at 01:38:05PM -0400, Jeff Moyer wrote:
> > Look, I think this is all you need for the full fix:
> >
> > diff --git a/fs/direct-io.c b/fs/direct-io.c
> > index 4720377..f66754e 100644
> > --- a/fs/direct-io.c
> > +++ b/fs/direct-io.c
> > @@ -639,8 +639,7 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
> > */
> > create = dio->rw & WRITE;
> > if (dio->flags & DIO_SKIP_HOLES) {
> > - if (sdio->block_in_file < (i_size_read(dio->inode) >>
> > - sdio->blkbits))
> > + if (fs_startblk < fs_count)

fs_count is number of blocks in the request so that is not correct...

> > create = 0;
> > }
> >
> >
> > Can you just test that?
>
> I tested it and it did fix both of the issues for me. But it seems that
> it's a bit overkilled, in certain case block allocation should be
> allowed, but it still sets 'create' to 0.
>
> For example, append writing 8k to a 4k sparse file (so offset is also
> 4k), on a 4k block size filesystem, fs_startblk(1) is smaller than
> fs_count(2), so it still sets 'create' to 0. But block allocation should
> be allowed in this case, and both the original code and my patch do so.
>
> So I simplified my real fix to this (updates for comments not included):
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 4720377..0cace3e 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -639,8 +639,8 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
> */
> create = dio->rw & WRITE;
> if (dio->flags & DIO_SKIP_HOLES) {
> - if (sdio->block_in_file < (i_size_read(dio->inode) >>
> - sdio->blkbits))
> + if (fs_startblk <= ((i_size_read(dio->inode) - 1) >>
> + i_blkbits))
> create = 0;

Yes, this is correct as far as I can tell.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-05-11 17:08:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] direct-io: fix stale data exposure from concurrent buffered read

On Sun 08-05-16 02:31:50, Eryu Guan wrote:
> Currently direct writes inside i_size on a DIO_SKIP_HOLES filesystem are
> not allowed to allocate blocks(get_more_blocks() sets 'create' to 0
> before calling get_block() callback), if it's a sparse file, direct
> writes fall back to buffered writes to avoid stale data exposure from
> concurrent buffered read. But there're two cases that can result in
> stale data exposure are not correctly detected.
>
> 1. The detection for "writing inside i_size" is not sufficient, writes
> can be treated as "extending writes" wrongly. For example, direct write
> 1FSB to a 1FSB sparse file on ext2/3/4, starting from offset 0, in this
> case it's writing inside i_size, but 'create' is non-zero, because
> 'block_in_file' and '(i_size_read(inode) >> blkbits' are both zero.
>
> 2. Direct writes starting from or beyong i_size (not inside i_size) also
> could trigger block allocation and expose stale data. For example,
> consider a sparse file with i_size of 2k, and a write to offset 2k or 3k
> into the file, with a filesystem block size of 4k. (Thanks to Jeff Moyer
> for pointing this case out.)
>
> The first problem can be demostrated by running ltp-aiodio test ADSP045
> many times. When testing on extN filesystems, I see test failures
> occasionally, buffered read could read non-zero (stale) data.
>
> ADSP045: dio_sparse -a 4k -w 4k -s 2k -n 1
>
> dio_sparse 0 TINFO : Dirtying free blocks
> dio_sparse 0 TINFO : Starting I/O tests
> non zero buffer at buf[0] => 0xffffffaa,ffffffaa,ffffffaa,ffffffaa
> non-zero read at offset 0
> dio_sparse 0 TINFO : Killing childrens(s)
> dio_sparse 1 TFAIL : dio_sparse.c:191: 1 children(s) exited abnormally
>
> The second problem can also be reproduced easily by a hacked dio_sparse
> program, which accepts an option to specify the write offset.
>
> What we should really do is to disable block allocation for writes that
> could result in filling holes inside i_size.
>
> Signed-off-by: Eryu Guan <[email protected]>

The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
>
> v2:
> - Fix the case Jeff pointed out as well
> - Update commit log
>
> fs/direct-io.c | 8 +++++---
> 1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index 9d5aff9..5c13bbf 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -632,8 +632,10 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
> map_bh->b_size = fs_count << i_blkbits;
>
> /*
> - * For writes inside i_size on a DIO_SKIP_HOLES filesystem we
> - * forbid block creations: only overwrites are permitted.
> + * For writes that could fill holes inside i_size on a
> + * DIO_SKIP_HOLES filesystem we forbid block creations: only
> + * overwrites are permitted.
> + *
> * We will return early to the caller once we see an
> * unmapped buffer head returned, and the caller will fall
> * back to buffered I/O.
> @@ -644,7 +646,7 @@ static int get_more_blocks(struct dio *dio, struct dio_submit *sdio,
> */
> create = dio->rw & WRITE;
> if (dio->flags & DIO_SKIP_HOLES) {
> - if (block_in_file < (i_size_read(inode) >> blkbits))
> + if (fs_startblk <= ((i_size_read(inode) - 1) >> i_blkbits))
> create = 0;
> }
>
> --
> 2.5.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR