2005-03-07 10:01:10

by Sébastien Dugué

[permalink] [raw]
Subject: [PATCH] 2.6.10 - direct-io async short read bug

Hi,

When reading a file in async mode (using kernel AIO), and the file
size is lower than the requested size (short read), the direct IO
layer reports an incorrect number of bytes read (transferred).

That case is handled for the sync path in 'direct_io_worker'
(fs/direct-io.c) where a check is made against the file size.

This patch does the same thing for the async path. It applies to
a vanilla 2.6.10 kernel.

Please CC: me any comments and/or suggestions.

S?bastien.

------------------------------------------------------

S?bastien Dugu? BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:[email protected]

Linux POSIX AIO: http://www.bullopensource.org/posix

------------------------------------------------------


Attachments:
directio_async_bug.patch (1.14 kB)

2005-03-08 06:44:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

S?bastien Dugu? <[email protected]> wrote:
>
> When reading a file in async mode (using kernel AIO), and the file
> size is lower than the requested size (short read), the direct IO
> layer reports an incorrect number of bytes read (transferred).
>
> That case is handled for the sync path in 'direct_io_worker'
> (fs/direct-io.c) where a check is made against the file size.
>
> This patch does the same thing for the async path.

It looks sane to me. It needs a couple of fixes, below. One of them is
horrid and isn't really a fix at all, but it improves things.

Would Suparna and Badari have time to check the logic of these two patches
please?




- i_size is 64 bit, ssize_t is 32-bit

- whitespace tweaks.

- i_size_read() in interrupt context is a no-no.

Signed-off-by: Andrew Morton <[email protected]>
---

25-akpm/fs/direct-io.c | 14 +++++++++++---
1 files changed, 11 insertions(+), 3 deletions(-)

diff -puN fs/direct-io.c~direct-io-async-short-read-fix-fix fs/direct-io.c
--- 25/fs/direct-io.c~direct-io-async-short-read-fix-fix 2005-03-07 22:28:52.000000000 -0800
+++ 25-akpm/fs/direct-io.c 2005-03-07 22:37:18.000000000 -0800
@@ -231,7 +231,7 @@ static void finished_one_bio(struct dio
if (dio->bio_count == 1) {
if (dio->is_async) {
ssize_t transferred;
- ssize_t i_size;
+ loff_t i_size;
loff_t offset;

/*
@@ -241,11 +241,19 @@ static void finished_one_bio(struct dio
spin_unlock_irqrestore(&dio->bio_lock, flags);

/* Check for short read case */
+
+ /*
+ * We should use i_size_read() here. But we're called
+ * in interrupt context. If this CPU happened to be
+ * in the middle of i_size_write() when the interrupt
+ * occurred, i_size_read() would lock up.
+ * So we just risk getting a wrong result instead :(
+ */
+ i_size = dio->inode->i_size;
transferred = dio->result;
- i_size = i_size_read (dio->inode);
offset = dio->iocb->ki_pos;

- if ((dio->rw == READ) && ((offset + transferred) > i_size))
+ if ((dio->rw == READ) && (offset+transferred > i_size))
transferred = i_size - offset;

dio_complete(dio, offset, transferred);
_

2005-03-08 09:01:00

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Mon, Mar 07, 2005 at 10:39:17PM -0800, Andrew Morton wrote:
> 6
> Lines: 76
>
> S?bastien Dugu? <[email protected]> wrote:
> >
> > When reading a file in async mode (using kernel AIO), and the file
> > size is lower than the requested size (short read), the direct IO
> > layer reports an incorrect number of bytes read (transferred).
> >
> > That case is handled for the sync path in 'direct_io_worker'
> > (fs/direct-io.c) where a check is made against the file size.
> >
> > This patch does the same thing for the async path.
>
> It looks sane to me. It needs a couple of fixes, below. One of them is
> horrid and isn't really a fix at all, but it improves things.
>
> Would Suparna and Badari have time to check the logic of these two patches
> please?
>

Bugs in this area seem never-ending don't they - plug one, open up
another - hard to be confident/verify :( - someday we'll have to
rewrite a part of this code.

Hmm, shouldn't dio->result ideally have been adjusted to be within
i_size at the time of io submission, so we don't have to deal with
this during completion ? We are creating bios with the right size
after all.

We have this:
if (!buffer_mapped(map_bh)) {
....
if (dio->block_in_file >=
i_size_read(dio->inode)>>blkbits) {
/* We hit eof */
page_cache_release(page);
goto out;
}

and
dio->result += iov[seg].iov_len -
((dio->final_block_in_request - dio->block_in_file) <<
blkbits);


can you spot what is going wrong here that we have to try and
workaround this later ?

Regards
Suparna

>
>
>
> - i_size is 64 bit, ssize_t is 32-bit
>
> - whitespace tweaks.
>
> - i_size_read() in interrupt context is a no-no.
>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> 25-akpm/fs/direct-io.c | 14 +++++++++++---
> 1 files changed, 11 insertions(+), 3 deletions(-)
>
> diff -puN fs/direct-io.c~direct-io-async-short-read-fix-fix fs/direct-io.c
> --- 25/fs/direct-io.c~direct-io-async-short-read-fix-fix 2005-03-07 22:28:52.000000000 -0800
> +++ 25-akpm/fs/direct-io.c 2005-03-07 22:37:18.000000000 -0800
> @@ -231,7 +231,7 @@ static void finished_one_bio(struct dio
> if (dio->bio_count == 1) {
> if (dio->is_async) {
> ssize_t transferred;
> - ssize_t i_size;
> + loff_t i_size;
> loff_t offset;
>
> /*
> @@ -241,11 +241,19 @@ static void finished_one_bio(struct dio
> spin_unlock_irqrestore(&dio->bio_lock, flags);
>
> /* Check for short read case */
> +
> + /*
> + * We should use i_size_read() here. But we're called
> + * in interrupt context. If this CPU happened to be
> + * in the middle of i_size_write() when the interrupt
> + * occurred, i_size_read() would lock up.
> + * So we just risk getting a wrong result instead :(
> + */
> + i_size = dio->inode->i_size;
> transferred = dio->result;
> - i_size = i_size_read (dio->inode);
> offset = dio->iocb->ki_pos;
>
> - if ((dio->rw == READ) && ((offset + transferred) > i_size))
> + if ((dio->rw == READ) && (offset+transferred > i_size))
> transferred = i_size - offset;
>
> dio_complete(dio, offset, transferred);
> _
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">aart@k
--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-08 09:19:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Suparna Bhattacharya <[email protected]> wrote:
>
> ...
>
> Bugs in this area seem never-ending don't they - plug one, open up
> another - hard to be confident/verify :( - someday we'll have to
> rewrite a part of this code.

It's solving a complex problem. Any rewrite would probably end up just as
hairy once all the new bugs and corner cases are fixed. Maybe.


> Hmm, shouldn't dio->result ideally have been adjusted to be within
> i_size at the time of io submission, so we don't have to deal with
> this during completion ? We are creating bios with the right size
> after all.
>
> We have this:
> if (!buffer_mapped(map_bh)) {
> ....
> if (dio->block_in_file >=
> i_size_read(dio->inode)>>blkbits) {
> /* We hit eof */
> page_cache_release(page);
> goto out;
> }
>
> and
> dio->result += iov[seg].iov_len -
> ((dio->final_block_in_request - dio->block_in_file) <<
> blkbits);
>
>
> can you spot what is going wrong here that we have to try and
> workaround this later ?

Good question. Do we have the i_sem coverage to prevent a concurrent
truncate?

But from Sebastien's description it doesn't soound as if a concurrent
truncate is involved.

2005-03-08 09:32:26

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, Mar 08, 2005 at 01:18:14AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya <[email protected]> wrote:
> >
> > ...
> >
> > Bugs in this area seem never-ending don't they - plug one, open up
> > another - hard to be confident/verify :( - someday we'll have to
> > rewrite a part of this code.
>
> It's solving a complex problem. Any rewrite would probably end up just as
> hairy once all the new bugs and corner cases are fixed. Maybe.
>
>
> > Hmm, shouldn't dio->result ideally have been adjusted to be within
> > i_size at the time of io submission, so we don't have to deal with
> > this during completion ? We are creating bios with the right size
> > after all.
> >
> > We have this:
> > if (!buffer_mapped(map_bh)) {
> > ....
> > if (dio->block_in_file >=
> > i_size_read(dio->inode)>>blkbits) {
> > /* We hit eof */
> > page_cache_release(page);
> > goto out;
> > }
> >
> > and
> > dio->result += iov[seg].iov_len -
> > ((dio->final_block_in_request - dio->block_in_file) <<
> > blkbits);
> >
> >
> > can you spot what is going wrong here that we have to try and
> > workaround this later ?
>
> Good question. Do we have the i_sem coverage to prevent a concurrent
> truncate?
>
> But from Sebastien's description it doesn't soound as if a concurrent
> truncate is involved.

Daniel McNeil has a testcase that reproduces the problem - seemed
like a single thread thing - that's what puzzles me.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-08 09:42:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Suparna Bhattacharya <[email protected]> wrote:
>
> > > can you spot what is going wrong here that we have to try and
> > > workaround this later ?
> >
> > Good question. Do we have the i_sem coverage to prevent a concurrent
> > truncate?
> >
> > But from Sebastien's description it doesn't soound as if a concurrent
> > truncate is involved.
>
> Daniel McNeil has a testcase that reproduces the problem - seemed
> like a single thread thing - that's what puzzles me.

OK. It'd be nice if we could find a solution which gets around that i_size
access in the ISR, if someone has the time to look into it?

2005-03-08 09:48:03

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Le mardi 08 mars 2005 ? 01:18 -0800, Andrew Morton a ?crit :
> Suparna Bhattacharya <[email protected]> wrote:
> >
> > ...
> >
> > Bugs in this area seem never-ending don't they - plug one, open up
> > another - hard to be confident/verify :( - someday we'll have to
> > rewrite a part of this code.
>
> It's solving a complex problem. Any rewrite would probably end up
just as
> hairy once all the new bugs and corner cases are fixed. Maybe.
>
>
> > Hmm, shouldn't dio->result ideally have been adjusted to be within
> > i_size at the time of io submission, so we don't have to deal with
> > this during completion ? We are creating bios with the right size
> > after all.
> >
> > We have this:
> > if (!buffer_mapped(map_bh)) {
> > ....
> > if (dio->block_in_file >=
> >
i_size_read(dio->inode)>>blkbits) {
> > /* We hit eof */
> > page_cache_release(page);
> > goto out;
> > }
> >
> > and
> > dio->result += iov[seg].iov_len -
> > ((dio->final_block_in_request -
dio->block_in_file) <<
> > blkbits);
> >
> >
> > can you spot what is going wrong here that we have to try and
> > workaround this later ?
>
> Good question. Do we have the i_sem coverage to prevent a concurrent
> truncate?
>
> But from Sebastien's description it doesn't soound as if a concurrent
> truncate is involved.

Yes, you're right, there's no concurrent truncate here. My test case
as well as Daniel's is single threaded and is the only thread accessing
the file.

Regards,

S?bastien.

--
------------------------------------------------------

S?bastien Dugu? BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:[email protected]

Linux POSIX AIO: http://www.bullopensource.org/posix

------------------------------------------------------

2005-03-08 10:17:26

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Le mardi 08 mars 2005 ? 01:18 -0800, Andrew Morton a ?crit :
> Suparna Bhattacharya <[email protected]> wrote:
> >
> > ...
> >
> > Bugs in this area seem never-ending don't they - plug one, open up
> > another - hard to be confident/verify :( - someday we'll have to
> > rewrite a part of this code.
>
> It's solving a complex problem. Any rewrite would probably end up
just as
> hairy once all the new bugs and corner cases are fixed. Maybe.
>
>
> > Hmm, shouldn't dio->result ideally have been adjusted to be within
> > i_size at the time of io submission, so we don't have to deal with
> > this during completion ? We are creating bios with the right size
> > after all.
> >
> > We have this:
> > if (!buffer_mapped(map_bh)) {
> > ....
> > if (dio->block_in_file >=
> >
i_size_read(dio->inode)>>blkbits) {
> > /* We hit eof */
> > page_cache_release(page);
> > goto out;
> > }
> >
> > and
> > dio->result += iov[seg].iov_len -
> > ((dio->final_block_in_request -
dio->block_in_file) <<
> > blkbits);
> >
> >
> > can you spot what is going wrong here that we have to try and
> > workaround this later ?
>
> Good question. Do we have the i_sem coverage to prevent a concurrent
> truncate?
>
> But from Sebastien's description it doesn't soound as if a concurrent
> truncate is involved.

Yes, you're right, there's no concurrent truncate here. My test case
as well as Daniel's is single threaded and is the only thread accessing
the file.

Regards,

S?bastien.

--
------------------------------------------------------

S?bastien Dugu? BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:[email protected]

Linux POSIX AIO: http://www.bullopensource.org/posix

------------------------------------------------------

2005-03-08 10:20:10

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, Mar 08, 2005 at 01:41:41AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya <[email protected]> wrote:
> >
> > > > can you spot what is going wrong here that we have to try and
> > > > workaround this later ?
> > >
> > > Good question. Do we have the i_sem coverage to prevent a concurrent
> > > truncate?
> > >
> > > But from Sebastien's description it doesn't soound as if a concurrent
> > > truncate is involved.
> >
> > Daniel McNeil has a testcase that reproduces the problem - seemed
> > like a single thread thing - that's what puzzles me.
>
> OK. It'd be nice if we could find a solution which gets around that i_size
> access in the ISR, if someone has the time to look into it?

We hold i_alloc_sem right until dio_complete, don't we ? Is it still
a problem ?

But, I'm wondering if this solution just covers up the real reason
for the single thread problem ? We shouldn't even be issuing IOs
beyond i_size into the user-space buffer !

BTW, part of what I'd have liked in a rewrite would be to unify more of the
AIO and sync io paths, consider doing all this in task context (like in
the retry code) etc. Would reduce the number of cases to watch out for.
But that's a different matter, not for now. I agree that the buffered
vs DIO stuff in combo with AIO is what makes this really complex, so
it is a hard problem.

Regards
Suparna

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-08 10:28:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Suparna Bhattacharya <[email protected]> wrote:
>
> But, I'm wondering if this solution just covers up the real reason
> for the single thread problem ? We shouldn't even be issuing IOs
> beyond i_size into the user-space buffer !

That's true.

2005-03-08 17:27:31

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, 2005-03-08 at 01:09, Suparna Bhattacharya wrote:

>
> Hmm, shouldn't dio->result ideally have been adjusted to be within
> i_size at the time of io submission, so we don't have to deal with
> this during completion ? We are creating bios with the right size
> after all.
>
> We have this:
> if (!buffer_mapped(map_bh)) {
> ....
> if (dio->block_in_file >=
> i_size_read(dio->inode)>>blkbits) {
> /* We hit eof */
> page_cache_release(page);
> goto out;
> }
>

This check will catch only if there is no block on the disk. In the
current test case, the filesize = 3K and filesystem blocksize=4K.
So, we have a block (beyond filesize). The test tries to read 4K.
None of the checks catch this case and it submits 4K IO.

> and
> dio->result += iov[seg].iov_len -
> ((dio->final_block_in_request - dio->block_in_file) <<
> blkbits);
>
>
> can you spot what is going wrong here that we have to try and
> workaround this later ?


Andrew, please don't apply the original patch. We shouldn't even attempt
to submit IO beyond the filesize. We should truncate the IO request to
filesize. I will send a patch today to fix this.

Thanks,
Badari

2005-03-08 19:29:12

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug


> Andrew, please don't apply the original patch. We shouldn't even attempt
> to submit IO beyond the filesize. We should truncate the IO request to
> filesize. I will send a patch today to fix this.
>

Well, spoke too soon. This is an ugly corner case :( But I have
a ugly hack to fix it :)

Let me ask you a basic question: Do we support DIO reads on a file
which is not blocksize multiple in size ? (say 12K - 10 bytes) ?

What about the ones which are not 4K but 512 byte multiple ? (say 7K) ?

I need answer to those, to figure out how hard I should try to fix this.

Anyway, here is ugly version of the patch - which will limit the IO
size to filesize and uses lower blocksizes to read the file (since
the filesize is only 3K, it would go down to 512 byte blocksize).

Thanks,
Badari


Attachments:
aio-dio-short-read.patch (921.00 B)

2005-03-08 23:59:56

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, 2005-03-08 at 15:27, Daniel McNeil wrote:
> On Tue, 2005-03-08 at 11:18, Badari Pulavarty wrote:
> > > Andrew, please don't apply the original patch. We shouldn't even attempt
> > > to submit IO beyond the filesize. We should truncate the IO request to
> > > filesize. I will send a patch today to fix this.
> > >
> >
> > Well, spoke too soon. This is an ugly corner case :( But I have
> > a ugly hack to fix it :)
> >
> > Let me ask you a basic question: Do we support DIO reads on a file
> > which is not blocksize multiple in size ? (say 12K - 10 bytes) ?
>
> > What about the ones which are not 4K but 512 byte multiple ? (say 7K) ?
> >
> > I need answer to those, to figure out how hard I should try to fix this.
> >
> > Anyway, here is ugly version of the patch - which will limit the IO
> > size to filesize and uses lower blocksizes to read the file (since
> > the filesize is only 3K, it would go down to 512 byte blocksize).
>
> Badari,
>
> Just to be clear, are you just asking about what we should support on a
> DIO read that spans EOF?
>
> For DIO reads past EOF the options are:
>
> 1. return EINVAL if the DIO goes past EOF.
>
> 2. truncate the request to file size (which is what your patch does)
> and if it works, it works.
>
> 3. truncate the request to a size that actually works - like a multiple
> of 512.
>
> 4. Do the full i/o since the user buffer is big enough, truncate the
> result returned to file size (and clear out the user buffer where it
> read past EOF).
>
> Number 4 would make it easy on the user-level code, but AIO DIO might be
> a bit tricky and might be a security hole since the data would be dma'ed
> there and then cleared. I need to look at the code some more.
>
> 1 and 2 both seem valid and are better than returning the wrong result.
>
> 3 seems like it would confuse applications.
>
> Thoughts?

Well, my basic questions were

1) Should we support DIO on files with sizes that are not multiple of
filesystem blocksize (but multiple of 512 bytes) == say 6K.
I think so, and we should kick it to use 512byte blocksize in the
DIO code.

2) Should we support DIO on files with sizes that are not multiple of
filesystem blocksize and also multiple of 512 bytes == say 8190 bytes.
I think NOT.

The problem is, we check alignment restrictions for buffer, offset
and IO sizes. We don't care about the filesize - only issue comes
is if the user tries to go beyond EOF.

So on a 8190 byte file, with DIO if the user does
lseek(fd, 4096, 0);
read(fd, buf, 4096);

what should read() return ? read() is expected to return 4094 bytes or
EINVAL ?


Thanks,
Badari

2005-03-09 00:56:01

by Daniel McNeil

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, 2005-03-08 at 11:18, Badari Pulavarty wrote:
> > Andrew, please don't apply the original patch. We shouldn't even attempt
> > to submit IO beyond the filesize. We should truncate the IO request to
> > filesize. I will send a patch today to fix this.
> >
>
> Well, spoke too soon. This is an ugly corner case :( But I have
> a ugly hack to fix it :)
>
> Let me ask you a basic question: Do we support DIO reads on a file
> which is not blocksize multiple in size ? (say 12K - 10 bytes) ?

> What about the ones which are not 4K but 512 byte multiple ? (say 7K) ?
>
> I need answer to those, to figure out how hard I should try to fix this.
>
> Anyway, here is ugly version of the patch - which will limit the IO
> size to filesize and uses lower blocksizes to read the file (since
> the filesize is only 3K, it would go down to 512 byte blocksize).

Badari,

Just to be clear, are you just asking about what we should support on a
DIO read that spans EOF?

For DIO reads past EOF the options are:

1. return EINVAL if the DIO goes past EOF.

2. truncate the request to file size (which is what your patch does)
and if it works, it works.

3. truncate the request to a size that actually works - like a multiple
of 512.

4. Do the full i/o since the user buffer is big enough, truncate the
result returned to file size (and clear out the user buffer where it
read past EOF).

Number 4 would make it easy on the user-level code, but AIO DIO might be
a bit tricky and might be a security hole since the data would be dma'ed
there and then cleared. I need to look at the code some more.

1 and 2 both seem valid and are better than returning the wrong result.

3 seems like it would confuse applications.

Thoughts?

Daniel





2005-03-09 02:22:21

by Daniel McNeil

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, 2005-03-08 at 11:18, Badari Pulavarty wrote:
> > Andrew, please don't apply the original patch. We shouldn't even attempt
> > to submit IO beyond the filesize. We should truncate the IO request to
> > filesize. I will send a patch today to fix this.
> >
>
> Well, spoke too soon. This is an ugly corner case :( But I have
> a ugly hack to fix it :)
>
> Let me ask you a basic question: Do we support DIO reads on a file
> which is not blocksize multiple in size ? (say 12K - 10 bytes) ?
>
> What about the ones which are not 4K but 512 byte multiple ? (say 7K) ?
>
> I need answer to those, to figure out how hard I should try to fix this.
>
> Anyway, here is ugly version of the patch - which will limit the IO
> size to filesize and uses lower blocksizes to read the file (since
> the filesize is only 3K, it would go down to 512 byte blocksize).
>

BTW, I got a compile error because the 'iov' parameter is
declared with 'const', so your patch is changing a read-only
value. :(

Daniel

2005-03-09 04:18:07

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, Mar 08, 2005 at 03:54:04PM -0800, Badari Pulavarty wrote:
> > 1. return EINVAL if the DIO goes past EOF.
> >
> > 2. truncate the request to file size (which is what your patch does)
> > and if it works, it works.
> >
> > 3. truncate the request to a size that actually works - like a multiple
> > of 512.
> >
> > 4. Do the full i/o since the user buffer is big enough, truncate the
> > result returned to file size (and clear out the user buffer where it
> > read past EOF).
> >
> > Number 4 would make it easy on the user-level code, but AIO DIO might be
> > a bit tricky and might be a security hole since the data would be dma'ed
> > there and then cleared. I need to look at the code some more.

Solaris, which does forcedirectio as a mount option, actually
will do buffered I/O on the trailing part. Consider it like a bounce
buffer. That way they don't DMA the trailing data and succeed the I/O.
The I/O returns actual bytes till EOF, just like read(2) is supposed to.
Either this or a fully DMA'd number 4 is really what we should
do. If security can only be solved via a bounce buffer, who cares? If
the user created themselves a non-aligned file to open O_DIRECT, that's
their problem if the last part-sector is negligably slower.

Joel

--

Life's Little Instruction Book #3

"Watch a sunrise at least once a year."

Joel Becker
Senior Member of Technical Staff
Oracle
E-mail: [email protected]
Phone: (650) 506-8127

2005-03-09 09:51:50

by Sébastien Dugué

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Le mardi 08 mars 2005 ? 20:07 -0800, Joel Becker a ?crit :
> On Tue, Mar 08, 2005 at 03:54:04PM -0800, Badari Pulavarty wrote:
> > > 1. return EINVAL if the DIO goes past EOF.
> > >
> > > 2. truncate the request to file size (which is what your patch does)
> > > and if it works, it works.
> > >
> > > 3. truncate the request to a size that actually works - like a multiple
> > > of 512.
> > >
> > > 4. Do the full i/o since the user buffer is big enough, truncate the
> > > result returned to file size (and clear out the user buffer where it
> > > read past EOF).
> > >
> > > Number 4 would make it easy on the user-level code, but AIO DIO might be
> > > a bit tricky and might be a security hole since the data would be dma'ed
> > > there and then cleared. I need to look at the code some more.
>
> Solaris, which does forcedirectio as a mount option, actually
> will do buffered I/O on the trailing part. Consider it like a bounce
> buffer. That way they don't DMA the trailing data and succeed the I/O.
> The I/O returns actual bytes till EOF, just like read(2) is supposed to.
> Either this or a fully DMA'd number 4 is really what we should
> do. If security can only be solved via a bounce buffer, who cares? If
> the user created themselves a non-aligned file to open O_DIRECT, that's
> their problem if the last part-sector is negligably slower.
>
> Joel
>

From a user's perspective, I tend to agree with Joel. A read going
past EOF should IMHO return the data till EOF and reflect it in the
return value. The penalty incured by the trailing part going buffered
is the price to pay for not having a fully aligned file.

I think doing otherwise would break read(2) semantics.

S?bastien.

------------------------------------------------------

S?bastien Dugu? BULL/FREC:B1-247
phone: (+33) 476 29 77 70 Bullcom: 229-7770

mailto:[email protected]

Linux POSIX AIO: http://www.bullopensource.org/posix

------------------------------------------------------

2005-03-09 15:11:23

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, Mar 08, 2005 at 08:07:57PM -0800, Joel Becker wrote:
> On Tue, Mar 08, 2005 at 03:54:04PM -0800, Badari Pulavarty wrote:
> > > 1. return EINVAL if the DIO goes past EOF.
> > >
> > > 2. truncate the request to file size (which is what your patch does)
> > > and if it works, it works.
> > >
> > > 3. truncate the request to a size that actually works - like a multiple
> > > of 512.
> > >
> > > 4. Do the full i/o since the user buffer is big enough, truncate the
> > > result returned to file size (and clear out the user buffer where it
> > > read past EOF).
> > >
> > > Number 4 would make it easy on the user-level code, but AIO DIO might be
> > > a bit tricky and might be a security hole since the data would be dma'ed
> > > there and then cleared. I need to look at the code some more.
>
> Solaris, which does forcedirectio as a mount option, actually
> will do buffered I/O on the trailing part. Consider it like a bounce
> buffer. That way they don't DMA the trailing data and succeed the I/O.
> The I/O returns actual bytes till EOF, just like read(2) is supposed to.
> Either this or a fully DMA'd number 4 is really what we should
> do. If security can only be solved via a bounce buffer, who cares? If
> the user created themselves a non-aligned file to open O_DIRECT, that's
> their problem if the last part-sector is negligably slower.

If writes/truncates take care of zeroing out the rest of the sector
on disk, might we still be OK without having to do the bounce buffer
thing ?

Regards
Suparna

>
> --
>
> Life's Little Instruction Book #3
>
> "Watch a sunrise at least once a year."
>
> Joel Becker
> Senior Member of Technical Staff
> Oracle
> E-mail: [email protected]
> Phone: (650) 506-8127
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Lab, India

2005-03-09 16:05:13

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Tue, 2005-03-08 at 17:44, Daniel McNeil wrote:
> On Tue, 2005-03-08 at 11:18, Badari Pulavarty wrote:
> > > Andrew, please don't apply the original patch. We shouldn't even attempt
> > > to submit IO beyond the filesize. We should truncate the IO request to
> > > filesize. I will send a patch today to fix this.
> > >
> >
> > Well, spoke too soon. This is an ugly corner case :( But I have
> > a ugly hack to fix it :)
> >
> > Let me ask you a basic question: Do we support DIO reads on a file
> > which is not blocksize multiple in size ? (say 12K - 10 bytes) ?
> >
> > What about the ones which are not 4K but 512 byte multiple ? (say 7K) ?
> >
> > I need answer to those, to figure out how hard I should try to fix this.
> >
> > Anyway, here is ugly version of the patch - which will limit the IO
> > size to filesize and uses lower blocksizes to read the file (since
> > the filesize is only 3K, it would go down to 512 byte blocksize).
> >
>
> BTW, I got a compile error because the 'iov' parameter is
> declared with 'const', so your patch is changing a read-only
> value. :(
>
> Daniel
>

Told you, it was ugly and I noted in the comments too :)

BTW, I don't think that is the correct solution. I guess,
we need to do the IO beyond EOF and zero out the data.
And also, this can happen only if the EOF is in the middle
of the block - so its an issue only for the last block.

What do you think ?

Thanks,
Badari
>

2005-03-09 20:01:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Suparna Bhattacharya <[email protected]> wrote:
>
> > Solaris, which does forcedirectio as a mount option, actually
> > will do buffered I/O on the trailing part. Consider it like a bounce
> > buffer. That way they don't DMA the trailing data and succeed the I/O.
> > The I/O returns actual bytes till EOF, just like read(2) is supposed to.
> > Either this or a fully DMA'd number 4 is really what we should
> > do. If security can only be solved via a bounce buffer, who cares? If
> > the user created themselves a non-aligned file to open O_DIRECT, that's
> > their problem if the last part-sector is negligably slower.
>
> If writes/truncates take care of zeroing out the rest of the sector
> on disk, might we still be OK without having to do the bounce buffer
> thing ?

We can probably rely on the rest of the sector outside i_size being zeroed
anyway. Because if it contains non-zero gunk then the fs already has a
problem, and the user can get at that gunk with an expanding truncate and
mmap() anyway.


2005-03-09 21:19:49

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Wed, Mar 09, 2005 at 11:53:48AM -0800, Andrew Morton wrote:
> Suparna Bhattacharya <[email protected]> wrote:
> > If writes/truncates take care of zeroing out the rest of the sector
> > on disk, might we still be OK without having to do the bounce buffer
> > thing ?
>
> We can probably rely on the rest of the sector outside i_size being zeroed
> anyway. Because if it contains non-zero gunk then the fs already has a
> problem, and the user can get at that gunk with an expanding truncate and
> mmap() anyway.

Actually, yeah, even today we rely on block_prepare_write and
friends to handle that trail zeroing. That all happens after the sector
has been read from disk. So this should be analogous.

Joel

--

Life's Little Instruction Book #396

"Never give anyone a fruitcake."

http://www.jlbec.org/
[email protected]

2005-03-09 21:37:30

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Wed, 2005-03-09 at 11:53, Andrew Morton wrote:
> Suparna Bhattacharya <[email protected]> wrote:
> >
> > > Solaris, which does forcedirectio as a mount option, actually
> > > will do buffered I/O on the trailing part. Consider it like a bounce
> > > buffer. That way they don't DMA the trailing data and succeed the I/O.
> > > The I/O returns actual bytes till EOF, just like read(2) is supposed to.
> > > Either this or a fully DMA'd number 4 is really what we should
> > > do. If security can only be solved via a bounce buffer, who cares? If
> > > the user created themselves a non-aligned file to open O_DIRECT, that's
> > > their problem if the last part-sector is negligably slower.
> >
> > If writes/truncates take care of zeroing out the rest of the sector
> > on disk, might we still be OK without having to do the bounce buffer
> > thing ?
>
> We can probably rely on the rest of the sector outside i_size being zeroed
> anyway. Because if it contains non-zero gunk then the fs already has a
> problem, and the user can get at that gunk with an expanding truncate and
> mmap() anyway.
>

Rest of the sector or rest of the block ? Are you implying that, we
already do this, so there is no problem reading beyond EOF to user
buffer ? Or we need to zero out the userbuffer beyond EOF ?


Thanks,
Badari


2005-03-09 23:54:57

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

On Wed, 2005-03-09 at 14:39, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > On Wed, 2005-03-09 at 11:53, Andrew Morton wrote:
> > > Suparna Bhattacharya <[email protected]> wrote:
> > > >
> > > > > Solaris, which does forcedirectio as a mount option, actually
> > > > > will do buffered I/O on the trailing part. Consider it like a bounce
> > > > > buffer. That way they don't DMA the trailing data and succeed the I/O.
> > > > > The I/O returns actual bytes till EOF, just like read(2) is supposed to.
> > > > > Either this or a fully DMA'd number 4 is really what we should
> > > > > do. If security can only be solved via a bounce buffer, who cares? If
> > > > > the user created themselves a non-aligned file to open O_DIRECT, that's
> > > > > their problem if the last part-sector is negligably slower.
> > > >
> > > > If writes/truncates take care of zeroing out the rest of the sector
> > > > on disk, might we still be OK without having to do the bounce buffer
> > > > thing ?
> > >
> > > We can probably rely on the rest of the sector outside i_size being zeroed
> > > anyway. Because if it contains non-zero gunk then the fs already has a
> > > problem, and the user can get at that gunk with an expanding truncate and
> > > mmap() anyway.
> > >
> >
> > Rest of the sector or rest of the block ?
>
> The filesystem-sized block (1<<i_blkbits) which straddles i_size should
> have zeroes outside i_size.
>
> There's one situation where it might not be zeroed out, and that's when the
> final page is mapped MAP_SHARED and the application modifies that page
> outside i_size while writeout is actually in flight. We can't do much about
> that.
>
> > Are you implying that, we
> > already do this, so there is no problem reading beyond EOF to user
> > buffer ? Or we need to zero out the userbuffer beyond EOF ?
>
> It should be acceptable to assume that the final (1<<i_blkbits) block of
> the file contains zeroes outside i_size.
>
> And if it doesn't contain those zeroes, well, applications are able to read
> that data already. Although I wouldn't count that as a security hole: that
> data is something which an application wrote there while writing the file,
> rather than being left-over uncontrolled stuff.


Well, in that case - the original patch sent out is good enough to fix
the problem. All the original patch did was after completing the IO,
truncated the size to filesize. The problem is only with the last
block in the file. If the file ends in the middle of the block, we
go ahead and read till the end of the block. I was trying to address
that issue. But, if the block is already zeroed, just truncating the
size after the IO is complete should be good enough.


Thanks,
Badari

2005-03-10 04:29:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] 2.6.10 - direct-io async short read bug

Badari Pulavarty <[email protected]> wrote:
>
> On Wed, 2005-03-09 at 11:53, Andrew Morton wrote:
> > Suparna Bhattacharya <[email protected]> wrote:
> > >
> > > > Solaris, which does forcedirectio as a mount option, actually
> > > > will do buffered I/O on the trailing part. Consider it like a bounce
> > > > buffer. That way they don't DMA the trailing data and succeed the I/O.
> > > > The I/O returns actual bytes till EOF, just like read(2) is supposed to.
> > > > Either this or a fully DMA'd number 4 is really what we should
> > > > do. If security can only be solved via a bounce buffer, who cares? If
> > > > the user created themselves a non-aligned file to open O_DIRECT, that's
> > > > their problem if the last part-sector is negligably slower.
> > >
> > > If writes/truncates take care of zeroing out the rest of the sector
> > > on disk, might we still be OK without having to do the bounce buffer
> > > thing ?
> >
> > We can probably rely on the rest of the sector outside i_size being zeroed
> > anyway. Because if it contains non-zero gunk then the fs already has a
> > problem, and the user can get at that gunk with an expanding truncate and
> > mmap() anyway.
> >
>
> Rest of the sector or rest of the block ?

The filesystem-sized block (1<<i_blkbits) which straddles i_size should
have zeroes outside i_size.

There's one situation where it might not be zeroed out, and that's when the
final page is mapped MAP_SHARED and the application modifies that page
outside i_size while writeout is actually in flight. We can't do much about
that.

> Are you implying that, we
> already do this, so there is no problem reading beyond EOF to user
> buffer ? Or we need to zero out the userbuffer beyond EOF ?

It should be acceptable to assume that the final (1<<i_blkbits) block of
the file contains zeroes outside i_size.

And if it doesn't contain those zeroes, well, applications are able to read
that data already. Although I wouldn't count that as a security hole: that
data is something which an application wrote there while writing the file,
rather than being left-over uncontrolled stuff.