This patch makes an attempt at supporting the O_NONBLOCK flag for regular
files. It's pretty straight-forward. One limitation is that we still call
into the readahead code, which I believe can block. However, if we don't
do this, then an application which only uses non-blocking reads may never
get it's data.
Comments welcome.
-Jeff
--- linux-2.6.8/mm/filemap.c.orig 2004-09-30 16:33:46.881129560 -0400
+++ linux-2.6.8/mm/filemap.c 2004-09-30 16:34:12.109294296 -0400
@@ -720,7 +720,7 @@ void do_generic_mapping_read(struct addr
unsigned long index, end_index, offset;
loff_t isize;
struct page *cached_page;
- int error;
+ int error, nonblock = filp->f_flags & O_NONBLOCK;
struct file_ra_state ra = *_ra;
cached_page = NULL;
@@ -755,10 +755,20 @@ find_page:
page = find_get_page(mapping, index);
if (unlikely(page == NULL)) {
handle_ra_miss(mapping, &ra, index);
+ if (nonblock) {
+ desc->error = -EWOULDBLOCK;
+ break;
+ }
goto no_cached_page;
}
- if (!PageUptodate(page))
+ if (!PageUptodate(page)) {
+ if (nonblock) {
+ page_cache_release(page);
+ desc->error = -EWOULDBLOCK;
+ break;
+ }
goto page_not_up_to_date;
+ }
page_ok:
/* If users can be writing to this page using arbitrary
@@ -1004,7 +1014,7 @@ __generic_file_aio_read(struct kiocb *io
desc.error = 0;
do_generic_file_read(filp,ppos,&desc,file_read_actor);
retval += desc.written;
- if (!retval) {
+ if (!retval || desc.error) {
retval = desc.error;
break;
}
Hi!
> This patch makes an attempt at supporting the O_NONBLOCK flag for regular
> files. It's pretty straight-forward. One limitation is that we still call
> into the readahead code, which I believe can block. However, if we don't
> do this, then an application which only uses non-blocking reads may never
> get it's data.
This looks very nice. Does it mean that aio and friends are instantly obsolete?
Does it have comparable performance to aio?
Pavel
--
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms
On Fri, Oct 01, 2004 at 04:57:50PM -0400, Jeff Moyer wrote:
> This patch makes an attempt at supporting the O_NONBLOCK flag for regular
> files. It's pretty straight-forward. One limitation is that we still call
> into the readahead code, which I believe can block. However, if we don't
> do this, then an application which only uses non-blocking reads may never
> get it's data.
>
> Comments welcome.
Hi Jeff,
Curiosity: Is this defined in any UNIX standard?
Adv. Programming in the UNIX environment says:
12.2 Nonblocking I/O
"Nonblocking I/O lets us issue an I/O operation, such as open, read,
or write and not have it block forever. If the operation cannot be
completed, return is made immediately with an error noting that
the operation would have blocked."
He is talking about read's on descriptors (pipe's, devices, etc) which
block in case of no data present, not about filesystem IO.
But here we create our own semantics of O_NONBLOCK on read() of
fs IO. As you say page_cache_readahead can block for one
trying to allocate pages, possibly while submitting IO too.
The patch is cool - might be nice to check if SuS or someone else
specificies behaviour and try to match if so?
>
> -Jeff
>
> --- linux-2.6.8/mm/filemap.c.orig 2004-09-30 16:33:46.881129560 -0400
> +++ linux-2.6.8/mm/filemap.c 2004-09-30 16:34:12.109294296 -0400
> @@ -720,7 +720,7 @@ void do_generic_mapping_read(struct addr
> unsigned long index, end_index, offset;
> loff_t isize;
> struct page *cached_page;
> - int error;
> + int error, nonblock = filp->f_flags & O_NONBLOCK;
> struct file_ra_state ra = *_ra;
>
> cached_page = NULL;
> @@ -755,10 +755,20 @@ find_page:
> page = find_get_page(mapping, index);
> if (unlikely(page == NULL)) {
> handle_ra_miss(mapping, &ra, index);
> + if (nonblock) {
> + desc->error = -EWOULDBLOCK;
> + break;
> + }
> goto no_cached_page;
> }
> - if (!PageUptodate(page))
> + if (!PageUptodate(page)) {
> + if (nonblock) {
> + page_cache_release(page);
> + desc->error = -EWOULDBLOCK;
> + break;
> + }
> goto page_not_up_to_date;
> + }
> page_ok:
>
> /* If users can be writing to this page using arbitrary
> @@ -1004,7 +1014,7 @@ __generic_file_aio_read(struct kiocb *io
> desc.error = 0;
> do_generic_file_read(filp,ppos,&desc,file_read_actor);
> retval += desc.written;
> - if (!retval) {
> + if (!retval || desc.error) {
> retval = desc.error;
> break;
> }
Marcelo wrote:
> Curiosity: Is this defined in any UNIX standard?
No. See
http://www.opengroup.org/onlinepubs/009695399/functions/open.html
which leaves it undefined.
http://www.pasc.org/interps/unofficial/db/p1003.1/pasc-1003.1-71.html
says implementations have to allow setting O_NONBLOCK even if they
ignore it.
http://www.ussg.iu.edu/hypermail/linux/kernel/9911.3/0530.html
claims other Unixes and NT implement it.
There's a thread that discusses this in a bit of detail, and
suggests that older Solaris might implement it:
http://lists.freebsd.org/pipermail/freebsd-arch/2003-April/000132.html
http://lists.freebsd.org/pipermail/freebsd-arch/2003-April/000134.html
Googling for O_NONBLOCK disk seems to be good. I'd google more
but my baby is calling :-)
- Dan
--
My technical stuff: http://kegel.com
My politics: see http://www.misleader.org for examples of why I'm for regime change
On Fri, 1 Oct 2004, Jeff Moyer wrote:
> This patch makes an attempt at supporting the O_NONBLOCK flag for
> regular files. It's pretty straight-forward. One limitation is that we
> still call into the readahead code, which I believe can block.
> However, if we don't do this, then an application which only uses
> non-blocking reads may never get it's data.
I like it, though for programs which are real serious about
O_NONBLOCK we might want to hand off readahead to a helper
thread instead of starting readahead in the same context...
Not sure if we would always want this, though ;)
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
==> Regarding Re: [patch rfc] towards supporting O_NONBLOCK on regular files; Marcelo Tosatti <[email protected]> adds:
marcelo.tosatti> On Fri, Oct 01, 2004 at 04:57:50PM -0400, Jeff Moyer
marcelo.tosatti> wrote:
>> This patch makes an attempt at supporting the O_NONBLOCK flag for
>> regular files. It's pretty straight-forward. One limitation is that we
>> still call into the readahead code, which I believe can block. However,
>> if we don't do this, then an application which only uses non-blocking
>> reads may never get it's data.
>>
>> Comments welcome.
marcelo.tosatti> Hi Jeff,
marcelo.tosatti> Curiosity: Is this defined in any UNIX standard?
marcelo.tosatti> Adv. Programming in the UNIX environment says:
marcelo.tosatti> 12.2 Nonblocking I/O
marcelo.tosatti> "Nonblocking I/O lets us issue an I/O operation, such as
marcelo.tosatti> open, read, or write and not have it block forever. If the
marcelo.tosatti> operation cannot be completed, return is made immediately
marcelo.tosatti> with an error noting that the operation would have
marcelo.tosatti> blocked."
marcelo.tosatti> He is talking about read's on descriptors (pipe's,
marcelo.tosatti> devices, etc) which block in case of no data present, not
marcelo.tosatti> about filesystem IO.
marcelo.tosatti> But here we create our own semantics of O_NONBLOCK on
marcelo.tosatti> read() of fs IO. As you say page_cache_readahead can block
marcelo.tosatti> for one trying to allocate pages, possibly while
marcelo.tosatti> submitting IO too.
marcelo.tosatti> The patch is cool - might be nice to check if SuS or
marcelo.tosatti> someone else specificies behaviour and try to match if so?
Hi, Marcelo,
The text below was taken from the following standard:
IEEE Std 1003.1, 2004 Edition
The Open Group Technical Standard
Base Sepcifications, Issue 6
Includes IEEE Std 1003.1-2001, IEEE Std 1003.1-2001/Cor 1-2002
open()
O_NONBLOCK When opening a FIFO with O_RDONLY or O_WRONLY set:
o If O_NONBLOCK is set, an open( ) for reading-only shall return without
delay. An open( ) for writing-only shall return an error if no process
currently has the file open for reading.
o If O_NONBLOCK is clear, an open( ) for reading-only shall block the
calling thread until a thread opens the file for writing. An open( ) for
writing-only shall block the calling thread until a thread opens the file
for reading.
When opening a block special or character special file that supports non-
blocking opens:
o If O_NONBLOCK is set, the open( ) function shall return without blocking
for the device to be ready or available. Subsequent behavior of the device
is device-specific.
o If O_NONBLOCK is clear, the open( ) function shall block the calling
thread until the device is ready or available before returning.
Otherwise, the behavior of O_NONBLOCK is unspecified.
read()
When attempting to read a file (other than a pipe or FIFO) that supports
non-blocking reads and has no data currently available:
o If O_NONBLOCK is set, read( ) shall return -1 and set errno to [EAGAIN].
o If O_NONBLOCK is clear, read( ) shall block the calling thread until
some data becomes available.
o The use of the O_NONBLOCK flag has no effect if there is some data
available.
write()
When attempting to write to a file descriptor (other than a pipe or FIFO)
that supports non- blocking writes and cannot accept the data immediately:
o If the O_NONBLOCK flag is clear, write( ) shall block the calling
thread until the data can be accepted.
o If the O_NONBLOCK flag is set, write( ) shall not block the thread. If
some data can be written without blocking the thread, write( ) shall
write what it can and return the number of bytes written. Otherwise, it
shall return -1 and set errno to [EAGAIN].
Thanks!
Jeff
On Wed, Oct 06, 2004 at 09:13:38AM -0400, Jeff Moyer wrote:
> ==> Regarding Re: [patch rfc] towards supporting O_NONBLOCK on regular files; Marcelo Tosatti <[email protected]> adds:
Hi Jeff!
> Hi, Marcelo,
>
> The text below was taken from the following standard:
>
> IEEE Std 1003.1, 2004 Edition
> The Open Group Technical Standard
> Base Sepcifications, Issue 6
> Includes IEEE Std 1003.1-2001, IEEE Std 1003.1-2001/Cor 1-2002
>
>
> open()
>
> O_NONBLOCK When opening a FIFO with O_RDONLY or O_WRONLY set:
> o If O_NONBLOCK is set, an open( ) for reading-only shall return without
> delay. An open( ) for writing-only shall return an error if no process
> currently has the file open for reading.
> o If O_NONBLOCK is clear, an open( ) for reading-only shall block the
> calling thread until a thread opens the file for writing. An open( ) for
> writing-only shall block the calling thread until a thread opens the file
> for reading.
>
> When opening a block special or character special file that supports non-
> blocking opens:
> o If O_NONBLOCK is set, the open( ) function shall return without blocking
> for the device to be ready or available. Subsequent behavior of the device
> is device-specific.
> o If O_NONBLOCK is clear, the open( ) function shall block the calling
> thread until the device is ready or available before returning.
>
> Otherwise, the behavior of O_NONBLOCK is unspecified.
>
> read()
>
> When attempting to read a file (other than a pipe or FIFO) that supports
> non-blocking reads and has no data currently available:
> o If O_NONBLOCK is set, read( ) shall return -1 and set errno to [EAGAIN].
This implies read(O_NONBLOCK) should never block.
Maybe your code should pass down __GFP_FAIL in the gfp_mask
to the page_cache_alloc() to avoid blocking reclaiming pages,
and possibly pass info down to the block layer
"if this is going to block, fail".
We have a bdi_congested() check before doing the readahead (dont RA
if the queue is full). Check if that really works - I'm afraid
it can block because the check is not done at each page
submitted, but rather at each readahead operation (which
can be several pages large).
> o If O_NONBLOCK is clear, read( ) shall block the calling thread until
> some data becomes available.
> o The use of the O_NONBLOCK flag has no effect if there is some data
> available.
>
> write()
>
> When attempting to write to a file descriptor (other than a pipe or FIFO)
> that supports non- blocking writes and cannot accept the data immediately:
> o If the O_NONBLOCK flag is clear, write( ) shall block the calling
> thread until the data can be accepted.
> o If the O_NONBLOCK flag is set, write( ) shall not block the thread. If
> some data can be written without blocking the thread, write( ) shall
> write what it can and return the number of bytes written. Otherwise, it
> shall return -1 and set errno to [EAGAIN].
>
>
> Thanks!
>
> Jeff
Hi,
On Wed, 2004-10-06 at 09:01 -0300, Marcelo Tosatti wrote:
> > When attempting to read a file (other than a pipe or FIFO) that supports
> > non-blocking reads and has no data currently available:
> > o If O_NONBLOCK is set, read( ) shall return -1 and set errno to [EAGAIN].
>
> This implies read(O_NONBLOCK) should never block.
The spec is usually pretty careful never to come straight out and
require that in all cases, even for true AIO.
> Maybe your code should pass down __GFP_FAIL in the gfp_mask
> to the page_cache_alloc() to avoid blocking reclaiming pages,
> and possibly pass info down to the block layer
> "if this is going to block, fail".
It's not just the page allocation that can block, though. Readahead
requires us to map the buffers being read before we submit the async
read, so we can still block reading indirect blocks. If we want to
avoid submitting that extra synchronous IO, then either O_NONBLOCK needs
to avoid readahead entirely for non-present pages, or the readahead
itself needs to know that it's a O_NONBLOCK IO and fail cleanly if the
metadata is not in cache.
Cheers,
Stephen
On Thu, Oct 07, 2004 at 04:31:35AM +0100, Stephen C. Tweedie wrote:
> Hi,
>
> On Wed, 2004-10-06 at 09:01 -0300, Marcelo Tosatti wrote:
> > > o If O_NONBLOCK is set, read( ) shall return -1 and set errno to [EAGAIN].
> >
> > This implies read(O_NONBLOCK) should never block.
>
> The spec is usually pretty careful never to come straight out and
> require that in all cases, even for true AIO.
>
> > Maybe your code should pass down __GFP_FAIL in the gfp_mask
> > to the page_cache_alloc() to avoid blocking reclaiming pages,
> > and possibly pass info down to the block layer
> > "if this is going to block, fail".
>
> It's not just the page allocation that can block, though. Readahead
> requires us to map the buffers being read before we submit the async
> read, so we can still block reading indirect blocks. If we want to
> avoid submitting that extra synchronous IO, then either O_NONBLOCK needs
> to avoid readahead entirely for non-present pages, or the readahead
> itself needs to know that it's a O_NONBLOCK IO and fail cleanly if the
> metadata is not in cache.
Hi Stephen!
Oh yes, theres also the indirect blocks which we might need to read from
disk.
Now the question is, how strict should the O_NONBLOCK implementation be
in reference to "not blocking" ?
Maybe Jeff's currently implementation is just fine avoiding the
potential block at !PageUptodate.
> Now the question is, how strict should the O_NONBLOCK implementation be
> in reference to "not blocking" ?
almost any allocation all over the kernel can in theory block ;)
I'd say be pragmatic in this and avoid the obvious pagecache blocking,
but ignore the rest, it'll be rare and if it's there, of short duration.
Userland can get rescheduled anyway at any time for brief periods
Hi,
On Thu, 2004-10-07 at 11:12, Marcelo Tosatti wrote:
> Oh yes, theres also the indirect blocks which we might need to read from
> disk.
Right.
> Now the question is, how strict should the O_NONBLOCK implementation be
> in reference to "not blocking" ?
Well, I suspect that depends on the application. But if you've got an
app that really wants to make sure its hot path is as fast as possible
(eg. a high-throughput server multiplexing disk IO and networking
through a single event loop), then ideally the app would want to punt
any blocking disk IO to another thread.
So it's a matter of significant extra programing for a small extra
reduction in app blocking potential.
I think it's worth getting this right in the long term, though. Getting
readahead of indirect blocks right has other benefits too --- eg. we may
be able to fix the situation where we end up trying to read indirect
blocks before we've even submitted the IO for the previous data blocks,
breaking the IO pipeline ordering.
--Stephen
==> Regarding Re: [patch rfc] towards supporting O_NONBLOCK on regular files; "Stephen C. Tweedie" <[email protected]> adds:
sct> Hi, On Thu, 2004-10-07 at 11:12, Marcelo Tosatti wrote:
>> Oh yes, theres also the indirect blocks which we might need to read from
>> disk.
sct> Right.
>> Now the question is, how strict should the O_NONBLOCK implementation be
>> in reference to "not blocking" ?
sct> Well, I suspect that depends on the application. But if you've got an
sct> app that really wants to make sure its hot path is as fast as possible
sct> (eg. a high-throughput server multiplexing disk IO and networking
sct> through a single event loop), then ideally the app would want to punt
sct> any blocking disk IO to another thread.
sct> So it's a matter of significant extra programing for a small extra
sct> reduction in app blocking potential.
sct> I think it's worth getting this right in the long term, though.
sct> Getting readahead of indirect blocks right has other benefits too ---
sct> eg. we may be able to fix the situation where we end up trying to read
sct> indirect blocks before we've even submitted the IO for the previous
sct> data blocks, breaking the IO pipeline ordering.
So for the short term, are you an advocate of the patch posted?
-Jeff
Hi,
On Mon, 2004-10-11 at 19:58, Jeff Moyer wrote:
> sct> I think it's worth getting this right in the long term, though.
> sct> Getting readahead of indirect blocks right has other benefits too ---
> sct> eg. we may be able to fix the situation where we end up trying to read
> sct> indirect blocks before we've even submitted the IO for the previous
> sct> data blocks, breaking the IO pipeline ordering.
>
> So for the short term, are you an advocate of the patch posted?
In the short term, can't we just disable readahead for O_NONBLOCK? That
has true non-blocking semantics --- if the data is already available we
return it, but if not, it's up to somebody else to retrieve it.
That's exactly what you want if you're genuinely trying to avoid
blocking at all costs on a really hot event loop, and the semantics seem
to make sense to me. It's not that different from the networking case
where no amount of read() on a non-blocking fd will get you more data
unless there's another process somewhere filling the stream.
--Stephen
==> Regarding Re: [patch rfc] towards supporting O_NONBLOCK on regular files; "Stephen C. Tweedie" <[email protected]> adds:
sct> Hi, On Mon, 2004-10-11 at 19:58, Jeff Moyer wrote:
sct> I think it's worth getting this right in the long term, though.
sct> Getting readahead of indirect blocks right has other benefits too ---
sct> eg. we may be able to fix the situation where we end up trying to read
sct> indirect blocks before we've even submitted the IO for the previous
sct> data blocks, breaking the IO pipeline ordering.
>> So for the short term, are you an advocate of the patch posted?
sct> In the short term, can't we just disable readahead for O_NONBLOCK?
sct> That has true non-blocking semantics --- if the data is already
sct> available we return it, but if not, it's up to somebody else to
sct> retrieve it.
sct> That's exactly what you want if you're genuinely trying to avoid
sct> blocking at all costs on a really hot event loop, and the semantics
sct> seem to make sense to me. It's not that different from the networking
sct> case where no amount of read() on a non-blocking fd will get you more
sct> data unless there's another process somewhere filling the stream.
Yes, that sounds like a fine idea. Here is a patch which does this.
Andrew, I know you only want bug fixes, but I'd like to get this into your
queue for post 2.6.9, if possible.
Thanks,
Jeff
--- linux-2.6.9-rc4-mm1/mm/filemap.c~ 2004-10-12 12:40:01.000000000 -0400
+++ linux-2.6.9-rc4-mm1/mm/filemap.c 2004-10-12 12:53:13.202396656 -0400
@@ -692,7 +692,7 @@ void do_generic_mapping_read(struct addr
unsigned long index, end_index, offset;
loff_t isize;
struct page *cached_page;
- int error;
+ int error, nonblock = filp->f_flags & O_NONBLOCK;
struct file_ra_state ra = *_ra;
cached_page = NULL;
@@ -721,16 +721,27 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;
cond_resched();
- page_cache_readahead(mapping, &ra, filp, index);
+ if (!nonblock)
+ page_cache_readahead(mapping, &ra, filp, index);
find_page:
page = find_get_page(mapping, index);
if (unlikely(page == NULL)) {
+ if (nonblock) {
+ desc->error = -EWOULDBLOCK;
+ break;
+ }
handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
- if (!PageUptodate(page))
+ if (!PageUptodate(page)) {
+ if (nonblock) {
+ page_cache_release(page);
+ desc->error = -EWOULDBLOCK;
+ break;
+ }
goto page_not_up_to_date;
+ }
page_ok:
/* If users can be writing to this page using arbitrary
@@ -976,7 +987,7 @@ __generic_file_aio_read(struct kiocb *io
desc.error = 0;
do_generic_file_read(filp,ppos,&desc,file_read_actor);
retval += desc.written;
- if (!retval) {
+ if (!retval || desc.error) {
retval = desc.error;
break;
}
==> Regarding Re: [patch rfc] towards supporting O_NONBLOCK on regular files; Pavel Machek <[email protected]> adds:
pavel> Hi!
>> This patch makes an attempt at supporting the O_NONBLOCK flag for
>> regular files. It's pretty straight-forward. One limitation is that we
>> still call into the readahead code, which I believe can block. However,
>> if we don't do this, then an application which only uses non-blocking
>> reads may never get it's data.
pavel> This looks very nice. Does it mean that aio and friends are
pavel> instantly obsolete?
I dont' think so. This only addresses the read() path, for one. Plus, in
it's current form, it will not perform any I/O if the data is not present.
So, you will need another thread/process to do kick off the I/O.
pavel> Does it have comparable performance to aio?
I haven't run any tests. One advantage this has to current aio is that it
can operate without O_DIRECT.
-Jeff
Hi!
> pavel> Does it have comparable performance to aio?
>
> I haven't run any tests. One advantage this has to current aio is that it
> can operate without O_DIRECT.
Well, second advantage is that it is much nicer interface...
Pavel
--
People were complaining that M$ turns users into beta-testers...
...jr ghea gurz vagb qrirybcref, naq gurl frrz gb yvxr vg gung jnl!
--- linux-2.6.9-rc4-mm1/mm/filemap.c.orig 2004-10-15 10:33:24.986209880 -0400
+++ linux-2.6.9-rc4-mm1/mm/filemap.c 2004-10-15 11:38:50.869384920 -0400
@@ -692,7 +692,7 @@ void do_generic_mapping_read(struct addr
unsigned long index, end_index, offset;
loff_t isize;
struct page *cached_page;
- int error;
+ int error, nonblock = filp->f_flags & O_NONBLOCK;
struct file_ra_state ra = *_ra;
cached_page = NULL;
@@ -721,16 +721,27 @@ void do_generic_mapping_read(struct addr
nr = nr - offset;
cond_resched();
- page_cache_readahead(mapping, &ra, filp, index);
+ if (!nonblock)
+ page_cache_readahead(mapping, &ra, filp, index);
find_page:
page = find_get_page(mapping, index);
if (unlikely(page == NULL)) {
+ if (nonblock) {
+ desc->error = -EWOULDBLOCK;
+ break;
+ }
handle_ra_miss(mapping, &ra, index);
goto no_cached_page;
}
- if (!PageUptodate(page))
+ if (!PageUptodate(page)) {
+ if (nonblock) {
+ page_cache_release(page);
+ desc->error = -EWOULDBLOCK;
+ break;
+ }
goto page_not_up_to_date;
+ }
page_ok:
/* If users can be writing to this page using arbitrary
@@ -976,10 +987,10 @@ __generic_file_aio_read(struct kiocb *io
desc.error = 0;
do_generic_file_read(filp,ppos,&desc,file_read_actor);
retval += desc.written;
- if (!retval) {
+ if (!retval)
retval = desc.error;
+ if (desc.written != iov[seg].iov_len)
break;
- }
}
}
out:
Hi,
On Fri, 2004-10-15 at 16:44, Jeff Moyer wrote:
> I got the partial read case wrong in the last patch. In fact, it looks
> like this code path would perform infinite retries before. This should
> address that by returning upon the first partial read. Attached is a new
> version of the patch.
...
> --- linux-2.6.9-rc4-mm1/mm/filemap.c.orig 2004-10-15 10:33:24.986209880 -0400
> +++ linux-2.6.9-rc4-mm1/mm/filemap.c 2004-10-15 11:38:50.869384920 -0400
> @@ -976,10 +987,10 @@ __generic_file_aio_read(struct kiocb *io
> desc.error = 0;
> do_generic_file_read(filp,ppos,&desc,file_read_actor);
> retval += desc.written;
> - if (!retval) {
> + if (!retval)
> retval = desc.error;
> + if (desc.written != iov[seg].iov_len)
> break;
> - }
> }
Yep; this chunk used to break out only on !retval, so at worst it's a
data corrupter for multi-segment iovecs. If one do_generic_file_read()
returned a short read we'd update retval but then move on to the next
segment of the iovec; if the next one succeeded we'd be reading into the
next segment but ppos wouldn't be correctly advanced.
The fix looks correct --- we still only return an error if no data at
all has been returned, but we break out of the IO completely at the
first short read.
--Stephen
On Oct 15, 2004, Jeff Moyer <[email protected]> wrote:
jmoyer> Yes, that sounds like a fine idea. Here is a patch which does
jmoyer> this. Andrew, I know you only want bug fixes, but I'd like to get
jmoyer> this into your queue for post 2.6.9, if possible.
> I got the partial read case wrong in the last patch. In fact, it looks
> like this code path would perform infinite retries before. This should
> address that by returning upon the first partial read. Attached is a new
> version of the patch.
Don't you have to adjust poll/select as well, to test whether read has
any data to return immediately? Squid is broken with the latest
FCdevel kernel, and this patch is in it.
The reason Squid breaks is that poll (or is it select? I forget) says
there's data to be read from cache files (as well as from error
message files read during start up), but then read fails with -EAGAIN.
If I bring the error files into memory with cat
/etc/squid/errors/ERR*, then squid will successfully start up, and
then, in order for it to not eat all the available CPU polling data
files and attempting to read from them, I need to start a command line
this:
pid=`pidof '(squid)'
while :; do
lsof -p $pid |
sed -n 's,.*\(/var/spool/squid/.*/.*/.*\),\1,p' |
xargs cat /dev/null > /dev/null;
sleep 5
done
--
Alexandre Oliva http://www.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org}
On Sun, 17 Oct 2004, Alexandre Oliva wrote:
> The reason Squid breaks is that poll (or is it select? I forget) says
> there's data to be read from cache files (as well as from error message
> files read during start up), but then read fails with -EAGAIN. If I
> bring the error files into memory with cat /etc/squid/errors/ERR*, then
> squid will successfully start up, and then, in order for it to not eat
> all the available CPU polling data files and attempting to read from
> them, I need to start a command line this:
the problem is that upon read() we dont 'kick' any IO if there's no data
available. I.e. the readahead-kicking is necessary after all, because
squid apparently assumes that re-trying a read will eventually succeed.
(Tux on the other hand does not assume this and uses helper threads to
kick IO.)
Ingo
On Oct 17, 2004, Ingo Molnar <[email protected]> wrote:
> I.e. the readahead-kicking is necessary after all, because
> squid apparently assumes that re-trying a read will eventually succeed.
I'm not sure it assumes that. It definitely expects read to succeed
if poll says there is data available from the file, though, and having
poll return that there is data, and then having read fail because
there isn't anything there, so that you go back to poll, is a recipe
for wasting CPU. I do think read should kick in readahead, yes, but
so should poll, if the process says it wants to read from the file,
and poll should not return (or not say data is available) unless an
immediate, atomic call to read would actually return some data. Of
course, if the data happens to be elicited from memory between the
poll and read calls, it's legitimate for read to fail with -EAGAIN,
but this shouldn't happen very often.
--
Alexandre Oliva http://www.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org}
==> Regarding Re: [patch rfc] towards supporting O_NONBLOCK on regular files; Alexandre Oliva <[email protected]> adds:
aoliva> On Oct 17, 2004, Ingo Molnar <[email protected]> wrote:
>> I.e. the readahead-kicking is necessary after all, because squid
>> apparently assumes that re-trying a read will eventually succeed.
aoliva> I'm not sure it assumes that. It definitely expects read to
aoliva> succeed if poll says there is data available from the file, though,
aoliva> and having poll return that there is data, and then having read
aoliva> fail because there isn't anything there, so that you go back to
aoliva> poll, is a recipe for wasting CPU. I do think read should kick in
aoliva> readahead, yes, but so should poll, if the process says it wants to
aoliva> read from the file, and poll should not return (or not say data is
aoliva> available) unless an immediate, atomic call to read would actually
aoliva> return some data. Of course, if the data happens to be elicited
aoliva> from memory between the poll and read calls, it's legitimate for
aoliva> read to fail with -EAGAIN, but this shouldn't happen very often.
Select, pselect, and poll will always return data ready on a regular file.
As such, I would argue that squid's behaviour is broken. Additionally, I
don't think it's a good idea to modify any polling mechanism to kick off
I/O, if simply because I'm not sure how much data to request!
Since it wasn't supported before, I see no need to extend select/poll to
change behaviour for regular files. After all, the POSIX spec explicitly
says that select and friends will always return data ready for regular
files. To make O_NONBLOCK on regular files a workable implementation, Uli
suggestsed extending epoll. I will look into this.
I am in favor of kicking off I/O for reads that would block.
-Jeff
On Oct 18, 2004, Jeff Moyer <[email protected]> wrote:
> Select, pselect, and poll will always return data ready on a regular file.
> As such, I would argue that squid's behaviour is broken. Additionally, I
> don't think it's a good idea to modify any polling mechanism to kick off
> I/O, if simply because I'm not sure how much data to request!
You could request a single block (whatever that means), and then the
subsequent non-blocking read would stand a chance of eventually making
progress. . Or you could request nothing, and have the actual read
start a readahead, such that next time it hopefully will have
something to get to immediately. If the read comes in too quickly,
before the poll-initiated readahead completes, you'll probably get
them merged anyway, so it's not like it could hurt, methinks.
And then, you might arrange for select/poll to not return immediately
for these file descriptors, but rather return as soon as one of them
has some data available to read, the time-out expired, or a very short
time-out set for the case of non-blocking file descriptors select()ed
for read expired. The latter time-out should be short enough to be
hardly distinguishable from an immediate return, and the return value
should be exactly what a POSIX-compliant application expects.
This doesn't quite fix the problem with the existing standard
interfaces, that don't quite enable anyone to do non-blocking reads
without explicit readahead advice and busy-waiting for data. The
short time-out above should at least reduce the syscall explosion that
we get with the current behavior, but if read kicks in a readahead to
guarantee we'd eventually get out of the select/read loop without
external help to bring the data into memory, I guess we could live
without the select/poll changes.
--
Alexandre Oliva http://www.ic.unicamp.br/~oliva/
Red Hat Compiler Engineer aoliva@{redhat.com, gcc.gnu.org}
Free Software Evangelist oliva@{lsd.ic.unicamp.br, gnu.org}
Jeff Moyer <[email protected]> writes:
> Select, pselect, and poll will always return data ready on a regular file.
> As such, I would argue that squid's behaviour is broken.
I would argue that this isn't true. Yes, poll() would always return
ready, but read would always "block" when you called it on the
files. And using the same method you use for network IO saves having
to have a separate path through the event loop just for reading files
from disk.
For another view of the argument, see:
http://www.and.org/vstr/examples/ex_cat.c.html
...this does non-blocking IO from stdin to stdout. You seem to be
arguing that it should have to know when stdin is from a pipe and when
it is a redirected file.
> I am in favor of kicking off I/O for reads that would block.
This would solve the problem, but I'd still argue that poll()
shouldn't lie ... having the event loop do the right thing when the
disk is busy or the file is on a down NFS server would be a huge free
win.
--
James Antill e: <[email protected]>
Red Hat Support Engineering Group