2001-12-16 00:48:11

by Suresh Gopalakrishnan

[permalink] [raw]
Subject: O_DIRECT wierd behavior..


I tried this small piece of code from an old post in the archive:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define O_DIRECT 040000 /* direct disk access hint */

int main()
{
char buf[16384];
int fd;
char *p;

p = (char *)((((unsigned long)buf) + 8191) & ~8191L);
fd = open("/tmp/blah", O_CREAT | O_RDWR | O_DIRECT);

printf("write returns %i\n", write(fd, buf, 8192));
printf("write returns %i\n", write(fd, p, 1));

return 0;
}

Output is:

write returns -1
Filesize limit exceeded (core dumped)

$ ls -l /tmp/blah
---------- 1 gsuresh users 4294967274 Dec 15 19:15 /tmp/blah

The kernel is 2.4.16 and /tmp is ext2. (It runs fine on 2.4.2).

Any idea why this happens and how to fix this?

Thanks
--suresh


2001-12-16 06:00:23

by Andrew Morton

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

Suresh Gopalakrishnan wrote:
>
> I tried this small piece of code from an old post in the archive:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> #define O_DIRECT 040000 /* direct disk access hint */
>
> int main()
> {
> char buf[16384];
> int fd;
> char *p;
>
> p = (char *)((((unsigned long)buf) + 8191) & ~8191L);
> fd = open("/tmp/blah", O_CREAT | O_RDWR | O_DIRECT);
>
> printf("write returns %i\n", write(fd, buf, 8192));
> printf("write returns %i\n", write(fd, p, 1));
>
> return 0;
> }
>

The app has a bug in it (I think); but the kernel has four.
Your first write fails because `buf' is not page-aligned.

Then the kernel screws up the error handling, and ends up
setting the file size to -EINVAL (ie: rather large).



1: We're testing `written >= 0', but it is unsigned (!). In two
places.

This one, IMO is a gcc shortcoming. The compiler is capable of warning
about expressions which always evaluate to true or false in `if' statements,
but turning this on also enables lots of things you don't want it to warn about.
gcc needs to provide finer control of its warning capabilities. I patched
gcc-2.7.2.3 to do this ages back and it was very useful.

2: If generic_osync_inode() returns an error, we fail to report it. In
two places.

Here's a quick fix. It needs a review.



--- linux-2.4.17-rc1/mm/filemap.c Thu Dec 13 14:07:55 2001
+++ linux-akpm/mm/filemap.c Sat Dec 15 21:52:06 2001
@@ -3038,8 +3038,11 @@ unlock:
/* For now, when the user asks for O_SYNC, we'll actually
* provide O_DSYNC. */
if (status >= 0) {
- if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
+ if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
+ if (status < 0)
+ written = 0; /* Return the right thing */
+ }
}

out_status:
@@ -3054,7 +3057,8 @@ fail_write:

o_direct:
written = generic_file_direct_IO(WRITE, file, (char *) buf, count, pos);
- if (written > 0) {
+ status = written;
+ if (status > 0) {
loff_t end = pos + written;
if (end > inode->i_size && !S_ISBLK(inode->i_mode)) {
inode->i_size = end;
@@ -3067,8 +3071,11 @@ o_direct:
* Sync the fs metadata but not the minor inode changes and
* of course not the data as we did direct DMA for the IO.
*/
- if (written >= 0 && file->f_flags & O_SYNC)
+ if (status >= 0 && file->f_flags & O_SYNC) {
status = generic_osync_inode(inode, OSYNC_METADATA);
+ if (status < 0)
+ written = 0; /* Return the right thing */
+ }
goto out_status;
}

2001-12-16 06:30:16

by GOTO Masanori

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

Hi,

At Sat, 15 Dec 2001 19:47:46 -0500 (EST),
Suresh Gopalakrishnan wrote:
> I tried this small piece of code from an old post in the archive:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> #define O_DIRECT 040000 /* direct disk access hint */
>
> int main()
> {
> char buf[16384];
> int fd;
> char *p;
>
> p = (char *)((((unsigned long)buf) + 8191) & ~8191L);
> fd = open("/tmp/blah", O_CREAT | O_RDWR | O_DIRECT);
>
> printf("write returns %i\n", write(fd, buf, 8192));
> printf("write returns %i\n", write(fd, p, 1));
>
> return 0;
> }
>
> Output is:
>
> write returns -1
> Filesize limit exceeded (core dumped)
>
> $ ls -l /tmp/blah
> ---------- 1 gsuresh users 4294967274 Dec 15 19:15 /tmp/blah
>
> The kernel is 2.4.16 and /tmp is ext2. (It runs fine on 2.4.2).
>
> Any idea why this happens and how to fix this?

Hmm, kernel 2.4.17-rc1 also has this problem.

The reason of this problem is that written is defined as unsigned long
currently, so if generic_file_direct_IO returns -EINVAL, then written
is translated as (unsigned long)(-EINVAL) = (0x100000000-EINVAL). Thus
the file size changed such a big value!

This patch fixes it.

--- linux-2.4.17-rc1.vanilla/mm/filemap.c Sun Dec 16 14:57:42 2001
+++ linux-2.4.17-rc1/mm/filemap.c Sun Dec 16 15:02:10 2001
@@ -2854,7 +2854,7 @@
unsigned long limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
loff_t pos;
struct page *page, *cached_page;
- unsigned long written;
+ ssize_t written;
long status = 0;
int err;
unsigned bytes;


This problem breaks file inode size with direct IO... we lost the
correct file inode size forever. It's serious. Linus and Marcelo,
please consider to apply this patch?

However, your program does not work yet. Try my test program. It
works well because valloc()/memalign() allocates aligned page
boundary.

But, my interest is that why we cannot access with stack memory or
malloc()-ed memory. I heard Suresh's program works fine on 2.4.2, but
it does not work on 2.4.17-rc1. Should we modify the kernel to work
well with user stack/heap memory when we use direct IO for the file ?
Or, this behavior is absolutely correct ?


#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define O_DIRECT 040000 /* direct disk access hint */

int main()
{
char *buf;
int fd;

buf = valloc(16384);
fd = open("/tmp/blah", O_CREAT | O_RDWR | O_DIRECT, 0755 );

printf("write returns %i\n", write(fd, buf, 4096));

return 0;
}


-- gotom


2001-12-16 08:17:59

by GOTO Masanori

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

At Sat, 15 Dec 2001 21:59:06 -0800,
Andrew Morton wrote:
> Then the kernel screws up the error handling, and ends up
> setting the file size to -EINVAL (ie: rather large).
>
> 1: We're testing `written >= 0', but it is unsigned (!). In two
> places.
>
> This one, IMO is a gcc shortcoming. The compiler is capable of warning
> about expressions which always evaluate to true or false in `if' statements,
> but turning this on also enables lots of things you don't want it to warn about.
> gcc needs to provide finer control of its warning capabilities. I patched
> gcc-2.7.2.3 to do this ages back and it was very useful.
>
> 2: If generic_osync_inode() returns an error, we fail to report it. In
> two places.
>
> Here's a quick fix. It needs a review.
>
> --- linux-2.4.17-rc1/mm/filemap.c Thu Dec 13 14:07:55 2001
> +++ linux-akpm/mm/filemap.c Sat Dec 15 21:52:06 2001
> @@ -3038,8 +3038,11 @@ unlock:
> /* For now, when the user asks for O_SYNC, we'll actually
> * provide O_DSYNC. */
> if (status >= 0) {
> - if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
> + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
> status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> + if (status < 0)
> + written = 0; /* Return the right thing */
> + }
> }

Right. If generic_osync_inode returns error, it must be needed.
This patch seems ok than my patch...

-- gotom

2001-12-16 08:47:46

by Andrew Morton

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

GOTO Masanori wrote:
>
> At Sat, 15 Dec 2001 21:59:06 -0800,
> Andrew Morton wrote:
> ...
> > --- linux-2.4.17-rc1/mm/filemap.c Thu Dec 13 14:07:55 2001
> > +++ linux-akpm/mm/filemap.c Sat Dec 15 21:52:06 2001
> > @@ -3038,8 +3038,11 @@ unlock:
> > /* For now, when the user asks for O_SYNC, we'll actually
> > * provide O_DSYNC. */
> > if (status >= 0) {
> > - if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
> > + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
> > status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> > + if (status < 0)
> > + written = 0; /* Return the right thing */
> > + }
> > }
>
> Right. If generic_osync_inode returns error, it must be needed.
> This patch seems ok than my patch...

Actually, I preferred your approach :)

Also, note how if ->commit_write() or ->prepare_write() return an
error, and we have already written some data, the function returns
the number of bytes written and no indication that there was an error.
According to the write(2) manpage, that's wrong.

Probably it is sufficient to make `written' a signed quantity
and to do:

out_status:
- err = written ? written : status;
+ err = status ? status : written;

I think that fixes the five or six bugs we've found so far in
this function. err.. make that six or seven. What is it trying
do if ->prepare_write() returns a non-zero, positive value?

It needs a big spring-clean. I'm afraid I don't have time to
do that for several days.

-

2001-12-16 09:20:41

by Suresh Gopalakrishnan

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..


On Sun, 16 Dec 2001, Andrew Morton wrote:
> GOTO Masanori wrote:
> > At Sat, 15 Dec 2001 21:59:06 -0800,
> > Andrew Morton wrote:
> > ...
> > > --- linux-2.4.17-rc1/mm/filemap.c Thu Dec 13 14:07:55 2001
> > > +++ linux-akpm/mm/filemap.c Sat Dec 15 21:52:06 2001
> > > @@ -3038,8 +3038,11 @@ unlock:
> > > /* For now, when the user asks for O_SYNC, we'll actually
> > > * provide O_DSYNC. */
> > > if (status >= 0) {
> > > - if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
> > > + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
> > > status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> > > + if (status < 0)
> > > + written = 0; /* Return the right thing */
> > > + }
> > > }
> >
> > Right. If generic_osync_inode returns error, it must be needed.
> > This patch seems ok than my patch...
>
> Actually, I preferred your approach :)
>
> Also, note how if ->commit_write() or ->prepare_write() return an
> error, and we have already written some data, the function returns the
> number of bytes written and no indication that there was an error.
> According to the write(2) manpage, that's wrong.
>
> Probably it is sufficient to make `written' a signed quantity and to
> do:
>
> out_status:
> - err = written ? written : status;
> + err = status ? status : written;
>
> I think that fixes the five or six bugs we've found so far in this
> function. err.. make that six or seven. What is it trying do if
> ->prepare_write() returns a non-zero, positive value?
>
> It needs a big spring-clean. I'm afraid I don't have time to do that
> for several days.

Thanks for the patches. There seems to be one more fix required: the test
program below works in 2.4.16 only if the write size is a multiple of 4K.
(Why) are all writes expected to be page size, in addition to being page
aligned? (It works fine on 2.4.2 for all sizes). Any quick fixes? :)

--suresh

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

#define O_DIRECT 040000 /* direct disk access hint */

int main()
{
char *buf;
int fd;

buf = valloc(16384);
fd = open("/tmp/blah", O_CREAT | O_RDWR | O_DIRECT, 0755 );

printf("write returns %i\n", write(fd, buf, 4096));

return 0;
}


2001-12-16 13:58:38

by Terje Eggestad

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..


The problem is that the kernel that don't support O_DIRECT has erronous
handling of the O_DIRECT flag. Meaning they happily accept it.

In order to figure out ifthe running kernel support O_DIRECT you MUST
attempt an unaligned read/write, if it succed the kernel DON'T
support O_DIRECT.

TJ


On Sun, 16 Dec 2001, Suresh Gopalakrishnan wrote:

>
> On Sun, 16 Dec 2001, Andrew Morton wrote:
> > GOTO Masanori wrote:
> > > At Sat, 15 Dec 2001 21:59:06 -0800,
> > > Andrew Morton wrote:
> > > ...
> > > > --- linux-2.4.17-rc1/mm/filemap.c Thu Dec 13 14:07:55 2001
> > > > +++ linux-akpm/mm/filemap.c Sat Dec 15 21:52:06 2001
> > > > @@ -3038,8 +3038,11 @@ unlock:
> > > > /* For now, when the user asks for O_SYNC, we'll actually
> > > > * provide O_DSYNC. */
> > > > if (status >= 0) {
> > > > - if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
> > > > + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
> > > > status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> > > > + if (status < 0)
> > > > + written = 0; /* Return the right thing */
> > > > + }
> > > > }
> > >
> > > Right. If generic_osync_inode returns error, it must be needed.
> > > This patch seems ok than my patch...
> >
> > Actually, I preferred your approach :)
> >
> > Also, note how if ->commit_write() or ->prepare_write() return an
> > error, and we have already written some data, the function returns the
> > number of bytes written and no indication that there was an error.
> > According to the write(2) manpage, that's wrong.
> >
> > Probably it is sufficient to make `written' a signed quantity and to
> > do:
> >
> > out_status:
> > - err = written ? written : status;
> > + err = status ? status : written;
> >
> > I think that fixes the five or six bugs we've found so far in this
> > function. err.. make that six or seven. What is it trying do if
> > ->prepare_write() returns a non-zero, positive value?
> >
> > It needs a big spring-clean. I'm afraid I don't have time to do that
> > for several days.
>
> Thanks for the patches. There seems to be one more fix required: the test
> program below works in 2.4.16 only if the write size is a multiple of 4K.
> (Why) are all writes expected to be page size, in addition to being page
> aligned? (It works fine on 2.4.2 for all sizes). Any quick fixes? :)
>
> --suresh
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <sys/stat.h>
> #include <sys/types.h>
> #include <unistd.h>
>
> #define O_DIRECT 040000 /* direct disk access hint */
>
> int main()
> {
> char *buf;
> int fd;
>
> buf = valloc(16384);
> fd = open("/tmp/blah", O_CREAT | O_RDWR | O_DIRECT, 0755 );
>
> printf("write returns %i\n", write(fd, buf, 4096));
>
> return 0;
> }
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
_________________________________________________________________________

Terje Eggestad [email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 70 Bogerud +47 975 31 574 (MOBILE)
N-0621 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2001-12-16 17:44:02

by Suresh Gopalakrishnan

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..


On Sun, 16 Dec 2001, Terje Eggestad wrote:
> The problem is that the kernel that don't support O_DIRECT has
> erronous handling of the O_DIRECT flag. Meaning they happily accept
> it. In order to figure out ifthe running kernel support O_DIRECT you
> MUST attempt an unaligned read/write, if it succed the kernel DON'T
> support O_DIRECT. TJ

You are right! It went through on 2.4.2 even with an unaligned buffer.

So direct i/o has to be multiple of page size blocks, from page aligned
buffer, and apparently into page aligned offset in the file! Is this the
expected behavior?

--suresh

> > Thanks for the patches. There seems to be one more fix required: the test
> > program below works in 2.4.16 only if the write size is a multiple of 4K.
> > (Why) are all writes expected to be page size, in addition to being page
> > aligned? (It works fine on 2.4.2 for all sizes). Any quick fixes? :)
> > --suresh


2001-12-17 09:04:33

by Terje Eggestad

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

Someone, wonder if it wasn't Andrea, pointet out to me that it should be
device block size, not page size for alignment and length.
Appearently it was just simpler to do page alignment, don't expect a
patch for that in any immediate future.

Other than that; you've got it.

TJ


s?n, 2001-12-16 kl. 18:43 skrev Suresh Gopalakrishnan:
>
> On Sun, 16 Dec 2001, Terje Eggestad wrote:
> > The problem is that the kernel that don't support O_DIRECT has
> > erronous handling of the O_DIRECT flag. Meaning they happily accept
> > it. In order to figure out ifthe running kernel support O_DIRECT you
> > MUST attempt an unaligned read/write, if it succed the kernel DON'T
> > support O_DIRECT. TJ
>
> You are right! It went through on 2.4.2 even with an unaligned buffer.
>
> So direct i/o has to be multiple of page size blocks, from page aligned
> buffer, and apparently into page aligned offset in the file! Is this the
> expected behavior?
>
> --suresh
>
> > > Thanks for the patches. There seems to be one more fix required: the test
> > > program below works in 2.4.16 only if the write size is a multiple of 4K.
> > > (Why) are all writes expected to be page size, in addition to being page
> > > aligned? (It works fine on 2.4.2 for all sizes). Any quick fixes? :)
> > > --suresh
>
--
_________________________________________________________________________

Terje Eggestad [email protected]
Scali Scalable Linux Systems http://www.scali.com

Olaf Helsets Vei 6 tel: +47 22 62 89 61 (OFFICE)
P.O.Box 70 Bogerud +47 975 31 574 (MOBILE)
N-0621 Oslo fax: +47 22 62 89 51
NORWAY
_________________________________________________________________________

2001-12-17 17:19:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

On Sun, Dec 16, 2001 at 05:17:33PM +0900, GOTO Masanori wrote:
> At Sat, 15 Dec 2001 21:59:06 -0800,
> Andrew Morton wrote:
> > Then the kernel screws up the error handling, and ends up
> > setting the file size to -EINVAL (ie: rather large).
> >
> > 1: We're testing `written >= 0', but it is unsigned (!). In two
> > places.
> >
> > This one, IMO is a gcc shortcoming. The compiler is capable of warning
> > about expressions which always evaluate to true or false in `if' statements,
> > but turning this on also enables lots of things you don't want it to warn about.
> > gcc needs to provide finer control of its warning capabilities. I patched
> > gcc-2.7.2.3 to do this ages back and it was very useful.
> >
> > 2: If generic_osync_inode() returns an error, we fail to report it. In
> > two places.
> >
> > Here's a quick fix. It needs a review.
> >
> > --- linux-2.4.17-rc1/mm/filemap.c Thu Dec 13 14:07:55 2001
> > +++ linux-akpm/mm/filemap.c Sat Dec 15 21:52:06 2001
> > @@ -3038,8 +3038,11 @@ unlock:
> > /* For now, when the user asks for O_SYNC, we'll actually
> > * provide O_DSYNC. */
> > if (status >= 0) {
> > - if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
> > + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) {
> > status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
> > + if (status < 0)
> > + written = 0; /* Return the right thing */
> > + }
> > }
>
> Right. If generic_osync_inode returns error, it must be needed.
> This patch seems ok than my patch...

The above have nothing with the O_DIRECT changes, the above was present
in 2.4.9 too.

my worry is that failing with -EIO or whatever if we written something can
screwup the app, the app will think the pos is still at the start of our
writes. the "ingore" of the osync failure (that can be generated only by
an I/O error) was on the lines of the ignore of a failure in
prepare_write/commit_write if we just written something. So to me it
looked quite intentional, not just a thinko. In those cases if we wrote
something we report "written" with a short-write (infact a short write
from kernel just indicates something is not strightforward) otherwise
only if nothing was written yet, we report -EIO. So the app, will know
something is been written and the "pos" of the fd is been updated, then
it will try again to write the remaining part and it will get the -EIO
next time.

But I see with common sense that a failing O_SYNC should be somehow
reported even if we just written something, or it could be silenty
ignored, the app at the very least should try again or to notify a
failure rather than losing the data journaling due the I/O errors in the
data/metadata flushing. At least this osync failure is something that
shouldn't happen in production. If an osync fails it means there's a bad
sector or at the very least some other unrelated software bug.

I'm unsure (it's basically a matter of API, not something a kernel
developer can choose liberally), and the SuSv2 is not saying anything about
O_SYNC failures in the write(2) manapge, but I guess it would be at
least saner to put the "pos" backwards if we fail osync but we just
written something (so if we previously advanced pos).

Comments? Andrew?

Andrea

2001-12-17 18:06:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
>
> I'm unsure (it's basically a matter of API, not something a kernel
> developer can choose liberally), and the SuSv2 is not saying anything about
> O_SYNC failures in the write(2) manapge, but I guess it would be at
> least saner to put the "pos" backwards if we fail osync but we just
> written something (so if we previously advanced pos).

I don't have references to back me up, don't take my word for it:
but I'm sure that the correct behaviour for a partially successful
read or write in any UNIX is that it return the count done, O_SYNC
or not, and file position should match that count; only when none
has been done is -1 returned with errno set. Most implementations will
get this wrong in one corner or another, but that's how it should be.

Hugh

2001-12-17 18:13:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

On Mon, Dec 17, 2001 at 06:07:47PM +0000, Hugh Dickins wrote:
> On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> >
> > I'm unsure (it's basically a matter of API, not something a kernel
> > developer can choose liberally), and the SuSv2 is not saying anything about
> > O_SYNC failures in the write(2) manapge, but I guess it would be at
> > least saner to put the "pos" backwards if we fail osync but we just
> > written something (so if we previously advanced pos).
>
> I don't have references to back me up, don't take my word for it:
> but I'm sure that the correct behaviour for a partially successful
> read or write in any UNIX is that it return the count done, O_SYNC
> or not, and file position should match that count; only when none
> has been done is -1 returned with errno set. Most implementations will
> get this wrong in one corner or another, but that's how it should be.

that's how linux handles it and as said incidentally it really looked
intentional to me as well (and Andrew's patch to return error even if
something was written without at least updating "pos" looked wrong). but
without any change ala Andrew, we cannot get errors back from O_SYNC if
at least one byte was written successfully to the cache (not to disk).
So I also see Andrew's point in doing those changes...

Andrea

2001-12-17 18:58:48

by Andrew Morton

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

Hugh Dickins wrote:
>
> On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> >
> > I'm unsure (it's basically a matter of API, not something a kernel
> > developer can choose liberally), and the SuSv2 is not saying anything about
> > O_SYNC failures in the write(2) manapge, but I guess it would be at
> > least saner to put the "pos" backwards if we fail osync but we just
> > written something (so if we previously advanced pos).
>
> I don't have references to back me up, don't take my word for it:
> but I'm sure that the correct behaviour for a partially successful
> read or write in any UNIX is that it return the count done, O_SYNC
> or not, and file position should match that count; only when none
> has been done is -1 returned with errno set. Most implementations will
> get this wrong in one corner or another, but that's how it should be.
>

SUS says: ( http://www.opengroup.org/onlinepubs/007908799/xsh/write.html )

RETURN VALUE

Upon successful completion, write() and pwrite() will return the number of bytes
actually written to the file associated with fildes. This number will never be greater
than nbyte. Otherwise, -1 is returned and errno is set to indicate the error.

I take that to mean that if an error occurs, we return that
error regardless of how much was written.

Which makes sense. Consider this code:

open(file)
write(100k)
close(fd)

if the write gets an IO error halfway through, it looks like
the caller never gets to hear about it at present. Except via
the short return value from the write. But from my reading of SUS,
a short return value from write implicitly means ENOSPC. If we
give a short return for EIO, the calling app has no way to distinguish
this from ENOSPC.

Regarding ENOSPC, SUS says:

If a write() requests that more bytes be written than there is room for (for example, the
ulimit or the physical end of a medium), only as many bytes as there is room for will be
written. For example, suppose there is space for 20 bytes more in a file before reaching
a limit. A write of 512 bytes will return 20. The next write of a non-zero number of
bytes will give a failure return (except as noted below) and the implementation will
generate a SIGXFSZ signal for the thread.

(We don't do the SIGXFSZ in this case either).

Note that I'm not talking about the O_SYNC case here. Just bog-standard
write(), if ->prepare_write() fails.


Blah. Hard. Our behaviour at present seems to be mostly correct
for ENOSPC, and probably incorrect (and undesirable) for EIO.
I'd vote for leaving it as-is for the while. Getting this
right is a medium-sized project. There's also the matter of getting
the file pointer in the correct place on error.

-

2001-12-17 19:27:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>
>SUS says: ( http://www.opengroup.org/onlinepubs/007908799/xsh/write.html )
>
> RETURN VALUE
>
> Upon successful completion, write() and pwrite() will return the number of bytes
> actually written to the file associated with fildes. This number will never be greater
> than nbyte. Otherwise, -1 is returned and errno is set to indicate the error.
>
>I take that to mean that if an error occurs, we return that
>error regardless of how much was written.

I disagree.

Note that writing 15 characters out of 30 is also a "successful write" -
it's just a _partial_ write.

So it is acceptable to return an intermediate value.

In particular, returning "error" when 15 bytes were written loses
information that the application _cannot_ recover from.

Which is why Linux does what Linux does now:
- if you get an error half-way, we return the number of bytes
successfully written.
- a well-written app will handle partial writes correctly (otherwise it
can never handle things like sockets or pipes), so it will try to
write the remaining chunk later,
- if, at that later date, the error persists, and we cannot write any
data, the application gets the error message.

this means:
- good applications can recover gracefully from errors
- you never lose "information" about what has happened

In contrast, if you wrote and committed 15 bytes, and an error occurs
when writing the 16th byte, and you return an error, the application is
now hosed. It has no way of knowing whether _any_ of the write was
successful or not.

>Which makes sense. Consider this code:
>
> open(file)
> write(100k)
> close(fd)
>
>if the write gets an IO error halfway through, it looks like
>the caller never gets to hear about it at present.

No, the caller _does_ get to hear about it. If the caller cares about
robust handling, it will notice "Hmm, I tried to write 100k bytes, but
the system only write 50k, what's up"?

Note that the caller _has_ to do this anyway, or it wouldn't be able to
handle things like interruptible NFS mounts, sockets, pipes, out-of-disk
errors etc etc.

And if the caller does _not_ care about robustness, then who cares?
It's going to ignore whatever we return anyway.

> Except via
>the short return value from the write. But from my reading of SUS,
>a short return value from write implicitly means ENOSPC.

I disagree. A short write is _normal_ for a lot of file descriptors.

Yes, ENOSPC implies short write. But short write does not imply ENOSPC.

Linus

2001-12-17 19:53:43

by Joel Becker

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

On Mon, Dec 17, 2001 at 07:26:05PM +0000, Linus Torvalds wrote:
> Andrew Morton <[email protected]> wrote:
> >I take that to mean that if an error occurs, we return that
> >error regardless of how much was written.
>
> I disagree.
>
> Note that writing 15 characters out of 30 is also a "successful write" -
> it's just a _partial_ write.
>
> >Which makes sense. Consider this code:
> >
> > open(file)
> > write(100k)
> > close(fd)
> >
> >if the write gets an IO error halfway through, it looks like
> >the caller never gets to hear about it at present.

IIRC, SUS states that if a fatal error occurred causing the
partial write, that error will be returned on the next write or upon
close(). Thus:

/* Smart program handles partial writes */
write(100k); = 50k
write(remaining 50k); = -1/ENOSPC|EIO|etc

or:

/* Dumb program doesn't handle partial write */
write(100k); = 50k
close(fd); = -1/EIO

Joel

--

Life's Little Instruction Book #444

"Never underestimate the power of a kind word or deed."

http://www.jlbec.org/
[email protected]

2001-12-17 20:01:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..


On Mon, 17 Dec 2001, Joel Becker wrote:
> IIRC, SUS states that if a fatal error occurred causing the
> partial write, that error will be returned on the next write or upon
> close(). Thus:
>
> /* Smart program handles partial writes */
> write(100k); = 50k
> write(remaining 50k); = -1/ENOSPC|EIO|etc

We do this, if the error is "hard". And "fatal" implies hardness, so we're
ok here.

> /* Dumb program doesn't handle partial write */
> write(100k); = 50k
> close(fd); = -1/EIO

But we're not doing this.

We'd have to save the error into the "struct file" to do the close thing.

Which is what NFS actually already does for other reasons (ie the
asynchronous nature of writes and thus IO errors), so generalizing it
might actually clean some stuff up (sockets have some of the same issues,
but socket semantics may be different enough that we do not want to have
common "error on next access" handling).

Linus

2001-12-17 20:21:19

by Joel Becker

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

On Mon, Dec 17, 2001 at 11:59:56AM -0800, Linus Torvalds wrote:
> On Mon, 17 Dec 2001, Joel Becker wrote:
> > /* Smart program handles partial writes */
> > write(100k); = 50k
> > write(remaining 50k); = -1/ENOSPC|EIO|etc
>
> We do this, if the error is "hard". And "fatal" implies hardness, so we're
> ok here.

Right. "hard" is also synonymous with "non-transient".

> > /* Dumb program doesn't handle partial write */
> > write(100k); = 50k
> > close(fd); = -1/EIO
>
> But we're not doing this.

IMHO we should be, and not just to comply with the letter of
SUS/Unix98. SUS specifies this behavior because a synchronous write()
can return after copying data to the buffer cache. However, the EIO can
happen later when the buffer cache is trying to flush to disk. The only
way for an application to see this error is to either run O_SYNC or
receive it upon close().

Joel

--

"Every day I get up and look through the Forbes list of the richest
people in America. If I'm not there, I go to work."
- Robert Orben

http://www.jlbec.org/
[email protected]

2001-12-17 20:44:31

by Andre Hedrick

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..


You are asking for something that Linux is not capable of doing.
There is no means to send and error back from the disk flush to the
fs/appilcation period. The current 2.4 can not even find the partition
when send an error back up to the block layer. 2.5 has a chance, but
currently none have the ablitity to notify or flush disk cache and recover
of there is a flushcache error.

Therefore it is potential a preferred model to preserve the entire request
for a retry than to do a partial validation of an incomplete attempt.

On Mon, 17 Dec 2001, Joel Becker wrote:

> On Mon, Dec 17, 2001 at 11:59:56AM -0800, Linus Torvalds wrote:
> > On Mon, 17 Dec 2001, Joel Becker wrote:
> > > /* Smart program handles partial writes */
> > > write(100k); = 50k
> > > write(remaining 50k); = -1/ENOSPC|EIO|etc
> >
> > We do this, if the error is "hard". And "fatal" implies hardness, so we're
> > ok here.
>
> Right. "hard" is also synonymous with "non-transient".
>
> > > /* Dumb program doesn't handle partial write */
> > > write(100k); = 50k
> > > close(fd); = -1/EIO
> >
> > But we're not doing this.
>
> IMHO we should be, and not just to comply with the letter of
> SUS/Unix98. SUS specifies this behavior because a synchronous write()
> can return after copying data to the buffer cache. However, the EIO can
> happen later when the buffer cache is trying to flush to disk. The only
> way for an application to see this error is to either run O_SYNC or
> receive it upon close().


Andre Hedrick
CEO/President, LAD Storage Consulting Group
Linux ATA Development
Linux Disk Certification Project


2001-12-26 18:11:24

by Riley Williams

[permalink] [raw]
Subject: Re: O_DIRECT wierd behavior..

Hi Andrew, Linus.

>> RETURN VALUE
>>
>> Upon successful completion, write() and pwrite() will return the
>> number of bytes actually written to the file associated with
>> fildes. This number will never be greater than nbyte. Otherwise,
>> -1 is returned and errno is set to indicate the error.
>>
>> I take that to mean that if an error occurs, we return that
>> error regardless of how much was written.

> I disagree.
>
> Note that writing 15 characters out of 30 is also a "successful write" -
> it's just a _partial_ write.
>
> So it is acceptable to return an intermediate value.

In the "Linux Device Drivers" book, the author writes quite a lot of
drivers for virtual devices that actually rely on the ability to return
short writes in many cases. Many of the standard UNIX utilities rely on
just the same behaviour as well.

>> Which makes sense. Consider this code...
>>
>> open(file)
>> write(100k)
>> close(fd)

...which ought to look like this code...

open(file)
ptr = fillbuffer(100k)
endptr = ptr + 100k
while (ptr < endptr) {
result += write(ptr, endptr - ptr)
if (result > 0)
ptr += result
else
ERROR
}
close(file)

...as anything else is just plain buggy and can't be otherwise.

>> if the write gets an IO error halfway through, it looks like
>> the caller never gets to hear about it at present.

> No, the caller _does_ get to hear about it. If the caller cares
> about robust handling, it will notice "Hmm, I tried to write 100k
> bytes, but the system only wrote 50k, what's up"?
>
> Note that the caller _has_ to do this anyway, or it wouldn't be able
> to handle things like interruptible NFS mounts, sockets, pipes,
> out-of-disk errors etc etc.

It's also not exactly hard to do the right thing - there's little extra
in the revised code snippet to the original one.

> And if the caller does _not_ care about robustness, then who cares?
> It's going to ignore whatever we return anyway.

Precicely why that sort of code is so essential.

Best wishes from Riley.

2002-01-20 04:16:41

by Suresh Gopalakrishnan

[permalink] [raw]
Subject: multithreaded RPC handling


Am not sure if these are kernel issues or a library issues..

Is it ok to have multiple threads call svc_run() and then let each thread
handle the request it gets? (In other words, does select allow multiple
threads to block on the same set of fds, and correctly wake up only one?)

I was looking at the (old) user level NFS server, and wonder why it forks
multiple servers rather than have threads. Are there any RPC issues
involved? Or is it just to avoid synchronization of the fd/filehandle
caches? (Or maybe the thread support was poor/absent then?)

Thanks
--suresh