2003-06-12 11:00:54

by Matti Aarnio

[permalink] [raw]
Subject: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

I have been debugging long and hard a thing where IO is done
with O_DIRECT flag applied to open(2).

Unlike Linux, FreeBSD (where this flag originates, apparently) does
_not_ require that read()/write() happens from page aligned memory
areas, and/or be of page-size multiples in size.

This needs at least wording in open(2) man-page, possibly code
changes in the kernel to support alike behaviour.

/Matti Aarnio


2003-06-12 11:13:19

by Andi Kleen

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

Matti Aarnio <[email protected]> writes:

> Unlike Linux, FreeBSD (where this flag originates, apparently) does

It doesn't. It originates from Irix. AFAIK Irix has similar restrictions.

-Andi

2003-06-12 11:10:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

On Thu, Jun 12, 2003 at 02:14:37PM +0300, Matti Aarnio wrote:
> I have been debugging long and hard a thing where IO is done
> with O_DIRECT flag applied to open(2).
>
> Unlike Linux, FreeBSD (where this flag originates, apparently)

O_DIRECT comes from IRIX.

2003-06-12 13:03:21

by Andries Brouwer

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

On Thu, Jun 12, 2003 at 02:14:37PM +0300, Matti Aarnio wrote:

> I have been debugging long and hard a thing where IO is done
> with O_DIRECT flag applied to open(2).
>
> Unlike Linux, FreeBSD (where this flag originates, apparently) does
> _not_ require that read()/write() happens from page aligned memory
> areas, and/or be of page-size multiples in size.
>
> This needs at least wording in open(2) man-page

Ha Matti, I was going to suggest you to send a patch to the man page
maintainer, but maybe the wording you ask for is there already and
you just have some outdated version of the manpages?

Andries

O_DIRECT
Try to minimize cache effects of the I/O to and
from this file. In general this will degrade per-
formance, but it is useful in special situations,
such as when applications do their own caching.
File I/O is done directly to/from user space
buffers. The I/O is synchronous, i.e., at the com-
pletion of the read(2) or write(2) system call,
data is guaranteed to have been transferred.
Transfer sizes, and the alignment of user buffer
and file offset must all be multiples of the logi-
cal block size of the file system.

2003-06-12 14:44:42

by Dave Jones

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

On Thu, Jun 12, 2003 at 03:17:04PM +0200, Andries Brouwer wrote:
> Transfer sizes, and the alignment of user buffer
> and file offset must all be multiples of the logi-
> cal block size of the file system.

Just to confirm something that I wrote in the post-halloween-2.5 doc,
that doesn't tally with this..

- The size and alignment of O_DIRECT file IO requests now matches that
of the device, not the filesystem. Typically this means that
you can perform O_DIRECT IO with 512-byte granularity rather than 4k.

Is this a case of the man pages not following 2.5 yet, or is this
incorrect ?

Dave

2003-06-12 14:55:27

by Matti Aarnio

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

On Thu, Jun 12, 2003 at 03:58:14PM +0100, Dave Jones wrote:
> > Transfer sizes, and the alignment of user buffer
> > and file offset must all be multiples of the logi-
> > cal block size of the file system.
>
> Just to confirm something that I wrote in the post-halloween-2.5 doc,
> that doesn't tally with this..
>
> - The size and alignment of O_DIRECT file IO requests now matches that
> of the device, not the filesystem. Typically this means that
> you can perform O_DIRECT IO with 512-byte granularity rather than 4k.
>
> Is this a case of the man pages not following 2.5 yet, or is this
> incorrect ?

I think of three things:
- 2.4 defines rules in most confusing manner
- 2.5 continues that
- We need more complete IRIX's O_DIRECT API:

from open(2):
O_DIRECT
If set, all reads and writes on the resulting file descriptor will
be performed directly to or from the user program buffer, provided
appropriate size and alignment restrictions are met. Refer to the
F_SETFL and F_DIOINFO commands in the fcntl(2) manual entry for
information about how to determine the alignment constraints.
O_DIRECT is a Silicon Graphics extension and is only supported on
local EFS and XFS file systems, and remote BDS file systems.


from fcntl(2):
F_SETFL Set file status flags to the third argument, ....

Flags not understood for a particular descriptor are silently
ignored except for FDIRECT. FDIRECT will return EINVAL if used
on other than an EFS, XFS or BDS file system file.

F_DIOINFO Get information required to perform direct I/O on the specified
fildes. Direct I/O is performed directly to and from a user's
data buffer. Since the kernels buffer cache is no longer
between the two, the user's data buffer must conform to the
same type of constraints as required for accessing a raw disk
partition. The third argument, arg, points to a data type
struct dioattr which is defined in the <fcntl.h> header file
and contains the following members: d_mem is the memory
alignment requirement of the user's data buffer. d_miniosz
specifies block size, minimum I/O request size, and I/O
alignment. Ths size of all I/O requests must be a multiple of
this amount and the value of the seek pointer at the time of
the I/O request must also be an integer multiple of this
amount. d_maxiosz is the maximum I/O request size which can be
performed on the fildes. If an I/O request does not meet these
constraints, the read(2) or write(2) will return with EINVAL.
All I/O requests are kept consistent with any data brought into
the cache with an access through a non-direct I/O file
descriptor. See also F_SETFL above and open(2).

> Dave

/Matti Aarnio

2003-06-12 22:59:25

by Rob van Nieuwkerk

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..


Andries Brouwer wrote:
> O_DIRECT
> Try to minimize cache effects of the I/O to and
> from this file. In general this will degrade per-
> formance, but it is useful in special situations,
> such as when applications do their own caching.
> File I/O is done directly to/from user space
> buffers. The I/O is synchronous, i.e., at the com-
> pletion of the read(2) or write(2) system call,
> data is guaranteed to have been transferred.
> Transfer sizes, and the alignment of user buffer
> and file offset must all be multiples of the logi-
> cal block size of the file system.

FYI:
It appears that somewhere between RH kernels 2.4.18-27.7.x and 2.4.20-18.9
something has changed so that my application needs a O_SYNC too besides
the O_DIRECT to make sure that writes will be synchronous. If I leave
the O_SYNC out with 2.4.20-18.9 the write will happen physically 35
seconds after the write() was done.

Haven't tested with a vanilla 2.4.* kernel yet but will try.
(All modern 2.4.2? kernels I tried will hang for > 30s during boot while
probing the CompactFlash and are because of that kind of useless: my
application needs a 5s boot-time ..)

greetings,
Rob van Nieuwkerk

2003-06-12 23:08:09

by Nuno Silva

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

Hi!

OARS, anybody knows a patch that implements O_DIRECT in 2.4 the same way
that it's implemented in 2.5?

2.5's O_DIRECT is much less restrictive than 2.4's. OTOH 2.5 is still
not recommended for production use. Any way of having the best of both
worlds? :)

Thanks,
nuno Silva

Matti Aarnio wrote:
> On Thu, Jun 12, 2003 at 03:58:14PM +0100, Dave Jones wrote:
>
>> > Transfer sizes, and the alignment of user buffer
>> > and file offset must all be multiples of the logi-
>> > cal block size of the file system.
>>
>>Just to confirm something that I wrote in the post-halloween-2.5 doc,
>>that doesn't tally with this..
>>
>>- The size and alignment of O_DIRECT file IO requests now matches that
>> of the device, not the filesystem. Typically this means that
>> you can perform O_DIRECT IO with 512-byte granularity rather than 4k.
>>
>>Is this a case of the man pages not following 2.5 yet, or is this
>>incorrect ?
>
>
> I think of three things:
> - 2.4 defines rules in most confusing manner
> - 2.5 continues that
> - We need more complete IRIX's O_DIRECT API:
>
> from open(2):
> O_DIRECT
> If set, all reads and writes on the resulting file descriptor will
> be performed directly to or from the user program buffer, provided
> appropriate size and alignment restrictions are met. Refer to the
> F_SETFL and F_DIOINFO commands in the fcntl(2) manual entry for
> information about how to determine the alignment constraints.
> O_DIRECT is a Silicon Graphics extension and is only supported on
> local EFS and XFS file systems, and remote BDS file systems.
>
>
> from fcntl(2):
> F_SETFL Set file status flags to the third argument, ....
>
> Flags not understood for a particular descriptor are silently
> ignored except for FDIRECT. FDIRECT will return EINVAL if used
> on other than an EFS, XFS or BDS file system file.
>
> F_DIOINFO Get information required to perform direct I/O on the specified
> fildes. Direct I/O is performed directly to and from a user's
> data buffer. Since the kernels buffer cache is no longer
> between the two, the user's data buffer must conform to the
> same type of constraints as required for accessing a raw disk
> partition. The third argument, arg, points to a data type
> struct dioattr which is defined in the <fcntl.h> header file
> and contains the following members: d_mem is the memory
> alignment requirement of the user's data buffer. d_miniosz
> specifies block size, minimum I/O request size, and I/O
> alignment. Ths size of all I/O requests must be a multiple of
> this amount and the value of the seek pointer at the time of
> the I/O request must also be an integer multiple of this
> amount. d_maxiosz is the maximum I/O request size which can be
> performed on the fildes. If an I/O request does not meet these
> constraints, the read(2) or write(2) will return with EINVAL.
> All I/O requests are kept consistent with any data brought into
> the cache with an access through a non-direct I/O file
> descriptor. See also F_SETFL above and open(2).
>
>
>> Dave
>
>
> /Matti Aarnio
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2003-06-13 07:33:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

On Fri, Jun 13, 2003 at 01:12:57AM +0200, Rob van Nieuwkerk wrote:
> FYI:
> It appears that somewhere between RH kernels 2.4.18-27.7.x and 2.4.20-18.9
> something has changed so that my application needs a O_SYNC too besides
> the O_DIRECT to make sure that writes will be synchronous. If I leave
> the O_SYNC out with 2.4.20-18.9 the write will happen physically 35
> seconds after the write() was done.

O_DIRECT is nothing but a hint and the 2.4.20-18.9 kernel decides to not
honor it

2003-06-13 08:15:53

by Arjan van de Ven

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

On Fri, Jun 13, 2003 at 10:27:52AM +0200, Rob van Nieuwkerk wrote:
>
> Arjan van de Ven wrote:
> > On Fri, Jun 13, 2003 at 01:12:57AM +0200, Rob van Nieuwkerk wrote:
> > > FYI:
> > > It appears that somewhere between RH kernels 2.4.18-27.7.x and 2.4.20-18.9
> > > something has changed so that my application needs a O_SYNC too besides
> > > the O_DIRECT to make sure that writes will be synchronous. If I leave
> > > the O_SYNC out with 2.4.20-18.9 the write will happen physically 35
> > > seconds after the write() was done.
> >
> > O_DIRECT is nothing but a hint and the 2.4.20-18.9 kernel decides to not
> > honor it
>
> Hi Arjan,
>
> Do you mean that the 2.4.20-18.9 kernel always ignores the O_DIRECT flag ?

yes.

2003-06-13 08:14:14

by Rob van Nieuwkerk

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..


Arjan van de Ven wrote:
> On Fri, Jun 13, 2003 at 01:12:57AM +0200, Rob van Nieuwkerk wrote:
> > FYI:
> > It appears that somewhere between RH kernels 2.4.18-27.7.x and 2.4.20-18.9
> > something has changed so that my application needs a O_SYNC too besides
> > the O_DIRECT to make sure that writes will be synchronous. If I leave
> > the O_SYNC out with 2.4.20-18.9 the write will happen physically 35
> > seconds after the write() was done.
>
> O_DIRECT is nothing but a hint and the 2.4.20-18.9 kernel decides to not
> honor it

Hi Arjan,

Do you mean that the 2.4.20-18.9 kernel always ignores the O_DIRECT flag ?

greetings,
Rob van Nieuwkerk

2003-06-13 08:48:46

by Rob van Nieuwkerk

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..


Arjan van de Ven wrote:
> On Fri, Jun 13, 2003 at 10:27:52AM +0200, Rob van Nieuwkerk wrote:
> >
> > Arjan van de Ven wrote:
> > > On Fri, Jun 13, 2003 at 01:12:57AM +0200, Rob van Nieuwkerk wrote:
> > > > FYI:
> > > > It appears that somewhere between RH kernels 2.4.18-27.7.x and 2.4.20-18.9
> > > > something has changed so that my application needs a O_SYNC too besides
> > > > the O_DIRECT to make sure that writes will be synchronous. If I leave
> > > > the O_SYNC out with 2.4.20-18.9 the write will happen physically 35
> > > > seconds after the write() was done.
> > >
> > > O_DIRECT is nothing but a hint and the 2.4.20-18.9 kernel decides to not
> > > honor it
> >
> > Hi Arjan,
> >
> > Do you mean that the 2.4.20-18.9 kernel always ignores the O_DIRECT flag ?
>
> yes.

Hi Arjan,

OK, that would explain why I see an old problem (*) re-appear in my
application that was solved/worked-around by using O_DIRECT when using
2.4.20-18.9.

Just to make sure I understand it correctly, is it like this: ?
"Kernel 2.4.20-18.9 completely ignores the O_DIRECT flag. Not only the
"synchronous writes part" but also you will get read-ahead despite
using O_DIRECT. The 2.4.20-18.9 with O_DIRECT behaviour is similar to
the 2.4.18-27.7.x without O_DIRECT (concerning synchronity of write()
and the number of physical media reads & writes)."

Just curious: what is the reason for ignoring O_DIRECT in 2.4.20-18.9 ?
Interactivity behaviour ?

Greetings,
Rob van Nieuwkerk


(*) I have an application that runs from CompactFlash that uses a Philips
webcam (pwc driver). It turned out that too much CompactFlash access
(in PIO mode) causes the camera(driver?) to stall and never wake up
again :-( I only log 2048 byte records to a raw partition. With
O_DIRECT and proper data aligning I could reduce the CF-access to
exactly 4 512 byte sector writes. This was enough to never trigger
the problem.

2003-06-13 20:51:15

by Andries Brouwer

[permalink] [raw]
Subject: Re: open(.. O_DIRECT ..) difference in between Linux and FreeBSD ..

On Thu, Jun 12, 2003 at 06:09:09PM +0300, Matti Aarnio wrote:
> On Thu, Jun 12, 2003 at 03:58:14PM +0100, Dave Jones wrote:

[all clipped - later]

I was reminded of the following quote:

"The thing that has always disturbed me about O_DIRECT is that the whole
interface is just stupid, and was probably designed by a deranged monkey
on some serious mind-controlling substances."

I'll add that to the BUGS section.