Has anyone ever done an madvise(2)-type syscall for file descriptors?
(or does the capability exist and I'm missing it?)
I was thinking, in playing around with stuff like cp(1) I've found that
standard read(2) and write(2) of a 4-8K buffer is the fastest solution
overall, in addition to providing the useful side effect of better error
reporting, such as ENOSPC report. Better error reporting than the
alternative I see anyway, mmap(2).
So... we have madvise, why not fadvise? I would love the capability for
applications to provide hints to the OS like madvise, but for file
descriptors...
Jeff
Jeff Garzik wrote:
>
> Has anyone ever done an madvise(2)-type syscall for file descriptors?
> (or does the capability exist and I'm missing it?)
Well, question is: is madvise() any use? :)
> I was thinking, in playing around with stuff like cp(1) I've found that
> standard read(2) and write(2) of a 4-8K buffer is the fastest solution
> overall, in addition to providing the useful side effect of better error
> reporting, such as ENOSPC report. Better error reporting than the
> alternative I see anyway, mmap(2).
4k to 8k is best on x86 at least. And if you're actually going to *use*
each byte in the file, the zero-copy characteristics of mmap aren't
worth much at all.
> So... we have madvise, why not fadvise? I would love the capability for
> applications to provide hints to the OS like madvise, but for file
> descriptors...
The one hint which I can think of which would be beneficial would
be an equivalent to MADV_SEQUENTIAL. Something which says "this
is a big streaming read/write - don't go and evict other stuff because
of it". O_STREAMING perhaps. Or working dropbehind heuristics,
although I suspect that explicit controls will always do better.
For MADV_RANDOM, readahead window scaling should get that right.
What else were you thinking of?
-
Andrew Morton wrote:
>Jeff Garzik wrote:
>
>>Has anyone ever done an madvise(2)-type syscall for file descriptors?
>>(or does the capability exist and I'm missing it?)
>>
>
>Well, question is: is madvise() any use? :)
>
:)
>>was thinking, in playing around with stuff like cp(1) I've found that
>>standard read(2) and write(2) of a 4-8K buffer is the fastest solution
>>overall, in addition to providing the useful side effect of better error
>>reporting, such as ENOSPC report. Better error reporting than the
>>alternative I see anyway, mmap(2).
>>
>
>4k to 8k is best on x86 at least. And if you're actually going to *use*
>each byte in the file, the zero-copy characteristics of mmap aren't
>worth much at all.
>
That's exactly what I found through experimentation.
>>So... we have madvise, why not fadvise? I would love the capability for
>>applications to provide hints to the OS like madvise, but for file
>>descriptors...
>>
>
>The one hint which I can think of which would be beneficial would
>be an equivalent to MADV_SEQUENTIAL. Something which says "this
>is a big streaming read/write - don't go and evict other stuff because
>of it". O_STREAMING perhaps. Or working dropbehind heuristics,
>although I suspect that explicit controls will always do better.
>
>For MADV_RANDOM, readahead window scaling should get that right.
>
>What else were you thinking of?
>
Hints for,
* sequential read
* sequential write
* sequential write, where the application considers the data it's
writing to be unlikely to be read again any time soon (hopefully
implying to the page cache that these pages have low value as cacheable
objects)
* some sort of streaming hints, implying that the application cares a
lot about maintaining some minimum i/o rate. note I said hint, not
requirement. -not- guaranteed-rate-IO.
I might even go so far as to advocate identifying common usage patterns,
and creating hint constants for them, even if we don't support them in
the kernel immediately (if ever). Makes the interface much more
future-proof, at the expense of a few integers in a 32-bit numberspace,
and a few more bytes in the C compiler's symbol table.
Jeff
At 09:10 17/03/02, Jeff Garzik wrote:
>Andrew Morton wrote:
>>Jeff Garzik wrote:
>>>So... we have madvise, why not fadvise? I would love the capability for
>>>applications to provide hints to the OS like madvise, but for file
>>>descriptors...
>>
>>The one hint which I can think of which would be beneficial would
>>be an equivalent to MADV_SEQUENTIAL. Something which says "this
>>is a big streaming read/write - don't go and evict other stuff because
>>of it". O_STREAMING perhaps. Or working dropbehind heuristics,
>>although I suspect that explicit controls will always do better.
>>
>>For MADV_RANDOM, readahead window scaling should get that right.
>>
>>What else were you thinking of?
>
>Hints for,
>* sequential read
>* sequential write
>* sequential write, where the application considers the data it's writing
>to be unlikely to be read again any time soon (hopefully implying to the
>page cache that these pages have low value as cacheable objects)
>* some sort of streaming hints, implying that the application cares a lot
>about maintaining some minimum i/o rate. note I said hint, not
>requirement. -not- guaranteed-rate-IO.
>
>I might even go so far as to advocate identifying common usage patterns,
>and creating hint constants for them, even if we don't support them in the
>kernel immediately (if ever). Makes the interface much more future-proof,
>at the expense of a few integers in a 32-bit numberspace, and a few more
>bytes in the C compiler's symbol table.
We don't need fadvise IMHO. That is what open(2) is for. The streaming
request you are asking for is just a normal open(2). It will do read ahead
which is perfect for streaming (of data size << RAM size in its current form).
When you want large data streaming, i.e. you start getting worried about
memory pressure, then you want open(2) + O_DIRECT. No caching done. Perfect
for large data streams and we have that already. I agree that you may want
some form of asynchronous read ahead with passed pages being dropped from
the cache but that could be just a open(2) + O_SEQUENTIAL (doesn't exist yet).
All of what you are asking for exists in Windows and all the semantics are
implemented through a very powerful open(2) equivalent. I don't see why we
shouldn't do the same. It makes more sense to me than inventing yet another
system call...
The Windows NT/2k/XP CreateFile() call is documented at below URL. Search
for FILE_FLAG_* and there is a nice big table with all the possible access
method hints one can give when opening or creating a file. Many of those
make perfect sense to have in the Linux kernel, too and in fact with
O_DIRECT we already have some of the functionality Windows offers (there it
would be FILE_FLAG_NO_BUFFERING)
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/fileio/filesio_7wmd.asp
Best regards,
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
On Sun, 17 Mar 2002, Anton Altaparmakov wrote:
> All of what you are asking for exists in Windows and all the semantics are
> implemented through a very powerful open(2) equivalent. I don't see why we
> shouldn't do the same. It makes more sense to me than inventing yet another
> system call...
It is easier for application writers to code:
[...]
#ifdef HAVE_FADVISE
(void)fadvise(fd, FADV_STREAMING);
#endif
[...]
Than to have a forest of #ifdefs to determine which O_* flags are
supported. After all, we still want our programs to run under Solaris. :-)
Simon
--
GPG public key available from http://phobos.fs.tum.de/pgp/Simon.Richter.asc
Fingerprint: 040E B5F7 84F1 4FBC CEAD ADC6 18A0 CC8D 5706 A4B4
Hi! I'm a .signature virus! Copy me into your ~/.signature to help me spread!
> It is easier for application writers to code:
>
> [...]
> #ifdef HAVE_FADVISE
> (void)fadvise(fd, FADV_STREAMING);
> #endif
> [...]
>
> Than to have a forest of #ifdefs to determine which O_* flags are
> supported. After all, we still want our programs to run under Solaris. :-)
#ifndef O_STREAMING
#define O_STREAMING 0
#endif
(and then just use the flag in open)
is still better - it can be done in a header somewhere, once for all opens.
--------------------------------------------------------------------------------
- Jan Hudec `Bulb' <[email protected]>
At 14:31 17/03/02, Simon Richter wrote:
>On Sun, 17 Mar 2002, Anton Altaparmakov wrote:
>
> > All of what you are asking for exists in Windows and all the semantics are
> > implemented through a very powerful open(2) equivalent. I don't see why we
> > shouldn't do the same. It makes more sense to me than inventing yet another
> > system call...
>
>It is easier for application writers to code:
>
>[...]
>#ifdef HAVE_FADVISE
> (void)fadvise(fd, FADV_STREAMING);
>#endif
>[...]
>
>Than to have a forest of #ifdefs to determine which O_* flags are
>supported. After all, we still want our programs to run under Solaris. :-)
Ugh. Both of your suggestions look ugly. Using the O_* flags, you just need
to have a compatibility header file which contains:
#ifndef HAVE_O_SEQUENTIAL
# define O_SEQUENTIAL 0
#endif
Then in the code you just use O_SEQUENTIAL and if the system doesn't know
about it it is optimised away at compile time.
Note how nicely this fits in with autoconf/automake where the ./configure
script can test for O_SEQUENTIAL and if it is not there automatically
define it to 0. That then means your code is completely free from these
ugly #ifdefs.
Thanks for making your point as that is ANOTHER argument for using open(2)
instead of fadvise() [1]. (-;
Cheers,
Anton
[1] Yeah, I know, one could also define fadvise() to nothing in the compat
header file...
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
There is a posix_fadvise() syscall in the POSIX Advanced Realtime
specification
http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
I don't know if this has been mentioned on linux-kernel before, but in
January, the Open Group, in cooperation with IEEE, added the POSIX
functionality to their specification and made it available online for free.
It's at
http://www.opengroup.org/onlinepubs/007904975/toc.htm
There are some useful tables at
http://www.unix-systems.org/version3/online.html and they ask that you
register there so that they know how many people are using the
specification.
They don't have a downloadable version of this specification, but they do
for the previous versions:
http://www.opengroup.org/onlinepubs/007908799/download/
Ken Hirsch
At 15:13 17/03/02, Ken Hirsch wrote:
>There is a posix_fadvise() syscall in the POSIX Advanced Realtime
>specification
>http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
Posix or not I still don't see why one would want that. You know what you
are going to be using a file for at open time and you are not going to be
changing your mind later. If you can show me a single _real_world_ example
where one would genuinely want to change from one access pattern to another
without closing/reopening a particular file I would agree that fadvise is a
good idea but otherwise I think open(2) is the superior approach.
In addition, open(2) allows you to do cool things like O_TEMP which could
create a file that would never get written to disk at all and on close
would just disappear again (just an idea, I can see good uses for such
things, although in a way we already have simillar semantics when one
creates such files on a tmpfs mount).
Best regards,
Anton
>I don't know if this has been mentioned on linux-kernel before, but in
>January, the Open Group, in cooperation with IEEE, added the POSIX
>functionality to their specification and made it available online for free.
>It's at
>http://www.opengroup.org/onlinepubs/007904975/toc.htm
>
>There are some useful tables at
>http://www.unix-systems.org/version3/online.html and they ask that you
>register there so that they know how many people are using the
>specification.
>
>They don't have a downloadable version of this specification, but they do
>for the previous versions:
>http://www.opengroup.org/onlinepubs/007908799/download/
>
>Ken Hirsch
>
>
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
On Sun, Mar 17, 2002 at 05:14:20PM +0000, Anton Altaparmakov wrote:
> At 15:13 17/03/02, Ken Hirsch wrote:
> >There is a posix_fadvise() syscall in the POSIX Advanced Realtime
> >specification
> >http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html
> Posix or not I still don't see why one would want that. You know what you
> are going to be using a file for at open time and you are not going to be
> changing your mind later. If you can show me a single _real_world_ example
> where one would genuinely want to change from one access pattern to another
> without closing/reopening a particular file I would agree that fadvise is a
> good idea but otherwise I think open(2) is the superior approach.
Also, at least in theory, open() can begin loading pages the moment it
completes (if the system is sufficiently idle). Calling madvise() "at
some later point" would allow a window during which the kernel could
already be loading the wrong pages, before it is *then* told "oh btw, I
really want *these* pages." As an example (assuming open() doesn't do this
already) I would be pleasantly surprised if open(O_RDONLY | O_SEQUENTIAL)
began loading at least the first page in the file the moment open() was
successful. Then, when we get control back to actually do a read() (we
may have been interrupted during open()) the page is already there.
mark
--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
http://mark.mielke.cc/
Anton Altaparmakov writes
> Posix or not I still don't see why one would want that. You know what you
> are going to be using a file for at open time and you are not going to be
> changing your mind later. If you can show me a single _real_world_ example
> where one would genuinely want to change from one access pattern to
another
> without closing/reopening a particular file I would agree that fadvise is
a
> good idea but otherwise I think open(2) is the superior approach.
>
Sure, a database manager can change the access pattern on every query. If
there's an index and not too many records are expected to match, it will use
a random pattern, otherwise it will use sequential access.
At 18:35 17/03/02, Ken Hirsch wrote:
>Anton Altaparmakov writes
> > Posix or not I still don't see why one would want that. You know what you
> > are going to be using a file for at open time and you are not going to be
> > changing your mind later. If you can show me a single _real_world_ example
> > where one would genuinely want to change from one access pattern to
>another
> > without closing/reopening a particular file I would agree that fadvise is
>a
> > good idea but otherwise I think open(2) is the superior approach.
> >
>
>Sure, a database manager can change the access pattern on every query. If
>there's an index and not too many records are expected to match, it will use
>a random pattern, otherwise it will use sequential access.
Last time I heard serious databases use their own memmory
management/caching in combination with O_DIRECT, i.e. they bypass the
kernel's buffering system completely. Hence I would deem them irrelevant to
the problem at hand...
If a database were not to use O_DIRECT I would think it would be using mmap
so it would have madvise already... but I am not a database expert so take
this with a pinch of salt...
Best regards,
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
On Sun, Mar 17, 2002 at 01:41:37PM +0000, Anton Altaparmakov wrote:
> When you want large data streaming, i.e. you start getting worried about
> memory pressure, then you want open(2) + O_DIRECT. No caching done. Perfect
> for large data streams and we have that already. I agree that you may want
> some form of asynchronous read ahead with passed pages being dropped from
> the cache but that could be just a open(2) + O_SEQUENTIAL (doesn't exist yet).
O_DIRECT isn't the right thing for large streaming. You want
readahead and dropbehind. O_DIRECT takes substantial penalties for its
lack of copy/cacheing. This works fine in certain circumstances
(applications that keep their own caching), but for something like a
video or mp3, you'll win with working dropbehind easily.
Joel
--
Life's Little Instruction Book #444
"Never underestimate the power of a kind word or deed."
http://www.jlbec.org/
[email protected]
Jeff Garzik writes:
> Andrew Morton wrote:
>
> >Jeff Garzik wrote:
> >>So... we have madvise, why not fadvise? I would love the capability for
> >>applications to provide hints to the OS like madvise, but for file
> >>descriptors...
> >>
> >
> >The one hint which I can think of which would be beneficial would
> >be an equivalent to MADV_SEQUENTIAL. Something which says "this
> >is a big streaming read/write - don't go and evict other stuff because
> >of it". O_STREAMING perhaps. Or working dropbehind heuristics,
> >although I suspect that explicit controls will always do better.
> >
> >For MADV_RANDOM, readahead window scaling should get that right.
> >
> >What else were you thinking of?
> >
>
> Hints for,
> * sequential read
> * sequential write
> * sequential write, where the application considers the data it's
> writing to be unlikely to be read again any time soon (hopefully
> implying to the page cache that these pages have low value as cacheable
> objects)
> * some sort of streaming hints, implying that the application cares a
> lot about maintaining some minimum i/o rate. note I said hint, not
> requirement. -not- guaranteed-rate-IO.
>
> I might even go so far as to advocate identifying common usage
> patterns, and creating hint constants for them, even if we don't
> support them in the kernel immediately (if ever). Makes the
> interface much more future-proof, at the expense of a few integers
> in a 32-bit numberspace, and a few more bytes in the C compiler's
> symbol table.
Here's one that I'd like (came up recently with these 21600x21600x3
images from NASA:-): MADV_REVERSE_SEQUENTIAL. When converting images
from stupid formats which have the origin in the top-left, to formats
which have the origin in the bottom-left (the way god intended), you
can avoid a massive malloc(3) if you read the input file backwards
(basically through llseek(2) steps).
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
Anton Altaparmakov writes:
> Last time I heard serious databases use their own memmory
> management/caching in combination with O_DIRECT, i.e. they bypass the
> kernel's buffering system completely. Hence I would deem them irrelevant
to
> the problem at hand...
>
> If a database were not to use O_DIRECT I would think it would be using
mmap
> so it would have madvise already... but I am not a database expert so take
> this with a pinch of salt...
>
I don't think that either MySQL or PostgreSQL use O_DIRECT; I just grepped
the source and didn't find it. They can't use mmap() because it uses up too
much process address space.
It's true that commercial databases mostly do their own scheduling and
caching, and if they are the only thing running on your system and you tune
them right, that works. But it's not necessarily a good thing. If there
are other processes on your system, there would be a benefit if the DBMS
could inform the operating system of its intentions.
A posix_fadvise() call would be a start, but you could potentially go beyond
that. For some interesting ideas, see
Seltzer, M., Small, C., Smith, K., "The Case for Extensible Operating
Systems",
Harvard University Center for Research in Computing Technology TR16 -95
(July 1995).
http://citeseer.nj.nec.com/article/seltzer95case.html
Ken Hirsch
At 19:20 17/03/02, Joel Becker wrote:
>On Sun, Mar 17, 2002 at 01:41:37PM +0000, Anton Altaparmakov wrote:
> > When you want large data streaming, i.e. you start getting worried about
> > memory pressure, then you want open(2) + O_DIRECT. No caching done.
> Perfect
> > for large data streams and we have that already. I agree that you may want
> > some form of asynchronous read ahead with passed pages being dropped from
> > the cache but that could be just a open(2) + O_SEQUENTIAL (doesn't
> exist yet).
>
> O_DIRECT isn't the right thing for large streaming. You want
>readahead and dropbehind. O_DIRECT takes substantial penalties for its
>lack of copy/cacheing. This works fine in certain circumstances
>(applications that keep their own caching), but for something like a
>video or mp3, you'll win with working dropbehind easily.
Oh absolutely. For mp3s, dvds, etc. Note I wrote O_SEQUENTIAL... Perhaps I
didn't emphasize it enough. In multimedia applications you very well know
in advance what you want so you can specify it at open(2) time.
Best regards,
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
Anton Altaparmakov wrote:
> We don't need fadvise IMHO. That is what open(2) is for. The streaming
> request you are asking for is just a normal open(2). It will do read
> ahead which is perfect for streaming (of data size << RAM size in its
> current form).
>
> When you want large data streaming, i.e. you start getting worried
> about memory pressure, then you want open(2) + O_DIRECT. No caching
> done. Perfect for large data streams and we have that already. I agree
> that you may want some form of asynchronous read ahead with passed
> pages being dropped from the cache but that could be just a open(2) +
> O_SEQUENTIAL (doesn't exist yet).
>
> All of what you are asking for exists in Windows and all the semantics
> are implemented through a very powerful open(2) equivalent. I don't
> see why we shouldn't do the same. It makes more sense to me than
> inventing yet another system call...
I disagree, and here's the main reasons:
* fadvise(2) usefulness extends past open(2). It may be useful to call
it at various points during runtime.
* I think putting hints in open(2) is the wrong direction to go. Hints
have a potential to be very flexible. open(2) O_xxx bits are not to be
squandered lightly, while I see a lot more value in being a little more
loose and free with the bit assignment for an "fadvise mask" (just a
list of hint bits). IMO it should be easier to introduce and retire
hints, far easier than O_xxx flags.
Jeff
Jeff Garzik wrote:
>
> * fadvise(2) usefulness extends past open(2). It may be useful to call
> it at various points during runtime.
>
> * I think putting hints in open(2) is the wrong direction to go. Hints
> have a potential to be very flexible. open(2) O_xxx bits are not to be
> squandered lightly, while I see a lot more value in being a little more
> loose and free with the bit assignment for an "fadvise mask" (just a
> list of hint bits). IMO it should be easier to introduce and retire
> hints, far easier than O_xxx flags.
>
Yup.
posix_fadvise() looks to be a fine interface:
int posix_fadvise(int fd, off_t offset, size_t len, int advice);
DESCRIPTION
The posix_fadvise() function shall advise the implementation on
the expected behavior of the application with respect to the data in
the file associated with the open file descriptor, fd, starting at offset
and continuing for len bytes. The specified range need not currently
exist in the file. If len is zero, all data following offset is specified.
The implementation may use this information to optimize handling
of the specified data. The posix_fadvise() function shall have no
effect on the semantics of other operations on the specified data,
although it may affect the performance of other operations.
The advice to be applied to the data is specified by the advice
parameter and may be one of the following values:
POSIX_FADV_NORMAL
Specifies that the application has no advice to give on its
behavior with respect to the specified data. It is the default
characteristic if no advice is given for an open file.
POSIX_FADV_SEQUENTIAL
Specifies that the application expects to access the specified
data sequentially from lower offsets to higher offsets.
POSIX_FADV_RANDOM
Specifies that the application expects to access the specified
data in a random order.
POSIX_FADV_WILLNEED
Specifies that the application expects to access the specified
data in the near future.
POSIX_FADV_DONTNEED
Specifies that the application expects that it will not access
the specified data in the near future.
POSIX_FADV_NOREUSE
Specifies that the application expects to access the specified
data once and then not reuse it thereafter.
We can usefully implement all of these. FADV_WILLNEED obsoletes
sys_readahead().
We'll need to cheat a bit on the offset/len thing for NORMAL and
SEQUENTIAL - just apply it to the whole file - we don't want to have to
attach an arbitrary number of silly range objects to each file for this.
(We already cheat a bit this way with msync).
Note that it applies to a file descriptor. If posix_fadvise(FADV_DONTNEED) is
called against a file descriptor, and someone else has an fd open
against the same file, that other user gets their foot shot off. That's
OK.
Given this, I don't see a persuasive need to implement a non-standard
interface. It takes an off_t, so posix_fadvise64() is also needed.
The presence of this interface doesn't imply that we don't need
good dropbehind heuristics for streaming reads and writes. We
do need those.
I wouldn't suggest that anyone rush out and implement this stuff for 2.5.
There's some decrudding needed in filemap.c first, and many of these
hints need to interact with the 2.6 VM. Whatever that will be.
A 2.4 implementation could be done any time. If anyone decides to
do this, please let me know...
-
On Sun, Mar 17, 2002 at 01:41:37PM +0000, Anton Altaparmakov wrote:
> We don't need fadvise IMHO. That is what open(2) is for. The streaming
> request you are asking for is just a normal open(2). It will do read ahead
> which is perfect for streaming (of data size << RAM size in its current form).
A quick real world example of where fadvise can work well.
Imagine a database appliction that doesn't use O_DIRECT (for whatever
reason, could even be that they don't trust the linux implementation yet
:-). So, this database gets a query. That query requires a full table
scan, so it calls fadvise(fd, F_SEQUENTIAL). Then another query does
row-specific access, and caching helps. So it wants to turn off
F_SEQUENTIAL.
Other applications can use this sort of stuff. DBM could, for
instance. So might GIMP. Etc. Dynamic hints have real world
applications.
Joel
--
print STDOUT q
Just another Perl hacker,
unless $spring
-Larry Wall
http://www.jlbec.org/
[email protected]
Andrew Morton wrote:
>posix_fadvise() looks to be a fine interface:
>
>We'll need to cheat a bit on the offset/len thing for NORMAL and
>SEQUENTIAL - just apply it to the whole file - we don't want to have to
>attach an arbitrary number of silly range objects to each file for this.
>(We already cheat a bit this way with msync).
>
yep
>Given this, I don't see a persuasive need to implement a non-standard
>interface. It takes an off_t, so posix_fadvise64() is also needed.
>
agreed WRT non-standard.
Are we required to have both foo and foo64 variants? If I had my
druthers, I would just do the foo64 version.
>
>A 2.4 implementation could be done any time. If anyone decides to
>do this, please let me know...
>
count me down as interested after my current project... If someone else
does it, more power to them...
Jeff
Joel Becker wrote:
>Other applications can use this sort of stuff. DBM could, for
>instance. So might GIMP. Etc. Dynamic hints have real world
>applications.
>
to be fair, fcntl(2) could be used in conjunction with open(2), to do
dynamic hints.
I prefer to separate the hints from other O_xxx flags, though, so
posix_fadvise seems to be applicable...
Jeff
Joel Becker wrote:
>
> On Sun, Mar 17, 2002 at 01:41:37PM +0000, Anton Altaparmakov wrote:
> > We don't need fadvise IMHO. That is what open(2) is for. The streaming
> > request you are asking for is just a normal open(2). It will do read ahead
> > which is perfect for streaming (of data size << RAM size in its current form).
>
> A quick real world example of where fadvise can work well.
> Imagine a database appliction that doesn't use O_DIRECT (for whatever
> reason, could even be that they don't trust the linux implementation yet
> :-).
O_DIRECT is broken against RAID0 (at least) in 2.5 at present. The
RAID driver gets sent BIOs which straddle two or more chunks and RAID
spits out lots of unpleasant warnings. Neil has been informed...
> So, this database gets a query. That query requires a full table
> scan, so it calls fadvise(fd, F_SEQUENTIAL). Then another query does
> row-specific access, and caching helps. So it wants to turn off
> F_SEQUENTIAL.
It'd probably be smarter for the application to hold two fds against
the same file for this sort of access pattern.
-
Jeff Garzik wrote:
>
> ...
> >Given this, I don't see a persuasive need to implement a non-standard
> >interface. It takes an off_t, so posix_fadvise64() is also needed.
> >
> agreed WRT non-standard.
>
> Are we required to have both foo and foo64 variants? If I had my
> druthers, I would just do the foo64 version.
That would be good. I can't see a reason why
#define posix_fadvise posix_fadvise64
would not suffice. That doesn't mean there isn't one :)
-
On Mon, Mar 18, 2002 at 03:10:03AM -0500, Jeff Garzik wrote:
> to be fair, fcntl(2) could be used in conjunction with open(2), to do
> dynamic hints.
I wasn't speaking to the exact interface, just to the real world
usefulness of hints after open(2). But yes, surely :-)
Joel
--
"Baby, even the losers
Get luck sometimes.
Even the losers
Keep a little bit of pride."
http://www.jlbec.org/
[email protected]
> Followup to: <[email protected]>
> By author: Anton Altaparmakov <[email protected]>
> In newsgroup: linux.dev.fs.devel
> >
> > Ok, so basically we want both fadvise() and open(2) semantics, with the
> > open(2) being a superset of the fadvise() capabilities (some things no
> > longer make sense to be specified once the file is open). They can of
> > course both be calling the same common helpers inside the kernel...
> >
>
> If they're open() flags, they should probably be controlled with
> fcntl() rather than with a new system call.
Then posix_fadvise interface can be implemented in libc using fcntl.
--------------------------------------------------------------------------------
- Jan Hudec `Bulb' <[email protected]>
Jan Hudec wrote:
>>Followup to: <[email protected]>
>>By author: Anton Altaparmakov <[email protected]>
>>In newsgroup: linux.dev.fs.devel
>>
>>>Ok, so basically we want both fadvise() and open(2) semantics, with the
>>>open(2) being a superset of the fadvise() capabilities (some things no
>>>longer make sense to be specified once the file is open). They can of
>>>course both be calling the same common helpers inside the kernel...
>>>
>>If they're open() flags, they should probably be controlled with
>>fcntl() rather than with a new system call.
>>
>
>Then posix_fadvise interface can be implemented in libc using fcntl.
>
Indeed it can be... but it less flexible that way, unless you want to
add another level of indirection.
It is far better for future-proofing the interface IMO if fadvise is
implementing directly. Hints are less important than open O_xxx flags
or F_xxx flags, because an implementation can safely ignore 100% of the
fadvise hints, if it so chooses. One cannot say the same thing for
open/fcntl flags.
So, different class of fd flags deserves a different syscall, IMO...
Jeff
>>>>> "Andrew" == Andrew Morton <[email protected]> writes:
Andrew> O_DIRECT is broken against RAID0 (at least) in 2.5 at present.
Andrew> The RAID driver gets sent BIOs which straddle two or more
Andrew> chunks and RAID spits out lots of unpleasant warnings. Neil
Andrew> has been informed...
Yep. I've been porting my original kiobuf based request splitter to
biobufs. It's almost there, I've just been extremely busy with
something else for a while.
It's not only when you straddle chunks. The current code does not
handle requests straddling RAID zones either.
--
Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
[email protected], http://www.linuxcare.com/
SGI XFS for Linux Developer, http://oss.sgi.com/projects/xfs/
Andrew Morton writes:
> Note that it applies to a file descriptor. If
> posix_fadvise(FADV_DONTNEED) is called against a file descriptor,
> and someone else has an fd open against the same file, that other
> user gets their foot shot off. That's OK.
Let me verify that I understand what you're saying. Process A and B
independently open the file. The file is already in the cache (because
other processes regularly read this file). Process A is slowly reading
stuff. Process B does FADV_DONTNEED on the whole file. The pages are
dropped.
You're saying this is OK? How about this DoS attack:
int fd = open ("/lib/libc.so", O_RDONLY, 0);
while (1) {
posix_fadvise (fd, 0, 0, FADVISE_DONTNEED);
sleep (1);
}
Let me see that disc head move! Wheeee!
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
On Mon, Mar 18, 2002 at 05:08:02AM -0500, Jeff Garzik wrote:
> Jan Hudec wrote:
> >Then posix_fadvise interface can be implemented in libc using fcntl.
> It is far better for future-proofing the interface IMO if fadvise is
> implementing directly. Hints are less important than open O_xxx flags
> or F_xxx flags, because an implementation can safely ignore 100% of the
> fadvise hints, if it so chooses. One cannot say the same thing for
> open/fcntl flags.
There is nothing to say that fadvise(...) shouldn't call fcntl(F_ADVISE, ...).
If it fits in with open(), then it might just fit in with F_GETFL /
F_SETFL as well.
I prefer generalization, especially for non-critical functions that should
not be called 1,000,000 a second, such as fadvise().
mark
--
[email protected]/[email protected]/[email protected] __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada
One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...
http://mark.mielke.cc/
Richard Gooch wrote:
>
> Andrew Morton writes:
> > Note that it applies to a file descriptor. If
> > posix_fadvise(FADV_DONTNEED) is called against a file descriptor,
> > and someone else has an fd open against the same file, that other
> > user gets their foot shot off. That's OK.
>
> Let me verify that I understand what you're saying. Process A and B
> independently open the file. The file is already in the cache (because
> other processes regularly read this file). Process A is slowly reading
> stuff. Process B does FADV_DONTNEED on the whole file. The pages are
> dropped.
>
> You're saying this is OK? How about this DoS attack:
> int fd = open ("/lib/libc.so", O_RDONLY, 0);
> while (1) {
> posix_fadvise (fd, 0, 0, FADVISE_DONTNEED);
> sleep (1);
> }
>
> Let me see that disc head move! Wheeee!
>
POSIX_FADV_DONTNEED could only unmap pages from the caller's
VMA's, so the problem would only affect other processes which
share the same mm - CLONE_MM threads.
If some other process has a reference on the pages then they
wouldn't get unmapped as a result of this. It's the same
as madvise(MADV_DONTNEED).
-
Andrew Morton writes:
> Richard Gooch wrote:
> >
> > Andrew Morton writes:
> > > Note that it applies to a file descriptor. If
> > > posix_fadvise(FADV_DONTNEED) is called against a file descriptor,
> > > and someone else has an fd open against the same file, that other
> > > user gets their foot shot off. That's OK.
> >
> > Let me verify that I understand what you're saying. Process A and B
> > independently open the file. The file is already in the cache (because
> > other processes regularly read this file). Process A is slowly reading
> > stuff. Process B does FADV_DONTNEED on the whole file. The pages are
> > dropped.
> >
> > You're saying this is OK? How about this DoS attack:
> > int fd = open ("/lib/libc.so", O_RDONLY, 0);
> > while (1) {
> > posix_fadvise (fd, 0, 0, FADVISE_DONTNEED);
> > sleep (1);
> > }
> >
> > Let me see that disc head move! Wheeee!
> >
>
> POSIX_FADV_DONTNEED could only unmap pages from the caller's
> VMA's, so the problem would only affect other processes which
> share the same mm - CLONE_MM threads.
>
> If some other process has a reference on the pages then they
> wouldn't get unmapped as a result of this. It's the same
> as madvise(MADV_DONTNEED).
OK, I misparsed what you had said. Good.
Regards,
Richard....
Permanent: [email protected]
Current: [email protected]
"Martin K. Petersen" wrote:
>
> >>>>> "Andrew" == Andrew Morton <[email protected]> writes:
>
> Andrew> O_DIRECT is broken against RAID0 (at least) in 2.5 at present.
> Andrew> The RAID driver gets sent BIOs which straddle two or more
> Andrew> chunks and RAID spits out lots of unpleasant warnings. Neil
> Andrew> has been informed...
>
> Yep. I've been porting my original kiobuf based request splitter to
> biobufs. It's almost there, I've just been extremely busy with
> something else for a while.
>
> It's not only when you straddle chunks. The current code does not
> handle requests straddling RAID zones either.
google fails me - where does your kiobuf-based splitter live?
I'm curious to know how this will all work. Will it take a
large BIO and split it into a number of smaller, newly allocated
BIOs? That would be kinda sad, IMO - the current bio-per-bh
allocations in the normal I/O path are really expensive, and
it seems wrong to take a large BIO, split it into lots of
teeny ones and then reassemble all the way down at the driver
level.
If that's really the only way in which we can solve this problem,
would it not be better to pass information up to the higher layer,
telling it when the BIO which is currently under assembly cannot
be grown further? Say, blk_can_i_add_more_stuff_to_this_bio()?
Anyway. I'm interested. O_DIRECT is a bit of a weird curiosity,
but I'm working on making these big-BIO code paths *the* way in which
data gets to and from disk. It needs to be efficient ;)
-
>>>>> "Andrew" == Andrew Morton <[email protected]> writes:
Andrew> google fails me - where does your kiobuf-based splitter live?
It's in the kiobuf XFS patches.
Andrew> I'm curious to know how this will all work. Will it take a
Andrew> large BIO and split it into a number of smaller, newly
Andrew> allocated BIOs?
For kiobufs I walked the request, cloned a new every time I crossed a
stripe/device boundary and sent it off. I had my own completion
function with an atomic counter that would call the parent kiobuf's
end_io function when all clones had completed.
So I didn't chop the request into page sized chunks or something like
that.
Andrew> If that's really the only way in which we can solve this
Andrew> problem, would it not be better to pass information up to the
Andrew> higher layer, telling it when the BIO which is currently under
Andrew> assembly cannot be grown further? Say,
Andrew> blk_can_i_add_more_stuff_to_this_bio()?
We tried different approaches. One of them was to be able to signal
to upper layers that your I/O was too big and please submit smaller
chunks. Running with that, however, the I/O size converged against
small requests because you'd often start an I/O - say 4K - from a
stripe boundary. And that would kill it right off.
So unless the filesystem knows about stripe/device boundaries it's
really hard to get the size signalling right. And then what happens
when you stack LVM and MD?
In the end, cloning the kiobuf from the above and adjusting
offset/length in the children turned out to be the best approach.
And I suspect that's why Jens kept the clone facility around for bio
bufs :)
Andrew> Anyway. I'm interested. O_DIRECT is a bit of a weird
Andrew> curiosity, but I'm working on making these big-BIO code paths
Andrew> *the* way in which data gets to and from disk. It needs to be
Andrew> efficient ;)
*nod*
I'll try and poke at this again tonight. Will shoot you the patch
once I get the zoning evil sorted out.
--
Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
[email protected], http://www.linuxcare.com/
SGI XFS for Linux Developer, http://oss.sgi.com/projects/xfs/
"Martin K. Petersen" <[email protected]> writes:
> >>>>> "Andrew" == Andrew Morton <[email protected]> writes:
>
> Andrew> If that's really the only way in which we can solve this
> Andrew> problem, would it not be better to pass information up to the
> Andrew> higher layer, telling it when the BIO which is currently under
> Andrew> assembly cannot be grown further? Say,
> Andrew> blk_can_i_add_more_stuff_to_this_bio()?
Please let's extend BIOs and not break them up.
> We tried different approaches. One of them was to be able to signal
> to upper layers that your I/O was too big and please submit smaller
> chunks. Running with that, however, the I/O size converged against
> small requests because you'd often start an I/O - say 4K - from a
> stripe boundary. And that would kill it right off.
>
> So unless the filesystem knows about stripe/device boundaries it's
> really hard to get the size signalling right. And then what happens
> when you stack LVM and MD?
>
> In the end, cloning the kiobuf from the above and adjusting
> offset/length in the children turned out to be the best approach.
Unless I am mistaken this interacts very badly with the writing data
out to disk to free up memory, because you must allocate memory to
split the bio. Which is the last place you want to allocate memory
if you can avoid it.
It's been a while but I believe there was a similiar thread about
splitting request to disk and the idea was shot down for similiar
reasons.
Eric
>>>>> "Eric" == Eric W Biederman <[email protected]> writes:
>> In the end, cloning the kiobuf from the above and adjusting
>> offset/length in the children turned out to be the best approach.
Eric> Unless I am mistaken this interacts very badly with the writing
Eric> data out to disk to free up memory, because you must allocate
Eric> memory to split the bio. Which is the last place you want to
Eric> allocate memory if you can avoid it.
Well. We have several places in the I/O path already where we need to
allocate memory in order to fulfill an I/O.
Think RAID1 where you need to turn one request from the filesystem
into several - one for each mirror. Or RAID5 where a write may cause
several reads/writes so you can mask and write the checksum out.
Also, with journaling filesystems you may very well be in a situation
where pushing a file to disk involves writing transactions to the log
before you can actually free up buffers.
In this case the clones come from the bio slab cache and are thus no
different from any other I/Os. Furthermore, the clones share the bulk
of their data with the parent, so the overhead isn't that big.
--
Martin K. Petersen, Principal Linux Consultant, Linuxcare, Inc.
[email protected], http://www.linuxcare.com/
SGI XFS for Linux Developer, http://oss.sgi.com/projects/xfs/
Hi!
> > We don't need fadvise IMHO. That is what open(2) is for. The streaming
> > request you are asking for is just a normal open(2). It will do read
> > ahead which is perfect for streaming (of data size << RAM size in its
> > current form).
> >
> > When you want large data streaming, i.e. you start getting worried
> > about memory pressure, then you want open(2) + O_DIRECT. No caching
> > done. Perfect for large data streams and we have that already. I agree
> > that you may want some form of asynchronous read ahead with passed
> > pages being dropped from the cache but that could be just a open(2) +
> > O_SEQUENTIAL (doesn't exist yet).
> >
> > All of what you are asking for exists in Windows and all the semantics
> > are implemented through a very powerful open(2) equivalent. I don't
> > see why we shouldn't do the same. It makes more sense to me than
> > inventing yet another system call...
>
>
>
> I disagree, and here's the main reasons:
>
> * fadvise(2) usefulness extends past open(2). It may be useful to call
> it at various points during runtime.
open(/proc/self/fd/0, O_NEW_FLAGS)?
> * I think putting hints in open(2) is the wrong direction to go. Hints
> have a potential to be very flexible. open(2) O_xxx bits are not to be
> squandered lightly, while I see a lot more value in being a little more
> loose and free with the bit assignment for an "fadvise mask" (just a
> list of hint bits). IMO it should be easier to introduce and retire
> hints, far easier than O_xxx flags.
I don't like idea of new syscall when open works just fine. First prove
O_X hints are usefull, then extend them.
Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
At 04:05 PM 3/22/2002 +0000, Pavel Machek wrote:
>>
>>
>> I disagree, and here's the main reasons:
>>
>> * fadvise(2) usefulness extends past open(2). It may be useful to call
>> it at various points during runtime.
>
>open(/proc/self/fd/0, O_NEW_FLAGS)?
So to use fadvise(), the system must have /proc mounted?
Not everybody mounts /proc -- it provides a lot of potential information to anybody who can access it ("hmm... they have a QZ48257 ethernet chipset [cat /proc/pci] -- lets see, sending this specific sequence of bytes in a TCP packet will lock up the receiver...").
--
Stevie-O
Real programmers use COPY CON PROGRAM.EXE
Hi!
> >> I disagree, and here's the main reasons:
> >>
> >> * fadvise(2) usefulness extends past open(2). It may be useful to call
> >> it at various points during runtime.
> >
> >open(/proc/self/fd/0, O_NEW_FLAGS)?
>
> So to use fadvise(), the system must have /proc mounted?
I think it is way more feasible than adding new syscall.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
At 11:24 24/03/02, Pavel Machek wrote:
>Hi!
>
> > >> I disagree, and here's the main reasons:
> > >>
> > >> * fadvise(2) usefulness extends past open(2). It may be useful to call
> > >> it at various points during runtime.
> > >
> > >open(/proc/self/fd/0, O_NEW_FLAGS)?
> >
> > So to use fadvise(), the system must have /proc mounted?
>
>I think it is way more feasible than adding new syscall.
Sorry but it is silly. (-; What's wrong with open("filename", O_FLAGS);
followed by fcntl(); if you want to modify them after opening. That is a
lot cleaner than going via proc in such a way...
posix_fadvise() can then be implemented in userspace and that can go via
fcntl(). That way we have the best of both worlds.
Best regards,
Anton
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/
Hi!
> >> >> I disagree, and here's the main reasons:
> >> >>
> >> >> * fadvise(2) usefulness extends past open(2). It may be useful to
> >call
> >> >> it at various points during runtime.
> >> >
> >> >open(/proc/self/fd/0, O_NEW_FLAGS)?
> >>
> >> So to use fadvise(), the system must have /proc mounted?
> >
> >I think it is way more feasible than adding new syscall.
>
> Sorry but it is silly. (-; What's wrong with open("filename", O_FLAGS);
> followed by fcntl(); if you want to modify them after opening. That is a
> lot cleaner than going via proc in such a way...
>
> posix_fadvise() can then be implemented in userspace and that can go via
> fcntl(). That way we have the best of both worlds.
Agreed, this is better than my proposal.
Pavel
--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.
At 20:19 17/03/02, Ken Hirsch wrote:
>Anton Altaparmakov writes:
> > Last time I heard serious databases use their own memmory
> > management/caching in combination with O_DIRECT, i.e. they bypass the
> > kernel's buffering system completely. Hence I would deem them irrelevant
>to
> > the problem at hand...
> >
> > If a database were not to use O_DIRECT I would think it would be using
>mmap
> > so it would have madvise already... but I am not a database expert so take
> > this with a pinch of salt...
>
>I don't think that either MySQL or PostgreSQL use O_DIRECT; I just grepped
>the source and didn't find it. They can't use mmap() because it uses up too
>much process address space.
<flame bait>So you consider these two to be serious databases?</flame bait>
(-; [1]
>It's true that commercial databases mostly do their own scheduling and
>caching, and if they are the only thing running on your system and you tune
>them right, that works. But it's not necessarily a good thing. If there
>are other processes on your system, there would be a benefit if the DBMS
>could inform the operating system of its intentions.
>
>A posix_fadvise() call would be a start, but you could potentially go beyond
>that.
Ok, so basically we want both fadvise() and open(2) semantics, with the
open(2) being a superset of the fadvise() capabilities (some things no
longer make sense to be specified once the file is open). They can of
course both be calling the same common helpers inside the kernel...
fadvise() would probably only be used by databases while open(2) would be
used by the rest of the world. (-;
Best regards,
Anton
[1] Sorry about the flame bait, couldn't resist... I know they are both
very respectable databases and they are free software which is great.
--
"I've not lost my mind. It's backed up on tape somewhere." - Unknown
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/