2009-12-25 00:23:09

by Quentin Barnes

[permalink] [raw]
Subject: [RFC][PATCH] Disabling read-ahead makes I/O of large reads small

In porting some application code to Linux, its performance over
NFSv3 on Linux is terrible. I'm posting this note to LKML since
the problem was actually tracked back to the VFS layer.

The app has a simple database that's accessed over NFS. It always
does random I/O, so any read-ahead is a waste. The app uses
O_DIRECT which has the side-effect of disabling read-ahead.

On Linux accessing an O_DIRECT opened file over NFS is much akin to
disabling its attribute cache causing its attributes to be refetched
from the server before each NFS operation. After some thought,
given the Linux behavior of O_DIRECT on regular hard disk files to
ensure file cache consistency, frustratingly, that's probably the
more correct answer to emulate this file system behavior for NFS.
At this point, rather than expecting Linux to somehow change to
avoid the unnecessary flood of GETATTRs, I thought it best for the
app not to just use the O_DIRECT flag on Linux. So I changed the
app code and then added a posix_fadvise(2) call to keep read-ahead
disabled. When I did that, I ran into an unexpected problem.

Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets
ra_pages=0. This has a very odd side-effect in the kernel. Once
read-ahead is disabled, subsequent calls to read(2) are now done in
the kernel via ->readpage() callback doing I/O one page at a time!

Pouring through the code in mm/filemap.c I see that the kernel has
commingled read-ahead and plain read implementations. The algorithms
have much in common, so I can see why it was done, but it left this
anomaly of severely pimping read(2) calls on file descriptors with
read-ahead disabled.


For example, with a read(2) of 98K bytes of a file opened with
O_DIRECT accessed over NFSv3 with rsize=32768, I see:
=========
V3 ACCESS Call (Reply In 249), FH:0xf3a8e519
V3 ACCESS Reply (Call In 248)
V3 READ Call (Reply In 321), FH:0xf3a8e519 Offset:0 Len:32768
V3 READ Call (Reply In 287), FH:0xf3a8e519 Offset:32768 Len:32768
V3 READ Call (Reply In 356), FH:0xf3a8e519 Offset:65536 Len:32768
V3 READ Reply (Call In 251) Len:32768
V3 READ Reply (Call In 250) Len:32768
V3 READ Reply (Call In 252) Len:32768
=========

I would expect three READs issued of size 32K, and that's exactly
what I see.


For the same file without O_DIRECT but with read-ahead disabled
(its ra_pages=0), I see:
=========
V3 ACCESS Call (Reply In 167), FH:0xf3a8e519
V3 ACCESS Reply (Call In 166)
V3 READ Call (Reply In 172), FH:0xf3a8e519 Offset:0 Len:4096
V3 READ Reply (Call In 168) Len:4096
V3 READ Call (Reply In 177), FH:0xf3a8e519 Offset:4096 Len:4096
V3 READ Reply (Call In 173) Len:4096
V3 READ Call (Reply In 182), FH:0xf3a8e519 Offset:8192 Len:4096
V3 READ Reply (Call In 178) Len:4096
[... READ Call/Reply pairs repeated another 21 times ...]
=========

Now I see 24 READ calls of 4K each!


A workaround for this kernel problem is to hack the app to do a
readahead(2) call prior to the read(2), however, I would think a
better approach would be to fix the kernel. I came up with the
included patch that once applied restores the expected read(2)
behavior. For the latter test case above of a file with read-ahead
disabled but now with the patch below applied, I now see:
=========
V3 ACCESS Call (Reply In 1350), FH:0xf3a8e519
V3 ACCESS Reply (Call In 1349)
V3 READ Call (Reply In 1387), FH:0xf3a8e519 Offset:0 Len:32768
V3 READ Call (Reply In 1421), FH:0xf3a8e519 Offset:32768 Len:32768
V3 READ Call (Reply In 1456), FH:0xf3a8e519 Offset:65536 Len:32768
V3 READ Reply (Call In 1351) Len:32768
V3 READ Reply (Call In 1352) Len:32768
V3 READ Reply (Call In 1353) Len:32768
=========

Which is what I would expect -- back to just three 32K READs.

After this change, the overall performance of the application
increased by 313%!


I have no idea if my patch is the appropriate fix. I'm well out of
my area in this part of the kernel. It solves this one problem, but
I have no idea how many boundary cases it doesn't cover or even if
it is the right way to go about addressing this issue.

Is this behavior of shorting I/O of read(2) considered a bug? And
is this approach for a fix approriate?

Quentin


--- linux-2.6.32.2/mm/filemap.c 2009-12-18 16:27:07.000000000 -0600
+++ linux-2.6.32.2-rapatch/mm/filemap.c 2009-12-24 13:07:07.000000000 -0600
@@ -1012,9 +1012,13 @@ static void do_generic_file_read(struct
find_page:
page = find_get_page(mapping, index);
if (!page) {
- page_cache_sync_readahead(mapping,
- ra, filp,
- index, last_index - index);
+ if (ra->ra_pages)
+ page_cache_sync_readahead(mapping,
+ ra, filp,
+ index, last_index - index);
+ else
+ force_page_cache_readahead(mapping, filp,
+ index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;



My test program used to gather the network traces above:
=========
#define _GNU_SOURCE 1
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

int
main(int argc, char **argv)
{
char scratch[32768*3];
int lgfd;
int cnt;

//if ( (lgfd = open(argv[1], O_RDWR|O_DIRECT)) == -1 ) {
if ( (lgfd = open(argv[1], O_RDWR)) == -1 ) {
fprintf(stderr, "Cannot open '%s'.\n", argv[1]);
return 1;
}

posix_fadvise(lgfd, 0, 0, POSIX_FADV_RANDOM);
//readahead(lgfd, 0, sizeof(scratch));
cnt = read(lgfd, scratch, sizeof(scratch));
printf("Read %d bytes.\n", cnt);
close(lgfd);

return 0;
}
=========


2009-12-29 18:04:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] Disabling read-ahead makes I/O of large reads small

Quentin Barnes <[email protected]> writes:

cc fengguang who is Mr.Readahead. The full description+patch
is in the archives.

> In porting some application code to Linux, its performance over
> NFSv3 on Linux is terrible. I'm posting this note to LKML since
> the problem was actually tracked back to the VFS layer.
[...]
> I have no idea if my patch is the appropriate fix. I'm well out of
> my area in this part of the kernel. It solves this one problem, but
> I have no idea how many boundary cases it doesn't cover or even if
> it is the right way to go about addressing this issue.
>
> Is this behavior of shorting I/O of read(2) considered a bug? And
> is this approach for a fix approriate?

It sounds like a (performance) bug to me.

>From a quick look your fix looks reasonable to me.

-Andi


--
[email protected] -- Speaking for myself only.

2009-12-30 05:15:45

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH] Disabling read-ahead makes I/O of large reads small

Andi,

On Wed, Dec 30, 2009 at 02:04:43AM +0800, Andi Kleen wrote:
> Quentin Barnes <[email protected]> writes:
>
> cc fengguang who is Mr.Readahead. The full description+patch
> is in the archives.

Thank you for the CC.

> > In porting some application code to Linux, its performance over
> > NFSv3 on Linux is terrible. I'm posting this note to LKML since
> > the problem was actually tracked back to the VFS layer.
> [...]
> > I have no idea if my patch is the appropriate fix. I'm well out of
> > my area in this part of the kernel. It solves this one problem, but
> > I have no idea how many boundary cases it doesn't cover or even if
> > it is the right way to go about addressing this issue.
> >
> > Is this behavior of shorting I/O of read(2) considered a bug? And
> > is this approach for a fix approriate?
>
> It sounds like a (performance) bug to me.

Yes it's a bug. It hit my mind in some early days.. I should be blamed
to lose track of it.

> >From a quick look your fix looks reasonable to me.

Yes, it's reasonable to directly call force_page_cache_readahead() in
this case.

However the ra_pages=0 trick in fadvise also asks for fix. We'd better
let it set a readahead flag, because ra_pages=0 is used in many other
places to really disable the (heuristic|force) readahead. See the
second patch's description for more details.

Thanks,
Fengguang

2009-12-30 05:17:52

by Fengguang Wu

[permalink] [raw]
Subject: [RFC][PATCH 1/2] readahead: replace ra->mmap_miss with ra->flags

Introduce a readahead flags field and embed the existing mmap_miss in it
(to save space).

It will be possible to lose the flags in race conditions, however the
impact should be limited.

CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Steven Whitehouse <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/fs.h | 30 +++++++++++++++++++++++++++++-
mm/filemap.c | 7 ++-----
2 files changed, 31 insertions(+), 6 deletions(-)

--- linux.orig/include/linux/fs.h 2009-12-30 11:07:04.000000000 +0800
+++ linux/include/linux/fs.h 2009-12-30 12:59:28.000000000 +0800
@@ -889,10 +889,38 @@ struct file_ra_state {
there are only # of pages ahead */

unsigned int ra_pages; /* Maximum readahead window */
- unsigned int mmap_miss; /* Cache miss stat for mmap accesses */
+ unsigned int flags;
loff_t prev_pos; /* Cache last read() position */
};

+/* low 16 bits: cache miss stat for mmap accesses */
+#define RA_FLAG_MMAP_MISS 0x0000ffff
+
+/*
+ * Don't do flags++ directly to avoid possible overflow:
+ * the ra fields can be accessed concurrently in a racy way.
+ */
+static inline unsigned int ra_mmap_miss_inc(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->flags & RA_FLAG_MMAP_MISS;
+
+ if (miss < RA_FLAG_MMAP_MISS) {
+ miss++;
+ ra->flags = miss | (ra->flags &~ RA_FLAG_MMAP_MISS);
+ }
+ return miss;
+}
+
+static inline void ra_mmap_miss_dec(struct file_ra_state *ra)
+{
+ unsigned int miss = ra->flags & RA_FLAG_MMAP_MISS;
+
+ if (miss) {
+ miss--;
+ ra->flags = miss | (ra->flags &~ RA_FLAG_MMAP_MISS);
+ }
+}
+
/*
* Check if @index falls in the readahead windows.
*/
--- linux.orig/mm/filemap.c 2009-12-30 11:07:04.000000000 +0800
+++ linux/mm/filemap.c 2009-12-30 11:07:22.000000000 +0800
@@ -1418,14 +1418,12 @@ static void do_sync_mmap_readahead(struc
return;
}

- if (ra->mmap_miss < INT_MAX)
- ra->mmap_miss++;

/*
* Do we miss much more than hit in this file? If so,
* stop bothering with read-ahead. It will only hurt.
*/
- if (ra->mmap_miss > MMAP_LOTSAMISS)
+ if (ra_mmap_miss_inc(ra) > MMAP_LOTSAMISS)
return;

/*
@@ -1455,8 +1453,7 @@ static void do_async_mmap_readahead(stru
/* If we don't want any read-ahead, don't bother */
if (VM_RandomReadHint(vma))
return;
- if (ra->mmap_miss > 0)
- ra->mmap_miss--;
+ ra_mmap_miss_dec(ra);
if (PageReadahead(page))
page_cache_async_readahead(mapping, ra, file,
page, offset, ra->ra_pages);

2009-12-30 05:24:09

by Fengguang Wu

[permalink] [raw]
Subject: [RFC][PATCH 2/2] readahead: avoid page-by-page reads on POSIX_FADV_RANDOM

POSIX_FADV_RANDOM sets ra_pages=0, which leads to poor performance:
a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a readahead flag RA_FLAG_RANDOM for POSIX_FADV_RANDOM.

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

An implementation caveat: the set of RA_FLAG_RANDOM can be lost if the
same file descriptor is being concurrent read in another thread. However
it's rare event and the possible performance hit is expected to be low.

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Steven Whitehouse <[email protected]>
Reported-by: Quentin Barnes <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/fs.h | 2 ++
mm/fadvise.c | 4 +++-
mm/readahead.c | 6 ++++++
3 files changed, 11 insertions(+), 1 deletion(-)

--- linux.orig/include/linux/fs.h 2009-12-30 13:15:53.000000000 +0800
+++ linux/include/linux/fs.h 2009-12-30 13:17:59.000000000 +0800
@@ -896,6 +896,8 @@ struct file_ra_state {
/* low 16 bits: cache miss stat for mmap accesses */
#define RA_FLAG_MMAP_MISS 0x0000ffff

+#define RA_FLAG_RANDOM 0x00010000 /* POSIX_FADV_RANDOM */
+
/*
* Don't do flags++ directly to avoid possible overflow:
* the ra fields can be accessed concurrently in a racy way.
--- linux.orig/mm/fadvise.c 2009-12-30 13:02:03.000000000 +0800
+++ linux/mm/fadvise.c 2009-12-30 13:23:05.000000000 +0800
@@ -77,12 +77,14 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
+ file->f_ra.flags &= ~RA_FLAG_RANDOM;
break;
case POSIX_FADV_RANDOM:
- file->f_ra.ra_pages = 0;
+ file->f_ra.flags |= RA_FLAG_RANDOM;
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
+ file->f_ra.flags &= ~RA_FLAG_RANDOM;
break;
case POSIX_FADV_WILLNEED:
if (!mapping->a_ops->readpage) {
--- linux.orig/mm/readahead.c 2009-12-30 13:02:03.000000000 +0800
+++ linux/mm/readahead.c 2009-12-30 13:17:59.000000000 +0800
@@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;

+ /* be dumb */
+ if (ra->flags & RA_FLAG_RANDOM) {
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}

2009-12-30 18:02:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] readahead: avoid page-by-page reads on POSIX_FADV_RANDOM

Wu Fengguang <[email protected]> writes:
> * the ra fields can be accessed concurrently in a racy way.
> --- linux.orig/mm/fadvise.c 2009-12-30 13:02:03.000000000 +0800
> +++ linux/mm/fadvise.c 2009-12-30 13:23:05.000000000 +0800
> @@ -77,12 +77,14 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
> switch (advice) {
> case POSIX_FADV_NORMAL:
> file->f_ra.ra_pages = bdi->ra_pages;
> + file->f_ra.flags &= ~RA_FLAG_RANDOM;
> break;
> case POSIX_FADV_RANDOM:
> - file->f_ra.ra_pages = 0;
> + file->f_ra.flags |= RA_FLAG_RANDOM;

What prevents this from racing with a parallel readahead
state modification, losing the bits?

-Andi

--
[email protected] -- Speaking for myself only.

2009-12-31 01:39:40

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] readahead: avoid page-by-page reads on POSIX_FADV_RANDOM

On Thu, Dec 31, 2009 at 02:02:38AM +0800, Andi Kleen wrote:
> Wu Fengguang <[email protected]> writes:
> > * the ra fields can be accessed concurrently in a racy way.
> > --- linux.orig/mm/fadvise.c 2009-12-30 13:02:03.000000000 +0800
> > +++ linux/mm/fadvise.c 2009-12-30 13:23:05.000000000 +0800
> > @@ -77,12 +77,14 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
> > switch (advice) {
> > case POSIX_FADV_NORMAL:
> > file->f_ra.ra_pages = bdi->ra_pages;
> > + file->f_ra.flags &= ~RA_FLAG_RANDOM;
> > break;
> > case POSIX_FADV_RANDOM:
> > - file->f_ra.ra_pages = 0;
> > + file->f_ra.flags |= RA_FLAG_RANDOM;
>
> What prevents this from racing with a parallel readahead
> state modification, losing the bits?

Oh I pretended that the problem don't exist..

To be serious, the race only exist inside a mutithread application,
where one single fd is shared between two threads, one is doing
fadvise, another doing readahead.

A sane application won't do fadvise(POSIX_FADV_RANDOM) while active
reads are going one concurrently: this leads to indeterminate behavior
by itself -- from which request the random hint takes effect?

fadvise() shall always be in the same streamline with all reads.

In real workloads, 1% applications may do POSIX_FADV_RANDOM, among
which 1% applications may be broken. And if the race does happen, the
impact is very small. So I choose to just ignore the race and use
non-atomic operations..

Thanks,
Fengguang

2009-12-31 02:53:47

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] readahead: avoid page-by-page reads on POSIX_FADV_RANDOM

On Thu, Dec 31, 2009 at 09:39:36AM +0800, Wu Fengguang wrote:
> On Thu, Dec 31, 2009 at 02:02:38AM +0800, Andi Kleen wrote:
> > Wu Fengguang <[email protected]> writes:
> > > * the ra fields can be accessed concurrently in a racy way.
> > > --- linux.orig/mm/fadvise.c 2009-12-30 13:02:03.000000000 +0800
> > > +++ linux/mm/fadvise.c 2009-12-30 13:23:05.000000000 +0800
> > > @@ -77,12 +77,14 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
> > > switch (advice) {
> > > case POSIX_FADV_NORMAL:
> > > file->f_ra.ra_pages = bdi->ra_pages;
> > > + file->f_ra.flags &= ~RA_FLAG_RANDOM;
> > > break;
> > > case POSIX_FADV_RANDOM:
> > > - file->f_ra.ra_pages = 0;
> > > + file->f_ra.flags |= RA_FLAG_RANDOM;
> >
> > What prevents this from racing with a parallel readahead
> > state modification, losing the bits?
>
> Oh I pretended that the problem don't exist..
>
> To be serious, the race only exist inside a mutithread application,
> where one single fd is shared between two threads, one is doing
> fadvise, another doing readahead.
>
> A sane application won't do fadvise(POSIX_FADV_RANDOM) while active
> reads are going one concurrently: this leads to indeterminate behavior
> by itself -- from which request the random hint takes effect?
>
> fadvise() shall always be in the same streamline with all reads.
>
> In real workloads, 1% applications may do POSIX_FADV_RANDOM, among
> which 1% applications may be broken. And if the race does happen, the
> impact is very small. So I choose to just ignore the race and use
> non-atomic operations..

OK, when updating the manpages as follows, I feel ashamed to add a
sentence like "make sure there are no concurrent reads on the same file
descriptor...otherwise your advice will be lost".

So how about add a FMODE_HINT_RANDOM_READ bit to file->f_mode?
Modifying it at fadvise() time at least won't disturb the existing
open-time-modify-only f_mode bits..

Thanks,
Fengguang

---
man2/posix_fadvise.2 | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

--- manpages-3.23.orig/man2/posix_fadvise.2 2009-12-31 09:46:13.000000000 +0800
+++ manpages-3.23/man2/posix_fadvise.2 2009-12-31 10:28:58.000000000 +0800
@@ -104,7 +104,8 @@ in POSIX.1-2003 TC1.
.SH NOTES
Under Linux, \fBPOSIX_FADV_NORMAL\fP sets the readahead window to the
default size for the backing device; \fBPOSIX_FADV_SEQUENTIAL\fP doubles
-this size, and \fBPOSIX_FADV_RANDOM\fP disables file readahead entirely.
+this size. \fBPOSIX_FADV_RANDOM\fP ignores the readahead size, and will
+submit IO for the read requests as-is.
These changes affect the entire file, not just the specified region
(but other open file handles to the same file are unaffected).

2009-12-31 03:03:16

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/2] readahead: avoid page-by-page reads on POSIX_FADV_RANDOM

On Thu, Dec 31, 2009 at 10:53:43AM +0800, Wu Fengguang wrote:

> So how about add a FMODE_HINT_RANDOM_READ bit to file->f_mode?
> Modifying it at fadvise() time at least won't disturb the existing
> open-time-modify-only f_mode bits..

Ah it seems file->f_flags (modified under f_lock) is ideal candidate :)

Thanks,
Fengguang

2009-12-31 04:31:22

by Fengguang Wu

[permalink] [raw]
Subject: [RFC][PATCH v2] readahead: introduce O_RANDOM_READ for POSIX_FADV_RANDOM

[an updated patch for the previous two patches in this thread]

This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
performance: a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag O_RANDOM_READ for POSIX_FADV_RANDOM.
It will be visible to fcntl(F_GETFL).

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Steven Whitehouse <[email protected]>
CC: David Howells <[email protected]>
CC: Al Viro <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Christoph Hellwig <[email protected]>
Reported-by: Quentin Barnes <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/asm-generic/fcntl.h | 4 ++++
mm/fadvise.c | 10 +++++++++-
mm/readahead.c | 6 ++++++
3 files changed, 19 insertions(+), 1 deletion(-)

--- linux.orig/include/asm-generic/fcntl.h 2009-12-31 11:23:33.000000000 +0800
+++ linux/include/asm-generic/fcntl.h 2009-12-31 11:24:45.000000000 +0800
@@ -80,6 +80,10 @@
#define O_NDELAY O_NONBLOCK
#endif

+#ifndef O_RANDOM_READ
+#define O_RANDOM_READ 010000000
+#endif
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
--- linux.orig/mm/fadvise.c 2009-12-31 11:25:05.000000000 +0800
+++ linux/mm/fadvise.c 2009-12-31 12:16:20.000000000 +0800
@@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM_READ;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
- file->f_ra.ra_pages = 0;
+ spin_lock(&file->f_lock);
+ file->f_flags |= O_RANDOM_READ;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM_READ;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_WILLNEED:
if (!mapping->a_ops->readpage) {
--- linux.orig/mm/readahead.c 2009-12-31 11:27:15.000000000 +0800
+++ linux/mm/readahead.c 2009-12-31 11:28:21.000000000 +0800
@@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;

+ /* be dumb */
+ if (filp->f_flags & O_RANDOM_READ) {
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}

2010-01-04 05:05:55

by Fengguang Wu

[permalink] [raw]
Subject: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
performance: a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag O_RANDOM for POSIX_FADV_RANDOM.
It will be visible to fcntl(F_GETFL).

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

v3: use O_RANDOM to indicate both read/write access pattern as in
posix_fadvise(), although it only takes effect for read() now
(proposed by Quentin)
v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi)

CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Steven Whitehouse <[email protected]>
CC: David Howells <[email protected]>
CC: Al Viro <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Christoph Hellwig <[email protected]>
Tested-by: Quentin Barnes <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/asm-generic/fcntl.h | 4 ++++
mm/fadvise.c | 10 +++++++++-
mm/readahead.c | 6 ++++++
3 files changed, 19 insertions(+), 1 deletion(-)

--- linux.orig/include/asm-generic/fcntl.h 2010-01-04 12:39:29.000000000 +0800
+++ linux/include/asm-generic/fcntl.h 2010-01-04 12:40:11.000000000 +0800
@@ -80,6 +80,10 @@
#define O_NDELAY O_NONBLOCK
#endif

+#ifndef O_RANDOM
+#define O_RANDOM 010000000 /* random access pattern hint */
+#endif
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
--- linux.orig/mm/fadvise.c 2010-01-04 12:39:29.000000000 +0800
+++ linux/mm/fadvise.c 2010-01-04 12:39:30.000000000 +0800
@@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
- file->f_ra.ra_pages = 0;
+ spin_lock(&file->f_lock);
+ file->f_flags |= O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_WILLNEED:
if (!mapping->a_ops->readpage) {
--- linux.orig/mm/readahead.c 2010-01-04 12:39:29.000000000 +0800
+++ linux/mm/readahead.c 2010-01-04 12:39:30.000000000 +0800
@@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;

+ /* be dumb */
+ if (filp->f_flags & O_RANDOM) {
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}

2010-01-04 05:17:31

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

Hi Wu,

On Mon, 4 Jan 2010 12:50:20 +0800 Wu Fengguang <[email protected]> wrote:
>
> --- linux.orig/include/asm-generic/fcntl.h 2010-01-04 12:39:29.000000000 +0800
> +++ linux/include/asm-generic/fcntl.h 2010-01-04 12:40:11.000000000 +0800
> @@ -80,6 +80,10 @@
> #define O_NDELAY O_NONBLOCK
> #endif
>
> +#ifndef O_RANDOM
> +#define O_RANDOM 010000000 /* random access pattern hint */
> +#endif

This value conflicts with O_CLOEXEC on alpha and parisc and O_NOATIME on
sparc.

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (590.00 B)
(No filename) (198.00 B)
Download all attachments

2010-01-04 05:20:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

Hi, Wu.

On Mon, Jan 4, 2010 at 1:50 PM, Wu Fengguang <[email protected]> wrote:
> This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.
>
> POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
> performance: a 16K read will be carried out in 4 _sync_ 1-page reads.
>
> In other places, ra_pages==0 means
> - it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
> - some IO error happened
> where multi-page read IO won't help or should be avoided.
>
> POSIX_FADV_RANDOM actually want a different semantics: to disable the
> *heuristic* readahead algorithm, and to use a dumb one which faithfully
> submit read IO for whatever application requests.
>
> So introduce a flag O_RANDOM for POSIX_FADV_RANDOM.
> It will be visible to fcntl(F_GETFL).
>
> Note that the random hint is not likely to help random reads performance
> noticeably. And it may be too permissive on huge request size (its IO
> size is not limited by read_ahead_kb).
>
> In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
> (NFS read) performance of the application increased by 313%!
>
> v3: use O_RANDOM to indicate both read/write access pattern as in
>    posix_fadvise(), although it only takes effect for read() now
>    (proposed by Quentin)
> v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi)
>
> CC: Nick Piggin <[email protected]>
> CC: Andi Kleen <[email protected]>
> CC: Steven Whitehouse <[email protected]>
> CC: David Howells <[email protected]>
> CC: Al Viro <[email protected]>
> CC: Jonathan Corbet <[email protected]>
> CC: Christoph Hellwig <[email protected]>
> Tested-by: Quentin Barnes <[email protected]>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---
>  include/asm-generic/fcntl.h |    4 ++++
>  mm/fadvise.c                |   10 +++++++++-
>  mm/readahead.c              |    6 ++++++
>  3 files changed, 19 insertions(+), 1 deletion(-)
>
> --- linux.orig/include/asm-generic/fcntl.h      2010-01-04 12:39:29.000000000 +0800
> +++ linux/include/asm-generic/fcntl.h   2010-01-04 12:40:11.000000000 +0800
> @@ -80,6 +80,10 @@
>  #define O_NDELAY       O_NONBLOCK
>  #endif
>
> +#ifndef O_RANDOM
> +#define O_RANDOM       010000000       /* random access pattern hint */
> +#endif
> +
>  #define F_DUPFD                0       /* dup */
>  #define F_GETFD                1       /* get close_on_exec */
>  #define F_SETFD                2       /* set/clear close_on_exec */
> --- linux.orig/mm/fadvise.c     2010-01-04 12:39:29.000000000 +0800
> +++ linux/mm/fadvise.c  2010-01-04 12:39:30.000000000 +0800
> @@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
>        switch (advice) {
>        case POSIX_FADV_NORMAL:
>                file->f_ra.ra_pages = bdi->ra_pages;
> +               spin_lock(&file->f_lock);
> +               file->f_flags &= ~O_RANDOM;
> +               spin_unlock(&file->f_lock);
>                break;
>        case POSIX_FADV_RANDOM:
> -               file->f_ra.ra_pages = 0;
> +               spin_lock(&file->f_lock);
> +               file->f_flags |= O_RANDOM;
> +               spin_unlock(&file->f_lock);
>                break;
>        case POSIX_FADV_SEQUENTIAL:
>                file->f_ra.ra_pages = bdi->ra_pages * 2;
> +               spin_lock(&file->f_lock);
> +               file->f_flags &= ~O_RANDOM;
> +               spin_unlock(&file->f_lock);
>                break;
>        case POSIX_FADV_WILLNEED:
>                if (!mapping->a_ops->readpage) {
> --- linux.orig/mm/readahead.c   2010-01-04 12:39:29.000000000 +0800
> +++ linux/mm/readahead.c        2010-01-04 12:39:30.000000000 +0800
> @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
>        if (!ra->ra_pages)
>                return;
>
> +       /* be dumb */
> +       if (filp->f_flags & O_RANDOM) {
> +               force_page_cache_readahead(mapping, filp, offset, req_size);
> +               return;
> +       }
> +

Let me have a dumb question. :)

How about testing O_RANDOM in front of ra_pages testing?

My intention is that although we turn off ra, it would be better to read
contiguous block all at once than readpage() callback doing I/O
one page at a time.

Is it break some semantics or happen some problem in ondemand readahead?

>        /* do read-ahead */
>        ondemand_readahead(mapping, ra, filp, false, offset, req_size);
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
Kind regards,
Minchan Kim

2010-01-04 07:33:42

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Mon, Jan 04, 2010 at 04:17:19PM +1100, Stephen Rothwell wrote:
> > @@ -80,6 +80,10 @@
> > #define O_NDELAY O_NONBLOCK
> > #endif
> >
> > +#ifndef O_RANDOM
> > +#define O_RANDOM 010000000 /* random access pattern hint */
> > +#endif
>
> This value conflicts with O_CLOEXEC on alpha and parisc and O_NOATIME on
> sparc.

Also when I tried to use this value for O_RSYNC and tested it I could
not actually see it getting propagated by the open code.

Eitherway I don't think an O_ value is a good idea for a simple access
pattern hint.

2010-01-04 16:55:10

by Quentin Barnes

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Sun, Jan 03, 2010 at 11:33:28PM -0800, Christoph Hellwig wrote:
> On Mon, Jan 04, 2010 at 04:17:19PM +1100, Stephen Rothwell wrote:
> > > @@ -80,6 +80,10 @@
> > > #define O_NDELAY O_NONBLOCK
> > > #endif
> > >
> > > +#ifndef O_RANDOM
> > > +#define O_RANDOM 010000000 /* random access pattern hint */
> > > +#endif
> >
> > This value conflicts with O_CLOEXEC on alpha and parisc and O_NOATIME on
> > sparc.
>
> Also when I tried to use this value for O_RSYNC and tested it I could
> not actually see it getting propagated by the open code.
>
> Eitherway I don't think an O_ value is a good idea for a simple access
> pattern hint.

Could you expand on that?

I was surprised by Wu's O_RANDOM approach, but after thinking about
it, I liked it. I'm used to seeing (on non-UNIX OSes) a parameter
as part of the open syscall that announces to the OS what the app's
access strategy through that file descriptor will be for that file.
An issue with the current fadvise(2) approach is for random access
files it necessitates two syscalls (open plus fadvise) for what
could be or should be only one syscall (open).

My guess on your issue is that open(2) should take only flags that
are necessary for the open state itself and therefore can't be
implemented as a separate and later syscall. I would generally
agree with that. There is however already at least two exceptions
to that principle, the O_SYNC and O_DIRECT flags. They are access
states though. I guess the question is whether to think of the
O_RANDOM flag as a "hint" or as an "access strategy".

Quentin

2010-01-04 18:57:25

by Andreas Dilger

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On 2010-01-04, at 09:50, Quentin Barnes wrote:
> On Sun, Jan 03, 2010 at 11:33:28PM -0800, Christoph Hellwig wrote:
>> On Mon, Jan 04, 2010 at 04:17:19PM +1100, Stephen Rothwell wrote:
>>>> @@ -80,6 +80,10 @@
>>>> #define O_NDELAY O_NONBLOCK
>>>> #endif
>>>>
>>>> +#ifndef O_RANDOM
>>>> +#define O_RANDOM 010000000 /* random access pattern hint */
>>>> +#endif
>>>
>>> This value conflicts with O_CLOEXEC on alpha and parisc and
>>> O_NOATIME on
>>> sparc.
>>
>> Also when I tried to use this value for O_RSYNC and tested it I could
>> not actually see it getting propagated by the open code.
>>
>> Eitherway I don't think an O_ value is a good idea for a simple
>> access
>> pattern hint.
>
> I was surprised by Wu's O_RANDOM approach, but after thinking about
> it, I liked it. I'm used to seeing (on non-UNIX OSes) a parameter
> as part of the open syscall that announces to the OS what the app's
> access strategy through that file descriptor will be for that file.
> An issue with the current fadvise(2) approach is for random access
> files it necessitates two syscalls (open plus fadvise) for what
> could be or should be only one syscall (open).


Given that syscall overhead is very minimal, especially since fadvise
is only setting some in-memory state and doesn't have to flush cache
or anything, I don't see that as a significant reason to consume an O_
flag. Those flags are essentially limited to 32-bit values, or 32-bit
applications wouldn't be able to use all of the flags, and we are
already into the mid-20's of bits.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

2010-01-05 01:12:52

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

Hi Minchan,

On Mon, Jan 04, 2010 at 01:20:49PM +0800, Minchan Kim wrote:
> > --- linux.orig/mm/readahead.c   2010-01-04 12:39:29.000000000 +0800
> > +++ linux/mm/readahead.c        2010-01-04 12:39:30.000000000 +0800
> > @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
> >        if (!ra->ra_pages)
> >                return;
> >
> > +       /* be dumb */
> > +       if (filp->f_flags & O_RANDOM) {
> > +               force_page_cache_readahead(mapping, filp, offset, req_size);
> > +               return;
> > +       }
> > +
>
> Let me have a dumb question. :)
>
> How about testing O_RANDOM in front of ra_pages testing?
>
> My intention is that although we turn off ra, it would be better to read
> contiguous block all at once than readpage() callback doing I/O
> one page at a time.
>
> Is it break some semantics or happen some problem in ondemand readahead?

Yes it will have some problem with shrink_readahead_size_eio(), which
want to disable readahead and use ->readpage() when ra_pages==0.

Do you have specific use case in mind? The file systems that set
ra_pages=0 seems to don't need readahead, too.

Thanks,
Fengguang

2010-01-05 01:13:08

by Fengguang Wu

[permalink] [raw]
Subject: [RFC][PATCH v4] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Mon, Jan 04, 2010 at 03:33:28PM +0800, Christoph Hellwig wrote:
> On Mon, Jan 04, 2010 at 04:17:19PM +1100, Stephen Rothwell wrote:
> > > @@ -80,6 +80,10 @@
> > > #define O_NDELAY O_NONBLOCK
> > > #endif
> > >
> > > +#ifndef O_RANDOM
> > > +#define O_RANDOM 010000000 /* random access pattern hint */
> > > +#endif
> >
> > This value conflicts with O_CLOEXEC on alpha and parisc and O_NOATIME on
> > sparc.

Ah thanks!

For now I'll use bit 040000000, which happen to equal to
FMODE_NONOTIFY and will be masked out by __dentry_open()
in the very beginning.

So there's no way to call open(O_RANDOM).

> Also when I tried to use this value for O_RSYNC and tested it I could
> not actually see it getting propagated by the open code.
>
> Eitherway I don't think an O_ value is a good idea for a simple access
> pattern hint.

Hmm, I don't really care adding open(O_RANDOM) support.

O_RANDOM is used in the code mainly because the file_ra_state fields
are racy, so the f_lock protected file->f_flags is convenient for
saving the random hint bit.

If necessary, we can further mask out O_RANDOM for fcntl(F_GETFL).

Thanks,
Fengguang
---
readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
performance: a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag O_RANDOM for POSIX_FADV_RANDOM. It will be
visible to fcntl(F_GETFL), but open(O_RANDOM) is disabled for now.

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

v4: resolve bit conflicts with sparc and parisc;
use bit 040000000(=FMODE_NONOTIFY), which will be masked out by
__dentry_open(), so that open(O_RANDOM) is disabled
(Stephen Rothwell and Christoph Hellwig)
v3: use O_RANDOM to indicate both read/write access pattern as in
posix_fadvise(), although it only takes effect for read() now
(proposed by Quentin)
v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi)

CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Steven Whitehouse <[email protected]>
CC: David Howells <[email protected]>
CC: Al Viro <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Christoph Hellwig <[email protected]>
Tested-by: Quentin Barnes <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/asm-generic/fcntl.h | 8 ++++++++
mm/fadvise.c | 10 +++++++++-
mm/readahead.c | 6 ++++++
3 files changed, 23 insertions(+), 1 deletion(-)

--- linux.orig/include/asm-generic/fcntl.h 2010-01-04 12:39:29.000000000 +0800
+++ linux/include/asm-generic/fcntl.h 2010-01-04 20:50:41.000000000 +0800
@@ -80,6 +80,14 @@
#define O_NDELAY O_NONBLOCK
#endif

+/*
+ * don't support open(O_RANDOM) for now:
+ * __dentry_open() will nuke this bit implicitly
+ */
+#ifndef O_RANDOM
+#define O_RANDOM 040000000 /* random access pattern hint */
+#endif
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
--- linux.orig/mm/fadvise.c 2010-01-04 12:39:29.000000000 +0800
+++ linux/mm/fadvise.c 2010-01-04 12:39:30.000000000 +0800
@@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
- file->f_ra.ra_pages = 0;
+ spin_lock(&file->f_lock);
+ file->f_flags |= O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_WILLNEED:
if (!mapping->a_ops->readpage) {
--- linux.orig/mm/readahead.c 2010-01-04 12:39:29.000000000 +0800
+++ linux/mm/readahead.c 2010-01-04 12:39:30.000000000 +0800
@@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;

+ /* be dumb */
+ if (filp->f_flags & O_RANDOM) {
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}

2010-01-05 01:46:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Mon, Jan 4, 2010 at 9:16 PM, Wu Fengguang <[email protected]> wrote:
> Hi Minchan,
>
> On Mon, Jan 04, 2010 at 01:20:49PM +0800, Minchan Kim wrote:
>> > --- linux.orig/mm/readahead.c   2010-01-04 12:39:29.000000000 +0800
>> > +++ linux/mm/readahead.c        2010-01-04 12:39:30.000000000 +0800
>> > @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
>> >        if (!ra->ra_pages)
>> >                return;
>> >
>> > +       /* be dumb */
>> > +       if (filp->f_flags & O_RANDOM) {
>> > +               force_page_cache_readahead(mapping, filp, offset, req_size);
>> > +               return;
>> > +       }
>> > +
>>
>> Let me have a dumb question. :)
>>
>> How about testing O_RANDOM in front of ra_pages testing?
>>
>> My intention is that although we turn off ra, it would be better to read
>> contiguous block all at once than readpage() callback doing I/O
>> one page at a time.
>>
>> Is it break some semantics or happen some problem in ondemand readahead?
>
> Yes it will have some problem with shrink_readahead_size_eio(), which
> want to disable readahead and use ->readpage() when ra_pages==0.
>
> Do you have specific use case in mind? The file systems that set
> ra_pages=0 seems to don't need readahead, too.

Never mind. It's just out of curiosity. :)

I thought although user disable readahead, we could enhance file I/O
with one readpages not multiple readpage if we know the user want to
read big contiguous blocks.

But I though it break current readahead off semantics. right?

Thanks for reply about my dumb question, Wu. :)

>
> Thanks,
> Fengguang
>



--
Kind regards,
Minchan Kim

2010-01-05 02:03:36

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [RFC][PATCH v4] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Mon, 4 Jan 2010 20:56:20 +0800 Wu Fengguang <[email protected]> wrote:
>
> For now I'll use bit 040000000, which happen to equal to
> FMODE_NONOTIFY and will be masked out by __dentry_open()
> in the very beginning.

But also clashes with __O_SYNC on sparc, sorry. I am not sure how we are
meant to reasonably choose these values any more ...

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (456.00 B)
(No filename) (198.00 B)
Download all attachments

2010-01-05 02:17:13

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Tue, Jan 05, 2010 at 09:46:09AM +0800, Minchan Kim wrote:
> On Mon, Jan 4, 2010 at 9:16 PM, Wu Fengguang <[email protected]> wrote:
> > Hi Minchan,
> >
> > On Mon, Jan 04, 2010 at 01:20:49PM +0800, Minchan Kim wrote:
> >> > --- linux.orig/mm/readahead.c   2010-01-04 12:39:29.000000000 +0800
> >> > +++ linux/mm/readahead.c        2010-01-04 12:39:30.000000000 +0800
> >> > @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
> >> >        if (!ra->ra_pages)
> >> >                return;
> >> >
> >> > +       /* be dumb */
> >> > +       if (filp->f_flags & O_RANDOM) {
> >> > +               force_page_cache_readahead(mapping, filp, offset, req_size);
> >> > +               return;
> >> > +       }
> >> > +
> >>
> >> Let me have a dumb question. :)
> >>
> >> How about testing O_RANDOM in front of ra_pages testing?
> >>
> >> My intention is that although we turn off ra, it would be better to read
> >> contiguous block all at once than readpage() callback doing I/O
> >> one page at a time.
> >>
> >> Is it break some semantics or happen some problem in ondemand readahead?
> >
> > Yes it will have some problem with shrink_readahead_size_eio(), which
> > want to disable readahead and use ->readpage() when ra_pages==0.
> >
> > Do you have specific use case in mind? The file systems that set
> > ra_pages=0 seems to don't need readahead, too.
>
> Never mind. It's just out of curiosity. :)
>
> I thought although user disable readahead, we could enhance file I/O
> with one readpages not multiple readpage if we know the user want to
> read big contiguous blocks.

Yes, not-break-large-read-into-pages would be good for HD/SSD drives
when readahead is disabled.

Currently, ->ra_pages is somehow overloaded in its ==0 case. As you
said, it's in fact possible to disable readahead while still limiting
read IO size to a non-zero ->ra_pages.

> But I though it break current readahead off semantics. right?

It can be done by applying the ->ra_pages limit to O_RANDOM. This also
makes O_RANDOM safer to use:

@@ -497,6 +497,13 @@ void page_cache_sync_readahead(struct ad
struct file_ra_state *ra, struct file *filp,
pgoff_t offset, unsigned long req_size)
{
+ /* be dumb */
+ if (filp->f_flags & O_RANDOM) {
+ req_size = clamp_t(unsigned long, req_size, 1, ra->ra_pages);
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* no read-ahead */
if (!ra->ra_pages)
return;

To make real change, we need an interface for the user to disable
whole-partition readahead by setting O_RANDOM instead of ra_pages=0.
That would be a hard sell..

> Thanks for reply about my dumb question, Wu. :)

You are welcome :)

Thanks,
Fengguang

2010-01-05 02:27:00

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH v4] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Tue, Jan 05, 2010 at 10:03:22AM +0800, Stephen Rothwell wrote:
> On Mon, 4 Jan 2010 20:56:20 +0800 Wu Fengguang <[email protected]> wrote:
> >
> > For now I'll use bit 040000000, which happen to equal to
> > FMODE_NONOTIFY and will be masked out by __dentry_open()
> > in the very beginning.
>
> But also clashes with __O_SYNC on sparc, sorry. I am not sure how we are
> meant to reasonably choose these values any more ...

Where is it? I cannot grep find one in arch/.
There is one defined in include/asm-generic/fcntl.h:

#ifndef O_SYNC
#define __O_SYNC 04000000
#define O_SYNC (__O_SYNC|O_DSYNC)
#endif

However it has one less '0' :)

Thanks,
Fengguang

2010-01-05 02:28:34

by Stephen Rothwell

[permalink] [raw]
Subject: Re: [RFC][PATCH v4] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

Hi Fengguang,

On Tue, 5 Jan 2010 10:26:50 +0800 Wu Fengguang <[email protected]> wrote:
>
> Where is it? I cannot grep find one in arch/.
> There is one defined in include/asm-generic/fcntl.h:
>
> #ifndef O_SYNC
> #define __O_SYNC 04000000
> #define O_SYNC (__O_SYNC|O_DSYNC)
> #endif
>
> However it has one less '0' :)

Search for 0x800000 ...

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (473.00 B)
(No filename) (198.00 B)
Download all attachments

2010-01-05 02:45:40

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH v4] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Tue, Jan 05, 2010 at 10:28:24AM +0800, Stephen Rothwell wrote:
> Hi Fengguang,
>
> On Tue, 5 Jan 2010 10:26:50 +0800 Wu Fengguang <[email protected]> wrote:
> >
> > Where is it? I cannot grep find one in arch/.
> > There is one defined in include/asm-generic/fcntl.h:
> >
> > #ifndef O_SYNC
> > #define __O_SYNC 04000000
> > #define O_SYNC (__O_SYNC|O_DSYNC)
> > #endif
> >
> > However it has one less '0' :)
>
> Search for 0x800000 ...

Thanks, I got it..

Does that mean __O_SYNC would be _accidentally_ masked out in
__dentry_open() by the newly introduced fanotify bit?

f->f_flags = (flags & ~(FMODE_EXEC | FMODE_NONOTIFY));

The above line is added in commit f67cee7b50b004357be383,
and can be fixed by the below patch.

Eric, what do you think?

Thanks,
Fengguang
---

include/asm-generic/fcntl.h | 2 +-
include/linux/fs.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

--- linux.orig/include/asm-generic/fcntl.h 2010-01-05 10:42:36.000000000 +0800
+++ linux/include/asm-generic/fcntl.h 2010-01-05 10:42:57.000000000 +0800
@@ -5,7 +5,7 @@

/*
* FMODE_EXEC is 0x20
- * FMODE_NONOTIFY is 0x800000
+ * FMODE_NONOTIFY is 0x1000000
* These cannot be used by userspace O_* until internal and external open
* flags are split.
* -Eric Paris
--- linux.orig/include/linux/fs.h 2010-01-05 10:40:33.000000000 +0800
+++ linux/include/linux/fs.h 2010-01-05 10:42:07.000000000 +0800
@@ -88,7 +88,7 @@ struct inodes_stat_t {
#define FMODE_NOCMTIME ((__force fmode_t)2048)

/* File was opened by fanotify and shouldn't generate fanotify events */
-#define FMODE_NONOTIFY ((__force fmode_t)8388608)
+#define FMODE_NONOTIFY ((__force fmode_t)0x1000000)

/*
* The below are the various read and write types that we support. Some of

2010-01-05 03:18:10

by Fengguang Wu

[permalink] [raw]
Subject: [RFC][PATCH v5] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
performance: a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag O_RANDOM for POSIX_FADV_RANDOM. It will be
visible to fcntl(F_GETFL), but open(O_RANDOM) is disabled for now.

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

v5: use bit 0200000000; explicitly nuke the O_RANDOM bit in __dentry_open()
(Stephen Rothwell)
v4: resolve bit conflicts with sparc and parisc;
use bit 040000000(=FMODE_NONOTIFY), which will be masked out by
__dentry_open(), so that open(O_RANDOM) is disabled
(Stephen Rothwell and Christoph Hellwig)
v3: use O_RANDOM to indicate both read/write access pattern as in
posix_fadvise(), although it only takes effect for read() now
(proposed by Quentin)
v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi)

CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Steven Whitehouse <[email protected]>
CC: David Howells <[email protected]>
CC: Al Viro <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Christoph Hellwig <[email protected]>
Tested-by: Quentin Barnes <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
fs/open.c | 2 +-
include/asm-generic/fcntl.h | 7 +++++++
mm/fadvise.c | 10 +++++++++-
mm/readahead.c | 6 ++++++
4 files changed, 23 insertions(+), 2 deletions(-)

--- linux.orig/include/asm-generic/fcntl.h 2010-01-04 12:39:29.000000000 +0800
+++ linux/include/asm-generic/fcntl.h 2010-01-05 11:13:17.000000000 +0800
@@ -80,6 +80,13 @@
#define O_NDELAY O_NONBLOCK
#endif

+/*
+ * usage of open(O_RANDOM) is disabled: __dentry_open() will nuke this bit
+ */
+#ifndef O_RANDOM
+#define O_RANDOM 0200000000 /* random access pattern hint */
+#endif
+
#define F_DUPFD 0 /* dup */
#define F_GETFD 1 /* get close_on_exec */
#define F_SETFD 2 /* set/clear close_on_exec */
--- linux.orig/mm/fadvise.c 2010-01-04 12:39:29.000000000 +0800
+++ linux/mm/fadvise.c 2010-01-04 12:39:30.000000000 +0800
@@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
- file->f_ra.ra_pages = 0;
+ spin_lock(&file->f_lock);
+ file->f_flags |= O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~O_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_WILLNEED:
if (!mapping->a_ops->readpage) {
--- linux.orig/mm/readahead.c 2010-01-04 12:39:29.000000000 +0800
+++ linux/mm/readahead.c 2010-01-05 10:20:29.000000000 +0800
@@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;

+ /* be dumb */
+ if (filp->f_flags & O_RANDOM) {
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}
--- linux.orig/fs/open.c 2010-01-05 09:41:50.000000000 +0800
+++ linux/fs/open.c 2010-01-05 10:34:40.000000000 +0800
@@ -862,7 +862,7 @@ static struct file *__dentry_open(struct
}
ima_counts_get(f);

- f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+ f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | O_RANDOM);

file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);

2010-01-05 03:27:21

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH v5] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Tue, Jan 05, 2010 at 11:18:01AM +0800, Wu Fengguang wrote:
>
> +/*
> + * usage of open(O_RANDOM) is disabled: __dentry_open() will nuke this bit
> + */
> +#ifndef O_RANDOM
> +#define O_RANDOM 0200000000 /* random access pattern hint */
> +#endif
> +

Tested with this program, output is

flags=0x9001 O_SYNC=0x1000 O_RANDOM=0

My user space O_SYNC is still the old O_DSYNC. But should be OK to
test the sparc O_SYNC.

------------------------------------------------------------------------------
#include <fcntl.h>
#include <stdio.h>

#define O_RANDOM 040000000 /* random access pattern hint */

int main (void)
{
int fd, flags;

fd = open("/tmp/fcntl-test",
O_WRONLY | O_CREAT | O_SYNC | O_RANDOM, 0666);
flags = fcntl(fd, F_GETFL, 0);
printf("flags=%#lx O_SYNC=%#lx O_RANDOM=%#lx\n",
flags, flags & O_SYNC, flags & O_RANDOM);
}
------------------------------------------------------------------------------

Thanks,
Fengguang

2010-01-05 03:41:06

by Minchan Kim

[permalink] [raw]
Subject: Re: [RFC][PATCH v3] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Tue, Jan 5, 2010 at 11:16 AM, Wu Fengguang <[email protected]> wrote:
> On Tue, Jan 05, 2010 at 09:46:09AM +0800, Minchan Kim wrote:
>> On Mon, Jan 4, 2010 at 9:16 PM, Wu Fengguang <[email protected]> wrote:
>> > Hi Minchan,
>> >
>> > On Mon, Jan 04, 2010 at 01:20:49PM +0800, Minchan Kim wrote:
>> >> > --- linux.orig/mm/readahead.c   2010-01-04 12:39:29.000000000 +0800
>> >> > +++ linux/mm/readahead.c        2010-01-04 12:39:30.000000000 +0800
>> >> > @@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
>> >> >        if (!ra->ra_pages)
>> >> >                return;
>> >> >
>> >> > +       /* be dumb */
>> >> > +       if (filp->f_flags & O_RANDOM) {
>> >> > +               force_page_cache_readahead(mapping, filp, offset, req_size);
>> >> > +               return;
>> >> > +       }
>> >> > +
>> >>
>> >> Let me have a dumb question. :)
>> >>
>> >> How about testing O_RANDOM in front of ra_pages testing?
>> >>
>> >> My intention is that although we turn off ra, it would be better to read
>> >> contiguous block all at once than readpage() callback doing I/O
>> >> one page at a time.
>> >>
>> >> Is it break some semantics or happen some problem in ondemand readahead?
>> >
>> > Yes it will have some problem with shrink_readahead_size_eio(), which
>> > want to disable readahead and use ->readpage() when ra_pages==0.
>> >
>> > Do you have specific use case in mind? The file systems that set
>> > ra_pages=0 seems to don't need readahead, too.
>>
>> Never mind. It's just out of curiosity. :)
>>
>> I thought although user disable readahead, we could enhance file I/O
>> with one readpages not multiple readpage if we know the user want to
>> read big contiguous blocks.
>
> Yes, not-break-large-read-into-pages would be good for HD/SSD drives
> when readahead is disabled.
>
> Currently, ->ra_pages is somehow overloaded in its ==0 case. As you
> said, it's in fact possible to disable readahead while still limiting
> read IO size to a non-zero ->ra_pages.
>
>> But I though it break current readahead off semantics. right?
>
> It can be done by applying the ->ra_pages limit to O_RANDOM. This also
> makes O_RANDOM safer to use:
>
> @@ -497,6 +497,13 @@ void page_cache_sync_readahead(struct ad
>                               struct file_ra_state *ra, struct file *filp,
>                               pgoff_t offset, unsigned long req_size)
>  {
> +       /* be dumb */
> +       if (filp->f_flags & O_RANDOM) {
> +               req_size = clamp_t(unsigned long, req_size, 1, ra->ra_pages);
> +               force_page_cache_readahead(mapping, filp, offset, req_size);
> +               return;
> +       }
> +
>        /* no read-ahead */
>        if (!ra->ra_pages)
>                return;
>
> To make real change, we need an interface for the user to disable
> whole-partition readahead by setting O_RANDOM instead of ra_pages=0.
> That would be a hard sell..

Okay. I understand.
Thanks, Wu.

--
Kind regards,
Minchan Kim

2010-01-05 05:22:31

by Eric Paris

[permalink] [raw]
Subject: Re: [RFC][PATCH v4] readahead: introduce O_RANDOM for POSIX_FADV_RANDOM

On Tue, 2010-01-05 at 10:45 +0800, Wu Fengguang wrote:
> On Tue, Jan 05, 2010 at 10:28:24AM +0800, Stephen Rothwell wrote:
> > Hi Fengguang,
> >
> > On Tue, 5 Jan 2010 10:26:50 +0800 Wu Fengguang <[email protected]> wrote:
> > >
> > > Where is it? I cannot grep find one in arch/.
> > > There is one defined in include/asm-generic/fcntl.h:
> > >
> > > #ifndef O_SYNC
> > > #define __O_SYNC 04000000
> > > #define O_SYNC (__O_SYNC|O_DSYNC)
> > > #endif
> > >
> > > However it has one less '0' :)
> >
> > Search for 0x800000 ...
>
> Thanks, I got it..
>
> Does that mean __O_SYNC would be _accidentally_ masked out in
> __dentry_open() by the newly introduced fanotify bit?
>
> f->f_flags = (flags & ~(FMODE_EXEC | FMODE_NONOTIFY));
>
> The above line is added in commit f67cee7b50b004357be383,
> and can be fixed by the below patch.
>
> Eric, what do you think?

I'll try to get that patch added tonight/tomorrow (although I'm
basically without internet until the 7th).

Al mentioned to me (for a completely unrelated reason) he'd rather see
internal and external open flags split, so we don't have that stupidity
of masking off FMODE_EXEC and FMODE_NONOTIFY. I'll probably try to work
on something like that, after I'm back working. Maybe Al or Christoph
would like to actually review this work and tell me everything else they
don't like?

-Eric

2010-01-08 13:08:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] readahead: introduce O_RANDOM_READ for POSIX_FADV_RANDOM

Just thinking about this again, why don't you put the flag into
file->f_mode and the FMODE_* namespace given that we don't want it to be
settable from open?

2010-01-09 13:59:23

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] readahead: introduce O_RANDOM_READ for POSIX_FADV_RANDOM

On Fri, Jan 08, 2010 at 09:08:28PM +0800, Christoph Hellwig wrote:
> Just thinking about this again, why don't you put the flag into
> file->f_mode and the FMODE_* namespace given that we don't want it to be
> settable from open?

Good idea. To do that without race I would like to add ->f_lock to
f_mode modifications at non-open time, like this. What do you think?

---
vfs: take f_lock on modifying f_mode after open time

Signed-off-by: Wu Fengguang <[email protected]>
---
fs/file_table.c | 2 ++
fs/nfsd/nfs4state.c | 2 ++
2 files changed, 4 insertions(+)

--- linux.orig/fs/file_table.c 2010-01-09 10:50:24.000000000 +0800
+++ linux/fs/file_table.c 2010-01-09 10:51:06.000000000 +0800
@@ -394,7 +394,9 @@ retry:
continue;
if (!(f->f_mode & FMODE_WRITE))
continue;
+ spin_lock(&f->f_lock);
f->f_mode &= ~FMODE_WRITE;
+ spin_unlock(&f->f_lock);
if (file_check_writeable(f) != 0)
continue;
file_release_write(f);
--- linux.orig/fs/nfsd/nfs4state.c 2010-01-09 10:52:54.000000000 +0800
+++ linux/fs/nfsd/nfs4state.c 2010-01-09 10:52:56.000000000 +0800
@@ -1998,7 +1998,9 @@ nfs4_file_downgrade(struct file *filp, u
{
if (share_access & NFS4_SHARE_ACCESS_WRITE) {
drop_file_write_access(filp);
+ spin_lock(&filp->f_lock);
filp->f_mode = (filp->f_mode | FMODE_READ) & ~FMODE_WRITE;
+ spin_unlock(&filp->f_lock);
}
}

2010-01-09 14:01:26

by Fengguang Wu

[permalink] [raw]
Subject: Re: [RFC][PATCH v2] readahead: introduce O_RANDOM_READ for POSIX_FADV_RANDOM

On Sat, Jan 09, 2010 at 09:59:04PM +0800, Wu Fengguang wrote:
> On Fri, Jan 08, 2010 at 09:08:28PM +0800, Christoph Hellwig wrote:
> > Just thinking about this again, why don't you put the flag into
> > file->f_mode and the FMODE_* namespace given that we don't want it to be
> > settable from open?
>
> Good idea. To do that without race I would like to add ->f_lock to
> f_mode modifications at non-open time, like this. What do you think?
>
> ---
> vfs: take f_lock on modifying f_mode after open time
>
> Signed-off-by: Wu Fengguang <[email protected]>
> ---

And the FMODE_RANDOM patch.

Thanks,
Fengguang
---

readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM

This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor
performance: a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

v6: use FMODE_RANDOM (proposed by Christoph Hellwig)
v5: use bit 0200000000; explicitly nuke the O_RANDOM bit in __dentry_open()
(Stephen Rothwell)
v4: resolve bit conflicts with sparc and parisc;
use bit 040000000(=FMODE_NONOTIFY), which will be masked out by
__dentry_open(), so that open(O_RANDOM) is disabled
(Stephen Rothwell and Christoph Hellwig)
v3: use O_RANDOM to indicate both read/write access pattern as in
posix_fadvise(), although it only takes effect for read() now
(proposed by Quentin)
v2: use O_RANDOM_READ to avoid race conditions (pointed out by Andi)

CC: Nick Piggin <[email protected]>
CC: Andi Kleen <[email protected]>
CC: Steven Whitehouse <[email protected]>
CC: David Howells <[email protected]>
CC: Al Viro <[email protected]>
CC: Jonathan Corbet <[email protected]>
CC: Christoph Hellwig <[email protected]>
Tested-by: Quentin Barnes <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
include/linux/fs.h | 3 +++
mm/fadvise.c | 10 +++++++++-
mm/readahead.c | 6 ++++++
3 files changed, 18 insertions(+), 1 deletion(-)

--- linux.orig/mm/fadvise.c 2010-01-09 11:00:11.000000000 +0800
+++ linux/mm/fadvise.c 2010-01-09 12:25:01.000000000 +0800
@@ -77,12 +77,20 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, lof
switch (advice) {
case POSIX_FADV_NORMAL:
file->f_ra.ra_pages = bdi->ra_pages;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~FMODE_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_RANDOM:
- file->f_ra.ra_pages = 0;
+ spin_lock(&file->f_lock);
+ file->f_flags |= FMODE_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_SEQUENTIAL:
file->f_ra.ra_pages = bdi->ra_pages * 2;
+ spin_lock(&file->f_lock);
+ file->f_flags &= ~FMODE_RANDOM;
+ spin_unlock(&file->f_lock);
break;
case POSIX_FADV_WILLNEED:
if (!mapping->a_ops->readpage) {
--- linux.orig/mm/readahead.c 2010-01-09 11:00:11.000000000 +0800
+++ linux/mm/readahead.c 2010-01-09 12:25:01.000000000 +0800
@@ -501,6 +501,12 @@ void page_cache_sync_readahead(struct ad
if (!ra->ra_pages)
return;

+ /* be dumb */
+ if (filp->f_mode & FMODE_RANDOM) {
+ force_page_cache_readahead(mapping, filp, offset, req_size);
+ return;
+ }
+
/* do read-ahead */
ondemand_readahead(mapping, ra, filp, false, offset, req_size);
}
--- linux.orig/include/linux/fs.h 2010-01-09 11:00:29.000000000 +0800
+++ linux/include/linux/fs.h 2010-01-09 12:26:33.000000000 +0800
@@ -90,6 +90,9 @@ struct inodes_stat_t {
/* File was opened by fanotify and shouldn't generate fanotify events */
#define FMODE_NONOTIFY ((__force fmode_t)0x1000000)

+/* Expect random access pattern */
+#define FMODE_RANDOM ((__force fmode_t)0x1000)
+
/*
* The below are the various read and write types that we support. Some of
* them include behavioral modifiers that send information down to the