LinuxLists.cc - Random I/O over NFS has horrible performance due to small I/O transfers

2009-12-26 20:59:19

Subject: Random I/O over NFS has horrible performance due to small I/O transfers

On the 24th I posted this note on LKML since it was a problem in the
VFS layer. However, since NFS is mainly affected by this problem,
I'd bring it up here for discussion as well for those that don't
follow LKML. At the time I posted it, I didn't set it up as a
cross-posted note.

Has this interaction between random I/O and NFS been noted before?
I searched back through the archive and didn't turn up anything.

Quentin

--

In porting some application code to Linux, its performance over
NFSv3 on Linux is terrible. I'm posting this note to LKML since
the problem was actually tracked back to the VFS layer.

The app has a simple database that's accessed over NFS. It always
does random I/O, so any read-ahead is a waste. The app uses
O_DIRECT which has the side-effect of disabling read-ahead.

On Linux accessing an O_DIRECT opened file over NFS is much akin to
disabling its attribute cache causing its attributes to be refetched
from the server before each NFS operation. After some thought,
given the Linux behavior of O_DIRECT on regular hard disk files to
ensure file cache consistency, frustratingly, that's probably the
more correct answer to emulate this file system behavior for NFS.
At this point, rather than expecting Linux to somehow change to
avoid the unnecessary flood of GETATTRs, I thought it best for the
app not to just use the O_DIRECT flag on Linux. So I changed the
app code and then added a posix_fadvise(2) call to keep read-ahead
disabled. When I did that, I ran into an unexpected problem.

Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets
ra_pages=0. This has a very odd side-effect in the kernel. Once
read-ahead is disabled, subsequent calls to read(2) are now done in
the kernel via ->readpage() callback doing I/O one page at a time!

Pouring through the code in mm/filemap.c I see that the kernel has
commingled read-ahead and plain read implementations. The algorithms
have much in common, so I can see why it was done, but it left this
anomaly of severely pimping read(2) calls on file descriptors with
read-ahead disabled.

For example, with a read(2) of 98K bytes of a file opened with
O_DIRECT accessed over NFSv3 with rsize=32768, I see:
=========
V3 ACCESS Call (Reply In 249), FH:0xf3a8e519
V3 ACCESS Reply (Call In 248)
V3 READ Call (Reply In 321), FH:0xf3a8e519 Offset:0 Len:32768
V3 READ Call (Reply In 287), FH:0xf3a8e519 Offset:32768 Len:32768
V3 READ Call (Reply In 356), FH:0xf3a8e519 Offset:65536 Len:32768
V3 READ Reply (Call In 251) Len:32768
V3 READ Reply (Call In 250) Len:32768
V3 READ Reply (Call In 252) Len:32768
=========

I would expect three READs issued of size 32K, and that's exactly
what I see.

For the same file without O_DIRECT but with read-ahead disabled
(its ra_pages=0), I see:
=========
V3 ACCESS Call (Reply In 167), FH:0xf3a8e519
V3 ACCESS Reply (Call In 166)
V3 READ Call (Reply In 172), FH:0xf3a8e519 Offset:0 Len:4096
V3 READ Reply (Call In 168) Len:4096
V3 READ Call (Reply In 177), FH:0xf3a8e519 Offset:4096 Len:4096
V3 READ Reply (Call In 173) Len:4096
V3 READ Call (Reply In 182), FH:0xf3a8e519 Offset:8192 Len:4096
V3 READ Reply (Call In 178) Len:4096
[... READ Call/Reply pairs repeated another 21 times ...]
=========

Now I see 24 READ calls of 4K each!

A workaround for this kernel problem is to hack the app to do a
readahead(2) call prior to the read(2), however, I would think a
better approach would be to fix the kernel. I came up with the
included patch that once applied restores the expected read(2)
behavior. For the latter test case above of a file with read-ahead
disabled but now with the patch below applied, I now see:
=========
V3 ACCESS Call (Reply In 1350), FH:0xf3a8e519
V3 ACCESS Reply (Call In 1349)
V3 READ Call (Reply In 1387), FH:0xf3a8e519 Offset:0 Len:32768
V3 READ Call (Reply In 1421), FH:0xf3a8e519 Offset:32768 Len:32768
V3 READ Call (Reply In 1456), FH:0xf3a8e519 Offset:65536 Len:32768
V3 READ Reply (Call In 1351) Len:32768
V3 READ Reply (Call In 1352) Len:32768
V3 READ Reply (Call In 1353) Len:32768
=========

Which is what I would expect -- back to just three 32K READs.

After this change, the overall performance of the application
increased by 313%!

I have no idea if my patch is the appropriate fix. I'm well out of
my area in this part of the kernel. It solves this one problem, but
I have no idea how many boundary cases it doesn't cover or even if
it is the right way to go about addressing this issue.

Is this behavior of shorting I/O of read(2) considered a bug? And
is this approach for a fix approriate?

Quentin

--- linux-2.6.32.2/mm/filemap.c 2009-12-18 16:27:07.000000000 -0600
+++ linux-2.6.32.2-rapatch/mm/filemap.c 2009-12-24 13:07:07.000000000 -0600
@@ -1012,9 +1012,13 @@ static void do_generic_file_read(struct
find_page:
page = find_get_page(mapping, index);
if (!page) {
- page_cache_sync_readahead(mapping,
- ra, filp,
- index, last_index - index);
+ if (ra->ra_pages)
+ page_cache_sync_readahead(mapping,
+ ra, filp,
+ index, last_index - index);
+ else
+ force_page_cache_readahead(mapping, filp,
+ index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;

My test program used to gather the network traces above:
=========
#define _GNU_SOURCE 1
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

int
main(int argc, char **argv)
{
char scratch[32768*3];
int lgfd;
int cnt;

//if ( (lgfd = open(argv[1], O_RDWR|O_DIRECT)) == -1 ) {
if ( (lgfd = open(argv[1], O_RDWR)) == -1 ) {
fprintf(stderr, "Cannot open '%s'.\n", argv[1]);
return 1;
}

posix_fadvise(lgfd, 0, 0, POSIX_FADV_RANDOM);
//readahead(lgfd, 0, sizeof(scratch));
cnt = read(lgfd, scratch, sizeof(scratch));
printf("Read %d bytes.\n", cnt);
close(lgfd);

return 0;
}
=========

2009-12-29 17:12:55

by Chuck Lever

[permalink] [raw]

Subject: Re: Random I/O over NFS has horrible performance due to small I/O transfers

On Dec 26, 2009, at 3:45 PM, Quentin Barnes wrote:

> On the 24th I posted this note on LKML since it was a problem in the
> VFS layer. However, since NFS is mainly affected by this problem,
> I'd bring it up here for discussion as well for those that don't
> follow LKML. At the time I posted it, I didn't set it up as a
> cross-posted note.
>
> Has this interaction between random I/O and NFS been noted before?
> I searched back through the archive and didn't turn up anything.
>
> Quentin
>
> --
>
> In porting some application code to Linux, its performance over
> NFSv3 on Linux is terrible. I'm posting this note to LKML since
> the problem was actually tracked back to the VFS layer.
>
> The app has a simple database that's accessed over NFS. It always
> does random I/O, so any read-ahead is a waste. The app uses
> O_DIRECT which has the side-effect of disabling read-ahead.
>
> On Linux accessing an O_DIRECT opened file over NFS is much akin to
> disabling its attribute cache causing its attributes to be refetched
> from the server before each NFS operation.

NFS O_DIRECT is designed so that attribute refetching is avoided.
Take a look at nfs_file_read() -- right at the top it skips to the
direct read code. Do you perhaps have the actimeo=0 or noac mount
options specified?

> After some thought,
> given the Linux behavior of O_DIRECT on regular hard disk files to
> ensure file cache consistency, frustratingly, that's probably the
> more correct answer to emulate this file system behavior for NFS.
> At this point, rather than expecting Linux to somehow change to
> avoid the unnecessary flood of GETATTRs, I thought it best for the
> app not to just use the O_DIRECT flag on Linux. So I changed the
> app code and then added a posix_fadvise(2) call to keep read-ahead
> disabled. When I did that, I ran into an unexpected problem.
>
> Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets
> ra_pages=0. This has a very odd side-effect in the kernel. Once
> read-ahead is disabled, subsequent calls to read(2) are now done in
> the kernel via ->readpage() callback doing I/O one page at a time!

Your application could always use posix_fadvise(...,
POSIX_FADV_WILLNEED). POSIX_FADV_RANDOM here means the application
will perform I/O requests in random offset order, and requests will be
smaller than a page.

> Pouring through the code in mm/filemap.c I see that the kernel has
> commingled read-ahead and plain read implementations. The algorithms
> have much in common, so I can see why it was done, but it left this
> anomaly of severely pimping read(2) calls on file descriptors with
> read-ahead disabled.

The problem is that do_generic_file_read() conflates read-ahead and
read coalescing, which are really two different things (and this use
case highlights that difference).

Above you said that "any readahead is a waste." That's only true if
your database is significantly larger than available physical memory.
Otherwise, you are simply populating the local page cache faster than
if your app read exactly what was needed each time. On fast modern
networks there is little latency difference between reading a single
page and reading 16 pages in a single NFS read request. The cost is a
larger page cache footprint.

Caching is only really harmful if your database file is shared between
more than one NFS client. In fact, I think O_DIRECT will be more of a
hindrance if your simple database doesn't do its own caching, since
your app will generate more NFS reads in the O_DIRECT case, meaning it
will wait more often. You're almost always better off letting the O/S
handle data caching.

If you leave read ahead enabled, theoretically, the read-ahead context
should adjust itself over time to read the average number of pages in
each application read request. Have you seen any real performance
problems when using normal cached I/O with read-ahead enabled?

> For example, with a read(2) of 98K bytes of a file opened with
> O_DIRECT accessed over NFSv3 with rsize=32768, I see:
> =========
> V3 ACCESS Call (Reply In 249), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 248)
> V3 READ Call (Reply In 321), FH:0xf3a8e519 Offset:0 Len:32768
> V3 READ Call (Reply In 287), FH:0xf3a8e519 Offset:32768 Len:32768
> V3 READ Call (Reply In 356), FH:0xf3a8e519 Offset:65536 Len:32768
> V3 READ Reply (Call In 251) Len:32768
> V3 READ Reply (Call In 250) Len:32768
> V3 READ Reply (Call In 252) Len:32768
> =========
>
> I would expect three READs issued of size 32K, and that's exactly
> what I see.
>
>
> For the same file without O_DIRECT but with read-ahead disabled
> (its ra_pages=0), I see:
> =========
> V3 ACCESS Call (Reply In 167), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 166)
> V3 READ Call (Reply In 172), FH:0xf3a8e519 Offset:0 Len:4096
> V3 READ Reply (Call In 168) Len:4096
> V3 READ Call (Reply In 177), FH:0xf3a8e519 Offset:4096 Len:4096
> V3 READ Reply (Call In 173) Len:4096
> V3 READ Call (Reply In 182), FH:0xf3a8e519 Offset:8192 Len:4096
> V3 READ Reply (Call In 178) Len:4096
> [... READ Call/Reply pairs repeated another 21 times ...]
> =========
>
> Now I see 24 READ calls of 4K each!
>
>
> A workaround for this kernel problem is to hack the app to do a
> readahead(2) call prior to the read(2), however, I would think a
> better approach would be to fix the kernel. I came up with the
> included patch that once applied restores the expected read(2)
> behavior. For the latter test case above of a file with read-ahead
> disabled but now with the patch below applied, I now see:
> =========
> V3 ACCESS Call (Reply In 1350), FH:0xf3a8e519
> V3 ACCESS Reply (Call In 1349)
> V3 READ Call (Reply In 1387), FH:0xf3a8e519 Offset:0 Len:32768
> V3 READ Call (Reply In 1421), FH:0xf3a8e519 Offset:32768 Len:32768
> V3 READ Call (Reply In 1456), FH:0xf3a8e519 Offset:65536 Len:32768
> V3 READ Reply (Call In 1351) Len:32768
> V3 READ Reply (Call In 1352) Len:32768
> V3 READ Reply (Call In 1353) Len:32768
> =========
>
> Which is what I would expect -- back to just three 32K READs.
>
> After this change, the overall performance of the application
> increased by 313%!
>
>
> I have no idea if my patch is the appropriate fix. I'm well out of
> my area in this part of the kernel. It solves this one problem, but
> I have no idea how many boundary cases it doesn't cover or even if
> it is the right way to go about addressing this issue.
>
> Is this behavior of shorting I/O of read(2) considered a bug? And
> is this approach for a fix approriate?
>
> Quentin
>
>
> --- linux-2.6.32.2/mm/filemap.c 2009-12-18 16:27:07.000000000 -0600
> +++ linux-2.6.32.2-rapatch/mm/filemap.c 2009-12-24
> 13:07:07.000000000 -0600
> @@ -1012,9 +1012,13 @@ static void do_generic_file_read(struct
> find_page:
> page = find_get_page(mapping, index);
> if (!page) {
> - page_cache_sync_readahead(mapping,
> - ra, filp,
> - index, last_index - index);
> + if (ra->ra_pages)
> + page_cache_sync_readahead(mapping,
> + ra, filp,
> + index, last_index - index);
> + else
> + force_page_cache_readahead(mapping, filp,
> + index, last_index - index);
> page = find_get_page(mapping, index);
> if (unlikely(page == NULL))
> goto no_cached_page;
>
>
>
> My test program used to gather the network traces above:
> =========
> #define _GNU_SOURCE 1
> #include <stdio.h>
> #include <unistd.h>
> #include <fcntl.h>
>
> int
> main(int argc, char **argv)
> {
> char scratch[32768*3];
> int lgfd;
> int cnt;
>
> //if ( (lgfd = open(argv[1], O_RDWR|O_DIRECT)) == -1 ) {
> if ( (lgfd = open(argv[1], O_RDWR)) == -1 ) {
> fprintf(stderr, "Cannot open '%s'.\n", argv[1]);
> return 1;
> }
>
> posix_fadvise(lgfd, 0, 0, POSIX_FADV_RANDOM);
> //readahead(lgfd, 0, sizeof(scratch));
> cnt = read(lgfd, scratch, sizeof(scratch));
> printf("Read %d bytes.\n", cnt);
> close(lgfd);
>
> return 0;
> }
> =========
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2010-01-24 18:48:56

by Quentin Barnes

[permalink] [raw]

Subject: Re: Random I/O over NFS has horrible performance due to small I/O transfers

> > Sorry I've been slow in responding. I had a recent death in my
> > family which has been occupying all my time for the last three
> > weeks.
>
> My condolences.

Thank you.

> > I'm sure I didn't have actimeo=0 or noac. What I was referring to
> > is the code in nfs_revalidate_file_size() which forces revalidation
> > with O_DIRECT files. According to the comments this is done to
> > minimize the window (race) with other clients writing to the file.
> > I saw this behavior as well in wireshark/tcpdump traces I collected.
> > With O_DIRECT, the attributes would often be refetched from the
> > server prior to each file operation. (Might have been just for
> > write and lseek file operations.) I could dig up traces if you
> > like.
>
> nfs_revalidate_file_size() is not invoked in the O_DIRECT read path.
> You were complaining about read-ahead. So I'd say this problem is
> independent of the issues you reported earlier with read-ahead.

Sorry for the confusion in the segue. To summarize, the app
on another OS was originally was designed to use O_DIRECT as a
side-effect to disable read-ahead. However, when ported to Linux,
the O_DIRECT flag with NFS files triggers a new GETATTR every time
the app did an lseek(2), write(2), or close(2). As an alternative,
I experimented with having the app on Linux not use O_DIRECT but
call posix_fadvise(...,POSIX_FADV_RANDOM). That got rid of the
extra GETATTRs and the read-aheads, but then that caused the larger
read(2)s to run very inefficiently with the dozens of 4K page-sized
NFS READs.

> > Aside from O_DIRECT not using cached file attributes before file
> > I/O, this also has an odd side-effect on closing a file. After
> > a write(2) is done by the app, the following close(2) triggers a
> > refetch of the attributes. I don't care what the file attributes
> > are -- just let the file close already! For example, here in user
> > space I'm doing a:
> > fd = open(..., O_RDWR|O_DIRECT);
> > write(fd, ...);
> > sleep(3);
> > close(fd);
> >
> > Which results in:
> > 4.191210 NFS V3 ACCESS Call, FH:0x0308031e
> > 4.191391 NFS V3 ACCESS Reply
> > 4.191431 NFS V3 LOOKUP Call, DH:0x0308031e/scr2
> > 4.191613 NFS V3 LOOKUP Reply, FH:0x29f0b5d0
> > 4.191645 NFS V3 ACCESS Call, FH:0x29f0b5d0
> > 4.191812 NFS V3 ACCESS Reply
> > 4.191852 NFS V3 WRITE Call, FH:0x29f0b5d0 Offset:0 Len:300 FILE_SYNC
> > 4.192095 NFS V3 WRITE Reply Len:300 FILE_SYNC
> > 7.193535 NFS V3 GETATTR Call, FH:0x29f0b5d0
> > 7.193724 NFS V3 GETATTR Reply Regular File mode:0644 uid:28238 gid:
> > 100
> >
> > As you can see by the first column time index that the GETATTR is done
> > after the sleep(3) as the file is being closed. (This was collected
> > on a 2.6.32.2 kernel.)
> >
> > Is there any actual need for doing that GETATTR on close that I don't
> > understand, or is it just a goof?
>
> This GETATTR is required generally for cached I/O and close-to-open
> cache coherency. The Linux NFS FAQ at nfs.sourceforge.net has more
> information on close-to-open.

I know CTO well. In my version of nfs.ko, I've added a O_NFS_NOCTO
flag to the open(2) syscall. We need the ability to fine tune
specifically which files have "no CTO" feature active. The "nocto"
mount flag is too sweeping. Has a per-file "nocto" feature been
discussed before?

> For close-to-open to work, a close(2) call must flush any pending
> changes, and the next open(2) call on that file needs to check that
> the file's attributes haven't changed since the file was last accessed
> on this client. The mtime, ctime, and size are compared between the
> two to determine if the client's copy of the file's data is stale.

Yes, that's the client's way to validate its cached CTO data.

> The flush done by a close(2) call after a write(2) may cause the
> server to update the mtime, ctime, and size of the file. So, after
> the flush, the client has to grab the latest copy of the file's
> attributes from the server (the server, not the client, maintains the
> values of mtime, ctime, and size).

When referring to "flush" above, are you refering to the NFS flush
call (nfs_file_flush) or the action of flushing cached data from
client to server?

If the flush call has no data to flush, the GETATTR is pointless.
The file's cached attribute information is already as valid as can
ever be hoped for with NFS and its life is limited by the normal
attribute timeout. On the other hand, if the flush has cached data
to write out, I would expect the WRITE will return the updated
attributes with the post-op attribute status, again making the
GETATTR on close pointless. Can you explain what I'm not following?

However, I tore into the code to better understand what was triggering
the GETATTR on close. A close(2) does two things, a flush
(nfs_file_flush) and a release (nfs_file_release). I had thought the
GETATTR was happening as part of the nfs_file_flush(). It's not. The
GETATTR is triggered by the nfs_file_release(). As part of it doing a
put_nfs_open_context() -> nfs_close_context() -> nfs_revalidate_inode(),
that triggers the NFS_PROTO(inode)->getattr()!

I'm not sure, but I suspect that O_DIRECT files take the
put_nfs_open_context() path that results in the extraneous GETATTR
on close(2) because filp->private_data is non-NULL where for regular
files it's NULL. Is that right? If so, can this problem be easily
fixed?

> Otherwise, the client would have
> cached file attributes that were valid _before_ the flush, but not
> afterwards. The next open(2) would be spoofed into thinking that the
> file had been changed by some other client, when it was its own
> activity that caused the mtime/ctime/size change.

I just don't see how that can happen. The attributes should be
automatically updated by the post-op reply, right?

> But again, you would only see this for normal cached accesses, or for
> llseek(SEEK_END). The O_DIRECT path splits off well before that
> nfs_revalidate_file_size() call in nfs_file_write().
>
> For llseek(SEEK_END), this is the preferred way to get the size of a
> file through an O_DIRECT file descriptor. This is precisely because
> O_DIRECT does not guarantee that the client's copy of the file's
> attributes are up to date.

Yes, that makes sense and is what the code comments help with
understanding.

> I see that the WRITE from your trace is a FILE_SYNC write. In this
> case, perhaps the GETATTR is really not required for close-to-open.
> Especially if the server has returned post-op attributes in the WRITE
> reply, the client would already have up-to-date file attributes
> available to it.

Yes, FILE_SYNC is normal for O_DIRECT WRITEs on NFSv3 and v4.

> >> Above you said that "any readahead is a waste." That's only true if
> >> your database is significantly larger than available physical memory.
> >
> > It is. It's waaaay larger than all available physical memory on
> > a given client machine. (Think of tens of millions of users' email
> > accounts.)
>
> If the accesses on any given client are localized in the file (eg.
> there are only a few e-mail users on that client) this should be
> handily dealt with by normal O/S caching behavior, even with an
> enormous database file. It really depends on the file's resident set
> on each client.

As to how the incoming requests are assigned to specific or random
clients, I am unsure. I'm not on their team. I've just been assisting
with the backend issues (NFS client performance).

In this particular use case, there isn't a single enormous database.
Each user has their own file which itself is their private database.
(Better for privacy and security.)

> >> Otherwise, you are simply populating the local page cache faster than
> >> if your app read exactly what was needed each time.
> >
> > It's a multithreaded app running across many clients accessing many
> > servers. Any excess network traffic at all to the database is a
> > very bad idea being detrimental to both to the particular client's
> > throughput but all other clients wanting to access files on the
> > burdened NFS servers.
>
> Which is why you might be better off relying on client-side caches in
> this case. Efficient client caching is absolutely required for good
> network and server scalability with such workloads.

It is a very unusual workload due to the sheer number of files
and their access patterns. Data cache hits are typically very
low because of the large amount of data and the bimodal access
patterns between MTAs and users, and users are pretty arbitrary too.
Attribute cache hits are typically low too, but due to their small
size, the attributes are more likely to stay in their cache.

> If all of this data is contained in a single large file, your
> application is relying on a single set of file attributes to determine
> whether the client's cache for all the file data is stale. So
> basically, read ahead is pulling a bunch of data into the client's
> page cache, then someone changes one byte in the file, and all that
> data is invalidated in one swell foop. In this case, it's not
> necessarily read-ahead that's killing your performance, it's excessive
> client data cache invalidations.

That's not the general case here since we're dealing with tens of
millions of files on one server, but I didn't know that all the data
of a file pulled gets invalidated like that. I would expect only
the page (4K) to be marked, not the whole set.

> >> On fast modern
> >> networks there is little latency difference between reading a single
> >> page and reading 16 pages in a single NFS read request. The cost
> >> is a
> >> larger page cache footprint.
> >
> > Believe me, the extra file accesses do make a huge difference.
>
> If your rsize is big enough, the read-ahead traffic usually won't
> increase the number of NFS READs on the wire; it increases the size of
> each request.

rsize is 32K. That's generally true (especially after Wu's fix),
but that extra network traffic is overburdening the NFS servers.

> Client read coalescing will attempt to bundle the
> additional requested data into a minimal number of wire READs. A
> closer examination of the on-the-wire READ count vs. the amount of
> data read might be interesting. It might also be useful to see how
> often the same client reads the same page in the file repeatedly.

Because the app was designed with O_DIRECT in mind, the app does its
own file buffering in user space.

If for the Linux port we can move away from O_DIRECT, it would be
interesting to see if that extra user space buffering could be
disabled and let the kernel do its job.

> > The kernel can't adjust its strategy over time. There is no history
> > maintained because the app opens a given file, updates it, closes
> > it, then moves on to the next file. The file descriptor is not kept
> > open beyond just one or two read or write operations. Also, the
> > chance of the same file needing to be updated by the same client
> > within any reasonable time frame is very small.
>
> Yes, read-ahead context is abandoned when a file descriptor is
> closed. That immediately suggests that file descriptors should be
> left open, but that's only as practical as your application allows.

Not in this case with tens of millions of files.

> > Hope all this helps understand the problems I'm dealing with.
>
> Yes, this is more clear, thanks.

Glad it helped!

> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Quentin

2010-01-25 16:44:11

by Chuck Lever

[permalink] [raw]

Subject: Re: Random I/O over NFS has horrible performance due to small I/O transfers

On Jan 24, 2010, at 1:46 PM, Quentin Barnes wrote:
>>> I'm sure I didn't have actimeo=0 or noac. What I was referring to
>>> is the code in nfs_revalidate_file_size() which forces revalidation
>>> with O_DIRECT files. According to the comments this is done to
>>> minimize the window (race) with other clients writing to the file.
>>> I saw this behavior as well in wireshark/tcpdump traces I collected.
>>> With O_DIRECT, the attributes would often be refetched from the
>>> server prior to each file operation. (Might have been just for
>>> write and lseek file operations.) I could dig up traces if you
>>> like.
>>
>> nfs_revalidate_file_size() is not invoked in the O_DIRECT read path.
>> You were complaining about read-ahead. So I'd say this problem is
>> independent of the issues you reported earlier with read-ahead.
>
> Sorry for the confusion in the segue. To summarize, the app
> on another OS was originally was designed to use O_DIRECT as a
> side-effect to disable read-ahead. However, when ported to Linux,
> the O_DIRECT flag with NFS files triggers a new GETATTR every time
> the app did an lseek(2), write(2), or close(2). As an alternative,
> I experimented with having the app on Linux not use O_DIRECT but
> call posix_fadvise(...,POSIX_FADV_RANDOM). That got rid of the
> extra GETATTRs and the read-aheads, but then that caused the larger
> read(2)s to run very inefficiently with the dozens of 4K page-sized
> NFS READs.
>
>>> Aside from O_DIRECT not using cached file attributes before file
>>> I/O, this also has an odd side-effect on closing a file. After
>>> a write(2) is done by the app, the following close(2) triggers a
>>> refetch of the attributes. I don't care what the file attributes
>>> are -- just let the file close already! For example, here in user
>>> space I'm doing a:
>>> fd = open(..., O_RDWR|O_DIRECT);
>>> write(fd, ...);
>>> sleep(3);
>>> close(fd);
>>>
>>> Which results in:
>>> 4.191210 NFS V3 ACCESS Call, FH:0x0308031e
>>> 4.191391 NFS V3 ACCESS Reply
>>> 4.191431 NFS V3 LOOKUP Call, DH:0x0308031e/scr2
>>> 4.191613 NFS V3 LOOKUP Reply, FH:0x29f0b5d0
>>> 4.191645 NFS V3 ACCESS Call, FH:0x29f0b5d0
>>> 4.191812 NFS V3 ACCESS Reply
>>> 4.191852 NFS V3 WRITE Call, FH:0x29f0b5d0 Offset:0 Len:300
>>> FILE_SYNC
>>> 4.192095 NFS V3 WRITE Reply Len:300 FILE_SYNC
>>> 7.193535 NFS V3 GETATTR Call, FH:0x29f0b5d0
>>> 7.193724 NFS V3 GETATTR Reply Regular File mode:0644 uid:28238 gid:
>>> 100
>>>
>>> As you can see by the first column time index that the GETATTR is
>>> done
>>> after the sleep(3) as the file is being closed. (This was collected
>>> on a 2.6.32.2 kernel.)
>>>
>>> Is there any actual need for doing that GETATTR on close that I
>>> don't
>>> understand, or is it just a goof?
>>
>> This GETATTR is required generally for cached I/O and close-to-open
>> cache coherency. The Linux NFS FAQ at nfs.sourceforge.net has more
>> information on close-to-open.
>
> I know CTO well. In my version of nfs.ko, I've added a O_NFS_NOCTO
> flag to the open(2) syscall. We need the ability to fine tune
> specifically which files have "no CTO" feature active. The "nocto"
> mount flag is too sweeping. Has a per-file "nocto" feature been
> discussed before?

Probably, but that's for another thread (preferably on [email protected]
only).

>> For close-to-open to work, a close(2) call must flush any pending
>> changes, and the next open(2) call on that file needs to check that
>> the file's attributes haven't changed since the file was last
>> accessed
>> on this client. The mtime, ctime, and size are compared between the
>> two to determine if the client's copy of the file's data is stale.
>
> Yes, that's the client's way to validate its cached CTO data.
>
>> The flush done by a close(2) call after a write(2) may cause the
>> server to update the mtime, ctime, and size of the file. So, after
>> the flush, the client has to grab the latest copy of the file's
>> attributes from the server (the server, not the client, maintains the
>> values of mtime, ctime, and size).
>
> When referring to "flush" above, are you refering to the NFS flush
> call (nfs_file_flush) or the action of flushing cached data from
> client to server?

close(2) uses nfs_file_flush() to flush dirty data.

> If the flush call has no data to flush, the GETATTR is pointless.
> The file's cached attribute information is already as valid as can
> ever be hoped for with NFS and its life is limited by the normal
> attribute timeout. On the other hand, if the flush has cached data
> to write out, I would expect the WRITE will return the updated
> attributes with the post-op attribute status, again making the
> GETATTR on close pointless. Can you explain what I'm not following?

The Linux NFS client does not use the post-op attributes to update the
cached attributes for the file. Because there is no guarantee that
the WRITE replies will return in the same order the WRITEs were sent,
it's simply not reliable. If the last reply to be received was for an
older write, then the mtime and ctime (and possibly the size) would be
stale, and would trigger a false data cache invalidation the next time
a full inode validation is done.

So, post-op attributes are used to detect the need for an attribute
cache update. At some later point, the client will perform the update
by sending a GETATTR, and that will update the cached attributes.
That's what NFS_INO_INVALID_ATTR is for.

The goal is to use a few extra operations on the wire to prevent
spurious data cache invalidations. For large files, this could mean
significantly fewer READs on the wire.

> However, I tore into the code to better understand what was triggering
> the GETATTR on close. A close(2) does two things, a flush
> (nfs_file_flush) and a release (nfs_file_release). I had thought the
> GETATTR was happening as part of the nfs_file_flush(). It's not. The
> GETATTR is triggered by the nfs_file_release(). As part of it doing a
> put_nfs_open_context() -> nfs_close_context() ->
> nfs_revalidate_inode(),
> that triggers the NFS_PROTO(inode)->getattr()!
>
> I'm not sure, but I suspect that O_DIRECT files take the
> put_nfs_open_context() path that results in the extraneous GETATTR
> on close(2) because filp->private_data is non-NULL where for regular
> files it's NULL. Is that right? If so, can this problem be easily
> fixed?

I'm testing a patch to use an asynchronous close for O_DIRECT files.
This will skip the GETATTR for NFSv2/v3 O_DIRECT files, and avoid
waiting for the CLOSE for NFSv4 O_DIRECT files.

>> If all of this data is contained in a single large file, your
>> application is relying on a single set of file attributes to
>> determine
>> whether the client's cache for all the file data is stale. So
>> basically, read ahead is pulling a bunch of data into the client's
>> page cache, then someone changes one byte in the file, and all that
>> data is invalidated in one swell foop. In this case, it's not
>> necessarily read-ahead that's killing your performance, it's
>> excessive
>> client data cache invalidations.
>
> That's not the general case here since we're dealing with tens of
> millions of files on one server, but I didn't know that all the data
> of a file pulled gets invalidated like that. I would expect only
> the page (4K) to be marked, not the whole set.
>
>>>> On fast modern
>>>> networks there is little latency difference between reading a
>>>> single
>>>> page and reading 16 pages in a single NFS read request. The cost
>>>> is a
>>>> larger page cache footprint.
>>>
>>> Believe me, the extra file accesses do make a huge difference.
>>
>> If your rsize is big enough, the read-ahead traffic usually won't
>> increase the number of NFS READs on the wire; it increases the size
>> of
>> each request.
>
> rsize is 32K. That's generally true (especially after Wu's fix),
> but that extra network traffic is overburdening the NFS servers.
>
>> Client read coalescing will attempt to bundle the
>> additional requested data into a minimal number of wire READs. A
>> closer examination of the on-the-wire READ count vs. the amount of
>> data read might be interesting. It might also be useful to see how
>> often the same client reads the same page in the file repeatedly.
>
> Because the app was designed with O_DIRECT in mind, the app does its
> own file buffering in user space.
>
> If for the Linux port we can move away from O_DIRECT, it would be
> interesting to see if that extra user space buffering could be
> disabled and let the kernel do its job.

You mentioned in an earlier e-mail that the application has its own
high-level data cache coherency protocol, and it appears that your
application is using NFS to do what is more or less real-time data
exchange between MTAs and users, with permanent storage as a side-
effect. In that case, the application should manage its own cache,
since it can better optimize the number of READs required to keep its
local data caches up to date. That would address the "too many data
cache invalidations" problem.

In terms of maintaining the Linux port of your application, you
probably want to stay as close to the original as possible, yes?
Given all this, we'd be better off getting O_DIRECT to perform better.

You say above that llseek(2), write(2), and close(2) cause excess
GETATTR traffic.

llseek(SEEK_END) on an O_DIRECT file pretty much has to do a GETATTR,
since the client can't trust it's own attribute cache in this case.

I think we can get rid of the GETATTR at close(2) time on O_DIRECT
files. On open(2), I think the GETATTR is delayed until the first
access that uses the client's data cache. So we shouldn't have an
extra GETATTR on open(2) of an O_DIRECT file.

You haven't mentioned a specific case where an O_DIRECT write(2)
generates too many GETATTRs before, unless I missed it.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2010-01-29 17:58:51

by Chuck Lever

[permalink] [raw]

Subject: Re: Random I/O over NFS has horrible performance due to small I/O transfers

[BTW: every time I reply to you, the e-mail to your address bounces.
I assume you are able to see my replies through the two reflectors
that are cc'd].

On Jan 29, 2010, at 11:57 AM, Quentin Barnes wrote:
>>> If the flush call has no data to flush, the GETATTR is pointless.
>>> The file's cached attribute information is already as valid as can
>>> ever be hoped for with NFS and its life is limited by the normal
>>> attribute timeout. On the other hand, if the flush has cached data
>>> to write out, I would expect the WRITE will return the updated
>>> attributes with the post-op attribute status, again making the
>>> GETATTR on close pointless. Can you explain what I'm not following?
>>
>> The Linux NFS client does not use the post-op attributes to update
>> the
>> cached attributes for the file. Because there is no guarantee that
>> the WRITE replies will return in the same order the WRITEs were sent,
>> it's simply not reliable. If the last reply to be received was for
>> an
>> older write, then the mtime and ctime (and possibly the size) would
>> be
>> stale, and would trigger a false data cache invalidation the next
>> time
>> a full inode validation is done.
>
> Ah, yes, the out of order WRITE replies problem. I knew I was
> forgetting something.
>
> This may be a stupid question, but why not use the post-op attribute
> information to update the inode whenever the fattr mtime exceeds the
> inode mtime and just ignore using the post-op update all the other
> times since that would indicate an out of order arrival?

I seem to recall that older versions of our client used to do that,
and we may still in certain cases. Take a look at the post-op
attribute handling near nfs_update_inode() in fs/nfs/inode.c.

One problem is that WRITE replies are entirely asynchronous with
application writes, and are handled in a different kernel context
(soft IRQ? I can't remember). Serializing updates to the attribute
cache between different contexts is difficult. The solution used
today means that attributes are updated only in synchronous contexts,
so we can get a handle on the many race conditions without causing
deadlocks.

For instance, post-op attributes can indicate that the client has to
invalidate the page cache for a file. That's tricky to do correctly
in a context that can't sleep, since invalidating a page needs to take
the page lock. Setting NFS_INO_INVALID_ATTR is one way to preserve
that indication until the client is running in a context where a data
cache invalidation is safe to do.

> (Of course
> other fields would want to be checked to see if the file suddenly
> changed other state information warranting general invalidation.)
> I would assume that there are other out of order arrivals for other
> op replies that prevent such a trivial algorithm?
>
> This out-of-order post-op attribute data invalidating cache sounds
> like a well-known problem that people have been trying to solve for
> a long time or have proved that it can't be solved. If there's a
> white paper you can point me to that discuss the problem at length,
> I'd like to read it.

I don't know of one.

>>> However, I tore into the code to better understand what was
>>> triggering
>>> the GETATTR on close. A close(2) does two things, a flush
>>> (nfs_file_flush) and a release (nfs_file_release). I had thought
>>> the
>>> GETATTR was happening as part of the nfs_file_flush(). It's not.
>>> The
>>> GETATTR is triggered by the nfs_file_release(). As part of it
>>> doing a
>>> put_nfs_open_context() -> nfs_close_context() ->
>>> nfs_revalidate_inode(),
>>> that triggers the NFS_PROTO(inode)->getattr()!
>>>
>>> I'm not sure, but I suspect that O_DIRECT files take the
>>> put_nfs_open_context() path that results in the extraneous GETATTR
>>> on close(2) because filp->private_data is non-NULL where for regular
>>> files it's NULL. Is that right? If so, can this problem be easily
>>> fixed?
>>
>> I'm testing a patch to use an asynchronous close for O_DIRECT files.
>> This will skip the GETATTR for NFSv2/v3 O_DIRECT files, and avoid
>> waiting for the CLOSE for NFSv4 O_DIRECT files.
>
> When you're ready for external testing, will you be publishing it here
> on the NFS mailing list? Any guess when it might be ready?

I have a pair of patches in my kernel git repo at git.linux-nfs.org
(cel). One fixes close, the other attempts to address open. I'm
still working on the open part. I'm hoping to get these into 2.6.34.
I'm sure these are not working quite right yet, but you might want to
review the work, as it probably looks very similar to what you've
already done internally.

I've also noticed that our client still sends a lot of ACCESS requests
in the simple open-write-close use case. Too many ACCESS requests
seem to be a perennial problem. I'm going to look at that next.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2010-01-21 01:14:54

by Quentin Barnes

[permalink] [raw]

Subject: Re: Random I/O over NFS has horrible performance due to small I/O transfers

On Tue, Dec 29, 2009 at 09:10:52AM -0800, Chuck Lever wrote:
> On Dec 26, 2009, at 3:45 PM, Quentin Barnes wrote:
[...]
> > In porting some application code to Linux, its performance over
> > NFSv3 on Linux is terrible. I'm posting this note to LKML since
> > the problem was actually tracked back to the VFS layer.
> >
> > The app has a simple database that's accessed over NFS. It always
> > does random I/O, so any read-ahead is a waste. The app uses
> > O_DIRECT which has the side-effect of disabling read-ahead.
> >
> > On Linux accessing an O_DIRECT opened file over NFS is much akin to
> > disabling its attribute cache causing its attributes to be refetched
> > from the server before each NFS operation.
>
> NFS O_DIRECT is designed so that attribute refetching is avoided.
> Take a look at nfs_file_read() -- right at the top it skips to the
> direct read code. Do you perhaps have the actimeo=0 or noac mount
> options specified?

Sorry I've been slow in responding. I had a recent death in my
family which has been occupying all my time for the last three
weeks.

I'm sure I didn't have actimeo=0 or noac. What I was referring to
is the code in nfs_revalidate_file_size() which forces revalidation
with O_DIRECT files. According to the comments this is done to
minimize the window (race) with other clients writing to the file.
I saw this behavior as well in wireshark/tcpdump traces I collected.
With O_DIRECT, the attributes would often be refetched from the
server prior to each file operation. (Might have been just for
write and lseek file operations.) I could dig up traces if you
like.

Aside from O_DIRECT not using cached file attributes before file
I/O, this also has an odd side-effect on closing a file. After
a write(2) is done by the app, the following close(2) triggers a
refetch of the attributes. I don't care what the file attributes
are -- just let the file close already! For example, here in user
space I'm doing a:
fd = open(..., O_RDWR|O_DIRECT);
write(fd, ...);
sleep(3);
close(fd);

Which results in:
4.191210 NFS V3 ACCESS Call, FH:0x0308031e
4.191391 NFS V3 ACCESS Reply
4.191431 NFS V3 LOOKUP Call, DH:0x0308031e/scr2
4.191613 NFS V3 LOOKUP Reply, FH:0x29f0b5d0
4.191645 NFS V3 ACCESS Call, FH:0x29f0b5d0
4.191812 NFS V3 ACCESS Reply
4.191852 NFS V3 WRITE Call, FH:0x29f0b5d0 Offset:0 Len:300 FILE_SYNC
4.192095 NFS V3 WRITE Reply Len:300 FILE_SYNC
7.193535 NFS V3 GETATTR Call, FH:0x29f0b5d0
7.193724 NFS V3 GETATTR Reply Regular File mode:0644 uid:28238 gid:100

As you can see by the first column time index that the GETATTR is done
after the sleep(3) as the file is being closed. (This was collected on
a 2.6.32.2 kernel.)

Is there any actual need for doing that GETATTR on close that I don't
understand, or is it just a goof?

> > After some thought,
> > given the Linux behavior of O_DIRECT on regular hard disk files to
> > ensure file cache consistency, frustratingly, that's probably the
> > more correct answer to emulate this file system behavior for NFS.
> > At this point, rather than expecting Linux to somehow change to
> > avoid the unnecessary flood of GETATTRs, I thought it best for the
> > app not to just use the O_DIRECT flag on Linux. So I changed the
> > app code and then added a posix_fadvise(2) call to keep read-ahead
> > disabled. When I did that, I ran into an unexpected problem.
> >
> > Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets
> > ra_pages=0. This has a very odd side-effect in the kernel. Once
> > read-ahead is disabled, subsequent calls to read(2) are now done in
> > the kernel via ->readpage() callback doing I/O one page at a time!
>
> Your application could always use posix_fadvise(...,
> POSIX_FADV_WILLNEED). POSIX_FADV_RANDOM here means the application
> will perform I/O requests in random offset order, and requests will be
> smaller than a page.

I agree with your first assertion, but I disagree with your second.
There's nothing to imply about the size of a POSIX_FADV_RANDOM
transaction being a page size or smaller.

Anyways, this whole problem was corrected by Wu Fengguang in his
fix to the readahead code that my patch prompted over in LKML.

> > Pouring through the code in mm/filemap.c I see that the kernel has
> > commingled read-ahead and plain read implementations. The algorithms
> > have much in common, so I can see why it was done, but it left this
> > anomaly of severely pimping read(2) calls on file descriptors with
> > read-ahead disabled.
>
> The problem is that do_generic_file_read() conflates read-ahead and
> read coalescing, which are really two different things (and this use
> case highlights that difference).
>
> Above you said that "any readahead is a waste." That's only true if
> your database is significantly larger than available physical memory.

It is. It's waaaay larger than all available physical memory on
a given client machine. (Think of tens of millions of users' email
accounts.)

> Otherwise, you are simply populating the local page cache faster than
> if your app read exactly what was needed each time.

It's a multithreaded app running across many clients accessing many
servers. Any excess network traffic at all to the database is a
very bad idea being detrimental to both to the particular client's
throughput but all other clients wanting to access files on the
burdened NFS servers.

> On fast modern
> networks there is little latency difference between reading a single
> page and reading 16 pages in a single NFS read request. The cost is a
> larger page cache footprint.

Believe me, the extra file accesses do make a huge difference.

> Caching is only really harmful if your database file is shared between
> more than one NFS client.

It is. Many clients. But as far as the usual caching problems, I
don't think those exist. I think there are high level protocols
in place to prevent multiple clients from stepping on each other's
work, but not positive. It's something I need to verify.

> In fact, I think O_DIRECT will be more of a
> hindrance if your simple database doesn't do its own caching, since
> your app will generate more NFS reads in the O_DIRECT case, meaning it
> will wait more often. You're almost always better off letting the O/S
> handle data caching.

Maybe. That's what I'm trying to determine. I think O_DIRECT was
more or less used simply to keep the apps from doing any readahead
rather than truly wanting to disable file data caching. It's a
tradeoff I'm currently analyzing.

> If you leave read ahead enabled, theoretically, the read-ahead context
> should adjust itself over time to read the average number of pages in
> each application read request. Have you seen any real performance
> problems when using normal cached I/O with read-ahead enabled?

Yes, HUGE problems. As measured under load, we're talking on an
order of magnitude slower throughput and then some.

The kernel can't adjust its strategy over time. There is no history
maintained because the app opens a given file, updates it, closes
it, then moves on to the next file. The file descriptor is not kept
open beyond just one or two read or write operations. Also, the
chance of the same file needing to be updated by the same client
within any reasonable time frame is very small.

Hope all this helps understand the problems I'm dealing with.

[...]
> --
> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Quentin

2010-01-21 17:05:08

by Chuck Lever

[permalink] [raw]

Subject: Re: Random I/O over NFS has horrible performance due to small I/O transfers

On Jan 20, 2010, at 8:12 PM, Quentin Barnes wrote:
> On Tue, Dec 29, 2009 at 09:10:52AM -0800, Chuck Lever wrote:
>> On Dec 26, 2009, at 3:45 PM, Quentin Barnes wrote:
> [...]
>>> In porting some application code to Linux, its performance over
>>> NFSv3 on Linux is terrible. I'm posting this note to LKML since
>>> the problem was actually tracked back to the VFS layer.
>>>
>>> The app has a simple database that's accessed over NFS. It always
>>> does random I/O, so any read-ahead is a waste. The app uses
>>> O_DIRECT which has the side-effect of disabling read-ahead.
>>>
>>> On Linux accessing an O_DIRECT opened file over NFS is much akin to
>>> disabling its attribute cache causing its attributes to be refetched
>>> from the server before each NFS operation.
>>
>> NFS O_DIRECT is designed so that attribute refetching is avoided.
>> Take a look at nfs_file_read() -- right at the top it skips to the
>> direct read code. Do you perhaps have the actimeo=0 or noac mount
>> options specified?
>
> Sorry I've been slow in responding. I had a recent death in my
> family which has been occupying all my time for the last three
> weeks.

My condolences.

> I'm sure I didn't have actimeo=0 or noac. What I was referring to
> is the code in nfs_revalidate_file_size() which forces revalidation
> with O_DIRECT files. According to the comments this is done to
> minimize the window (race) with other clients writing to the file.
> I saw this behavior as well in wireshark/tcpdump traces I collected.
> With O_DIRECT, the attributes would often be refetched from the
> server prior to each file operation. (Might have been just for
> write and lseek file operations.) I could dig up traces if you
> like.

nfs_revalidate_file_size() is not invoked in the O_DIRECT read path.
You were complaining about read-ahead. So I'd say this problem is
independent of the issues you reported earlier with read-ahead.

> Aside from O_DIRECT not using cached file attributes before file
> I/O, this also has an odd side-effect on closing a file. After
> a write(2) is done by the app, the following close(2) triggers a
> refetch of the attributes. I don't care what the file attributes
> are -- just let the file close already! For example, here in user
> space I'm doing a:
> fd = open(..., O_RDWR|O_DIRECT);
> write(fd, ...);
> sleep(3);
> close(fd);
>
> Which results in:
> 4.191210 NFS V3 ACCESS Call, FH:0x0308031e
> 4.191391 NFS V3 ACCESS Reply
> 4.191431 NFS V3 LOOKUP Call, DH:0x0308031e/scr2
> 4.191613 NFS V3 LOOKUP Reply, FH:0x29f0b5d0
> 4.191645 NFS V3 ACCESS Call, FH:0x29f0b5d0
> 4.191812 NFS V3 ACCESS Reply
> 4.191852 NFS V3 WRITE Call, FH:0x29f0b5d0 Offset:0 Len:300 FILE_SYNC
> 4.192095 NFS V3 WRITE Reply Len:300 FILE_SYNC
> 7.193535 NFS V3 GETATTR Call, FH:0x29f0b5d0
> 7.193724 NFS V3 GETATTR Reply Regular File mode:0644 uid:28238 gid:
> 100
>
> As you can see by the first column time index that the GETATTR is done
> after the sleep(3) as the file is being closed. (This was collected
> on
> a 2.6.32.2 kernel.)
>
> Is there any actual need for doing that GETATTR on close that I don't
> understand, or is it just a goof?

This GETATTR is required generally for cached I/O and close-to-open
cache coherency. The Linux NFS FAQ at nfs.sourceforge.net has more
information on close-to-open.

For close-to-open to work, a close(2) call must flush any pending
changes, and the next open(2) call on that file needs to check that
the file's attributes haven't changed since the file was last accessed
on this client. The mtime, ctime, and size are compared between the
two to determine if the client's copy of the file's data is stale.

The flush done by a close(2) call after a write(2) may cause the
server to update the mtime, ctime, and size of the file. So, after
the flush, the client has to grab the latest copy of the file's
attributes from the server (the server, not the client, maintains the
values of mtime, ctime, and size). Otherwise, the client would have
cached file attributes that were valid _before_ the flush, but not
afterwards. The next open(2) would be spoofed into thinking that the
file had been changed by some other client, when it was its own
activity that caused the mtime/ctime/size change.

But again, you would only see this for normal cached accesses, or for
llseek(SEEK_END). The O_DIRECT path splits off well before that
nfs_revalidate_file_size() call in nfs_file_write().

For llseek(SEEK_END), this is the preferred way to get the size of a
file through an O_DIRECT file descriptor. This is precisely because
O_DIRECT does not guarantee that the client's copy of the file's
attributes are up to date.

I see that the WRITE from your trace is a FILE_SYNC write. In this
case, perhaps the GETATTR is really not required for close-to-open.
Especially if the server has returned post-op attributes in the WRITE
reply, the client would already have up-to-date file attributes
available to it.

>>> After some thought,
>>> given the Linux behavior of O_DIRECT on regular hard disk files to
>>> ensure file cache consistency, frustratingly, that's probably the
>>> more correct answer to emulate this file system behavior for NFS.
>>> At this point, rather than expecting Linux to somehow change to
>>> avoid the unnecessary flood of GETATTRs, I thought it best for the
>>> app not to just use the O_DIRECT flag on Linux. So I changed the
>>> app code and then added a posix_fadvise(2) call to keep read-ahead
>>> disabled. When I did that, I ran into an unexpected problem.
>>>
>>> Adding the posix_fadvise(..., POSIX_FADV_RANDOM) call sets
>>> ra_pages=0. This has a very odd side-effect in the kernel. Once
>>> read-ahead is disabled, subsequent calls to read(2) are now done in
>>> the kernel via ->readpage() callback doing I/O one page at a time!
>>
>> Your application could always use posix_fadvise(...,
>> POSIX_FADV_WILLNEED). POSIX_FADV_RANDOM here means the application
>> will perform I/O requests in random offset order, and requests will
>> be
>> smaller than a page.
>
> I agree with your first assertion, but I disagree with your second.
> There's nothing to imply about the size of a POSIX_FADV_RANDOM
> transaction being a page size or smaller.

My second assertion is true on Linux. Certainly POSIX does not
require the request size limitation.

> Anyways, this whole problem was corrected by Wu Fengguang in his
> fix to the readahead code that my patch prompted over in LKML.

I was pleased to see that fix.

>>> Pouring through the code in mm/filemap.c I see that the kernel has
>>> commingled read-ahead and plain read implementations. The
>>> algorithms
>>> have much in common, so I can see why it was done, but it left this
>>> anomaly of severely pimping read(2) calls on file descriptors with
>>> read-ahead disabled.
>>
>> The problem is that do_generic_file_read() conflates read-ahead and
>> read coalescing, which are really two different things (and this use
>> case highlights that difference).
>>
>> Above you said that "any readahead is a waste." That's only true if
>> your database is significantly larger than available physical memory.
>
> It is. It's waaaay larger than all available physical memory on
> a given client machine. (Think of tens of millions of users' email
> accounts.)

If the accesses on any given client are localized in the file (eg.
there are only a few e-mail users on that client) this should be
handily dealt with by normal O/S caching behavior, even with an
enormous database file. It really depends on the file's resident set
on each client.

>> Otherwise, you are simply populating the local page cache faster than
>> if your app read exactly what was needed each time.
>
> It's a multithreaded app running across many clients accessing many
> servers. Any excess network traffic at all to the database is a
> very bad idea being detrimental to both to the particular client's
> throughput but all other clients wanting to access files on the
> burdened NFS servers.

Which is why you might be better off relying on client-side caches in
this case. Efficient client caching is absolutely required for good
network and server scalability with such workloads.

If all of this data is contained in a single large file, your
application is relying on a single set of file attributes to determine
whether the client's cache for all the file data is stale. So
basically, read ahead is pulling a bunch of data into the client's
page cache, then someone changes one byte in the file, and all that
data is invalidated in one swell foop. In this case, it's not
necessarily read-ahead that's killing your performance, it's excessive
client data cache invalidations.

>> On fast modern
>> networks there is little latency difference between reading a single
>> page and reading 16 pages in a single NFS read request. The cost
>> is a
>> larger page cache footprint.
>
> Believe me, the extra file accesses do make a huge difference.

If your rsize is big enough, the read-ahead traffic usually won't
increase the number of NFS READs on the wire; it increases the size of
each request. Client read coalescing will attempt to bundle the
additional requested data into a minimal number of wire READs. A
closer examination of the on-the-wire READ count vs. the amount of
data read might be interesting. It might also be useful to see how
often the same client reads the same page in the file repeatedly.

>> If you leave read ahead enabled, theoretically, the read-ahead
>> context
>> should adjust itself over time to read the average number of pages in
>> each application read request. Have you seen any real performance
>> problems when using normal cached I/O with read-ahead enabled?
>
> Yes, HUGE problems. As measured under load, we're talking on an
> order of magnitude slower throughput and then some.
>
> The kernel can't adjust its strategy over time. There is no history
> maintained because the app opens a given file, updates it, closes
> it, then moves on to the next file. The file descriptor is not kept
> open beyond just one or two read or write operations. Also, the
> chance of the same file needing to be updated by the same client
> within any reasonable time frame is very small.

Yes, read-ahead context is abandoned when a file descriptor is
closed. That immediately suggests that file descriptors should be
left open, but that's only as practical as your application allows.

> Hope all this helps understand the problems I'm dealing with.

Yes, this is more clear, thanks.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2010-01-29 17:01:10

by Quentin Barnes

[permalink] [raw]

Subject: Re: Random I/O over NFS has horrible performance due to small I/O transfers

> > If the flush call has no data to flush, the GETATTR is pointless.
> > The file's cached attribute information is already as valid as can
> > ever be hoped for with NFS and its life is limited by the normal
> > attribute timeout. On the other hand, if the flush has cached data
> > to write out, I would expect the WRITE will return the updated
> > attributes with the post-op attribute status, again making the
> > GETATTR on close pointless. Can you explain what I'm not following?
>
> The Linux NFS client does not use the post-op attributes to update the
> cached attributes for the file. Because there is no guarantee that
> the WRITE replies will return in the same order the WRITEs were sent,
> it's simply not reliable. If the last reply to be received was for an
> older write, then the mtime and ctime (and possibly the size) would be
> stale, and would trigger a false data cache invalidation the next time
> a full inode validation is done.

Ah, yes, the out of order WRITE replies problem. I knew I was
forgetting something.

This may be a stupid question, but why not use the post-op attribute
information to update the inode whenever the fattr mtime exceeds the
inode mtime and just ignore using the post-op update all the other
times since that would indicate an out of order arrival? (Of course
other fields would want to be checked to see if the file suddenly
changed other state information warranting general invalidation.)
I would assume that there are other out of order arrivals for other
op replies that prevent such a trivial algorithm?

This out-of-order post-op attribute data invalidating cache sounds
like a well-known problem that people have been trying to solve for
a long time or have proved that it can't be solved. If there's a
white paper you can point me to that discuss the problem at length,
I'd like to read it.

> So, post-op attributes are used to detect the need for an attribute
> cache update. At some later point, the client will perform the update
> by sending a GETATTR, and that will update the cached attributes.
> That's what NFS_INO_INVALID_ATTR is for.

NFS_INO_INVALID_ATTR is just the tip of the iceberg. I'm still trying
to absorb all the NFS inode state information that's tracked, how and
under what conditions it is updated, and when it is set as invalid.

> The goal is to use a few extra operations on the wire to prevent
> spurious data cache invalidations. For large files, this could mean
> significantly fewer READs on the wire.

Yes, definitely better to reread attribute info when needed than cause
a data cache flush.

> > However, I tore into the code to better understand what was triggering
> > the GETATTR on close. A close(2) does two things, a flush
> > (nfs_file_flush) and a release (nfs_file_release). I had thought the
> > GETATTR was happening as part of the nfs_file_flush(). It's not. The
> > GETATTR is triggered by the nfs_file_release(). As part of it doing a
> > put_nfs_open_context() -> nfs_close_context() ->
> > nfs_revalidate_inode(),
> > that triggers the NFS_PROTO(inode)->getattr()!
> >
> > I'm not sure, but I suspect that O_DIRECT files take the
> > put_nfs_open_context() path that results in the extraneous GETATTR
> > on close(2) because filp->private_data is non-NULL where for regular
> > files it's NULL. Is that right? If so, can this problem be easily
> > fixed?
>
> I'm testing a patch to use an asynchronous close for O_DIRECT files.
> This will skip the GETATTR for NFSv2/v3 O_DIRECT files, and avoid
> waiting for the CLOSE for NFSv4 O_DIRECT files.

When you're ready for external testing, will you be publishing it here
on the NFS mailing list? Any guess when it might be ready?

> > Because the app was designed with O_DIRECT in mind, the app does its
> > own file buffering in user space.
> >
> > If for the Linux port we can move away from O_DIRECT, it would be
> > interesting to see if that extra user space buffering could be
> > disabled and let the kernel do its job.
>
> You mentioned in an earlier e-mail that the application has its own
> high-level data cache coherency protocol, and it appears that your
> application is using NFS to do what is more or less real-time data
> exchange between MTAs and users, with permanent storage as a side-
> effect. In that case, the application should manage its own cache,
> since it can better optimize the number of READs required to keep its
> local data caches up to date. That would address the "too many data
> cache invalidations" problem.

True.

But we also have many other internal groups using NFS with many
different use cases. Though I'm primarily focused on this one use
case I recently mentioned, I may jump around at times to other ones
without being clear that I hopped. I'll try to watch for that. :-)

> In terms of maintaining the Linux port of your application, you
> probably want to stay as close to the original as possible, yes?
> Given all this, we'd be better off getting O_DIRECT to perform better.
>
> You say above that llseek(2), write(2), and close(2) cause excess
> GETATTR traffic.
>
> llseek(SEEK_END) on an O_DIRECT file pretty much has to do a GETATTR,
> since the client can't trust it's own attribute cache in this case.
>
> I think we can get rid of the GETATTR at close(2) time on O_DIRECT
> files. On open(2), I think the GETATTR is delayed until the first
> access that uses the client's data cache. So we shouldn't have an
> extra GETATTR on open(2) of an O_DIRECT file.
>
> You haven't mentioned a specific case where an O_DIRECT write(2)
> generates too many GETATTRs before, unless I missed it.

I've been evaluating this NFS use case with kernels from
2.6.{9,18,21,24,26,30,31,32,33-rc5} builds. After reading your
note, I went back and looked at the tcpdumps from the latest 2.6.32
runs, I didn't see the GETATTRs before write(2)s anymore. I'm going
to assume that problem was from an older kernel I was testing and
jumbled them in my head. I should have double-checked that on the
current kernels and not relied on my faulty memory.

> Chuck Lever
> chuck[dot]lever[at]oracle[dot]com
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Quentin