Hi,
I'm presently investigating why writing to a nfs exported lustre filesystem is
rather slow. Reading from lustre over nfs about 200-300 MB/s, but writing to
it over nfs is only 20-50MB/s (both with IPoIB). Writing directly to this
lustre cluster is about 600-700 MB/s both reading and writing. Well, 200-300
MB/s over NFS per client would be acceptable.
After several dozens of printks, systemtaps, etc I think its not the fault of
lustre, but a generic nfsd and/or vfs problem.
In nfs3svc_decode_writeargs() all the data received are splitted into
PAGE_SIZE, except the very first page. This page only gets
PAGE_SIZE - header_length. So far no problem, but now on writing the pages in
generic_file_buffered_write(), this function tries to write PAGE_SIZE. So it
takes the first nfs page, which is PAGE_SIZE - header_length.
To fill up to PAGE_SIZE it will take header_length from the second page. Of
course, now there's also only PAGE_SIZE - header_length for the 2nd nfs page
left.
It will continue this way until the last page is written. Don't know why this
doesn't show a big effect on other file system. Well, maybe it does, but
nobody did notice it before?
Well, I have no idea if generic_file_buffered_write() really has to do what it
presently does. But lets first stay at nfs, is it really necessary to already
split up the data into pages?
Using this patch I get write speed of about 200 MB/s, even with kernel
debugging enabled and several left-over printks
-- nfs3xdr.c.bak 2007-07-09 01:32:17.000000000 +0200
+++ nfs3xdr.c 2007-08-31 19:29:31.000000000 +0200
@@ -405,16 +405,8 @@ nfs3svc_decode_writeargs(struct svc_rqst
len = args->len = max_blocksize;
}
rqstp->rq_vec[0].iov_base = (void*)p;
- rqstp->rq_vec[0].iov_len = rqstp->rq_arg.head[0].iov_len - hdr;
- v = 0;
- while (len > rqstp->rq_vec[v].iov_len) {
- len -= rqstp->rq_vec[v].iov_len;
- v++;
- rqstp->rq_vec[v].iov_base = page_address(rqstp->rq_pages[v]);
- rqstp->rq_vec[v].iov_len = PAGE_SIZE;
- }
- rqstp->rq_vec[v].iov_len = len;
- args->vlen = v + 1;
+ rqstp->rq_vec[0].iov_len = len;
+ args->vlen = 1;
return 1;
}
Cheers,
Bernd
--
Bernd Schubert
Q-Leap Networks GmbH
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Fri, Aug 31, 2007 at 08:03:30PM +0200, Bernd Schubert wrote:
> I'm presently investigating why writing to a nfs exported lustre filesystem is
> rather slow. Reading from lustre over nfs about 200-300 MB/s, but writing to
> it over nfs is only 20-50MB/s (both with IPoIB). Writing directly to this
> lustre cluster is about 600-700 MB/s both reading and writing. Well, 200-300
> MB/s over NFS per client would be acceptable.
>
> After several dozens of printks, systemtaps, etc I think its not the fault of
> lustre, but a generic nfsd and/or vfs problem.
Thanks for looking into this!
> In nfs3svc_decode_writeargs() all the data received are splitted into
> PAGE_SIZE, except the very first page. This page only gets
> PAGE_SIZE - header_length. So far no problem, but now on writing the pages in
> generic_file_buffered_write(), this function tries to write PAGE_SIZE. So it
> takes the first nfs page, which is PAGE_SIZE - header_length.
> To fill up to PAGE_SIZE it will take header_length from the second page. Of
> course, now there's also only PAGE_SIZE - header_length for the 2nd nfs page
> left.
> It will continue this way until the last page is written. Don't know why this
> doesn't show a big effect on other file system. Well, maybe it does, but
> nobody did notice it before?
Hm. Any chance this is the same problem?:
http://marc.info/?l=linux-nfs&m=112289652218095&w=2
> Using this patch I get write speed of about 200 MB/s, even with kernel
> debugging enabled and several left-over printks
At too high a cost, unfortunately:
> -- nfs3xdr.c.bak 2007-07-09 01:32:17.000000000 +0200
> rqstp->rq_vec[0].iov_base = (void*)p;
...
> + rqstp->rq_vec[0].iov_len = len;
> + args->vlen = 1;
There's no guarantee the later pages in the rq_pages array are
contiguous in memory after the first one, so the rest of that iovec
probably has random data in it.
(You might want to add to your tests some checks that the right data
still gets to the file afterwards.)
--b.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Fri, 2007-08-31 at 14:45 -0400, J. Bruce Fields wrote:
>
> Hm. Any chance this is the same problem?:
>
> http://marc.info/?l=linux-nfs&m=112289652218095&w=2
Did this ever land anywhere?
b.
--
A day in the yard with my son is just like a day at work. He goes
hunting around for stuff and brings it back to me and says: "Hey Dad,
look what I found. The money is for me and the screw is for you."
On Fri, Aug 31, 2007 at 02:52:15PM -0400, Brian J. Murrell wrote:
> On Fri, 2007-08-31 at 14:45 -0400, J. Bruce Fields wrote:
> >
> > Hm. Any chance this is the same problem?:
> >
> > http://marc.info/?l=linux-nfs&m=112289652218095&w=2
>
> Did this ever land anywhere?
No--I think there were some discussion of problems in the followup
posts.
--b.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
Hello Bruce,
thanks for your help!
On Friday 31 August 2007, J. Bruce Fields wrote:
> On Fri, Aug 31, 2007 at 08:03:30PM +0200, Bernd Schubert wrote:
> > I'm presently investigating why writing to a nfs exported lustre
> > filesystem is rather slow. Reading from lustre over nfs about 200-300
> > MB/s, but writing to it over nfs is only 20-50MB/s (both with IPoIB).
> > Writing directly to this lustre cluster is about 600-700 MB/s both
> > reading and writing. Well, 200-300 MB/s over NFS per client would be
> > acceptable.
> >
> > After several dozens of printks, systemtaps, etc I think its not the
> > fault of lustre, but a generic nfsd and/or vfs problem.
>
> Thanks for looking into this!
I will give these thanks to my boss who is paying me for this work :)
>
> > In nfs3svc_decode_writeargs() all the data received are splitted into
> > PAGE_SIZE, except the very first page. This page only gets
> > PAGE_SIZE - header_length. So far no problem, but now on writing the
> > pages in generic_file_buffered_write(), this function tries to write
> > PAGE_SIZE. So it takes the first nfs page, which is PAGE_SIZE -
> > header_length.
> > To fill up to PAGE_SIZE it will take header_length from the second page.
> > Of course, now there's also only PAGE_SIZE - header_length for the 2nd
> > nfs page left.
> > It will continue this way until the last page is written. Don't know why
> > this doesn't show a big effect on other file system. Well, maybe it does,
> > but nobody did notice it before?
>
> Hm. Any chance this is the same problem?:
>
> http://marc.info/?l=linux-nfs&m=112289652218095&w=2
Looks similar.
+ if (vec[0].iov_len + vec[vlen-1].iov_len != PAGE_CACHE_SIZE)
+ return 0;
+ for (i = 1; i < vlen - 1; ++i) {
+ if (vec[i].iov_len != PAGE_CACHE_SIZE)
+ return 0;
+ }
I tried to say in my last mail:
vec[0].iov_len = PAGE_PAGE_SIZE - headerlength
vec[1 ... n - 1].iov_len = PAGE_PAGE_SIZE
vec[n].iov_len = headerlength
This looks like it needs quite some cpu cycles
+ memmove(this_page + chunk0, this_page, chunk1);
+ memcpy(this_page, prev_page + chunk1, chunk0);
I will test the patch tomorrow.
>
> > Using this patch I get write speed of about 200 MB/s, even with kernel
> > debugging enabled and several left-over printks
>
> At too high a cost, unfortunately:
> > -- nfs3xdr.c.bak 2007-07-09 01:32:17.000000000 +0200
> > rqstp->rq_vec[0].iov_base = (void*)p;
>
> ...
>
> > + rqstp->rq_vec[0].iov_len = len;
> > + args->vlen = 1;
>
> There's no guarantee the later pages in the rq_pages array are
> contiguous in memory after the first one, so the rest of that iovec
> probably has random data in it.
Hmm, its some time since I last read rfc1813, but I can't remember something
like 'data are send in pages and pages may have random order'. So I guess
some kind of multi-threading is filling in the data the client is sending?
Given the performance impact this has, maybe single-threading per client
request would be better?
Can you point me to the corresponding function?
>
> (You might want to add to your tests some checks that the right data
> still gets to the file afterwards.)
Hmm, I need to put the data on a ram-disk. All raid-boxes sufficiently fast
for this operation are in use for lustre storage.
Thanks again,
Bernd
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Fri, Aug 31, 2007 at 11:34:49PM +0200, Bernd Schubert wrote:
> On Friday 31 August 2007, J. Bruce Fields wrote:
> > There's no guarantee the later pages in the rq_pages array are
> > contiguous in memory after the first one, so the rest of that iovec
> > probably has random data in it.
>
> Hmm, its some time since I last read rfc1813, but I can't remember something
> like 'data are send in pages and pages may have random order'. So I guess
> some kind of multi-threading is filling in the data the client is sending?
The data all arrives in one big chunk, in order. But then we have to
put it some place. The kernel almost never tries to allocate more than
one contiguous page of memory--memory fragmentation can make it
difficult to do that reliably--so we just ask for a bunch of pages to
put the data in, which may represent memory from all over the place,
store those pages into an array, and receive the data into those pages
in the order they're listed in the array.
--b.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Friday 31 August 2007, J. Bruce Fields wrote:
> On Fri, Aug 31, 2007 at 11:34:49PM +0200, Bernd Schubert wrote:
> > On Friday 31 August 2007, J. Bruce Fields wrote:
> > > There's no guarantee the later pages in the rq_pages array are
> > > contiguous in memory after the first one, so the rest of that iovec
> > > probably has random data in it.
> >
> > Hmm, its some time since I last read rfc1813, but I can't remember
> > something like 'data are send in pages and pages may have random order'.
> > So I guess some kind of multi-threading is filling in the data the client
> > is sending?
>
> The data all arrives in one big chunk, in order. But then we have to
> put it some place. The kernel almost never tries to allocate more than
> one contiguous page of memory--memory fragmentation can make it
> difficult to do that reliably--so we just ask for a bunch of pages to
> put the data in, which may represent memory from all over the place,
> store those pages into an array, and receive the data into those pages
> in the order they're listed in the array.
Ah, now I understand, thanks! I'm still used to userspace programming (*)
Thanks,
Bernd
PS: (*) Don't know if I ever will really like kernel programming - it
remembers me to metal-organic chemistry, everything is a 100 times more
difficult than usually, even simple weighing 100g of a substance might take a
few hours.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Friday 31 August 2007, J. Bruce Fields wrote:
> On Fri, Aug 31, 2007 at 02:52:15PM -0400, Brian J. Murrell wrote:
> > On Fri, 2007-08-31 at 14:45 -0400, J. Bruce Fields wrote:
> > > Hm. Any chance this is the same problem?:
> > >
> > > http://marc.info/?l=linux-nfs&m=112289652218095&w=2
> >
> > Did this ever land anywhere?
>
> No--I think there were some discussion of problems in the followup
> posts.
To sum up this discussion:
There are two choices to move the data:
1.) To page 2 ... n - this will overwrite nfsv4 at the end of page n.
2.) To page 1 ... n - 1 - this will overwrite the header, thus, all pointers
to that memory will point to wrong data now.
Seems both aproaches are troublesome and nobody bothered to implement it. Not
that I much like the idea of data moving at all, but what about
3.) On allocating the pages, allocate one page more than required. After
filling in page 1, skip page 2 and proceed with page 3.
Now we would have space to properly move the data later on, thus:
memcpy (page2, page1 + hdr, PAGE_SIZE - hdr_length)
memcpy (page2 + PAGE_SIZE - hdr_length, page3, hdr_length)
memmove(page3, page3 + hdr_length, PAGE_SIZE - hdr_length)
[...]
Can you point me to the function assigning the data-block from the network to
the page-vector?
Thanks,
Bernd
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs