Hello All,
The attached program demonstrates a problem I am having writing to
files on an NFS file system. It works by creating a file, and then
seeking through the file to update it. The problem I am seeing is that
the seek/update stage takes more than 10X as long as the amount of time
required to initially create the file. And its not even seeking in
some strange pattern.
I am running this with a 2.4.20 (red hat patched) kernel. I have not
tried it with 2.6. I have played with various mount options, but they
do not seem to make much difference. Here is one example that I used:
sledge:/l0 /mnt/v3_tcp_8k nfs rw,v3,rsize=8192,wsize=8192,hard,intr,tcp,lock,addr=sledge 0 0
Anyone have any ideas or comments?
Thanks,
Jim
----------------------------------------------------------------------
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>
char buff[4096];
double dt(struct timeval *a, struct timeval *b)
{
double sec = b->tv_usec - a->tv_usec;
sec /= 1e6;
sec += b->tv_sec - a->tv_sec;
return sec;
}
int main()
{
struct timeval a, b;
int i;
FILE *fp = fopen("testfile", "w");
printf("Creating file: ");
fflush(stdout);
gettimeofday(&a, 0);
for(i=0; i<100*1024; i++)
fwrite(buff, 4096, 1, fp);
fflush(fp);
gettimeofday(&b, 0);
printf("%.3f seconds\n", dt(&a, &b));
printf("Updating file: ");
fflush(stdout);
gettimeofday(&a, 0);
for(i=0; i<100*1024*sizeof(buff); i += 5000) {
fseek(fp, i, SEEK_SET);
fwrite(&i, sizeof(i), 1, fp);
}
gettimeofday(&b, 0);
printf("%.3f seconds\n", dt(&a, &b));
return 0;
}
P? m? , 19/01/2004 klokka 16:16, skreiv [email protected]:
> Hello All,
> The attached program demonstrates a problem I am having writing to
> files on an NFS file system. It works by creating a file, and then
> seeking through the file to update it. The problem I am seeing is that
> the seek/update stage takes more than 10X as long as the amount of time
> required to initially create the file. And its not even seeking in
> some strange pattern.
>
> I am running this with a 2.4.20 (red hat patched) kernel. I have not
> tried it with 2.6. I have played with various mount options, but they
> do not seem to make much difference. Here is one example that I used:
So you are surprised that writing the same dataset by putting one
integer onto each kernel page takes much more time than placing the
entire dataset onto just a few kernel pages? 'cos I'm not...
Have a look at the nfsstat output and you'll see what I mean. Doing the
former will necessarily end up generating *a lot* more NFS write
requests.
Cheers,
Trond
On Mon, Jan 19, 2004 at 04:53:46PM -0500, Trond Myklebust wrote:
> So you are surprised that writing the same dataset by putting one
> integer onto each kernel page takes much more time than placing the
> entire dataset onto just a few kernel pages? 'cos I'm not...
I must admit that I am. I could see that it would take somewhat longer
because a logicial way for the kernel to implement this would be as a
read-modify-write operation. So a 2X slowdown would not supprise me.
But the slowdown is more than 10X, and that does.
Also, for what its worth, Solaris performs like this:
flame> ./a.out
Creating file: 3.886 seconds
Updating file: 1.259 seconds
While Linux performs like this:
jesse> ./a.out
Creating file: 43.042 seconds
Updating file: 555.796 seconds
Both machines are writing to the same directory (not at the same time) on
an x86_64 Linux server running a kernel which identifies itself as
2.4.21-4.ELsmp. The network connection between the Solaris machine and
the NFS server actually seems to be slightly slower than that between
the Linux machine and the NFS server:
sledge> ping flame
PING flame.synopsys.com (10.48.2.191) 56(84) bytes of data.
64 bytes from flame.synopsys.com (10.48.2.191): icmp_seq=0 ttl=255 time=0.174 ms
64 bytes from flame.synopsys.com (10.48.2.191): icmp_seq=1 ttl=255 time=0.171 ms
64 bytes from flame.synopsys.com (10.48.2.191): icmp_seq=2 ttl=255 time=0.168 ms
sledge> ping jesse
PING jesse.synopsys.com (10.48.2.120) 56(84) bytes of data.
64 bytes from jesse.synopsys.com (10.48.2.120): icmp_seq=0 ttl=64 time=0.115 ms
64 bytes from jesse.synopsys.com (10.48.2.120): icmp_seq=1 ttl=64 time=0.113 ms
64 bytes from jesse.synopsys.com (10.48.2.120): icmp_seq=2 ttl=64 time=0.104 ms
so I dont think the descrepency is due to network problems.
I should note that Solaris takes 135 seconds to close the file while Linux
takes almost no time. This probably means that Solaris has postponed some
of the work it needs to do until close time. However, even with the 135
seconds factored in, Solaris is still killing us.
I would like to try this with a 2.6 kernel, but that is difficult. Ill
see what I can put together though.
> Have a look at the nfsstat output and you'll see what I mean. Doing the
> former will necessarily end up generating *a lot* more NFS write
> requests.
Thanks for the info. I didnt know about nfsstat. It looks like a useful
utility.
Thanks,
Jim
P? ty , 20/01/2004 klokka 08:28, skreiv [email protected]:
> I must admit that I am. I could see that it would take somewhat longer
> because a logicial way for the kernel to implement this would be as a
> read-modify-write operation. So a 2X slowdown would not supprise me.
> But the slowdown is more than 10X, and that does.
No... Do the math: your program is literally writing to
5000*4096/PAGE_SIZE times as many pages in the fseek()+fwrite() case.
You are creating 5000 times as many non-contiguous NFS write requests
that need to be flushed out somehow.
You will have the same problem even when writing to a local filesystem.
Just try adding a fflush()+fsync() before the gettimeofday(&b, 0) and
you'll see an equivalent slowdown there too.
For NFS, we could possibly "optimize" the process of flushing them out a
little bit by doing read-modify-write and then coalescing two 4byte
requests into a single 8k write, but that offers no reductions at all
here 'cos you first have to read in said 8k request in a single 8k read.
Even with a 32k r/wsize, the reduction in the number of generated
requests would at best be a factor of 8. That's still more than a factor
500 short.
> Also, for what its worth, Solaris performs like this:
>
> flame> ./a.out
> Creating file: 3.886 seconds
> Updating file: 1.259 seconds
>
> While Linux performs like this:
>
> jesse> ./a.out
> Creating file: 43.042 seconds
> Updating file: 555.796 seconds
>
>
> Both machines are writing to the same directory (not at the same time) on
> an x86_64 Linux server running a kernel which identifies itself as
> 2.4.21-4.ELsmp. The network connection between the Solaris machine and
> the NFS server actually seems to be slightly slower than that between
> the Linux machine and the NFS server:
Oh. It's an x86_64? You didn't say originally, so I assumed an ia32. OK,
I believe my modified math above is correct.
If the your kernel was compiled using the default larger page size
(isn't that 16K?), then the explanation is simple: Linux generates
synchronous writes in all cases where the r/wsize is less than the
actual page size. Chuck Lever is currently working on a set of patches
that will fix that, but his work when completed will first have to go
into 2.6.
The other 2.4 limitation that will slow you down (even if you compile
with 4k pages) is that the NFS client is not allowed to cache more than
256 pages at a time. This hard limit exists in 2.4.x because there is no
mechanism for the MM to otherwise notify NFS if there is a low memory
situation that would require it to flush out all cached writes.
This limitation has already been removed in 2.6.x: there the number of
pages you can cache is only limited by the available free memory.
Cheers,
Trond
Trond Myklebust <[email protected]> writes:
>
> Oh. It's an x86_64? You didn't say originally, so I assumed an ia32. OK,
> I believe my modified math above is correct.
>
> If the your kernel was compiled using the default larger page size
> (isn't that 16K?), then the explanation is simple: Linux generates
x86-64 uses 4K pages, just like i386. It doesn't support bigger pages.
-Andi
Trond Myklebust wrote:
>
> [email protected]:
> > I must admit that I am. I could see that it would take somewhat longer
> > because a logicial way for the kernel to implement this would be as a
> > read-modify-write operation. So a 2X slowdown would not supprise me.
> > But the slowdown is more than 10X, and that does.
>
> No... Do the math: your program is literally writing to
> 5000*4096/PAGE_SIZE times as many pages in the fseek()+fwrite() case.
Sorry. I believe it may be you who needs to check their math.
Case 1: for(i=0; i<100*1024; i++) {
fwrite(buff, 4096, 1, fp);
}
Case 2: /* sizeof(buff) = 4096. sizeof(i) = 4 */
for(i=0; i<100*1024*sizeof(buff); i += 5000) {
fseek(fp, i, SEEK_SET);
fwrite(&i, sizeof(i), 1, fp);
}
In the first case, he's writing 102,400 chunks of 4096 bytes each (for a
grand total of 419,430,400 bytes...or 400MB).
In the second case, he's performing 83,886 seeks and writing 83,886
4-byte ints (only 1 int per seek), OVER THE SAME 400MB REGION. As such,
he cannot be writing to "more" ("times as many") pages in the second case
as he does in the first. In fact, due to the spacing of the seeks, he
can occasionally skip over some pages. He's simply writing over the
same region in a more sparse, incremental-but-non-contiguous, manner.
> You are creating 5000 times as many non-contiguous NFS write requests
> that need to be flushed out somehow.
Based on the previous logic, he is actually creating 83,886 non-contiguous
write requests(!). :^)
> > Also, for what its worth, Solaris performs like this:
> >
> > flame> ./a.out
> > Creating file: 3.886 seconds
> > Updating file: 1.259 seconds
> >
> > While Linux performs like this:
> >
> > jesse> ./a.out
> > Creating file: 43.042 seconds
> > Updating file: 555.796 seconds
A-HA! Now we're getting somewhere. This is the crux of the issue.
The bottom line question is, to put it bluntly:
Why does Solaris perform so well, and Linux suck so bad?
Why does Solaris take only 1.3 seconds to do what Linux takes ~10 minutes
to do?!?
Ironically, the "update file" section on Solaris takes LESS time than it
does to create it...which in some ways makes sense: The app is only writing
335,544 bytes in the second phase compared to 400M in the first (creation)
phase.
> > Both machines are writing to the same directory (not at the same time) on
> > an x86_64 Linux server running a kernel which identifies itself as
> > 2.4.21-4.ELsmp. The network connection between the Solaris machine and
> > the NFS server actually seems to be slightly slower than that between
> > the Linux machine and the NFS server:
>
> Oh. It's an x86_64? You didn't say originally, so I assumed an ia32. OK,
> I believe my modified math above is correct.
To clarify: The test program (a.out) is NOT running on x86_64, rather the
remote file server is x86_64 in both (flame and jesse) testcases. The
testfile created by code (from a previous post) resides on a remote
machine (sledge). 'sledge' is x86_64; 'flame' is sparc64; and 'jesse' is
ia32.
-chris
-----------------------------------------------------------------
Chris M. Petersen [email protected]
Sr. R&D Engineer
Synopsys, Inc. o: 919.425.7342
1101 Slater Road, Suite 300 c: 919.349.6393
Durham, NC 27703 f: 919.425.7320
-----------------------------------------------------------------
[email protected] wrote:
> On Mon, Jan 19, 2004 at 04:53:46PM -0500, Trond Myklebust wrote:
>
>
>>So you are surprised that writing the same dataset by putting one
>>integer onto each kernel page takes much more time than placing the
>>entire dataset onto just a few kernel pages? 'cos I'm not...
>
>
> I must admit that I am. I could see that it would take somewhat longer
> because a logicial way for the kernel to implement this would be as a
> read-modify-write operation. So a 2X slowdown would not supprise me.
> But the slowdown is more than 10X, and that does.
>
> Also, for what its worth, Solaris performs like this:
>
> flame> ./a.out
> Creating file: 3.886 seconds
> Updating file: 1.259 seconds
>
> While Linux performs like this:
>
> jesse> ./a.out
> Creating file: 43.042 seconds
> Updating file: 555.796 seconds
My Client (2.6.1-mm2) against my server (2.6.1-bk6) is significantly faster:
./nfstest
Creating file: 11.607 seconds
Updating file: 35.885 seconds
Note the increase of the factor 3 only instead of 10.
nfs mount options:
rw,hard,intr,rsize=8192,wsize=8192,timeo=30,retrans=10,tcp
Thanks,
Jan
P? ty , 20/01/2004 klokka 15:03, skreiv Chris Petersen:
> In the second case, he's performing 83,886 seeks and writing 83,886
> 4-byte ints (only 1 int per seek), OVER THE SAME 400MB REGION. As such,
> he cannot be writing to "more" ("times as many") pages in the second case
> as he does in the first. In fact, due to the spacing of the seeks, he
> can occasionally skip over some pages. He's simply writing over the
> same region in a more sparse, incremental-but-non-contiguous, manner.
Duh... You're right. Sorry...
> Why does Solaris take only 1.3 seconds to do what Linux takes ~10 minutes
> to do?!?
> Ironically, the "update file" section on Solaris takes LESS time than it
> does to create it...which in some ways makes sense: The app is only writing
> 335,544 bytes in the second phase compared to 400M in the first (creation)
> phase.
2 comments:
1) your testcase is missing an fflush() call before the gettimeofday()
in case 2
2) In any case, fflush() does not actually ensure the data has been
written out to disk: it just schedules the writes (man 3 fflush). If you
actually want to measure write performance, you need to add fsync()
after the fflush().
So here's the difference:
-------------------------
On a Solaris machine:
Without fflush+fsync:
bash-2.05$ ./a.out
Creating file: 9.933 seconds
Updating file: 4.382 seconds
With fflush, but no fsync:
bash-2.05$ ./a.out
Creating file: 9.852 seconds
Updating file: 1.459 seconds
With fflush+fsync:
bash-2.05$ ./a.out
Creating file: 9.915 seconds
Updating file: 10.639 seconds
------------------------
Now on Linux 2.6.1-mm3:
Without fflush+fsync:
[trondmy@hammerfest gnurr]$ ./a.out
Creating file: 3.475 seconds
Updating file: 0.561 seconds
With fflush, but no fsync:
[trondmy@hammerfest gnurr]$ ./a.out
Creating file: 3.275 seconds
Updating file: 1.642 seconds
With fflush+fsync:
[trondmy@hammerfest gnurr]$ ./a.out
Creating file: 3.780 seconds
Updating file: 4.618 seconds
-----------------------------
Surprise! Linux and Solaris show exactly the same behaviour.
As I said before in 2.4.x we have a hard limit of 256 outstanding
requests before we have to wait on requests to complete. Remove that
limit, and all is well...
Cheers,
Trond
On Tue, Jan 20, 2004 at 04:50:44PM -0500, Trond Myklebust wrote:
> Surprise! Linux and Solaris show exactly the same behaviour.
>
> As I said before in 2.4.x we have a hard limit of 256 outstanding
> requests before we have to wait on requests to complete. Remove that
> limit, and all is well...
Nice job with the 2.6 performance! Im looking forward to the day when
distributors start shipping it.
Thanks,
Jim