2001-10-16 09:11:07

by Robert Cohen

[permalink] [raw]
Subject: [Bench] New benchmark showing fileserver problem in 2.4.12

I have recently been reporting on problems with file server performance
in recent 2.4 kernels.
Since the setup I was using is difficult for most people to reproduce
(it involved 5 mac clients) I have taken the time to find a benchmark
that more or less reproduces the problems in a more accessible manner.

The original benchmark involved a number of file server clients writing
to the server.

The new benchmark involves two programs "send" and "receive". Send
generates data on standard out.
Receive takes data from stdin and writes it to a file. They are setup to
do this for a number of repetitions.
When "receive" reaches the end of the file it seeks back to the
beginning and rewrites the file.
I think it may be significant that the file is not truncated, it is
overwritten.

Send and Receive are designed to run over an rsh pipe. The programs take
2
parameters "file_size" and the number of repetitions. The same
parameters
should be given to each program.

To duplicate the activity of the original benchmark, I run 5 copies each
using files of 30 Megs:
./send 30 10 | rsh server ./receive 30 10 &

Since its a networked benchmark you need at least 2 linux machines on
100 Meg (or faster) network.
Originally I thought I might need to run the "send" programs on separate
machines, but testing indicates that I get the same problems running all
the "send"'s on one machine and the "receives" on another.
I have to admit I used a solaris box to run the sends on since I don't
have 2 linux machines here but I can't see why that would make any
difference.


The source code for send is at http://tltsu.anu.edu.au/~robert/send.c
Receive is at http://tltsu.anu.edu.au/~robert/receive.c

In order to produce the problem, the collective filesize has to be
bigger than the memory in the server.
In this example the collective filesize is 5*30=150 Megs.

You can see the problems most clearly by running vmstat while the
program runs.

So if I run it against a server with 256 Megs of memory, there are no
problems. The run takes about 6 minutes to complete.
A vmstat output is available at
http://tltsu.anu.edu.au/~robert/linux_logs/sr-256

If I run it against a server with 128 Megs of memory, the throughput as
shown by the "bo" stat starts out fine but the page cache usage rises
while the files are written. When the page cache tops out, the "bo"
figure drops sharply. At this point we get reads happening as shown by
"bi" even though the program does no reads. I presume that pages evicted
from page cache need to be read back into page cache before they can be
modified by writes.

With 128 Megs of memory, the benchmark takes about 30 minutes to run. So
its 5 times slower
than with 256 Megs. Given that the system isnt actually getting any
benefit out of the page cache since the files are never read back in, I
would have hoped there wouldnt be much difference.
A vmstat output for a 128 Meg run is at
http://tltsu.anu.edu.au/~robert/linux_logs/sr-128.


I can reproduce the problems with 256 Megs of memory by running 5
clients with 60 Meg files instead of 30 Meg files.

I get similar results with the following kernels

2.4.10-ac11 with Rik's Hog patch.
2.4.12-ac3
2.4.11-pre6

With an aa kernel 2.4.13pre2-aa1, once the page cache fills up, we
start getting "order 0 allocation" fails. The mem killer kicks in and
kills one of the receives (even though it only allocates 8k of memory
:-( ). The remaining clients then show similar throughput problems.

The problem does not occur when the sends and receives are run on the
same machine connected by pipes. This seems to indicate that its an
interaction between the memory usage by the page cache and the memory
usage by the network subsystem.

Also the problem is not as pronounced if I test with 1 client accessing
150 Megs rather than 5 clients accessing 30 Megs each.

--
Robert Cohen
Unix Support
TLTSU
Australian National University
Ph: 612 58389


2001-10-17 13:10:05

by Robert Cohen

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

I have had a chance to do some more testing with the test program I
posted yesterday. I have been able to try various combinations of
parameters and variations of the programs.

I now have a pretty good idea of what specific activities will see the
performance problems I was seeing. But since I'm not a kernel guru, I
have no idea as to why the problem exists or how to fix it.

I am interested in reports from people who can run the test. I would
like to confirm my findings (or simply confirm that I'm crazy :-).

The problems appear to only happen in very specific set of
circumstances. Its an incredible coincidence that my original
lantest/netatalk testing happened to hit that specific combination of
factors.
So it looks like I havent actually found a generic performance problem
with Linux as such. But I would still like to get to the bottom of this.

The factors that cause these problems probably won't occur very often in
real usage, but they are things that are not obviously silly. So it does
indicate a problem with some dark corner of the linux kernel that
probably should be investigated.

I have identified 4 specific factors that contribute to the problem. All
4 have to be present for before there is a performance problem.


Summary of the factors involved
===============================

Factor 1: the performance problems only occur when you are rewriting an
existing file in place. That is writing to an existing file which is
opened without O_TRUNC. Equivalently, if you have written a file and
then seek'ed back to the beginning and started writing again. I admit
this is something that not many real programs (except databases) do. But
it still shouldnt cause any problems.

Factor 2: the performance problems only occur when the part of the file
you are rewriting is not already present in the page cache. This tends
to happen when you are doing I/O to files larger than memory. Or if you
are rewriting an existing file which has just been opened.

Factor 3: the performance problems only happens for I/O that is due to
network traffic, not I/O that was generated locally. I realise this is
extremely strange and I have no idea how it knows that I/O is die to
network traffic let alone why it cares. But I can assure you that it
does make a difference.

Factor 4: the performance problem is only evident with small writes eg
write calls with an 8k buffer. Actually, the performance hit is there
with larger writes, just not significant enough to be an issue. Its
tempting to say "well just use larger buffers". But this isnt always
possible and anyway, 8k buffers should still work adequately, just not
optimally.



Experimental evidence
=====================


Factor 1: the performance problems only occur when you are rewriting an
existing file in place. That is writing to an existing file which is
opened without O_TRUNC. Equivalently, if you have written a file and
then seek'ed back to the beginning and started writing again.

Evidence: in the report I posted yesterday, the test I was using
involved 5 clients rewriting 30 Meg files on a 128 Meg machine. The
symptom was that after about 10 seconds, the throughput as shown by
vmstat "bo" drops sharply and we start getting reads occuring as shown
by the "bi" figure. However, with that test the page cache fills up
after 10 seconds. This is only just before the end of the files are
reached and we start rewriting the files. So its difficult to see which
of those two is causing the problem. Yesterday, I attributed the
problems to the page cache filling up, but I was apparently wrong.

The new test I am using is 5 copies of

./send 200 2 | rsh server ./receive 200 2.

Here we have 5 clients rewriting 200 Meg file.
With this test, the page cache fills up after about 10 seconds, but
since we are writing a total of 1 Gig of files, the end of the files is
not reached for 2 minutes or so. It is at this point that we start
rewriting the files.

When the page cache fills up, there is no drop in performance. However,
when the end of the file is reached and we start to rewrite, the
throughput drops and we get the reads happening. So the problems are
obviously due to the rewriting of an existing file not due to the page
cache filling.

It doesnt make any difference whether the test seeks back to the start
to rewrite or if it closes it and reopens without O_TRUNC.



Factor 2: the performance problems only occur when the part of the file
that is being rewritten is not already present in the page cache.

Evidence: I modified the "receive" test program to write to a named file
and to not delete the file after the run, so I could rewrite existing
files with only one pass.

On a machine with 128 Megs of memory

I created 5 large test files.
I purged these files from the page cache by writing another file larger
than memory and deleting it.

I did a run of 5 copies of ./send 18 1 | rsh server ./receive 18 1
(each one on a different file).
I did a second run of ./send 18 1 | rsh server ./receive 18 1

With the first run, the files were not present in page cache and the
performance problems were seen. This run took about 95 seconds. Since
the total size of the 5 files is smaller than page cache available, they
were all still present after the first run.

The second run took about 20 seconds. So the presence of data in the
cache makes a significant difference.

It seems natural to say "of course the cache sped things up, thats what
caches are for". However, the cache should not have sped this operation
up. Only writes are being done, no reads. So there is no reason why the
presence of data in the cache which is going to be overwritten anyway
should speed things up.
Also, the cache shouldnt speed writes up since the program does an fsync
to flush the cache on write. And even if the cache does speed writes, it
should have the same effect on both runs.

I had originally thought the problem occured when the page cache was
full. I assumed it was due to the extra work to purge something from the
page cache to make space for a new write. However with this test I
observed that the performance was bad even when the page cache did not
fill memory and there was plenty of free memory. So it seems that the
performance problem is purely due to rewriting something which is not
present in page cache. It has nothing to do with the amount of free
memory and whether the page cache is filling memory.

In this kind of test, if the collective size of the files is greater
than the amount of memory available for page cache, then problems can be
observed even with the second run. For example if you are writing to 120
Megs of files and there is 100 Megs of page cache. On the second run,
even though 100 megs of the files are present in the page cache, you get
no benefit because each portion of the file will be flushed to make way
for new writes before you get around to rewriting that portion. This is
the standard LRU performance wall when the working set is bigger than
available memory.



Factor 3: the problems only happens for I/O that is due to network
traffic.
Evidence: The problem does occurs when you have a second machine
"rsh"ing into the linux server.
However, if you run the test entirely on the linux server with any of
the following

./send 30 10 | ./receive 30 10
./send 30 10 | rsh localhost ./receive 30 10
./send 30 10 | rsh server ./receive 30 10

then the problem does not occur. Strangely we also don't see any reads
showing up in the vmstat output in these cases.
It seems the page cache is able to rewrite existing files without doing
any reading first under these conditions.

This is the really strange issue. I have no idea why it would make a
difference whether the receive program is taking its standard input from
a local source or from an rsh over the network. Why would the behaviour
of the page cache differ in these circumstances. If any Guru's can clue
me in, I would appreciate it.



Factor 4: the performance problem only occurs with small writes.
Evidence: the test programs I posted yesterday were doing IO with 8K
buffers (set by a define) because that was what the original benchmark I
was emulating did. If I modify "receive" to use a 64k buffer, I get
adequate throughput.
The anomalous reads are still happening, but don't seem to impact
performance too much. The throughput ramps smoothly between 8k and 64k
buffers.

One possible response is a variation on the old joke: if you have
experience problems when you do 8k writes, then don't do 8k writes.
However, I would like to understand why we are seeing a problem with 8k
writes. Its not as if 8k is *that* small. At worst small writes should
just chew CPU time, but we get lots of CPU idle time during the
benchmark, just poor throughput. The evidence suggests some kind of
constant overhead for each write.

Modifying the buffer size in send, simply reduces the amount of CPU that
send uses. Which is as you would expect. Doing this doesnt have much
effect on the overall throughput.


--
Robert Cohen
Unix Support
TLTSU
Australian National University
Ph: 612 58389

2001-10-17 15:17:53

by John Stoffel

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12


[ lots of wonderful problem tracking deleted. ]

Robert> Factor 4: the performance problem only occurs with small
Robert> writes. Evidence: the test programs I posted yesterday were
Robert> doing IO with 8K buffers (set by a define) because that was
Robert> what the original benchmark I was emulating did. If I modify
Robert> "receive" to use a 64k buffer, I get adequate throughput. The
Robert> anomalous reads are still happening, but don't seem to impact
Robert> performance too much. The throughput ramps smoothly between 8k
Robert> and 64k buffers.

I'm not a kernel hacker either, but I wonder what happens when you
scale your buffers above and below your ranges. I.e. what happens
with 1k, 2k, 4k, 128k, 256k buffers? Do you get a linear (or at least
a smooth curve) change between these values?

I also wonder about whether using TCP vs. UDP packets over sockets
makes any difference in your testing. More tests for you to write and
do, but it might help narrow down where the bad interaction is really
happening.

Good luck,
John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
[email protected] - http://www.lucent.com - 978-952-7548

2001-10-17 15:12:33

by M. Edward Borasky

[permalink] [raw]
Subject: RE: [Bench] New benchmark showing fileserver problem in 2.4.12

Have you looked at CPU utilization? Is it abnormally high when the system
slows down?

--
M. Edward (Ed) Borasky, Chief Scientist, Borasky Research
http://www.borasky-research.net
mailto:[email protected]
http://groups.yahoo.com/group/BoraskyResearchJournal

Q: How do you tell when a pineapple is ready to eat?
A: It picks up its knife and fork.

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]]On Behalf Of Robert Cohen
> Sent: Wednesday, October 17, 2001 6:07 AM
> To: [email protected]
> Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12
>
>
> I have had a chance to do some more testing with the test program I
> posted yesterday. I have been able to try various combinations of
> parameters and variations of the programs.
>
> I now have a pretty good idea of what specific activities will see the
> performance problems I was seeing. But since I'm not a kernel guru, I
> have no idea as to why the problem exists or how to fix it.
>
> I am interested in reports from people who can run the test. I would
> like to confirm my findings (or simply confirm that I'm crazy :-).
>
> The problems appear to only happen in very specific set of
> circumstances. Its an incredible coincidence that my original
> lantest/netatalk testing happened to hit that specific combination of
> factors.
> So it looks like I havent actually found a generic performance problem
> with Linux as such. But I would still like to get to the bottom of this.
>
> The factors that cause these problems probably won't occur very often in
> real usage, but they are things that are not obviously silly. So it does
> indicate a problem with some dark corner of the linux kernel that
> probably should be investigated.
>
> I have identified 4 specific factors that contribute to the problem. All
> 4 have to be present for before there is a performance problem.
>
>
> Summary of the factors involved
> ===============================
>
> Factor 1: the performance problems only occur when you are rewriting an
> existing file in place. That is writing to an existing file which is
> opened without O_TRUNC. Equivalently, if you have written a file and
> then seek'ed back to the beginning and started writing again. I admit
> this is something that not many real programs (except databases) do. But
> it still shouldnt cause any problems.
>
> Factor 2: the performance problems only occur when the part of the file
> you are rewriting is not already present in the page cache. This tends
> to happen when you are doing I/O to files larger than memory. Or if you
> are rewriting an existing file which has just been opened.
>
> Factor 3: the performance problems only happens for I/O that is due to
> network traffic, not I/O that was generated locally. I realise this is
> extremely strange and I have no idea how it knows that I/O is die to
> network traffic let alone why it cares. But I can assure you that it
> does make a difference.
>
> Factor 4: the performance problem is only evident with small writes eg
> write calls with an 8k buffer. Actually, the performance hit is there
> with larger writes, just not significant enough to be an issue. Its
> tempting to say "well just use larger buffers". But this isnt always
> possible and anyway, 8k buffers should still work adequately, just not
> optimally.
>
>
>
> Experimental evidence
> =====================
>
>
> Factor 1: the performance problems only occur when you are rewriting an
> existing file in place. That is writing to an existing file which is
> opened without O_TRUNC. Equivalently, if you have written a file and
> then seek'ed back to the beginning and started writing again.
>
> Evidence: in the report I posted yesterday, the test I was using
> involved 5 clients rewriting 30 Meg files on a 128 Meg machine. The
> symptom was that after about 10 seconds, the throughput as shown by
> vmstat "bo" drops sharply and we start getting reads occuring as shown
> by the "bi" figure. However, with that test the page cache fills up
> after 10 seconds. This is only just before the end of the files are
> reached and we start rewriting the files. So its difficult to see which
> of those two is causing the problem. Yesterday, I attributed the
> problems to the page cache filling up, but I was apparently wrong.
>
> The new test I am using is 5 copies of
>
> ./send 200 2 | rsh server ./receive 200 2.
>
> Here we have 5 clients rewriting 200 Meg file.
> With this test, the page cache fills up after about 10 seconds, but
> since we are writing a total of 1 Gig of files, the end of the files is
> not reached for 2 minutes or so. It is at this point that we start
> rewriting the files.
>
> When the page cache fills up, there is no drop in performance. However,
> when the end of the file is reached and we start to rewrite, the
> throughput drops and we get the reads happening. So the problems are
> obviously due to the rewriting of an existing file not due to the page
> cache filling.
>
> It doesnt make any difference whether the test seeks back to the start
> to rewrite or if it closes it and reopens without O_TRUNC.
>
>
>
> Factor 2: the performance problems only occur when the part of the file
> that is being rewritten is not already present in the page cache.
>
> Evidence: I modified the "receive" test program to write to a named file
> and to not delete the file after the run, so I could rewrite existing
> files with only one pass.
>
> On a machine with 128 Megs of memory
>
> I created 5 large test files.
> I purged these files from the page cache by writing another file larger
> than memory and deleting it.
>
> I did a run of 5 copies of ./send 18 1 | rsh server ./receive 18 1
> (each one on a different file).
> I did a second run of ./send 18 1 | rsh server ./receive 18 1
>
> With the first run, the files were not present in page cache and the
> performance problems were seen. This run took about 95 seconds. Since
> the total size of the 5 files is smaller than page cache available, they
> were all still present after the first run.
>
> The second run took about 20 seconds. So the presence of data in the
> cache makes a significant difference.
>
> It seems natural to say "of course the cache sped things up, thats what
> caches are for". However, the cache should not have sped this operation
> up. Only writes are being done, no reads. So there is no reason why the
> presence of data in the cache which is going to be overwritten anyway
> should speed things up.
> Also, the cache shouldnt speed writes up since the program does an fsync
> to flush the cache on write. And even if the cache does speed writes, it
> should have the same effect on both runs.
>
> I had originally thought the problem occured when the page cache was
> full. I assumed it was due to the extra work to purge something from the
> page cache to make space for a new write. However with this test I
> observed that the performance was bad even when the page cache did not
> fill memory and there was plenty of free memory. So it seems that the
> performance problem is purely due to rewriting something which is not
> present in page cache. It has nothing to do with the amount of free
> memory and whether the page cache is filling memory.
>
> In this kind of test, if the collective size of the files is greater
> than the amount of memory available for page cache, then problems can be
> observed even with the second run. For example if you are writing to 120
> Megs of files and there is 100 Megs of page cache. On the second run,
> even though 100 megs of the files are present in the page cache, you get
> no benefit because each portion of the file will be flushed to make way
> for new writes before you get around to rewriting that portion. This is
> the standard LRU performance wall when the working set is bigger than
> available memory.
>
>
>
> Factor 3: the problems only happens for I/O that is due to network
> traffic.
> Evidence: The problem does occurs when you have a second machine
> "rsh"ing into the linux server.
> However, if you run the test entirely on the linux server with any of
> the following
>
> ./send 30 10 | ./receive 30 10
> ./send 30 10 | rsh localhost ./receive 30 10
> ./send 30 10 | rsh server ./receive 30 10
>
> then the problem does not occur. Strangely we also don't see any reads
> showing up in the vmstat output in these cases.
> It seems the page cache is able to rewrite existing files without doing
> any reading first under these conditions.
>
> This is the really strange issue. I have no idea why it would make a
> difference whether the receive program is taking its standard input from
> a local source or from an rsh over the network. Why would the behaviour
> of the page cache differ in these circumstances. If any Guru's can clue
> me in, I would appreciate it.
>
>
>
> Factor 4: the performance problem only occurs with small writes.
> Evidence: the test programs I posted yesterday were doing IO with 8K
> buffers (set by a define) because that was what the original benchmark I
> was emulating did. If I modify "receive" to use a 64k buffer, I get
> adequate throughput.
> The anomalous reads are still happening, but don't seem to impact
> performance too much. The throughput ramps smoothly between 8k and 64k
> buffers.
>
> One possible response is a variation on the old joke: if you have
> experience problems when you do 8k writes, then don't do 8k writes.
> However, I would like to understand why we are seeing a problem with 8k
> writes. Its not as if 8k is *that* small. At worst small writes should
> just chew CPU time, but we get lots of CPU idle time during the
> benchmark, just poor throughput. The evidence suggests some kind of
> constant overhead for each write.
>
> Modifying the buffer size in send, simply reduces the amount of CPU that
> send uses. Which is as you would expect. Doing this doesnt have much
> effect on the overall throughput.
>
>
> --
> Robert Cohen
> Unix Support
> TLTSU
> Australian National University
> Ph: 612 58389
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2001-10-17 15:34:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12


Robert,

I appreciated your report. Thanks.

The network issue is what I'm concerned with: the kernel core methods are
the _same_ with and without networking.

If you're able to reproduce the problem locally I'll try to track down the
thing. Three factors (involving network, which I know almost nothing
about) are too much for me right now :)

On Wed, 17 Oct 2001, Robert Cohen wrote:

> I have had a chance to do some more testing with the test program I
> posted yesterday. I have been able to try various combinations of
> parameters and variations of the programs.

2001-10-17 15:47:35

by Andreas Dilger

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

On Oct 17, 2001 23:06 +1000, Robert Cohen wrote:
> Factor 1: the performance problems only occur when you are rewriting an
> existing file in place. That is writing to an existing file which is
> opened without O_TRUNC. Equivalently, if you have written a file and
> then seek'ed back to the beginning and started writing again.
>
> Evidence: in the report I posted yesterday, the test I was using
> involved 5 clients rewriting 30 Meg files on a 128 Meg machine. The
> symptom was that after about 10 seconds, the throughput as shown by
> vmstat "bo" drops sharply and we start getting reads occuring as shown
> by the "bi" figure.

Just a guess - if you are getting reads that are about the same as writes,
then it would indicate that the code is doing "read-modify-write" for the
existing file data rather than just "write". This would be caused by not
writing only full-sized aligned blocks to the files.

As to why this is happening only over the network - it may be that you are
are unable to send an even multiple of the blocksize over the network (MTU)
and this is causing fragmented writes. Try using a smaller block size like
4k or so to see if it makes a difference.

Another possibility is that with 8k chunks you are needing order-1
allocations to receive the data and this is causing a lot of searching for
buffers to free.

Cheers, Andreas
--
Andreas Dilger \ "If a man ate a pound of pasta and a pound of antipasto,
\ would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/ -- Dogbert

2001-10-17 16:45:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

In article <[email protected]>,
Robert Cohen <[email protected]> wrote:
>
>Factor 3: the performance problems only happens for I/O that is due to
>network traffic, not I/O that was generated locally. I realise this is
>extremely strange and I have no idea how it knows that I/O is die to
>network traffic let alone why it cares. But I can assure you that it
>does make a difference.

I'll bet you $5 AUD that this happens because you don't block your
output into nicely aligned chunks.

When you have an existing file, and you write 1500 bytes to the middle
of it, performance will degrade _horribly_ compared to the case of
writing a full block, or writing to a position that hasn't been written
yet.

Your benchmark probably just does the equivalent of

for (;;) {
int bytes = read(in, buf, BUFSIZE);
if (bytes <= 0)
break;
write(out, buf, bytes);
}

am I right? The above is obvious code, but it happens to be bad code.

Now, when you read from the network, you will NOT get reads that are a
nice multiple of BUFSIZE, you'll get reads that are a multiple of the
packet data load (~1460 bytes on TCP over ethernet), and you'll end up
doing unaligned writes that require a read-modify-wrtie cycle and thus
end up doing twice as much IO.

And not only does it do twice as much IO (and potentially more with
read-ahead), the read will obviously be _synchronous_, so the slowdown
is more than twice as much.

In contrast, when the source is a local file (or a pipe that ends up
chunking stuff up in 4kB chunks instead of 1500-byte packets), you'll
have nice write patterns that fill the whole buffer and make the read
unnecessary. Which gets you nice streaming writes to disk.

With most fast disks, this is not unlikely to be performance difference
on the order of a magnitude.

And there is _nothing_ the kernel can do about it. Your benchmark is
bad, and has different behaviour depending on the source.

In short, fix your program. Change the loop to be something like

unsigned int so_far = 0;
for (;;) {
int bytes = read(in, buf+so_far, BUFSIZE-so_far);
if (bytes <= 0)
break;
so_far += bytes;
if (so_far < BUFSIZE)
continue;
write(out, buf, BUFSIZE);
so_far = 0;
}
if (so_far)
write(out, buf, so_far);

which will act the same for partial and full reads, and I bet you'll see
the same difference for local and networking I/O (modulo the speed
difference in the _source_, of course).

Oh, and I bet you that once you do something like the above, you won't
see much difference between a 8kB buffer and a 256kB buffer. The
smaller buffer will generate more system calls, but it won't much matter
(and sometimes the smaller buffer performs better due to better data
cache locality and better overlapping IO - system calls under Linux
aren't slow, other factors can easily dominate).

Linus

2001-10-18 02:00:52

by Leo Mauro

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

Small fix to Linus's sample code

unsigned int so_far = 0;
for (;;) {
int bytes = read(in, buf+so_far, BUFSIZE-so_far);
if (bytes <= 0)
break;
so_far += bytes;
if (so_far < BUFSIZE)
continue;
write(out, buf, BUFSIZE);
- so_far = 0;
+ so_far -= BUFSIZE;
}
if (so_far)
write(out, buf, so_far);

to avoid losing data.

2001-10-18 04:54:32

by Robert Cohen

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

Linus Torvalds wrote:
>
> In article <[email protected]>,
> Robert Cohen <[email protected]> wrote:
> >
> >Factor 3: the performance problems only happens for I/O that is due to
> >network traffic, not I/O that was generated locally. I realise this is
> >extremely strange and I have no idea how it knows that I/O is die to
> >network traffic let alone why it cares. But I can assure you that it
> >does make a difference.
>
> I'll bet you $5 AUD that this happens because you don't block your
> output into nicely aligned chunks.
>
> When you have an existing file, and you write 1500 bytes to the middle
> of it, performance will degrade _horribly_ compared to the case of
> writing a full block, or writing to a position that hasn't been written
> yet.
>
>
> Now, when you read from the network, you will NOT get reads that are a
> nice multiple of BUFSIZE, you'll get reads that are a multiple of the
> packet data load (~1460 bytes on TCP over ethernet), and you'll end up
> doing unaligned writes that require a read-modify-wrtie cycle and thus
> end up doing twice as much IO.
>
> And not only does it do twice as much IO (and potentially more with
> read-ahead), the read will obviously be _synchronous_, so the slowdown
> is more than twice as much.
>
> In contrast, when the source is a local file (or a pipe that ends up
> chunking stuff up in 4kB chunks instead of 1500-byte packets), you'll
> have nice write patterns that fill the whole buffer and make the read
> unnecessary. Which gets you nice streaming writes to disk.
>
> With most fast disks, this is not unlikely to be performance difference
> on the order of a magnitude.
>
> And there is _nothing_ the kernel can do about it. Your benchmark is
> bad, and has different behaviour depending on the source.
>
>


This is almost certainly correct, I will be modifying the benchmark to
use aligned writes.

However, I was curious about the magnitude of the impact of misaligned
writes. I have been seeing performance differences of about a factor of
5.

I have written a trivial test program to explore the issue which just
writes and then rewrites a file with a given buffer size. By using an
odd buffersize we get misaligned writes. You have to use it on files
that are bigger than memory so that the file will not still be in the
page cache during the rewrite.
The source of the program is at
http://tltsu.anu.edu.au/~robert/aligntest.c

Here are some results under linux

Heres a baseline run with aligned buffers.

writing to file of size 300 Megs with buffers of 8192 bytes
write elapsed time=41.00 seconds, write_speed=7.32
rewrite elapsed time=38.26 seconds, rewrite_speed=7.84


As expected there is no penalty for rewrite.

Heres a run with misaligned buffers

writing to file of size 300 Megs with buffers of 5000 bytes
write elapsed time=37.55 seconds, write_speed=7.99
rewrite elapsed time=112.75 seconds, rewrite_speed=2.66


There is a bit more than a factor of 2 between write and rewrite speed.
Fair enough, if you do stupid things, you pay the penalty.

However, look what happens if I run 5 copies at once.

writing to file of size 60 Megs with buffers of 5000 bytes
writing to file of size 60 Megs with buffers of 5000 bytes
writing to file of size 60 Megs with buffers of 5000 bytes
writing to file of size 60 Megs with buffers of 5000 bytes
writing to file of size 60 Megs with buffers of 5000 bytes
write elapsed time=33.96 seconds, write_speed=1.77
write elapsed time=37.43 seconds, write_speed=1.60
write elapsed time=37.74 seconds, write_speed=1.59
write elapsed time=37.93 seconds, write_speed=1.58
write elapsed time=40.74 seconds, write_speed=1.47
rewrite elapsed time=512.44 seconds, rewrite_speed=0.12
rewrite elapsed time=518.59 seconds, rewrite_speed=0.12
rewrite elapsed time=518.05 seconds, rewrite_speed=0.12
rewrite elapsed time=518.96 seconds, rewrite_speed=0.12
rewrite elapsed time=517.08 seconds, rewrite_speed=0.12


Here we see a factor of about 15 between write speed and rewrite speed.
That seems a little extreme.
>From the amount of seeking happening, I believe that all the reads are
being done as single page separate reads. Surely there should be some
readahead happening.


I tested the same program under Solaris and I get about a factor of 2
difference regardless whether its one copy or 5 copies.

I believe that this is an odd situation and sure it only happens for
badly written program. I can see that it would be stupid to optimise for
this situation. But do we really need to do this badly for this case?


--
Robert Cohen
Unix Support
TLTSU
Australian National University
Ph: 612 58389

2001-10-18 08:30:58

by James A Sutherland

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

On Wed, 17 Oct 2001, Leo Mauro wrote:

> Small fix to Linus's sample code
>
> unsigned int so_far = 0;
> for (;;) {
> int bytes = read(in, buf+so_far, BUFSIZE-so_far);
> if (bytes <= 0)
> break;
> so_far += bytes;
> if (so_far < BUFSIZE)
> continue;
> write(out, buf, BUFSIZE);
> - so_far = 0;
> + so_far -= BUFSIZE;
> }
> if (so_far)
> write(out, buf, so_far);
>
> to avoid losing data.

Checking the return from write() for errors might be a nice idea too,
otherwise you carry on reading, and trying to append, even if the target
device is full (or whatever).


James.
--
"Our attitude with TCP/IP is, `Hey, we'll do it, but don't make a big
system, because we can't fix it if it breaks -- nobody can.'"

"TCP/IP is OK if you've got a little informal club, and it doesn't make
any difference if it takes a while to fix it."
-- Ken Olson, in Digital News, 1988

2001-10-18 22:23:27

by Roger Larsson

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

On Thursday 18 October 2001 04:01, Leo Mauro wrote:
> Small fix to Linus's sample code
>
> unsigned int so_far = 0;
> for (;;) {
> int bytes = read(in, buf+so_far, BUFSIZE-so_far);
> if (bytes <= 0)
> break;
> so_far += bytes;
> if (so_far < BUFSIZE)
> continue;
> write(out, buf, BUFSIZE);
> - so_far = 0;
> + so_far -= BUFSIZE;
> }
> if (so_far)
> write(out, buf, so_far);
>
> to avoid losing data.

You too...

I was close to press the send button but noticed the "BUFSIZE-so_far"
in the read call, just in time(TM).

If it had not been there you would have needed to copy data from the
end of buf (from above BUFSIZE) to the beginning of buf too...
(the required size of buf would have been 2*BUFSIZE)

/RogerL

--
Roger Larsson
Skellefte?
Sweden

2001-10-19 02:53:50

by George Greer

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

On Thu, 18 Oct 2001, Roger Larsson wrote:

>On Thursday 18 October 2001 04:01, Leo Mauro wrote:
>> Small fix to Linus's sample code
>>
>> unsigned int so_far = 0;
>> for (;;) {
>> int bytes = read(in, buf+so_far, BUFSIZE-so_far);
>> if (bytes <= 0)
>> break;
>> so_far += bytes;
>> if (so_far < BUFSIZE)
>> continue;
>> write(out, buf, BUFSIZE);
>> - so_far = 0;
>> + so_far -= BUFSIZE;
>> }
>> if (so_far)
>> write(out, buf, so_far);
>>
>> to avoid losing data.
>
>I was close to press the send button but noticed the "BUFSIZE-so_far"
>in the read call, just in time(TM).
>
>If it had not been there you would have needed to copy data from the
>end of buf (from above BUFSIZE) to the beginning of buf too...
>(the required size of buf would have been 2*BUFSIZE)

Since you only ever have BUFSIZE bytes when you write, aren't:

so_far -= BUFSIZE;

and

so_far = 0;

identical? I'd say the assignment to 0 would be faster.

--
George Greer, [email protected]
http://www.m-l.org/~greerga/

2001-10-19 06:13:55

by Roger Larsson

[permalink] [raw]
Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12

On Friday 19 October 2001 04:53, you wrote:
> On Thu, 18 Oct 2001, Roger Larsson wrote:
> >On Thursday 18 October 2001 04:01, Leo Mauro wrote:
> >> Small fix to Linus's sample code
> >>
> >> unsigned int so_far = 0;
> >> for (;;) {
> >> int bytes = read(in, buf+so_far, BUFSIZE-so_far);
> >> if (bytes <= 0)
> >> break;
> >> so_far += bytes;
> >> if (so_far < BUFSIZE)
> >> continue;
> >> write(out, buf, BUFSIZE);
> >> - so_far = 0;
> >> + so_far -= BUFSIZE;
> >> }
> >> if (so_far)
> >> write(out, buf, so_far);
> >>
> >> to avoid losing data.
> >
> >I was close to press the send button but noticed the "BUFSIZE-so_far"
> >in the read call, just in time(TM).
> >
> >If it had not been there you would have needed to copy data from the
> >end of buf (from above BUFSIZE) to the beginning of buf too...
> >(the required size of buf would have been 2*BUFSIZE)
>
> Since you only ever have BUFSIZE bytes when you write, aren't:
>
> so_far -= BUFSIZE;
>
> and
>
> so_far = 0;
>
> identical? I'd say the assignment to 0 would be faster.

I was not specific enough. I intended to say that Linus code was ok.
And that if so_far -= BUFSIZE ever was something different from
zero you would need to move the read bytes too...

This code not using continue is probably easier to read...
(+ error checking...)

unsigned int so_far = 0;
for (;;) {
int bytes = read(in, buf+so_far, BUFSIZE-so_far);
if (bytes <= 0)
break;
so_far += bytes;
if (so_far == BUFSIZE) {
write(out, buf, BUFSIZE);
so_far = 0;
}
}
if (so_far)
write(out, buf, so_far);


/RogerL

--
Roger Larsson
Skellefte?
Sweden