Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Wed, 17 Oct 2001 11:12:33 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Wed, 17 Oct 2001 11:12:24 -0400 Received: from mail3.aracnet.com ([216.99.193.38]:7945 "EHLO mail3.aracnet.com") by vger.kernel.org with ESMTP id ; Wed, 17 Oct 2001 11:12:17 -0400 From: "M. Edward Borasky" To: "Robert Cohen" , Subject: RE: [Bench] New benchmark showing fileserver problem in 2.4.12 Date: Wed, 17 Oct 2001 08:12:44 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2911.0) In-Reply-To: <3BCD8269.B4E003E5@anu.edu.au> Importance: Normal X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Have you looked at CPU utilization? Is it abnormally high when the system slows down? -- M. Edward (Ed) Borasky, Chief Scientist, Borasky Research http://www.borasky-research.net mailto:znmeb@borasky-research.net http://groups.yahoo.com/group/BoraskyResearchJournal Q: How do you tell when a pineapple is ready to eat? A: It picks up its knife and fork. > -----Original Message----- > From: linux-kernel-owner@vger.kernel.org > [mailto:linux-kernel-owner@vger.kernel.org]On Behalf Of Robert Cohen > Sent: Wednesday, October 17, 2001 6:07 AM > To: linux-kernel@vger.kernel.org > Subject: Re: [Bench] New benchmark showing fileserver problem in 2.4.12 > > > I have had a chance to do some more testing with the test program I > posted yesterday. I have been able to try various combinations of > parameters and variations of the programs. > > I now have a pretty good idea of what specific activities will see the > performance problems I was seeing. But since I'm not a kernel guru, I > have no idea as to why the problem exists or how to fix it. > > I am interested in reports from people who can run the test. I would > like to confirm my findings (or simply confirm that I'm crazy :-). > > The problems appear to only happen in very specific set of > circumstances. Its an incredible coincidence that my original > lantest/netatalk testing happened to hit that specific combination of > factors. > So it looks like I havent actually found a generic performance problem > with Linux as such. But I would still like to get to the bottom of this. > > The factors that cause these problems probably won't occur very often in > real usage, but they are things that are not obviously silly. So it does > indicate a problem with some dark corner of the linux kernel that > probably should be investigated. > > I have identified 4 specific factors that contribute to the problem. All > 4 have to be present for before there is a performance problem. > > > Summary of the factors involved > =============================== > > Factor 1: the performance problems only occur when you are rewriting an > existing file in place. That is writing to an existing file which is > opened without O_TRUNC. Equivalently, if you have written a file and > then seek'ed back to the beginning and started writing again. I admit > this is something that not many real programs (except databases) do. But > it still shouldnt cause any problems. > > Factor 2: the performance problems only occur when the part of the file > you are rewriting is not already present in the page cache. This tends > to happen when you are doing I/O to files larger than memory. Or if you > are rewriting an existing file which has just been opened. > > Factor 3: the performance problems only happens for I/O that is due to > network traffic, not I/O that was generated locally. I realise this is > extremely strange and I have no idea how it knows that I/O is die to > network traffic let alone why it cares. But I can assure you that it > does make a difference. > > Factor 4: the performance problem is only evident with small writes eg > write calls with an 8k buffer. Actually, the performance hit is there > with larger writes, just not significant enough to be an issue. Its > tempting to say "well just use larger buffers". But this isnt always > possible and anyway, 8k buffers should still work adequately, just not > optimally. > > > > Experimental evidence > ===================== > > > Factor 1: the performance problems only occur when you are rewriting an > existing file in place. That is writing to an existing file which is > opened without O_TRUNC. Equivalently, if you have written a file and > then seek'ed back to the beginning and started writing again. > > Evidence: in the report I posted yesterday, the test I was using > involved 5 clients rewriting 30 Meg files on a 128 Meg machine. The > symptom was that after about 10 seconds, the throughput as shown by > vmstat "bo" drops sharply and we start getting reads occuring as shown > by the "bi" figure. However, with that test the page cache fills up > after 10 seconds. This is only just before the end of the files are > reached and we start rewriting the files. So its difficult to see which > of those two is causing the problem. Yesterday, I attributed the > problems to the page cache filling up, but I was apparently wrong. > > The new test I am using is 5 copies of > > ./send 200 2 | rsh server ./receive 200 2. > > Here we have 5 clients rewriting 200 Meg file. > With this test, the page cache fills up after about 10 seconds, but > since we are writing a total of 1 Gig of files, the end of the files is > not reached for 2 minutes or so. It is at this point that we start > rewriting the files. > > When the page cache fills up, there is no drop in performance. However, > when the end of the file is reached and we start to rewrite, the > throughput drops and we get the reads happening. So the problems are > obviously due to the rewriting of an existing file not due to the page > cache filling. > > It doesnt make any difference whether the test seeks back to the start > to rewrite or if it closes it and reopens without O_TRUNC. > > > > Factor 2: the performance problems only occur when the part of the file > that is being rewritten is not already present in the page cache. > > Evidence: I modified the "receive" test program to write to a named file > and to not delete the file after the run, so I could rewrite existing > files with only one pass. > > On a machine with 128 Megs of memory > > I created 5 large test files. > I purged these files from the page cache by writing another file larger > than memory and deleting it. > > I did a run of 5 copies of ./send 18 1 | rsh server ./receive 18 1 > (each one on a different file). > I did a second run of ./send 18 1 | rsh server ./receive 18 1 > > With the first run, the files were not present in page cache and the > performance problems were seen. This run took about 95 seconds. Since > the total size of the 5 files is smaller than page cache available, they > were all still present after the first run. > > The second run took about 20 seconds. So the presence of data in the > cache makes a significant difference. > > It seems natural to say "of course the cache sped things up, thats what > caches are for". However, the cache should not have sped this operation > up. Only writes are being done, no reads. So there is no reason why the > presence of data in the cache which is going to be overwritten anyway > should speed things up. > Also, the cache shouldnt speed writes up since the program does an fsync > to flush the cache on write. And even if the cache does speed writes, it > should have the same effect on both runs. > > I had originally thought the problem occured when the page cache was > full. I assumed it was due to the extra work to purge something from the > page cache to make space for a new write. However with this test I > observed that the performance was bad even when the page cache did not > fill memory and there was plenty of free memory. So it seems that the > performance problem is purely due to rewriting something which is not > present in page cache. It has nothing to do with the amount of free > memory and whether the page cache is filling memory. > > In this kind of test, if the collective size of the files is greater > than the amount of memory available for page cache, then problems can be > observed even with the second run. For example if you are writing to 120 > Megs of files and there is 100 Megs of page cache. On the second run, > even though 100 megs of the files are present in the page cache, you get > no benefit because each portion of the file will be flushed to make way > for new writes before you get around to rewriting that portion. This is > the standard LRU performance wall when the working set is bigger than > available memory. > > > > Factor 3: the problems only happens for I/O that is due to network > traffic. > Evidence: The problem does occurs when you have a second machine > "rsh"ing into the linux server. > However, if you run the test entirely on the linux server with any of > the following > > ./send 30 10 | ./receive 30 10 > ./send 30 10 | rsh localhost ./receive 30 10 > ./send 30 10 | rsh server ./receive 30 10 > > then the problem does not occur. Strangely we also don't see any reads > showing up in the vmstat output in these cases. > It seems the page cache is able to rewrite existing files without doing > any reading first under these conditions. > > This is the really strange issue. I have no idea why it would make a > difference whether the receive program is taking its standard input from > a local source or from an rsh over the network. Why would the behaviour > of the page cache differ in these circumstances. If any Guru's can clue > me in, I would appreciate it. > > > > Factor 4: the performance problem only occurs with small writes. > Evidence: the test programs I posted yesterday were doing IO with 8K > buffers (set by a define) because that was what the original benchmark I > was emulating did. If I modify "receive" to use a 64k buffer, I get > adequate throughput. > The anomalous reads are still happening, but don't seem to impact > performance too much. The throughput ramps smoothly between 8k and 64k > buffers. > > One possible response is a variation on the old joke: if you have > experience problems when you do 8k writes, then don't do 8k writes. > However, I would like to understand why we are seeing a problem with 8k > writes. Its not as if 8k is *that* small. At worst small writes should > just chew CPU time, but we get lots of CPU idle time during the > benchmark, just poor throughput. The evidence suggests some kind of > constant overhead for each write. > > Modifying the buffer size in send, simply reduces the amount of CPU that > send uses. Which is as you would expect. Doing this doesnt have much > effect on the overall throughput. > > > -- > Robert Cohen > Unix Support > TLTSU > Australian National University > Ph: 612 58389 > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/