Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754854Ab0APQ10 (ORCPT ); Sat, 16 Jan 2010 11:27:26 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753321Ab0APQ1Z (ORCPT ); Sat, 16 Jan 2010 11:27:25 -0500 Received: from mail-ew0-f219.google.com ([209.85.219.219]:55315 "EHLO mail-ew0-f219.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753316Ab0APQ1X convert rfc822-to-8bit (ORCPT ); Sat, 16 Jan 2010 11:27:23 -0500 DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; b=g+1R9jSr9aj8VHMm96DYlFUbLnA5aGi5kdnPu2iiicc/tAZjNW/VbUJc6VUjGDOJKM e6P1Q3HwQslM39C1/JhK75Tye6pxOyRAORqjzkpavEtNFr2LAnMzW7hVNM4blVQuVZT+ 4VpBpjqYElsBEH8vaW9/7+8KURxVEFTIv47qc= MIME-Version: 1.0 In-Reply-To: <4e5e476b1001041028v1f204834r1fa97e732a094210@mail.gmail.com> References: <1262250960.1819.68.camel@localhost> <4e5e476b0912310234mf9ccaadm771c637a3d107d18@mail.gmail.com> <1262340730.19773.47.camel@localhost> <4e5e476b1001010832o24f6a0efudbfc36598bfc7c5e@mail.gmail.com> <1262435612.19773.80.camel@localhost> <4e5e476b1001021052u51a90a91qb2fbb4089498a3ca@mail.gmail.com> <1262593090.29897.14.camel@localhost> <4e5e476b1001041028v1f204834r1fa97e732a094210@mail.gmail.com> Date: Sat, 16 Jan 2010 17:27:21 +0100 Message-ID: <4e5e476b1001160827n6dc73b35vee8b46e541134c2@mail.gmail.com> Subject: Re: fio mmap randread 64k more than 40% regression with 2.6.33-rc1 From: Corrado Zoccolo To: "Zhang, Yanmin" , "jmoyer@redhat.com" Cc: Jens Axboe , Shaohua Li , LKML Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Yanmin On Mon, Jan 4, 2010 at 7:28 PM, Corrado Zoccolo wrote: > Hi Yanmin, > On Mon, Jan 4, 2010 at 9:18 AM, Zhang, Yanmin > wrote: >> On Sat, 2010-01-02 at 19:52 +0100, Corrado Zoccolo wrote: >>> Hi >>> On Sat, Jan 2, 2010 at 1:33 PM, Zhang, Yanmin >>> wrote: >>> > On Fri, 2010-01-01 at 17:32 +0100, Corrado Zoccolo wrote: >>> >> Hi Yanmin, >>> >> On Fri, Jan 1, 2010 at 11:12 AM, Zhang, Yanmin >>> >> wrote: >>> >> > On Thu, 2009-12-31 at 11:34 +0100, Corrado Zoccolo wrote: >>> >> >> Hi Yanmin, >>> >> >> On Thu, Dec 31, 2009 at 10:16 AM, Zhang, Yanmin >>> >> >> wrote: >>> >> >> > Comparing with kernel 2.6.32, fio mmap randread 64k has more than 40% regression with >>> >> >> > 2.6.33-rc1. >>> >> >> >>> >> > Thanks for your timely reply. Some comments inlined below. >>> >> > >>> >> >> Can you compare the performance also with 2.6.31? >>> >> > We did. We run Linux kernel Performance Tracking project and run many benchmarks when a RC kernel >>> >> > is released. >>> >> > >>> >> > The result of 2.6.31 is quite similar to the one of 2.6.32. But the one of 2.6.30 is about >>> >> > 8% better than the one of 2.6.31. >>> >> > >>> >> >> I think I understand what causes your problem. >>> >> >> 2.6.32, with default settings, handled even random readers as >>> >> >> sequential ones to provide fairness. This has benefits on single disks >>> >> >> and JBODs, but causes harm on raids. >>> >> > I didn't test RAID as that machine with hardware RAID HBA is crashed now. But if we turn on >>> >> > hardware RAID in HBA, mostly we use noop io scheduler. >>> >> I think you should start testing cfq with them, too. From 2.6.33, we >>> >> have some big improvements in this area. >>> > Great! I once compared cfq and noop against non-raid and raid0. One interesting finding >>> > about sequential read testing is when there are fewer processes to read files on the raid0 >>> > JBOD, noop on raid0 is pretty good, but when there are lots of processes to do so on a non-raid >>> > JBOD, cfq is pretty better. I planed to investigate it, but too busy in other issues. >>> > >>> >> > >>> >> >> For 2.6.33, we changed the way in which this is handled, restoring the >>> >> >> enable_idle = 0 for seeky queues as it was in 2.6.31: >>> >> >> @@ -2218,13 +2352,10 @@ cfq_update_idle_window(struct cfq_data *cfqd, >>> >> >> struct cfq_queue *cfqq, >>> >> >>        enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); >>> >> >> >>> >> >>        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >>> >> >> -           (!cfqd->cfq_latency && cfqd->hw_tag && CFQQ_SEEKY(cfqq))) >>> >> >> +           (sample_valid(cfqq->seek_samples) && CFQQ_SEEKY(cfqq))) >>> >> >>                enable_idle = 0; >>> >> >> (compare with 2.6.31: >>> >> >>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || >>> >> >>             (cfqd->hw_tag && CIC_SEEKY(cic))) >>> >> >>                 enable_idle = 0; >>> >> >> excluding the sample_valid check, it should be equivalent for you (I >>> >> >> assume you have NCQ disks)) >>> >> >> and we provide fairness for them by servicing all seeky queues >>> >> >> together, and then idling before switching to other ones. >>> >> > As for function cfq_update_idle_window, you is right. But since >>> >> > 2.6.32, CFQ merges many patches and the patches have impact on each other. >>> >> > >>> >> >> >>> >> >> The mmap 64k randreader will have a large seek_mean, resulting in >>> >> >> being marked seeky, but will send 16 * 4k sequential requests one >>> >> >> after the other, so alternating between those seeky queues will cause >>> >> >> harm. >>> >> >> >>> >> >> I'm working on a new way to compute seekiness of queues, that should >>> >> >> fix your issue, correctly identifying those queues as non-seeky (for >>> >> >> me, a queue should be considered seeky only if it submits more than 1 >>> >> >> seeky requests for 8 sequential ones). >>> >> >> >>> >> >> > >>> >> >> > The test scenario: 1 JBOD has 12 disks and every disk has 2 partitions. Create >>> >> >> > 8 1-GB files per partition and start 8 processes to do rand read on the 8 files >>> >> >> > per partitions. There are 8*24 processes totally. randread block size is 64K. >>> >> >> > >>> >> >> > We found the regression on 2 machines. One machine has 8GB memory and the other has >>> >> >> > 6GB. >>> >> >> > >>> >> >> > Bisect is very unstable. The related patches are many instead of just one. >>> >> >> > >>> >> >> > >>> >> >> > 1) commit 8e550632cccae34e265cb066691945515eaa7fb5 >>> >> >> > Author: Corrado Zoccolo >>> >> >> > Date:   Thu Nov 26 10:02:58 2009 +0100 >>> >> >> > >>> >> >> >    cfq-iosched: fix corner cases in idling logic >>> >> >> > >>> >> >> > >>> >> >> > This patch introduces about less than 20% regression. I just reverted below section >>> >> >> > and this part regression disappear. It shows this regression is stable and not impacted >>> >> >> > by other patches. >>> >> >> > >>> >> >> > @@ -1253,9 +1254,9 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) >>> >> >> >                return; >>> >> >> > >>> >> >> >        /* >>> >> >> > -        * still requests with the driver, don't idle >>> >> >> > +        * still active requests from this queue, don't idle >>> >> >> >         */ >>> >> >> > -       if (rq_in_driver(cfqd)) >>> >> >> > +       if (cfqq->dispatched) >>> >> >> >                return; >>> >> > Although 5 patches are related to the regression, above line is quite >>> >> > independent. Reverting above line could always improve the result for about >>> >> > 20%. >>> >> I've looked at your fio script, and it is quite complex, >>> > As we have about 40 fio sub cases, we have a script to create fio job file from >>> > a specific parameter list. So there are some superfluous parameters. >>> > >>> My point is that there are so many things going on, that is more >>> difficult to analyse the issues. >>> I prefer looking at one problem at a time, so (initially) removing the >>> possibility of queue merging, that Shaohua already investigated, can >>> help in spotting the still not-well-understood problem. >> Sounds reasonable. >> >>> Could you generate the same script, but with each process accessing >>> only one of the files, instead of chosing it at random? >> Ok. New testing starts 8 processes per partition and every process just works >> on one file. > Great, thanks. >> >>> >>> > Another point is we need stable result. >>> > >>> >> with lot of >>> >> things going on. >>> >> Let's keep this for last. >>> > Ok. But the change like what you do mostly reduces regresion. >>> > >>> >> I've created a smaller test, that already shows some regression: >>> >> [global] >>> >> direct=0 >>> >> ioengine=mmap >>> >> size=8G >>> >> bs=64k >>> >> numjobs=1 >>> >> loops=5 >>> >> runtime=60 >>> >> #group_reporting >>> >> invalidate=0 >>> >> directory=/media/hd/cfq-tests >>> >> >>> >> [job0] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile1 >>> >> >>> >> [job1] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile2 >>> >> >>> >> [job2] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile3 >>> >> >>> >> [job3] >>> >> startdelay=0 >>> >> rw=randread >>> >> filename=testfile4 >>> >> >>> >> The attached patches, in particular 0005 (that apply on top of >>> >> for-linus branch of Jen's tree >>> >> git://git.kernel.dk/linux-2.6-block.git) fix the regression on this >>> >> simplified workload. >>> > I didn't download the tree. I tested the 3 attached patches against 2.6.33-rc1. The >>> > result isn't resolved. >>> Can you quantify if there is an improvement, though? >> >> Ok. Because of company policy, I could only post percent instead of real number > Sure, it is fine. >> >>> Please, also include Shahoua's patches. >>> I'd like to see the comparison between (always with low_latency set to 0): >>> plain 2.6.33 >>> plain 2.6.33 + shahoua's >>> plain 2.6.33 + shahoua's + my patch >>> plain 2.6.33 + shahoua's + my patch + rq_in_driver vs dispatched patch. >> >> 1) low_latency=0 >> 2.6.32 kernel                                   0 >> 2.6.33-rc1                                      -0.33 >> 2.6.33-rc1_shaohua                              -0.33 >> 2.6.33-rc1+corrado                              0.03 >> 2.6.33-rc1_corrado+shaohua                      0.02 >> 2.6.33-rc1_corrado+shaohua+rq_in_driver         0.01 >> > So my patch fixes the situation for low_latency = 0, as I expected. > I'll send it to Jens with a proper changelog. > >> 2) low_latency=1 >> 2.6.32 kernel                                   0 >> 2.6.33-rc1                                      -0.45 >> 2.6.33-rc1+corrado                              -0.24 >> 2.6.33-rc1_corrado+shaohua                      -0.23 >> 2.6.33-rc1_corrado+shaohua+rq_in_driver         -0.23 > The results are as expected. With each process working on a separate > file, Shahoua's patches do not influence the result sensibly. > Interestingly, even rq_in_driver doesn't improve in this case, so > maybe its effect is somewhat connected to queue merging. > The remaining -23% is due to timeslice shrinking, that is done to > reduce max latency when there are too many processes doing I/O, at the > expense of throughput. It is a documented change, and the suggested > way if you favor throughput over latency is to set low_latency = 0. > >> >> >> When low_latency=1, we get the biggest number with kernel 2.6.32. >> Comparing with low_latency=0's result, the prior one is about 4% better. > Ok, so 2.6.33 + corrado (with low_latency =0) is comparable with > fastest 2.6.32, so we can consider the first part of the problem > solved. > I think we can return now to your full script with queue merging. I'm wondering if (in arm_slice_timer): - if (cfqq->dispatched) + if (cfqq->dispatched || (cfqq->new_cfqq && rq_in_driver(cfqd))) return; gives the same improvement you were experiencing just reverting to rq_in_driver. We saw that cfqq->dispatched worked fine when there was no queue merging happening, so it must be something concerning merging, probably dispatched is not accurate when we set up for a merging, but the merging was not yet done. Thanks, Corrado -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/