Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751533AbdHAHhe (ORCPT ); Tue, 1 Aug 2017 03:37:34 -0400 Received: from [140.206.112.106] ([140.206.112.106]:14484 "EHLO mail2012.asrmicro.com" rhost-flags-FAIL-FAIL-OK-FAIL) by vger.kernel.org with ESMTP id S1751012AbdHAHhc (ORCPT ); Tue, 1 Aug 2017 03:37:32 -0400 Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync From: qiaozhou To: Vikram Mulukutla , Will Deacon CC: Thomas Gleixner , John Stultz , , LKML , Wang Wilbur , Marc Zyngier , Peter Zijlstra , , , Zhou Qiao References: <3d2459c7-defd-a47e-6cea-007c10cecaac@asrmicro.com> <20170728092831.GA24839@arm.com> <2aa9684cf9c889ee9fdc8550b4388af6@codeaurora.org> <1c951f01-c450-d21b-13e9-6a32eb509d94@asrmicro.com> Message-ID: Date: Tue, 1 Aug 2017 15:37:01 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: <1c951f01-c450-d21b-13e9-6a32eb509d94@asrmicro.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Originating-IP: [10.1.50.16] X-ClientProxiedBy: mail2012.asrmicro.com (10.1.24.123) To mail2012.asrmicro.com (10.1.24.123) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9663 Lines: 205 On 2017年07月31日 19:20, qiaozhou wrote: > > > On 2017年07月29日 03:09, Vikram Mulukutla wrote: >> On 2017-07-28 02:28, Will Deacon wrote: >>> On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote: >> >> >>> >>> Does bodging cpu_relax to back-off to wfe after a while help? The event >>> stream will wake it up if nothing else does. Nasty patch below, but >>> I'd be >>> interested to know whether or not it helps. >>> >>> Will >>> > The patch also helps a lot on my platform. (Though it does cause > deadlock(related with udelay) in uart driver in early boot, and not sure > it's uart driver issue. Just workaround it firstly) > > Platform: 4 a53(832MHz) + 4 a73(1.8GHz) > Test condition #1: > a. core2: a53, while loop (spinlock, spin_unlock) > b. core7: a73, while loop (spinlock, spin_unlock, cpu_relax) > > Test result: recording the lock acquire times(a53, a73), max lock > acquired time(a53), in 20 seconds > > Without cpu_relax bodging patch: > =============================================================== > |a53 locked times | a73 locked times | a53 max locked time(us)| > ==================|==================|========================| > 182| 38371616| 1,951,954| > 202| 38427652| 2,261,319| > 210| 38477427| 15,309,597| > 207| 38494479| 6,656,453| > 220| 38422283| 2,064,155| > =============================================================== > > With cpu_relax bodging patch: > =============================================================== > |a53 locked times | a73 locked times | a53 max locked time(us)| > ==================|==================|========================| > 1849898| 37799379| 131,255| > 1574172| 38557653| 38,410| > 1924777| 37831725| 42,999| > 1477665| 38723741| 52,087| > 1865793| 38007741| 783,965| > =============================================================== > > Also add some workload to the whole system to check the result. > Test condition #2: based on #1 > c. core6: a73, 1.8GHz, run "while(1);" loop > > With cpu_relax bodging patch: > =============================================================== > |a53 locked times | a73 locked times | a53 max locked time(us)| > ==================|==================|========================| > 20| 42563981| 2,317,070| > 10| 42652793| 4,210,944| > 9| 42651075| 5,691,834| > 28| 42652591| 4,539,555| > 10| 42652801| 5,850,639| > =============================================================== > > Also hotplug out other cores. > Test condition #3: based on #1 > d. hotplug out core1/3/4/5/6, keep core0 for scheduling > > With cpu_relax bodging patch: > =============================================================== > |a53 locked times | a73 locked times | a53 max locked time(us)| > ==================|==================|========================| > 447| 42652450| 309,549| > 515| 42650382| 337,661| > 415| 42646669| 628,525| > 431| 42651137| 365,862| > 464| 42648916| 379,934| > =============================================================== > > The last two tests are the actual cases where the hard-lockup is > triggered on my platform. So I gathered some data, and it shows that a53 > needs much longer time to acquire the lock. > > All tests are done in android, black screen with USB cable attached. The > data is not so pretty as Vikram's. It might be related with cpu > topology, core numbers, CCI frequency etc. (I'll do another test with > both a53 and a73 running at 1.2GHz, to check whether it's the core > frequency which leads to the major difference.) > Test the contention with the same frequency between a53 and a73 cores. Platform: 4 a53(1248MHz) + 4 a73(1248MHz) Test condition #4: a. core2: a53, while loop (spinlock, spin_unlock) b. core7: a73, while loop (spinlock, spin_unlock) =============================================================== |a53 locked times | a73 locked times | a53 max locked time(us)| ==================|==================|========================| 12945632| 13021576| 14| 12934181| 13059230| 16| 12987186| 13059016| 49| 12958583| 13038884| 24| 14637546| 14672522| 14| =============================================================== The locked times are almost the same, and the max time of acquiring the lock on a53 also drops. On my platform, core frequency seems to be the key factor. >> This does seem to help. Here's some data after 5 runs with and without >> the patch. >> >> time = max time taken to acquire lock >> counter = number of times lock acquired >> >> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz >> Without the cpu_relax() bodging patch: >> ===================================================== >> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | >> ==========|==============|===========|==============| >> 117893us| 2349144| 2us| 6748236| >> 571260us| 2125651| 2us| 7643264| >> 19780us| 2392770| 2us| 5987203| >> 19948us| 2395413| 2us| 5977286| >> 19822us| 2429619| 2us| 5768252| >> 19888us| 2444940| 2us| 5675657| >> ===================================================== >> >> cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz >> With the cpu_relax() bodging patch: >> ===================================================== >> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | >> ==========|==============|===========|==============| >> 3us| 2737438| 2us| 6907147| >> 2us| 2742478| 2us| 6902241| >> 132us| 2745636| 2us| 6876485| >> 3us| 2744554| 2us| 6898048| >> 3us| 2741391| 2us| 6882901| >> ==================================================== > >> The patch also seems to have helped with fairness in general >> allowing more work to be done if the CPU frequencies are more >> closely matched (I don't know if this translates to real world >> performance - probably not). The counter values are higher >> with the patch. >> >> time = max time taken to acquire lock >> counter = number of times lock acquired >> >> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz >> Without the cpu_relax() bodging patch: >> ===================================================== >> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | >> ==========|==============|===========|==============| >> 2us| 5240654| 1us| 5339009| >> 2us| 5287797| 97us| 5327073| >> 2us| 5237634| 1us| 5334694| >> 2us| 5236676| 88us| 5333582| >> 84us| 5285880| 84us| 5329489| >> ===================================================== >> >> cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz >> With the cpu_relax() bodging patch: >> ===================================================== >> cpu0 time | cpu0 counter | cpu4 time | cpu4 counter | >> ==========|==============|===========|==============| >> 140us| 10449121| 1us| 11154596| >> 1us| 10757081| 1us| 11479395| >> 83us| 10237109| 1us| 10902557| >> 2us| 9871101| 1us| 10514313| >> 2us| 9758763| 1us| 10391849| >> ===================================================== >>Also apply Vikram's patch and have a test. cpu2: a53, 832MHz, cpu7: a73, 1.75Hz Without cpu_relax bodging patch ===================================================== cpu2 time | cpu2 counter | cpu7 time | cpu7 counter | ==========|==============|===========|==============| 16505| 5243| 2| 12487322| 16494| 5619| 1| 12013291| 16498| 5276| 2| 11706824| 16494| 7123| 1| 12532355| 16470| 7208| 2| 11784617| ===================================================== cpu2: a53, 832MHz, cpu7: a73, 1.75Hz With cpu_relax bodging patch: ===================================================== cpu2 time | cpu2 counter | cpu7 time | cpu7 counter | ==========|==============|===========|==============| 3991| 140714| 1| 11430528| 4018| 144371| 1| 11430528| 4034| 143250| 1| 11427011| 4330| 147345| 1| 11423583| 4752| 138273| 1| 11433241| ===================================================== It has some improvements, but not so good as Vikram's data. The big core still has much more chance to acquire lock. >> >> Thanks, >> Vikram >>