Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751343AbdH1XME (ORCPT ); Mon, 28 Aug 2017 19:12:04 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:44468 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751189AbdH1XMC (ORCPT ); Mon, 28 Aug 2017 19:12:02 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Mon, 28 Aug 2017 16:12:01 -0700 From: Vikram Mulukutla To: Will Deacon Cc: qiaozhou , Thomas Gleixner , John Stultz , sboyd@codeaurora.org, LKML , Wang Wilbur , Marc Zyngier , Peter Zijlstra , linux-kernel-owner@vger.kernel.org, sudeep.holla@arm.com Subject: Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync In-Reply-To: <9f86bd426bbaede9de6d38cb047bd6fa@codeaurora.org> References: <3d2459c7-defd-a47e-6cea-007c10cecaac@asrmicro.com> <20170728092831.GA24839@arm.com> <2aa9684cf9c889ee9fdc8550b4388af6@codeaurora.org> <20170731131321.GB1737@arm.com> <20170815184039.GE10801@arm.com> <9f86bd426bbaede9de6d38cb047bd6fa@codeaurora.org> Message-ID: User-Agent: Roundcube Webmail/1.2.5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4323 Lines: 111 Hi Will, On 2017-08-25 12:48, Vikram Mulukutla wrote: > Hi Will, > > On 2017-08-15 11:40, Will Deacon wrote: >> Hi Vikram, >> >> On Thu, Aug 03, 2017 at 04:25:12PM -0700, Vikram Mulukutla wrote: >>> On 2017-07-31 06:13, Will Deacon wrote: >>> >On Fri, Jul 28, 2017 at 12:09:38PM -0700, Vikram Mulukutla wrote: >>> >>On 2017-07-28 02:28, Will Deacon wrote: >>> >>>On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote: >>> >>> >>> >>> >>This does seem to help. Here's some data after 5 runs with and without >>> >>the >>> >>patch. >>> > >>> >Blimey, that does seem to make a difference. Shame it's so ugly! Would you >>> >be able to experiment with other values for CPU_RELAX_WFE_THRESHOLD? I had >>> >it set to 10000 in the diff I posted, but that might be higher than >>> >optimal. >>> >It would be interested to see if it correlates with num_possible_cpus() >>> >for the highly contended case. >>> > >>> >Will >>> >>> Sorry for the late response - I should hopefully have some more data >>> with >>> different thresholds before the week is finished or on Monday. >> >> Did you get anywhere with the threshold heuristic? >> >> Will > > Here's some data from experiments that I finally got to today. I > decided > to recompile for every value of the threshold. Was doing a binary > search > of sorts and then started reducing by orders of magnitude. There pairs > of rows here: > Well here's something interesting. I tried a different platform and found that the workaround doesn't help much at all, similar to Qiao's observation on his b.L chipset. Something to do with the WFE implementation or event-stream? I modified your patch to use a __delay(1) in place of the WFEs and this was the result (still with the 10k threshold). The worst-case lock time for cpu0 drastically improves. Given that cpu0 re-enables interrupts between each lock attempt in my test case, I think the lock count matters less here. cpu_relax() patch with WFEs (original workaround): (pairs of rows, first row is with c0 at 300Mhz, second with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz max time is in microseconds) ------------------------------------------------------| c0 max time| c0 lock count| c4 max time| c4 lock count| ------------------------------------------------------| 999843| 25| 2| 12988498| -> c0/cpu0 at 300Mhz 0| 8421132| 1| 9152979| -> c0/cpu0 at 1.9GHz ------------------------------------------------------| 999860| 160| 2| 12963487| 1| 8418492| 1| 9158001| ------------------------------------------------------| 999381| 734| 2| 12988636| 1| 8387562| 1| 9128056| ------------------------------------------------------| 989800| 750| 3| 12996473| 1| 8389091| 1| 9112444| ------------------------------------------------------| cpu_relax() patch with __delay(1): (pairs of rows, first row is with c0 at 300Mhz, second with c0 at 1.9GHz. Both rows have cpu4 at 2.3GHz. max time is in microseconds) ------------------------------------------------------| c0 max time| c0 lock count| c4 max time| c4 lock count| ------------------------------------------------------| 7703| 1532| 2| 13035203| -> c0/cpu0 at 300Mhz 1| 8511686| 1| 8550411| -> c0/cpu0 at 1.9GHz ------------------------------------------------------| 7801| 1561| 2| 13040188| 1| 8553985| 1| 8609853| ------------------------------------------------------| 3953| 1576| 2| 13049991| 1| 8576370| 1| 8611533| ------------------------------------------------------| 3953| 1557| 2| 13030553| 1| 8509020| 1| 8543883| ------------------------------------------------------| I should also note that my earlier kernel was 4.9-stable based and the one above was on a 4.4-stable based kernel. Thanks, Vikram -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project