Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3D94C433F5 for ; Wed, 15 Dec 2021 19:43:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232209AbhLOTnC (ORCPT ); Wed, 15 Dec 2021 14:43:02 -0500 Received: from out01.mta.xmission.com ([166.70.13.231]:37404 "EHLO out01.mta.xmission.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229492AbhLOTnB (ORCPT ); Wed, 15 Dec 2021 14:43:01 -0500 Received: from in01.mta.xmission.com ([166.70.13.51]:51974) by out01.mta.xmission.com with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mxaAz-00C2tz-3k; Wed, 15 Dec 2021 12:42:57 -0700 Received: from ip68-227-160-95.om.om.cox.net ([68.227.160.95]:48176 helo=email.froward.int.ebiederm.org.xmission.com) by in01.mta.xmission.com with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.93) (envelope-from ) id 1mxaAx-00FWKo-Lr; Wed, 15 Dec 2021 12:42:56 -0700 From: ebiederm@xmission.com (Eric W. Biederman) To: Barret Rhoden Cc: Christian Brauner , Andrew Morton , Alexey Gladkov , William Cohen , Viresh Kumar , Alexey Dobriyan , Chris Hyser , Peter Collingbourne , Xiaofeng Cao , David Hildenbrand , Cyrill Gorcunov , linux-kernel@vger.kernel.org References: <20211213220401.1039578-1-brho@google.com> <8735mww2w3.fsf@email.froward.int.ebiederm.org> <456a056e-453e-71b0-0f9e-03511b9f56b1@google.com> Date: Wed, 15 Dec 2021 13:42:32 -0600 In-Reply-To: <456a056e-453e-71b0-0f9e-03511b9f56b1@google.com> (Barret Rhoden's message of "Wed, 15 Dec 2021 14:00:11 -0500") Message-ID: <87zgp1psd3.fsf@email.froward.int.ebiederm.org> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-XM-SPF: eid=1mxaAx-00FWKo-Lr;;;mid=<87zgp1psd3.fsf@email.froward.int.ebiederm.org>;;;hst=in01.mta.xmission.com;;;ip=68.227.160.95;;;frm=ebiederm@xmission.com;;;spf=neutral X-XM-AID: U2FsdGVkX1+/H5bM6vXl/B8OAznoBP7d9IHqE6Sa2M8= X-SA-Exim-Connect-IP: 68.227.160.95 X-SA-Exim-Mail-From: ebiederm@xmission.com Subject: Re: [PATCH] rlimits: do not grab tasklist_lock for do_prlimit on current X-SA-Exim-Version: 4.2.1 (built Sat, 08 Feb 2020 21:53:50 +0000) X-SA-Exim-Scanned: Yes (on in01.mta.xmission.com) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Barret Rhoden writes: > Hi - > > On 12/13/21 5:34 PM, Eric W. Biederman wrote: >> Do you have any numbers? As the entire point of this change is >> performance it would be good to see how the performance changes. >> >> Especially as a read_lock should not be too bad as it allows sharing, >> nor do I expect reading or writing the rlimit values to be particularly >> frequent. So some insight into what kinds of userspace patterns make >> this a problem would be nice. > > This was motivated by slowdowns we observed on a few machines running > tests in a cluster. AFAIK, there were a lot of small tests, many of > which mucked with process management syscalls while fork/joining other > tasks. > > Based on a cycles profile, it looked like ~87% of the time was spent > in the kernel, ~42% of which was just trying to get *some* spinlock > (queued_spin_lock_slowpath, not necessarily the tasklist_lock). > > The big offenders (with rough percentages in cycles of the overall trace): > > - do_wait 11% > - setpriority 8% (potential future patch) > - kill 8% > - do_exit 5% > - clone 3% > - prlimit64 2% (this patch) > - getrlimit 1% (this patch) > > Even though do_prlimit was using a read_lock, it was still contending > on the internal queued_spin_lock. > > The prlimit was only 3% of the total. This patch was more of a "oh, > this doesn't *need* the tasklist_lock for p == current" - can we fix > that? I actually don't even know often those prlimit64 calls had p == > current. > > setpriority was a bigger one too - is the tasklist lock only needed > for the PGRP ops? (I thought so based on where the tasklist_lock is > write locked and the comment on task_pgrp()). If so, I could do that > in another patch. That is my understanding. For setpriority to change everything atomically it must hold the tasklist lock when dealing with more than one process at a time. >> This change is a bit scary as it makes taking a lock conditional and >> increases the probability of causing a locking mistake. > > I definitely see how making the code more brittle might not be worth > the small win. If this is more "damage" than "cleanup", then I can > drop it. > >> If you are going to make this change I would say that do_prlimit should >> become static and taking the tasklist_lock should move into prlimit64. >> >> >> Looking a little closer it looks like that update_rlimit_cpu should use >> lock_task_sighand, and once lock_task_sighand is used there is actually >> no need for the tasklist_lock at all. As holding the reference to tsk >> guarantees that tsk->signal remains valid. > > Maybe do both? unconditionally grab lock_task_sighand (instead of > tasklist_lock) in prlimit64. In update_rlimit_cpu use lock_task_sighand instead of unconditionally grabbing sighand->siglock (because without tasklist_lock sighand might be NULL). Then do_prlimit64 can drop the tasklist lock and the test for sighand == NULL. This will address every prlimit64 case instead of just when updating current. With prlimit64 in your profile I expect some of those are non-current. >> So I completely agree there are cleanups that can happen in this area. >> Please make those and show numbers in how they improve things, instead >> of making the code worse with a conditional lock. > > Unfortunately, I can't easily get a "before and after" on this > change. The motivating issue popped up sporadically, but getting it to > happen in a setup under *my* control is organizationally a pain. So I > understand if you wouldn't want the patch for that reason. Ideally, > the changes would make the code easier to follow and clearer about why > we're locking. Even a microbenchmark that stresses the lock and can show the performance impact of the change you are making is useful. Simply reorganizing and removing an unnecessary lock lowers the bar because the code is then both simpler and pretty much by definition faster. Although sometimes another lock is then hit and the contention moves. > If you're OK with two patches that 1) grab lock_task_sighand in > prlimit64 and 2) moving the read_lock in {set,get}priority into the > PGRP cases (assuming I was correct on that), I can send them out. I think getpriority can only use the rcu_read_lock. I don't think it has any atomicity guarantees. For setpriority the single process case should be safe just use rcu. I haven't read through all of what set_one_prio is doing to confirm it is safe, but in principle it should be safe. > If it's too much of a risk/ugliness for not clear enough gain (in code > quality or performance), I'm fine with dropping it. Removing the tasklist_lock where we can is definitely a clear gain. Simply shoving tasklist_lock aside and making the code more complicated is much less clear. Plus anything you can benchmark (even microbenchmark) and show the benefit of is welcome. Especially when you have indications that it makes a difference in a larger context. Eric