Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp4734395ybb; Tue, 24 Mar 2020 04:25:25 -0700 (PDT) X-Google-Smtp-Source: ADFU+vuWPbp+GTdGVvvb7OIDTNxWCb/QFB2LNy2hvnK3ize1E6g37eAZKZJGQ2pVxuZblhh14peh X-Received: by 2002:a4a:3ed7:: with SMTP id t206mr1797254oot.61.1585049125073; Tue, 24 Mar 2020 04:25:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1585049125; cv=none; d=google.com; s=arc-20160816; b=TCrsp2wtEbNBOyJAYkZFgOI6FKZ9AuehgvlGfNiHLPkjZtSEdFNb1wUWeS4XIqsLjl xyrZ8W9WSOm5Gj42w0JzaeSwTW+NOUtTEpKZbc4o93jR6etH95upb3POTrMk8o8qEmtx PryBluBzATWDIeW1Pj4nYPmZJWSSDAe9hEWipHTcLlMI30SVVHwSMNf9wc8xEodcf2K3 rohVca9nM0SjlZCOsgTrO8LGS/2U9ovOjpq1m7GNhnv+lgcHKT6bghpIdXpDpSl+EhhW ZfeRLrKM/cbSUtgNIUvxaMmc7NBeRSM4r2k305nL74QF6PUkZZ5m2OfdGMgQ4KL10fPt 7xpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=+CSSdMsVaIyquVyuTuWUPxM7DhF2C3Wr0DObey+9Q6I=; b=VPSRoYK0ky1nYLIO9tUEgomWXE5mboKijZby0tKeP2CQYHmquexrH7Q0rBCZZJtGGG lTzc4FpzWwgjSAvu3Tz3mZHmRIIib22vfNuDNWPISyDehGGOixE3r/cLnhkHvkw9cxrW IkFAZif3PdEuNbMseJ2rFHpuvfqzfoZd/uR1KcPGLdcAK56PWwJhtzIBs2Nf9cEYnLzg 8yDMxywyZCgPzgO75l0I6LqZISDPoQOib60m1kEG3eUFutWZr/Uu9dmdJ/BovXp+l14U kpeu1+CuJtJ5sBpu92Qko4tvqrNe1syqRLkk6fX8/OfWstRlIn0kxTEWEDm3PuF0925A m3aQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id s200si8098212oie.188.2020.03.24.04.25.11; Tue, 24 Mar 2020 04:25:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727282AbgCXLXr (ORCPT + 99 others); Tue, 24 Mar 2020 07:23:47 -0400 Received: from outbound-smtp39.blacknight.com ([46.22.139.222]:58603 "EHLO outbound-smtp39.blacknight.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727124AbgCXLXr (ORCPT ); Tue, 24 Mar 2020 07:23:47 -0400 Received: from mail.blacknight.com (pemlinmail04.blacknight.ie [81.17.254.17]) by outbound-smtp39.blacknight.com (Postfix) with ESMTPS id F2A84173A for ; Tue, 24 Mar 2020 11:23:44 +0000 (GMT) Received: (qmail 25315 invoked from network); 24 Mar 2020 11:23:44 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.18.57]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 24 Mar 2020 11:23:44 -0000 Date: Tue, 24 Mar 2020 11:23:42 +0000 From: Mel Gorman To: Vincent Guittot Cc: Ingo Molnar , Peter Zijlstra , Valentin Schneider , Phil Auld , LKML Subject: Re: [PATCH 4/4] sched/fair: Track possibly overloaded domains and abort a scan if necessary Message-ID: <20200324112342.GH3818@techsingularity.net> References: <20200320151245.21152-1-mgorman@techsingularity.net> <20200320151245.21152-5-mgorman@techsingularity.net> <20200320164432.GE3818@techsingularity.net> <20200320174304.GF3818@techsingularity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 24, 2020 at 11:35:08AM +0100, Vincent Guittot wrote: > > > > It's connected to nohz balancing and I didn't see how I could use that > > > > for detecting overload. Also, I don't think it ever can be larger than > > > > the sd weight and overload is based on the number of running tasks being > > > > greater than the number of available CPUs. Did I miss something obvious? > > > > > > IIUC you try to estimate if there is a chance to find an idle cpu > > > before starting the loop and scanning the domain and abort early if > > > the possibility is low. > > > > > > if nr_busy_cpus equals to sd->span_weight it means that there is no > > > free cpu so there is no need to scan > > > > > > > Ok, I see what you are getting at but I worry there are multiple > > problems there. First, the nr_busy_cpus is decremented only when a CPU > > is entering idle with the tick stopped. If nohz is disabled then this > > breaks, no? Secondly, a CPU can be idle but the tick not stopped if > > But this can be changed if that make the statistic useful > Hmm, for all cases to track number of running tasks, I think that would end up being too costly because of the shared cache line. > > __tick_nohz_idle_stop_tick knows there is an event in the near future > > so using busy_cpus, we potentially miss a sibling that was adequate > > for running a task. Finally, the threshold for cutting off the search > > entirely seems low. The patch marks a domain as overloaded if there are > > twice as many running tasks as runqueues scanned. In that scenario, even > > if tasks are rapidly switching between busy/idle, it's still unlikely > > the task will go idle. When cutting off at just the fully-busy mark, we > > could miss a CPU that is going idle, almost idle or is running SCHED_IDLE > > tasks where are acceptable target candidates for select_idle_sibling. I > > think there are too many cases where nr_busy_cpus are problematic to > > make it a good alternative. > > I don't really like this patch because it adds yet another metrics and > yet another feature which is set true by default. Also the current > proposal seems a bit fragile because it uses an arbitrary ratio of 2 > on an arbitrary number of CPUs. This threshold probably works in your > case and your system but probably not for others and the threshold > really looks like a heuristic that works for you but without any real > meaning. > I have to admit that is a possibility. The really interesting case for other people is the transition from almost-fully-busy -> fully-busy -> overloaded because the optimal amount to search changes at those points. It's a minefield of hitting a regression somewhere whether you search too much or too little. It also can be somewhat problematic when there are multiple small llc caches per numa node. In that case a "full" search is still a small number of CPUs and premature cutoff can be hurtful. It eventually showed up when enough tests ran for long enough. > Then, the update is done at each and every task wake up and by all > CPUs in the LLC. I avoided that by only doing the write when there is a state transition. set_sd_overloaded() does not write if it's already marked overloaded. > It means that the same variable is updated > simultaneously by all CPUs: one CPU can set it and the next one might > clear it immediately because they haven't scanned the same CPUs. That is possible. > At > the end, 2 threads waking up simultaneously on different CPUS, might > end up using 2 different policy without any other reason than a random > ordering. > True, but I did not feel the race was extremely damaging because it happens once per transition for up to sd_weight-1 unlucky tasks. The same is true for the test for idle cores where multiple tasks can search for an idle core unnecessarily. Now the search for an idle core can actually cache one idle candidate it finds. It's a straight-forward patch but it hurts the fast path when the domain has spare capacity and similar to the overloaded state, we cannot detect spare capacity in advance. > I agree that the concept of detecting that a LLC domain is overloaded > can be useful to decide to skip searching for an idle cpu but this > proposal seems to be not really generic > Ok, I can dump the patch and think of something else. Thanks for the review! -- Mel Gorman SUSE Labs