Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755145AbdIGLDP (ORCPT ); Thu, 7 Sep 2017 07:03:15 -0400 Received: from cloudserver094114.home.net.pl ([79.96.170.134]:54680 "EHLO cloudserver094114.home.net.pl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754219AbdIGLDO (ORCPT ); Thu, 7 Sep 2017 07:03:14 -0400 From: "Rafael J. Wysocki" To: Peter Zijlstra Cc: Mike Galbraith , Andy Lutomirski , Andy Lutomirski , Ingo Molnar , "linux-kernel@vger.kernel.org" , Tejun Heo , Thomas Gleixner Subject: Re: [PATCH] sched/cpuset/pm: Fix cpuset vs suspend-resume Date: Thu, 07 Sep 2017 12:54:22 +0200 Message-ID: <1983421.hjxvCve29b@aspire.rjw.lan> In-Reply-To: <20170907092616.thsuyqklit4463wj@hirez.programming.kicks-ass.net> References: <20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net> <20170907092616.thsuyqklit4463wj@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2271 Lines: 53 On Thursday, September 7, 2017 11:26:16 AM CEST Peter Zijlstra wrote: > On Thu, Sep 07, 2017 at 11:13:38AM +0200, Peter Zijlstra wrote: > > Subject: sched/cpuset/pm: Fix cpuset vs suspend-resume > > > > Cpusets vs suspend-resume is _completely_ broken. And it got noticed > > because it now resulted in non-cpuset usage breaking too. > > > > On suspend cpuset_cpu_inactive() doesn't call into > > cpuset_update_active_cpus() because it doesn't want to move tasks about, > > there is no need, all tasks are frozen and won't run again until after > > we've resumed everything. > > > > But this means that when we finally do call into > > cpuset_update_active_cpus() after resuming the last frozen cpu in > > cpuset_cpu_active(), the top_cpuset will not have any difference with > > the cpu_active_mask and this it will not in fact do _anything_. > > > > So the cpuset configuration will not be restored. This was largely > > hidden because we would unconditionally create identity domains and > > mobile users would not in fact use cpusets much. And servers what do use > > cpusets tend to not suspend-resume much. > > > > An addition problem is that we'd not in fact wait for the cpuset work to > > finish before resuming the tasks, allowing spurious migrations outside > > of the specified domains. > > > > Fix the rebuild by introducing cpuset_force_rebuild() and fix the > > ordering with cpuset_wait_for_hotplug(). > > > > Cc: tj@kernel.org > > Cc: rjw@rjwysocki.net > > Cc: efault@gmx.de > > Reported-by: Andy Lutomirski > > Signed-off-by: Peter Zijlstra (Intel) > > TJ, I _think_ it was commit: > > deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling") > > That wrecked things, but there's been so much changes in this area it is > really hard to tell. Note how before that commit it would > unconditionally rebuild the domains, and you 'optimized' that ;-) > > That commit also introduced the work to do the async rebuild and failed > to do that flush on resume. > > In any case, I think we should put a fixes tag on this commit such that > it gets picked up into stable kernels. Not sure anybody will try and > backport it into 4 year old kernels, but who knows. > Many thanks for fixing this!