Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp1066519ybt; Wed, 1 Jul 2020 17:54:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwOiyk+/Z/L69yge7XtLLnoWTtw9mwY7VoteKMQy0RVcDLlfcnwPHpczhlGj/ckMdaVxG3N X-Received: by 2002:a17:907:2654:: with SMTP id ar20mr23958622ejc.62.1593651286497; Wed, 01 Jul 2020 17:54:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593651286; cv=none; d=google.com; s=arc-20160816; b=L2XLUEkF8fS4abAYlmrUAzd26XQyen9BRSWQpvfYSRkEFgjq2itBX1TsyTnO9Bcwpv 0tmuosexesZNrRGtFz2n96WXedVCg6PM+zt+t1vvk2H+YBkT2JoE0t/pSdscy66LhGmz V5wONsji02/u4DOqxpF6nDiJFcWd+MNoqW7RV9WBqIR/O4dfitQFHn4yEJEjYHItzP3e 5CDxp0bcb9lKyrTm0oHlqEdijOT+V/wdMOJ3cDkT/KfvfcXWJJUnpxybM/ZhJ3S82Xw9 8e8FXCG0+FiAcHL8ZdrCSy5WQee262kvkbf2IJYlOZKO4ttrGb1WU90QbTfOP+UZTA2o GNlg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:ironport-sdr:ironport-sdr; bh=HwzooMR8DvHG2MOVOhoWpxF8nyP3aIS8QgdKOQ1QtzE=; b=gtPPc39ysqEG1I5TCfta9WIlVTv83PaKMyrAWsJgmo9VcZwHLGVDQWNNrfBw5ZMr3V Z7203oh6zEU5wycHe+VVY4q9ZRu11XR+lEfVhI+hK9R5/FHwaCAgCFfWPMw6fnYumX/1 O3/q4q+TpXnBwUGoP6R4SSZA5CTLSGDYQUgAiyPaC4SWsxrOSHYQDAoE2QVsREwwQB1E o7G3/wygszWQQrtwJY1f9Udpg6Jku/TUawX6/rk8DGR2JhsQEQgxfuAjgczRDBa2aTbG bGncIc2XAlh3rFZauKG76s3ZpymQMOHRqwKCsfgz1b9Poi7GMMApGpNDNquFX3ukuPg6 iaQw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l22si4812435edw.118.2020.07.01.17.54.23; Wed, 01 Jul 2020 17:54:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727809AbgGBAyO (ORCPT + 99 others); Wed, 1 Jul 2020 20:54:14 -0400 Received: from mga18.intel.com ([134.134.136.126]:36083 "EHLO mga18.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726637AbgGBAyO (ORCPT ); Wed, 1 Jul 2020 20:54:14 -0400 IronPort-SDR: S4avjd7oYzFUlsRlzE87Ge8y+y7mRwcv+GdzVxqAiBukgnGepkolAmAuFbG6QkJSBs0c1GUN2U xB3XgMKX2Waw== X-IronPort-AV: E=McAfee;i="6000,8403,9669"; a="134214968" X-IronPort-AV: E=Sophos;i="5.75,302,1589266800"; d="scan'208";a="134214968" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Jul 2020 17:54:13 -0700 IronPort-SDR: ZJW0pwP9Tx+hxbw3+an1iM7OTr1hX8O5cBLXdjwJgg55ooUBkQFZT8E5CVOCDPtOQGolTInZno 5G9Rbbkdf3kA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.75,302,1589266800"; d="scan'208";a="313952711" Received: from schen9-mobl.amr.corp.intel.com ([10.251.140.152]) by fmsmga002.fm.intel.com with ESMTP; 01 Jul 2020 17:54:12 -0700 Subject: Re: [RFC PATCH 06/16] sched: Add core wide task selection and scheduling. To: Joel Fernandes , Vineeth Remanan Pillai , Peter Zijlstra Cc: Nishanth Aravamudan , Julien Desfossez , tglx@linutronix.de, pjt@google.com, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com, mingo@kernel.org, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil Auld , Aaron Lu , Aubrey Li , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini , vineethrp@gmail.com, Chen Yu , Christian Brauner , Aaron Lu , paulmck@kernel.org References: <20200701232847.GA439212@google.com> From: Tim Chen Message-ID: <200c81ef-c961-dcd5-1221-84897c459b05@linux.intel.com> Date: Wed, 1 Jul 2020 17:54:11 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20200701232847.GA439212@google.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 7/1/20 4:28 PM, Joel Fernandes wrote: > On Tue, Jun 30, 2020 at 09:32:27PM +0000, Vineeth Remanan Pillai wrote: >> From: Peter Zijlstra >> >> Instead of only selecting a local task, select a task for all SMT >> siblings for every reschedule on the core (irrespective which logical >> CPU does the reschedule). >> >> There could be races in core scheduler where a CPU is trying to pick >> a task for its sibling in core scheduler, when that CPU has just been >> offlined. We should not schedule any tasks on the CPU in this case. >> Return an idle task in pick_next_task for this situation. >> >> NOTE: there is still potential for siblings rivalry. >> NOTE: this is far too complicated; but thus far I've failed to >> simplify it further. >> >> Signed-off-by: Peter Zijlstra (Intel) >> Signed-off-by: Julien Desfossez >> Signed-off-by: Vineeth Remanan Pillai >> Signed-off-by: Aaron Lu >> Signed-off-by: Tim Chen > > Hi Peter, Tim, all, the below patch fixes the hotplug issue described in the > below patch's Link tag. Patch description below describes the issues fixed > and it applies on top of this patch. > > ------8<---------- > > From: "Joel Fernandes (Google)" > Subject: [PATCH] sched: Fix CPU hotplug causing crashes in task selection logic > > The selection logic does not run correctly if the current CPU is not in the > cpu_smt_mask (which it is not because the CPU is offlined when the stopper > finishes running and needs to switch to idle). There are also other issues > fixed by the patch I think such as: if some other sibling set core_pick to > something, however the selection logic on current cpu resets it before > selecting. In this case, we need to run the task selection logic again to > make sure it picks something if there is something to run. It might end up > picking the wrong task. Yet another issue was, if the stopper thread is an > unconstrained pick, then rq->core_pick is set. The next time task selection > logic runs when stopper needs to switch to idle, the current CPU is not in > the smt_mask. This causes the previous ->core_pick to be picked again which > happens to be the unconstrained task! so the stopper keeps getting selected > forever. > > That and there are a few more safe guards and checks around checking/setting > rq->core_pick. To test it, I ran rcutorture and made it tag all torture > threads. Then ran it in hotplug mode (hotplugging every 200ms) and it hit the > issue. Now it runs for an hour or so without issue. (Torture testing debug > changes: https://bit.ly/38htfqK ). > > Various fixes were tried causing varying degrees of crashes. Finally I found > that it is easiest to just add current CPU to the smt_mask's copy always. > This is so that task selection logic always runs on the current CPU which > called schedule(). It looks good to me. Thanks. Tim