Received: by 10.192.165.156 with SMTP id m28csp304671imm; Wed, 18 Apr 2018 22:40:40 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/PFHqn5QPydQXCwj6k9NbTxbJMcU1Zv8Pm8F5kS8GlULdY1PHg3mIJbUQXvctAbEMVx/Z7 X-Received: by 2002:a17:902:b7c9:: with SMTP id v9-v6mr4780032plz.35.1524116440300; Wed, 18 Apr 2018 22:40:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524116440; cv=none; d=google.com; s=arc-20160816; b=iFc9w1Rvr5luiZnuIup6rOOMf9cU1Y8LkO0hwz4SNjyz0BaGrioU+Gy0av4H46k76h xfEZRGULfJ/JSkz2mwnQ/3JN1MBc2oFqOKsLN51hSt2VQyHjVt1xR6Y5+qtFd4gjnuMx baaMsaKLmprXNWBWiM+b8wm0bUBvozSCDEIMvc/5dY4PtDveoPOfiEhceuDejlMG2xlY TGgxJrwog8qohJrVTv1v4ElQxHcEblCMrl6PF13VbecQl8DS+fQUzHAWFVQOSN9vKnXJ U0vZeIUleFYgE7qP4U4e8ZLoZiAYtOJqXPlU2VMiXYWCeiocMQsTwx1EYm7ZBA4gGhuD TPYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:date:cc:to:from:subject:message-id :arc-authentication-results; bh=iJM8dzB5rBx8QL6uP58QyUQtmCEsSHB7/ieilGjPVEE=; b=HB8a0QumOXpBxpGsMr65jJU6rQP0Wqjy5Ifw/wvcs6rt6zfpFIDt5hvf5xCbxu2nYK vO91YAIUCRitR6SRxOi00VJ6QYi4I0rYR/+7k8cExB5ZjKuIw5edIcc+mrIl3s/pq03a UJfkkqQiXRWBybnnWV+PX60Un2ADnzEUeK+tMTD6Z7OF/uCm7V+8QRqL6Fhi20FAuJoU yUq7UTn7k2M4A5YAqvXHznMPhL2TXlSQyMIN2LoknfNiSFi0Kw08UaaCNxL5YXKntd+Z H/A6lJuxUzAjzJrvTUQ7jBYKAtFgSzaxHsea5MfIRrfj7fU055Oz9QqvWK3FK9YN3xAx lnSA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k69si805904pfh.50.2018.04.18.22.40.25; Wed, 18 Apr 2018 22:40:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753234AbeDSFjQ (ORCPT + 99 others); Thu, 19 Apr 2018 01:39:16 -0400 Received: from mout.gmx.net ([212.227.15.19]:38787 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750913AbeDSFjP (ORCPT ); Thu, 19 Apr 2018 01:39:15 -0400 Received: from homer.simpson.net ([185.221.150.85]) by mail.gmx.com (mrgmx001 [212.227.17.190]) with ESMTPSA (Nemesis) id 0Mey7N-1elBk70bJh-00OVZi; Thu, 19 Apr 2018 07:38:46 +0200 Message-ID: <1524116324.21378.1.camel@gmx.de> Subject: Re: cpu stopper threads and load balancing leads to deadlock From: Mike Galbraith To: Matt Fleming , Peter Zijlstra , Ingo Molnar Cc: linux-kernel@vger.kernel.org, Michal Hocko Date: Thu, 19 Apr 2018 07:38:44 +0200 In-Reply-To: <1524030475.5645.2.camel@gmx.de> References: <20180417142119.GA4511@codeblueprint.co.uk> <1524030475.5645.2.camel@gmx.de> Content-Type: text/plain; charset="ISO-8859-15" X-Mailer: Evolution 3.22.6 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K1:/a3Gli4062q2Tqm8e4UCvdeiasV5CV0qy/zNbgWsT93NOHP2BeT YEVauhCXcibscdDI0Wdz2qMkMbjqBJwEIjkQXyC+1X1BmsdNaYBpv06o2SwTp5JNMERA6TC KFyxG3zehuTL3WsZc2b49wWewwsBloL6IYmwQZyttpDzevM9H0KadqP3PDfyRDcEMGvmqgs kXg2YVaj/x+bQI00aO9ZA== X-UI-Out-Filterresults: notjunk:1;V01:K0:laqmJAs1SVo=:sRAoXUH5QYMnP+H4AKo3A6 VSsANo8rACmjPiN3qZfmcppfI0R72En9jbF4RID3qmw+Sqwj2p2kymKPz08tWMyX1O1Ne0rLG PcML7TH+8EqgIiyB+VODkxxn1xNme72Nkp0KzWyk23Vo22fUSbotqcGW/P5aMnX6KrxrCbPPs 3GL/DGeN3WGyjfBKC9FiYg5w3yfmePl7e/+7Jo4XyJ2iKRQip2TgcUEkSosCtJHP+o7a/vGqk cbPj4gXPuFeIsq4PUIKavQpsRYzvnC5aKaS9RCn7l9F/XI5eCcMQcV3ru8MkaH5PkYT2JuTIR Glgab9gko/BhaTgBkNK2XFdHO3XTzw85vyALHfSbo0u0nPGEoWwACJeBAD6DooraB3u8cXnWC Rj7gEYkfbgCvwuoqYr8GyvE1BQMweU9+Fyt32kkddo/XsghtBQEWbrlwoUipxVJ3N9pjlGEJR Dq392QWqDrZteYkS49myMbjJT9djCYfGUfCSu0QIitnr60/a6JkeF2rD/qe2pYbICc4Tlilaq G+v/i3YXoIVI2SNQ26OfDchsUOCs56uGAwe3MmNanES0ABOVpXNBRBm+EiKXCLryjDThCdtEJ nZqWi1Jf0RojRxKuS8JtoYtYDpJJHeDXwJKS3+uIerUNbenVyjBzQVaZxHHMDWW02PSBs7APu NBFY2jdMBwEg/7nYhc+oUqOxgg1jnp5pdiPLpkwFwp248NjgKFOUcgZpPjU7DmBNGKfKOVMjx aDWp72rgpw+rQkqHENX3OB4oJCehUWd2xkobuqt5Kr6q05nuX9nZCURjtVIcNObLZvAurURDt UOpUJhMDGt6Tphx6VpV9lP1OAz9mfvII9aiNsOpQwK0pQnQJPc= Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2018-04-18 at 07:47 +0200, Mike Galbraith wrote: > On Tue, 2018-04-17 at 15:21 +0100, Matt Fleming wrote: > > Hi guys, > > > > We've seen a bug in one of our SLE kernels where the cpu stopper > > thread ("migration/15") is entering idle balance. This then triggers > > active load balance. > > > > At the same time, a task on another CPU triggers a page fault and NUMA > > balancing kicks in to try and migrate the task closer to the NUMA node > > for that page (we're inside stop_two_cpus()). This faulting task is > > spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu, > > !VAL)), waiting for "migration/15" to context switch. > > > > Unfortunately, because "migration/15" is doing active load balance > > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock, > > which is already held (since it's inside stop_two_cpus()). > > > > Deadlock ensues. > > > > This seems like a situation that should be prohibited, but I cannot > > find any code to prevent it. Is it OK for stopper threads to load > > balance? Is there something that should prevent this situation from > > happening? > > I don't see anything to stop the deadlock either, would exclude stop > class from playing idle balancer entirely... Bah, insufficient: __do_softirq() -> rebalance_domains() still bites.