Received: by 10.192.165.156 with SMTP id m28csp860363imm; Tue, 17 Apr 2018 22:50:20 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+FnzW0gZnmkk+ujpdnFeMhrxcVSoCBGQIt8GZUF4p2NZLrfp9++WfrV611LvUyE1B0RV1B X-Received: by 2002:a17:902:8546:: with SMTP id d6-v6mr832879plo.106.1524030620025; Tue, 17 Apr 2018 22:50:20 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524030619; cv=none; d=google.com; s=arc-20160816; b=sCpAlylLU07u3FvxFNnbUhyDEPejKvHWqMS2vMSjKjwzoRYBj3asK1NnPiFcwlxyXo UsTA1PqYbEyNuejKEf02HWZxGtkoFOLqm9iFtd2gQz5QzHT1EuQsqdelmE6F56JcqEXp kFV5M1zEcH7PiW86lYZ/PWpeTW5kjrS8gSMMPNykfEmxnIVkVNx+l8yiILlU51rdZrOf du9e/P1XAQmBOhWsd9AdJVGTlaMtEBIYu6xNXF2BrbNEzk0QbaVYW5VWCpu4pAbzX43u 7Nif2sIQMRC6ruzkeM1SAZONn58bFicco6YVHozxJcYrEUsyXTrhiXqYvbL62XK67nzg GY5Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:date:cc:to:from:subject:message-id :arc-authentication-results; bh=PqGbm7Uu2SiL4Ofu2tOim/mU1RSMr6sQGtZYseIeBcA=; b=LcEn9sc0uR1tRXsG/RGJGd/yZayevx2Pkrot6pwC/nNYO0pSUT8pJ4vEHE+pWl+g45 TLzfmsfIOnJNh4Cr1mN7dQxmwBArcOCPDvI5bn/s9NeN/xecIRMYlDnayXnnaR/KDFzO CoOJpbv1ipVhv1D9VomnbjaBJrWflG98dnWvRFx6xNym4KTBNEUd7d5nK0/eNj+F/5G5 d7OY5cXwPZFPvAf7pKywHDe7zfBPfbAbkNdbs5a1JA2/ZbQVsvNzSJxye0VrBKJviSDP 6Gd++DDz4R1TkfugekgzKib+bjhSK4BZxd4deiOb8W9IOL7QqLwpLlt2evdJnvCNbdbo FCEA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y20-v6si521488pll.77.2018.04.17.22.50.05; Tue, 17 Apr 2018 22:50:19 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752812AbeDRFs1 (ORCPT + 99 others); Wed, 18 Apr 2018 01:48:27 -0400 Received: from mout.gmx.net ([212.227.15.19]:44119 "EHLO mout.gmx.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752606AbeDRFsY (ORCPT ); Wed, 18 Apr 2018 01:48:24 -0400 Received: from homer.simpson.net ([185.191.219.205]) by mail.gmx.com (mrgmx001 [212.227.17.190]) with ESMTPSA (Nemesis) id 0MHoWj-1fA2SJ16jN-003hSb; Wed, 18 Apr 2018 07:47:58 +0200 Message-ID: <1524030475.5645.2.camel@gmx.de> Subject: Re: cpu stopper threads and load balancing leads to deadlock From: Mike Galbraith To: Matt Fleming , Peter Zijlstra , Ingo Molnar Cc: linux-kernel@vger.kernel.org, Michal Hocko Date: Wed, 18 Apr 2018 07:47:55 +0200 In-Reply-To: <20180417142119.GA4511@codeblueprint.co.uk> References: <20180417142119.GA4511@codeblueprint.co.uk> Content-Type: text/plain; charset="ISO-8859-15" X-Mailer: Evolution 3.22.6 Mime-Version: 1.0 Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K1:Xbn6KhsCiaNRUeE1TAjtk7jqMQr2bsRsXj3NtGT4o79R+IRl9+9 TUXDlyPdERcHKwW4CK1fI5ZRFFpxnwpaEaTSK7UdmoN2ujiRMYKonE5SmvnqR+YnFnzse/e WRJZUEpR1V4mGRPBe84zpPQsQze5dZbKqY0xxnhNxWmT5uD9dIgaFcOtnX6MF+xt8fk+703 ngIaWVJWtyBf9XuI+PcYQ== X-UI-Out-Filterresults: notjunk:1;V01:K0:NGLFDCHFbZo=:IY6JAJ4s5x7SlP0dVH8SPq wEdThlnlV82UAGy883zxaHvq4fN5q5E8eJACZz8M6Z+Pra0X9973s4FUBgGs3wndRQDc6w/Hy xiRrnpZlpRB9+ZmfPNaJ4pZJ471RSY81azV1x2NCighycws8PY5iikcopCBDXtzg5ybGvppnn Eyl4l1zh+KfvYdueuwhYg41HyUFhTJ1UHG5zpIo9SPxxdri93n5iGI/7pALTiRNfbojGMTELc lpugWPcDFMJqH7RW/VHO3h7lHdn4EwUnR8wnVeYyn1H1WzkwmIghs60bqLMXmiBFsexw7VU+n uZ1r8f4xdW7sVXpd5efqwvKSNdjz7LoKsYqrz+qTnK0ZTbfq5yrIDly6vV2VPwUIH3nCTzoS4 Q2s+73KnSCkIezpDv9k8BoEUMlKktI4lZEnDaO7zg5pZzznMCHmINTSBf6kFqHhdFDm393A9S ljtZiJTVxIwIa82UPExr1VpSPEgOHrZkyK95tGwHt9fcwtLM8n+rAO5gCalFnG00hKsNNX9G+ T/6R762kljuBCuDIYIV0Iraix1o97cX4zHXfjxP008o67FqMKCSbnAObWdyD7BMHJkjCSgBq2 5lVZHzTV6RmSEUrhlS6n2zIjHtdk0JiVgXPFWy8VadBthA6JOT4F5peiFNgZpmzNDq/jCbMKt 5RQhAjYqHvKAq95R5nZzQO9KSavNBJXyXprZipQU3qeQFKbxxDcgft/EuUonKG56LMMC8vDvi 9W8dfgIcTTd43nBaZFoyR2PrtxUuZYUSmYs6h7JyipX4cTqwmDX6ucBvsPpT7fevJscHEYR72 eMreQ3KFxViTaf1yFhPyylXIeekjLzKJ3IonACcTqrvzXeO3bI= Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2018-04-17 at 15:21 +0100, Matt Fleming wrote: > Hi guys, > > We've seen a bug in one of our SLE kernels where the cpu stopper > thread ("migration/15") is entering idle balance. This then triggers > active load balance. > > At the same time, a task on another CPU triggers a page fault and NUMA > balancing kicks in to try and migrate the task closer to the NUMA node > for that page (we're inside stop_two_cpus()). This faulting task is > spinning in try_to_wake_up() (inside smp_cond_load_acquire(&p->on_cpu, > !VAL)), waiting for "migration/15" to context switch. > > Unfortunately, because "migration/15" is doing active load balance > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock, > which is already held (since it's inside stop_two_cpus()). > > Deadlock ensues. > > This seems like a situation that should be prohibited, but I cannot > find any code to prevent it. Is it OK for stopper threads to load > balance? Is there something that should prevent this situation from > happening? I don't see anything to stop the deadlock either, would exclude stop class from playing idle balancer entirely, though I suppose you could check for caller being stop class in need_active_balance(). I don't think any RT class playing idle balancer is particularly wonderful. -Mike