Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp476755ybx; Wed, 30 Oct 2019 18:36:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqzMhSdlj0CCZ5tULZpIjvJX47GUN3VykZt65qVFWLhg7sGuIqGbHp4u1+2tc4r6/XtcLHut X-Received: by 2002:a17:906:7097:: with SMTP id b23mr1330126ejk.252.1572485796316; Wed, 30 Oct 2019 18:36:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572485796; cv=none; d=google.com; s=arc-20160816; b=yTH0Z2tj9OdzPrKOOdwYMaVNiGfgLfzUVw63aZgIC+q93qoBKv4t8Yl3SQINQaFH9U wfVlaJ28CVKDO10c5H+bRhjer0yw7MbVyXoZ+21qf9TPjWYTjWbIAeZWHP6FR3H4XVHI pSUVj4YJfLzfRidW+A2lvGZYXGn5XNZrB7iBL5jzl00/5UkwhV6YqfXmtro+b3dialIp Ocwt9k96G5bGnBk30P0E5T7n20Xj2RtFb/LWYjaU7dMgPmEkPsHZMW2Likzz7xspqQ7S t9q9SX878RNjRTNY2vxDbDEMjWNiusZNAzq6SMXFzeh+1ID8hzrGhLVetDIzBfPJykPd phXA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=xtg1tM8zA0rFWrugHzKAOs/nyCMNcbm9OA+oFwqzpkA=; b=LQ2UtxYgiQRueV2uFLHmneh3upOslVSApGOBWFG7NcABhY7B997FPY4NlwZBOlFWoc ioLbxwxViOilrYVbQnjTwjcimGi65J0pjiZvBbfJlhDFTSlv4L3OTbp6uTMyAkIRX07O Oh0D0QSwMZNEbAuq2cm91sHSYXlksTnC7cQ9KS2g9k4PeBji/X6oLhLvO1d+wvOfITej en81VVtsPiCeJtEOlSZIXetMA3BAvkT3/iLuz4LKZsH7qrGqo6TZdzv+oUKAW0cbNDoa jX/qQWpEHNlCwZzTzswRx5mz+6sEFf8hNQscnyNbmFEoMLf2KS56ugrvuq8NLLDCg/K9 L23g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o48si3034202edc.151.2019.10.30.18.36.13; Wed, 30 Oct 2019 18:36:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726675AbfJaBd1 (ORCPT + 99 others); Wed, 30 Oct 2019 21:33:27 -0400 Received: from foss.arm.com ([217.140.110.172]:43604 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726460AbfJaBd1 (ORCPT ); Wed, 30 Oct 2019 21:33:27 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 75AE61FB; Wed, 30 Oct 2019 18:33:26 -0700 (PDT) Received: from [10.188.222.161] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4EE2E3F719; Wed, 30 Oct 2019 18:33:23 -0700 (PDT) Subject: Re: NULL pointer dereference in pick_next_task_fair To: Ram Muthiah , Quentin Perret Cc: Peter Zijlstra , linux-kernel@vger.kernel.org, aaron.lwe@gmail.com, mingo@kernel.org, pauld@redhat.com, jdesfossez@digitalocean.com, naravamudan@digitalocean.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, juri.lelli@redhat.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, kernel-team@android.com, john.stultz@linaro.org References: <20191028174603.GA246917@google.com> <20191029113411.GP4643@worktop.programming.kicks-ass.net> <20191029115000.GA11194@google.com> From: Valentin Schneider Message-ID: <75e99374-0bd6-a7d7-581e-9360a1f90103@arm.com> Date: Thu, 31 Oct 2019 02:33:13 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.9.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 30/10/2019 23:50, Ram Muthiah wrote: > > Quentin and I were able to create a setup which reproduces the issue. > > Given this, I tried Peter's proposed fix and was still able to reproduce the > issue unfortunately. Current patch is located here - > https://android-review.googlesource.com/c/kernel/common/+/1153487 > > Our mitigation for this issue on the android-mainline branch has been to > revert 67692435c411 ("sched: Rework pick_next_task() slow-path"). > https://android-review.googlesource.com/c/kernel/common/+/1152564 > > I'll spend some time detailing repro steps next. I should be able to > provide an update on those details early next week. > > We appreciate the help so far. > Thanks, > Ram > The splat Quentin posted happens at secondary startup, is that always the case? I'm trying to think of what could make rq.cfs_rq.nr_running non-zero at secondary bringup time. It might not explain the NULL pointer, but I'm still curious as to how we can get something there this early, as it could point towards something. Be warned, I might bring up stuff I know nothing about, but this looks "fun" so I can't help myself :) sched domains are only setup after smp_init() in sched_init_smp(), thus after we've booted all secondaries. This should take load balance out of the picture. For wakeups, select_task_rq_fair() can only ever pick prev_cpu or this_cpu since there are no sched domains. I don't see many candidates that could wakeup on a secondary (thus have non-zero this_cpu) this early there. Perhaps the smpboot threads, but from a quick look they are first created *after* sched_init_smp(), so they couldn't exist during (boot-time) secondary bringup. Seems to be the same for IRQ threads (and they're setscheduler'd to FIFO anyway). So now I'm even more curious as to what CFS task could be enqueued on a secondary CPU rq before sched_init_smp(). Have you been sending stuff to space without any shielding lately?