Received: by 10.192.165.148 with SMTP id m20csp4776232imm; Tue, 1 May 2018 03:42:54 -0700 (PDT) X-Google-Smtp-Source: AB8JxZo5i0TtfZiY2nxL82lDPVFAEjALlP4lRczYmcPGvPA63DnreQ8nvn6U6ogkCp50fgobyJmw X-Received: by 2002:a65:45c6:: with SMTP id m6-v6mr12656506pgr.244.1525171374733; Tue, 01 May 2018 03:42:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525171374; cv=none; d=google.com; s=arc-20160816; b=NgPLZsCn5LT6pJrnF2zo1DXGvrG8eCA27LTWurcalwM605GR0AJixqeO0vLJ78d8AV kiX+RNvQ6qJLttoS45OjYb1qK+OoSMIkTPQF353YdGSuzjCZmKYl828Fz8TBTEtJGcJx DVNka4p+UrthGwDp5et0NfS9trBc8b5Ao4sXvIaEQWQhWXRtPFQKfQVFftDukDxHzYX0 DmLlB1KpojjFXRYVew+WqPgGteMkUsP4s31wCCKqBZ3DNzdRhxGa63Ezis1GddLNLNDK fiD0WRaaZeoQIcm+QQf1+E8lk4qrxGZKCAr6BiTMiUvkvKpzG+AVF1HJZvA0g12f4f1m HZ6A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dmarc-filter :dkim-signature:dkim-signature:arc-authentication-results; bh=4zPFesGEcLI5QMPvwJOYZ3StJd0xEdFFhWc93K8BFrU=; b=wjbzP+mPhMRnC0fSIRQjcbhN+ox6VAa1IJqzXWAQyfMkXUeDUux+9K9XMrhBJXcRhB +TdKxMNif2kalywzSWkSM/2CHxGmezkAjF2MuJfHFmJSxOIzngvyKC64vzGA3SWrBGKt x2BMLhUNUdmGipI4OuozVJfb5GOUnt3/xDzaey+oi8Mp8Vg4r6v78QY8p6j/wfKXncN3 uwoJ2NSV9IAOlZeJs/dvuTe5RGaedYYPfP3DmM3+mGZ2MCOP2Qs6Bn/Ia8jxt2t4Gpxt GvmUaFXRSH8uemYBWQzLJbD0Nt5NXEVzWgoFWSOo3KwPkA2BvKq2/ZregpJ8ldWEjMY9 oTwg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=DfO97JVx; dkim=pass header.i=@codeaurora.org header.s=default header.b=R2n/Kjca; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w68si9236417pfk.14.2018.05.01.03.42.40; Tue, 01 May 2018 03:42:54 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@codeaurora.org header.s=default header.b=DfO97JVx; dkim=pass header.i=@codeaurora.org header.s=default header.b=R2n/Kjca; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755147AbeEAKlE (ORCPT + 99 others); Tue, 1 May 2018 06:41:04 -0400 Received: from smtp.codeaurora.org ([198.145.29.96]:41474 "EHLO smtp.codeaurora.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753545AbeEAKlB (ORCPT ); Tue, 1 May 2018 06:41:01 -0400 Received: by smtp.codeaurora.org (Postfix, from userid 1000) id 8BA3260AE0; Tue, 1 May 2018 10:41:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1525171260; bh=iHNNzoMOLUt+x78mDYk3GjWDrgxVBR/+Hk7bXepEyqI=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=DfO97JVxqqQUdDGO7HV8aCakpzmIBGNlrJDkJTYqcdMgjmCf2f3xJSDFnjFm30VMX 9MWWz7d9FqU3gB4tI12Igwe2oV3g2sgrwvyRc5u1/S8CPL+TdfRMxXe+p1NruPsN2F U/wEB3uoHFzkC35eJurR1PisQ0c/MyC5VHTeeGYs= X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on pdx-caf-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.8 required=2.0 tests=ALL_TRUSTED,BAYES_00, DKIM_SIGNED,T_DKIM_INVALID autolearn=no autolearn_force=no version=3.4.0 Received: from [10.204.78.254] (blr-c-bdr-fw-01_globalnat_allzones-outside.qualcomm.com [103.229.19.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: gkohli@smtp.codeaurora.org) by smtp.codeaurora.org (Postfix) with ESMTPSA id CBEEA607E4; Tue, 1 May 2018 10:40:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=codeaurora.org; s=default; t=1525171259; bh=iHNNzoMOLUt+x78mDYk3GjWDrgxVBR/+Hk7bXepEyqI=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=R2n/KjcaIQQXF42WL/xh5EZJf2neh2PuXglO+ibPdHcEpsLE2nAasp/WcnLmtenVh HAKgamoHSUipqdRJHdUqxcGV5PiVZj/7s1hIp/sMpnEuagLUr4CRtOBrQ5eZqkDAQn l8fSZc9H/HeutlJe5EtJJ3wysZEXiQElg5FhK8aE= DMARC-Filter: OpenDMARC Filter v1.3.2 smtp.codeaurora.org CBEEA607E4 Authentication-Results: pdx-caf-mail.web.codeaurora.org; dmarc=none (p=none dis=none) header.from=codeaurora.org Authentication-Results: pdx-caf-mail.web.codeaurora.org; spf=none smtp.mailfrom=gkohli@codeaurora.org Subject: Re: [PATCH v1] kthread/smpboot: Serialize kthread parking against wakeup To: Peter Zijlstra Cc: tglx@linutronix.de, mpe@ellerman.id.au, mingo@kernel.org, bigeasy@linutronix.de, linux-kernel@vger.kernel.org, linux-arm-msm@vger.kernel.org, Neeraj Upadhyay , Will Deacon , Oleg Nesterov References: <1524645199-5596-1-git-send-email-gkohli@codeaurora.org> <20180425200917.GZ4082@hirez.programming.kicks-ass.net> <20180426084131.GV4129@hirez.programming.kicks-ass.net> <20180426085719.GW4129@hirez.programming.kicks-ass.net> <4d3f68f8-e599-6b27-a2e8-9e96b401d57a@codeaurora.org> <20180430111744.GE4082@hirez.programming.kicks-ass.net> <3af3365b-4e3f-e388-8e90-45a3bd4120fd@codeaurora.org> <20180501101845.GE12217@hirez.programming.kicks-ass.net> From: "Kohli, Gaurav" Message-ID: Date: Tue, 1 May 2018 16:10:53 +0530 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180501101845.GE12217@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/1/2018 3:48 PM, Peter Zijlstra wrote: > On Tue, May 01, 2018 at 01:20:26PM +0530, Kohli, Gaurav wrote: >> But In our older case, where we have seen failure below is the wake up path >> and ftraces, Wakeup occured and completed before schedule call only. >> >> So final state of CPUHP is running not parked. I have also pasted debug >> ftraces that we got during issue reproduction. >> >> Here wakeup for cpuhp is below: >> >> takedown_cpu-> kthread_park-> wake_up_process >> >> >> 39,034,311,742,395 apps (10240) Trace Printk cpuhp/0 (16) [000] >> 39015.625000: __kthread_parkme state=512 task=ffffffcc7458e680 >> flags: 0x5 -> state 5 -> state is parked inside parkme function >> >> 39,034,311,846,510 apps (10240) Trace Printk cpuhp/0 (16) [000] >> 39015.625000: before schedule __kthread_parkme state=0 >> task=ffffffcc7458e680 flags: 0xd -> just before schedule call, state is >> running >> >> tatic void __kthread_parkme(struct kthread *self) >> >> { >> >> __set_current_state(TASK_PARKED); >> >> while (test_bit(KTHREAD_SHOULD_PARK, &self->flags)) { >> >> if (!test_and_set_bit(KTHREAD_IS_PARKED, &self->flags)) >> >> complete(&self->parked); >> >> schedule(); >> >> __set_current_state(TASK_PARKED); >> >> } >> >> clear_bit(KTHREAD_IS_PARKED, &self->flags); >> >> __set_current_state(TASK_RUNNING); >> >> } >> >> So my point is here also, if it is reschedule then it can set TASK_PARKED, >> but it seems after takedown_cpu call this thread never get a chance to run, >> So final state is TASK_RUNNING. >> >> In our current fix also can't we observe same scenario where final state is >> TASK_RUNNING. > > I'm not sure I understand your concern. Loosing the TASK_PARKED store > with the above code is obviously bad. But with the loop as proposed I > don't see a problem. Yes with loop, it will reset TASK_PARKED but that is not happening in the dumps we have seen. Here before schedule state is RUNNING and cpuhp got migrate to some core but never get a chance to run so state is running. > > takedown_cpu() can proceed beyond smpboot_park_threads() and kill the > CPU before any of the threads are parked -- per having the complete() > before hitting schedule(). > > And, afaict, that is harmless. When we go offline, sched_cpu_dying() -> > migrate_tasks() will migrate any still runnable threads off the cpu. > But because at this point the thread must be in the PARKED wait-loop, it > will hit schedule() and go to sleep eventually. > > Also note that kthread_unpark() does __kthread_bind() to rebind the > threads. > > Aaaah... I think I've spotted a problem there. We clear SHOULD_PARK > before we rebind, so if the thread lost the first PARKED store, > does the completion, gets migrated, cycles through the loop and now > observes !SHOULD_PARK and bails the wait-loop, then __kthread_bind() > will forever wait. > So during next unpark __kthread_unpark -> __kthread_bind -> wait_task_inactive (this got failed, as current state is running so failed on below call: while (task_running(rq, p)) { if (match_state && unlikely(p->state != match_state)) return 0; cpu_relax(); } and gives warning: if (!wait_task_inactive(p, state)) { WARN_ON(1); return; -> return from here, and further binding call fail which is after this code. } finally it is giving bug_on here as we failed to rebind hotplug to our core: } kthread_parkme(); /* We might have been woken for stop */ continue; } BUG_ON(td->cpu != smp_processor_id()); panic occured. So it seems we always have to be in PARKED state only , not miss any single instance. > Is that what you had in mind? > -- > To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.