Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp7505896rwb; Wed, 23 Nov 2022 07:20:38 -0800 (PST) X-Google-Smtp-Source: AA0mqf77vW8b3cWZIt0pKWiTahMvhlfmMQoAP0rPJKVvaQEWPVZ4DwgZJDpV0VMuHdc3bJgFRYte X-Received: by 2002:a17:906:89a1:b0:7ae:3193:8cf3 with SMTP id gg33-20020a17090689a100b007ae31938cf3mr22976833ejc.416.1669216838587; Wed, 23 Nov 2022 07:20:38 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669216838; cv=none; d=google.com; s=arc-20160816; b=jk3HgIZgQShoLR79IjHEJJXxiJUYfx+paA4NojJuXndcDKceEh4LXzOoAA53G1UGrT auQcJI9cR0Z+dDQTq4ZFFhTVzcG1a01NxV3lp72zf89mdpe3R9dXHjMSJ/5S7PszJtB1 56sIfWS1gxH7ZoOkf7azaMVpmBLdMaHeSZUCEA80A3v31Q/yw9ptICKxhQVqdlu5om9y lljBUgshPG7T3mREAJnCUIehINl67X5mFzwoC3Ls5rEMJAH4pFRVhDDK7Czyc8LnwAEo XqOaV0glniQbDduQhj43cVs/dCr5TFnlUfW2/J4llBTW+6lh4NePG7ffPAPMt2PfHOAz M0Yg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=kk9tHcPbPXFKtPxqaNxBpTzi+Ia2+gj/aE/vUb29mGg=; b=uvRpSfYPjRWb7RiPfdTfSGkCTIdYwFIv4QKLh6PfYj7ZA/2hkXLb8iwwNc5f/UYirh icTE6IFqSv1izJR7UXIbCfrmiVezjxqRrYJZPssKypgW95ixFqhf2CKv3S9j2xjb6aqf jHEwccpIKOdpR0dqtpsEqEtlpze0Rq19W3bgraQwLhl+SeFt3GELHpd1Vfq9nmy9Xvg/ LWBLai7tnvquLN8f4WKG0IeOKJxW4e8nVlM1GFMiy+KLauk+vTwdY+jegOgOcB+gIS2D ebK33se/JeP6sKELkr2Xwb6P+v8bQRyHwLvMgc8sG2IRv9b3aoK3tphgcXgKSWbDSs62 dAWA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=dtaG+XPy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id hq15-20020a1709073f0f00b007b2c4e84af6si10484894ejc.684.2022.11.23.07.20.13; Wed, 23 Nov 2022 07:20:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=dtaG+XPy; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S237258AbiKWOiK (ORCPT + 88 others); Wed, 23 Nov 2022 09:38:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S237797AbiKWOiH (ORCPT ); Wed, 23 Nov 2022 09:38:07 -0500 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7D6CE1F62B; Wed, 23 Nov 2022 06:38:04 -0800 (PST) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 1D916B8205D; Wed, 23 Nov 2022 14:38:03 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 498CAC433C1; Wed, 23 Nov 2022 14:38:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1669214281; bh=DHdFXSKtZ+h0ShqK6NSzY3nVLisD1LxX3VpZknCiL8o=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=dtaG+XPyeMXqZfsyxUDuqS0XRT8Fq7/xgVbdsbW9prjHKH9yfH77O7+grJJEon+ND diC9HUqrOx/47Hikvg0bSu2RzxTcwPwv2TAJ84ZBfFiaqEuurGSZIJvG60PlkY0heJ C4FxFEOAsEUiwv23kRjZMNo01JOV3x57yYfucd3kIfKQhVAr8M4rljaGs2b2fF96NF Ura2EaHfx0R9j/NT0w9COCh9unO8E76Y22s3qdjgX0eMzmFPYkbyf0Pnv2/V48UeXd ktGFEGQVwSN/IjOygGHkeQmvjvsRzp02FykL6C/ENPK1QovoP10+MNbctICJHnF0nZ KdElVR5+4p1Zw== Date: Wed, 23 Nov 2022 15:37:58 +0100 From: Frederic Weisbecker To: Pengfei Xu , Lai Jiangshan , "Paul E. McKenney" , Neeraj Upadhyay , Christian Brauner , "Eric W. Biederman" Cc: linux-kernel@vger.kernel.org, heng.su@intel.com, rcu@vger.kernel.org Subject: PID_NS unshare VS synchronize_rcu_tasks() (was: Re: [Syzkaller & bisect] There is task hung in "synchronize_rcu" in v6.1-rc5 kernel) Message-ID: <20221123143758.GA1387380@lothringen> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 21, 2022 at 01:37:06PM +0800, Pengfei Xu wrote: > Hi Frederic Weisbecker and kernel developers, > > Greeting! > There is task hung in "synchronize_rcu" in v6.1-rc5 kernel. > > Bisected the issue on Raptor and server(No atom small core, big core only), > both platforms bisected results show that: > first bad commit is c597bfddc9e9e8a63817252b67c3ca0e544ace26: > "sched: Provide Kconfig support for default dynamic preempt mode" > > [ 300.097166] INFO: task rcu_tasks_kthre:11 blocked for more than 147 seconds. > [ 300.097455] Not tainted 6.1.0-rc5-094226ad94f4 #1 > [ 300.097641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 300.097922] task:rcu_tasks_kthre state:D stack:0 pid:11 ppid:2 flags:0x00004000 > [ 300.098230] Call Trace: > [ 300.098325] > [ 300.098410] __schedule+0x2de/0x8f0 > [ 300.098562] schedule+0x5b/0xe0 > [ 300.098693] schedule_timeout+0x3f1/0x4b0 > [ 300.098849] ? __sanitizer_cov_trace_pc+0x25/0x60 > [ 300.099032] ? queue_delayed_work_on+0x82/0xc0 > [ 300.099206] wait_for_completion+0x81/0x140 > [ 300.099373] __synchronize_srcu.part.23+0x83/0xb0 > [ 300.099558] ? __bpf_trace_rcu_stall_warning+0x20/0x20 > [ 300.099757] synchronize_srcu+0xd6/0x100 > [ 300.099913] rcu_tasks_postscan+0x19/0x20 > [ 300.100070] rcu_tasks_wait_gp+0x108/0x290 > [ 300.100230] ? _raw_spin_unlock+0x1d/0x40 > [ 300.100389] rcu_tasks_one_gp+0x27f/0x370 > [ 300.100546] ? rcu_tasks_postscan+0x20/0x20 > [ 300.100709] rcu_tasks_kthread+0x37/0x50 > [ 300.100863] kthread+0x14d/0x190 > [ 300.100998] ? kthread_complete_and_exit+0x40/0x40 > [ 300.101199] ret_from_fork+0x1f/0x30 > [ 300.101347] Thanks for reporting this. Fortunately I managed to reproduce and debug. It took me a few days to understand the complicated circular dependency involved. So here is a summary: 1) TASK A calls unshare(CLONE_NEWPID), this creates a new PID namespace that every subsequent child of TASK A will belong to. But TASK A doesn't itself belong to that new PID namespace. 2) TASK A forks() and creates TASK B (it is a new threadgroup so it is a thread group leader). TASK A stays attached to its PID namespace (let's say PID_NS1) and TASK B is the first task belonging to the new PID namespace created by unshare() (let's call it PID_NS2). 3) Since TASK B is the first task attached to PID_NS2, it becomes the PID_NS2 child reaper. 4) TASK A forks() again and creates TASK C which get attached to PID_NS2. Note how TASK C has TASK A as a parent (belonging to PID_NS1) but has TASK B (belonging to PID_NS2) as a pid_namespace child_reaper. 3) TASK B exits and since it is the child reaper for PID_NS2, it has to kill all other tasks attached to PID_NS2, and wait for all of them to die before reaping itself (zap_pid_ns_process()). Note it seems to make a misleading assumption here, trusting that all tasks in PID_NS2 either get reaped by a parent belonging to the same namespace or by TASK B. And it is confident that since it deactivated SIGCHLD handler, all the remaining tasks ultimately autoreap. And it waits for that to happen. However TASK C escapes that rule because it will get reaped by its parent TASK A belonging to PID_NS1. 4) TASK A calls synchronize_rcu_tasks() which leads to synchronize_srcu(&tasks_rcu_exit_srcu). 5) TASK B is waiting for TASK C to get reaped (wrongly assuming it autoreaps) But TASK B is under a tasks_rcu_exit_srcu SRCU critical section (exit_notify() is between exit_tasks_rcu_start() and exit_tasks_rcu_finish()), blocking TASK A 6) TASK C exits and since TASK A is its parent, it waits for it to reap TASK C, but it can't because TASK A waits for TASK B that waits for TASK C. So there is a circular dependency: _ TASK A waits for TASK B to get out of tasks_rcu_exit_srcu SRCU critical section _ TASK B waits for TASK C to get reaped _ TASK C waits for TASK A to reap it. I have no idea how to solve the situation without violating the pid_namespace rules and unshare() semantics (although I wish unshare(CLONE_NEWPID) had a less error prone behaviour with allowing creating more than one task belonging to the same namespace). So probably having an SRCU read side critical section within exit_notify() is not a good idea, is there a solution to work around that for rcu tasks? Thanks.