Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp3103813pxb; Fri, 5 Nov 2021 09:48:34 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw55+Q+AN3Rpk4PfPwBEewuADL1MF05v84pmo3pK8rST1wO1l+G+5gLWZnwiE+6HYDo1HL9 X-Received: by 2002:aa7:c3cd:: with SMTP id l13mr42524729edr.275.1636130914594; Fri, 05 Nov 2021 09:48:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1636130914; cv=none; d=google.com; s=arc-20160816; b=IEsL+eEd0F34W+1MzHgZ3/f1oW+NAD/ZgbMQh5N7fo71vFVx25XapPIyWvYP+rHPga k65PhTii73sKgKS0/9OsDcWZSWu4K0CfJnn3aXBq7E6bpKAEMi/cktlW2GP92OAFyy8e Jzw0nqsHBQcj5D51FqUGOC9h2KVEvANRyd2gE/6WSthfVsvEcDz63U5iAQflhVq5tz5u 9JsxLozhEUcPR5z9lGEXYbv+ip8k4CFuT1NHfVdcz1wvZpFwbZZWJykBexB7TdFlBcfH ItBdvCRJEsHIoaM6V9FvkfDJfesq7wfldG/aUYA/guZkFMtz/Epz+qxiQ66CTj62r1sL WXzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=W9SnoanO33UFz5dG9pFgGMhmc12jctelo4NbEeinDas=; b=crSdAYClrS7wtF/Y4z6S+3dujVTM++9OgJbSC9DyNfFQqyWORI/xx71Kv3IRE5eEup ETToz0LwKg+154Zi1eLZgONG1Mh2PXIHBwA/pNT54Xl02jTmNIu/8csWyyqbhCwDVzc1 lYwEDPl90Q1STvLoIuj9LSX+E6L/RnY5hFh3Hppv2J0alI9OajwFdnoedA/i86bPCF2W VKrDD+xKn43tZdSzjHs9QQ2SI+aratJ7E9+RPN+XhDkLvf8+AN2hoXCU7xjDfvEumxfp ar5W/syge1Irkq+h/SxskB3NFz1YLFJ5RGg1qir+cSIDZ71bADsnYmYXtim+cq4NhsyH e8Cw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=fWjZWglz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q16si7701431edd.19.2021.11.05.09.48.05; Fri, 05 Nov 2021 09:48:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=fWjZWglz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233299AbhKEPGr (ORCPT + 99 others); Fri, 5 Nov 2021 11:06:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34986 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229680AbhKEPGq (ORCPT ); Fri, 5 Nov 2021 11:06:46 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F4C8C061714 for ; Fri, 5 Nov 2021 08:04:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Transfer-Encoding: Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description; bh=W9SnoanO33UFz5dG9pFgGMhmc12jctelo4NbEeinDas=; b=fWjZWglzXCsknXllnNKq+AfJf3 NfM5RMBpf2kTX/CCZULzWd4jNqzteigcZt/UYMbY3A2YaFzeS5cBzO10raPSqoEKrWFPqSFRVItZX erX4smzrqh6LuS0HbcORKjWcIpbr2aSzHDjI8gbg64WqvNrh57vknTf7VyKgDqA056x8wVoMetAn8 x2Jdz6qhMSxpgS99iqCjKHmnCgSsAoVGEz1p4KKNUL6WUa2znJFIE+yqHUNzrfoHjiu7/K/knCpHR SWfiV+MMn3ShpmDarKG7VWwmRz+i9PXsQk2notwgzHpAqhC9UzZsYCw52fepm3AAGI1KX01fsRMCO hFC79gtQ==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=noisy.programming.kicks-ass.net) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1mj0fM-006co7-W4; Fri, 05 Nov 2021 14:59:03 +0000 Received: from hirez.programming.kicks-ass.net (hirez.programming.kicks-ass.net [192.168.1.225]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (Client did not present a certificate) by noisy.programming.kicks-ass.net (Postfix) with ESMTPS id 8C5FF30022C; Fri, 5 Nov 2021 15:58:03 +0100 (CET) Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 6E68331A04571; Fri, 5 Nov 2021 15:58:03 +0100 (CET) Date: Fri, 5 Nov 2021 15:58:03 +0100 From: Peter Zijlstra To: Mathias Krause Cc: Ingo Molnar , Juri Lelli , Vincent Guittot , Michal =?iso-8859-1?Q?Koutn=FD?= , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , linux-kernel@vger.kernel.org, Odin Ugedal , Kevin Tanguy , Brad Spengler Subject: Re: [PATCH] sched/fair: Prevent dead task groups from regaining cfs_rq's Message-ID: References: <20211103190613.3595047-1-minipli@grsecurity.net> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20211103190613.3595047-1-minipli@grsecurity.net> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org TJ, long email below, but TL;DR, is it ok to do synchornize_rcu() from a css_released callback? We can't use the css_offline callback because commit 2f5177f0fd7e ("sched/cgroup: Fix/cleanup cgroup teardown/init") but we do need a second GP, as per the below. On Wed, Nov 03, 2021 at 08:06:13PM +0100, Mathias Krause wrote: > Kevin is reporting crashes which point to a use-after-free of a cfs_rq > in update_blocked_averages(). Initial debugging revealed that we've live > cfs_rq's (on_list=1) in an about to be kfree()'d task group in > free_fair_sched_group(). However, it was unclear how that can happen. > > His kernel config happened to lead to a layout of struct sched_entity > that put the 'my_q' member directly into the middle of the object which > makes it incidentally overlap with SLUB's freelist pointer. That, in > combination with SLAB_FREELIST_HARDENED's freelist pointer mangling, > leads to a reliable access violation in form of a #GP which made the UAF > fail fast, e.g. like this: > > general protection fault, probably for non-canonical address 0xff80f68888f69107: 0000 [#1] SMP > CPU: 5 PID: 0 Comm: swapper/5 Not tainted 5.14.13-custom+ #3 > Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018 > RIP: 0010:[] update_blocked_averages+0x428/0xb90 > Code: 28 01 00 4c 8b 4c 24 10 41 8b 97 c4 00 00 00 85 d2 75 32 4c 89 fe 4c 89 cf e8 74 2c 01 00 49 8b 96 80 00 00 00 48 85 d2 74 0e <48> 83 ba 08 01 00 00 00 0f 85 45 01 00 00 85 c0 0f 84 34 fe ff ff > RSP: 0018:ffffc9000002bf08 EFLAGS: 00010086 > RAX: 0000000000000001 RBX: ffff8880202eba00 RCX: 000000000000b686 > RDX: ff80f68888f68fff RSI: 0000000000000000 RDI: 000000000000000c > RBP: ffff888006808a00 R08: ffff888006808a00 R09: 0000000000000000 > R10: 0000000000000008 R11: 0000000000000000 R12: ffff8880202ebb40 > R13: 0000000000000000 R14: ffff8880087e7f00 R15: ffff888006808a00 > FS: 0000000000000000(0000) GS:ffff888237d40000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000315b147b8000 CR3: 0000000002d70000 CR4: 00000000001606f0 shadow CR4: 00000000001606f0 > Stack: > 0100008237d58b80 0000000000000286 000003ae017023d5 00000000000000f0 > ffff888237d5d240 0000000000000028 ffff888237d5c980 ffff888237d5c900 > ffff888237d5c900 0000000000000001 0000000000000100 0000000000000007 > Call Trace: > > [] run_rebalance_domains+0x3a/0x60 > [] __do_softirq+0xbf/0x1fb > [] irq_exit_rcu+0x7f/0x90 > [] sysvec_apic_timer_interrupt+0x6e/0x90 > > [] asm_sysvec_apic_timer_interrupt+0x12/0x20 > RIP: 0010:[] acpi_idle_do_entry+0x49/0x50 > Code: 8b 15 2f b3 75 01 ed c3 0f 0b e9 52 fd ff ff 65 48 8b 04 25 40 1c 01 00 48 8b 00 a8 08 75 e8 eb 07 0f 00 2d d9 e2 1d 00 fb f4 c3 0f 0b cc cc cc 41 55 41 89 d5 41 54 49 89 f4 55 53 48 89 fb > RSP: 0000:ffffc900000bbe68 EFLAGS: 00000246 > RAX: 0000000000004000 RBX: 0000000000000001 RCX: ffff888237d40000 > RDX: 0000000000000001 RSI: ffffffff82cdd4c0 RDI: ffff888100b7bc64 > RBP: ffff888101c07000 R08: ffff888100b7bc00 R09: 00000000000000ac > R10: 0000000000000005 R11: ffff888237d5b824 R12: 0000000000000001 > R13: ffffffff82cdd4c0 R14: ffffffff82cdd540 R15: 0000000000000000 > [] ? sched_clock_cpu+0x9/0xa0 > [] acpi_idle_enter+0x48/0xb0 > [] cpuidle_enter_state+0x7c/0x2c0 > [] cpuidle_enter+0x24/0x40 > [] do_idle+0x1c4/0x210 > [] cpu_startup_entry+0x1e/0x20 > [] start_secondary+0x11a/0x120 > [] secondary_startup_64_no_verify+0xae/0xbb > ---[ end trace aac4ad8b95ba31e5 ]--- > > Michal seems to have run into the same issue[1]. He already correctly > diagnosed that commit a7b359fc6a37 ("sched/fair: Correctly insert > cfs_rq's to list on unthrottle") is causing the preconditions for the > UAF to happen by re-adding cfs_rq's also to task groups that have no > more running tasks, i.e. also to dead ones. His analysis, however, > misses the real root cause and it cannot be seen from the crash > backtrace only, as the real offender is tg_unthrottle_up() getting > called via sched_cfs_period_timer() via the timer interrupt at an > inconvenient time. > > When unregister_fair_sched_group() unlinks all cfs_rq's from the dying > task group, it doesn't protect itself from getting interrupted. If the > timer interrupt triggers while we iterate over all CPUs or after > unregister_fair_sched_group() has finished but prior to unlinking the > task group, sched_cfs_period_timer() will execute and walk the list of > task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the > dying task group. These will later -- in free_fair_sched_group() -- be > kfree()'ed while still being linked, leading to the fireworks Kevin and > Michal are seeing. > > To fix this race, ensure the dying task group gets unlinked first. > However, simply switching the order of unregistering and unlinking the > task group isn't sufficient, as concurrent RCU walkers might still see > it, as can be seen below: > > CPU1: CPU2: > : timer IRQ: > : do_sched_cfs_period_timer(): > : : > : distribute_cfs_runtime(): > : rcu_read_lock(); > : : > : unthrottle_cfs_rq(): > sched_offline_group(): : > : walk_tg_tree_from(…,tg_unthrottle_up,…): > list_del_rcu(&tg->list); : > (1) : list_for_each_entry_rcu(child, &parent->children, siblings) > : : > (2) list_del_rcu(&tg->siblings); : > : tg_unthrottle_up(): > unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)]; > : : > list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); : > : : > : if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running) > (3) : list_add_leaf_cfs_rq(cfs_rq); > : : > : : > : : > : : > : : > (4) : rcu_read_unlock(); > > CPU 2 walks the task group list in parallel to sched_offline_group(), > specifically, it'll read the soon to be unlinked task group entry at > (1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from > still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink all > cfs_rq's via list_del_leaf_cfs_rq() in unregister_fair_sched_group(). > Meanwhile CPU 2 will re-add some of these at (3), which is the cause of > the UAF later on. > > To prevent this additional race from happening, we need to wait until > walk_tg_tree_from() has finished traversing the task groups, i.e. after > the RCU read critical section ends in (4). Afterwards we're safe to call > unregister_fair_sched_group(), as each new walk won't see the dying task > group any more. > > Using synchronize_rcu() might be seen as a too heavy hammer to nail this > problem. However, the overall tear down sequence (e.g., as documented in > css_free_rwork_fn()) already relies on quite a few assumptions regarding > execution context and RCU grace periods from passing. Looking at the > autogroup code, which calls sched_destroy_group() directly after > sched_offline_group() and the apparent need to have at least one RCU > grace period expire after unlinking the task group, prior to calling > unregister_fair_sched_group(), there seems to be no better alternative. > Calling unregister_fair_sched_group() via call_rcu() will only lead to > trouble in sched_offline_group() which also relies on (yet another) > expired RCU grace period. > > This patch survives Michal's reproducer[2] for 8h+ now, which used to > trigger within minutes before. > > [1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/ > [2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/ > > Fixes: a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") > Cc: Odin Ugedal > Cc: Michal Koutný > Reported-by: Kevin Tanguy > Suggested-by: Brad Spengler > Signed-off-by: Mathias Krause > --- > kernel/sched/core.c | 18 +++++++++++++++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 978460f891a1..60125a6c9d1b 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9506,13 +9506,25 @@ void sched_offline_group(struct task_group *tg) > { > unsigned long flags; > > - /* End participation in shares distribution: */ > - unregister_fair_sched_group(tg); > - > + /* > + * Unlink first, to avoid walk_tg_tree_from() from finding us (via > + * sched_cfs_period_timer()). > + */ > spin_lock_irqsave(&task_group_lock, flags); > list_del_rcu(&tg->list); > list_del_rcu(&tg->siblings); > spin_unlock_irqrestore(&task_group_lock, flags); > + > + /* > + * Wait for all pending users of this task group to leave their RCU > + * critical section to ensure no new user will see our dying task > + * group any more. Specifically ensure that tg_unthrottle_up() won't > + * add decayed cfs_rq's to it. > + */ > + synchronize_rcu(); > + > + /* End participation in shares distribution: */ > + unregister_fair_sched_group(tg); > } > > static void sched_change_group(struct task_struct *tsk, int type) > -- > 2.30.2 >