Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp5204824rwl; Mon, 3 Apr 2023 16:23:38 -0700 (PDT) X-Google-Smtp-Source: AKy350a9gWQqIvTlMrzbszhacBR1YWFShbHlRl44pnvF0K5aCtMYTY9Cfx/7lngMnEdBN55vWF8J X-Received: by 2002:a05:6402:2044:b0:501:d43e:d1e6 with SMTP id bc4-20020a056402204400b00501d43ed1e6mr857200edb.4.1680564218231; Mon, 03 Apr 2023 16:23:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680564218; cv=none; d=google.com; s=arc-20160816; b=u73b3YFne16mP2RFr0dr0TYNGKywEdqQbe9lMeLKvjcEyO0TZoflNOZnqr/nGxpUeT 5vst5QcgCM0KlRWAEx2EjmyQsJRD1nESioVz7VjYcMZd8nS0gZPTIEiJeDIk5gn85LCk lRsOgjpF9G2w46fZwhuwnb4ZI8eSQG/Lb8c41LU6u+msChbQh3605ebnpT5/DoWsz7st TNgIfmbOIgXY8GmzH0fXthhIt4bgCcxUjcgeaAVGoeiw440EsEnuQLP0kW/R+6I/6L5h P8Kd+1QQq/5YlTlcaLLq95oZ2oK72B60BiLelCkuM22vepRRRx2Z00ZvSw2dR2655HJX uMdQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:reply-to:message-id:subject:cc:to:from:date :dkim-signature; bh=/uVceI7RXrBNBlmAdohl0Tt5WcOpT8xiLHCP4J/XLdc=; b=TUlMYb8OcpVs/S5VkcvWjgN2KxsWPvkP73LOtxwc8Ev794vybitGPXeYq88QkzPk+N iRxlH5xsJ5A35Yr/wz3g/pAxHC+LhZFK37T6ADBqF5ulg2EmTDcaOxUmDzJs4xkAkVq7 O4w7o8+8QAWmAl1f8edqVOUImUsDF0VYKWODLHX9EALFGVABE2NcUDhakupSY2cDKAMm ih4QMvPAOAR6gbcUAYnDsEBqevqbopbf7ZYkNxMqoeZ6nAWPorSQige74V7i0x5zdh9o BmciTAf+/FtbdlNDhHsktlQN+B3ia4xUmzwtG22HbJ3j5OC7zAFCiWAKhWcWoNISOg6d crQg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=EXIfXwM6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id i10-20020aa7dd0a000000b00501d7cde627si6532035edv.505.2023.04.03.16.23.13; Mon, 03 Apr 2023 16:23:38 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=EXIfXwM6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233523AbjDCW6b (ORCPT + 99 others); Mon, 3 Apr 2023 18:58:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37702 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232916AbjDCW63 (ORCPT ); Mon, 3 Apr 2023 18:58:29 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B14351FD2; Mon, 3 Apr 2023 15:58:26 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 1F0C2628ED; Mon, 3 Apr 2023 22:58:26 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7612AC433EF; Mon, 3 Apr 2023 22:58:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1680562705; bh=B7ALZPzF4cEdXfCHfmx+utsMzdCDAqdR/dyO0wz/JtQ=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=EXIfXwM65JjczsikX7DC1+QaY9VA1dW3sAPYb4XDC01hPQ5PltBdFkAGzYYkiPxDn N4Gs28G3jd9yT7xxrrB/AP86JhPh+DDCCtPC4Ko6NSqKVn1sGKsjwWs4rMlQUkolMQ EGzZo/Ak57np7SEKrIk6xwDHjY06oztLeDQFsWFNrlzxifmZfI08YWsm736XeEpVFB Kla5RYKdfd/rqybOn8532IErqt/ZjrFmCB7Y+/15flUeC9SC+kUyOHpMdWbsdoorLG tmTpABzJo/ODyPhm5aW/ngfu25lyR80luBp12i64JDqjM0ie2a7uZ5g9Zyhe23cZw2 aio+59oB9L/lw== Received: by paulmck-ThinkPad-P72.home (Postfix, from userid 1000) id E1B5415404B4; Mon, 3 Apr 2023 15:58:24 -0700 (PDT) Date: Mon, 3 Apr 2023 15:58:24 -0700 From: "Paul E. McKenney" To: Ziwei Dai Cc: urezki@gmail.com, frederic@kernel.org, quic_neeraju@quicinc.com, josh@joshtriplett.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, jiangshanlai@gmail.com, joel@joelfernandes.org, rcu@vger.kernel.org, linux-kernel@vger.kernel.org, shuang.wang@unisoc.com, yifan.xin@unisoc.com, ke.wang@unisoc.com, xuewen.yan@unisoc.com, zhiguo.niu@unisoc.com, zhaoyang.huang@unisoc.com Subject: Re: [PATCH V2] rcu: Make sure new krcp free business is handled after the wanted rcu grace period. Message-ID: <105d5e6e-d0b7-46c6-8f8e-574091e45eb1@paulmck-laptop> Reply-To: paulmck@kernel.org References: <1680266529-28429-1-git-send-email-ziwei.dai@unisoc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1680266529-28429-1-git-send-email-ziwei.dai@unisoc.com> X-Spam-Status: No, score=-2.5 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_PASS,T_PDS_OTHER_BAD_TLD autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 31, 2023 at 08:42:09PM +0800, Ziwei Dai wrote: > In kfree_rcu_monitor(), new free business at krcp is attached to any free > channel at krwp. kfree_rcu_monitor() is responsible to make sure new free > business is handled after the rcu grace period. But if there is any none-free > channel at krwp already, that means there is an on-going rcu work, > which will cause the kvfree_call_rcu()-triggered free business is done > before the wanted rcu grace period ends. > > This commit ignore krwp which has non-free channel at kfree_rcu_monitor(), > to fix the issue that kvfree_call_rcu() loses effectiveness. > > Below is the css_set obj "from_cset" use-after-free case caused by > kvfree_call_rcu() losing effectiveness. > CPU 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes, > the task is schedule out. > CPU 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp. > But "from_cset" is freed right after current gp end. "from_cset" is reallocated. > CPU 0 's task arrives back, references "from_cset"'s member, which causes crash. > > CPU 0 CPU 1 > count_memcg_event_mm() > |rcu_read_lock() <--- > |mem_cgroup_from_task() > |// css_set_ptr is the "from_cset" mentioned on CPU 1 > |css_set_ptr = rcu_dereference((task)->cgroups) > |// Hard irq comes, current task is scheduled out. > > cgroup_attach_task() > |cgroup_migrate() > |cgroup_migrate_execute() > |css_set_move_task(task, from_cset, to_cset, true) > |cgroup_move_task(task, to_cset) > |rcu_assign_pointer(.., to_cset) > |... > |cgroup_migrate_finish() > |put_css_set_locked(from_cset) > |from_cset->refcount return 0 > |kfree_rcu(cset, rcu_head) // means to free from_cset after new gp > |add_ptr_to_bulk_krc_lock() > |schedule_delayed_work(&krcp->monitor_work, ..) > > kfree_rcu_monitor() > |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] > |queue_rcu_work(system_wq, &krwp->rcu_work) > |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, > |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp > > // There is a perious call_rcu(.., rcu_work_rcufn) > // gp end, rcu_work_rcufn() is called. > rcu_work_rcufn() > |__queue_work(.., rwork->wq, &rwork->work); > > |kfree_rcu_work() > |krwp->bulk_head_free[0] bulk is freed before new gp end!!! > |The "from_cset" is freed before new gp end. > > // the task is scheduled in after many ms. > |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. > > v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). > > Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") > Signed-off-by: Ziwei Dai Good catch, thank you!!! How difficult was this to trigger? If it can be triggered easily, this of course needs to go into mainline sooner rather than later. Longer term, would it make sense to run the three channels through RCU separately, in order to avoid one channel refraining from starting a grace period just because some other channel has callbacks waiting for a grace period to complete? One argument against might be energy efficiency, but perhaps the ->gp_snap field could be used to get the best of both worlds. Either way, this fixes only one bug of two. The second bug is in the kfree_rcu() tests, which should have caught this bug. Thoughts on a good fix for those tests? I have applied Uladzislau's and Mukesh's tags, and done the usual wordsmithing as shown at the end of this message. Please let me know if I messed anything up. > --- > kernel/rcu/tree.c | 27 +++++++++++++++++++-------- > 1 file changed, 19 insertions(+), 8 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 8e880c0..7b95ee9 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -3024,6 +3024,18 @@ static void kfree_rcu_work(struct work_struct *work) > return !!READ_ONCE(krcp->head); > } > > +static bool > +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) > +{ > + int i; > + > + for (i = 0; i < FREE_N_CHANNELS; i++) > + if (!list_empty(&krwp->bulk_head_free[i])) > + return true; > + > + return !!krwp->head_free; This is fixed from v1, good! > +} > + > static int krc_count(struct kfree_rcu_cpu *krcp) > { > int sum = atomic_read(&krcp->head_count); > @@ -3107,15 +3119,14 @@ static void kfree_rcu_monitor(struct work_struct *work) > for (i = 0; i < KFREE_N_BATCHES; i++) { > struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); > > - // Try to detach bulk_head or head and attach it over any > - // available corresponding free channel. It can be that > - // a previous RCU batch is in progress, it means that > - // immediately to queue another one is not possible so > - // in that case the monitor work is rearmed. > - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || > - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || > - (READ_ONCE(krcp->head) && !krwp->head_free)) { > + // Try to detach bulk_head or head and attach it, only when > + // all channels are free. Any channel is not free means at krwp > + // there is on-going rcu work to handle krwp's free business. > + if (need_wait_for_krwp_work(krwp)) > + continue; > > + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. > + if (need_offload_krc(krcp)) { > // Channel 1 corresponds to the SLAB-pointer bulk path. > // Channel 2 corresponds to vmalloc-pointer bulk path. > for (j = 0; j < FREE_N_CHANNELS; j++) { > -- > 1.9.1 ------------------------------------------------------------------------ commit e222f9a512539c3f4093a55d16624d9da614800b Author: Ziwei Dai Date: Fri Mar 31 20:42:09 2023 +0800 rcu: Avoid freeing new kfree_rcu() memory after old grace period Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task resumes some time later. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: c014efeef76a ("rcu: Add multiple in-flight batches of kfree_rcu() work") Reported-by: Mukesh Ojha Signed-off-by: Ziwei Dai Reviewed-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 859ee02f6614..e2dbea6cee4b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3051,6 +3051,18 @@ need_offload_krc(struct kfree_rcu_cpu *krcp) return !!READ_ONCE(krcp->head); } +static bool +need_wait_for_krwp_work(struct kfree_rcu_cpu_work *krwp) +{ + int i; + + for (i = 0; i < FREE_N_CHANNELS; i++) + if (!list_empty(&krwp->bulk_head_free[i])) + return true; + + return !!krwp->head_free; +} + static int krc_count(struct kfree_rcu_cpu *krcp) { int sum = atomic_read(&krcp->head_count); @@ -3134,15 +3146,14 @@ static void kfree_rcu_monitor(struct work_struct *work) for (i = 0; i < KFREE_N_BATCHES; i++) { struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]); - // Try to detach bulk_head or head and attach it over any - // available corresponding free channel. It can be that - // a previous RCU batch is in progress, it means that - // immediately to queue another one is not possible so - // in that case the monitor work is rearmed. - if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) || - (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) || - (READ_ONCE(krcp->head) && !krwp->head_free)) { + // Try to detach bulk_head or head and attach it, only when + // all channels are free. Any channel is not free means at krwp + // there is on-going rcu work to handle krwp's free business. + if (need_wait_for_krwp_work(krwp)) + continue; + // kvfree_rcu_drain_ready() might handle this krcp, if so give up. + if (need_offload_krc(krcp)) { // Channel 1 corresponds to the SLAB-pointer bulk path. // Channel 2 corresponds to vmalloc-pointer bulk path. for (j = 0; j < FREE_N_CHANNELS; j++) {