Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp9080167pxu; Mon, 28 Dec 2020 06:09:57 -0800 (PST) X-Google-Smtp-Source: ABdhPJyFy9dUkg+uuRTHlfJK0qp3EWmzNtyqkvJkzkKk0vbIKtYvXdagx+JxwEK1cmKBbCfXmrZu X-Received: by 2002:a50:9ee6:: with SMTP id a93mr42817718edf.174.1609164597620; Mon, 28 Dec 2020 06:09:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1609164597; cv=none; d=google.com; s=arc-20160816; b=rjpugnXO3MzWz8CLEfS3g8wYaEwaSnDiEv17PbEYW5fwD8t/iakncr8m4KnP1aMsNs v8Y09JpgJBv2pBTflFnrVyrS66housBeKFfYiBOg3+G9zgdL7tIFrd46fnBfJLVvV2T4 HblxMbaK/zW/fENAqOXs62gGz54gcaFZ4LmOwNVJcsrNVzIVKAdRdl8HHgDmubcpsBiw aeCVMIwcTzBD5L9Zly5LZs37BgmKtiURkLMkUZdoR3GKJS837oJSsHFLkHNVszHUw0pM ZxjnV9DdFCqVccYmhLVEU+DcHp96nVv6s8+y9DzZSNtRT93moe43AkYr+lgDobi162Z9 KHZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=pHqxmkUL7ILKZlmO2et1W568J9MOIUwMRcMSnhLKTVE=; b=QJhZYye46NlrS7ik0ExpduoKJPfhmiZ8bY0f9wm/IxHssmUkJjbK0uY8y4IVT7HiRD ABZjBPvSFqlMI47hsTbOVahFlHY8q1zpaGmBD6NpOyHDKPH2fyolz++Pz9K/mKBxX013 0Q/LHkMJxFduqa3rFD39g02vte2srS7+AQHgxeO86nviE1kmekp6Wpq4NS9+KfZIAHyh XuhtTwF50Hh+F9Y1kgf7kTZ/DrwEmEwk3TUC7N+3ynJQS3T+9wKeeJQGeQFIOTyY+gav ovyGxjgRIiRyiVW8gITFQoSicttXd6HfcJE9cv4956Bn9As+ltBiWqHlCWfPv9Fp4Z0b hSLA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=wrtZS0Ad; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id zc2si18155321ejb.427.2020.12.28.06.09.34; Mon, 28 Dec 2020 06:09:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b=wrtZS0Ad; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2438459AbgL1OHb (ORCPT + 99 others); Mon, 28 Dec 2020 09:07:31 -0500 Received: from mail.kernel.org ([198.145.29.99]:41738 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2438133AbgL1OH2 (ORCPT ); Mon, 28 Dec 2020 09:07:28 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 55CC8206E5; Mon, 28 Dec 2020 14:06:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1609164406; bh=sks1Op6MiqH3GkHKIwbeuGkl42ag3vtCAGmq3oZOOzk=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=wrtZS0AdFNGVvFtocILMHe0xFgHkSIqrR89TYny9uIY6ZOkFsfD80apGxraAFB1c8 5D6Np60UMyq0WPAP87O1q2OUPcvtKfSjtdJJINZNYiaC/NYlP/EmWyZJ/NyDrkpRqA ItJ9QJKWp0AQeEckKVDGheC+042hc5HuYkm7Vq0I= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Sebastian Andrzej Siewior , "Uladzislau Rezki (Sony)" , "Paul E. McKenney" , Sasha Levin Subject: [PATCH 5.10 169/717] rcu/tree: Defer kvfree_rcu() allocation to a clean context Date: Mon, 28 Dec 2020 13:42:47 +0100 Message-Id: <20201228125029.055753459@linuxfoundation.org> X-Mailer: git-send-email 2.29.2 In-Reply-To: <20201228125020.963311703@linuxfoundation.org> References: <20201228125020.963311703@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Uladzislau Rezki (Sony) [ Upstream commit 56292e8609e39537297a7468dda4d87b9bd81d6a ] The current memmory-allocation interface causes the following difficulties for kvfree_rcu(): a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will complain about violation of the nesting rules, as in "BUG: Invalid wait context". This Kconfig option checks for proper raw_spinlock vs. spinlock nesting, in particular, it is not legal to acquire a spinlock_t while holding a raw_spinlock_t. This is a problem because kfree_rcu() uses raw_spinlock_t whereas the "page allocator" internally deals with spinlock_t to access to its zones. The code also can be broken from higher level of view: raw_spin_lock(&some_lock); kfree_rcu(some_pointer, some_field_offset); b) If built with CONFIG_PREEMPT_RT, spinlock_t is converted into sleeplock. This means that invoking the page allocator from atomic contexts results in "BUG: scheduling while atomic". c) Please note that call_rcu() is already invoked from raw atomic context, so it is only reasonable to expaect that kfree_rcu() and kvfree_rcu() will also be called from atomic raw context. This commit therefore defers page allocation to a clean context using the combination of an hrtimer and a workqueue. The hrtimer stage is required in order to avoid deadlocks with the scheduler. This deferred allocation is required only when kvfree_rcu()'s per-CPU page cache is empty. Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/ Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object") Reported-by: Sebastian Andrzej Siewior Signed-off-by: Uladzislau Rezki (Sony) Signed-off-by: Paul E. McKenney Signed-off-by: Sasha Levin --- kernel/rcu/tree.c | 109 ++++++++++++++++++++++++++++------------------ 1 file changed, 66 insertions(+), 43 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 585bf112ee08d..593df7edfe97f 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -177,7 +177,7 @@ module_param(rcu_unlock_delay, int, 0444); * per-CPU. Object size is equal to one page. This value * can be changed at boot time. */ -static int rcu_min_cached_objs = 2; +static int rcu_min_cached_objs = 5; module_param(rcu_min_cached_objs, int, 0444); /* Retrieve RCU kthreads priority for rcutorture */ @@ -3087,6 +3087,9 @@ struct kfree_rcu_cpu_work { * In order to save some per-cpu space the list is singular. * Even though it is lockless an access has to be protected by the * per-cpu lock. + * @page_cache_work: A work to refill the cache when it is empty + * @work_in_progress: Indicates that page_cache_work is running + * @hrtimer: A hrtimer for scheduling a page_cache_work * @nr_bkv_objs: number of allocated objects at @bkvcache. * * This is a per-CPU structure. The reason that it is not included in @@ -3103,6 +3106,11 @@ struct kfree_rcu_cpu { bool monitor_todo; bool initialized; int count; + + struct work_struct page_cache_work; + atomic_t work_in_progress; + struct hrtimer hrtimer; + struct llist_head bkvcache; int nr_bkv_objs; }; @@ -3220,10 +3228,10 @@ static void kfree_rcu_work(struct work_struct *work) } rcu_lock_release(&rcu_callback_map); - krcp = krc_this_cpu_lock(&flags); + raw_spin_lock_irqsave(&krcp->lock, flags); if (put_cached_bnode(krcp, bkvhead[i])) bkvhead[i] = NULL; - krc_this_cpu_unlock(krcp, flags); + raw_spin_unlock_irqrestore(&krcp->lock, flags); if (bkvhead[i]) free_page((unsigned long) bkvhead[i]); @@ -3350,6 +3358,57 @@ static void kfree_rcu_monitor(struct work_struct *work) raw_spin_unlock_irqrestore(&krcp->lock, flags); } +static enum hrtimer_restart +schedule_page_work_fn(struct hrtimer *t) +{ + struct kfree_rcu_cpu *krcp = + container_of(t, struct kfree_rcu_cpu, hrtimer); + + queue_work(system_highpri_wq, &krcp->page_cache_work); + return HRTIMER_NORESTART; +} + +static void fill_page_cache_func(struct work_struct *work) +{ + struct kvfree_rcu_bulk_data *bnode; + struct kfree_rcu_cpu *krcp = + container_of(work, struct kfree_rcu_cpu, + page_cache_work); + unsigned long flags; + bool pushed; + int i; + + for (i = 0; i < rcu_min_cached_objs; i++) { + bnode = (struct kvfree_rcu_bulk_data *) + __get_free_page(GFP_KERNEL | __GFP_NOWARN); + + if (bnode) { + raw_spin_lock_irqsave(&krcp->lock, flags); + pushed = put_cached_bnode(krcp, bnode); + raw_spin_unlock_irqrestore(&krcp->lock, flags); + + if (!pushed) { + free_page((unsigned long) bnode); + break; + } + } + } + + atomic_set(&krcp->work_in_progress, 0); +} + +static void +run_page_cache_worker(struct kfree_rcu_cpu *krcp) +{ + if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING && + !atomic_xchg(&krcp->work_in_progress, 1)) { + hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + krcp->hrtimer.function = schedule_page_work_fn; + hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL); + } +} + static inline bool kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) { @@ -3366,32 +3425,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) if (!krcp->bkvhead[idx] || krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) { bnode = get_cached_bnode(krcp); - if (!bnode) { - /* - * To keep this path working on raw non-preemptible - * sections, prevent the optional entry into the - * allocator as it uses sleeping locks. In fact, even - * if the caller of kfree_rcu() is preemptible, this - * path still is not, as krcp->lock is a raw spinlock. - * With additional page pre-allocation in the works, - * hitting this return is going to be much less likely. - */ - if (IS_ENABLED(CONFIG_PREEMPT_RT)) - return false; - - /* - * NOTE: For one argument of kvfree_rcu() we can - * drop the lock and get the page in sleepable - * context. That would allow to maintain an array - * for the CONFIG_PREEMPT_RT as well if no cached - * pages are available. - */ - bnode = (struct kvfree_rcu_bulk_data *) - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); - } - /* Switch to emergency path. */ - if (unlikely(!bnode)) + if (!bnode) return false; /* Initialize the new block. */ @@ -3455,12 +3490,10 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) goto unlock_return; } - /* - * Under high memory pressure GFP_NOWAIT can fail, - * in that case the emergency path is maintained. - */ success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); if (!success) { + run_page_cache_worker(krcp); + if (head == NULL) // Inline if kvfree_rcu(one_arg) call. goto unlock_return; @@ -4452,24 +4485,14 @@ static void __init kfree_rcu_batch_init(void) for_each_possible_cpu(cpu) { struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); - struct kvfree_rcu_bulk_data *bnode; for (i = 0; i < KFREE_N_BATCHES; i++) { INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); krcp->krw_arr[i].krcp = krcp; } - for (i = 0; i < rcu_min_cached_objs; i++) { - bnode = (struct kvfree_rcu_bulk_data *) - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); - - if (bnode) - put_cached_bnode(krcp, bnode); - else - pr_err("Failed to preallocate for %d CPU!\n", cpu); - } - INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); + INIT_WORK(&krcp->page_cache_work, fill_page_cache_func); krcp->initialized = true; } if (register_shrinker(&kfree_rcu_shrinker)) -- 2.27.0