Received: by 2002:a05:6622:f08:0:0:0:0 with SMTP id l8csp4284373ivc; Tue, 3 Nov 2020 07:50:27 -0800 (PST) X-Google-Smtp-Source: ABdhPJzV7T6NEfny/xvIA9SPxSZjEsgg7+g7bkVU3qq6mwr+6Chi+NUbYFGBMfv4d55+q5uOgGCS X-Received: by 2002:a17:906:d931:: with SMTP id rn17mr12434469ejb.308.1604418626859; Tue, 03 Nov 2020 07:50:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1604418626; cv=none; d=google.com; s=arc-20160816; b=zT+bntMA70iFU3vxQGhdZLehkfnpgVytmqgjhaJxJvj1NY7OizrEweAYqJkHwTrV6z h2ZS1DBc8wEGxibCVvC3n4KB2hpQdietgaxIPY3sxueXxYs+ncj37Z1unOqZS5FNf7QV ZOQy0HbnG4uO1m7j1eV869ORetRqsxjDfo3ZZYjlnDZjjXHJTABbs64pz1F8550s40Xk S/+sw9o0VJJIle5q1VG4DMMp/BCrFnsuQvzQu81Py8W78ugstfFEUqV8nGVyP3EgA0z7 TP58DTZp9+4D9E2oGsVC3OCNr6WRZtKH+LkQaqmLEplXpKsXb9XG/7W0hHZFJLupg9l7 RMyg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=FxIdEvKslGYou6uHwb3y47uEgNWUxE32h5p7W2Jkl6o=; b=PwwwiIaiwmJWGPHx0vH18QOOPvB76qpT5BeZ3rJIAAeC333mIjgxL39I9omewgTXI5 IJKwIrRbq17RSRhCc2I0Bmx+Jb1DstuUc2FSJJm9CNw89dcjYODznL2yUQTfgzIdpMOU pDbMOUp2IKpkJQougRSdxVg8qm4vrcbqQ642MFenT9xt9KV3PrEHgEo5sRx3qkeDP93g JINVWhIZQGdjCSJwG0HcwiAWJZ+FR4ayNYuzdcLIxeTBwosHNvJka4nyeo/CEjY76/Sg hoXfsaiV1I21acMyB4tS3ocZB6rRYlWNigLGneNAmWfqy9uhAHzZ6/PVF8EyVLeCvgNr QKFg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=abm0hmkt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h27si15046631ejd.385.2020.11.03.07.50.00; Tue, 03 Nov 2020 07:50:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=abm0hmkt; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728264AbgKCPr0 (ORCPT + 99 others); Tue, 3 Nov 2020 10:47:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59546 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726312AbgKCPr0 (ORCPT ); Tue, 3 Nov 2020 10:47:26 -0500 Received: from mail-qk1-x743.google.com (mail-qk1-x743.google.com [IPv6:2607:f8b0:4864:20::743]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 275C8C0613D1 for ; Tue, 3 Nov 2020 07:47:26 -0800 (PST) Received: by mail-qk1-x743.google.com with SMTP id 12so11141168qkl.8 for ; Tue, 03 Nov 2020 07:47:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=FxIdEvKslGYou6uHwb3y47uEgNWUxE32h5p7W2Jkl6o=; b=abm0hmkt6aEh9QZIwFkoNC1HZ+ku03wlEFRPlovuAO3DmMdYQu54Efo23IL0rYtSID gjvRUisPz9Qlr3baLP1hHSd2YbAGoq7N7QwqYBoznVbxkBwgBYWEjXKoFh6SyPLMkl5g WntDvMVIg6F/x6FxFd6GpscZik1H/tBMDMRks= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=FxIdEvKslGYou6uHwb3y47uEgNWUxE32h5p7W2Jkl6o=; b=aalleKsSq6iPE+c9YXL9WRomtWW6xiAEhUKPq3MpQEP7ntFfs9N3YlljodnlLvp18E 8He5LBohpqEf8L7QJO7K7WQ+Hzc4T0RfDOFGR/xtqxA5IYbR0aqYlg5ra+igtBwTUDWh lb91Hk1CjrAKN3yCG/QUv6WHVvdj16elsTJ6GyS5Y8fszwGF+GX19TLI+CxVEpf2CImF IMkyGt6namkJgzDCnuaVyCfCPxldp0tYJI+Ddu5OYtPj3hedKW2VOt3SvuQk9JLCYTLR XzyMyMf4vwZ2P/dttZ3+zxS/QUQ12tQTZkXAsJ31Yh9fnt4SC+y3mt5HCaBAaX86A9GK on0Q== X-Gm-Message-State: AOAM530EBYjyyWX0oU7Y+R5/nzO9YSaafZUxvSh3hJo982TMIcBO5tTv u/CYbwhcJ8UXI5DSIB2t6RAr/A== X-Received: by 2002:a37:849:: with SMTP id 70mr20751283qki.332.1604418445075; Tue, 03 Nov 2020 07:47:25 -0800 (PST) Received: from localhost ([2620:15c:6:411:cad3:ffff:feb3:bd59]) by smtp.gmail.com with ESMTPSA id a3sm10464581qtp.63.2020.11.03.07.47.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 03 Nov 2020 07:47:24 -0800 (PST) Date: Tue, 3 Nov 2020 10:47:23 -0500 From: Joel Fernandes To: "Uladzislau Rezki (Sony)" Cc: LKML , RCU , "Paul E . McKenney" , Andrew Morton , Peter Zijlstra , Michal Hocko , Thomas Gleixner , "Theodore Y . Ts'o" , Sebastian Andrzej Siewior , Oleksiy Avramchenko , willy@infradead.org Subject: Re: [PATCH 01/16] rcu/tree: Add a work to allocate pages from regular context Message-ID: <20201103154723.GA1310511@google.com> References: <20201029165019.14218-1-urezki@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201029165019.14218-1-urezki@gmail.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 29, 2020 at 05:50:04PM +0100, Uladzislau Rezki (Sony) wrote: > The current memmory-allocation interface presents to following > difficulties that this patch is designed to overcome: > > a) If built with CONFIG_PROVE_RAW_LOCK_NESTING, the lockdep will > complain about violation("BUG: Invalid wait context") of the > nesting rules. It does the raw_spinlock vs. spinlock nesting > checks, i.e. it is not legal to acquire a spinlock_t while > holding a raw_spinlock_t. > > Internally the kfree_rcu() uses raw_spinlock_t whereas the > "page allocator" internally deals with spinlock_t to access > to its zones. The code also can be broken from higher level > of view: > > raw_spin_lock(&some_lock); > kfree_rcu(some_pointer, some_field_offset); > > > b) If built with CONFIG_PREEMPT_RT. Please note, in that case spinlock_t > is converted into sleepable variant. Invoking the page allocator from > atomic contexts leads to "BUG: scheduling while atomic". > > c) call_rcu() is invoked from raw atomic context and kfree_rcu() > and kvfree_rcu() are expected to be called from atomic raw context > as well. > > Move out a page allocation from contexts which trigger kvfree_rcu() > function to the separate worker. When a k[v]free_rcu() per-cpu page > cache is empty a fallback mechanism is used and a special job is > scheduled to refill the per-cpu cache. Looks good, still reviewing here. BTW just for my education, I was wondering about Thomas's email: https://lkml.org/lkml/2020/8/11/939 If slab allocations in pure raw-atomic context on RT is not allowed or recommended, should kfree_rcu() be allowed? slab can have same issue right? If per-cpu cache is drained, it has to allocate page from buddy allocator and there's no GFP flag to tell it about context where alloc is happening from. Or are we saying that we want to support kfree on RT from raw atomic atomic context, even though kmalloc is not supported? I hate to bring up this elephant in the room, but since I am a part of the people maintaining this code, I believe I would rather set some rules than supporting unsupported usages. :-\ (Once I know what is supported and what isn't that is). If indeed raw atomic kfree_rcu() is a bogus use case because of -RT, then we ought to put a giant warning than supporting it :-(. But I don't stand in the way of this patch if all agree on it. thanks, - Joel > Link: https://lore.kernel.org/lkml/20200630164543.4mdcf6zb4zfclhln@linutronix.de/ > Fixes: 3042f83f19be ("rcu: Support reclaim for head-less object") > Reported-by: Sebastian Andrzej Siewior > Signed-off-by: Uladzislau Rezki (Sony) > --- > kernel/rcu/tree.c | 109 ++++++++++++++++++++++++++++------------------ > 1 file changed, 66 insertions(+), 43 deletions(-) > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c > index 06895ef85d69..f2da2a1cc716 100644 > --- a/kernel/rcu/tree.c > +++ b/kernel/rcu/tree.c > @@ -177,7 +177,7 @@ module_param(rcu_unlock_delay, int, 0444); > * per-CPU. Object size is equal to one page. This value > * can be changed at boot time. > */ > -static int rcu_min_cached_objs = 2; > +static int rcu_min_cached_objs = 5; > module_param(rcu_min_cached_objs, int, 0444); > > /* Retrieve RCU kthreads priority for rcutorture */ > @@ -3084,6 +3084,9 @@ struct kfree_rcu_cpu_work { > * In order to save some per-cpu space the list is singular. > * Even though it is lockless an access has to be protected by the > * per-cpu lock. > + * @page_cache_work: A work to refill the cache when it is empty > + * @work_in_progress: Indicates that page_cache_work is running > + * @hrtimer: A hrtimer for scheduling a page_cache_work > * @nr_bkv_objs: number of allocated objects at @bkvcache. > * > * This is a per-CPU structure. The reason that it is not included in > @@ -3100,6 +3103,11 @@ struct kfree_rcu_cpu { > bool monitor_todo; > bool initialized; > int count; > + > + struct work_struct page_cache_work; > + atomic_t work_in_progress; > + struct hrtimer hrtimer; > + > struct llist_head bkvcache; > int nr_bkv_objs; > }; > @@ -3217,10 +3225,10 @@ static void kfree_rcu_work(struct work_struct *work) > } > rcu_lock_release(&rcu_callback_map); > > - krcp = krc_this_cpu_lock(&flags); > + raw_spin_lock_irqsave(&krcp->lock, flags); > if (put_cached_bnode(krcp, bkvhead[i])) > bkvhead[i] = NULL; > - krc_this_cpu_unlock(krcp, flags); > + raw_spin_unlock_irqrestore(&krcp->lock, flags); > > if (bkvhead[i]) > free_page((unsigned long) bkvhead[i]); > @@ -3347,6 +3355,57 @@ static void kfree_rcu_monitor(struct work_struct *work) > raw_spin_unlock_irqrestore(&krcp->lock, flags); > } > > +static enum hrtimer_restart > +schedule_page_work_fn(struct hrtimer *t) > +{ > + struct kfree_rcu_cpu *krcp = > + container_of(t, struct kfree_rcu_cpu, hrtimer); > + > + queue_work(system_highpri_wq, &krcp->page_cache_work); > + return HRTIMER_NORESTART; > +} > + > +static void fill_page_cache_func(struct work_struct *work) > +{ > + struct kvfree_rcu_bulk_data *bnode; > + struct kfree_rcu_cpu *krcp = > + container_of(work, struct kfree_rcu_cpu, > + page_cache_work); > + unsigned long flags; > + bool pushed; > + int i; > + > + for (i = 0; i < rcu_min_cached_objs; i++) { > + bnode = (struct kvfree_rcu_bulk_data *) > + __get_free_page(GFP_KERNEL | __GFP_NOWARN); > + > + if (bnode) { > + raw_spin_lock_irqsave(&krcp->lock, flags); > + pushed = put_cached_bnode(krcp, bnode); > + raw_spin_unlock_irqrestore(&krcp->lock, flags); > + > + if (!pushed) { > + free_page((unsigned long) bnode); > + break; > + } > + } > + } > + > + atomic_set(&krcp->work_in_progress, 0); > +} > + > +static void > +run_page_cache_worker(struct kfree_rcu_cpu *krcp) > +{ > + if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING && > + !atomic_xchg(&krcp->work_in_progress, 1)) { > + hrtimer_init(&krcp->hrtimer, CLOCK_MONOTONIC, > + HRTIMER_MODE_REL); > + krcp->hrtimer.function = schedule_page_work_fn; > + hrtimer_start(&krcp->hrtimer, 0, HRTIMER_MODE_REL); > + } > +} > + > static inline bool > kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) > { > @@ -3363,32 +3422,8 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr) > if (!krcp->bkvhead[idx] || > krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) { > bnode = get_cached_bnode(krcp); > - if (!bnode) { > - /* > - * To keep this path working on raw non-preemptible > - * sections, prevent the optional entry into the > - * allocator as it uses sleeping locks. In fact, even > - * if the caller of kfree_rcu() is preemptible, this > - * path still is not, as krcp->lock is a raw spinlock. > - * With additional page pre-allocation in the works, > - * hitting this return is going to be much less likely. > - */ > - if (IS_ENABLED(CONFIG_PREEMPT_RT)) > - return false; > - > - /* > - * NOTE: For one argument of kvfree_rcu() we can > - * drop the lock and get the page in sleepable > - * context. That would allow to maintain an array > - * for the CONFIG_PREEMPT_RT as well if no cached > - * pages are available. > - */ > - bnode = (struct kvfree_rcu_bulk_data *) > - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > - } > - > /* Switch to emergency path. */ > - if (unlikely(!bnode)) > + if (!bnode) > return false; > > /* Initialize the new block. */ > @@ -3452,12 +3487,10 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func) > goto unlock_return; > } > > - /* > - * Under high memory pressure GFP_NOWAIT can fail, > - * in that case the emergency path is maintained. > - */ > success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr); > if (!success) { > + run_page_cache_worker(krcp); > + > if (head == NULL) > // Inline if kvfree_rcu(one_arg) call. > goto unlock_return; > @@ -4449,24 +4482,14 @@ static void __init kfree_rcu_batch_init(void) > > for_each_possible_cpu(cpu) { > struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu); > - struct kvfree_rcu_bulk_data *bnode; > > for (i = 0; i < KFREE_N_BATCHES; i++) { > INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work); > krcp->krw_arr[i].krcp = krcp; > } > > - for (i = 0; i < rcu_min_cached_objs; i++) { > - bnode = (struct kvfree_rcu_bulk_data *) > - __get_free_page(GFP_NOWAIT | __GFP_NOWARN); > - > - if (bnode) > - put_cached_bnode(krcp, bnode); > - else > - pr_err("Failed to preallocate for %d CPU!\n", cpu); > - } > - > INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor); > + INIT_WORK(&krcp->page_cache_work, fill_page_cache_func); > krcp->initialized = true; > } > if (register_shrinker(&kfree_rcu_shrinker)) > -- > 2.20.1 >