LinuxLists.cc - [PATCH rcu 0/8] kvfree

2023-05-10 17:08:54

Subject: [PATCH rcu 0/8] kvfree_rcu() changes for v6.5

Hello!

This series contains updates for kvfree_rcu(), perhaps most notably
the removal of single-argument k{,v}free_rcu() in favor of the new-ish
k{,v}free_rcu_mightsleep().

1. rcu/kvfree: Eliminate k[v]free_rcu() single argument macro,
courtesy of "Uladzislau Rezki (Sony)".

2. rcu/kvfree: Add debug to check grace periods.

3. rcu/kvfree: Add debug check for GP complete for kfree_rcu_cpu
list, courtesy of "Uladzislau Rezki (Sony)".

4. rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking
bnode->gp_snap, courtesy of Zqiang.

5. rcu/kvfree: Use consistent krcp when growing kfree_rcu() page
cache, courtesy of Zqiang.

6. rcu/kvfree: Do not run a page work if a cache is disabled,
courtesy of "Uladzislau Rezki (Sony)".

7. rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objs,
courtesy of Zqiang.

8. rcu/kvfree: Make drain_page_cache() take early return if cache
is disabled, courtesy of Zqiang.

Thanx, Paul

------------------------------------------------------------------------

b/include/linux/rcupdate.h | 29 ++++++++---------------------
b/kernel/rcu/tree.c | 37 +++++++++++++++++++------------------
kernel/rcu/tree.c | 21 +++++++++++++++++----
3 files changed, 44 insertions(+), 43 deletions(-)

2023-05-10 17:10:38

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH rcu 2/8] rcu/kvfree: Add debug to check grace periods

This commit adds debugging checks to verify that the required RCU
grace period has elapsed for each kvfree_rcu_bulk_data structure that
arrives at the kvfree_rcu_bulk() function. These checks make use
of that structure's ->gp_snap field, which has been upgraded from an
unsigned long to an rcu_gp_oldstate structure. This upgrade reduces
the chances of false positives to nearly zero, even on 32-bit systems,
for which this structure carries 64 bits of state.

Cc: Ziwei Dai <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 37 +++++++++++++++++++------------------
1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f52ff7241041..91d75fd6c579 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2756,7 +2756,7 @@ EXPORT_SYMBOL_GPL(call_rcu);
*/
struct kvfree_rcu_bulk_data {
struct list_head list;
- unsigned long gp_snap;
+ struct rcu_gp_oldstate gp_snap;
unsigned long nr_records;
void *records[];
};
@@ -2921,23 +2921,24 @@ kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
int i;

debug_rcu_bhead_unqueue(bnode);
-
- rcu_lock_acquire(&rcu_callback_map);
- if (idx == 0) { // kmalloc() / kfree().
- trace_rcu_invoke_kfree_bulk_callback(
- rcu_state.name, bnode->nr_records,
- bnode->records);
-
- kfree_bulk(bnode->nr_records, bnode->records);
- } else { // vmalloc() / vfree().
- for (i = 0; i < bnode->nr_records; i++) {
- trace_rcu_invoke_kvfree_callback(
- rcu_state.name, bnode->records[i], 0);
-
- vfree(bnode->records[i]);
+ if (!WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&bnode->gp_snap))) {
+ rcu_lock_acquire(&rcu_callback_map);
+ if (idx == 0) { // kmalloc() / kfree().
+ trace_rcu_invoke_kfree_bulk_callback(
+ rcu_state.name, bnode->nr_records,
+ bnode->records);
+
+ kfree_bulk(bnode->nr_records, bnode->records);
+ } else { // vmalloc() / vfree().
+ for (i = 0; i < bnode->nr_records; i++) {
+ trace_rcu_invoke_kvfree_callback(
+ rcu_state.name, bnode->records[i], 0);
+
+ vfree(bnode->records[i]);
+ }
}
+ rcu_lock_release(&rcu_callback_map);
}
- rcu_lock_release(&rcu_callback_map);

raw_spin_lock_irqsave(&krcp->lock, flags);
if (put_cached_bnode(krcp, bnode))
@@ -3081,7 +3082,7 @@ kvfree_rcu_drain_ready(struct kfree_rcu_cpu *krcp)
INIT_LIST_HEAD(&bulk_ready[i]);

list_for_each_entry_safe_reverse(bnode, n, &krcp->bulk_head[i], list) {
- if (!poll_state_synchronize_rcu(bnode->gp_snap))
+ if (!poll_state_synchronize_rcu_full(&bnode->gp_snap))
break;

atomic_sub(bnode->nr_records, &krcp->bulk_count[i]);
@@ -3285,7 +3286,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,

// Finally insert and update the GP for this page.
bnode->records[bnode->nr_records++] = ptr;
- bnode->gp_snap = get_state_synchronize_rcu();
+ get_state_synchronize_rcu_full(&bnode->gp_snap);
atomic_inc(&(*krcp)->bulk_count[idx]);

return true;
--
2.40.1

2023-05-10 17:10:57

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH rcu 4/8] rcu/kvfree: Invoke debug_rcu_bhead_unqueue() after checking bnode->gp_snap

From: Zqiang <[email protected]>

If kvfree_rcu_bulk() sees that the required grace period has failed to
elapse, it leaks the memory because readers might still be using it.
But in that case, the debug-objects subsystem still marks the relevant
structures as having been freed, even though they are instead being
leaked.

This commit fixes this mismatch by invoking debug_rcu_bhead_unqueue()
only when we are actually going to free the objects.

Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7452ba97ba34..426f1f3bb5f2 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2922,8 +2922,8 @@ kvfree_rcu_bulk(struct kfree_rcu_cpu *krcp,
unsigned long flags;
int i;

- debug_rcu_bhead_unqueue(bnode);
if (!WARN_ON_ONCE(!poll_state_synchronize_rcu_full(&bnode->gp_snap))) {
+ debug_rcu_bhead_unqueue(bnode);
rcu_lock_acquire(&rcu_callback_map);
if (idx == 0) { // kmalloc() / kfree().
trace_rcu_invoke_kfree_bulk_callback(
--
2.40.1

2023-05-10 17:12:46

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH rcu 5/8] rcu/kvfree: Use consistent krcp when growing kfree_rcu() page cache

From: Zqiang <[email protected]>

The add_ptr_to_bulk_krc_lock() function is invoked to allocate a new
kfree_rcu() page, also known as a kvfree_rcu_bulk_data structure.
The kfree_rcu_cpu structure's lock is used to protect this operation,
except that this lock must be momentarily dropped when allocating memory.
It is clearly important that the lock that is reacquired be the same
lock that was acquired initially via krc_this_cpu_lock().

Unfortunately, this same krc_this_cpu_lock() function is used to
re-acquire this lock, and if the task migrated to some other CPU during
the memory allocation, this will result in the kvfree_rcu_bulk_data
structure being added to the wrong CPU's kfree_rcu_cpu structure.

This commit therefore replaces that second call to krc_this_cpu_lock()
with raw_spin_lock_irqsave() in order to explicitly acquire the lock on
the correct kfree_rcu_cpu structure, thus keeping things straight even
when the task migrates.

Signed-off-by: Zqiang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 426f1f3bb5f2..51d84eabf645 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3279,7 +3279,7 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
// scenarios.
bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
- *krcp = krc_this_cpu_lock(flags);
+ raw_spin_lock_irqsave(&(*krcp)->lock, *flags);
}

if (!bnode)
--
2.40.1

2023-05-10 17:26:39

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH rcu 8/8] rcu/kvfree: Make drain_page_cache() take early return if cache is disabled

From: Zqiang <[email protected]>

If the rcutree.rcu_min_cached_objs kernel boot parameter is set to zero,
then krcp->page_cache_work will never be triggered to fill page cache.
In addition, the put_cached_bnode() will not fill page cache. As a
result krcp->bkvcache will always be empty, so there is no need to acquire
krcp->lock to get page from krcp->bkvcache. This commit therefore makes
drain_page_cache() return immediately if the rcu_min_cached_objs is zero.

Signed-off-by: Zqiang <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 98f2e833e217..00ed45ddc6ca 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2902,6 +2902,9 @@ drain_page_cache(struct kfree_rcu_cpu *krcp)
struct llist_node *page_list, *pos, *n;
int freed = 0;

+ if (!rcu_min_cached_objs)
+ return 0;
+
raw_spin_lock_irqsave(&krcp->lock, flags);
page_list = llist_del_all(&krcp->bkvcache);
WRITE_ONCE(krcp->nr_bkv_objs, 0);
--
2.40.1

2023-05-10 17:26:39

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH rcu 6/8] rcu/kvfree: Do not run a page work if a cache is disabled

From: "Uladzislau Rezki (Sony)" <[email protected]>

By default the cache size is 5 pages per CPU, but it can be disabled at
boot time by setting the rcu_min_cached_objs to zero. When that happens,
the current code will uselessly set an hrtimer to schedule refilling this
cache with zero pages. This commit therefore streamlines this process
by simply refusing the set the hrtimer when rcu_min_cached_objs is zero.

Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 51d84eabf645..18f592bf6dc6 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3225,6 +3225,10 @@ static void fill_page_cache_func(struct work_struct *work)
static void
run_page_cache_worker(struct kfree_rcu_cpu *krcp)
{
+ // If cache disabled, bail out.
+ if (!rcu_min_cached_objs)
+ return;
+
if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
!atomic_xchg(&krcp->work_in_progress, 1)) {
if (atomic_read(&krcp->backoff_page_cache_fill)) {
--
2.40.1

2023-05-10 17:28:47

by Paul E. McKenney

[permalink] [raw]

Subject: [PATCH rcu 7/8] rcu/kvfree: Make fill page cache start from krcp->nr_bkv_objs

From: Zqiang <[email protected]>

When the fill_page_cache_func() function is invoked, it assumes that
the cache of pages is completely empty. However, there can be some time
between triggering execution of this function and its actual invocation.
During this time, kfree_rcu_work() might run, and might fill in part or
all of this cache of pages, thus invalidating the fill_page_cache_func()
function's assumption.

This will not overfill the cache because put_cached_bnode() will reject
the extra page. However, it will result in a needless allocation and
freeing of one extra page, which might not be helpful under lowish-memory
conditions.

This commit therefore causes the fill_page_cache_func() to explicitly
account for pages that have been placed into the cache shortly before
it starts running.

Signed-off-by: Zqiang <[email protected]>
Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 18f592bf6dc6..98f2e833e217 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3201,7 +3201,7 @@ static void fill_page_cache_func(struct work_struct *work)
nr_pages = atomic_read(&krcp->backoff_page_cache_fill) ?
1 : rcu_min_cached_objs;

- for (i = 0; i < nr_pages; i++) {
+ for (i = READ_ONCE(krcp->nr_bkv_objs); i < nr_pages; i++) {
bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);

--
2.40.1