Hello!
This series contains kfree_rcu() updates. Unless otherwise noted,
these are all courtesy of Uladzislau Rezki.
1. Fix a kernel-doc warnings for "count", courtesy of Mauro Carvalho
Chehab.
2. rcu/tree: Keep kfree_rcu() awake during lock contention, courtesy
of Joel Fernandes.
3. rcu/tree: Skip entry into the page allocator for PREEMPT_RT,
courtesy of Joel Fernandes.
4. rcu/tree: Repeat the monitor if any free channel is busy.
5. rcu/tree: Make debug_objects logic independent of rcu_head,
courtesy of Joel Fernandes.
6. rcu/tree: Simplify KFREE_BULK_MAX_ENTR macro.
7. rcu/tree: Move kfree_rcu_cpu locking/unlocking to separate
functions.
8. rcu/tree: Use static initializer for krc.lock, courtesy of
Sebastian Andrzej Siewior.
9. rcu/tree: cache specified number of objects.
10. rcu/tree: Maintain separate array for vmalloc ptrs.
11. rcu/tiny: support vmalloc in tiny-RCU.
12. Rename *_kfree_callback/*_kfree_rcu_offset/kfree_call_*.
13. mm/list_lru.c: Rename kvfree_rcu() to local variant.
14. Introduce 2 arg kvfree_rcu() interface.
15. Support reclaim for head-less object.
16. Introduce single argument kvfree_rcu() interface.
17. lib/test_vmalloc.c: Add test cases for kvfree_rcu().
Thanx, Paul
------------------------------------------------------------------------
Documentation/admin-guide/kernel-parameters.txt | 8
include/linux/rcupdate.h | 61 ++-
include/linux/rcutiny.h | 20 -
include/linux/rcutree.h | 2
include/trace/events/rcu.h | 8
kernel/rcu/tiny.c | 7
kernel/rcu/tree.c | 443 ++++++++++++++++--------
lib/test_vmalloc.c | 103 +++++
mm/list_lru.c | 6
9 files changed, 481 insertions(+), 177 deletions(-)
From: "Uladzislau Rezki (Sony)" <[email protected]>
Introduce helpers to lock and unlock per-cpu "kfree_rcu_cpu"
structures. That will make kfree_call_rcu() more readable
and prevent programming errors.
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 31 +++++++++++++++++++++++--------
1 file changed, 23 insertions(+), 8 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 519a3f5..edc512e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3017,6 +3017,27 @@ debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead)
#endif
}
+static inline struct kfree_rcu_cpu *
+krc_this_cpu_lock(unsigned long *flags)
+{
+ struct kfree_rcu_cpu *krcp;
+
+ local_irq_save(*flags); // For safely calling this_cpu_ptr().
+ krcp = this_cpu_ptr(&krc);
+ if (likely(krcp->initialized))
+ raw_spin_lock(&krcp->lock);
+
+ return krcp;
+}
+
+static inline void
+krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
+{
+ if (likely(krcp->initialized))
+ raw_spin_unlock(&krcp->lock);
+ local_irq_restore(flags);
+}
+
/*
* This function is invoked in workqueue context after a grace period.
* It frees all the objects queued on ->bhead_free or ->head_free.
@@ -3242,11 +3263,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
struct kfree_rcu_cpu *krcp;
void *ptr;
- local_irq_save(flags); // For safely calling this_cpu_ptr().
- krcp = this_cpu_ptr(&krc);
- if (krcp->initialized)
- raw_spin_lock(&krcp->lock);
-
+ krcp = krc_this_cpu_lock(&flags);
ptr = (void *)head - (unsigned long)func;
// Queue the object but don't yet schedule the batch.
@@ -3277,9 +3294,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
}
unlock_return:
- if (krcp->initialized)
- raw_spin_unlock(&krcp->lock);
- local_irq_restore(flags);
+ krc_this_cpu_unlock(krcp, flags);
}
EXPORT_SYMBOL_GPL(kfree_call_rcu);
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
In order to reduce the dynamic need for pages in kfree_rcu(),
pre-allocate a configurable number of pages per CPU and link
them in a list. When kfree_rcu() reclaims objects, the object's
container page is cached into a list instead of being released
to the low-level page allocator.
Such an approach provides O(1) access to free pages while also
reducing the number of requests to the page allocator. It also
makes the kfree_rcu() code to have free pages available during
a low memory condition.
A read-only sysfs parameter (rcu_min_cached_objs) reflects the
minimum number of allowed cached pages per CPU.
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
Documentation/admin-guide/kernel-parameters.txt | 8 +++
kernel/rcu/tree.c | 66 +++++++++++++++++++++++--
2 files changed, 70 insertions(+), 4 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index fb95fad..befaa63 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4038,6 +4038,14 @@
latencies, which will choose a value aligned
with the appropriate hardware boundaries.
+ rcutree.rcu_min_cached_objs= [KNL]
+ Minimum number of objects which are cached and
+ maintained per one CPU. Object size is equal
+ to PAGE_SIZE. The cache allows to reduce the
+ pressure to page allocator, also it makes the
+ whole algorithm to behave better in low memory
+ condition.
+
rcutree.jiffies_till_first_fqs= [KNL]
Set delay from grace-period initialization to
first attempt to force quiescent states.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4fba1e9..f767a37 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -175,6 +175,15 @@ module_param(gp_init_delay, int, 0444);
static int gp_cleanup_delay;
module_param(gp_cleanup_delay, int, 0444);
+/*
+ * This rcu parameter is runtime-read-only. It reflects
+ * a minimum allowed number of objects which can be cached
+ * per-CPU. Object size is equal to one page. This value
+ * can be changed at boot time.
+ */
+static int rcu_min_cached_objs = 2;
+module_param(rcu_min_cached_objs, int, 0444);
+
/* Retrieve RCU kthreads priority for rcutorture */
int rcu_get_gp_kthreads_prio(void)
{
@@ -2979,7 +2988,6 @@ struct kfree_rcu_cpu_work {
* struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
* @head: List of kfree_rcu() objects not yet waiting for a grace period
* @bhead: Bulk-List of kfree_rcu() objects not yet waiting for a grace period
- * @bcached: Keeps at most one object for later reuse when build chain blocks
* @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
* @lock: Synchronize access to this structure
* @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
@@ -2995,13 +3003,22 @@ struct kfree_rcu_cpu_work {
struct kfree_rcu_cpu {
struct rcu_head *head;
struct kfree_rcu_bulk_data *bhead;
- struct kfree_rcu_bulk_data *bcached;
struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
raw_spinlock_t lock;
struct delayed_work monitor_work;
bool monitor_todo;
bool initialized;
int count;
+
+ /*
+ * A simple cache list that contains objects for
+ * reuse purpose. In order to save some per-cpu
+ * space the list is singular. Even though it is
+ * lockless an access has to be protected by the
+ * per-cpu lock.
+ */
+ struct llist_head bkvcache;
+ int nr_bkv_objs;
};
static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
@@ -3038,6 +3055,31 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
local_irq_restore(flags);
}
+static inline struct kfree_rcu_bulk_data *
+get_cached_bnode(struct kfree_rcu_cpu *krcp)
+{
+ if (!krcp->nr_bkv_objs)
+ return NULL;
+
+ krcp->nr_bkv_objs--;
+ return (struct kfree_rcu_bulk_data *)
+ llist_del_first(&krcp->bkvcache);
+}
+
+static inline bool
+put_cached_bnode(struct kfree_rcu_cpu *krcp,
+ struct kfree_rcu_bulk_data *bnode)
+{
+ // Check the limit.
+ if (krcp->nr_bkv_objs >= rcu_min_cached_objs)
+ return false;
+
+ llist_add((struct llist_node *) bnode, &krcp->bkvcache);
+ krcp->nr_bkv_objs++;
+ return true;
+
+}
+
/*
* This function is invoked in workqueue context after a grace period.
* It frees all the objects queued on ->bhead_free or ->head_free.
@@ -3073,7 +3115,12 @@ static void kfree_rcu_work(struct work_struct *work)
kfree_bulk(bhead->nr_records, bhead->records);
rcu_lock_release(&rcu_callback_map);
- if (cmpxchg(&krcp->bcached, NULL, bhead))
+ krcp = krc_this_cpu_lock(&flags);
+ if (put_cached_bnode(krcp, bhead))
+ bhead = NULL;
+ krc_this_cpu_unlock(krcp, flags);
+
+ if (bhead)
free_page((unsigned long) bhead);
cond_resched_tasks_rcu_qs();
@@ -3206,7 +3253,7 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
/* Check if a new block is required. */
if (!krcp->bhead ||
krcp->bhead->nr_records == KFREE_BULK_MAX_ENTR) {
- bnode = xchg(&krcp->bcached, NULL);
+ bnode = get_cached_bnode(krcp);
if (!bnode) {
WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE);
@@ -4259,12 +4306,23 @@ static void __init kfree_rcu_batch_init(void)
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
+ struct kfree_rcu_bulk_data *bnode;
for (i = 0; i < KFREE_N_BATCHES; i++) {
INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
krcp->krw_arr[i].krcp = krcp;
}
+ for (i = 0; i < rcu_min_cached_objs; i++) {
+ bnode = (struct kfree_rcu_bulk_data *)
+ __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+
+ if (bnode)
+ put_cached_bnode(krcp, bnode);
+ else
+ pr_err("Failed to preallocate for %d CPU!\n", cpu);
+ }
+
INIT_DELAYED_WORK(&krcp->monitor_work, kfree_rcu_monitor);
krcp->initialized = true;
}
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
To do so, we use an array of kvfree_rcu_bulk_data structures.
It consists of two elements:
- index number 0 corresponds to slab pointers.
- index number 1 corresponds to vmalloc pointers.
Keeping vmalloc pointers separated from slab pointers makes
it possible to invoke the right freeing API for the right
kind of pointer.
It also prepares us for future headless support for vmalloc
and SLAB objects. Such objects cannot be queued on a linked
list and are instead directly into an array.
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 173 +++++++++++++++++++++++++++++++-----------------------
1 file changed, 100 insertions(+), 73 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f767a37..659f67b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -57,6 +57,8 @@
#include <linux/slab.h>
#include <linux/sched/isolation.h>
#include <linux/sched/clock.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
#include "../time/tick-internal.h"
#include "tree.h"
@@ -2948,46 +2950,47 @@ EXPORT_SYMBOL_GPL(call_rcu);
/* Maximum number of jiffies to wait before draining a batch. */
#define KFREE_DRAIN_JIFFIES (HZ / 50)
#define KFREE_N_BATCHES 2
+#define FREE_N_CHANNELS 2
/**
- * struct kfree_rcu_bulk_data - single block to store kfree_rcu() pointers
+ * struct kvfree_rcu_bulk_data - single block to store kvfree_rcu() pointers
* @nr_records: Number of active pointers in the array
- * @records: Array of the kfree_rcu() pointers
* @next: Next bulk object in the block chain
+ * @records: Array of the kvfree_rcu() pointers
*/
-struct kfree_rcu_bulk_data {
+struct kvfree_rcu_bulk_data {
unsigned long nr_records;
- struct kfree_rcu_bulk_data *next;
+ struct kvfree_rcu_bulk_data *next;
void *records[];
};
/*
* This macro defines how many entries the "records" array
* will contain. It is based on the fact that the size of
- * kfree_rcu_bulk_data structure becomes exactly one page.
+ * kvfree_rcu_bulk_data structure becomes exactly one page.
*/
-#define KFREE_BULK_MAX_ENTR \
- ((PAGE_SIZE - sizeof(struct kfree_rcu_bulk_data)) / sizeof(void *))
+#define KVFREE_BULK_MAX_ENTR \
+ ((PAGE_SIZE - sizeof(struct kvfree_rcu_bulk_data)) / sizeof(void *))
/**
* struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
* @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
* @head_free: List of kfree_rcu() objects waiting for a grace period
- * @bhead_free: Bulk-List of kfree_rcu() objects waiting for a grace period
+ * @bkvhead_free: Bulk-List of kvfree_rcu() objects waiting for a grace period
* @krcp: Pointer to @kfree_rcu_cpu structure
*/
struct kfree_rcu_cpu_work {
struct rcu_work rcu_work;
struct rcu_head *head_free;
- struct kfree_rcu_bulk_data *bhead_free;
+ struct kvfree_rcu_bulk_data *bkvhead_free[FREE_N_CHANNELS];
struct kfree_rcu_cpu *krcp;
};
/**
* struct kfree_rcu_cpu - batch up kfree_rcu() requests for RCU grace period
* @head: List of kfree_rcu() objects not yet waiting for a grace period
- * @bhead: Bulk-List of kfree_rcu() objects not yet waiting for a grace period
+ * @bkvhead: Bulk-List of kvfree_rcu() objects not yet waiting for a grace period
* @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
* @lock: Synchronize access to this structure
* @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
@@ -3002,7 +3005,7 @@ struct kfree_rcu_cpu_work {
*/
struct kfree_rcu_cpu {
struct rcu_head *head;
- struct kfree_rcu_bulk_data *bhead;
+ struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS];
struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
raw_spinlock_t lock;
struct delayed_work monitor_work;
@@ -3026,7 +3029,7 @@ static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
};
static __always_inline void
-debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead)
+debug_rcu_bhead_unqueue(struct kvfree_rcu_bulk_data *bhead)
{
#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
int i;
@@ -3055,20 +3058,20 @@ krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
local_irq_restore(flags);
}
-static inline struct kfree_rcu_bulk_data *
+static inline struct kvfree_rcu_bulk_data *
get_cached_bnode(struct kfree_rcu_cpu *krcp)
{
if (!krcp->nr_bkv_objs)
return NULL;
krcp->nr_bkv_objs--;
- return (struct kfree_rcu_bulk_data *)
+ return (struct kvfree_rcu_bulk_data *)
llist_del_first(&krcp->bkvcache);
}
static inline bool
put_cached_bnode(struct kfree_rcu_cpu *krcp,
- struct kfree_rcu_bulk_data *bnode)
+ struct kvfree_rcu_bulk_data *bnode)
{
// Check the limit.
if (krcp->nr_bkv_objs >= rcu_min_cached_objs)
@@ -3087,43 +3090,63 @@ put_cached_bnode(struct kfree_rcu_cpu *krcp,
static void kfree_rcu_work(struct work_struct *work)
{
unsigned long flags;
+ struct kvfree_rcu_bulk_data *bkvhead[FREE_N_CHANNELS], *bnext;
struct rcu_head *head, *next;
- struct kfree_rcu_bulk_data *bhead, *bnext;
struct kfree_rcu_cpu *krcp;
struct kfree_rcu_cpu_work *krwp;
+ int i, j;
krwp = container_of(to_rcu_work(work),
struct kfree_rcu_cpu_work, rcu_work);
krcp = krwp->krcp;
+
raw_spin_lock_irqsave(&krcp->lock, flags);
+ // Channels 1 and 2.
+ for (i = 0; i < FREE_N_CHANNELS; i++) {
+ bkvhead[i] = krwp->bkvhead_free[i];
+ krwp->bkvhead_free[i] = NULL;
+ }
+
+ // Channel 3.
head = krwp->head_free;
krwp->head_free = NULL;
- bhead = krwp->bhead_free;
- krwp->bhead_free = NULL;
raw_spin_unlock_irqrestore(&krcp->lock, flags);
- /* "bhead" is now private, so traverse locklessly. */
- for (; bhead; bhead = bnext) {
- bnext = bhead->next;
-
- debug_rcu_bhead_unqueue(bhead);
-
- rcu_lock_acquire(&rcu_callback_map);
- trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
- bhead->nr_records, bhead->records);
-
- kfree_bulk(bhead->nr_records, bhead->records);
- rcu_lock_release(&rcu_callback_map);
+ // Handle two first channels.
+ for (i = 0; i < FREE_N_CHANNELS; i++) {
+ for (; bkvhead[i]; bkvhead[i] = bnext) {
+ bnext = bkvhead[i]->next;
+ debug_rcu_bhead_unqueue(bkvhead[i]);
+
+ rcu_lock_acquire(&rcu_callback_map);
+ if (i == 0) { // kmalloc() / kfree().
+ trace_rcu_invoke_kfree_bulk_callback(
+ rcu_state.name, bkvhead[i]->nr_records,
+ bkvhead[i]->records);
+
+ kfree_bulk(bkvhead[i]->nr_records,
+ bkvhead[i]->records);
+ } else { // vmalloc() / vfree().
+ for (j = 0; j < bkvhead[i]->nr_records; j++) {
+ trace_rcu_invoke_kfree_callback(
+ rcu_state.name,
+ bkvhead[i]->records[j], 0);
+
+ vfree(bkvhead[i]->records[j]);
+ }
+ }
+ rcu_lock_release(&rcu_callback_map);
- krcp = krc_this_cpu_lock(&flags);
- if (put_cached_bnode(krcp, bhead))
- bhead = NULL;
- krc_this_cpu_unlock(krcp, flags);
+ krcp = krc_this_cpu_lock(&flags);
+ if (put_cached_bnode(krcp, bkvhead[i]))
+ bkvhead[i] = NULL;
+ krc_this_cpu_unlock(krcp, flags);
- if (bhead)
- free_page((unsigned long) bhead);
+ if (bkvhead[i])
+ free_page((unsigned long) bkvhead[i]);
- cond_resched_tasks_rcu_qs();
+ cond_resched_tasks_rcu_qs();
+ }
}
/*
@@ -3141,7 +3164,7 @@ static void kfree_rcu_work(struct work_struct *work)
trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset);
if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset)))
- kfree(ptr);
+ kvfree(ptr);
rcu_lock_release(&rcu_callback_map);
cond_resched_tasks_rcu_qs();
@@ -3158,7 +3181,7 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
{
struct kfree_rcu_cpu_work *krwp;
bool repeat = false;
- int i;
+ int i, j;
lockdep_assert_held(&krcp->lock);
@@ -3166,21 +3189,25 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
krwp = &(krcp->krw_arr[i]);
/*
- * Try to detach bhead or head and attach it over any
+ * Try to detach bkvhead or head and attach it over any
* available corresponding free channel. It can be that
* a previous RCU batch is in progress, it means that
* immediately to queue another one is not possible so
* return false to tell caller to retry.
*/
- if ((krcp->bhead && !krwp->bhead_free) ||
+ if ((krcp->bkvhead[0] && !krwp->bkvhead_free[0]) ||
+ (krcp->bkvhead[1] && !krwp->bkvhead_free[1]) ||
(krcp->head && !krwp->head_free)) {
- /* Channel 1. */
- if (!krwp->bhead_free) {
- krwp->bhead_free = krcp->bhead;
- krcp->bhead = NULL;
+ // Channel 1 corresponds to SLAB ptrs.
+ // Channel 2 corresponds to vmalloc ptrs.
+ for (j = 0; j < FREE_N_CHANNELS; j++) {
+ if (!krwp->bkvhead_free[j]) {
+ krwp->bkvhead_free[j] = krcp->bkvhead[j];
+ krcp->bkvhead[j] = NULL;
+ }
}
- /* Channel 2. */
+ // Channel 3 corresponds to emergency path.
if (!krwp->head_free) {
krwp->head_free = krcp->head;
krcp->head = NULL;
@@ -3189,16 +3216,17 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
WRITE_ONCE(krcp->count, 0);
/*
- * One work is per one batch, so there are two "free channels",
- * "bhead_free" and "head_free" the batch can handle. It can be
- * that the work is in the pending state when two channels have
- * been detached following each other, one by one.
+ * One work is per one batch, so there are three
+ * "free channels", the batch can handle. It can
+ * be that the work is in the pending state when
+ * channels have been detached following by each
+ * other.
*/
queue_rcu_work(system_wq, &krwp->rcu_work);
}
- /* Repeat if any "free" corresponding channel is still busy. */
- if (krcp->bhead || krcp->head)
+ // Repeat if any "free" corresponding channel is still busy.
+ if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head)
repeat = true;
}
@@ -3240,23 +3268,22 @@ static void kfree_rcu_monitor(struct work_struct *work)
}
static inline bool
-kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
- struct rcu_head *head, rcu_callback_t func)
+kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
{
- struct kfree_rcu_bulk_data *bnode;
+ struct kvfree_rcu_bulk_data *bnode;
+ int idx;
if (unlikely(!krcp->initialized))
return false;
lockdep_assert_held(&krcp->lock);
+ idx = !!is_vmalloc_addr(ptr);
/* Check if a new block is required. */
- if (!krcp->bhead ||
- krcp->bhead->nr_records == KFREE_BULK_MAX_ENTR) {
+ if (!krcp->bkvhead[idx] ||
+ krcp->bkvhead[idx]->nr_records == KVFREE_BULK_MAX_ENTR) {
bnode = get_cached_bnode(krcp);
if (!bnode) {
- WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE);
-
/*
* To keep this path working on raw non-preemptible
* sections, prevent the optional entry into the
@@ -3269,7 +3296,7 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
if (IS_ENABLED(CONFIG_PREEMPT_RT))
return false;
- bnode = (struct kfree_rcu_bulk_data *)
+ bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
}
@@ -3279,30 +3306,30 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
/* Initialize the new block. */
bnode->nr_records = 0;
- bnode->next = krcp->bhead;
+ bnode->next = krcp->bkvhead[idx];
/* Attach it to the head. */
- krcp->bhead = bnode;
+ krcp->bkvhead[idx] = bnode;
}
/* Finally insert. */
- krcp->bhead->records[krcp->bhead->nr_records++] =
- (void *) head - (unsigned long) func;
+ krcp->bkvhead[idx]->records
+ [krcp->bkvhead[idx]->nr_records++] = ptr;
return true;
}
/*
- * Queue a request for lazy invocation of kfree_bulk()/kfree() after a grace
- * period. Please note there are two paths are maintained, one is the main one
- * that uses kfree_bulk() interface and second one is emergency one, that is
- * used only when the main path can not be maintained temporary, due to memory
- * pressure.
+ * Queue a request for lazy invocation of appropriate free routine after a
+ * grace period. Please note there are three paths are maintained, two are the
+ * main ones that use array of pointers interface and third one is emergency
+ * one, that is used only when the main path can not be maintained temporary,
+ * due to memory pressure.
*
* Each kfree_call_rcu() request is added to a batch. The batch will be drained
* every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch will
* be free'd in workqueue context. This allows us to: batch requests together to
- * reduce the number of grace periods during heavy kfree_rcu() load.
+ * reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
*/
void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
@@ -3325,7 +3352,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
* Under high memory pressure GFP_NOWAIT can fail,
* in that case the emergency path is maintained.
*/
- if (unlikely(!kfree_call_rcu_add_ptr_to_bulk(krcp, head, func))) {
+ if (unlikely(!kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr))) {
head->func = func;
head->next = krcp->head;
krcp->head = head;
@@ -4306,7 +4333,7 @@ static void __init kfree_rcu_batch_init(void)
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
- struct kfree_rcu_bulk_data *bnode;
+ struct kvfree_rcu_bulk_data *bnode;
for (i = 0; i < KFREE_N_BATCHES; i++) {
INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
@@ -4314,7 +4341,7 @@ static void __init kfree_rcu_batch_init(void)
}
for (i = 0; i < rcu_min_cached_objs; i++) {
- bnode = (struct kfree_rcu_bulk_data *)
+ bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
if (bnode)
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
Rename kvfree_rcu() function to the kvfree_rcu_local() one.
The purpose is to prevent a conflict of two same function
declarations. The kvfree_rcu() will be globally visible
what would lead to a build error. No functional change.
Cc: [email protected]
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
mm/list_lru.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index 9222910..e825804 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -373,14 +373,14 @@ static void memcg_destroy_list_lru_node(struct list_lru_node *nlru)
struct list_lru_memcg *memcg_lrus;
/*
* This is called when shrinker has already been unregistered,
- * and nobody can use it. So, there is no need to use kvfree_rcu().
+ * and nobody can use it. So, there is no need to use kvfree_rcu_local().
*/
memcg_lrus = rcu_dereference_protected(nlru->memcg_lrus, true);
__memcg_destroy_list_lru_node(memcg_lrus, 0, memcg_nr_cache_ids);
kvfree(memcg_lrus);
}
-static void kvfree_rcu(struct rcu_head *head)
+static void kvfree_rcu_local(struct rcu_head *head)
{
struct list_lru_memcg *mlru;
@@ -419,7 +419,7 @@ static int memcg_update_list_lru_node(struct list_lru_node *nlru,
rcu_assign_pointer(nlru->memcg_lrus, new);
spin_unlock_irq(&nlru->lock);
- call_rcu(&old->rcu, kvfree_rcu);
+ call_rcu(&old->rcu, kvfree_rcu_local);
return 0;
}
--
2.9.5
From: "Joel Fernandes (Google)" <[email protected]>
On PREEMPT_RT kernels, the krcp spinlock gets converted to an rt-mutex
and causes kfree_rcu() callers to sleep. This makes it unusable for
callers in purely atomic sections such as non-threaded IRQ handlers and
raw spinlock sections. Fix it by converting the spinlock to a raw
spinlock.
Vetting all code paths, there is no reason to believe that the raw
spinlock will hurt RT latencies as it is not held for a long time.
Cc: [email protected]
Cc: Uladzislau Rezki <[email protected]>
Reviewed-by: Uladzislau Rezki <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 912d466..64592b4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2998,7 +2998,7 @@ struct kfree_rcu_cpu {
struct kfree_rcu_bulk_data *bhead;
struct kfree_rcu_bulk_data *bcached;
struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
- spinlock_t lock;
+ raw_spinlock_t lock;
struct delayed_work monitor_work;
bool monitor_todo;
bool initialized;
@@ -3031,12 +3031,12 @@ static void kfree_rcu_work(struct work_struct *work)
krwp = container_of(to_rcu_work(work),
struct kfree_rcu_cpu_work, rcu_work);
krcp = krwp->krcp;
- spin_lock_irqsave(&krcp->lock, flags);
+ raw_spin_lock_irqsave(&krcp->lock, flags);
head = krwp->head_free;
krwp->head_free = NULL;
bhead = krwp->bhead_free;
krwp->bhead_free = NULL;
- spin_unlock_irqrestore(&krcp->lock, flags);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
/* "bhead" is now private, so traverse locklessly. */
for (; bhead; bhead = bnext) {
@@ -3139,14 +3139,14 @@ static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp,
krcp->monitor_todo = false;
if (queue_kfree_rcu_work(krcp)) {
// Success! Our job is done here.
- spin_unlock_irqrestore(&krcp->lock, flags);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
return;
}
// Previous RCU batch still in progress, try again later.
krcp->monitor_todo = true;
schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
- spin_unlock_irqrestore(&krcp->lock, flags);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
}
/*
@@ -3159,11 +3159,11 @@ static void kfree_rcu_monitor(struct work_struct *work)
struct kfree_rcu_cpu *krcp = container_of(work, struct kfree_rcu_cpu,
monitor_work.work);
- spin_lock_irqsave(&krcp->lock, flags);
+ raw_spin_lock_irqsave(&krcp->lock, flags);
if (krcp->monitor_todo)
kfree_rcu_drain_unlock(krcp, flags);
else
- spin_unlock_irqrestore(&krcp->lock, flags);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
}
static inline bool
@@ -3234,7 +3234,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
local_irq_save(flags); // For safely calling this_cpu_ptr().
krcp = this_cpu_ptr(&krc);
if (krcp->initialized)
- spin_lock(&krcp->lock);
+ raw_spin_lock(&krcp->lock);
// Queue the object but don't yet schedule the batch.
if (debug_rcu_head_queue(head)) {
@@ -3265,7 +3265,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
unlock_return:
if (krcp->initialized)
- spin_unlock(&krcp->lock);
+ raw_spin_unlock(&krcp->lock);
local_irq_restore(flags);
}
EXPORT_SYMBOL_GPL(kfree_call_rcu);
@@ -3297,11 +3297,11 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
count = krcp->count;
- spin_lock_irqsave(&krcp->lock, flags);
+ raw_spin_lock_irqsave(&krcp->lock, flags);
if (krcp->monitor_todo)
kfree_rcu_drain_unlock(krcp, flags);
else
- spin_unlock_irqrestore(&krcp->lock, flags);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
sc->nr_to_scan -= count;
freed += count;
@@ -3328,15 +3328,15 @@ void __init kfree_rcu_scheduler_running(void)
for_each_online_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
- spin_lock_irqsave(&krcp->lock, flags);
+ raw_spin_lock_irqsave(&krcp->lock, flags);
if (!krcp->head || krcp->monitor_todo) {
- spin_unlock_irqrestore(&krcp->lock, flags);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
continue;
}
krcp->monitor_todo = true;
schedule_delayed_work_on(cpu, &krcp->monitor_work,
KFREE_DRAIN_JIFFIES);
- spin_unlock_irqrestore(&krcp->lock, flags);
+ raw_spin_unlock_irqrestore(&krcp->lock, flags);
}
}
@@ -4232,7 +4232,7 @@ static void __init kfree_rcu_batch_init(void)
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
- spin_lock_init(&krcp->lock);
+ raw_spin_lock_init(&krcp->lock);
for (i = 0; i < KFREE_N_BATCHES; i++) {
INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
krcp->krw_arr[i].krcp = krcp;
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
Update the kvfree_call_rcu() function with head-less support.
This allows RCU to reclaim objects without an embedded rcu_head.
tree-RCU:
We introduce two chains of arrays to store SLAB-backed and vmalloc
pointers, each. Storage in either of these arrays does not require
embedding an rcu_head within the object.
Maintaining the arrays may become impossible due to high memory
pressure. For such cases there is an emergency path. Objects with
rcu_head inside are just queued on a backup rcu_head list. Later on
that list is drained. As for the head-less variant, as the current
context can sleep, the following emergency measures are applied:
a) Synchronously wait until a grace period has elapsed.
b) Call kvfree().
tiny-RCU:
For double argument calls, there are no new changes in behavior. For
single argument call, kvfree() is directly inlined on the current
stack after a synchronize_rcu() call. Note that for tiny-RCU, any
call to synchronize_rcu() is actually a quiescent state, therefore
it does nothing.
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcutiny.h | 18 +++++++++++++++++-
kernel/rcu/tree.c | 45 +++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 60 insertions(+), 3 deletions(-)
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index fb2eb39..5cc9637 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -34,9 +34,25 @@ static inline void synchronize_rcu_expedited(void)
synchronize_rcu();
}
+/*
+ * Add one more declaration of kvfree() here. It is
+ * not so straight forward to just include <linux/mm.h>
+ * where it is defined due to getting many compile
+ * errors caused by that include.
+ */
+extern void kvfree(const void *addr);
+
static inline void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
- call_rcu(head, func);
+ if (head) {
+ call_rcu(head, func);
+ return;
+ }
+
+ // kvfree_rcu(one_arg) call.
+ might_sleep();
+ synchronize_rcu();
+ kvfree((void *) func);
}
void rcu_qs(void);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 07fb5a36..38fb3ce 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3296,6 +3296,13 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
if (IS_ENABLED(CONFIG_PREEMPT_RT))
return false;
+ /*
+ * NOTE: For one argument of kvfree_rcu() we can
+ * drop the lock and get the page in sleepable
+ * context. That would allow to maintain an array
+ * for the CONFIG_PREEMPT_RT as well if no cached
+ * pages are available.
+ */
bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
}
@@ -3335,16 +3342,33 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
unsigned long flags;
struct kfree_rcu_cpu *krcp;
+ bool success;
void *ptr;
+ if (head) {
+ ptr = (void *) head - (unsigned long) func;
+ } else {
+ /*
+ * Please note there is a limitation for the head-less
+ * variant, that is why there is a clear rule for such
+ * objects: it can be used from might_sleep() context
+ * only. For other places please embed an rcu_head to
+ * your data.
+ */
+ might_sleep();
+ ptr = (unsigned long *) func;
+ }
+
krcp = krc_this_cpu_lock(&flags);
- ptr = (void *)head - (unsigned long)func;
// Queue the object but don't yet schedule the batch.
if (debug_rcu_head_queue(ptr)) {
// Probable double kfree_rcu(), just leak.
WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
__func__, head);
+
+ // Mark as success and leave.
+ success = true;
goto unlock_return;
}
@@ -3352,10 +3376,16 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
* Under high memory pressure GFP_NOWAIT can fail,
* in that case the emergency path is maintained.
*/
- if (unlikely(!kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr))) {
+ success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+ if (!success) {
+ if (head == NULL)
+ // Inline if kvfree_rcu(one_arg) call.
+ goto unlock_return;
+
head->func = func;
head->next = krcp->head;
krcp->head = head;
+ success = true;
}
WRITE_ONCE(krcp->count, krcp->count + 1);
@@ -3369,6 +3399,17 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
unlock_return:
krc_this_cpu_unlock(krcp, flags);
+
+ /*
+ * Inline kvfree() after synchronize_rcu(). We can do
+ * it from might_sleep() context only, so the current
+ * CPU can pass the QS state.
+ */
+ if (!success) {
+ debug_rcu_head_unqueue((struct rcu_head *) ptr);
+ synchronize_rcu();
+ kvfree(ptr);
+ }
}
EXPORT_SYMBOL_GPL(kvfree_call_rcu);
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
Introduce four new test cases for testing the kvfree_rcu()
interface. Two of them belong to single argument functionality
and another two for 2-argument functionality.
The aim is to stress and check how kvfree_rcu() behaves under
different load and memory conditions and analyze its performance
throughput.
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
lib/test_vmalloc.c | 103 ++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 95 insertions(+), 8 deletions(-)
diff --git a/lib/test_vmalloc.c b/lib/test_vmalloc.c
index ddc9685..5cf2fe9 100644
--- a/lib/test_vmalloc.c
+++ b/lib/test_vmalloc.c
@@ -15,6 +15,8 @@
#include <linux/delay.h>
#include <linux/rwsem.h>
#include <linux/mm.h>
+#include <linux/rcupdate.h>
+#include <linux/slab.h>
#define __param(type, name, init, msg) \
static type name = init; \
@@ -35,14 +37,18 @@ __param(int, test_loop_count, 1000000,
__param(int, run_test_mask, INT_MAX,
"Set tests specified in the mask.\n\n"
- "\t\tid: 1, name: fix_size_alloc_test\n"
- "\t\tid: 2, name: full_fit_alloc_test\n"
- "\t\tid: 4, name: long_busy_list_alloc_test\n"
- "\t\tid: 8, name: random_size_alloc_test\n"
- "\t\tid: 16, name: fix_align_alloc_test\n"
- "\t\tid: 32, name: random_size_align_alloc_test\n"
- "\t\tid: 64, name: align_shift_alloc_test\n"
- "\t\tid: 128, name: pcpu_alloc_test\n"
+ "\t\tid: 1, name: fix_size_alloc_test\n"
+ "\t\tid: 2, name: full_fit_alloc_test\n"
+ "\t\tid: 4, name: long_busy_list_alloc_test\n"
+ "\t\tid: 8, name: random_size_alloc_test\n"
+ "\t\tid: 16, name: fix_align_alloc_test\n"
+ "\t\tid: 32, name: random_size_align_alloc_test\n"
+ "\t\tid: 64, name: align_shift_alloc_test\n"
+ "\t\tid: 128, name: pcpu_alloc_test\n"
+ "\t\tid: 256, name: kvfree_rcu_1_arg_vmalloc_test\n"
+ "\t\tid: 512, name: kvfree_rcu_2_arg_vmalloc_test\n"
+ "\t\tid: 1024, name: kvfree_rcu_1_arg_slab_test\n"
+ "\t\tid: 2048, name: kvfree_rcu_2_arg_slab_test\n"
/* Add a new test case description here. */
);
@@ -316,6 +322,83 @@ pcpu_alloc_test(void)
return rv;
}
+struct test_kvfree_rcu {
+ struct rcu_head rcu;
+ unsigned char array[20];
+};
+
+static int
+kvfree_rcu_1_arg_vmalloc_test(void)
+{
+ struct test_kvfree_rcu *p;
+ int i;
+
+ for (i = 0; i < test_loop_count; i++) {
+ p = vmalloc(1 * PAGE_SIZE);
+ if (!p)
+ return -1;
+
+ p->array[0] = 'a';
+ kvfree_rcu(p);
+ }
+
+ return 0;
+}
+
+static int
+kvfree_rcu_2_arg_vmalloc_test(void)
+{
+ struct test_kvfree_rcu *p;
+ int i;
+
+ for (i = 0; i < test_loop_count; i++) {
+ p = vmalloc(1 * PAGE_SIZE);
+ if (!p)
+ return -1;
+
+ p->array[0] = 'a';
+ kvfree_rcu(p, rcu);
+ }
+
+ return 0;
+}
+
+static int
+kvfree_rcu_1_arg_slab_test(void)
+{
+ struct test_kvfree_rcu *p;
+ int i;
+
+ for (i = 0; i < test_loop_count; i++) {
+ p = kmalloc(sizeof(*p), GFP_KERNEL);
+ if (!p)
+ return -1;
+
+ p->array[0] = 'a';
+ kvfree_rcu(p);
+ }
+
+ return 0;
+}
+
+static int
+kvfree_rcu_2_arg_slab_test(void)
+{
+ struct test_kvfree_rcu *p;
+ int i;
+
+ for (i = 0; i < test_loop_count; i++) {
+ p = kmalloc(sizeof(*p), GFP_KERNEL);
+ if (!p)
+ return -1;
+
+ p->array[0] = 'a';
+ kvfree_rcu(p, rcu);
+ }
+
+ return 0;
+}
+
struct test_case_desc {
const char *test_name;
int (*test_func)(void);
@@ -330,6 +413,10 @@ static struct test_case_desc test_case_array[] = {
{ "random_size_align_alloc_test", random_size_align_alloc_test },
{ "align_shift_alloc_test", align_shift_alloc_test },
{ "pcpu_alloc_test", pcpu_alloc_test },
+ { "kvfree_rcu_1_arg_vmalloc_test", kvfree_rcu_1_arg_vmalloc_test },
+ { "kvfree_rcu_2_arg_vmalloc_test", kvfree_rcu_2_arg_vmalloc_test },
+ { "kvfree_rcu_1_arg_slab_test", kvfree_rcu_1_arg_slab_test },
+ { "kvfree_rcu_2_arg_slab_test", kvfree_rcu_2_arg_slab_test },
/* Add a new test case here. */
};
--
2.9.5
From: "Joel Fernandes (Google)" <[email protected]>
To keep the kfree_rcu() code working in purely atomic sections on RT,
such as non-threaded IRQ handlers and raw spinlock sections, avoid
calling into the page allocator which uses sleeping locks on RT.
In fact, even if the caller is preemptible, the kfree_rcu() code is
not, as the krcp->lock is a raw spinlock.
Calling into the page allocator is optional and avoiding it should be
Ok, especially with the page pre-allocation support in future patches.
Such pre-allocation would further avoid the a need for a dynamically
allocated page in the first place.
Cc: Sebastian Andrzej Siewior <[email protected]>
Reviewed-by: Uladzislau Rezki <[email protected]>
Co-developed-by: Uladzislau Rezki <[email protected]>
Signed-off-by: Uladzislau Rezki <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 64592b4..dbdd509 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3184,6 +3184,18 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
if (!bnode) {
WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE);
+ /*
+ * To keep this path working on raw non-preemptible
+ * sections, prevent the optional entry into the
+ * allocator as it uses sleeping locks. In fact, even
+ * if the caller of kfree_rcu() is preemptible, this
+ * path still is not, as krcp->lock is a raw spinlock.
+ * With additional page pre-allocation in the works,
+ * hitting this return is going to be much less likely.
+ */
+ if (IS_ENABLED(CONFIG_PREEMPT_RT))
+ return false;
+
bnode = (struct kfree_rcu_bulk_data *)
__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
}
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
The following changes are introduced:
1. Rename rcu_invoke_kfree_callback() to rcu_invoke_kvfree_callback(),
as well as the associated trace events, so the rcu_kfree_callback(),
becomes rcu_kvfree_callback(). The reason is to be aligned with kvfree()
notation.
2. Rename __is_kfree_rcu_offset to __is_kvfree_rcu_offset. All RCU
paths use kvfree() now instead of kfree(), thus rename it.
3. Rename kfree_call_rcu() to the kvfree_call_rcu(). The reason is,
it is capable of freeing vmalloc() memory now. Do the same with
__kfree_rcu() macro, it becomes __kvfree_rcu(), the goal is the
same.
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Co-developed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 14 +++++++-------
include/linux/rcutiny.h | 2 +-
include/linux/rcutree.h | 2 +-
include/trace/events/rcu.h | 8 ++++----
kernel/rcu/tiny.c | 4 ++--
kernel/rcu/tree.c | 16 ++++++++--------
6 files changed, 23 insertions(+), 23 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 659cbfa..b344fc8 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -828,17 +828,17 @@ static inline notrace void rcu_read_unlock_sched_notrace(void)
/*
* Does the specified offset indicate that the corresponding rcu_head
- * structure can be handled by kfree_rcu()?
+ * structure can be handled by kvfree_rcu()?
*/
-#define __is_kfree_rcu_offset(offset) ((offset) < 4096)
+#define __is_kvfree_rcu_offset(offset) ((offset) < 4096)
/*
* Helper macro for kfree_rcu() to prevent argument-expansion eyestrain.
*/
-#define __kfree_rcu(head, offset) \
+#define __kvfree_rcu(head, offset) \
do { \
- BUILD_BUG_ON(!__is_kfree_rcu_offset(offset)); \
- kfree_call_rcu(head, (rcu_callback_t)(unsigned long)(offset)); \
+ BUILD_BUG_ON(!__is_kvfree_rcu_offset(offset)); \
+ kvfree_call_rcu(head, (rcu_callback_t)(unsigned long)(offset)); \
} while (0)
/**
@@ -857,7 +857,7 @@ static inline notrace void rcu_read_unlock_sched_notrace(void)
* Because the functions are not allowed in the low-order 4096 bytes of
* kernel virtual memory, offsets up to 4095 bytes can be accommodated.
* If the offset is larger than 4095 bytes, a compile-time error will
- * be generated in __kfree_rcu(). If this error is triggered, you can
+ * be generated in __kvfree_rcu(). If this error is triggered, you can
* either fall back to use of call_rcu() or rearrange the structure to
* position the rcu_head structure into the first 4096 bytes.
*
@@ -872,7 +872,7 @@ do { \
typeof (ptr) ___p = (ptr); \
\
if (___p) \
- __kfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \
+ __kvfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \
} while (0)
/*
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 8512cae..fb2eb39 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -34,7 +34,7 @@ static inline void synchronize_rcu_expedited(void)
synchronize_rcu();
}
-static inline void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
+static inline void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
call_rcu(head, func);
}
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index d5cc9d67..d2f4064 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -33,7 +33,7 @@ static inline void rcu_virt_note_context_switch(int cpu)
}
void synchronize_rcu_expedited(void);
-void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
+void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func);
void rcu_barrier(void);
bool rcu_eqs_special_set(int cpu);
diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index f9a7811..0ee93d0 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -506,13 +506,13 @@ TRACE_EVENT_RCU(rcu_callback,
/*
* Tracepoint for the registration of a single RCU callback of the special
- * kfree() form. The first argument is the RCU type, the second argument
+ * kvfree() form. The first argument is the RCU type, the second argument
* is a pointer to the RCU callback, the third argument is the offset
* of the callback within the enclosing RCU-protected data structure,
* the fourth argument is the number of lazy callbacks queued, and the
* fifth argument is the total number of callbacks queued.
*/
-TRACE_EVENT_RCU(rcu_kfree_callback,
+TRACE_EVENT_RCU(rcu_kvfree_callback,
TP_PROTO(const char *rcuname, struct rcu_head *rhp, unsigned long offset,
long qlen),
@@ -596,12 +596,12 @@ TRACE_EVENT_RCU(rcu_invoke_callback,
/*
* Tracepoint for the invocation of a single RCU callback of the special
- * kfree() form. The first argument is the RCU flavor, the second
+ * kvfree() form. The first argument is the RCU flavor, the second
* argument is a pointer to the RCU callback, and the third argument
* is the offset of the callback within the enclosing RCU-protected
* data structure.
*/
-TRACE_EVENT_RCU(rcu_invoke_kfree_callback,
+TRACE_EVENT_RCU(rcu_invoke_kvfree_callback,
TP_PROTO(const char *rcuname, struct rcu_head *rhp, unsigned long offset),
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 4b99f7b8..aa897c3 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -85,8 +85,8 @@ static inline bool rcu_reclaim_tiny(struct rcu_head *head)
unsigned long offset = (unsigned long)head->func;
rcu_lock_acquire(&rcu_callback_map);
- if (__is_kfree_rcu_offset(offset)) {
- trace_rcu_invoke_kfree_callback("", head, offset);
+ if (__is_kvfree_rcu_offset(offset)) {
+ trace_rcu_invoke_kvfree_callback("", head, offset);
kvfree((void *)head - offset);
rcu_lock_release(&rcu_callback_map);
return true;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 659f67b..07fb5a36 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2887,8 +2887,8 @@ __call_rcu(struct rcu_head *head, rcu_callback_t func)
return; // Enqueued onto ->nocb_bypass, so just leave.
// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
rcu_segcblist_enqueue(&rdp->cblist, head);
- if (__is_kfree_rcu_offset((unsigned long)func))
- trace_rcu_kfree_callback(rcu_state.name, head,
+ if (__is_kvfree_rcu_offset((unsigned long)func))
+ trace_rcu_kvfree_callback(rcu_state.name, head,
(unsigned long)func,
rcu_segcblist_n_cbs(&rdp->cblist));
else
@@ -3128,7 +3128,7 @@ static void kfree_rcu_work(struct work_struct *work)
bkvhead[i]->records);
} else { // vmalloc() / vfree().
for (j = 0; j < bkvhead[i]->nr_records; j++) {
- trace_rcu_invoke_kfree_callback(
+ trace_rcu_invoke_kvfree_callback(
rcu_state.name,
bkvhead[i]->records[j], 0);
@@ -3161,9 +3161,9 @@ static void kfree_rcu_work(struct work_struct *work)
next = head->next;
debug_rcu_head_unqueue((struct rcu_head *)ptr);
rcu_lock_acquire(&rcu_callback_map);
- trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset);
+ trace_rcu_invoke_kvfree_callback(rcu_state.name, head, offset);
- if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset)))
+ if (!WARN_ON_ONCE(!__is_kvfree_rcu_offset(offset)))
kvfree(ptr);
rcu_lock_release(&rcu_callback_map);
@@ -3326,12 +3326,12 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
* one, that is used only when the main path can not be maintained temporary,
* due to memory pressure.
*
- * Each kfree_call_rcu() request is added to a batch. The batch will be drained
+ * Each kvfree_call_rcu() request is added to a batch. The batch will be drained
* every KFREE_DRAIN_JIFFIES number of jiffies. All the objects in the batch will
* be free'd in workqueue context. This allows us to: batch requests together to
* reduce the number of grace periods during heavy kfree_rcu()/kvfree_rcu() load.
*/
-void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
+void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
unsigned long flags;
struct kfree_rcu_cpu *krcp;
@@ -3370,7 +3370,7 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
unlock_return:
krc_this_cpu_unlock(krcp, flags);
}
-EXPORT_SYMBOL_GPL(kfree_call_rcu);
+EXPORT_SYMBOL_GPL(kvfree_call_rcu);
static unsigned long
kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
We can simplify KFREE_BULK_MAX_ENTR macro and get rid of
magic numbers which were used to make the structure to be
exactly one page.
Suggested-by: Boqun Feng <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dc5369a..519a3f5 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2940,13 +2940,6 @@ EXPORT_SYMBOL_GPL(call_rcu);
#define KFREE_DRAIN_JIFFIES (HZ / 50)
#define KFREE_N_BATCHES 2
-/*
- * This macro defines how many entries the "records" array
- * will contain. It is based on the fact that the size of
- * kfree_rcu_bulk_data structure becomes exactly one page.
- */
-#define KFREE_BULK_MAX_ENTR ((PAGE_SIZE / sizeof(void *)) - 3)
-
/**
* struct kfree_rcu_bulk_data - single block to store kfree_rcu() pointers
* @nr_records: Number of active pointers in the array
@@ -2955,10 +2948,18 @@ EXPORT_SYMBOL_GPL(call_rcu);
*/
struct kfree_rcu_bulk_data {
unsigned long nr_records;
- void *records[KFREE_BULK_MAX_ENTR];
struct kfree_rcu_bulk_data *next;
+ void *records[];
};
+/*
+ * This macro defines how many entries the "records" array
+ * will contain. It is based on the fact that the size of
+ * kfree_rcu_bulk_data structure becomes exactly one page.
+ */
+#define KFREE_BULK_MAX_ENTR \
+ ((PAGE_SIZE - sizeof(struct kfree_rcu_bulk_data)) / sizeof(void *))
+
/**
* struct kfree_rcu_cpu_work - single batch of kfree_rcu() requests
* @rcu_work: Let queue_rcu_work() invoke workqueue handler after grace period
--
2.9.5
From: "Joel Fernandes (Google)" <[email protected]>
kfree_rcu()'s debug_objects logic uses the address of the object's
embedded rcu_head to queue/unqueue. Instead of this, make use of the
object's address itself as preparation for future headless kfree_rcu()
support.
Reviewed-by: Uladzislau Rezki <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 29 +++++++++++++----------------
1 file changed, 13 insertions(+), 16 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 31d3b2c..dc5369a 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2952,13 +2952,11 @@ EXPORT_SYMBOL_GPL(call_rcu);
* @nr_records: Number of active pointers in the array
* @records: Array of the kfree_rcu() pointers
* @next: Next bulk object in the block chain
- * @head_free_debug: For debug, when CONFIG_DEBUG_OBJECTS_RCU_HEAD is set
*/
struct kfree_rcu_bulk_data {
unsigned long nr_records;
void *records[KFREE_BULK_MAX_ENTR];
struct kfree_rcu_bulk_data *next;
- struct rcu_head *head_free_debug;
};
/**
@@ -3008,11 +3006,13 @@ struct kfree_rcu_cpu {
static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc);
static __always_inline void
-debug_rcu_head_unqueue_bulk(struct rcu_head *head)
+debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead)
{
#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
- for (; head; head = head->next)
- debug_rcu_head_unqueue(head);
+ int i;
+
+ for (i = 0; i < bhead->nr_records; i++)
+ debug_rcu_head_unqueue((struct rcu_head *)(bhead->records[i]));
#endif
}
@@ -3042,7 +3042,7 @@ static void kfree_rcu_work(struct work_struct *work)
for (; bhead; bhead = bnext) {
bnext = bhead->next;
- debug_rcu_head_unqueue_bulk(bhead->head_free_debug);
+ debug_rcu_bhead_unqueue(bhead);
rcu_lock_acquire(&rcu_callback_map);
trace_rcu_invoke_kfree_bulk_callback(rcu_state.name,
@@ -3064,14 +3064,15 @@ static void kfree_rcu_work(struct work_struct *work)
*/
for (; head; head = next) {
unsigned long offset = (unsigned long)head->func;
+ void *ptr = (void *)head - offset;
next = head->next;
- debug_rcu_head_unqueue(head);
+ debug_rcu_head_unqueue((struct rcu_head *)ptr);
rcu_lock_acquire(&rcu_callback_map);
trace_rcu_invoke_kfree_callback(rcu_state.name, head, offset);
if (!WARN_ON_ONCE(!__is_kfree_rcu_offset(offset)))
- kfree((void *)head - offset);
+ kfree(ptr);
rcu_lock_release(&rcu_callback_map);
cond_resched_tasks_rcu_qs();
@@ -3210,18 +3211,11 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
/* Initialize the new block. */
bnode->nr_records = 0;
bnode->next = krcp->bhead;
- bnode->head_free_debug = NULL;
/* Attach it to the head. */
krcp->bhead = bnode;
}
-#ifdef CONFIG_DEBUG_OBJECTS_RCU_HEAD
- head->func = func;
- head->next = krcp->bhead->head_free_debug;
- krcp->bhead->head_free_debug = head;
-#endif
-
/* Finally insert. */
krcp->bhead->records[krcp->bhead->nr_records++] =
(void *) head - (unsigned long) func;
@@ -3245,14 +3239,17 @@ void kfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
{
unsigned long flags;
struct kfree_rcu_cpu *krcp;
+ void *ptr;
local_irq_save(flags); // For safely calling this_cpu_ptr().
krcp = this_cpu_ptr(&krc);
if (krcp->initialized)
raw_spin_lock(&krcp->lock);
+ ptr = (void *)head - (unsigned long)func;
+
// Queue the object but don't yet schedule the batch.
- if (debug_rcu_head_queue(head)) {
+ if (debug_rcu_head_queue(ptr)) {
// Probable double kfree_rcu(), just leak.
WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
__func__, head);
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
Make kvfree_rcu() capable of freeing objects that will not
embed an rcu_head within it. This saves storage overhead in
such objects. Reclaiming headless objects this way requires
only a single argument (pointer to the object).
After this patch, there are two ways to use kvfree_rcu():
a) kvfree_rcu(ptr, rhf);
struct X {
struct rcu_head rhf;
unsigned char data[100];
};
void *ptr = kvmalloc(sizeof(struct X), GFP_KERNEL);
if (ptr)
kvfree_rcu(ptr, rhf);
b) kvfree_rcu(ptr);
void *ptr = kvmalloc(some_bytes, GFP_KERNEL);
if (ptr)
kvfree_rcu(ptr);
Note that the headless usage (example b) can only be used in a code
that can sleep. This is enforced by the CONFIG_DEBUG_ATOMIC_SLEEP
option.
Co-developed-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 38 ++++++++++++++++++++++++++++++++++----
1 file changed, 34 insertions(+), 4 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 51b26ab..d15d46d 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -877,12 +877,42 @@ do { \
/**
* kvfree_rcu() - kvfree an object after a grace period.
- * @ptr: pointer to kvfree
- * @rhf: the name of the struct rcu_head within the type of @ptr.
*
- * Same as kfree_rcu(), just simple alias.
+ * This macro consists of one or two arguments and it is
+ * based on whether an object is head-less or not. If it
+ * has a head then a semantic stays the same as it used
+ * to be before:
+ *
+ * kvfree_rcu(ptr, rhf);
+ *
+ * where @ptr is a pointer to kvfree(), @rhf is the name
+ * of the rcu_head structure within the type of @ptr.
+ *
+ * When it comes to head-less variant, only one argument
+ * is passed and that is just a pointer which has to be
+ * freed after a grace period. Therefore the semantic is
+ *
+ * kvfree_rcu(ptr);
+ *
+ * where @ptr is a pointer to kvfree().
+ *
+ * Please note, head-less way of freeing is permitted to
+ * use from a context that has to follow might_sleep()
+ * annotation. Otherwise, please switch and embed the
+ * rcu_head structure within the type of @ptr.
*/
-#define kvfree_rcu(ptr, rhf) kfree_rcu(ptr, rhf)
+#define kvfree_rcu(...) KVFREE_GET_MACRO(__VA_ARGS__, \
+ kvfree_rcu_arg_2, kvfree_rcu_arg_1)(__VA_ARGS__)
+
+#define KVFREE_GET_MACRO(_1, _2, NAME, ...) NAME
+#define kvfree_rcu_arg_2(ptr, rhf) kfree_rcu(ptr, rhf)
+#define kvfree_rcu_arg_1(ptr) \
+do { \
+ typeof(ptr) ___p = (ptr); \
+ \
+ if (___p) \
+ kvfree_call_rcu(NULL, (rcu_callback_t) (___p)); \
+} while (0)
/*
* Place this after a lock-acquisition primitive to guarantee that
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
kvmalloc() can allocate two types of objects: SLAB backed
and vmalloc backed. How it behaves depends on requested
object's size and memory pressure.
Add a kvfree_rcu() interface that can free memory allocated
via kvmalloc(). It is a simple alias to kfree_rcu() which
can now handle either type of object.
<snip>
struct test_kvfree_rcu {
struct rcu_head rcu;
unsigned char array[100];
};
struct test_kvfree_rcu *p;
p = kvmalloc(10 * PAGE_SIZE);
if (p)
kvfree_rcu(p, rcu);
<snip>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Co-developed-by: Joel Fernandes (Google) <[email protected]>
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rcupdate.h | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index b344fc8..51b26ab 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -875,6 +875,15 @@ do { \
__kvfree_rcu(&((___p)->rhf), offsetof(typeof(*(ptr)), rhf)); \
} while (0)
+/**
+ * kvfree_rcu() - kvfree an object after a grace period.
+ * @ptr: pointer to kvfree
+ * @rhf: the name of the struct rcu_head within the type of @ptr.
+ *
+ * Same as kfree_rcu(), just simple alias.
+ */
+#define kvfree_rcu(ptr, rhf) kfree_rcu(ptr, rhf)
+
/*
* Place this after a lock-acquisition primitive to guarantee that
* an UNLOCK+LOCK pair acts as a full barrier. This guarantee applies
--
2.9.5
From: Sebastian Andrzej Siewior <[email protected]>
The per-CPU variable is initialized at runtime in
kfree_rcu_batch_init(). This function is invoked before
'rcu_scheduler_active' is set to 'RCU_SCHEDULER_RUNNING'.
After the initialisation, '->initialized' is to true.
The raw_spin_lock is only acquired if '->initialized' is
set to true. The worqueue item is only used if 'rcu_scheduler_active'
set to RCU_SCHEDULER_RUNNING which happens after initialisation.
Use a static initializer for krc.lock and remove the runtime
initialisation of the lock. Since the lock can now be always
acquired, remove the '->initialized' check.
Cc: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index edc512e..4fba1e9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2984,7 +2984,7 @@ struct kfree_rcu_cpu_work {
* @lock: Synchronize access to this structure
* @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
* @monitor_todo: Tracks whether a @monitor_work delayed work is pending
- * @initialized: The @lock and @rcu_work fields have been initialized
+ * @initialized: The @rcu_work fields have been initialized
* @count: Number of objects for which GP not started
*
* This is a per-CPU structure. The reason that it is not included in
@@ -3004,7 +3004,9 @@ struct kfree_rcu_cpu {
int count;
};
-static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc);
+static DEFINE_PER_CPU(struct kfree_rcu_cpu, krc) = {
+ .lock = __RAW_SPIN_LOCK_UNLOCKED(krc.lock),
+};
static __always_inline void
debug_rcu_bhead_unqueue(struct kfree_rcu_bulk_data *bhead)
@@ -3024,8 +3026,7 @@ krc_this_cpu_lock(unsigned long *flags)
local_irq_save(*flags); // For safely calling this_cpu_ptr().
krcp = this_cpu_ptr(&krc);
- if (likely(krcp->initialized))
- raw_spin_lock(&krcp->lock);
+ raw_spin_lock(&krcp->lock);
return krcp;
}
@@ -3033,8 +3034,7 @@ krc_this_cpu_lock(unsigned long *flags)
static inline void
krc_this_cpu_unlock(struct kfree_rcu_cpu *krcp, unsigned long flags)
{
- if (likely(krcp->initialized))
- raw_spin_unlock(&krcp->lock);
+ raw_spin_unlock(&krcp->lock);
local_irq_restore(flags);
}
@@ -4260,7 +4260,6 @@ static void __init kfree_rcu_batch_init(void)
for_each_possible_cpu(cpu) {
struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
- raw_spin_lock_init(&krcp->lock);
for (i = 0; i < KFREE_N_BATCHES; i++) {
INIT_RCU_WORK(&krcp->krw_arr[i].rcu_work, kfree_rcu_work);
krcp->krw_arr[i].krcp = krcp;
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
Replace kfree() with kvfree() in rcu_reclaim_tiny().
This makes it possible to release either SLAB or vmalloc
objects after a GP.
Reviewed-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tiny.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index dd572ce..4b99f7b8 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -23,6 +23,7 @@
#include <linux/cpu.h>
#include <linux/prefetch.h>
#include <linux/slab.h>
+#include <linux/mm.h>
#include "rcu.h"
@@ -86,7 +87,7 @@ static inline bool rcu_reclaim_tiny(struct rcu_head *head)
rcu_lock_acquire(&rcu_callback_map);
if (__is_kfree_rcu_offset(offset)) {
trace_rcu_invoke_kfree_callback("", head, offset);
- kfree((void *)head - offset);
+ kvfree((void *)head - offset);
rcu_lock_release(&rcu_callback_map);
return true;
}
--
2.9.5
From: "Uladzislau Rezki (Sony)" <[email protected]>
It is possible that one of the channels cannot be detached
because its free channel is busy and previously queued data
has not been processed yet. On the other hand, another
channel can be successfully detached causing the monitor
work to stop.
Prevent that by rescheduling the monitor work if there are
any channels in the pending state after a detach attempt.
Fixes: 34c881745549e ("rcu: Support kfree_bulk() interface in kfree_rcu()")
Acked-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index dbdd509..31d3b2c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3087,7 +3087,7 @@ static void kfree_rcu_work(struct work_struct *work)
static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
{
struct kfree_rcu_cpu_work *krwp;
- bool queued = false;
+ bool repeat = false;
int i;
lockdep_assert_held(&krcp->lock);
@@ -3125,11 +3125,14 @@ static inline bool queue_kfree_rcu_work(struct kfree_rcu_cpu *krcp)
* been detached following each other, one by one.
*/
queue_rcu_work(system_wq, &krwp->rcu_work);
- queued = true;
}
+
+ /* Repeat if any "free" corresponding channel is still busy. */
+ if (krcp->bhead || krcp->head)
+ repeat = true;
}
- return queued;
+ return !repeat;
}
static inline void kfree_rcu_drain_unlock(struct kfree_rcu_cpu *krcp,
--
2.9.5
From: Mauro Carvalho Chehab <[email protected]>
There are some kernel-doc warnings:
./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu'
This commit therefore moves the comment for "count" to the kernel-doc
markup.
Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c716ead..912d466 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2986,6 +2986,7 @@ struct kfree_rcu_cpu_work {
* @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
* @monitor_todo: Tracks whether a @monitor_work delayed work is pending
* @initialized: The @lock and @rcu_work fields have been initialized
+ * @count: Number of objects for which GP not started
*
* This is a per-CPU structure. The reason that it is not included in
* the rcu_data structure is to permit this code to be extracted from
@@ -3001,7 +3002,6 @@ struct kfree_rcu_cpu {
struct delayed_work monitor_work;
bool monitor_todo;
bool initialized;
- // Number of objects for which GP not started
int count;
};
--
2.9.5
On 2020-06-24 13:12:12 [-0700], [email protected] wrote:
> From: "Joel Fernandes (Google)" <[email protected]>
>
> To keep the kfree_rcu() code working in purely atomic sections on RT,
> such as non-threaded IRQ handlers and raw spinlock sections, avoid
> calling into the page allocator which uses sleeping locks on RT.
>
> In fact, even if the caller is preemptible, the kfree_rcu() code is
> not, as the krcp->lock is a raw spinlock.
>
> Calling into the page allocator is optional and avoiding it should be
> Ok, especially with the page pre-allocation support in future patches.
> Such pre-allocation would further avoid the a need for a dynamically
> allocated page in the first place.
>
> Cc: Sebastian Andrzej Siewior <[email protected]>
> Reviewed-by: Uladzislau Rezki <[email protected]>
> Co-developed-by: Uladzislau Rezki <[email protected]>
> Signed-off-by: Uladzislau Rezki <[email protected]>
> Signed-off-by: Joel Fernandes (Google) <[email protected]>
> Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 64592b4..dbdd509 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3184,6 +3184,18 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
> if (!bnode) {
> WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE);
>
> + /*
> + * To keep this path working on raw non-preemptible
> + * sections, prevent the optional entry into the
> + * allocator as it uses sleeping locks. In fact, even
> + * if the caller of kfree_rcu() is preemptible, this
> + * path still is not, as krcp->lock is a raw spinlock.
> + * With additional page pre-allocation in the works,
> + * hitting this return is going to be much less likely.
> + */
> + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> + return false;
This is not going to work together with the "wait context validator"
(CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
printk() which is why it is still disabled by default.
So assume that this is fixed and enabled then on !PREEMPT_RT it will
complain that you have a raw_spinlock_t acquired (the one from patch
02/17) and attempt to acquire a spinlock_t in the memory allocator.
> bnode = (struct kfree_rcu_bulk_data *)
> __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> }
Sebastian
On Tue, Jun 30, 2020 at 06:45:43PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-06-24 13:12:12 [-0700], [email protected] wrote:
> > From: "Joel Fernandes (Google)" <[email protected]>
> >
> > To keep the kfree_rcu() code working in purely atomic sections on RT,
> > such as non-threaded IRQ handlers and raw spinlock sections, avoid
> > calling into the page allocator which uses sleeping locks on RT.
> >
> > In fact, even if the caller is preemptible, the kfree_rcu() code is
> > not, as the krcp->lock is a raw spinlock.
> >
> > Calling into the page allocator is optional and avoiding it should be
> > Ok, especially with the page pre-allocation support in future patches.
> > Such pre-allocation would further avoid the a need for a dynamically
> > allocated page in the first place.
> >
> > Cc: Sebastian Andrzej Siewior <[email protected]>
> > Reviewed-by: Uladzislau Rezki <[email protected]>
> > Co-developed-by: Uladzislau Rezki <[email protected]>
> > Signed-off-by: Uladzislau Rezki <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 64592b4..dbdd509 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3184,6 +3184,18 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
> > if (!bnode) {
> > WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE);
> >
> > + /*
> > + * To keep this path working on raw non-preemptible
> > + * sections, prevent the optional entry into the
> > + * allocator as it uses sleeping locks. In fact, even
> > + * if the caller of kfree_rcu() is preemptible, this
> > + * path still is not, as krcp->lock is a raw spinlock.
> > + * With additional page pre-allocation in the works,
> > + * hitting this return is going to be much less likely.
> > + */
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > + return false;
>
> This is not going to work together with the "wait context validator"
> (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> printk() which is why it is still disabled by default.
Fixing that should be "interesting". In particular, RCU CPU stall
warnings rely on the raw spin lock to reduce false positives due
to race conditions. Some thought will be required here.
> So assume that this is fixed and enabled then on !PREEMPT_RT it will
> complain that you have a raw_spinlock_t acquired (the one from patch
> 02/17) and attempt to acquire a spinlock_t in the memory allocator.
Given that the slab allocator doesn't acquire any locks until it gets
a fair way in, wouldn't it make sense to allow a "shallow" allocation
while a raw spinlock is held? This would require yet another GFP_ flag,
but that won't make all that much of a difference in the total. ;-)
Thanx, Paul
> > bnode = (struct kfree_rcu_bulk_data *)
> > __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > }
>
> Sebastian
On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > This is not going to work together with the "wait context validator"
> > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > printk() which is why it is still disabled by default.
>
> Fixing that should be "interesting". In particular, RCU CPU stall
> warnings rely on the raw spin lock to reduce false positives due
> to race conditions. Some thought will be required here.
I don't get this part. Can you explain/give me an example where to look
at?
> > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > complain that you have a raw_spinlock_t acquired (the one from patch
> > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
>
> Given that the slab allocator doesn't acquire any locks until it gets
> a fair way in, wouldn't it make sense to allow a "shallow" allocation
> while a raw spinlock is held? This would require yet another GFP_ flag,
> but that won't make all that much of a difference in the total. ;-)
That would be one way of dealing with. But we could go back to
spinlock_t and keep the memory allocation even for RT as is. I don't see
a downside of this. And we would worry about kfree_rcu() from real
IRQ-off region once we get to it.
> Thanx, Paul
>
> > > bnode = (struct kfree_rcu_bulk_data *)
> > > __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > }
Sebastian
On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > This is not going to work together with the "wait context validator"
> > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > printk() which is why it is still disabled by default.
> >
> > Fixing that should be "interesting". In particular, RCU CPU stall
> > warnings rely on the raw spin lock to reduce false positives due
> > to race conditions. Some thought will be required here.
>
> I don't get this part. Can you explain/give me an example where to look
> at?
Starting from the scheduler-clock interrupt's call into RCU,
we have rcu_sched_clock_irq() which calls rcu_pending() which
calls check_cpu_stall() which calls either print_cpu_stall() or
print_other_cpu_stall(), depending on whether the stall is happening on
the current CPU or on some other CPU, respectively.
Both of these last functions acquire the rcu_node structure's raw ->lock
and expect to do printk()s while holding it.
Not that the lock matters all that much given that all of this code is
in hardirq context. Which is necessary if RCU CPU stall warnings are
to be at all reliable. Yeah, yeah, they would be even more reliable if
invoked in NMI context but that would prevent them from acquiring the
->lock that is used to resolve races between multiple concurrent stall
warnings and between a stall warning and the end of that stall. ;-)
I could imagine accumulating the data in hardirq context and then
using a workqueue or some such to print it all out, but that
requires that workqueues be fully operational for RCU CPU stall
warnings to appear. :-(
Thoughts?
> > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> >
> > Given that the slab allocator doesn't acquire any locks until it gets
> > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > while a raw spinlock is held? This would require yet another GFP_ flag,
> > but that won't make all that much of a difference in the total. ;-)
>
> That would be one way of dealing with. But we could go back to
> spinlock_t and keep the memory allocation even for RT as is. I don't see
> a downside of this. And we would worry about kfree_rcu() from real
> IRQ-off region once we get to it.
Once we get to it, your thought would be to do per-CPU queuing of
memory from IRQ-off kfree_rcu(), and have IRQ work or some such clean
up after it? Or did you have some other trick in mind?
Thanx, Paul
> > > > bnode = (struct kfree_rcu_bulk_data *)
> > > > __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> > > > }
>
> Sebastian
On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > This is not going to work together with the "wait context validator"
> > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > printk() which is why it is still disabled by default.
> >
> > Fixing that should be "interesting". In particular, RCU CPU stall
> > warnings rely on the raw spin lock to reduce false positives due
> > to race conditions. Some thought will be required here.
>
> I don't get this part. Can you explain/give me an example where to look
> at?
>
> > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> >
> > Given that the slab allocator doesn't acquire any locks until it gets
> > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > while a raw spinlock is held? This would require yet another GFP_ flag,
> > but that won't make all that much of a difference in the total. ;-)
>
> That would be one way of dealing with. But we could go back to
> spinlock_t and keep the memory allocation even for RT as is. I don't see
> a downside of this. And we would worry about kfree_rcu() from real
> IRQ-off region once we get to it.
>
Another way of fixing it is just dropping the lock letting the page
allocator to do an allocation without our "upper/local" lock. I did a
proposal like that once upon a time, so maybe it is a time to highlight
it again:
<snip>
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 21c2fa5bd8c3..249f10a89bb9 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3278,9 +3278,11 @@ static void kfree_rcu_monitor(struct work_struct *work)
}
static inline bool
-kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
+kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
+ void *ptr, unsigned long *flags)
{
struct kvfree_rcu_bulk_data *bnode;
+ struct kfree_rcu_cpu *tmp;
int idx;
if (unlikely(!krcp->initialized))
@@ -3306,6 +3308,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
if (IS_ENABLED(CONFIG_PREEMPT_RT))
return false;
+ migrate_disable();
+ krc_this_cpu_unlock(krcp, *flags);
+
/*
* NOTE: For one argument of kvfree_rcu() we can
* drop the lock and get the page in sleepable
@@ -3315,6 +3320,12 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
*/
bnode = (struct kvfree_rcu_bulk_data *)
__get_free_page(GFP_NOWAIT | __GFP_NOWARN);
+
+ tmp = krc_this_cpu_lock(flags);
+ migrate_enable();
+
+ /* Sanity check, just in case. */
+ WARN_ON(tmp != krcp);
}
/* Switch to emergency path. */
@@ -3386,7 +3397,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
* Under high memory pressure GFP_NOWAIT can fail,
* in that case the emergency path is maintained.
*/
- success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
+ success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr, &flags);
if (!success) {
if (head == NULL)
// Inline if kvfree_rcu(one_arg) call.
<snip>
--
Vlad Rezki
On 2020-07-02 09:48:26 [-0700], Paul E. McKenney wrote:
> On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > > This is not going to work together with the "wait context validator"
> > > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > > printk() which is why it is still disabled by default.
> > >
> > > Fixing that should be "interesting". In particular, RCU CPU stall
> > > warnings rely on the raw spin lock to reduce false positives due
> > > to race conditions. Some thought will be required here.
> >
> > I don't get this part. Can you explain/give me an example where to look
> > at?
>
> Starting from the scheduler-clock interrupt's call into RCU,
> we have rcu_sched_clock_irq() which calls rcu_pending() which
> calls check_cpu_stall() which calls either print_cpu_stall() or
> print_other_cpu_stall(), depending on whether the stall is happening on
> the current CPU or on some other CPU, respectively.
>
> Both of these last functions acquire the rcu_node structure's raw ->lock
> and expect to do printk()s while holding it.
…
> Thoughts?
Okay. So in the RT queue there is a printk() rewrite which fixes this
kind of things. Upstream the printk() interface is still broken in this
regard and therefore CONFIG_PROVE_RAW_LOCK_NESTING is disabled.
[Earlier the workqueue would also trigger a warning but this has been
fixed as of v5.8-rc1.]
This was just me explaining why this bad, what debug function would
report it and why it is not enabled by default.
> > > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> > >
> > > Given that the slab allocator doesn't acquire any locks until it gets
> > > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > > while a raw spinlock is held? This would require yet another GFP_ flag,
> > > but that won't make all that much of a difference in the total. ;-)
> >
> > That would be one way of dealing with. But we could go back to
> > spinlock_t and keep the memory allocation even for RT as is. I don't see
> > a downside of this. And we would worry about kfree_rcu() from real
> > IRQ-off region once we get to it.
>
> Once we get to it, your thought would be to do per-CPU queuing of
> memory from IRQ-off kfree_rcu(), and have IRQ work or some such clean
> up after it? Or did you have some other trick in mind?
So for now I would very much like to revert the raw_spinlock_t back to
the spinlock_t and add a migrate_disable() just avoid the tiny
possible migration between obtaining the CPU-ptr and acquiring the lock
(I think Joel was afraid of performance hit).
Should we get to a *real* use case where someone must invoke kfree_rcu()
from a hard-IRQ-off region then we can think what makes sense. per-CPU
queues and IRQ-work would be one way of dealing with it.
> Thanx, Paul
>
Sebastian
On Thu, Jul 02, 2020 at 09:45:06PM +0200, Uladzislau Rezki wrote:
> On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > > This is not going to work together with the "wait context validator"
> > > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > > printk() which is why it is still disabled by default.
> > >
> > > Fixing that should be "interesting". In particular, RCU CPU stall
> > > warnings rely on the raw spin lock to reduce false positives due
> > > to race conditions. Some thought will be required here.
> >
> > I don't get this part. Can you explain/give me an example where to look
> > at?
> >
> > > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> > >
> > > Given that the slab allocator doesn't acquire any locks until it gets
> > > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > > while a raw spinlock is held? This would require yet another GFP_ flag,
> > > but that won't make all that much of a difference in the total. ;-)
> >
> > That would be one way of dealing with. But we could go back to
> > spinlock_t and keep the memory allocation even for RT as is. I don't see
> > a downside of this. And we would worry about kfree_rcu() from real
> > IRQ-off region once we get to it.
> >
Sorry for my late reply as the day job and family demanded a lot last week...
> Another way of fixing it is just dropping the lock letting the page
> allocator to do an allocation without our "upper/local" lock. I did a
> proposal like that once upon a time, so maybe it is a time to highlight
> it again:
> <snip>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 21c2fa5bd8c3..249f10a89bb9 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3278,9 +3278,11 @@ static void kfree_rcu_monitor(struct work_struct *work)
> }
>
> static inline bool
> -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> +kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
> + void *ptr, unsigned long *flags)
> {
> struct kvfree_rcu_bulk_data *bnode;
> + struct kfree_rcu_cpu *tmp;
> int idx;
>
> if (unlikely(!krcp->initialized))
> @@ -3306,6 +3308,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> return false;
>
> + migrate_disable();
> + krc_this_cpu_unlock(krcp, *flags);
If I remember, the issue here is that migrate_disable is not implemented on a
non-RT kernel due to issues with starvation.
What about using the mempools. If we can allocate from that without entry
into the page allocator, then that would be one way. Not sure if there are
other issues such as the mempool allocation itself acquiring regular spinlocks.
We could also just do what Sebastian is suggesting, which is revert to
regular spinlocks and allow this code to sleep while worrying about
atomic callers later. For RT, could such atomic callers perhaps do their
allocations differently such as allocating in advance or later on in process
context?
thanks,
- Joel
> +
> /*
> * NOTE: For one argument of kvfree_rcu() we can
> * drop the lock and get the page in sleepable
> @@ -3315,6 +3320,12 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> */
> bnode = (struct kvfree_rcu_bulk_data *)
> __get_free_page(GFP_NOWAIT | __GFP_NOWARN);
> +
> + tmp = krc_this_cpu_lock(flags);
> + migrate_enable();
> +
> + /* Sanity check, just in case. */
> + WARN_ON(tmp != krcp);
> }
>
> /* Switch to emergency path. */
> @@ -3386,7 +3397,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> * Under high memory pressure GFP_NOWAIT can fail,
> * in that case the emergency path is maintained.
> */
> - success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr);
> + success = kvfree_call_rcu_add_ptr_to_bulk(krcp, ptr, &flags);
> if (!success) {
> if (head == NULL)
> // Inline if kvfree_rcu(one_arg) call.
> <snip>
>
> --
> Vlad Rezki
On Mon, Jul 06, 2020 at 03:42:32PM -0400, Joel Fernandes wrote:
> On Thu, Jul 02, 2020 at 09:45:06PM +0200, Uladzislau Rezki wrote:
> > On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > > > This is not going to work together with the "wait context validator"
> > > > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > > > printk() which is why it is still disabled by default.
> > > >
> > > > Fixing that should be "interesting". In particular, RCU CPU stall
> > > > warnings rely on the raw spin lock to reduce false positives due
> > > > to race conditions. Some thought will be required here.
> > >
> > > I don't get this part. Can you explain/give me an example where to look
> > > at?
> > >
> > > > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> > > >
> > > > Given that the slab allocator doesn't acquire any locks until it gets
> > > > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > > > while a raw spinlock is held? This would require yet another GFP_ flag,
> > > > but that won't make all that much of a difference in the total. ;-)
> > >
> > > That would be one way of dealing with. But we could go back to
> > > spinlock_t and keep the memory allocation even for RT as is. I don't see
> > > a downside of this. And we would worry about kfree_rcu() from real
> > > IRQ-off region once we get to it.
> > >
>
> Sorry for my late reply as the day job and family demanded a lot last week...
>
> > Another way of fixing it is just dropping the lock letting the page
> > allocator to do an allocation without our "upper/local" lock. I did a
> > proposal like that once upon a time, so maybe it is a time to highlight
> > it again:
> > <snip>
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 21c2fa5bd8c3..249f10a89bb9 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3278,9 +3278,11 @@ static void kfree_rcu_monitor(struct work_struct *work)
> > }
> >
> > static inline bool
> > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > +kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
> > + void *ptr, unsigned long *flags)
> > {
> > struct kvfree_rcu_bulk_data *bnode;
> > + struct kfree_rcu_cpu *tmp;
> > int idx;
> >
> > if (unlikely(!krcp->initialized))
> > @@ -3306,6 +3308,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > return false;
> >
> > + migrate_disable();
> > + krc_this_cpu_unlock(krcp, *flags);
>
> If I remember, the issue here is that migrate_disable is not implemented on a
> non-RT kernel due to issues with starvation.
>
It is implemented. Please have a look linux/preempt.h for regular kernel.
--
Vlad Rezki
On Mon, Jul 06, 2020 at 09:55:57PM +0200, Uladzislau Rezki wrote:
[...]
> > > Another way of fixing it is just dropping the lock letting the page
> > > allocator to do an allocation without our "upper/local" lock. I did a
> > > proposal like that once upon a time, so maybe it is a time to highlight
> > > it again:
> > > <snip>
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 21c2fa5bd8c3..249f10a89bb9 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3278,9 +3278,11 @@ static void kfree_rcu_monitor(struct work_struct *work)
> > > }
> > >
> > > static inline bool
> > > -kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > +kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
> > > + void *ptr, unsigned long *flags)
> > > {
> > > struct kvfree_rcu_bulk_data *bnode;
> > > + struct kfree_rcu_cpu *tmp;
> > > int idx;
> > >
> > > if (unlikely(!krcp->initialized))
> > > @@ -3306,6 +3308,9 @@ kvfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp, void *ptr)
> > > if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > > return false;
> > >
> > > + migrate_disable();
> > > + krc_this_cpu_unlock(krcp, *flags);
> >
> > If I remember, the issue here is that migrate_disable is not implemented on a
> > non-RT kernel due to issues with starvation.
> >
> It is implemented. Please have a look linux/preempt.h for regular kernel.
Yeah sorry, I meant it is not implemented in the right way in the sense - it
disables migration with the preempt-disable hammer.
Anyway, I think as we discussed on IRC, your proposal will work for both
kernels while allowing us to keep the internal lock as raw.
thanks,
- Joel
On Thu, Jul 02, 2020 at 10:19:08PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-02 09:48:26 [-0700], Paul E. McKenney wrote:
> > On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > > > This is not going to work together with the "wait context validator"
> > > > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > > > printk() which is why it is still disabled by default.
> > > >
> > > > Fixing that should be "interesting". In particular, RCU CPU stall
> > > > warnings rely on the raw spin lock to reduce false positives due
> > > > to race conditions. Some thought will be required here.
> > >
> > > I don't get this part. Can you explain/give me an example where to look
> > > at?
> >
> > Starting from the scheduler-clock interrupt's call into RCU,
> > we have rcu_sched_clock_irq() which calls rcu_pending() which
> > calls check_cpu_stall() which calls either print_cpu_stall() or
> > print_other_cpu_stall(), depending on whether the stall is happening on
> > the current CPU or on some other CPU, respectively.
> >
> > Both of these last functions acquire the rcu_node structure's raw ->lock
> > and expect to do printk()s while holding it.
>
> …
> > Thoughts?
>
> Okay. So in the RT queue there is a printk() rewrite which fixes this
> kind of things. Upstream the printk() interface is still broken in this
> regard and therefore CONFIG_PROVE_RAW_LOCK_NESTING is disabled.
> [Earlier the workqueue would also trigger a warning but this has been
> fixed as of v5.8-rc1.]
> This was just me explaining why this bad, what debug function would
> report it and why it is not enabled by default.
Whew!!! ;-)
> > > > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> > > >
> > > > Given that the slab allocator doesn't acquire any locks until it gets
> > > > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > > > while a raw spinlock is held? This would require yet another GFP_ flag,
> > > > but that won't make all that much of a difference in the total. ;-)
> > >
> > > That would be one way of dealing with. But we could go back to
> > > spinlock_t and keep the memory allocation even for RT as is. I don't see
> > > a downside of this. And we would worry about kfree_rcu() from real
> > > IRQ-off region once we get to it.
> >
> > Once we get to it, your thought would be to do per-CPU queuing of
> > memory from IRQ-off kfree_rcu(), and have IRQ work or some such clean
> > up after it? Or did you have some other trick in mind?
>
> So for now I would very much like to revert the raw_spinlock_t back to
> the spinlock_t and add a migrate_disable() just avoid the tiny
> possible migration between obtaining the CPU-ptr and acquiring the lock
> (I think Joel was afraid of performance hit).
Performance is indeed a concern here.
> Should we get to a *real* use case where someone must invoke kfree_rcu()
> from a hard-IRQ-off region then we can think what makes sense. per-CPU
> queues and IRQ-work would be one way of dealing with it.
It looks like workqueues can also be used, at least in their current
form. And timers.
Vlad, Joel, thoughts?
Thanx, Paul
On 2020-07-06 16:29:49 [-0400], Joel Fernandes wrote:
> Anyway, I think as we discussed on IRC, your proposal will work for both
> kernels while allowing us to keep the internal lock as raw.
Why can't we use the migrate_disable() and spinlock_t?
> thanks,
>
> - Joel
Sebastian
On Mon, Jul 06, 2020 at 02:06:45PM -0700, Paul E. McKenney wrote:
> On Thu, Jul 02, 2020 at 10:19:08PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2020-07-02 09:48:26 [-0700], Paul E. McKenney wrote:
> > > On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > > > > This is not going to work together with the "wait context validator"
> > > > > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > > > > printk() which is why it is still disabled by default.
> > > > >
> > > > > Fixing that should be "interesting". In particular, RCU CPU stall
> > > > > warnings rely on the raw spin lock to reduce false positives due
> > > > > to race conditions. Some thought will be required here.
> > > >
> > > > I don't get this part. Can you explain/give me an example where to look
> > > > at?
> > >
> > > Starting from the scheduler-clock interrupt's call into RCU,
> > > we have rcu_sched_clock_irq() which calls rcu_pending() which
> > > calls check_cpu_stall() which calls either print_cpu_stall() or
> > > print_other_cpu_stall(), depending on whether the stall is happening on
> > > the current CPU or on some other CPU, respectively.
> > >
> > > Both of these last functions acquire the rcu_node structure's raw ->lock
> > > and expect to do printk()s while holding it.
> >
> > …
> > > Thoughts?
> >
> > Okay. So in the RT queue there is a printk() rewrite which fixes this
> > kind of things. Upstream the printk() interface is still broken in this
> > regard and therefore CONFIG_PROVE_RAW_LOCK_NESTING is disabled.
> > [Earlier the workqueue would also trigger a warning but this has been
> > fixed as of v5.8-rc1.]
> > This was just me explaining why this bad, what debug function would
> > report it and why it is not enabled by default.
>
> Whew!!! ;-)
>
> > > > > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > > > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > > > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> > > > >
> > > > > Given that the slab allocator doesn't acquire any locks until it gets
> > > > > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > > > > while a raw spinlock is held? This would require yet another GFP_ flag,
> > > > > but that won't make all that much of a difference in the total. ;-)
> > > >
> > > > That would be one way of dealing with. But we could go back to
> > > > spinlock_t and keep the memory allocation even for RT as is. I don't see
> > > > a downside of this. And we would worry about kfree_rcu() from real
> > > > IRQ-off region once we get to it.
> > >
> > > Once we get to it, your thought would be to do per-CPU queuing of
> > > memory from IRQ-off kfree_rcu(), and have IRQ work or some such clean
> > > up after it? Or did you have some other trick in mind?
> >
> > So for now I would very much like to revert the raw_spinlock_t back to
> > the spinlock_t and add a migrate_disable() just avoid the tiny
> > possible migration between obtaining the CPU-ptr and acquiring the lock
> > (I think Joel was afraid of performance hit).
>
> Performance is indeed a concern here.
>
> > Should we get to a *real* use case where someone must invoke kfree_rcu()
> > from a hard-IRQ-off region then we can think what makes sense. per-CPU
> > queues and IRQ-work would be one way of dealing with it.
>
> It looks like workqueues can also be used, at least in their current
> form. And timers.
>
> Vlad, Joel, thoughts?
>
Some high level thoughts:
Currently everything is done in workqueue context, it means all freeing
happens there. For RT kernel we can invoke a page allocator only for single
kfree_rcu() argument(though we skip it). As for double one, it is impossible,
that is why a simple path is used by linking rcu_head among each other for
further reclaim in wq context. As of now, for RT, everything is already
deferred.
If we revert to spinlock_t then calling of kfree_rcu() from hard IRQ
context is broken, even though we think that for RT kernel it will
never happen. Therefore i do not see a clear motivation and benefits
why we should revert to spinlock_t.
IMHO, if we can avoid of such drawback i would go with that way, i.e.
i would not like to think what to do with that when it becomes broken.
Thanks!
--
Vlad Rezki
On Tue, Jul 07, 2020 at 07:34:41PM +0200, Uladzislau Rezki wrote:
> On Mon, Jul 06, 2020 at 02:06:45PM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 02, 2020 at 10:19:08PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2020-07-02 09:48:26 [-0700], Paul E. McKenney wrote:
> > > > On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > > > > > This is not going to work together with the "wait context validator"
> > > > > > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > > > > > printk() which is why it is still disabled by default.
> > > > > >
> > > > > > Fixing that should be "interesting". In particular, RCU CPU stall
> > > > > > warnings rely on the raw spin lock to reduce false positives due
> > > > > > to race conditions. Some thought will be required here.
> > > > >
> > > > > I don't get this part. Can you explain/give me an example where to look
> > > > > at?
> > > >
> > > > Starting from the scheduler-clock interrupt's call into RCU,
> > > > we have rcu_sched_clock_irq() which calls rcu_pending() which
> > > > calls check_cpu_stall() which calls either print_cpu_stall() or
> > > > print_other_cpu_stall(), depending on whether the stall is happening on
> > > > the current CPU or on some other CPU, respectively.
> > > >
> > > > Both of these last functions acquire the rcu_node structure's raw ->lock
> > > > and expect to do printk()s while holding it.
> > >
> > > …
> > > > Thoughts?
> > >
> > > Okay. So in the RT queue there is a printk() rewrite which fixes this
> > > kind of things. Upstream the printk() interface is still broken in this
> > > regard and therefore CONFIG_PROVE_RAW_LOCK_NESTING is disabled.
> > > [Earlier the workqueue would also trigger a warning but this has been
> > > fixed as of v5.8-rc1.]
> > > This was just me explaining why this bad, what debug function would
> > > report it and why it is not enabled by default.
> >
> > Whew!!! ;-)
> >
> > > > > > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > > > > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > > > > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> > > > > >
> > > > > > Given that the slab allocator doesn't acquire any locks until it gets
> > > > > > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > > > > > while a raw spinlock is held? This would require yet another GFP_ flag,
> > > > > > but that won't make all that much of a difference in the total. ;-)
> > > > >
> > > > > That would be one way of dealing with. But we could go back to
> > > > > spinlock_t and keep the memory allocation even for RT as is. I don't see
> > > > > a downside of this. And we would worry about kfree_rcu() from real
> > > > > IRQ-off region once we get to it.
> > > >
> > > > Once we get to it, your thought would be to do per-CPU queuing of
> > > > memory from IRQ-off kfree_rcu(), and have IRQ work or some such clean
> > > > up after it? Or did you have some other trick in mind?
> > >
> > > So for now I would very much like to revert the raw_spinlock_t back to
> > > the spinlock_t and add a migrate_disable() just avoid the tiny
> > > possible migration between obtaining the CPU-ptr and acquiring the lock
> > > (I think Joel was afraid of performance hit).
> >
> > Performance is indeed a concern here.
> >
> > > Should we get to a *real* use case where someone must invoke kfree_rcu()
> > > from a hard-IRQ-off region then we can think what makes sense. per-CPU
> > > queues and IRQ-work would be one way of dealing with it.
> >
> > It looks like workqueues can also be used, at least in their current
> > form. And timers.
> >
> > Vlad, Joel, thoughts?
> >
> Some high level thoughts:
>
> Currently everything is done in workqueue context, it means all freeing
> happens there. For RT kernel we can invoke a page allocator only for single
> kfree_rcu() argument(though we skip it). As for double one, it is impossible,
> that is why a simple path is used by linking rcu_head among each other for
> further reclaim in wq context. As of now, for RT, everything is already
> deferred.
>
> If we revert to spinlock_t then calling of kfree_rcu() from hard IRQ
> context is broken, even though we think that for RT kernel it will
> never happen. Therefore i do not see a clear motivation and benefits
> why we should revert to spinlock_t.
>
> IMHO, if we can avoid of such drawback i would go with that way, i.e.
> i would not like to think what to do with that when it becomes broken.
I am also of Vlad's opinion. It seems to me that doing the migrate_disable()
before dropping the raw spinlock should suffice.
Note that in the future, we may have other users of this path such as
a potential kmem_cache_free_rcu(). It seems burdensome to make these all
callable only from sleepable-context.
Is there a drawback of doing migrate_disable before dropping the raw internal
lock, that I'm missing?
thanks,
- Joel
On Tue, Jul 07, 2020 at 02:45:31PM -0400, Joel Fernandes wrote:
> On Tue, Jul 07, 2020 at 07:34:41PM +0200, Uladzislau Rezki wrote:
> > On Mon, Jul 06, 2020 at 02:06:45PM -0700, Paul E. McKenney wrote:
> > > On Thu, Jul 02, 2020 at 10:19:08PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2020-07-02 09:48:26 [-0700], Paul E. McKenney wrote:
> > > > > On Thu, Jul 02, 2020 at 04:12:16PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > > On 2020-06-30 11:35:34 [-0700], Paul E. McKenney wrote:
> > > > > > > > This is not going to work together with the "wait context validator"
> > > > > > > > (CONFIG_PROVE_RAW_LOCK_NESTING). As of -rc3 it should complain about
> > > > > > > > printk() which is why it is still disabled by default.
> > > > > > >
> > > > > > > Fixing that should be "interesting". In particular, RCU CPU stall
> > > > > > > warnings rely on the raw spin lock to reduce false positives due
> > > > > > > to race conditions. Some thought will be required here.
> > > > > >
> > > > > > I don't get this part. Can you explain/give me an example where to look
> > > > > > at?
> > > > >
> > > > > Starting from the scheduler-clock interrupt's call into RCU,
> > > > > we have rcu_sched_clock_irq() which calls rcu_pending() which
> > > > > calls check_cpu_stall() which calls either print_cpu_stall() or
> > > > > print_other_cpu_stall(), depending on whether the stall is happening on
> > > > > the current CPU or on some other CPU, respectively.
> > > > >
> > > > > Both of these last functions acquire the rcu_node structure's raw ->lock
> > > > > and expect to do printk()s while holding it.
> > > >
> > > > …
> > > > > Thoughts?
> > > >
> > > > Okay. So in the RT queue there is a printk() rewrite which fixes this
> > > > kind of things. Upstream the printk() interface is still broken in this
> > > > regard and therefore CONFIG_PROVE_RAW_LOCK_NESTING is disabled.
> > > > [Earlier the workqueue would also trigger a warning but this has been
> > > > fixed as of v5.8-rc1.]
> > > > This was just me explaining why this bad, what debug function would
> > > > report it and why it is not enabled by default.
> > >
> > > Whew!!! ;-)
> > >
> > > > > > > > So assume that this is fixed and enabled then on !PREEMPT_RT it will
> > > > > > > > complain that you have a raw_spinlock_t acquired (the one from patch
> > > > > > > > 02/17) and attempt to acquire a spinlock_t in the memory allocator.
> > > > > > >
> > > > > > > Given that the slab allocator doesn't acquire any locks until it gets
> > > > > > > a fair way in, wouldn't it make sense to allow a "shallow" allocation
> > > > > > > while a raw spinlock is held? This would require yet another GFP_ flag,
> > > > > > > but that won't make all that much of a difference in the total. ;-)
> > > > > >
> > > > > > That would be one way of dealing with. But we could go back to
> > > > > > spinlock_t and keep the memory allocation even for RT as is. I don't see
> > > > > > a downside of this. And we would worry about kfree_rcu() from real
> > > > > > IRQ-off region once we get to it.
> > > > >
> > > > > Once we get to it, your thought would be to do per-CPU queuing of
> > > > > memory from IRQ-off kfree_rcu(), and have IRQ work or some such clean
> > > > > up after it? Or did you have some other trick in mind?
> > > >
> > > > So for now I would very much like to revert the raw_spinlock_t back to
> > > > the spinlock_t and add a migrate_disable() just avoid the tiny
> > > > possible migration between obtaining the CPU-ptr and acquiring the lock
> > > > (I think Joel was afraid of performance hit).
> > >
> > > Performance is indeed a concern here.
> > >
> > > > Should we get to a *real* use case where someone must invoke kfree_rcu()
> > > > from a hard-IRQ-off region then we can think what makes sense. per-CPU
> > > > queues and IRQ-work would be one way of dealing with it.
> > >
> > > It looks like workqueues can also be used, at least in their current
> > > form. And timers.
> > >
> > > Vlad, Joel, thoughts?
> > >
> > Some high level thoughts:
> >
> > Currently everything is done in workqueue context, it means all freeing
> > happens there. For RT kernel we can invoke a page allocator only for single
> > kfree_rcu() argument(though we skip it). As for double one, it is impossible,
> > that is why a simple path is used by linking rcu_head among each other for
> > further reclaim in wq context. As of now, for RT, everything is already
> > deferred.
> >
> > If we revert to spinlock_t then calling of kfree_rcu() from hard IRQ
> > context is broken, even though we think that for RT kernel it will
> > never happen. Therefore i do not see a clear motivation and benefits
> > why we should revert to spinlock_t.
> >
> > IMHO, if we can avoid of such drawback i would go with that way, i.e.
> > i would not like to think what to do with that when it becomes broken.
>
> I am also of Vlad's opinion. It seems to me that doing the migrate_disable()
> before dropping the raw spinlock should suffice.
>
> Note that in the future, we may have other users of this path such as
> a potential kmem_cache_free_rcu(). It seems burdensome to make these all
> callable only from sleepable-context.
>
> Is there a drawback of doing migrate_disable before dropping the raw internal
> lock, that I'm missing?
>
I used migrate_disable() just to underline that we want to stay on the
same CPU. We can go with preempt_enable()/preempt_disable() what is
exactly the same as migrate_disable() for regular kernel.
--
Vlad Rezki
On Tue, Jun 30, 2020 at 06:45:43PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-06-24 13:12:12 [-0700], [email protected] wrote:
> > From: "Joel Fernandes (Google)" <[email protected]>
> >
> > To keep the kfree_rcu() code working in purely atomic sections on RT,
> > such as non-threaded IRQ handlers and raw spinlock sections, avoid
> > calling into the page allocator which uses sleeping locks on RT.
> >
> > In fact, even if the caller is preemptible, the kfree_rcu() code is
> > not, as the krcp->lock is a raw spinlock.
> >
> > Calling into the page allocator is optional and avoiding it should be
> > Ok, especially with the page pre-allocation support in future patches.
> > Such pre-allocation would further avoid the a need for a dynamically
> > allocated page in the first place.
> >
> > Cc: Sebastian Andrzej Siewior <[email protected]>
> > Reviewed-by: Uladzislau Rezki <[email protected]>
> > Co-developed-by: Uladzislau Rezki <[email protected]>
> > Signed-off-by: Uladzislau Rezki <[email protected]>
> > Signed-off-by: Joel Fernandes (Google) <[email protected]>
> > Signed-off-by: Uladzislau Rezki (Sony) <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 12 ++++++++++++
> > 1 file changed, 12 insertions(+)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 64592b4..dbdd509 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3184,6 +3184,18 @@ kfree_call_rcu_add_ptr_to_bulk(struct kfree_rcu_cpu *krcp,
> > if (!bnode) {
> > WARN_ON_ONCE(sizeof(struct kfree_rcu_bulk_data) > PAGE_SIZE);
> >
> > + /*
> > + * To keep this path working on raw non-preemptible
> > + * sections, prevent the optional entry into the
> > + * allocator as it uses sleeping locks. In fact, even
> > + * if the caller of kfree_rcu() is preemptible, this
> > + * path still is not, as krcp->lock is a raw spinlock.
> > + * With additional page pre-allocation in the works,
> > + * hitting this return is going to be much less likely.
> > + */
> > + if (IS_ENABLED(CONFIG_PREEMPT_RT))
> > + return false;
>
> This is not going to work together with the "wait context validator"
> (CONFIG_PROVE_RAW_LOCK_NESTING).
>
> As of -rc3 it should complain about printk() which is why it is still disabled by default.
>
Have you tried to trigger a "complain" you are talking about?
I suspect to get some trace dump when CONFIG_PROVE_RAW_LOCK_NESTING=y.
Thank you.
--
Vlad Rezki
On Wed, Jul 15, 2020 at 04:16:22PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-07-15 15:38:08 [+0200], Uladzislau Rezki wrote:
> > > As of -rc3 it should complain about printk() which is why it is still disabled by default.
> > >
> > Have you tried to trigger a "complain" you are talking about?
>
> No, but I is wrong because a raw_spinlock_t is acquired followed by a
> spinlock_t.
>
Right. According to documentation CONFIG_PROVE_RAW_LOCK_NESTING is used
to detect raw_spinlock vs. spinlock nesting usage.
>
> > I suspect to get some trace dump when CONFIG_PROVE_RAW_LOCK_NESTING=y.
>
> You should get one if you haven't received any splat earlier (like from
> printk code because it only triggers once).
>
Got it.
Thanks!
--
Vlad Rezki
On 2020-07-15 15:38:08 [+0200], Uladzislau Rezki wrote:
> > As of -rc3 it should complain about printk() which is why it is still disabled by default.
> >
> Have you tried to trigger a "complain" you are talking about?
No, but I is wrong because a raw_spinlock_t is acquired followed by a
spinlock_t.
> I suspect to get some trace dump when CONFIG_PROVE_RAW_LOCK_NESTING=y.
You should get one if you haven't received any splat earlier (like from
printk code because it only triggers once).
> Thank you.
Sebastian