2020-06-23 00:23:53

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 0/26] Miscellaneous fixes for v5.9

Hello!

This series provides miscellaneous fixes:

1. Initialize and destroy rcu_synchronize only when necessary,
courtesy of Wei Yang.

2. mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls.

3. Simplify the calculation of rcu_state.ncpus, courtesy of Wei Yang.

4. Add callbacks-invoked counters.

5. Add comment documenting rcu_callback_map's purpose.

6. events: rcu: Change description of rcu_dyntick trace event,
courtesy of Madhuparna Bhowmik.

7. Grace-period-kthread related sleeps to idle priority.

8. Priority-boost-related sleeps to idle priority.

9. No-CBs-related sleeps to idle priority.

10. Expedited grace-period sleeps to idle priority.

11. fs/btrfs: Add cond_resched() for try_release_extent_mapping()
stalls.

12. Update comment from rsp->rcu_gp_seq to rsp->gp_seq, courtesy
of Lihao Liang.

13. tick/nohz: Narrow down noise while setting current task's tick
dependency, courtesy of Frederic Weisbecker.

14. Fix some kernel-doc warnings, courtesy of Mauro Carvalho Chehab.

15. Remove initialized but unused rnp from check_slow_task().

16. Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr.

17. Complain only once about RCU in extended quiescent state.

18. Replace 1 with true, courtesy of Jules Irenge.

19. Stop shrinker loop, courtesy of Peter Enderborg.

20. gp_max is protected by root rcu_node's lock, courtesy of Wei Yang.

21. grplo/grphi just records CPU number, courtesy of Wei Yang.

22. grpnum just records group number, courtesy of Wei Yang.

23. kernel/rcu/tree.c: Fix kernel-doc warnings, courtesy of Randy
Dunlap.

24. Fix some kernel-doc warnings, courtesy of Mauro Carvalho Chehab.

25. Remove KCSAN stubs.

26. Remove KCSAN stubs from update.c.

Thanx, Paul

------------------------------------------------------------------------

fs/btrfs/extent_io.c | 2 ++
include/linux/rculist.h | 2 +-
include/trace/events/rcu.h | 11 ++++++-----
kernel/locking/lockdep.c | 4 +---
kernel/rcu/tree.c | 39 +++++++++++++--------------------------
kernel/rcu/tree.h | 15 ++++++++-------
kernel/rcu/tree_exp.h | 2 +-
kernel/rcu/tree_plugin.h | 4 ++--
kernel/rcu/tree_stall.h | 5 +++--
kernel/rcu/update.c | 28 +++++++++-------------------
kernel/time/tick-sched.c | 22 +++++++++++++++-------
mm/mmap.c | 1 +
12 files changed, 62 insertions(+), 73 deletions(-)


2020-06-23 00:23:58

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 02/26] mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls

From: "Paul E. McKenney" <[email protected]>

A large process running on a heavily loaded system can encounter the
following RCU CPU stall warning:

rcu: INFO: rcu_sched self-detected stall on CPU
rcu: \x093-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
\x09(t=21013 jiffies g=1005461 q=132576)
NMI backtrace for cpu 3
CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
Call Trace:
<IRQ>
dump_stack+0x46/0x60
nmi_cpu_backtrace.cold.3+0x13/0x50
? lapic_can_unplug_cpu.cold.27+0x34/0x34
nmi_trigger_cpumask_backtrace+0xba/0xca
rcu_dump_cpu_stacks+0x99/0xc7
rcu_sched_clock_irq.cold.87+0x1aa/0x397
? tick_sched_do_timer+0x60/0x60
update_process_times+0x28/0x60
tick_sched_timer+0x37/0x70
__hrtimer_run_queues+0xfe/0x270
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x5e/0x120
apic_timer_interrupt+0xf/0x20
</IRQ>
RIP: 0010:kmem_cache_free+0x223/0x300
Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
? remove_vma+0x4f/0x60
remove_vma+0x4f/0x60
exit_mmap+0xd6/0x160
mmput+0x4a/0x110
do_exit+0x278/0xae0
? syscall_trace_enter+0x1d3/0x2b0
? handle_mm_fault+0xaa/0x1c0
do_group_exit+0x3a/0xa0
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x42/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9

And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
for a very long time given a large process. This commit therefore adds
a cond_resched() to this loop, providing RCU any needed quiescent states.

Cc: Andrew Morton <[email protected]>
Cc: <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
mm/mmap.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 59a4682..972f839 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
vma = remove_vma(vma);
+ cond_resched();
}
vm_unacct_memory(nr_accounted);
}
--
2.9.5

2020-06-23 00:24:11

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 05/26] rcu: Add comment documenting rcu_callback_map's purpose

From: "Paul E. McKenney" <[email protected]>

The rcu_callback_map lockdep_map structure was added back in 2013, but
its purpose has become obscure. This commit therefore documments that the
purpose of rcu_callback map is, in the words of commit 24ef659a857 ("rcu:
Provide better diagnostics for blocking in RCU callback functions"),
to help lockdep to tie an "inappropriate voluntary context switch back
to the fact that the function is being invoked from within a callback."

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/update.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index f5a82e1..ca17b77 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -279,6 +279,7 @@ struct lockdep_map rcu_sched_lock_map = {
};
EXPORT_SYMBOL_GPL(rcu_sched_lock_map);

+// Tell lockdep when RCU callbacks are being invoked.
static struct lock_class_key rcu_callback_key;
struct lockdep_map rcu_callback_map =
STATIC_LOCKDEP_MAP_INIT("rcu_callback", &rcu_callback_key);
--
2.9.5

2020-06-23 00:24:15

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 09/26] rcu: No-CBs-related sleeps to idle priority

From: "Paul E. McKenney" <[email protected]>

This commit converts the schedule_timeout_interruptible() call used by
RCU's no-CBs grace-period kthreads to schedule_timeout_idle(). This
conversion avoids polluting the load-average with RCU-related sleeping.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 25296c1..982fc5b 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -2005,7 +2005,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
/* Polling, so trace if first poll in the series. */
if (gotcbs)
trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Poll"));
- schedule_timeout_interruptible(1);
+ schedule_timeout_idle(1);
} else if (!needwait_gp) {
/* Wait for callbacks to appear. */
trace_rcu_nocb_wake(rcu_state.name, cpu, TPS("Sleep"));
--
2.9.5

2020-06-23 00:24:22

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 24/26] rcu: fix some kernel-doc warnings

From: Mauro Carvalho Chehab <[email protected]>

There are some kernel-doc warnings:

./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu'
./include/linux/rculist.h:517: warning: bad line: [@right ][node2 ... ]
./include/linux/rculist.h:2: WARNING: Unexpected indentation.

Move the comment for "count" to the kernel-doc markup and add
a missing "*" on one kernel-doc continuation line.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/linux/rculist.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index df587d1..7eed65b 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -512,7 +512,7 @@ static inline void hlist_replace_rcu(struct hlist_node *old,
* @right: The hlist head on the right
*
* The lists start out as [@left ][node1 ... ] and
- [@right ][node2 ... ]
+ * [@right ][node2 ... ]
* The lists end up as [@left ][node2 ... ]
* [@right ][node1 ... ]
*/
--
2.9.5

2020-06-23 00:24:38

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 03/26] rcu: Simplify the calculation of rcu_state.ncpus

From: Wei Yang <[email protected]>

There is only 1 bit set in mask, which means that the only difference
between oldmask and the new one will be at the position where the bit is
set in mask. This commit therefore updates rcu_state.ncpus by checking
whether the bit in mask is already set in rnp->expmaskinitnext.

Signed-off-by: Wei Yang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c716ead..9caaeee 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3824,10 +3824,9 @@ void rcu_cpu_starting(unsigned int cpu)
{
unsigned long flags;
unsigned long mask;
- int nbits;
- unsigned long oldmask;
struct rcu_data *rdp;
struct rcu_node *rnp;
+ bool newcpu;

if (per_cpu(rcu_cpu_started, cpu))
return;
@@ -3839,12 +3838,10 @@ void rcu_cpu_starting(unsigned int cpu)
mask = rdp->grpmask;
raw_spin_lock_irqsave_rcu_node(rnp, flags);
WRITE_ONCE(rnp->qsmaskinitnext, rnp->qsmaskinitnext | mask);
- oldmask = rnp->expmaskinitnext;
+ newcpu = !(rnp->expmaskinitnext & mask);
rnp->expmaskinitnext |= mask;
- oldmask ^= rnp->expmaskinitnext;
- nbits = bitmap_weight(&oldmask, BITS_PER_LONG);
/* Allow lockless access for expedited grace periods. */
- smp_store_release(&rcu_state.ncpus, rcu_state.ncpus + nbits); /* ^^^ */
+ smp_store_release(&rcu_state.ncpus, rcu_state.ncpus + newcpu); /* ^^^ */
ASSERT_EXCLUSIVE_WRITER(rcu_state.ncpus);
rcu_gpnum_ovf(rnp, rdp); /* Offline-induced counter wrap? */
rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
--
2.9.5

2020-06-23 00:24:50

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 23/26] kernel/rcu/tree.c: Fix kernel-doc warnings

From: Randy Dunlap <[email protected]>

Fix kernel-doc warning:

../kernel/rcu/tree.c:959: warning: Excess function parameter 'irq' description in 'rcu_nmi_enter'

Fixes: cf7614e13c8f ("rcu: Refactor rcu_{nmi,irq}_{enter,exit}()")
Signed-off-by: Randy Dunlap <[email protected]>
Cc: Byungchul Park <[email protected]>
Cc: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0806762..e7161e0 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -944,7 +944,6 @@ void __rcu_irq_enter_check_tick(void)

/**
* rcu_nmi_enter - inform RCU of entry to NMI context
- * @irq: Is this call from rcu_irq_enter?
*
* If the CPU was idle from RCU's viewpoint, update rdp->dynticks and
* rdp->dynticks_nmi_nesting to let the RCU grace-period handling know
--
2.9.5

2020-06-23 00:25:01

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 19/26] rcu: Stop shrinker loop

From: Peter Enderborg <[email protected]>

The count and scan can be separated in time, and there is a fair chance
that all work is already done when the scan starts, which might in turn
result in a needless retry. This commit therefore avoids this retry by
returning SHRINK_STOP.

Reviewed-by: Uladzislau Rezki (Sony) <[email protected]>
Signed-off-by: Peter Enderborg <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 67912ad..0806762 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3314,7 +3314,7 @@ kfree_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
break;
}

- return freed;
+ return freed == 0 ? SHRINK_STOP : freed;
}

static struct shrinker kfree_rcu_shrinker = {
--
2.9.5

2020-06-23 00:25:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 18/26] rcu: Replace 1 with true

From: Jules Irenge <[email protected]>

Coccinelle reports a warning

WARNING: Assignment of 0/1 to bool variable

The root cause is that the variable lastphase is a bool, but is
initialised with integer 1. This commit therefore replaces the 1 with
a true.

Signed-off-by: Jules Irenge <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/update.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index ca17b77..a0ba885 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -207,7 +207,7 @@ void rcu_end_inkernel_boot(void)
rcu_unexpedite_gp();
if (rcu_normal_after_boot)
WRITE_ONCE(rcu_normal, 1);
- rcu_boot_ended = 1;
+ rcu_boot_ended = true;
}

/*
--
2.9.5

2020-06-23 00:25:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 20/26] rcu: gp_max is protected by root rcu_node's lock

From: Wei Yang <[email protected]>

gp_max is protected by root rcu_node's lock, let's move the definition
to the place where comments indicate.

Signed-off-by: Wei Yang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 575745f..09ec93b 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -302,6 +302,8 @@ struct rcu_state {
u8 boost ____cacheline_internodealigned_in_smp;
/* Subject to priority boost. */
unsigned long gp_seq; /* Grace-period sequence #. */
+ unsigned long gp_max; /* Maximum GP duration in */
+ /* jiffies. */
struct task_struct *gp_kthread; /* Task for grace periods. */
struct swait_queue_head gp_wq; /* Where GP task waits. */
short gp_flags; /* Commands for GP task. */
@@ -347,8 +349,6 @@ struct rcu_state {
/* a reluctant CPU. */
unsigned long n_force_qs_gpstart; /* Snapshot of n_force_qs at */
/* GP start. */
- unsigned long gp_max; /* Maximum GP duration in */
- /* jiffies. */
const char *name; /* Name of structure. */
char abbr; /* Abbreviated name. */

--
2.9.5

2020-06-23 00:25:11

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 21/26] rcu: grplo/grphi just records CPU number

From: Wei Yang <[email protected]>

We store lowest and highest CPU number belongs to a rcu_node, which is
not the group number.

Signed-off-by: Wei Yang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 09ec93b..9f903f5 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -73,8 +73,8 @@ struct rcu_node {
unsigned long ffmask; /* Fully functional CPUs. */
unsigned long grpmask; /* Mask to apply to parent qsmask. */
/* Only one bit will be set in this mask. */
- int grplo; /* lowest-numbered CPU or group here. */
- int grphi; /* highest-numbered CPU or group here. */
+ int grplo; /* lowest-numbered CPU here. */
+ int grphi; /* highest-numbered CPU here. */
u8 grpnum; /* CPU/group number for next level up. */
u8 level; /* root is at level 0. */
bool wait_blkd_tasks;/* Necessary to wait for blocked tasks to */
--
2.9.5

2020-06-23 00:25:50

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 13/26] tick/nohz: Narrow down noise while setting current task's tick dependency

From: Frederic Weisbecker <[email protected]>

Setting a tick dependency on any task, including the case where a task
sets that dependency on itself, triggers an IPI to all CPUs. That is
of course suboptimal but it had previously not been an issue because it
was only used by POSIX CPU timers on nohz_full, which apparently never
occurs in latency-sensitive workloads in production. (Or users of such
systems are suffering in silence on the one hand or venting their ire
on the wrong people on the other.)

But RCU now sets a task tick dependency on the current task in order
to fix stall issues that can occur during RCU callback processing.
Thus, RCU callback processing triggers frequent system-wide IPIs from
nohz_full CPUs. This is quite counter-productive, after all, avoiding
IPIs is what nohz_full is supposed to be all about.

This commit therefore optimizes tasks' self-setting of a task tick
dependency by using tick_nohz_full_kick() to avoid the system-wide IPI.
Instead, only the execution of the one task is disturbed, which is
acceptable given that this disturbance is well down into the noise
compared to the degree to which the RCU callback processing itself
disturbs execution.

Fixes: 6a949b7af82d (rcu: Force on tick when invoking lots of callbacks)
Reported-by: Matt Fleming <[email protected]>
Signed-off-by: Frederic Weisbecker <[email protected]>
Cc: [email protected]
Cc: Paul E. McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/time/tick-sched.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 3e2dc9b..f0199a4 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -351,16 +351,24 @@ void tick_nohz_dep_clear_cpu(int cpu, enum tick_dep_bits bit)
EXPORT_SYMBOL_GPL(tick_nohz_dep_clear_cpu);

/*
- * Set a per-task tick dependency. Posix CPU timers need this in order to elapse
- * per task timers.
+ * Set a per-task tick dependency. RCU need this. Also posix CPU timers
+ * in order to elapse per task timers.
*/
void tick_nohz_dep_set_task(struct task_struct *tsk, enum tick_dep_bits bit)
{
- /*
- * We could optimize this with just kicking the target running the task
- * if that noise matters for nohz full users.
- */
- tick_nohz_dep_set_all(&tsk->tick_dep_mask, bit);
+ if (!atomic_fetch_or(BIT(bit), &tsk->tick_dep_mask)) {
+ if (tsk == current) {
+ preempt_disable();
+ tick_nohz_full_kick();
+ preempt_enable();
+ } else {
+ /*
+ * Some future tick_nohz_full_kick_task()
+ * should optimize this.
+ */
+ tick_nohz_full_kick_all();
+ }
+ }
}
EXPORT_SYMBOL_GPL(tick_nohz_dep_set_task);

--
2.9.5

2020-06-23 00:25:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 12/26] rcu: Update comment from rsp->rcu_gp_seq to rsp->gp_seq

From: Lihao Liang <[email protected]>

Signed-off-by: Lihao Liang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 9c6f734..575745f 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -41,7 +41,7 @@ struct rcu_node {
raw_spinlock_t __private lock; /* Root rcu_node's lock protects */
/* some rcu_state fields as well as */
/* following. */
- unsigned long gp_seq; /* Track rsp->rcu_gp_seq. */
+ unsigned long gp_seq; /* Track rsp->gp_seq. */
unsigned long gp_seq_needed; /* Track furthest future GP request. */
unsigned long completedqs; /* All QSes done for this node. */
unsigned long qsmask; /* CPUs or groups that need to switch in */
@@ -149,7 +149,7 @@ union rcu_noqs {
/* Per-CPU data for read-copy update. */
struct rcu_data {
/* 1) quiescent-state and grace-period handling : */
- unsigned long gp_seq; /* Track rsp->rcu_gp_seq counter. */
+ unsigned long gp_seq; /* Track rsp->gp_seq counter. */
unsigned long gp_seq_needed; /* Track furthest future GP request. */
union rcu_noqs cpu_no_qs; /* No QSes yet for this CPU. */
bool core_needs_qs; /* Core waits for quiesc state. */
--
2.9.5

2020-06-23 00:26:18

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 07/26] rcu: Grace-period-kthread related sleeps to idle priority

From: "Paul E. McKenney" <[email protected]>

This commit converts the long-standing schedule_timeout_interruptible()
and schedule_timeout_uninterruptible() calls used by RCU's grace-period
kthread to schedule_timeout_idle(). This conversion avoids polluting
the load-average with RCU-related sleeping.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index db17ffe..48ae673 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1620,7 +1620,7 @@ static void rcu_gp_slow(int delay)
if (delay > 0 &&
!(rcu_seq_ctr(rcu_state.gp_seq) %
(rcu_num_nodes * PER_RCU_NODE_PERIOD * delay)))
- schedule_timeout_uninterruptible(delay);
+ schedule_timeout_idle(delay);
}

static unsigned long sleep_duration;
@@ -1643,7 +1643,7 @@ static void rcu_gp_torture_wait(void)
duration = xchg(&sleep_duration, 0UL);
if (duration > 0) {
pr_alert("%s: Waiting %lu jiffies\n", __func__, duration);
- schedule_timeout_uninterruptible(duration);
+ schedule_timeout_idle(duration);
pr_alert("%s: Wait complete\n", __func__);
}
}
@@ -2709,7 +2709,7 @@ static void rcu_cpu_kthread(unsigned int cpu)
}
*statusp = RCU_KTHREAD_YIELDING;
trace_rcu_utilization(TPS("Start CPU kthread@rcu_yield"));
- schedule_timeout_interruptible(2);
+ schedule_timeout_idle(2);
trace_rcu_utilization(TPS("End CPU kthread@rcu_yield"));
*statusp = RCU_KTHREAD_WAITING;
}
--
2.9.5

2020-06-23 00:26:29

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 04/26] rcu: Add callbacks-invoked counters

From: "Paul E. McKenney" <[email protected]>

This commit adds a count of the callbacks invoked to the per-CPU rcu_data
structure. This count is printed by the show_rcu_gp_kthreads() that
is invoked by rcutorture and the RCU CPU stall-warning code. It is also
intended for use by drgn.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 1 +
kernel/rcu/tree.h | 1 +
kernel/rcu/tree_stall.h | 3 +++
3 files changed, 5 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9caaeee..db17ffe 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2425,6 +2425,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
local_irq_save(flags);
rcu_nocb_lock(rdp);
count = -rcl.len;
+ rdp->n_cbs_invoked += count;
trace_rcu_batch_end(rcu_state.name, count, !!rcl.head, need_resched(),
is_idle_task(current), rcu_is_callbacks_kthread());

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 43991a4..9c6f734 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -171,6 +171,7 @@ struct rcu_data {
/* different grace periods. */
long qlen_last_fqs_check;
/* qlen at last check for QS forcing */
+ unsigned long n_cbs_invoked; /* # callbacks invoked since boot. */
unsigned long n_force_qs_snap;
/* did other CPU force QS recently? */
long blimit; /* Upper limit on a processed batch */
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 54a6dba..2768ce6 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -649,6 +649,7 @@ static void check_cpu_stall(struct rcu_data *rdp)
*/
void show_rcu_gp_kthreads(void)
{
+ unsigned long cbs = 0;
int cpu;
unsigned long j;
unsigned long ja;
@@ -690,9 +691,11 @@ void show_rcu_gp_kthreads(void)
}
for_each_possible_cpu(cpu) {
rdp = per_cpu_ptr(&rcu_data, cpu);
+ cbs += data_race(rdp->n_cbs_invoked);
if (rcu_segcblist_is_offloaded(&rdp->cblist))
show_rcu_nocb_state(rdp);
}
+ pr_info("RCU callbacks invoked since boot: %lu\n", cbs);
show_rcu_tasks_gp_kthreads();
}
EXPORT_SYMBOL_GPL(show_rcu_gp_kthreads);
--
2.9.5

2020-06-23 00:26:34

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 22/26] rcu: grpnum just records group number

From: Wei Yang <[email protected]>

grpnum in rcu_node means the position in its parent, which is not the
CPU number even in last level.

Signed-off-by: Wei Yang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 9f903f5..c96ae35 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -75,7 +75,7 @@ struct rcu_node {
/* Only one bit will be set in this mask. */
int grplo; /* lowest-numbered CPU here. */
int grphi; /* highest-numbered CPU here. */
- u8 grpnum; /* CPU/group number for next level up. */
+ u8 grpnum; /* group number for next level up. */
u8 level; /* root is at level 0. */
bool wait_blkd_tasks;/* Necessary to wait for blocked tasks to */
/* exit RCU read-side critical sections */
--
2.9.5

2020-06-23 00:27:03

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 15/26] rcu: Remove initialized but unused rnp from check_slow_task()

From: "Paul E. McKenney" <[email protected]>

This commit removes the variable rnp from check_slow_task(), which
is defined, assigned to, but not otherwise used.

Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_stall.h | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 2768ce6..d203f82 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -237,14 +237,12 @@ struct rcu_stall_chk_rdr {
*/
static bool check_slow_task(struct task_struct *t, void *arg)
{
- struct rcu_node *rnp;
struct rcu_stall_chk_rdr *rscrp = arg;

if (task_curr(t))
return false; // It is running, so decline to inspect it.
rscrp->nesting = t->rcu_read_lock_nesting;
rscrp->rs = t->rcu_read_unlock_special;
- rnp = t->rcu_blocked_node;
rscrp->on_blkd_list = !list_empty(&t->rcu_node_entry);
return true;
}
--
2.9.5

2020-06-23 00:27:27

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 06/26] trace: events: rcu: Change description of rcu_dyntick trace event

From: Madhuparna Bhowmik <[email protected]>

The different strings used for describing the polarity are
Start, End and StillNonIdle. Since StillIdle is not used in any trace
point for rcu_dyntick, it can be removed and StillNonIdle can be added
in the description. Because StillNonIdle is used in a few tracepoints for
rcu_dyntick.

Similarly, USER, IDLE and IRQ are used for describing context in
the rcu_dyntick tracepoints. Since, "KERNEL" is not used for any
of the rcu_dyntick tracepoints, remove it from the description.

Signed-off-by: Madhuparna Bhowmik <[email protected]>
Signed-off-by: Joel Fernandes (Google) <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
include/trace/events/rcu.h | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index f9a7811..af274d1 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -435,11 +435,12 @@ TRACE_EVENT_RCU(rcu_fqs,
#endif /* #if defined(CONFIG_TREE_RCU) */

/*
- * Tracepoint for dyntick-idle entry/exit events. These take a string
- * as argument: "Start" for entering dyntick-idle mode, "Startirq" for
- * entering it from irq/NMI, "End" for leaving it, "Endirq" for leaving it
- * to irq/NMI, "--=" for events moving towards idle, and "++=" for events
- * moving away from idle.
+ * Tracepoint for dyntick-idle entry/exit events. These take 2 strings
+ * as argument:
+ * polarity: "Start", "End", "StillNonIdle" for entering, exiting or still not
+ * being in dyntick-idle mode.
+ * context: "USER" or "IDLE" or "IRQ".
+ * NMIs nested in IRQs are inferred with dynticks_nesting > 1 in IRQ context.
*
* These events also take a pair of numbers, which indicate the nesting
* depth before and after the event of interest, and a third number that is
--
2.9.5

2020-06-23 00:27:57

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 14/26] rcu: fix some kernel-doc warnings

From: Mauro Carvalho Chehab <[email protected]>

There are some kernel-doc warnings:

./kernel/rcu/tree.c:2915: warning: Function parameter or member 'count' not described in 'kfree_rcu_cpu'

This commit therefore moves the comment for "count" to the kernel-doc
markup.

Signed-off-by: Mauro Carvalho Chehab <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 48ae673..08e3648 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2987,6 +2987,7 @@ struct kfree_rcu_cpu_work {
* @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
* @monitor_todo: Tracks whether a @monitor_work delayed work is pending
* @initialized: The @lock and @rcu_work fields have been initialized
+ * @count: Number of objects for which GP not started
*
* This is a per-CPU structure. The reason that it is not included in
* the rcu_data structure is to permit this code to be extracted from
@@ -3002,7 +3003,6 @@ struct kfree_rcu_cpu {
struct delayed_work monitor_work;
bool monitor_todo;
bool initialized;
- // Number of objects for which GP not started
int count;
};

--
2.9.5

2020-06-23 00:28:05

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 11/26] fs/btrfs: Add cond_resched() for try_release_extent_mapping() stalls

From: "Paul E. McKenney" <[email protected]>

Very large I/Os can cause the following RCU CPU stall warning:

RIP: 0010:rb_prev+0x8/0x50
Code: 49 89 c0 49 89 d1 48 89 c2 48 89 f8 e9 e5 fd ff ff 4c 89 48 10 c3 4c =
89 06 c3 4c 89 40 10 c3 0f 1f 00 48 8b 0f 48 39 cf 74 38 <48> 8b 47 10 48 85 c0 74 22 48 8b 50 08 48 85 d2 74 0c 48 89 d0 48
RSP: 0018:ffffc9002212bab0 EFLAGS: 00000287 ORIG_RAX: ffffffffffffff13
RAX: ffff888821f93630 RBX: ffff888821f93630 RCX: ffff888821f937e0
RDX: 0000000000000000 RSI: 0000000000102000 RDI: ffff888821f93630
RBP: 0000000000103000 R08: 000000000006c000 R09: 0000000000000238
R10: 0000000000102fff R11: ffffc9002212bac8 R12: 0000000000000001
R13: ffffffffffffffff R14: 0000000000102000 R15: ffff888821f937e0
__lookup_extent_mapping+0xa0/0x110
try_release_extent_mapping+0xdc/0x220
btrfs_releasepage+0x45/0x70
shrink_page_list+0xa39/0xb30
shrink_inactive_list+0x18f/0x3b0
shrink_lruvec+0x38e/0x6b0
shrink_node+0x14d/0x690
do_try_to_free_pages+0xc6/0x3e0
try_to_free_mem_cgroup_pages+0xe6/0x1e0
reclaim_high.constprop.73+0x87/0xc0
mem_cgroup_handle_over_high+0x66/0x150
exit_to_usermode_loop+0x82/0xd0
do_syscall_64+0xd4/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9

On a PREEMPT=n kernel, the try_release_extent_mapping() function's
"while" loop might run for a very long time on a large I/O. This commit
therefore adds a cond_resched() to this loop, providing RCU any needed
quiescent states.

Signed-off-by: Paul E. McKenney <[email protected]>
---
fs/btrfs/extent_io.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 68c9605..7042395 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4515,6 +4515,8 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)

/* once for us */
free_extent_map(em);
+
+ cond_resched(); /* Allow large-extent preemption. */
}
}
return try_release_extent_state(tree, page, mask);
--
2.9.5

2020-06-23 00:28:26

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 01/26] rcu: Initialize and destroy rcu_synchronize only when necessary

From: Wei Yang <[email protected]>

The __wait_rcu_gp() function unconditionally initializes and cleans up
each element of rs_array[], whether used or not. This is slightly
wasteful and rather confusing, so this commit skips both initialization
and cleanup for duplicate callback functions.

Signed-off-by: Wei Yang <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/update.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 84843ad..f5a82e1 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -390,13 +390,14 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
might_sleep();
continue;
}
- init_rcu_head_on_stack(&rs_array[i].head);
- init_completion(&rs_array[i].completion);
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
- if (j == i)
+ if (j == i) {
+ init_rcu_head_on_stack(&rs_array[i].head);
+ init_completion(&rs_array[i].completion);
(crcu_array[i])(&rs_array[i].head, wakeme_after_rcu);
+ }
}

/* Wait for all callbacks to be invoked. */
@@ -407,9 +408,10 @@ void __wait_rcu_gp(bool checktiny, int n, call_rcu_func_t *crcu_array,
for (j = 0; j < i; j++)
if (crcu_array[j] == crcu_array[i])
break;
- if (j == i)
+ if (j == i) {
wait_for_completion(&rs_array[i].completion);
- destroy_rcu_head_on_stack(&rs_array[i].head);
+ destroy_rcu_head_on_stack(&rs_array[i].head);
+ }
}
}
EXPORT_SYMBOL_GPL(__wait_rcu_gp);
--
2.9.5

2020-06-23 00:50:42

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/26] mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls

On Mon, Jun 22, 2020 at 5:22 PM <[email protected]> wrote:
>
> From: "Paul E. McKenney" <[email protected]>
>
> A large process running on a heavily loaded system can encounter the
> following RCU CPU stall warning:
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: \x093-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
> \x09(t=21013 jiffies g=1005461 q=132576)
> NMI backtrace for cpu 3
> CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
> Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
> Call Trace:
> <IRQ>
> dump_stack+0x46/0x60
> nmi_cpu_backtrace.cold.3+0x13/0x50
> ? lapic_can_unplug_cpu.cold.27+0x34/0x34
> nmi_trigger_cpumask_backtrace+0xba/0xca
> rcu_dump_cpu_stacks+0x99/0xc7
> rcu_sched_clock_irq.cold.87+0x1aa/0x397
> ? tick_sched_do_timer+0x60/0x60
> update_process_times+0x28/0x60
> tick_sched_timer+0x37/0x70
> __hrtimer_run_queues+0xfe/0x270
> hrtimer_interrupt+0xf4/0x210
> smp_apic_timer_interrupt+0x5e/0x120
> apic_timer_interrupt+0xf/0x20
> </IRQ>
> RIP: 0010:kmem_cache_free+0x223/0x300
> Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
> RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
> RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
> RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
> R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
> R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
> ? remove_vma+0x4f/0x60
> remove_vma+0x4f/0x60
> exit_mmap+0xd6/0x160
> mmput+0x4a/0x110
> do_exit+0x278/0xae0
> ? syscall_trace_enter+0x1d3/0x2b0
> ? handle_mm_fault+0xaa/0x1c0
> do_group_exit+0x3a/0xa0
> __x64_sys_exit_group+0x14/0x20
> do_syscall_64+0x42/0x100
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
> for a very long time given a large process. This commit therefore adds
> a cond_resched() to this loop, providing RCU any needed quiescent states.
>
> Cc: Andrew Morton <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>

We have exactly the same change in our internal kernel since 2018. We
mostly observed the need_resched warnings on the processes mapping the
hugetlbfs.

Reviewed-by: Shakeel Butt <[email protected]>

> ---
> mm/mmap.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 59a4682..972f839 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
> if (vma->vm_flags & VM_ACCOUNT)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> + cond_resched();
> }
> vm_unacct_memory(nr_accounted);
> }
> --
> 2.9.5
>

2020-06-23 01:02:10

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/26] mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls

On Mon, Jun 22, 2020 at 05:47:19PM -0700, Shakeel Butt wrote:
> On Mon, Jun 22, 2020 at 5:22 PM <[email protected]> wrote:
> >
> > From: "Paul E. McKenney" <[email protected]>
> >
> > A large process running on a heavily loaded system can encounter the
> > following RCU CPU stall warning:
> >
> > rcu: INFO: rcu_sched self-detected stall on CPU
> > rcu: \x093-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
> > \x09(t=21013 jiffies g=1005461 q=132576)
> > NMI backtrace for cpu 3
> > CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
> > Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
> > Call Trace:
> > <IRQ>
> > dump_stack+0x46/0x60
> > nmi_cpu_backtrace.cold.3+0x13/0x50
> > ? lapic_can_unplug_cpu.cold.27+0x34/0x34
> > nmi_trigger_cpumask_backtrace+0xba/0xca
> > rcu_dump_cpu_stacks+0x99/0xc7
> > rcu_sched_clock_irq.cold.87+0x1aa/0x397
> > ? tick_sched_do_timer+0x60/0x60
> > update_process_times+0x28/0x60
> > tick_sched_timer+0x37/0x70
> > __hrtimer_run_queues+0xfe/0x270
> > hrtimer_interrupt+0xf4/0x210
> > smp_apic_timer_interrupt+0x5e/0x120
> > apic_timer_interrupt+0xf/0x20
> > </IRQ>
> > RIP: 0010:kmem_cache_free+0x223/0x300
> > Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
> > RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> > RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
> > RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
> > RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
> > R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
> > R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
> > ? remove_vma+0x4f/0x60
> > remove_vma+0x4f/0x60
> > exit_mmap+0xd6/0x160
> > mmput+0x4a/0x110
> > do_exit+0x278/0xae0
> > ? syscall_trace_enter+0x1d3/0x2b0
> > ? handle_mm_fault+0xaa/0x1c0
> > do_group_exit+0x3a/0xa0
> > __x64_sys_exit_group+0x14/0x20
> > do_syscall_64+0x42/0x100
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
> > for a very long time given a large process. This commit therefore adds
> > a cond_resched() to this loop, providing RCU any needed quiescent states.
> >
> > Cc: Andrew Morton <[email protected]>
> > Cc: <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
>
> We have exactly the same change in our internal kernel since 2018. We
> mostly observed the need_resched warnings on the processes mapping the
> hugetlbfs.
>
> Reviewed-by: Shakeel Butt <[email protected]>

Thank you very much, I will apply your Reviewed-by on the next rebase.

Any other patches we should know about? ;-)

Thanx, Paul

> > ---
> > mm/mmap.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 59a4682..972f839 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
> > if (vma->vm_flags & VM_ACCOUNT)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > + cond_resched();
> > }
> > vm_unacct_memory(nr_accounted);
> > }
> > --
> > 2.9.5
> >

2020-06-23 02:01:51

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 10/26] rcu: Expedited grace-period sleeps to idle priority

From: "Paul E. McKenney" <[email protected]>

This commit converts the schedule_timeout_uninterruptible() call used
by RCU's expedited grace-period processing to schedule_timeout_idle().
This conversion avoids polluting the load-average with RCU-related
sleeping.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_exp.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 72952ed..1888c0e 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -403,7 +403,7 @@ static void sync_rcu_exp_select_node_cpus(struct work_struct *wp)
/* Online, so delay for a bit and try again. */
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("selectofl"));
- schedule_timeout_uninterruptible(1);
+ schedule_timeout_idle(1);
goto retry_ipi;
}
/* CPU really is offline, so we must report its QS. */
--
2.9.5

2020-06-23 02:01:55

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 26/26] rcu: Remove KCSAN stubs from update.c

From: "Paul E. McKenney" <[email protected]>

KCSAN is now in mainline, so this commit removes the stubs for the
data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS()
macros.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/update.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index a0ba885..f37cebf 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -52,19 +52,6 @@
#endif
#define MODULE_PARAM_PREFIX "rcupdate."

-#ifndef data_race
-#define data_race(expr) \
- ({ \
- expr; \
- })
-#endif
-#ifndef ASSERT_EXCLUSIVE_WRITER
-#define ASSERT_EXCLUSIVE_WRITER(var) do { } while (0)
-#endif
-#ifndef ASSERT_EXCLUSIVE_ACCESS
-#define ASSERT_EXCLUSIVE_ACCESS(var) do { } while (0)
-#endif
-
#ifndef CONFIG_TINY_RCU
module_param(rcu_expedited, int, 0);
module_param(rcu_normal, int, 0);
--
2.9.5

2020-06-23 02:02:02

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 25/26] rcu: Remove KCSAN stubs

From: "Paul E. McKenney" <[email protected]>

KCSAN is now in mainline, so this commit removes the stubs for the
data_race(), ASSERT_EXCLUSIVE_WRITER(), and ASSERT_EXCLUSIVE_ACCESS()
macros.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 13 -------------
1 file changed, 13 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e7161e0..6422870 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -67,19 +67,6 @@
#endif
#define MODULE_PARAM_PREFIX "rcutree."

-#ifndef data_race
-#define data_race(expr) \
- ({ \
- expr; \
- })
-#endif
-#ifndef ASSERT_EXCLUSIVE_WRITER
-#define ASSERT_EXCLUSIVE_WRITER(var) do { } while (0)
-#endif
-#ifndef ASSERT_EXCLUSIVE_ACCESS
-#define ASSERT_EXCLUSIVE_ACCESS(var) do { } while (0)
-#endif
-
/* Data structures. */

/*
--
2.9.5

2020-06-23 02:02:07

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 16/26] rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr

From: "Paul E. McKenney" <[email protected]>

The objtool complains about the call to rcu_cleanup_after_idle() from
rcu_nmi_enter(), so this commit adds instrumentation_begin() before that
call and instrumentation_end() after it.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 08e3648..67912ad 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -980,8 +980,11 @@ noinstr void rcu_nmi_enter(void)
rcu_dynticks_eqs_exit();
// ... but is watching here.

- if (!in_nmi())
+ if (!in_nmi()) {
+ instrumentation_begin();
rcu_cleanup_after_idle();
+ instrumentation_end();
+ }

incby = 1;
} else if (!in_nmi()) {
--
2.9.5

2020-06-23 02:02:08

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 17/26] lockdep: Complain only once about RCU in extended quiescent state

From: "Paul E. McKenney" <[email protected]>

Currently, lockdep_rcu_suspicious() complains twice about RCU read-side
critical sections being invoked from within extended quiescent states,
for example:

RCU used illegally from idle CPU!
rcu_scheduler_active = 2, debug_locks = 1
RCU used illegally from extended quiescent state!

This commit therefore saves a couple lines of code and one line of
console-log output by eliminating the first of these two complaints.

Link: https://lore.kernel.org/lkml/[email protected]
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/locking/lockdep.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 29a8de4..0a7549d 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -5851,9 +5851,7 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
pr_warn("\n%srcu_scheduler_active = %d, debug_locks = %d\n",
!rcu_lockdep_current_cpu_online()
? "RCU used illegally from offline CPU!\n"
- : !rcu_is_watching()
- ? "RCU used illegally from idle CPU!\n"
- : "",
+ : "",
rcu_scheduler_active, debug_locks);

/*
--
2.9.5

2020-06-23 02:02:17

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH tip/core/rcu 08/26] rcu: Priority-boost-related sleeps to idle priority

From: "Paul E. McKenney" <[email protected]>

This commit converts the long-standing schedule_timeout_interruptible()
call used by RCU's priority-boosting kthreads to schedule_timeout_idle().
This conversion avoids polluting the load-average with RCU-related
sleeping.

Signed-off-by: Paul E. McKenney <[email protected]>
---
kernel/rcu/tree_plugin.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 3522236..25296c1 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -1033,7 +1033,7 @@ static int rcu_boost_kthread(void *arg)
if (spincnt > 10) {
WRITE_ONCE(rnp->boost_kthread_status, RCU_KTHREAD_YIELDING);
trace_rcu_utilization(TPS("End boost kthread@rcu_yield"));
- schedule_timeout_interruptible(2);
+ schedule_timeout_idle(2);
trace_rcu_utilization(TPS("Start boost kthread@rcu_yield"));
spincnt = 0;
}
--
2.9.5

2020-06-23 17:08:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/26] rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr

On Mon, Jun 22, 2020 at 05:21:37PM -0700, [email protected] wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> The objtool complains about the call to rcu_cleanup_after_idle() from
> rcu_nmi_enter(), so this commit adds instrumentation_begin() before that
> call and instrumentation_end() after it.

Hmm, I've not seen this one. Still,

Acked-by: Peter Zijlstra (Intel) <[email protected]>

> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> kernel/rcu/tree.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 08e3648..67912ad 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -980,8 +980,11 @@ noinstr void rcu_nmi_enter(void)
> rcu_dynticks_eqs_exit();
> // ... but is watching here.
>
> - if (!in_nmi())
> + if (!in_nmi()) {
> + instrumentation_begin();
> rcu_cleanup_after_idle();
> + instrumentation_end();
> + }
>
> incby = 1;
> } else if (!in_nmi()) {
> --
> 2.9.5
>

2020-06-23 17:51:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 16/26] rcu: Mark rcu_nmi_enter() call to rcu_cleanup_after_idle() noinstr

On Tue, Jun 23, 2020 at 07:04:25PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 22, 2020 at 05:21:37PM -0700, [email protected] wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > The objtool complains about the call to rcu_cleanup_after_idle() from
> > rcu_nmi_enter(), so this commit adds instrumentation_begin() before that
> > call and instrumentation_end() after it.
>
> Hmm, I've not seen this one. Still,

I am still based off of v5.8-rc1, so I might be missing some commits.
Not seeing any that would affect this, but that doesn't mean that
there aren't any. ;-)

> Acked-by: Peter Zijlstra (Intel) <[email protected]>

I will apply this on my next rebase, thank you!

Thanx, Paul

> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > kernel/rcu/tree.c | 5 ++++-
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 08e3648..67912ad 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -980,8 +980,11 @@ noinstr void rcu_nmi_enter(void)
> > rcu_dynticks_eqs_exit();
> > // ... but is watching here.
> >
> > - if (!in_nmi())
> > + if (!in_nmi()) {
> > + instrumentation_begin();
> > rcu_cleanup_after_idle();
> > + instrumentation_end();
> > + }
> >
> > incby = 1;
> > } else if (!in_nmi()) {
> > --
> > 2.9.5
> >

2020-06-23 19:37:50

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/26] mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls

On Mon, Jun 22, 2020 at 05:21:23PM -0700, [email protected] wrote:
> From: "Paul E. McKenney" <[email protected]>
>
> A large process running on a heavily loaded system can encounter the
> following RCU CPU stall warning:
>
> rcu: INFO: rcu_sched self-detected stall on CPU
> rcu: \x093-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
> \x09(t=21013 jiffies g=1005461 q=132576)
> NMI backtrace for cpu 3
> CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
> Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
> Call Trace:
> <IRQ>
> dump_stack+0x46/0x60
> nmi_cpu_backtrace.cold.3+0x13/0x50
> ? lapic_can_unplug_cpu.cold.27+0x34/0x34
> nmi_trigger_cpumask_backtrace+0xba/0xca
> rcu_dump_cpu_stacks+0x99/0xc7
> rcu_sched_clock_irq.cold.87+0x1aa/0x397
> ? tick_sched_do_timer+0x60/0x60
> update_process_times+0x28/0x60
> tick_sched_timer+0x37/0x70
> __hrtimer_run_queues+0xfe/0x270
> hrtimer_interrupt+0xf4/0x210
> smp_apic_timer_interrupt+0x5e/0x120
> apic_timer_interrupt+0xf/0x20
> </IRQ>
> RIP: 0010:kmem_cache_free+0x223/0x300
> Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
> RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
> RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
> RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
> R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
> R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
> ? remove_vma+0x4f/0x60
> remove_vma+0x4f/0x60
> exit_mmap+0xd6/0x160
> mmput+0x4a/0x110
> do_exit+0x278/0xae0
> ? syscall_trace_enter+0x1d3/0x2b0
> ? handle_mm_fault+0xaa/0x1c0
> do_group_exit+0x3a/0xa0
> __x64_sys_exit_group+0x14/0x20
> do_syscall_64+0x42/0x100
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
> for a very long time given a large process. This commit therefore adds
> a cond_resched() to this loop, providing RCU any needed quiescent states.
>
> Cc: Andrew Morton <[email protected]>
> Cc: <[email protected]>
> Signed-off-by: Paul E. McKenney <[email protected]>
> ---
> mm/mmap.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 59a4682..972f839 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
> if (vma->vm_flags & VM_ACCOUNT)
> nr_accounted += vma_pages(vma);
> vma = remove_vma(vma);
> + cond_resched();

Reviewed-by: Joel Fernandes (Google) <[email protected]>

Just for my understanding, cond_resched_tasks_rcu_qs() may not help here
because preemption is not disabled right? Still I see no harm in using it
here either as it may give a slight speed up for tasks-RCU.

thanks,

- Joel

> }
> vm_unacct_memory(nr_accounted);
> }
> --
> 2.9.5
>

2020-06-23 20:56:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/26] mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls

On Tue, Jun 23, 2020 at 03:34:31PM -0400, Joel Fernandes wrote:
> On Mon, Jun 22, 2020 at 05:21:23PM -0700, [email protected] wrote:
> > From: "Paul E. McKenney" <[email protected]>
> >
> > A large process running on a heavily loaded system can encounter the
> > following RCU CPU stall warning:
> >
> > rcu: INFO: rcu_sched self-detected stall on CPU
> > rcu: \x093-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
> > \x09(t=21013 jiffies g=1005461 q=132576)
> > NMI backtrace for cpu 3
> > CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
> > Hardware name: Wiwynn HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
> > Call Trace:
> > <IRQ>
> > dump_stack+0x46/0x60
> > nmi_cpu_backtrace.cold.3+0x13/0x50
> > ? lapic_can_unplug_cpu.cold.27+0x34/0x34
> > nmi_trigger_cpumask_backtrace+0xba/0xca
> > rcu_dump_cpu_stacks+0x99/0xc7
> > rcu_sched_clock_irq.cold.87+0x1aa/0x397
> > ? tick_sched_do_timer+0x60/0x60
> > update_process_times+0x28/0x60
> > tick_sched_timer+0x37/0x70
> > __hrtimer_run_queues+0xfe/0x270
> > hrtimer_interrupt+0xf4/0x210
> > smp_apic_timer_interrupt+0x5e/0x120
> > apic_timer_interrupt+0xf/0x20
> > </IRQ>
> > RIP: 0010:kmem_cache_free+0x223/0x300
> > Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
> > RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
> > RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
> > RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
> > RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
> > R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
> > R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
> > ? remove_vma+0x4f/0x60
> > remove_vma+0x4f/0x60
> > exit_mmap+0xd6/0x160
> > mmput+0x4a/0x110
> > do_exit+0x278/0xae0
> > ? syscall_trace_enter+0x1d3/0x2b0
> > ? handle_mm_fault+0xaa/0x1c0
> > do_group_exit+0x3a/0xa0
> > __x64_sys_exit_group+0x14/0x20
> > do_syscall_64+0x42/0x100
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
> > for a very long time given a large process. This commit therefore adds
> > a cond_resched() to this loop, providing RCU any needed quiescent states.
> >
> > Cc: Andrew Morton <[email protected]>
> > Cc: <[email protected]>
> > Signed-off-by: Paul E. McKenney <[email protected]>
> > ---
> > mm/mmap.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 59a4682..972f839 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3159,6 +3159,7 @@ void exit_mmap(struct mm_struct *mm)
> > if (vma->vm_flags & VM_ACCOUNT)
> > nr_accounted += vma_pages(vma);
> > vma = remove_vma(vma);
> > + cond_resched();
>
> Reviewed-by: Joel Fernandes (Google) <[email protected]>

Thank you! I will apply this on my next rebase.

> Just for my understanding, cond_resched_tasks_rcu_qs() may not help here
> because preemption is not disabled right? Still I see no harm in using it
> here either as it may give a slight speed up for tasks-RCU.

The RCU-tasks stall-warning interval is ten minutes, and I have not yet
seen evidence that we are getting close to that. If we do, then yes,
a cond_resched_tasks_rcu_qs() might be in this code's future. But it
does add overhead, so we need to see the evidence first.

Thanx, Paul

> thanks,
>
> - Joel
>
> > }
> > vm_unacct_memory(nr_accounted);
> > }
> > --
> > 2.9.5
> >

2020-06-23 21:03:16

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH tip/core/rcu 02/26] mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls

On Tue, Jun 23, 2020 at 01:55:08PM -0700, Paul E. McKenney wrote:
[..]
> > Just for my understanding, cond_resched_tasks_rcu_qs() may not help here
> > because preemption is not disabled right? Still I see no harm in using it
> > here either as it may give a slight speed up for tasks-RCU.
>
> The RCU-tasks stall-warning interval is ten minutes, and I have not yet
> seen evidence that we are getting close to that. If we do, then yes,
> a cond_resched_tasks_rcu_qs() might be in this code's future. But it
> does add overhead, so we need to see the evidence first.

Yes, true about that overhead. Ok, this is fine with me too, thanks :)

- Joel