Received-SPF: pass (google.com: domain of linux-kernel+bounces-69589-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223;
Date: Fri, 16 Feb 2024 16:55:59 -0800
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Ankur Arora <ankur.a.arora@oracle.com>
Cc: linux-kernel@vger.kernel.org, tglx@linutronix.de, peterz@infradead.org,
	torvalds@linux-foundation.org, akpm@linux-foundation.org,
	luto@kernel.org, bp@alien8.de, dave.hansen@linux.intel.com,
	hpa@zytor.com, mingo@redhat.com, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, willy@infradead.org, mgorman@suse.de,
	jpoimboe@kernel.org, mark.rutland@arm.com, jgross@suse.com,
	andrew.cooper3@citrix.com, bristot@kernel.org,
	mathieu.desnoyers@efficios.com, glaubitz@physik.fu-berlin.de,
	anton.ivanov@cambridgegreys.com, mattst88@gmail.com,
	krypton@ulrich-teichert.org, rostedt@goodmis.org,
	David.Laight@aculab.com, richard@nod.at, jon.grimm@amd.com,
	bharata@amd.com, boris.ostrovsky@oracle.com, konrad.wilk@oracle.com
Subject: Re: [PATCH 00/30] PREEMPT_AUTO: support lazy rescheduling
Message-ID: <0be4df28-99be-41a3-9e24-2b7cfc740b4a@paulmck-laptop>
Reply-To: paulmck@kernel.org
References: <20240213055554.1802415-1-ankur.a.arora@oracle.com>
 <a7e785f8-d4c3-4b1b-9abe-36ac0b971e44@paulmck-laptop>
 <87le7mpjpr.fsf@oracle.com>
 <4e070ae0-29dc-41ee-aee6-0d3670304825@paulmck-laptop>
 <0d4a4eec-ce91-48da-91b6-1708a97edaeb@paulmck-laptop>
 <871q9dmndg.fsf@oracle.com>
 <9916c73f-510c-47a6-a9b4-ea6b438e82c0@paulmck-laptop>
 <87le7lkzj6.fsf@oracle.com>
 <4bc4ea06-e3e9-4d22-bacf-71cae0ba673d@paulmck-laptop>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4bc4ea06-e3e9-4d22-bacf-71cae0ba673d@paulmck-laptop>

On Thu, Feb 15, 2024 at 06:59:25PM -0800, Paul E. McKenney wrote:
> On Thu, Feb 15, 2024 at 04:45:17PM -0800, Ankur Arora wrote:
> > 
> > Paul E. McKenney <paulmck@kernel.org> writes:
> > 
> > > On Thu, Feb 15, 2024 at 01:24:59PM -0800, Ankur Arora wrote:
> > >>
> > >> Paul E. McKenney <paulmck@kernel.org> writes:
> > >>
> > >> > On Wed, Feb 14, 2024 at 07:45:18PM -0800, Paul E. McKenney wrote:
> > >> >> On Wed, Feb 14, 2024 at 06:03:28PM -0800, Ankur Arora wrote:
> > >> >> >
> > >> >> > Paul E. McKenney <paulmck@kernel.org> writes:
> > >> >> >
> > >> >> > > On Mon, Feb 12, 2024 at 09:55:24PM -0800, Ankur Arora wrote:
> > >> >> > >> Hi,
> > >> >> > >>
> > >> >> > >> This series adds a new scheduling model PREEMPT_AUTO, which like
> > >> >> > >> PREEMPT_DYNAMIC allows dynamic switching between a none/voluntary/full
> > >> >> > >> preemption model. However, unlike PREEMPT_DYNAMIC, it doesn't depend
> > >> >> > >> on explicit preemption points for the voluntary models.
> > >> >> > >>
> > >> >> > >> The series is based on Thomas' original proposal which he outlined
> > >> >> > >> in [1], [2] and in his PoC [3].
> > >> >> > >>
> > >> >> > >> An earlier RFC version is at [4].
> > >> >> > >
> > >> >> > > This uncovered a couple of latent bugs in RCU due to its having been
> > >> >> > > a good long time since anyone built a !SMP preemptible kernel with
> > >> >> > > non-preemptible RCU.  I have a couple of fixes queued on -rcu [1], most
> > >> >> > > likely for the merge window after next, but let me know if you need
> > >> >> > > them sooner.
> > >> >> >
> > >> >> > Thanks. As you can probably tell, I skipped out on !SMP in my testing.
> > >> >> > But, the attached diff should tide me over until the fixes are in.
> > >> >>
> > >> >> That was indeed my guess.  ;-)
> > >> >>
> > >> >> > > I am also seeing OOM conditions during rcutorture testing of callback
> > >> >> > > flooding, but I am still looking into this.
> > >> >> >
> > >> >> > That's on the PREEMPT_AUTO && PREEMPT_VOLUNTARY configuration?
> > >> >>
> > >> >> On two of the PREEMPT_AUTO && PREEMPT_NONE configurations, but only on
> > >> >> two of them thus far.  I am running a longer test to see if this might
> > >> >> be just luck.  If not, I look to see what rcutorture scenarios TREE10
> > >> >> and TRACE01 have in common.
> > >> >
> > >> > And still TRACE01 and TREE10 are hitting OOMs, still not seeing what
> > >> > sets them apart.  I also hit a grace-period hang in TREE04, which does
> > >> > CONFIG_PREEMPT_VOLUNTARY=y along with CONFIG_PREEMPT_AUTO=y.  Something
> > >> > to dig into more.
> > >>
> > >> So, the only PREEMPT_VOLUNTARY=y configuration is TREE04. I wonder
> > >> if you would continue to hit the TREE04 hang with CONFIG_PREEMTP_NONE=y
> > >> as well?
> > >> (Just in the interest of minimizing configurations.)
> > >
> > > I would be happy to, but in the spirit of full disclosure...
> > >
> > > First, I have seen that failure only once, which is not enough to
> > > conclude that it has much to do with TREE04.  It might simply be low
> > > probability, so that TREE04 simply was unlucky enough to hit it first.
> > > In contrast, I have sufficient data to be reasonably confident that the
> > > callback-flooding OOMs really do have something to do with the TRACE01 and
> > > TREE10 scenarios, even though I am not yet seeing what these two scenarios
> > > have in common that they don't also have in common with other scenarios.
> > > But what is life without a bit of mystery?  ;-)
> > 
> > :).
> > 
> > > Second, please see the attached tarball, which contains .csv files showing
> > > Kconfig options and kernel boot parameters for the various torture tests.
> > > The portions of the filenames preceding the "config.csv" correspond to
> > > the directories in tools/testing/selftests/rcutorture/configs.
> > 
> > So, at least some of the HZ_FULL=y tests don't run into problems.
> > 
> > > Third, there are additional scenarios hand-crafted by the script at
> > > tools/testing/selftests/rcutorture/bin/torture.sh.  Thus far, none of
> > > them have triggered, other than via the newly increased difficulty
> > > of configurating a tracing-free kernel with which to test, but they
> > > can still be useful in ruling out particular Kconfig options or kernel
> > > boot parameters being related to a given issue.
> > >
> > > But please do take a look at the .csv files and let me know what
> > > adjustments would be appropriate given the failure information.
> > 
> > Nothing stands out just yet. Let me start a run here and see if
> > that gives me some ideas.
> 
> Sounds good, thank you!
> 
> > I'm guessing the splats don't give any useful information or
> > you would have attached them ;).
> 
> My plan is to extract what can be extracted from the overnight run
> that I just started.  Just in case the fixes have any effect on things,
> unlikely though that might be given those fixes and the runs that failed.

And I only got no failures from either TREE10 or TRACE01 on last night's
run.  I merged your series on top of v6.8-rc4 with the -rcu tree's
dev branch, the latter to get the RCU fixes.  But this means that last
night's results are not really comparable to earlier results.

I did get a few TREE09 failures, but I get those anyway.  I took it
apart below for you because I got confused and thought that it was a
TREE10 failure.  So just in case you were curious what one of these
looks like and because I am too lazy to delete it.  ;-)

So from the viewpoint of moderate rcutorture testing, this series
looks good.  Woo hoo!!!

We did uncover a separate issue with Tasks RCU, which I will report on
in more detail separately.  However, this issue does not (repeat, *not*)
affect lazy preemption as such, but instead any attempt to remove all
of the cond_resched() invocations.

My next step is to try this on bare metal on a system configured as
is the fleet.  But good progress for a week!!!

							Thanx, Paul

------------------------------------------------------------------------

[ 3458.875819] rcu_torture_fwd_prog: Starting forward-progress test 0
[ 3458.877155] rcu_torture_fwd_prog_cr: Starting forward-progress test 0

	This says that rcutorture is starting a callback-flood
	forward-progress test.

[ 3459.252546] rcu-torture: rtc: 00000000ec445089 ver: 298757 tfle: 0 rta: 298758 rtaf: 0 rtf: 298747 rtmbe: 0 rtmbkf: 0/0 rtbe: 0 rtbke: 0 rtbf: 0 rtb: 0 nt: 895741 barrier: 27715/27716:0 read-exits: 3984 nocb-toggles: 0:0
[ 3459.261545] rcu-torture: Reader Pipe:  363878289 159521 0 0 0 0 0 0 0 0 0
[ 3459.263883] rcu-torture: Reader Batch:  363126419 911391 0 0 0 0 0 0 0 0 0
[ 3459.267544] rcu-torture: Free-Block Circulation:  298757 298756 298754 298753 298752 298751 298750 298749 298748 298747 0

	These lines are just statistics that rcutorture prints out
	periodically.  Thus far, nothing bad happened.	This is one of a
	few SMP scenarios that does not do CPU hotplug.  But the TRACE01
	scenario does do CPU hotplug, so not likely a determining factor.
	Another difference is that TREE10 is the only scenario with more
	than 16 CPUs, but then again, TRACE01 has only five.

[ 3459.733109] ------------[ cut here ]------------
[ 3459.734165] rcutorture_oom_notify invoked upon OOM during forward-progress testing.
[ 3459.735828] WARNING: CPU: 0 PID: 43 at kernel/rcu/rcutorture.c:2874 rcutorture_oom_notify+0x3e/0x1d0

	Now something bad happened.  RCU was unable to keep up with the
	callback flood.  Given that users can create callback floods
	with close(open()) loops, this is not good.

[ 3459.737761] Modules linked in:
[ 3459.738408] CPU: 0 PID: 43 Comm: rcu_torture_fwd Not tainted 6.8.0-rc4-00096-g40c2642e6f24 #8252
[ 3459.740295] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 3459.742651] RIP: 0010:rcutorture_oom_notify+0x3e/0x1d0
[ 3459.743821] Code: e8 37 48 c2 00 48 8b 1d f8 b4 dc 01 48 85 db 0f 84 80 01 00 00 90 48 c7 c6 40 f5 e0 92 48 c7 c7 10 52 23 93 e8 d3 b9 f9 ff 90 <0f> 0b 90 90 8b 35 f8 c0 66 01 85 f6 7e 40 45 31 ed 4d 63 e5 41 83
[ 3459.747935] RSP: 0018:ffffabbb0015bb30 EFLAGS: 00010282
[ 3459.749061] RAX: 0000000000000000 RBX: ffff9485812ae000 RCX: 00000000ffffdfff
[ 3459.750601] RDX: 0000000000000000 RSI: 00000000ffffffea RDI: 0000000000000001
[ 3459.752026] RBP: ffffabbb0015bb98 R08: ffffffff93539388 R09: 00000000ffffdfff
[ 3459.753616] R10: ffffffff934593a0 R11: ffffffff935093a0 R12: 0000000000000000
[ 3459.755134] R13: ffffabbb0015bb98 R14: ffffffff93547da0 R15: 00000000ffffffff
[ 3459.756695] FS:  0000000000000000(0000) GS:ffffffff9344f000(0000) knlGS:0000000000000000
[ 3459.758443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3459.759672] CR2: 0000000000600298 CR3: 0000000001958000 CR4: 00000000000006f0
[ 3459.761253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3459.762748] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3459.764472] Call Trace:
[ 3459.765003]  <TASK>
[ 3459.765483]  ? __warn+0x61/0xe0
[ 3459.766176]  ? rcutorture_oom_notify+0x3e/0x1d0
[ 3459.767154]  ? report_bug+0x174/0x180
[ 3459.767942]  ? handle_bug+0x3d/0x70
[ 3459.768715]  ? exc_invalid_op+0x18/0x70
[ 3459.769561]  ? asm_exc_invalid_op+0x1a/0x20
[ 3459.770494]  ? rcutorture_oom_notify+0x3e/0x1d0
[ 3459.771501]  blocking_notifier_call_chain+0x5c/0x80
[ 3459.772553]  out_of_memory+0x236/0x4b0
[ 3459.773365]  __alloc_pages+0x9ca/0xb10
[ 3459.774233]  ? set_next_entity+0x8b/0x150
[ 3459.775107]  new_slab+0x382/0x430
[ 3459.776454]  ___slab_alloc+0x23c/0x8c0
[ 3459.777315]  ? preempt_schedule_irq+0x32/0x50
[ 3459.778319]  ? rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.779291]  ? rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.780264]  ? rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.781244]  kmalloc_trace+0x179/0x1a0
[ 3459.784651]  rcu_torture_fwd_prog+0x6bf/0x970
[ 3459.785529]  ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 3459.786617]  ? kthread+0xc3/0xf0
[ 3459.787352]  ? __pfx_rcu_torture_fwd_prog+0x10/0x10
[ 3459.788417]  kthread+0xc3/0xf0
[ 3459.789101]  ? __pfx_kthread+0x10/0x10
[ 3459.789906]  ret_from_fork+0x2f/0x50
[ 3459.790708]  ? __pfx_kthread+0x10/0x10
[ 3459.791523]  ret_from_fork_asm+0x1a/0x30
[ 3459.792359]  </TASK>
[ 3459.792835] ---[ end trace 0000000000000000 ]---

	Standard rcutorture stack trace for this failure mode.

[ 3459.793849] rcu_torture_fwd_cb_hist: Callback-invocation histogram 0 (duration 913 jiffies): 1s/10: 0:1 2s/10: 719677:32517 3s/10: 646965:0

	So the whole thing lasted less than a second (913 jiffies).
	Each element of the histogram is 100 milliseconds worth.  Nothing
	came through during the first 100 ms (not surprising), and one
	grace period elapsed (also not surprising).  A lot of callbacks
	came through in the second 100 ms (also not surprising), but there
	were some tens of thousand grace periods (extremely surprising).
	The third 100 ms got a lot of callbacks but no grace periods
	(not surprising).

	Huh.  This started at t=3458.877155 and we got the OOM at
	t=3459.733109, which roughly matches the reported time.

[ 3459.796413] rcu: rcu_fwd_progress_check: GP age 737 jiffies

	The callback flood does seem to have stalled grace periods,
	though not by all *that* much.

[ 3459.799402] rcu: rcu_preempt: wait state: RCU_GP_WAIT_FQS(5) ->state: 0x402 ->rt_priority 0 delta ->gp_start 740 ->gp_activity 0 ->gp_req_activity 747 ->gp_wake_time 68 ->gp_wake_seq 5535689 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->gp_max 713 ->gp_flags 0x0

	The RCU grace-period kthread is in its loop looking for
	quiescent states, and is executing normally ("->gp_activity 0",
	as opposed to some huge number indicating that the kthread was
	never awakened).

[ 3459.804267] rcu:     rcu_node 0:0 ->gp_seq 5535689 ->gp_seq_needed 5535696 ->qsmask 0x0 ...G ->n_boosts 0

	The "->qsmask 0x0" says that all CPUs have provided a quiescent
	state, but the "G" indicates that the normal grace period is
	blocked by some task preempted within an RCU read-side critical
	section.  This output is strange because a 56-CPU scenario should
	have considerably more output.

	Plus this means that this cannot possibly be TREE10 because that
	scenario is non-preemptible, so there cannot be grace periods
	waiting for quiescent states on anything but CPUs.

	This happens from time to time because TREE09 runs on a single
	CPU, and has preemption enabled, but not RCU priority boosting.
	And this output is quite appropriate for a single-CPU scenario.

	I probably should enable RCU priority boosting on this scenario.
	I would also need it to be pretty fast off the mark, because we
	OOMed about 700 milliseconds into the grace period.

	But that has nothing to do with lazy preemption!

[ 3459.806271] rcu:     cpu 0 ->gp_seq_needed 5535692
[ 3459.807139] rcu: RCU callbacks invoked since boot: 65398010
[ 3459.808374] rcu: rcu_fwd_progress_check: callbacks 0: 7484262
[ 3459.809640] rcutorture_oom_notify: Freed 1 RCU callbacks.
[ 3460.616268] rcutorture_oom_notify: Freed 7484253 RCU callbacks.
[ 3460.619551] rcutorture_oom_notify: Freed 0 RCU callbacks.
[ 3460.620740] rcutorture_oom_notify returning after OOM processing.
[ 3460.622719] rcu_torture_fwd_prog_cr: Waiting for CBs: rcu_barrier+0x0/0x2c0() 0
[ 3461.678556] rcu_torture_fwd_prog_nr: Starting forward-progress test 0
[ 3461.684546] rcu_torture_fwd_prog_nr: Waiting for CBs: rcu_barrier+0x0/0x2c0() 0