by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH v2 13/35] sched: allow runtime config for PREEMPT_AUTO

On Sat, Jun 08, 2024 at 05:46:26PM -0700, Ankur Arora wrote:
> Peter Zijlstra <[email protected]> writes:
> > On Thu, Jun 06, 2024 at 08:11:41AM -0700, Ankur Arora wrote:
> >> Peter Zijlstra <[email protected]> writes:
> >> > On Thu, May 30, 2024 at 02:29:45AM -0700, Ankur Arora wrote:
> >> >> Peter Zijlstra <[email protected]> writes:
> >> >> > On Mon, May 27, 2024 at 05:34:59PM -0700, Ankur Arora wrote:
> >> >> >> Reuse sched_dynamic_update() and related logic to enable choosing
> >> >> >> the preemption model at boot or runtime for PREEMPT_AUTO.
> >> >> >>
> >> >> >> The interface is identical to PREEMPT_DYNAMIC.
> >> >> >
> >> >> > Colour me confused, why?!? What are you doing and why aren't just just
> >> >> > adding AUTO to the existing DYNAMIC thing?
> >> >>
> >> >> You mean have a single __sched_dynamic_update()? AUTO doesn't use any
> >> >> of the static_call/static_key stuff so I'm not sure how that would work.
> >> >
> >> > *sigh*... see the below, seems to work.
> >>
> >> Sorry, didn't mean for you to have to do all that work to prove the
> >> point.
> >
> > Well, for a large part it was needed for me to figure out what your
> > patches were actually doing anyway. Peel away all the layers and this is
> > what remains.
> >
> >> I phrased it badly. I do understand how lazy can be folded in as
> >> you do here:
> >>
> >> > + case preempt_dynamic_lazy:
> >> > + if (!klp_override)
> >> > + preempt_dynamic_disable(cond_resched);
> >> > + preempt_dynamic_disable(might_resched);
> >> > + preempt_dynamic_enable(preempt_schedule);
> >> > + preempt_dynamic_enable(preempt_schedule_notrace);
> >> > + preempt_dynamic_enable(irqentry_exit_cond_resched);
> >> > + preempt_dynamic_key_enable(preempt_lazy);
> >> > + if (mode != preempt_dynamic_mode)
> >> > + pr_info("Dynamic Preempt: lazy\n");
> >> > + break;
> >> > }
> >>
> >> But, if the long term goal (at least as I understand it) is to get rid
> >> of cond_resched() -- to allow optimizations that needing to call cond_resched()
> >> makes impossible -- does it make sense to pull all of these together?
> >
> > It certainly doesn't make sense to add yet another configurable thing. We
> > have one, so yes add it here.
> >
> >> Say, eventually preempt_dynamic_lazy and preempt_dynamic_full are the
> >> only two models left. Then we will have (modulo figuring out how to
> >> switch over klp from cond_resched() to a different unwinding technique):
> >>
> >> static void __sched_dynamic_update(int mode)
> >> {
> >> preempt_dynamic_enable(preempt_schedule);
> >> preempt_dynamic_enable(preempt_schedule_notrace);
> >> preempt_dynamic_enable(irqentry_exit_cond_resched);
> >>
> >> switch (mode) {
> >> case preempt_dynamic_full:
> >> preempt_dynamic_key_disable(preempt_lazy);
> >> if (mode != preempt_dynamic_mode)
> >> pr_info("%s: full\n", PREEMPT_MODE);
> >> break;
> >>
> >> case preempt_dynamic_lazy:
> >> preempt_dynamic_key_enable(preempt_lazy);
> >> if (mode != preempt_dynamic_mode)
> >> pr_info("Dynamic Preempt: lazy\n");
> >> break;
> >> }
> >>
> >> preempt_dynamic_mode = mode;
> >> }
> >>
> >> Which is pretty similar to what the PREEMPT_AUTO code was doing.
> >
> > Right, but without duplicating all that stuff in the interim.
>
> Yeah, that makes sense. Joel had suggested something on these lines
> earlier [1], to which I was resistant.
>
> However, the duplication (and the fact that the voluntary model
> was quite thin) should have told me that (AUTO, preempt=voluntary)
> should just be folded under PREEMPT_DYNAMIC.
>
> I'll rework the series to do that.
>
> That should also simplify RCU related choices which I think Paul will
> like. Given that the lazy model is meant to eventually replace
> none/voluntary, so PREEMPT_RCU configuration can just be:
>
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -18,7 +18,7 @@ config TREE_RCU
>
> config PREEMPT_RCU
> bool
> - default y if PREEMPTION
> + default y if PREEMPTION && !PREEMPT_LAZY

Given that PREEMPT_DYNAMIC selects PREEMPT_BUILD which in turn selects
PREEMPTION, this should work.

> Or, maybe we should instead have this:
>
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -18,7 +18,7 @@ config TREE_RCU
>
> config PREEMPT_RCU
> bool
> - default y if PREEMPTION
> + default y if PREEMPT || PREEMPT_RT
> select TREE_RCU
>
> Though this would be a change in behaviour for current PREEMPT_DYNAMIC
> users.

Which I believe to be a no-go. I believe that PREEMPT_DYNAMIC users
really need their preemptible kernels to include preemptible RCU.

If PREEMPT_LAZY causes PREEMPT_DYNAMIC non-preemptible kernels to become
lazily preemptible, that is a topic to discuss with PREEMPT_DYNAMIC users.
On the other hand, if PREEMPT_LAZY does not cause PREEMPT_DYNAMIC
kernels to become lazily preemptible, then I would expect there to be
hard questions about removing cond_resched() and might_sleep(), or,
for that matter changing their semantics. Which I again must leave
to PREEMPT_DYNAMIC users.

Thanx, Paul

> [1] https://lore.kernel.org/lkml/[email protected]/
>
> Thanks
> --
> ankur

2024-06-15 15:05:33

by Shrikanth Hegde

[permalink] [raw]

Subject: Re: [PATCH v2 00/35] PREEMPT_AUTO: support lazy rescheduling

On 6/10/24 12:53 PM, Ankur Arora wrote:
>
_auto.
>>
>> 6.10-rc1:
>> =========
>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> 09:45:23 AM all 4.14 0.00 77.57 0.00 16.92 0.00 0.00 0.00 0.00 1.37
>> 09:45:24 AM all 4.42 0.00 77.62 0.00 16.76 0.00 0.00 0.00 0.00 1.20
>> 09:45:25 AM all 4.43 0.00 77.45 0.00 16.94 0.00 0.00 0.00 0.00 1.18
>> 09:45:26 AM all 4.45 0.00 77.87 0.00 16.68 0.00 0.00 0.00 0.00 0.99
>>
>> PREEMPT_AUTO:
>> ===========
>> 10:09:50 AM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
>> 10:09:56 AM all 3.11 0.00 72.59 0.00 21.34 0.00 0.00 0.00 0.00 2.96
>> 10:09:57 AM all 3.31 0.00 73.10 0.00 20.99 0.00 0.00 0.00 0.00 2.60
>> 10:09:58 AM all 3.40 0.00 72.83 0.00 20.85 0.00 0.00 0.00 0.00 2.92
>> 10:10:00 AM all 3.21 0.00 72.87 0.00 21.19 0.00 0.00 0.00 0.00 2.73
>> 10:10:01 AM all 3.02 0.00 72.18 0.00 21.08 0.00 0.00 0.00 0.00 3.71
>>
>> Used bcc tools hardirq and softirq to see if irq are increasing. softirq implied there are more
>> timer,sched softirq. Numbers vary between different samples, but trend seems to be similar.
>
> Yeah, the %sys is lower and %irq, higher. Can you also see where the
> increased %irq is? For instance are the resched IPIs numbers greater?

Hi Ankur,

Used mpstat -I ALL to capture this info for 20 seconds.

HARDIRQ per second:
===================
6.10:
===================
18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
417956.86 1114642.30 1712683.65 2058664.99 0.00 0.00 18.30 0.39 31978.37 0.00 0.35 351.98 0.00 0.00 0.00 6405.54 329189.45

Preempt_auto:
===================
18 19 22 23 48 49 50 51 LOC BCT LOC2 SPU PMI MCE NMI WDG DBL
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
609509.69 1910413.99 1923503.52 2061876.33 0.00 0.00 19.14 0.30 31916.59 0.00 0.45 497.88 0.00 0.00 0.00 6825.49 88247.85

18,19,22,23 are called XIVE interrupts. These are IPI interrupts. I am not sure which type of IPI are these. will have to see why its increasing.

SOFTIRQ per second:
===================
6.10:
===================
HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
0.00 3966.47 0.00 18.25 0.59 0.00 0.34 12811.00 0.00 9693.95

Preempt_auto:
===================
HI TIMER NET_TX NET_RX BLOCK IRQ_POLL TASKLET SCHED HRTIMER RCU
0.00 4871.67 0.00 18.94 0.40 0.00 0.25 13518.66 0.00 15732.77

Note: RCU softirq seems to increase significantly. Not sure which one triggers. still trying to figure out why.
It maybe irq triggering to softirq or softirq causing more IPI.

Also, Noticed a below config difference which gets removed in preempt auto. This happens because PREEMPTION make them as N. Made the changes in kernel/Kconfig.locks to get them
enabled. I still see the same regression in hackbench. These configs still may need attention?

6.10 | preempt auto
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y | CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK=y | ----------------------------------------------------------------------------
CONFIG_INLINE_READ_UNLOCK_IRQ=y | ----------------------------------------------------------------------------
CONFIG_INLINE_WRITE_UNLOCK=y | ----------------------------------------------------------------------------
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y | ----------------------------------------------------------------------------

>
>> 6.10-rc1:
>> =========
>> SOFTIRQ TOTAL_usecs
>> tasklet 71
>> block 145
>> net_rx 7914
>> rcu 136988
>> timer 304357
>> sched 1404497
>>
>>
>>
>> PREEMPT_AUTO:
>> ===========
>> SOFTIRQ TOTAL_usecs
>> tasklet 80
>> block 139
>> net_rx 6907
>> rcu 223508
>> timer 492767
>> sched 1794441
>>
>>
>> Would any specific setting of RCU matter for this?
>> This is what I have in config.
>
> Don't see how it could matter unless the RCU settings are changing
> between the two tests? In my testing I'm also using TREE_RCU=y,
> PREEMPT_RCU=n.
>
> Let me see if I can find a test which shows a similar trend to what you
> are seeing. And, then maybe see if tracing sched-switch might point to
> an interesting difference between x86 and powerpc.
>
>
> Thanks for all the detail.
>
> Ankur
>
>> # RCU Subsystem
>> #
>> CONFIG_TREE_RCU=y
>> # CONFIG_RCU_EXPERT is not set
>> CONFIG_TREE_SRCU=y
>> CONFIG_NEED_SRCU_NMI_SAFE=y
>> CONFIG_TASKS_RCU_GENERIC=y
>> CONFIG_NEED_TASKS_RCU=y
>> CONFIG_TASKS_RCU=y
>> CONFIG_TASKS_RUDE_RCU=y
>> CONFIG_TASKS_TRACE_RCU=y
>> CONFIG_RCU_STALL_COMMON=y
>> CONFIG_RCU_NEED_SEGCBLIST=y
>> CONFIG_RCU_NOCB_CPU=y
>> # CONFIG_RCU_NOCB_CPU_DEFAULT_ALL is not set
>> # CONFIG_RCU_LAZY is not set
>> # end of RCU Subsystem
>>
>>
>> # Timers subsystem
>> #
>> CONFIG_TICK_ONESHOT=y
>> CONFIG_NO_HZ_COMMON=y
>> # CONFIG_HZ_PERIODIC is not set
>> # CONFIG_NO_HZ_IDLE is not set
>> CONFIG_NO_HZ_FULL=y
>> CONFIG_CONTEXT_TRACKING_USER=y
>> # CONFIG_CONTEXT_TRACKING_USER_FORCE is not set
>> CONFIG_NO_HZ=y
>> CONFIG_HIGH_RES_TIMERS=y
>> # end of Timers subsystem
>
>
> --
> ankur