Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Date:   Tue, 28 Sep 2021 09:35:46 +0100
From:   Mark Rutland <mark.rutland@arm.com>
To:     "Paul E. McKenney" <paulmck@kernel.org>
Cc:     Pingfan Liu <kernelfans@gmail.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        linux-arm-kernel@lists.infradead.org,
        Catalin Marinas <catalin.marinas@arm.com>,
        Will Deacon <will@kernel.org>, Marc Zyngier <maz@kernel.org>,
        Joey Gouly <joey.gouly@arm.com>,
        Sami Tolvanen <samitolvanen@google.com>,
        Julien Thierry <julien.thierry@arm.com>,
        Yuichi Ito <ito-yuichi@fujitsu.com>,
        linux-kernel@vger.kernel.org, Sven Schnelle <svens@linux.ibm.com>,
        Vasily Gorbik <gor@linux.ibm.com>
Subject: Re: [PATCHv2 0/5] arm64/irqentry: remove duplicate housekeeping of
Message-ID: <20210928083546.GB1924@C02TD0UTHF1T.local>
References: <20210924132837.45994-1-kernelfans@gmail.com>
 <20210924173615.GA42068@C02TD0UTHF1T.local>
 <20210924225954.GN880162@paulmck-ThinkPad-P17-Gen-1>
 <20210927092303.GC1131@C02TD0UTHF1T.local>
 <20210928000922.GY880162@paulmck-ThinkPad-P17-Gen-1>
 <20210928083222.GA1924@C02TD0UTHF1T.local>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20210928083222.GA1924@C02TD0UTHF1T.local>
Precedence: bulk

On Tue, Sep 28, 2021 at 09:32:22AM +0100, Mark Rutland wrote:
> On Mon, Sep 27, 2021 at 05:09:22PM -0700, Paul E. McKenney wrote:
> > On Mon, Sep 27, 2021 at 10:23:18AM +0100, Mark Rutland wrote:
> > > On Fri, Sep 24, 2021 at 03:59:54PM -0700, Paul E. McKenney wrote:
> > > > On Fri, Sep 24, 2021 at 06:36:15PM +0100, Mark Rutland wrote:
> > > > > [Adding Paul for RCU, s390 folk for entry code RCU semantics]
> > > > > 
> > > > > On Fri, Sep 24, 2021 at 09:28:32PM +0800, Pingfan Liu wrote:
> > > > > > After introducing arm64/kernel/entry_common.c which is akin to
> > > > > > kernel/entry/common.c , the housekeeping of rcu/trace are done twice as
> > > > > > the following:
> > > > > >     enter_from_kernel_mode()->rcu_irq_enter().
> > > > > > And
> > > > > >     gic_handle_irq()->...->handle_domain_irq()->irq_enter()->rcu_irq_enter()
> > > > > >
> > > > > > Besides redundance, based on code analysis, the redundance also raise
> > > > > > some mistake, e.g.  rcu_data->dynticks_nmi_nesting inc 2, which causes
> > > > > > rcu_is_cpu_rrupt_from_idle() unexpected.
> > > > > 
> > > > > Hmmm...
> > > > > 
> > > > > The fundamental questionss are:
> > > > > 
> > > > > 1) Who is supposed to be responsible for doing the rcu entry/exit?
> > > > > 
> > > > > 2) Is it supposed to matter if this happens multiple times?
> > > > > 
> > > > > For (1), I'd generally expect that this is supposed to happen in the
> > > > > arch/common entry code, since that itself (or the irqchip driver) could
> > > > > depend on RCU, and if that's the case thatn handle_domain_irq()
> > > > > shouldn't need to call rcu_irq_enter(). That would be consistent with
> > > > > the way we handle all other exceptions.
> > > > > 
> > > > > For (2) I don't know whether the level of nesting is suppoosed to
> > > > > matter. I was under the impression it wasn't meant to matter in general,
> > > > > so I'm a little surprised that rcu_is_cpu_rrupt_from_idle() depends on a
> > > > > specific level of nesting.
> > > > > 
> > > > > >From a glance it looks like this would cause rcu_sched_clock_irq() to
> > > > > skip setting TIF_NEED_RESCHED, and to not call invoke_rcu_core(), which
> > > > > doesn't sound right, at least...
> > > > > 
> > > > > Thomas, Paul, thoughts?
> > > > 
> > > > It is absolutely required that rcu_irq_enter() and rcu_irq_exit() calls
> > > > be balanced.  Normally, this is taken care of by the fact that irq_enter()
> > > > invokes rcu_irq_enter() and irq_exit() invokes rcu_irq_exit().  Similarly,
> > > > nmi_enter() invokes rcu_nmi_enter() and nmi_exit() invokes rcu_nmi_exit().
> > > 
> > > Sure; I didn't mean to suggest those weren't balanced! The problem here
> > > is *nesting*. Due to the structure of our entry code and the core IRQ
> > > code, when handling an IRQ we have a sequence:
> > > 
> > > 	irq_enter() // arch code
> > > 	irq_enter() // irq code
> > > 
> > > 	< irq handler here >
> > > 
> > > 	irq_exit() // irq code
> > > 	irq_exit() // arch code
> > > 
> > > ... and if we use something like rcu_is_cpu_rrupt_from_idle() in the
> > > middle (e.g. as part of rcu_sched_clock_irq()), this will not give the
> > > expected result because of the additional nesting, since
> > > rcu_is_cpu_rrupt_from_idle() seems to expect that dynticks_nmi_nesting
> > > is only incremented once per exception entry, when it does:
> > > 
> > > 	/* Are we at first interrupt nesting level? */
> > > 	nesting = __this_cpu_read(rcu_data.dynticks_nmi_nesting);
> > > 	if (nesting > 1)
> > > 		return false;
> > > 
> > > What I'm trying to figure out is whether that expectation is legitimate,
> > > and assuming so, where the entry/exit should happen.
> > 
> > Oooh...
> > 
> > The penalty for fooling rcu_is_cpu_rrupt_from_idle() is that RCU will
> > be unable to detect a userspace quiescent state for a non-nohz_full
> > CPU.  That could result in RCU CPU stall warnings if a user task runs
> > continuously on a given CPU for more than 21 seconds (60 seconds in
> > some distros).  And this can easily happen if the user has a CPU-bound
> > thread that is the only runnable task on that CPU.
> > 
> > So, yes, this does need some sort of resolution.
> > 
> > The traditional approach is (as you surmise) to have only a single call
> > to irq_enter() on exception entry and only a single call to irq_exit()
> > on exception exit.  If this is feasible, it is highly recommended.
> 
> Cool; that's roughly what I was expecting / hoping to hear!
> 
> > In theory, we could have that "1" in "nesting > 1" be a constant supplied
> > by the architecture (you would want "3" if I remember correctly) but
> > in practice could we please avoid this?  For one thing, if there is
> > some other path into the kernel for your architecture that does only a
> > single irq_enter(), then rcu_is_cpu_rrupt_from_idle() just doesn't stand
> > a chance.  It would need to compare against a different value depending
> > on what exception showed up.  Even if that cannot happen, it would be
> > better if your architecture could remain in blissful ignorance of the
> > colorful details of ->dynticks_nmi_nesting manipulations.
> 
> I completely agree. I think it's much harder to keep that in check than
> to enforce a "once per architectural exception" policy in the arch code.
> 
> > Another approach would be for the arch code to supply RCU a function that
> > it calls.  If there is such a function (or perhaps better, if some new
> > Kconfig option is enabled), RCU invokes it.  Otherwise, it compares to
> > "1" as it does now.  But you break it, you buy it!  ;-)
> 
> I guess we could look at the exception regs and inspect the original
> context, but it sounds overkill...
> 
> I think the cleanest thing is to leave this to arch code, and have the
> common IRQ code stay well clear. Unfortunately most architectures
> (including arch/arm) still need the common IRQ code to handle this, so
> we'll have to make that conditional on Kconfig, something like the below
> (build+boot tested only).
> 
> If there are no objections, I'll go check who else needs the same
> treatment (IIUC at least s390 will), and spin that as a real
> patch/series.

Ah, looking again this is basically Pinfan's patch 2, so ignore the
below, and I'll review Pingfan's patch instead.

Thanks,
Mark.