Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20221126173453.306088-1-apatel@ventanamicro.com>
 <20221126173453.306088-4-apatel@ventanamicro.com> <86k03fmkox.wl-maz@kernel.org>
 <CAK9=C2Un8vH-OM8PRGgU-OijnNjmEOXya_gC=2BUMBDuhpjWPQ@mail.gmail.com> <86ilizmi40.wl-maz@kernel.org>
In-Reply-To: <86ilizmi40.wl-maz@kernel.org>
From:   Anup Patel <anup@brainfault.org>
Date:   Tue, 29 Nov 2022 19:45:24 +0530
Message-ID: <CAAhSdy2rNOAqY0ErJwE-yATiu+EXhNduX13d-3pvzwETscKa_w@mail.gmail.com>
Subject: Re: [PATCH v12 3/7] genirq: Add mechanism to multiplex a single HW IPI
To:     Marc Zyngier <maz@kernel.org>
Cc:     Anup Patel <apatel@ventanamicro.com>,
        Palmer Dabbelt <palmer@dabbelt.com>,
        Paul Walmsley <paul.walmsley@sifive.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Daniel Lezcano <daniel.lezcano@linaro.org>,
        Atish Patra <atishp@atishpatra.org>,
        Alistair Francis <Alistair.Francis@wdc.com>,
        linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Precedence: bulk

On Mon, Nov 28, 2022 at 5:00 PM Marc Zyngier <maz@kernel.org> wrote:
>
> On Mon, 28 Nov 2022 11:13:30 +0000,
> Anup Patel <apatel@ventanamicro.com> wrote:
> >
> > On Mon, Nov 28, 2022 at 4:04 PM Marc Zyngier <maz@kernel.org> wrote:
> > >
> > > On Sat, 26 Nov 2022 17:34:49 +0000,
> > > Anup Patel <apatel@ventanamicro.com> wrote:
> > > >
> > > > +static void ipi_mux_send_mask(struct irq_data *d, const struct cpumask *mask)
> > > > +{
> > > > +     u32 ibit = BIT(irqd_to_hwirq(d));
> > > > +     struct ipi_mux_cpu *icpu = this_cpu_ptr(ipi_mux_pcpu);
> > > > +     struct cpumask *send_mask = &icpu->send_mask;
> > > > +     unsigned long flags;
> > > > +     int cpu;
> > > > +
> > > > +     /*
> > > > +      * We use send_mask as a per-CPU variable so disable local
> > > > +      * interrupts to avoid being preempted.
> > > > +      */
> > > > +     local_irq_save(flags);
> > >
> > > The correct way to avoid preemption is to use preempt_disable(), which
> > > is a lot cheaper than disabling interrupt on most architectures.
> >
> > Okay, I will update.
> >
> > >
> > > > +
> > > > +     cpumask_clear(send_mask);
> > >
> > > This thing is likely to be unnecessarily expensive on very large
> > > systems, as it is proportional to the number of CPUs.
> > >
> > > > +
> > > > +     for_each_cpu(cpu, mask) {
> > > > +             icpu = per_cpu_ptr(ipi_mux_pcpu, cpu);
> > > > +             atomic_or(ibit, &icpu->bits);
> > >
> > > The original code had an atomic_fetch_or_release() to allow eliding
> > > the IPI if the target interrupt was already pending. Why is that code
> > > gone? This is a pretty cheap and efficient optimisation.
> >
> > That optimization is causing RCU stalls on QEMU RISC-V virt
> > machine with large number of CPUs.
>
> Then there is a bug somewhere, either in the implementation of the
> atomic operations or in QEMU. Or maybe even in the original code
> (though this looks unlikely given how heavily this is used on actual
> HW - I'm typing this email from one of these machines, and I'd be
> pretty annoyed if I was missing IPIs).
>
> In any case, please don't paper over this.

I was trying to defer the optimization to a later stage until this
issue was fixed for RISC-V.

Anyways, I found the root cause. This turned out to be missing
broadcast timer initialization in time_init() for RISC-V. Removing
the optimization over here was simply hiding the issue.

I will bring back the optimization in the next patch revision.

>
> >
> > >
> > > > +
> > > > +             /*
> > > > +              * The atomic_or() above must complete before
> > > > +              * the atomic_read() below to avoid racing with
> > > > +              * ipi_mux_unmask().
> > > > +              */
> > > > +             smp_mb__after_atomic();
> > > > +
> > > > +             if (atomic_read(&icpu->enable) & ibit)
> > > > +                     cpumask_set_cpu(cpu, send_mask);
> > > > +     }
> > > > +
> > > > +     /* Trigger the parent IPI */
> > > > +     ipi_mux_send(send_mask);
> > >
> > > IPIs are very rarely made pending on more than a single CPU at a
> > > time. The overwhelming majority of them are targeting a single CPU. So
> > > accumulating bits to avoid doing two or more "send" actions only
> > > penalises the generic case.
> > >
> > > My conclusion is that this "send_mask" can probably be removed,
> > > together with the preemption fiddling.
> >
> > So, we should call ipi_mux_send() for one target CPU at a time ?
>
> I think so, as it matches my measurements from a few years ago. It
> also simplifies things significantly, leading to better performance
> for the common case. Add some instrumentation and see whether this is
> still the case though.

I did not see any difference in the hackbench running on QEMU RISC-V.
I will simplify ipi_mux_send() like you suggested.

>
> >
> > >
> > > > +
> > > > +     local_irq_restore(flags);
> > > > +}
> > > > +
> > > > +static const struct irq_chip ipi_mux_chip = {
> > > > +     .name           = "IPI Mux",
> > > > +     .irq_mask       = ipi_mux_mask,
> > > > +     .irq_unmask     = ipi_mux_unmask,
> > > > +     .ipi_send_mask  = ipi_mux_send_mask,
> > > > +};
> > >
> > > OK, you have now dropped the superfluous pre/post handlers. But the
> > > need still exists. Case in point, the aic_handle_ipi() prologue and
> > > epilogue to the interrupt handling. I have suggested last time that
> > > the driver could provide the actual struct irq_chip in order to
> > > provide the callbacks it requires.
> >
> > The aic_handle_ipi() can simply call ipi_mux_process() between
> > the prologue and epilogue.
>
> Hmm. OK. That's not what I had in mind, but fair enough.
>
>         M.
>
> --
> Without deviation from the norm, progress is not possible.

Regards,
Anup