dsb(ishst) should be enough here as we only need to guarantee the
visibility of data to other CPUs in smp inner domain before we
send the ipi.
Signed-off-by: Barry Song <[email protected]>
---
drivers/irqchip/irq-gic-v3.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 5e935d97207d..0efe1a9a9f3b 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
* Ensure that stores to Normal memory are visible to the
* other CPUs before issuing the IPI.
*/
- wmb();
+ dsb(ishst);
for_each_cpu(cpu, mask) {
u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
--
2.25.1
On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <[email protected]> wrote:
>
> On 2022-02-18 21:55, Barry Song wrote:
> > dsb(ishst) should be enough here as we only need to guarantee the
> > visibility of data to other CPUs in smp inner domain before we
> > send the ipi.
> >
> > Signed-off-by: Barry Song <[email protected]>
> > ---
> > drivers/irqchip/irq-gic-v3.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/irqchip/irq-gic-v3.c
> > b/drivers/irqchip/irq-gic-v3.c
> > index 5e935d97207d..0efe1a9a9f3b 100644
> > --- a/drivers/irqchip/irq-gic-v3.c
> > +++ b/drivers/irqchip/irq-gic-v3.c
> > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > *d, const struct cpumask *mask)
> > * Ensure that stores to Normal memory are visible to the
> > * other CPUs before issuing the IPI.
> > */
> > - wmb();
> > + dsb(ishst);
> >
> > for_each_cpu(cpu, mask) {
> > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>
> I'm not opposed to that change, but I'm pretty curious whether this
> makes
> any visible difference in practice. Could you measure the effect of this
> change
> for any sort of IPI heavy workload?
>
Does this have to be a DSB ?
>> + dsb(ishst);
>>
>> for_each_cpu(cpu, mask) {
>> u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>
> I'm not opposed to that change, but I'm pretty curious whether this
> makes
> any visible difference in practice. Could you measure the effect of this
> change
> for any sort of IPI heavy workload?
>
> Thanks,
>
> M.
In practice, at least I don't see much difference on the hardware I am
using. So the result probably depends on the implementaion of the real
hardwares.
I wrote a micro benchmark to measure the latency w/ and w/o the patch on
kunpeng920 with 96 cores(2 socket, each socket has 2dies, each die has
24 cores, cpu0-cpu47 belong to socket0, cpu48-95 belong to socket1) by
sending IPI to cpu0-cpu95 1000 times from an specified cpu:
#include <linux/module.h>
#include <linux/timekeeping.h>
static void ipi_latency_func(void *val)
{
}
static int __init ipi_latency_init(void)
{
ktime_t stime, etime, delta;
int cpu, i;
int start = smp_processor_id();
stime = ktime_get();
for ( i = 0; i < 1000; i++)
for (cpu = 0; cpu < 96; cpu++)
smp_call_function_single(cpu, ipi_latency_func, NULL, 1);
etime = ktime_get();
delta = ktime_sub(etime, stime);
printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n",
__func__, start, delta);
return 0;
}
module_init(ipi_latency_init);
static void ipi_latency_exit(void)
{
}
module_exit(ipi_latency_exit);
MODULE_DESCRIPTION("IPI benchmark");
MODULE_LICENSE("GPL");
do the below 10times:
# taskset -c 0 insmod test.ko
# rmmod test
and the below 10times:
# taskset -c 48 insmod test.ko
# rmmod test
by taskset -c, I can change the source cpu sending IPI.
The result is as below:
vanilla kernel:
[ 103.391684] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122237009
[ 103.537256] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121364329
[ 103.681276] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121420160
[ 103.826254] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122392403
[ 103.970209] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122371262
[ 104.113879] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122041254
[ 104.257444] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121594453
[ 104.402432] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122592556
[ 104.561434] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121601214
[ 104.705561] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121732767
[ 124.592944] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147048939
[ 124.779280] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147467842
[ 124.958162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146448676
[ 125.129253] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141537482
[ 125.298848] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147161504
[ 125.471531] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147833787
[ 125.643133] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147438445
[ 125.814530] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146806172
[ 125.989677] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145971002
[ 126.159497] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147780655
patched kernel:
[ 428.828167] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122195849
[ 428.970822] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122361042
[ 429.111058] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122528494
[ 429.257704] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121155045
[ 429.410186] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121608565
[ 429.570171] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121613673
[ 429.718181] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121593737
[ 429.862615] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121953875
[ 430.002796] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122102741
[ 430.142741] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122005473
[ 516.642812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145610926
[ 516.817002] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145878266
[ 517.004665] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145602966
[ 517.188758] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145658672
[ 517.372409] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141329497
[ 517.557313] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146323829
[ 517.733107] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146015196
[ 517.921491] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146439231
[ 518.093129] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146106916
[ 518.264162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145097868
So there is no much difference between vanilla and patched kernel.
What really makes me worried about my hardware is that IPI sent from the second socket
always shows worse performance than the first socket. This seems to be a problem
worth investigation.
Thanks
Barry
> So there is no much difference between vanilla and patched kernel.
Sorry, let me correct it.
I realize I should write some data before sending IPI. So I have changed the module
to be as below:
#include <linux/module.h>
#include <linux/timekeeping.h>
volatile int data0 ____cacheline_aligned;
volatile int data1 ____cacheline_aligned;
volatile int data2 ____cacheline_aligned;
volatile int data3 ____cacheline_aligned;
volatile int data4 ____cacheline_aligned;
volatile int data5 ____cacheline_aligned;
volatile int data6 ____cacheline_aligned;
static void ipi_latency_func(void *val)
{
}
static int __init ipi_latency_init(void)
{
ktime_t stime, etime, delta;
int cpu, i;
int start = smp_processor_id();
stime = ktime_get();
for ( i = 0; i < 1000; i++)
for (cpu = 0; cpu < 96; cpu++) {
data0 = data1 = data2 = data3 = data4 = data5 = data6 = cpu;
smp_call_function_single(cpu, ipi_latency_func, NULL, 1);
}
etime = ktime_get();
delta = ktime_sub(etime, stime);
printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n",
__func__, start, delta);
return 0;
}
module_init(ipi_latency_init);
static void ipi_latency_exit(void)
{
}
module_exit(ipi_latency_exit);
MODULE_DESCRIPTION("IPI benchmark");
MODULE_LICENSE("GPL");
after that, I can see ~1% difference between patched kernel and vanilla:
vanilla:
[ 375.220131] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126757449
[ 375.382596] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126784249
[ 375.537975] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126177703
[ 375.686823] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127022281
[ 375.849967] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126184883
[ 375.999173] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127374585
[ 376.149565] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:125778089
[ 376.298743] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126974441
[ 376.451125] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127357625
[ 376.606006] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126228184
[ 371.405378] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151851181
[ 371.591642] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151568608
[ 371.767906] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151853441
[ 371.944031] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:152065453
[ 372.114085] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146122093
[ 372.291345] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151379636
[ 372.459812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151854411
[ 372.629708] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145750720
[ 372.807574] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151629448
[ 372.994979] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151050253
patched kernel:
[ 105.598815] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124467401
[ 105.748368] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123474209
[ 105.900400] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123558497
[ 106.043890] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122993951
[ 106.191845] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122984223
[ 106.348215] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123323609
[ 106.501448] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124507583
[ 106.656358] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123386963
[ 106.804367] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123340664
[ 106.956331] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123285324
[ 108.930802] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143616067
[ 109.094750] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148969821
[ 109.267428] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149648418
[ 109.443274] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149448903
[ 109.621760] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147882917
[ 109.794611] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148700282
[ 109.975197] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149050595
[ 110.141543] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143566604
[ 110.315213] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149202898
[ 110.491008] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148958261
as you can see, while cpu0 is the source, vanilla takes 125xxxxxx-127xxxxxx ns, patched
kernel takes 122xxxxxx-124xxxxxx ns.
Thanks
Barry
On 2022-02-20 13:30, Ard Biesheuvel wrote:
> On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <[email protected]> wrote:
>>
>> On 2022-02-18 21:55, Barry Song wrote:
>> > dsb(ishst) should be enough here as we only need to guarantee the
>> > visibility of data to other CPUs in smp inner domain before we
>> > send the ipi.
>> >
>> > Signed-off-by: Barry Song <[email protected]>
>> > ---
>> > drivers/irqchip/irq-gic-v3.c | 2 +-
>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/irqchip/irq-gic-v3.c
>> > b/drivers/irqchip/irq-gic-v3.c
>> > index 5e935d97207d..0efe1a9a9f3b 100644
>> > --- a/drivers/irqchip/irq-gic-v3.c
>> > +++ b/drivers/irqchip/irq-gic-v3.c
>> > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
>> > *d, const struct cpumask *mask)
>> > * Ensure that stores to Normal memory are visible to the
>> > * other CPUs before issuing the IPI.
>> > */
>> > - wmb();
>> > + dsb(ishst);
>> >
>> > for_each_cpu(cpu, mask) {
>> > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
>>
>> I'm not opposed to that change, but I'm pretty curious whether this
>> makes
>> any visible difference in practice. Could you measure the effect of
>> this
>> change
>> for any sort of IPI heavy workload?
>>
>
> Does this have to be a DSB ?
We can use a DMB ISHST for the other interrupt controllers that use a
MMIO access to signal the IPI, as we need to order the previous writes
with
the MMIO access (and DMB fits that bill).
For GICv3 when ICC_SRE_EL1.SRE==1, we need to order a set of writes with
a system register access with side effects, and the only barrier that
allows us to do that is DSB.
It is a bit unfortunate that the relative lightweight nature of the
sysreg
CPU interface is encumbered by fairly heavy barriers (the most horrible
one
being the DSB SY on each read of IAR to be able to synchronise the CPU
interface
and the redistributor).
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On 2022-02-18 21:55, Barry Song wrote:
> dsb(ishst) should be enough here as we only need to guarantee the
> visibility of data to other CPUs in smp inner domain before we
> send the ipi.
>
> Signed-off-by: Barry Song <[email protected]>
> ---
> drivers/irqchip/irq-gic-v3.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3.c
> b/drivers/irqchip/irq-gic-v3.c
> index 5e935d97207d..0efe1a9a9f3b 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> *d, const struct cpumask *mask)
> * Ensure that stores to Normal memory are visible to the
> * other CPUs before issuing the IPI.
> */
> - wmb();
> + dsb(ishst);
>
> for_each_cpu(cpu, mask) {
> u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
I'm not opposed to that change, but I'm pretty curious whether this
makes
any visible difference in practice. Could you measure the effect of this
change
for any sort of IPI heavy workload?
Thanks,
M.
--
Jazz is not dead. It just smells funny...
On 2022-02-20 15:04, Russell King (Oracle) wrote:
> On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
>> dsb(ishst) should be enough here as we only need to guarantee the
>> visibility of data to other CPUs in smp inner domain before we
>> send the ipi.
>>
>> Signed-off-by: Barry Song <[email protected]>
>> ---
>> drivers/irqchip/irq-gic-v3.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/irqchip/irq-gic-v3.c
>> b/drivers/irqchip/irq-gic-v3.c
>> index 5e935d97207d..0efe1a9a9f3b 100644
>> --- a/drivers/irqchip/irq-gic-v3.c
>> +++ b/drivers/irqchip/irq-gic-v3.c
>> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
>> *d, const struct cpumask *mask)
>> * Ensure that stores to Normal memory are visible to the
>> * other CPUs before issuing the IPI.
>> */
>> - wmb();
>> + dsb(ishst);
>
> On ARM, wmb() is a dsb(st) followed by other operations which may
> include a sync operation at the L2 cache, and SoC specific barriers
> for the bus. Hopefully, nothing will break if the sledge hammer is
> replaced by the tack hammer.
The saving grace is that ARMv8 forbids (as per D4.4.11) these
SW-visible,
non architected caches (something like PL310 simply isn't allowed).
Given
that GICv3 requires ARMv8 the first place, we should be OK.
As for SoC-specific bus barriers, I don't know of any that would affect
an ARMv8 based SoC. But I'm always prepared to be badly surprised...
M.
--
Jazz is not dead. It just smells funny...
On Mon, Feb 21, 2022 at 4:05 AM Russell King (Oracle)
<[email protected]> wrote:
>
> On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote:
> > On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <[email protected]> wrote:
> > >
> > > On 2022-02-18 21:55, Barry Song wrote:
> > > > dsb(ishst) should be enough here as we only need to guarantee the
> > > > visibility of data to other CPUs in smp inner domain before we
> > > > send the ipi.
> > > >
> > > > Signed-off-by: Barry Song <[email protected]>
> > > > ---
> > > > drivers/irqchip/irq-gic-v3.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/irqchip/irq-gic-v3.c
> > > > b/drivers/irqchip/irq-gic-v3.c
> > > > index 5e935d97207d..0efe1a9a9f3b 100644
> > > > --- a/drivers/irqchip/irq-gic-v3.c
> > > > +++ b/drivers/irqchip/irq-gic-v3.c
> > > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > > > *d, const struct cpumask *mask)
> > > > * Ensure that stores to Normal memory are visible to the
> > > > * other CPUs before issuing the IPI.
> > > > */
> > > > - wmb();
> > > > + dsb(ishst);
> > > >
> > > > for_each_cpu(cpu, mask) {
> > > > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
> > >
> > > I'm not opposed to that change, but I'm pretty curious whether this
> > > makes
> > > any visible difference in practice. Could you measure the effect of this
> > > change
> > > for any sort of IPI heavy workload?
> > >
> >
> > Does this have to be a DSB ?
>
> Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ?
usually smp_wmb() is ok, for example drivers/irqchip/irq-bcm2836.c:
static void bcm2836_arm_irqchip_ipi_send_mask(struct irq_data *d,
const struct cpumask *mask)
{
int cpu;
void __iomem *mailbox0_base = intc.base + LOCAL_MAILBOX0_SET0;
/*
* Ensure that stores to normal memory are visible to the
* other CPUs before issuing the IPI.
*/
smp_wmb();
for_each_cpu(cpu, mask)
writel_relaxed(BIT(d->hwirq), mailbox0_base + 16 * cpu);
}
but the instruction to send ipi for irq-gic-v3.c isn't a store, so
this driver has been
changed by this commit from dmb(ish) to dsb(sy):
commit 21ec30c0ef5234fb1039cc7c7737d885bf875a9e
Author: Shanker Donthineni <[email protected]>
irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq()
A DMB instruction can be used to ensure the relative order of only
memory accesses before and after the barrier. Since writes to system
registers are not memory operations, barrier DMB is not sufficient
for observability of memory accesses that occur before ICC_SGI1R_EL1
writes.
A DSB instruction ensures that no instructions that appear in program
order after the DSB instruction, can execute until the DSB instruction
has completed.
...
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index d71be9a1f9d2..d99cc07903ec 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -688,7 +688,7 @@ static void gic_raise_softirq(const struct cpumask
*mask, unsigned int irq)
* Ensure that stores to Normal memory are visible to the
* other CPUs before issuing the IPI.
*/
- smp_wmb();
+ wmb();
for_each_cpu(cpu, mask) {
u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
my understanding is that dtb(sy) is too strong as this case we are
asking data to
be observed by other CPUs in smp just as smp_wmb is doing that in other drivers
by dmb(isb). we are not requiring data is observed by everyone in the system.
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
Thanks
Barry
On Mon, Feb 21, 2022 at 4:21 AM Marc Zyngier <[email protected]> wrote:
>
> On 2022-02-20 15:04, Russell King (Oracle) wrote:
> > On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
> >> dsb(ishst) should be enough here as we only need to guarantee the
> >> visibility of data to other CPUs in smp inner domain before we
> >> send the ipi.
> >>
> >> Signed-off-by: Barry Song <[email protected]>
> >> ---
> >> drivers/irqchip/irq-gic-v3.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/irqchip/irq-gic-v3.c
> >> b/drivers/irqchip/irq-gic-v3.c
> >> index 5e935d97207d..0efe1a9a9f3b 100644
> >> --- a/drivers/irqchip/irq-gic-v3.c
> >> +++ b/drivers/irqchip/irq-gic-v3.c
> >> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> >> *d, const struct cpumask *mask)
> >> * Ensure that stores to Normal memory are visible to the
> >> * other CPUs before issuing the IPI.
> >> */
> >> - wmb();
> >> + dsb(ishst);
> >
> > On ARM, wmb() is a dsb(st) followed by other operations which may
> > include a sync operation at the L2 cache, and SoC specific barriers
> > for the bus. Hopefully, nothing will break if the sledge hammer is
> > replaced by the tack hammer.
>
> The saving grace is that ARMv8 forbids (as per D4.4.11) these
> SW-visible,
> non architected caches (something like PL310 simply isn't allowed).
> Given
> that GICv3 requires ARMv8 the first place, we should be OK.
>
> As for SoC-specific bus barriers, I don't know of any that would affect
> an ARMv8 based SoC. But I'm always prepared to be badly surprised...2
i assume we have been safe since dsb(ish) has been widely used for
smp data visibility in arm64:
arch/arm64/include/asm/assembler.h: dsb ish
arch/arm64/include/asm/cacheflush.h: dsb(ish);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: dsb(ishst);
arch/arm64/include/asm/pgtable.h: * is doing a dsb(ishst), but that
penalises the fastpath.
arch/arm64/include/asm/smp.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: "dsb ish\n tlbi " #op, \
arch/arm64/include/asm/tlbflush.h: "dsb ish\n tlbi " #op ", %0", \
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/include/asm/tlbflush.h: dsb(ishst);
arch/arm64/include/asm/tlbflush.h: dsb(ish);
arch/arm64/kernel/alternative.c: dsb(ish);
arch/arm64/kernel/entry.S: dsb ish
arch/arm64/kernel/head.S: dsb ishst // Make zero page visible to PTW
arch/arm64/kernel/hibernate-asm.S: dsb ish /* wait for PoU cleaning to finish */
arch/arm64/kernel/hibernate-asm.S: dsb ish
arch/arm64/kernel/mte.c: dsb(ish);
arch/arm64/kernel/process.c: dsb(ish);
arch/arm64/kernel/sys_compat.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ish);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/pgtable.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst);
arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish);
arch/arm64/mm/cache.S: dsb ishst
arch/arm64/mm/cache.S: dsb ishst
arch/arm64/mm/kasan_init.c: dsb(ishst);
arch/arm64/mm/mmu.c: * We need dsb(ishst) here to ensure the
page-table-walker sees
arch/arm64/mm/mmu.c: dsb(ishst);
arch/arm64/mm/proc.S: dsb ish
drivers/irqchip/irq-gic-v3-its.c: dsb(ishst);
drivers/irqchip/irq-gic-v3-its.c: dsb(ishst);
Plus, is it even a problem to arm since arm only requires soc-level barrier
for system-level memory barrier rather than smp-level barrier?
#if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
#define mb() __arm_heavy_mb()
#define rmb() dsb()
#define wmb() __arm_heavy_mb(st)
#define dma_rmb() dmb(osh)
#define dma_wmb() dmb(oshst)
#else
#define mb() barrier()
#define rmb() barrier()
#define wmb() barrier()
#define dma_rmb() barrier()
#define dma_wmb() barrier()
#endif
#define __smp_mb() dmb(ish)
#define __smp_rmb() __smp_mb()
#define __smp_wmb() dmb(ishst)
I am not seeing arm requires soc-level mb for smp-level barrier though
the mandatory
barriers are using heavy_mb which has soc-level mb.
In this particular case, we are asking the data visibility for
smp-level not system-level. I am
not quite sure Russell's concern is relevant.
>
> M.
> --
> Jazz is not dead. It just smells funny...
Thanks
Barry
On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote:
> dsb(ishst) should be enough here as we only need to guarantee the
> visibility of data to other CPUs in smp inner domain before we
> send the ipi.
>
> Signed-off-by: Barry Song <[email protected]>
> ---
> drivers/irqchip/irq-gic-v3.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 5e935d97207d..0efe1a9a9f3b 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> * Ensure that stores to Normal memory are visible to the
> * other CPUs before issuing the IPI.
> */
> - wmb();
> + dsb(ishst);
On ARM, wmb() is a dsb(st) followed by other operations which may
include a sync operation at the L2 cache, and SoC specific barriers
for the bus. Hopefully, nothing will break if the sledge hammer is
replaced by the tack hammer.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!
On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote:
> On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <[email protected]> wrote:
> >
> > On 2022-02-18 21:55, Barry Song wrote:
> > > dsb(ishst) should be enough here as we only need to guarantee the
> > > visibility of data to other CPUs in smp inner domain before we
> > > send the ipi.
> > >
> > > Signed-off-by: Barry Song <[email protected]>
> > > ---
> > > drivers/irqchip/irq-gic-v3.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/irqchip/irq-gic-v3.c
> > > b/drivers/irqchip/irq-gic-v3.c
> > > index 5e935d97207d..0efe1a9a9f3b 100644
> > > --- a/drivers/irqchip/irq-gic-v3.c
> > > +++ b/drivers/irqchip/irq-gic-v3.c
> > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data
> > > *d, const struct cpumask *mask)
> > > * Ensure that stores to Normal memory are visible to the
> > > * other CPUs before issuing the IPI.
> > > */
> > > - wmb();
> > > + dsb(ishst);
> > >
> > > for_each_cpu(cpu, mask) {
> > > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu));
> >
> > I'm not opposed to that change, but I'm pretty curious whether this
> > makes
> > any visible difference in practice. Could you measure the effect of this
> > change
> > for any sort of IPI heavy workload?
> >
>
> Does this have to be a DSB ?
Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ?
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!