2015-01-26 22:27:00

by Peter Oh

[permalink] [raw]
Subject: [PATCH] ath10k: Replace ioread with wmb for data sync

Using ioread() to perform data sync is excessive.
Use compact API, wmb(), that intended to be used for the case.
It reduces total 14 CPU clocks per interrupt.

Signed-off-by: Peter Oh <[email protected]>
---
drivers/net/wireless/ath/ath10k/pci.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c b/drivers/net/wireless/ath/ath10k/pci.c
index 3b40a86..c353a2c 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -346,10 +346,8 @@ static void ath10k_pci_disable_and_clear_legacy_irq(struct ath10k *ar)
ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + PCIE_INTR_CLR_ADDRESS,
PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);

- /* IMPORTANT: this extra read transaction is required to
- * flush the posted write buffer. */
- (void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
- PCIE_INTR_ENABLE_ADDRESS);
+ /* invoke data sync barrier */
+ wmb();
}

static void ath10k_pci_enable_legacy_irq(struct ath10k *ar)
@@ -358,10 +356,8 @@ static void ath10k_pci_enable_legacy_irq(struct ath10k *ar)
PCIE_INTR_ENABLE_ADDRESS,
PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);

- /* IMPORTANT: this extra read transaction is required to
- * flush the posted write buffer. */
- (void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
- PCIE_INTR_ENABLE_ADDRESS);
+ /* invoke data sync barrier */
+ wmb();
}

static inline const char *ath10k_pci_get_irq_method(struct ath10k *ar)
--
1.9.1



2015-01-31 02:03:29

by Sujith Manoharan

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

Peter Oh wrote:
> Please refer the email thread that I mentioned about other architectures.
> (dsb is for ARM and other platforms have the equivalent instruction such
> as sfence, sync, mf, and dcs).

Ok.

> Also the patch is updated with 2nd patch set replacing wmb to mb.

Would be good to test this on a MIPS platform...

Sujith

2015-01-28 05:40:59

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 01/27/2015 08:30 PM, Bob Copeland wrote:
> On Tue, Jan 27, 2015 at 03:53:00PM -0800, Peter Oh wrote:
>>>> - /* IMPORTANT: this extra read transaction is required to
>>>> - * flush the posted write buffer. */
>>>> - (void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
>>>> - PCIE_INTR_ENABLE_ADDRESS);
>>>> + /* invoke data sync barrier */
>>>> + wmb();
>>>> }
>>> I am no expert in arcane PCI matters, but that looks suspicious to me. I seem
>>> to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
>>> on all platforms. If you want to disable an irq, it really seems like you
>>> would want to flush posted writes so you know the hardware has seen it.
>> enforced ordering is happened by flush write buffer and wmb is
>> commonly used to flush write buffer.
>> so that wmb guarantees ordering by flush write buffer. That's why
>> it's called a memory barrier.
> Ok, sure, but I/O ordering two different writes, and ensuring device
> has seen a posted write, are related but different things, no?
yes, they are different and wmb guarantees both.
> So if
> you are disabling an interrupt and want to be really sure interrupts
> are off when function returns, I think you still want the read.
the mechanism that read is using to make sure write work done is the
same mechanism that wmb is using.
> Anyway, it's your driver, just pointing out that the "IMPORTANT"
> read might still be important to someone.
I really appreciate your opinion.
>
Regards,
Peter

2015-01-30 22:53:50

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

Hi,

On 01/27/2015 11:37 PM, Johannes Berg wrote:
> On Tue, 2015-01-27 at 21:39 -0800, Peter Oh wrote:
>
>>> Ok, sure, but I/O ordering two different writes, and ensuring device
>>> has seen a posted write, are related but different things, no?
>> yes, they are different and wmb guarantees both.
> No, wmb() doesn't. I'd be very surprised if it had any side effect on
> the PCI bus even on ARM. Read Documentation/memory-barriers.txt.
>
> And to understand (PCI) posted writes, wikipedia helps:
> http://en.wikipedia.org/wiki/Posted_write
I admit that I/O ordering and posted write are looked different in
theory and at glance since posted write could be related cache invalidate.
But wmb are still related to both.
As I addressed wmb uses dsb (in arm arch) and here is the description of
arm architecture.

* DSB drains write buffer.
* DSB is architecturally defined to include all cache, TLB and branch
prediction maintenance operations as well as explicit memory operations

These are the reasons why I mentioned wmb does both.

* captured from ARMv7 Architecture Manual
--- Notes ---
Historically, this operation was referred to as Drain Write Buffer or
Data Write Barrier (DWB). From ARMv6, these
names and the use of DWB were deprecated in favor of the new Data
Synchronization Barrier name and DSB
abbreviation. DSB better reflects the functionality provided from ARMv6,
because DSB is architecturally defined
to include all cache, TLB and branch prediction maintenance operations
as well as explicit memory operations

--- A DSB completes when: ---
? all explicit memory accesses that are observed by Pe before the DSB is
executed, are of the required access
types, and are from observers in the same required shareability domain
as Pe, are complete for the set of
observers in the required shareability domain.
? if the required accesses types of the DSB is reads and writes, all
cache and branch predictor maintenance
operations issued by Pe before the DSB are complete for the required
shareability domain.
? if the required accesses types of the DSB is reads and writes, all TLB
maintenance operations issued by Pe
before the DSB are complete for the required shareability domain.
--------------

Furthermore this is the comparison of the compiled assembly code between
ath10k_pci_read32 and wmb.

ath10k_pci_read32()
bac: e5932008 ldr r2, [r3, #8]
bb0: f57ff04f dsb sy
bb4: e2883d52 add r3, r8, #5248 ; 0x1480
bb8: e283303c add r3, r3, #60 ; 0x3c
bbc: e593300c ldr r3, [r3, #12]
bc0: e2833a09 add r3, r3, #36864 ; 0x9000

wmb();
b9c: f57ff04e dsb st

ath10k_pci_read32 does register operation except dsb and there is no
cache invalidate related commands.

So that if wmb is not enough for the purpose then ath10k_pci_read32 is
also not enough for that.

Also refer the section "ACQUIRES VS I/O ACCESSES" in memory-barriers.txt.

It gives an example with PCI bridge and introduces readl as an
alternative method to mmiowb which weaker form of wmb.

Please give your opinion.
> johannes
>
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k
Regards,
Peter

2015-01-31 01:56:29

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 01/30/2015 05:16 PM, Sujith Manoharan wrote:
> Peter Oh wrote:
>> As I addressed wmb uses dsb (in arm arch) and here is the description of
>> arm architecture.
> What about other architectures ?
Please refer the email thread that I mentioned about other architectures.
(dsb is for ARM and other platforms have the equivalent instruction such
as sfence, sync, mf, and dcs).
Also the patch is updated with 2nd patch set replacing wmb to mb.
> Sujith
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k


2015-01-27 21:34:20

by Bob Copeland

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On Mon, Jan 26, 2015 at 02:25:18PM -0800, Peter Oh wrote:
> Using ioread() to perform data sync is excessive.
> Use compact API, wmb(), that intended to be used for the case.
> It reduces total 14 CPU clocks per interrupt.

Hi,

> ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + PCIE_INTR_CLR_ADDRESS,
> PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);
>
> - /* IMPORTANT: this extra read transaction is required to
> - * flush the posted write buffer. */
> - (void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
> - PCIE_INTR_ENABLE_ADDRESS);
> + /* invoke data sync barrier */
> + wmb();
> }

I am no expert in arcane PCI matters, but that looks suspicious to me. I seem
to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
on all platforms. If you want to disable an irq, it really seems like you
would want to flush posted writes so you know the hardware has seen it.

--
Bob Copeland %% http://bobcopeland.com/

2015-01-31 01:13:19

by Sujith Manoharan

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

Peter Oh wrote:
> As I addressed wmb uses dsb (in arm arch) and here is the description of
> arm architecture.

What about other architectures ?

Sujith

2015-01-28 04:30:37

by Bob Copeland

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On Tue, Jan 27, 2015 at 03:53:00PM -0800, Peter Oh wrote:
> >>- /* IMPORTANT: this extra read transaction is required to
> >>- * flush the posted write buffer. */
> >>- (void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
> >>- PCIE_INTR_ENABLE_ADDRESS);
> >>+ /* invoke data sync barrier */
> >>+ wmb();
> >> }
> >I am no expert in arcane PCI matters, but that looks suspicious to me. I seem
> >to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
> >on all platforms. If you want to disable an irq, it really seems like you
> >would want to flush posted writes so you know the hardware has seen it.
> enforced ordering is happened by flush write buffer and wmb is
> commonly used to flush write buffer.
> so that wmb guarantees ordering by flush write buffer. That's why
> it's called a memory barrier.

Ok, sure, but I/O ordering two different writes, and ensuring device
has seen a posted write, are related but different things, no? So if
you are disabling an interrupt and want to be really sure interrupts
are off when function returns, I think you still want the read.

Anyway, it's your driver, just pointing out that the "IMPORTANT"
read might still be important to someone.

--
Bob Copeland %% http://bobcopeland.com/

2015-01-28 20:18:40

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On Tue, 2015-01-27 at 21:39 -0800, Peter Oh wrote:

> > Ok, sure, but I/O ordering two different writes, and ensuring device
> > has seen a posted write, are related but different things, no?
> yes, they are different and wmb guarantees both.

No, wmb() doesn't. I'd be very surprised if it had any side effect on
the PCI bus even on ARM. Read Documentation/memory-barriers.txt.

And to understand (PCI) posted writes, wikipedia helps:
http://en.wikipedia.org/wiki/Posted_write

johannes


2015-01-27 23:54:53

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 01/27/2015 01:33 PM, Bob Copeland wrote:
> On Mon, Jan 26, 2015 at 02:25:18PM -0800, Peter Oh wrote:
>> Using ioread() to perform data sync is excessive.
>> Use compact API, wmb(), that intended to be used for the case.
>> It reduces total 14 CPU clocks per interrupt.
> Hi,
>
>> ath10k_pci_write32(ar, SOC_CORE_BASE_ADDRESS + PCIE_INTR_CLR_ADDRESS,
>> PCIE_INTR_FIRMWARE_MASK | PCIE_INTR_CE_MASK_ALL);
>>
>> - /* IMPORTANT: this extra read transaction is required to
>> - * flush the posted write buffer. */
>> - (void)ath10k_pci_read32(ar, SOC_CORE_BASE_ADDRESS +
>> - PCIE_INTR_ENABLE_ADDRESS);
>> + /* invoke data sync barrier */
>> + wmb();
>> }
> I am no expert in arcane PCI matters, but that looks suspicious to me. I seem
> to recall wmb() only enforced ordering, and maybe not even memory-IO ordering
> on all platforms. If you want to disable an irq, it really seems like you
> would want to flush posted writes so you know the hardware has seen it.
enforced ordering is happened by flush write buffer and wmb is commonly
used to flush write buffer.
so that wmb guarantees ordering by flush write buffer. That's why it's
called a memory barrier.
when wmb is used, it guarantees all memory accesses complete before wmb
command completes.
for instance dsb is the corresponding command for wmb in arm and arm
instruction guide says
"The DSB instruction completes when all explicit memory accesses before
it complete."
Which means DSB instruction will hold the bus (usually AXI bus) and
never returns until any memory or memory mapped I/O access complete
(dsb is for ARM and other platforms have the equivalent instruction such
as sfence, sync, mf, and dcs).
Thanks,
Peter

2015-02-02 19:22:24

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


> > You basically have the following sequence:
> >
> > iowrite()
> > ioread()
> >
> > If you look, you'll see that iowrite() is actually done (or should be,
> > or perhaps with appropriate syncs) on an uncached mapping.
> since it's mmio, iowrite will be map to write, not out which is cached
> mapping.
> That's why we address "posted write" here.
> If it's un-cached mapping which is volatile, we don't even need ioread.

No, this isn't true - "posted write" in the context of this discussion
is about the PCIe bus. Memory writes that go through cache aren't
referred to as "posted writes", those are just (cached) memory writes.

> > As a result,
> > the only thing you care about here is the PCIe bus, not the CPU cache
> > flush. And from there on that's just a question of PCIe bus semantics.
> So how does ioread guarantee PCIe bus transaction done?

That's how PCIe works, operations are serialized, and read() has to wait
for a response from the device (but write doesn't - which is "posted
write")

johannes


2015-02-02 23:25:31

by Florian Fainelli

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On 02/02/15 14:06, Peter Oh wrote:
>
> On 02/02/2015 11:47 AM, Johannes Berg wrote:
>> On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
>>> On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>>>>> You basically have the following sequence:
>>>>>>
>>>>>> iowrite()
>>>>>> ioread()
>>>>>>
>>>>>> If you look, you'll see that iowrite() is actually done (or should
>> be,
>>>>>> or perhaps with appropriate syncs) on an uncached mapping.
>>>>> since it's mmio, iowrite will be map to write, not out which is
>> cached
>>>>> mapping.
>>>>> That's why we address "posted write" here.
>>>>> If it's un-cached mapping which is volatile, we don't even need
>> ioread.
>>>> No, this isn't true - "posted write" in the context of this discussion
>>>> is about the PCIe bus. Memory writes that go through cache aren't
>>>> referred to as "posted writes", those are just (cached) memory writes.
>>>>
>>>>>> As a result,
>>>>>> the only thing you care about here is the PCIe bus, not the CPU
>> cache
>>>>>> flush. And from there on that's just a question of PCIe bus
>> semantics.
>>>>> So how does ioread guarantee PCIe bus transaction done?
>>>> That's how PCIe works, operations are serialized, and read() has to
>> wait
>>>> for a response from the device
>>> do you know which mechanism or which instruction set makes read() wait
>>> for a response from the device?
>> I have no idea. I assume it's just like a DRAM read, the CPU stalls
>> while there's no response.
> My explanation in this thread is all about how read() guarantees the
> wait for a response from the device, therefore why mb() - replace from
> wmb at patch set 2 - is compatible to read().
> Briefly speaking,
> read() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus
> -> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
> device) signals write completion when write transactions completed in
> write response channel -> cpu release axi bus -> cpu program counter
> (pc) proceeds the next to read.
>
> the exact same routines happen with mb().
> mb() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus ->
> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
> device) signals write completion when write transactions completed in
> write response channel -> cpu release axi bus -> cpu program counter
> (pc) proceeds the next to read.
>
> Since axi bus master is waiting (blocking) for write completion signal
> from axi slave (PCIe device), this is how read() and mb() guarantee
> write command reaches to the device.

PCIe writes are posted, so the only guarantee you can have by inserting
such barriers is that writes from CPU to the PCIe RC (targeting PCIe
device) is non-posted (as far as the busing between CPU and the PCIe RC
is concerned), but past the PCIe RC, there is no such guarantee, because
the PCIe specification allows for that and there is flow control, PCIe
switches or other things that can alter the way your PCIe device ends-up
being written to.

The only way to make a "portable" synchronization barrier is to do a
PCIe read from the same register you just wrote to, because then, the
PCIe RC needs to guarantee the transaction ordering on the PCIe bus itself.

You might just be lucky and/or have very good HW which ensures that the
ARM synchronization barriers are propagated to the memory region where
your PCIe device BARs are mapped from the CPU perspective, but you
definitively cannot rely on such assumptions, as there will be bogus HW
there, for which only a subsequent ioread32() will work.
--
Florian

2015-02-02 23:50:01

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

Hi Florian,

Very appreciate your explanation in detail.

Regards,
Peter
On 02/02/2015 03:25 PM, Florian Fainelli wrote:
> On 02/02/15 14:06, Peter Oh wrote:
>> On 02/02/2015 11:47 AM, Johannes Berg wrote:
>>> On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
>>>> On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>>>>>> You basically have the following sequence:
>>>>>>>
>>>>>>> iowrite()
>>>>>>> ioread()
>>>>>>>
>>>>>>> If you look, you'll see that iowrite() is actually done (or should
>>> be,
>>>>>>> or perhaps with appropriate syncs) on an uncached mapping.
>>>>>> since it's mmio, iowrite will be map to write, not out which is
>>> cached
>>>>>> mapping.
>>>>>> That's why we address "posted write" here.
>>>>>> If it's un-cached mapping which is volatile, we don't even need
>>> ioread.
>>>>> No, this isn't true - "posted write" in the context of this discussion
>>>>> is about the PCIe bus. Memory writes that go through cache aren't
>>>>> referred to as "posted writes", those are just (cached) memory writes.
>>>>>
>>>>>>> As a result,
>>>>>>> the only thing you care about here is the PCIe bus, not the CPU
>>> cache
>>>>>>> flush. And from there on that's just a question of PCIe bus
>>> semantics.
>>>>>> So how does ioread guarantee PCIe bus transaction done?
>>>>> That's how PCIe works, operations are serialized, and read() has to
>>> wait
>>>>> for a response from the device
>>>> do you know which mechanism or which instruction set makes read() wait
>>>> for a response from the device?
>>> I have no idea. I assume it's just like a DRAM read, the CPU stalls
>>> while there's no response.
>> My explanation in this thread is all about how read() guarantees the
>> wait for a response from the device, therefore why mb() - replace from
>> wmb at patch set 2 - is compatible to read().
>> Briefly speaking,
>> read() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus
>> -> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
>> device) signals write completion when write transactions completed in
>> write response channel -> cpu release axi bus -> cpu program counter
>> (pc) proceeds the next to read.
>>
>> the exact same routines happen with mb().
>> mb() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus ->
>> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
>> device) signals write completion when write transactions completed in
>> write response channel -> cpu release axi bus -> cpu program counter
>> (pc) proceeds the next to read.
>>
>> Since axi bus master is waiting (blocking) for write completion signal
>> from axi slave (PCIe device), this is how read() and mb() guarantee
>> write command reaches to the device.
> PCIe writes are posted, so the only guarantee you can have by inserting
> such barriers is that writes from CPU to the PCIe RC (targeting PCIe
> device) is non-posted (as far as the busing between CPU and the PCIe RC
> is concerned), but past the PCIe RC, there is no such guarantee, because
> the PCIe specification allows for that and there is flow control, PCIe
> switches or other things that can alter the way your PCIe device ends-up
> being written to.
>
> The only way to make a "portable" synchronization barrier is to do a
> PCIe read from the same register you just wrote to, because then, the
> PCIe RC needs to guarantee the transaction ordering on the PCIe bus itself.
>
> You might just be lucky and/or have very good HW which ensures that the
> ARM synchronization barriers are propagated to the memory region where
> your PCIe device BARs are mapped from the CPU perspective, but you
> definitively cannot rely on such assumptions, as there will be bogus HW
> there, for which only a subsequent ioread32() will work.


2015-02-02 19:15:56

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 02/02/2015 10:54 AM, Johannes Berg wrote:
> On Mon, 2015-02-02 at 09:33 -0800, Peter Oh wrote:
>
>>> The code (as it is before your patch) implies that it's trying to make
>>> sure that before it continues, any previous writes to the PCIe
> device's
>>> registers are posted. The only way to ensure that is to do a read to
> the
>>> registers, as the code does now.
>> Do you know how the read ensure that although the read code does not
>> check the return value?
>> Can you explain how a read ensures that posted write reaches PCIe
> device?
>
> You basically have the following sequence:
>
> iowrite()
> ioread()
>
> If you look, you'll see that iowrite() is actually done (or should be,
> or perhaps with appropriate syncs) on an uncached mapping.
since it's mmio, iowrite will be map to write, not out which is cached
mapping.
That's why we address "posted write" here.
If it's un-cached mapping which is volatile, we don't even need ioread.
> As a result,
> the only thing you care about here is the PCIe bus, not the CPU cache
> flush. And from there on that's just a question of PCIe bus semantics.
So how does ioread guarantee PCIe bus transaction done?
>
> johannes
>
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k
Regards,
Peter

2015-02-02 19:37:18

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>> You basically have the following sequence:
>>>
>>> iowrite()
>>> ioread()
>>>
>>> If you look, you'll see that iowrite() is actually done (or should be,
>>> or perhaps with appropriate syncs) on an uncached mapping.
>> since it's mmio, iowrite will be map to write, not out which is cached
>> mapping.
>> That's why we address "posted write" here.
>> If it's un-cached mapping which is volatile, we don't even need ioread.
> No, this isn't true - "posted write" in the context of this discussion
> is about the PCIe bus. Memory writes that go through cache aren't
> referred to as "posted writes", those are just (cached) memory writes.
>
>>> As a result,
>>> the only thing you care about here is the PCIe bus, not the CPU cache
>>> flush. And from there on that's just a question of PCIe bus semantics.
>> So how does ioread guarantee PCIe bus transaction done?
> That's how PCIe works, operations are serialized, and read() has to wait
> for a response from the device
do you know which mechanism or which instruction set makes read() wait
for a response from the device?
> (but write doesn't - which is "posted
> write")
>
> johannes
>


2015-02-02 13:02:41

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On Fri, 2015-01-30 at 14:53 -0800, Peter Oh wrote:

> I admit that I/O ordering and posted write are looked different in
> theory and at glance since posted write could be related cache invalidate.
> But wmb are still related to both.
> As I addressed wmb uses dsb (in arm arch) and here is the description of
> arm architecture.
>
> * DSB drains write buffer.
> * DSB is architecturally defined to include all cache, TLB and branch
> prediction maintenance operations as well as explicit memory operations
>
> These are the reasons why I mentioned wmb does both.
>
> * captured from ARMv7 Architecture Manual
> --- Notes ---
> Historically, this operation was referred to as Drain Write Buffer or
> Data Write Barrier (DWB). From ARMv6, these
> names and the use of DWB were deprecated in favor of the new Data
> Synchronization Barrier name and DSB
> abbreviation. DSB better reflects the functionality provided from ARMv6,
> because DSB is architecturally defined
> to include all cache, TLB and branch prediction maintenance operations
> as well as explicit memory operations
>
> --- A DSB completes when: ---
> ? all explicit memory accesses that are observed by Pe before the DSB is
> executed, are of the required access
> types, and are from observers in the same required shareability domain
> as Pe, are complete for the set of
> observers in the required shareability domain.
> ? if the required accesses types of the DSB is reads and writes, all
> cache and branch predictor maintenance
> operations issued by Pe before the DSB are complete for the required
> shareability domain.
> ? if the required accesses types of the DSB is reads and writes, all TLB
> maintenance operations issued by Pe
> before the DSB are complete for the required shareability domain.
> --------------

I cannot read from this in any way that it can post writes to the PCIe
bus. In fact, architecturally, I cannot think of any reason how it even
could do that from the CPU.

> Furthermore this is the comparison of the compiled assembly code between
> ath10k_pci_read32 and wmb.
>
> ath10k_pci_read32()
> bac: e5932008 ldr r2, [r3, #8]
> bb0: f57ff04f dsb sy
> bb4: e2883d52 add r3, r8, #5248 ; 0x1480
> bb8: e283303c add r3, r3, #60 ; 0x3c
> bbc: e593300c ldr r3, [r3, #12]
> bc0: e2833a09 add r3, r3, #36864 ; 0x9000
>
> wmb();
> b9c: f57ff04e dsb st
>
> ath10k_pci_read32 does register operation except dsb and there is no
> cache invalidate related commands.

I don't think this is relevant. The question is "what are you trying to
achieve".

> So that if wmb is not enough for the purpose then ath10k_pci_read32 is
> also not enough for that.
>
> Also refer the section "ACQUIRES VS I/O ACCESSES" in memory-barriers.txt.
>
> It gives an example with PCI bridge and introduces readl as an
> alternative method to mmiowb which weaker form of wmb.
>
> Please give your opinion.

Again - the question is - what are you trying to achieve?

The code (as it is before your patch) implies that it's trying to make
sure that before it continues, any previous writes to the PCIe device's
registers are posted. The only way to ensure that is to do a read to the
registers, as the code does now.

What you're describing is something else entirely - you're describing a
way to make sure that some data was flushed out to DRAM from the CPU
caches.

These two things are not related in any way.

In an interrupt routine, it would make sense to ensure that the write
was posted (e.g. to mask interrupts, or to acknowledge them, or similar,
before the routine can be re-invoked.)

To me, flushing memory writes to DRAM makes less sense in an interrupt
handlers unless the device was somehow using DMA to coordinate
interrupts [1], which seems unlikely but I haven't checked.

Anyway - I have no particular interest in this discussion, I was merely
trying to help you out with this :) You can make whatever change you
want, of course :P

johannes

[1] incidentally, our device [iwlwifi] does in fact do something like
that, but it's read-only for the driver so no need for such a thing
either


2015-02-02 18:54:43

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On Mon, 2015-02-02 at 09:33 -0800, Peter Oh wrote:

> > The code (as it is before your patch) implies that it's trying to make
> > sure that before it continues, any previous writes to the PCIe device's
> > registers are posted. The only way to ensure that is to do a read to the
> > registers, as the code does now.

> Do you know how the read ensure that although the read code does not
> check the return value?
> Can you explain how a read ensures that posted write reaches PCIe device?

You basically have the following sequence:

iowrite()
ioread()

If you look, you'll see that iowrite() is actually done (or should be,
or perhaps with appropriate syncs) on an uncached mapping. As a result,
the only thing you care about here is the PCIe bus, not the CPU cache
flush. And from there on that's just a question of PCIe bus semantics.

johannes


2015-02-02 22:26:37

by Adrian Chadd

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On 30 January 2015 at 18:06, Sujith Manoharan <[email protected]> wrote:
> Peter Oh wrote:
>> Please refer the email thread that I mentioned about other architectures.
>> (dsb is for ARM and other platforms have the equivalent instruction such
>> as sfence, sync, mf, and dcs).
>
> Ok.
>
>> Also the patch is updated with 2nd patch set replacing wmb to mb.
>
> Would be good to test this on a MIPS platform...
>

The Atheros mips74k stuff I have here does /not/ flush all the writes
out to the device and guarantee the device has seen everything with a
memory barrier. Just saying. Various drivers ended up needing
ioread()s in my experiments; mips sync operations weren't enough.

So I'd suggest abstracting it out like the linux dri i915 code has -
they define a "posting read" macro which they use whenever they need
to ensure it's definitely made it all the way out to the hardware and
through internal FIFOs so internal hardware has seen the state change.
Then you can redefine that to your hearts content based on platform.



-adrian

2015-02-02 17:25:29

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 01/30/2015 06:06 PM, Sujith Manoharan wrote:
> Peter Oh wrote:
>> Please refer the email thread that I mentioned about other
> architectures.
>> (dsb is for ARM and other platforms have the equivalent instruction such
>> as sfence, sync, mf, and dcs).
> Ok.
>
>> Also the patch is updated with 2nd patch set replacing wmb to mb.
> Would be good to test this on a MIPS platform...
Agree. I'll do the investigation and test.
> Sujith
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k
Thanks,
Peter

2015-02-02 17:34:09

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 02/02/2015 05:02 AM, Johannes Berg wrote:
> On Fri, 2015-01-30 at 14:53 -0800, Peter Oh wrote:
>
>> I admit that I/O ordering and posted write are looked different in
>> theory and at glance since posted write could be related cache
> invalidate.
>> But wmb are still related to both.
>> As I addressed wmb uses dsb (in arm arch) and here is the description of
>> arm architecture.
>>
>> * DSB drains write buffer.
>> * DSB is architecturally defined to include all cache, TLB and branch
>> prediction maintenance operations as well as explicit memory operations
>>
>> These are the reasons why I mentioned wmb does both.
>>
>> * captured from ARMv7 Architecture Manual
>> --- Notes ---
>> Historically, this operation was referred to as Drain Write Buffer or
>> Data Write Barrier (DWB). From ARMv6, these
>> names and the use of DWB were deprecated in favor of the new Data
>> Synchronization Barrier name and DSB
>> abbreviation. DSB better reflects the functionality provided from ARMv6,
>> because DSB is architecturally defined
>> to include all cache, TLB and branch prediction maintenance operations
>> as well as explicit memory operations
>>
>> --- A DSB completes when: ---
>> ? all explicit memory accesses that are observed by Pe before the DSB is
>> executed, are of the required access
>> types, and are from observers in the same required shareability domain
>> as Pe, are complete for the set of
>> observers in the required shareability domain.
>> ? if the required accesses types of the DSB is reads and writes, all
>> cache and branch predictor maintenance
>> operations issued by Pe before the DSB are complete for the required
>> shareability domain.
>> ? if the required accesses types of the DSB is reads and writes, all TLB
>> maintenance operations issued by Pe
>> before the DSB are complete for the required shareability domain.
>> --------------
> I cannot read from this in any way that it can post writes to the PCIe
> bus. In fact, architecturally, I cannot think of any reason how it even
> could do that from the CPU.
>
>> Furthermore this is the comparison of the compiled assembly code between
>> ath10k_pci_read32 and wmb.
>>
>> ath10k_pci_read32()
>> bac: e5932008 ldr r2, [r3, #8]
>> bb0: f57ff04f dsb sy
>> bb4: e2883d52 add r3, r8, #5248 ; 0x1480
>> bb8: e283303c add r3, r3, #60 ; 0x3c
>> bbc: e593300c ldr r3, [r3, #12]
>> bc0: e2833a09 add r3, r3, #36864 ; 0x9000
>>
>> wmb();
>> b9c: f57ff04e dsb st
>>
>> ath10k_pci_read32 does register operation except dsb and there is no
>> cache invalidate related commands.
> I don't think this is relevant. The question is "what are you trying to
> achieve".
>
>> So that if wmb is not enough for the purpose then ath10k_pci_read32 is
>> also not enough for that.
>>
>> Also refer the section "ACQUIRES VS I/O ACCESSES" in
> memory-barriers.txt.
>> It gives an example with PCI bridge and introduces readl as an
>> alternative method to mmiowb which weaker form of wmb.
>>
>> Please give your opinion.
> Again - the question is - what are you trying to achieve?
>
> The code (as it is before your patch) implies that it's trying to make
> sure that before it continues, any previous writes to the PCIe device's
> registers are posted. The only way to ensure that is to do a read to the
> registers, as the code does now.
Do you know how the read ensure that although the read code does not
check the return value?
Can you explain how a read ensures that posted write reaches PCIe device?
> What you're describing is something else entirely - you're describing a
> way to make sure that some data was flushed out to DRAM from the CPU
> caches.
>
> These two things are not related in any way.
>
> In an interrupt routine, it would make sense to ensure that the write
> was posted (e.g. to mask interrupts, or to acknowledge them, or similar,
> before the routine can be re-invoked.)
>
> To me, flushing memory writes to DRAM makes less sense in an interrupt
> handlers unless the device was somehow using DMA to coordinate
> interrupts [1], which seems unlikely but I haven't checked.
>
> Anyway - I have no particular interest in this discussion, I was merely
> trying to help you out with this :) You can make whatever change you
> want, of course :P
>
> johannes
>
> [1] incidentally, our device [iwlwifi] does in fact do something like
> that, but it's read-only for the driver so no need for such a thing
> either
>
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k
Thanks,
Peter

2015-02-02 22:06:54

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 02/02/2015 11:47 AM, Johannes Berg wrote:
> On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
>> On 02/02/2015 11:22 AM, Johannes Berg wrote:
>>>>> You basically have the following sequence:
>>>>>
>>>>> iowrite()
>>>>> ioread()
>>>>>
>>>>> If you look, you'll see that iowrite() is actually done (or should
> be,
>>>>> or perhaps with appropriate syncs) on an uncached mapping.
>>>> since it's mmio, iowrite will be map to write, not out which is
> cached
>>>> mapping.
>>>> That's why we address "posted write" here.
>>>> If it's un-cached mapping which is volatile, we don't even need
> ioread.
>>> No, this isn't true - "posted write" in the context of this discussion
>>> is about the PCIe bus. Memory writes that go through cache aren't
>>> referred to as "posted writes", those are just (cached) memory writes.
>>>
>>>>> As a result,
>>>>> the only thing you care about here is the PCIe bus, not the CPU
> cache
>>>>> flush. And from there on that's just a question of PCIe bus
> semantics.
>>>> So how does ioread guarantee PCIe bus transaction done?
>>> That's how PCIe works, operations are serialized, and read() has to
> wait
>>> for a response from the device
>> do you know which mechanism or which instruction set makes read() wait
>> for a response from the device?
> I have no idea. I assume it's just like a DRAM read, the CPU stalls
> while there's no response.
My explanation in this thread is all about how read() guarantees the
wait for a response from the device, therefore why mb() - replace from
wmb at patch set 2 - is compatible to read().
Briefly speaking,
read() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus
-> cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
device) signals write completion when write transactions completed in
write response channel -> cpu release axi bus -> cpu program counter
(pc) proceeds the next to read.

the exact same routines happen with mb().
mb() -> dsb 'st' -> cpu (actually axi master in cpu) holding axi bus ->
cpu post write buffer on axi bus -> axi bus (axi slave which is PCIe
device) signals write completion when write transactions completed in
write response channel -> cpu release axi bus -> cpu program counter
(pc) proceeds the next to read.

Since axi bus master is waiting (blocking) for write completion signal
from axi slave (PCIe device), this is how read() and mb() guarantee
write command reaches to the device.
> johannes
>
>
> _______________________________________________
> ath10k mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/ath10k
Regards,
Peter

2015-02-02 19:47:32

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync

On Mon, 2015-02-02 at 11:36 -0800, Peter Oh wrote:
> On 02/02/2015 11:22 AM, Johannes Berg wrote:
> >>> You basically have the following sequence:
> >>>
> >>> iowrite()
> >>> ioread()
> >>>
> >>> If you look, you'll see that iowrite() is actually done (or should be,
> >>> or perhaps with appropriate syncs) on an uncached mapping.
> >> since it's mmio, iowrite will be map to write, not out which is cached
> >> mapping.
> >> That's why we address "posted write" here.
> >> If it's un-cached mapping which is volatile, we don't even need ioread.
> > No, this isn't true - "posted write" in the context of this discussion
> > is about the PCIe bus. Memory writes that go through cache aren't
> > referred to as "posted writes", those are just (cached) memory writes.
> >
> >>> As a result,
> >>> the only thing you care about here is the PCIe bus, not the CPU cache
> >>> flush. And from there on that's just a question of PCIe bus semantics.
> >> So how does ioread guarantee PCIe bus transaction done?
> > That's how PCIe works, operations are serialized, and read() has to wait
> > for a response from the device
> do you know which mechanism or which instruction set makes read() wait
> for a response from the device?

I have no idea. I assume it's just like a DRAM read, the CPU stalls
while there's no response.

johannes


2015-02-02 23:05:16

by Peter Oh

[permalink] [raw]
Subject: Re: [PATCH] ath10k: Replace ioread with wmb for data sync


On 02/02/2015 02:26 PM, Adrian Chadd wrote:
> On 30 January 2015 at 18:06, Sujith Manoharan <[email protected]> wrote:
>> Peter Oh wrote:
>>> Please refer the email thread that I mentioned about other architectures.
>>> (dsb is for ARM and other platforms have the equivalent instruction such
>>> as sfence, sync, mf, and dcs).
>> Ok.
>>
>>> Also the patch is updated with 2nd patch set replacing wmb to mb.
>> Would be good to test this on a MIPS platform...
>>
> The Atheros mips74k stuff I have here does /not/ flush all the writes
> out to the device and guarantee the device has seen everything with a
> memory barrier. Just saying. Various drivers ended up needing
> ioread()s in my experiments; mips sync operations weren't enough.
>
> So I'd suggest abstracting it out like the linux dri i915 code has -
> they define a "posting read" macro which they use whenever they need
> to ensure it's definitely made it all the way out to the hardware and
> through internal FIFOs so internal hardware has seen the state change.
> Then you can redefine that to your hearts content based on platform.
Thank you Adrian to head up the concern and suggestion.
The other people also concerned about other architectures like mips, so
I was going to analysis it.
But since you've already experienced the defects, let me hold the change
back until I find better for all.
>
>
> -adrian
Regards,
Peter