2014-12-05 09:22:28

by Arend van Spriel

[permalink] [raw]
Subject: using DMA-API on ARM

Hi Russell,

For our brcm80211 development we are working on getting brcmfmac driver
up and running on a Broadcom ARM-based platform. The wireless device is
a PCIe device, which is hooked up to the system behind a PCIe host
bridge, and we transfer information between host and device using a
descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
tested on x86 and seen no issue. However, on this ARM platform
(single-core A9) we detect occasionally that the descriptor content is
invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
retried a number of times if the problem persists. Actually, found out
that someone made a mistake by using virt_to_dma(va) to get the
dma_handle parameter. So probably we only provided a delay in the retry
loop. After fixing that a single call to dma_sync_single_for_cpu() is
sufficient. The DMA-API-HOWTO clearly states that:

"""
the hardware should guarantee that the device and the CPU can access the
data in parallel and will see updates made by each other without any
explicit software flushing.
"""

So it seems incorrect that we would need to do a dma_sync for this
memory. That we do need it seems like this memory can end up in
cache(?), or whatever happens, in some rare condition. Is there anyway
to investigate this situation either through DMA-API or some low-level
ARM specific functions.

Regards,
Arend


2014-12-09 10:19:50

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 12/08/14 18:01, Arend van Spriel wrote:
> On 12/08/14 17:03, Catalin Marinas wrote:
>> On Mon, Dec 08, 2014 at 03:01:32PM +0000, Arnd Bergmann wrote:
>>> [ 0.000000] PL310 OF: cache setting yield illegal associativity
>>> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
>>> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
>>> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
>>> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
>>> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
>>> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001
>>
>> If the above value is correct, they should make sure bit 22 is set in
>> AUX_CTRL.
>
> Hante applied the patch and it now says:
>
> [ 0.000000] PL310 OF: cache setting yield illegal associativity
> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e530001
>
> He started running a test overnight. So will see if it hits the failure
> with this L2 cache configuration.

The issue did not trigger overnight so it seems setting bit 22 <Shared
Attribute _Override_ Enable> solves the issue over here. Now the
question is how to move forward with this. As I understood from Catalin
this patch was not included as it was not considered responsibility of
the linux kernel.

Regards,
Arend

2014-12-05 14:47:10

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 12/05/14 15:20, Hante Meuleman wrote:
> Ok, I'll add the necessary debug to get all the information out,
> but it will take some time to get it done, so I won't have anything
> before Monday.
>
> -----Original Message-----
> From: Russell King - ARM Linux [mailto:[email protected]]
> Sent: vrijdag 5 december 2014 14:24
> To: Hante Meuleman
> Cc: Will Deacon; Arend Van Spriel; Marek Szyprowski; [email protected]; David Miller; [email protected]; brcm80211-dev-list; linux-wireless
> Subject: Re: using DMA-API on ARM
>
> Please wrap your message - replying to a message which looks like this in
> my editor is far from easy, and gives me much more work to /manually/
> reformat it before I can reply to it:

That's what happens with corporate IT forcing to use Outlook. We can
workaround that using Thunderbird on Citrix. I will enlighten Hante
about that option :-)

Regards,
Arend

> On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
>> The problem is with data coming from device, so DMA from device to host. The $
>>
>> However: this indicates that dma_alloc_coherent on an ARM target may result i$
>>
>> Regards,
>> Hante
>
> Thanks.
>
> On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
>> However: this indicates that dma_alloc_coherent on an ARM target may
>> result in a memory buffer which can be cached which conflicts with
>> the API of this function.
>
> If the memory has an alias which is cacheable, it is possible for cache
> lines to get allocated via that alias, even if the alias has no explicit
> accesses to it.
>
> This is something which I've been going on for quite literally /years/ -
> mismatched cache attributes can cause unpredictable behaviour. I've had
> a lot of push back from people who are of the opinion that "if it works
> for me, then there isn't a problem" and I eventually gave up fighting
> the battle, especially as the ARM architecture people weakened my
> reasoning behind it by publishing a relaxation of the "no differing
> attributes" issue. This was particularly true of those who wanted to
> use ioremap() on system memory - and cases such as
> dma_init_coherent_memory().
>
> So, I never fixed this problem in the original DMA allocator code; I
> basically gave up with it. It's a latent bug which did need to be fixed,
> and is still present today in the non-CMA case.
>
> The symptoms which you are reporting sound very much like this kind of
> problem - the virtual address for the memory returned by
> dma_alloc_coherent() will not be cacheable memory - it will have been
> remapped using map_vm_area(). However, there could very well be a fully
> cacheable lowmem mapping of that memory, which if a read (speculative or
> otherwise) will bring a cache line in, and because the caches are VIPT
> or PIPT, that cache line can be hit via the non-cacheable mapping too.
>
> What I /really/ need is more evidence of this to tell those disbelievers
> where to stick their flawed arguments. :)
>


2014-12-05 19:50:58

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 12/05/14 19:53, Catalin Marinas wrote:
> On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote:
>> On Fri, Dec 05, 2014 at 09:22:22AM +0000, Arend van Spriel wrote:
>>> For our brcm80211 development we are working on getting brcmfmac driver
>>> up and running on a Broadcom ARM-based platform. The wireless device is
>>> a PCIe device, which is hooked up to the system behind a PCIe host
>>> bridge, and we transfer information between host and device using a
>>> descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
>>> tested on x86 and seen no issue. However, on this ARM platform
>>> (single-core A9) we detect occasionally that the descriptor content is
>>> invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
>>> retried a number of times if the problem persists. Actually, found out
>>> that someone made a mistake by using virt_to_dma(va) to get the
>>> dma_handle parameter. So probably we only provided a delay in the retry
>>> loop. After fixing that a single call to dma_sync_single_for_cpu() is
>>> sufficient. The DMA-API-HOWTO clearly states that:
>>
>> Does your system have an L2 cache? What's the SoC topology, can PCIe see
>> such L2 cache (or snoop the L1 caches)?
>
> BTW, if you really have a PL310-like L2 cache, have a look at some
> patches (I've seen similar symptoms) and make sure your configuration is
> correct:
>
> http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1
>
> http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1
>
> The first one is vexpress specific. The second one was eventually
> discarded by Russell (I don't remember the reason, I guess it's because
> SoC code is supposed to set the right bits in there anyway). In your
> case, such bits may be set up by firmware, so Linux cannot fix anything
> up.

I guess by firmware you mean to bootloader. This one boots with CFE
bootloader which Broadcom maintains itself so could look into that.

Regards,
Arend


2014-12-05 12:43:05

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 05-12-14 10:45, Russell King - ARM Linux wrote:
> On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote:
>> For our brcm80211 development we are working on getting brcmfmac driver
>> up and running on a Broadcom ARM-based platform. The wireless device is
>> a PCIe device, which is hooked up to the system behind a PCIe host
>> bridge, and we transfer information between host and device using a
>> descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
>> tested on x86 and seen no issue. However, on this ARM platform
>> (single-core A9) we detect occasionally that the descriptor content is
>> invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
>> retried a number of times if the problem persists. Actually, found out
>> that someone made a mistake by using virt_to_dma(va) to get the
>> dma_handle parameter. So probably we only provided a delay in the retry
>> loop. After fixing that a single call to dma_sync_single_for_cpu() is
>> sufficient. The DMA-API-HOWTO clearly states that:
>>
>> """
>> the hardware should guarantee that the device and the CPU can access the
>> data in parallel and will see updates made by each other without any
>> explicit software flushing.
>> """
>>
>> So it seems incorrect that we would need to do a dma_sync for this
>> memory. That we do need it seems like this memory can end up in
>> cache(?), or whatever happens, in some rare condition. Is there anyway
>> to investigate this situation either through DMA-API or some low-level
>> ARM specific functions.
>
> It's been a long while since I looked at the code, and the code for
> dma_alloc_coherent() has completely changed since then with the
> addition of CMA. I'm afraid that anything I would say about it would
> not be accurate without research into the possible paths through that
> code - it's no longer just a simple allocator.

I know. On this particular platform we are not using CMA.

> What you say is correct however: the memory should not have any cache
> lines associated with it, if it does, there's a bug somewhere.
>
> Also, the memory will be weakly ordered, which means that writes to such
> memory can be reordered. If ordering matters, barriers should be used.
> rmb() and wmb() can be used for this.

Ok. You already had a peek in our code checking the memory barriers,
which does not have the dma_sync_single_for_cpu() "workaround" yet. So
here some more background. The problem is in DMA_FROM_DEVICE direction.
Because of the possible reordering issue we first tried using rmb() in
the retry loop but that did not solve it. Another experiment was to
ignore the failed ring descriptor entry and proceed. So we get interrupt
from device and access the ring descriptor entry. This should contain
expected value X, however we get X-1 back. When proceeding everything
works find until hitting the same ring descriptor entry again reading
X-1 when X+1 would be valid. This lead us to the assumption that somehow
this entry ended up in cache lines. The issue goes away using the
dma_sync_single_for_cpu() with DMA_FROM_DEVICE in direction parameter.
We are not longer using virt_to_dma() so that is no longer an issue. So
is there any function interface to verify cache status.

Regards,
Arend

> (Added Marek for comment on dma_alloc_coherent(), Will for comment on
> barrier stuff.)
>


2014-12-05 17:38:49

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 11:11:14AM +0000, Russell King - ARM Linux wrote:
> In any case, wouldn't using a u64 type for "address" be better - isn't
> "long long" 128-bit on 64-bit architectures?

No, it's still 64-bit. There is no 128-bit integer in the C standard.

--
Catalin

2014-12-05 18:53:13

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote:
> On Fri, Dec 05, 2014 at 09:22:22AM +0000, Arend van Spriel wrote:
> > For our brcm80211 development we are working on getting brcmfmac driver
> > up and running on a Broadcom ARM-based platform. The wireless device is
> > a PCIe device, which is hooked up to the system behind a PCIe host
> > bridge, and we transfer information between host and device using a
> > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
> > tested on x86 and seen no issue. However, on this ARM platform
> > (single-core A9) we detect occasionally that the descriptor content is
> > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
> > retried a number of times if the problem persists. Actually, found out
> > that someone made a mistake by using virt_to_dma(va) to get the
> > dma_handle parameter. So probably we only provided a delay in the retry
> > loop. After fixing that a single call to dma_sync_single_for_cpu() is
> > sufficient. The DMA-API-HOWTO clearly states that:
>
> Does your system have an L2 cache? What's the SoC topology, can PCIe see
> such L2 cache (or snoop the L1 caches)?

BTW, if you really have a PL310-like L2 cache, have a look at some
patches (I've seen similar symptoms) and make sure your configuration is
correct:

http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1

http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1

The first one is vexpress specific. The second one was eventually
discarded by Russell (I don't remember the reason, I guess it's because
SoC code is supposed to set the right bits in there anyway). In your
case, such bits may be set up by firmware, so Linux cannot fix anything
up.

--
Catalin

2014-12-05 18:31:25

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 06:24:43PM +0000, Russell King - ARM Linux wrote:
> On Fri, Dec 05, 2014 at 05:38:39PM +0000, Catalin Marinas wrote:
> > On Fri, Dec 05, 2014 at 11:11:14AM +0000, Russell King - ARM Linux wrote:
> > > In any case, wouldn't using a u64 type for "address" be better - isn't
> > > "long long" 128-bit on 64-bit architectures?
> >
> > No, it's still 64-bit. There is no 128-bit integer in the C standard.
>
> Actually, that's a fallicy.
>
> The C99 standard (like previous versions) does not define exactly the
> number of bits in each type. It defines ranks of type, and says that
> lower ranks are a subrange of integers with higher ranks (for the same
> signed-ness.) See section 6.2.5.
>
> So, it merely states that:
>
> range(char) <= range(short) <= range(int) <= range(long) <= range(long long)

You are probably right, I haven't checked. But the ABI we use in Linux
for 64-bit, LP64, defines long long as 64-bit. Gcc has a int128_t type
but it's specific to this toolchain.

--
Catalin

2014-12-09 11:07:31

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 12/09/14 11:29, Russell King - ARM Linux wrote:
> On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote:
>> The issue did not trigger overnight so it seems setting bit 22<Shared
>> Attribute _Override_ Enable> solves the issue over here. Now the question is
>> how to move forward with this. As I understood from Catalin this patch was
>> not included as it was not considered responsibility of the linux kernel.
>
> It is preferable for firmware to configure the L2 cache appropriately,
> which includes things like the prefetch offsets as well as feature bits
> like bit 22.
>
> I think what I'll do is queue up a patch which adds a warning if bit 22
> is not set, suggesting that firmware is updated to set this bit.

I was thinking in the same direction. Thanks to you all for looking into
this. It did not feel right to use the dma_sync_single_for_cpu() for
memory allocated with dma_alloc_coherent() and I am glad this got
cleared up.

Regards,
Arend


2014-12-05 15:06:59

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

I've been doing more digging into the current DMA code, and I'm dismayed
to see that there's new bugs in it...

commit 513510ddba9650fc7da456eefeb0ead7632324f6
Author: Laura Abbott <[email protected]>
Date: Thu Oct 9 15:26:40 2014 -0700

common: dma-mapping: introduce common remapping functions

This uses map_vm_area() to achieve the remapping of pages allocated inside
dma_alloc_coherent(). dma_alloc_coherent() is documented in a rather
round-about way in Documentation/DMA-API.txt:

| Part Ia - Using large DMA-coherent buffers
| ------------------------------------------
|
| void *
| dma_alloc_coherent(struct device *dev, size_t size,
| dma_addr_t *dma_handle, gfp_t flag)
|
| void
| dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
| dma_addr_t dma_handle)
|
| Free a region of consistent memory you previously allocated. dev,
| size and dma_handle must all be the same as those passed into
| dma_alloc_coherent(). cpu_addr must be the virtual address returned by
| the dma_alloc_coherent().
|
| Note that unlike their sibling allocation calls, these routines
| may only be called with IRQs enabled.

Note that very last paragraph. What this says is that it is explicitly
permitted to call dma_alloc_coherent() with IRQs disabled.

Now, the question is: is it safe to call map_vm_area() with IRQs disabled?
Well, map_vm_area() calls pud_alloc(), pmd_alloc(), and pte_alloc_kernel().
These functions all call into the kernel memory allocator *without*
GFP_ATOMIC - in other words, these allocations are permitted to sleep.
Except, IRQs are off, so it's a bug to call these functions from
dma_alloc_coherent().

Now, if we look at the previous code, it used ioremap_page_range(). This
has the same problem: it needs to allocate page tables, and it can only
do it via functions which may sleep.

If we go back even further, we find that the use of ioremap_page_range()
in dma_alloc_coherent() was introduced by:

commit e9da6e9905e639b0f842a244bc770b48ad0523e9
Author: Marek Szyprowski <[email protected]>
Date: Mon Jul 30 09:11:33 2012 +0200

ARM: dma-mapping: remove custom consistent dma region

which is the commit which removed my pre-allocated page tables for the
DMA re-mapping region - code which I explicitly had to specifically
avoid this issue.

Obviously, this isn't a big problem, because people haven't reported
that they've hit any of the might_sleep() checks in the memory
allocators, which I think is our only saving grace - but it's still
wrong to the specified calling conditions of the DMA API.

If the problem which you (Broadcom) are suffering from is down to the
issue I suspect (that being having mappings with different cache
attributes) then I'm not sure that there's anything we can realistically
do about that. There's a number of issues which make it hard to see a
way forward.

One example is that if we allocate memory, we need to be able to change
(or remove) the cacheable mappings associated with that memory. We'd
need to touch the L1 page table, either to change the attributes of the
section mapping, or to convert the section mapping to a L2 page table
pointer. We need to change the attributes in a break-flush-make sequence
to avoid TLB conflicts.

However, those mappings may be shared between other CPUs in a SMP system.
So, we would need to flush the TLBs on other CPUs before we could proceed
to create replacement mappings. That means something like stop_machine()
or sending (and waiting for completion) of an IPI to the other CPUs. That
is totally impractical due to dma_alloc_coherent() being allowed to be
called with IRQs off.

I'll continue to think about it, but I don't see many possibilities to
satisfy dma_alloc_coherent()'s documented requirements other than by
pre-allocating a chunk of memory at boot time to be served out as
DMA-able memory for these horrid cases.

I don't see much point in keeping the map_vm_area() approach on ARM
even if we did fallback - if we're re-establishing mappings for
the surrounding pages in lowmem, we might as well insert appropriately
attributed mappings for the DMA memory as well.

On the face of it, it would be better to allocate one section at a
time, but my unfortunate experience is that 3.x kernels are a /lot/
more trigger happy with the OOM killer, and the chances of being able
to allocate 1MB of memory at a go after the system has been running
for a while is near-on impossible. So I don't think that's a reality.

Even if we did break up section mappings in this way, it would also
mean that over time, we'd end up with much of lowmem mapped using 4K
page table entries, which would place significant pressure on the MMU
TLBs.

So, we might just be far better off pre-allocating enough "DMA
coherent" memory at boot time and be done with it. Those who want
it dynamic can use CMA instead.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 16:55:06

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Mon, Dec 08, 2014 at 04:50:43PM +0000, Catalin Marinas wrote:
> On Mon, Dec 08, 2014 at 04:38:57PM +0000, Arnd Bergmann wrote:
> > On Monday 08 December 2014 17:22:44 Arend van Spriel wrote:
> > > >> The log: first the ring allocation info is printed. Starting at
> > > >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this
> > > >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each
> > > >> 16 bytes. The next thing printed is the kernel page tables. Then some
> > > >> OpenWRT info and the logging of part of the connection setup. Then at
> > > >> 1780.130752 the logging of the failure starts. The sequence number is
> > > >> modulo 253 with ring size of 1024 matches an "old" entry (read 40,
> > > >> expected 52). Then the different pointers are printed followed by
> > > >> the kernel page table. The code does then a cache invalidate on the
> > > >> dma_handle and the next read the sequence number is correct.
> > > >
> > > > How do you invalidate the cache? A dma_handle is of type dma_addr_t
> > > > and we don't define an operation for that, nor does it make sense
> > > > on an allocation from dma_alloc_coherent(). What happens if you
> > > > take out the invalidate?
> > >
> > > dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating
> > > the cache (or that is our suspicion).
> >
> > I'm not sure about that:
> >
> > static void arm_dma_sync_single_for_cpu(struct device *dev,
> > dma_addr_t handle, size_t size, enum dma_data_direction dir)
> > {
> > unsigned int offset = handle & (PAGE_SIZE - 1);
> > struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset));
> > __dma_page_dev_to_cpu(page, offset, size, dir);
> > }
> >
> > Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping,
> > dma_to_pfn will return the correct pfn here, but pfn_to_page will return a
> > page pointer into the kernel linear mapping,
>
> Or a highmem page, both should be handled by dma_cache_maint_page().

A valid point, but one which is irrelevant to this thread, because we're
talking about a platform with only 128MB, and a PAGE_OFFSET of 2GB (hence
no highmem):

Memory: 125936K/131072K available (2682K kernel code, 103K rwdata,
744K rodata, 164K init, 188K bss, 5136K reserved)

Can we stay on-point to getting this problem solved, rather than drifting
off topic please?

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-05 18:40:01

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 09:22:22AM +0000, Arend van Spriel wrote:
> For our brcm80211 development we are working on getting brcmfmac driver
> up and running on a Broadcom ARM-based platform. The wireless device is
> a PCIe device, which is hooked up to the system behind a PCIe host
> bridge, and we transfer information between host and device using a
> descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
> tested on x86 and seen no issue. However, on this ARM platform
> (single-core A9) we detect occasionally that the descriptor content is
> invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
> retried a number of times if the problem persists. Actually, found out
> that someone made a mistake by using virt_to_dma(va) to get the
> dma_handle parameter. So probably we only provided a delay in the retry
> loop. After fixing that a single call to dma_sync_single_for_cpu() is
> sufficient. The DMA-API-HOWTO clearly states that:

Does your system have an L2 cache? What's the SoC topology, can PCIe see
such L2 cache (or snoop the L1 caches)?

Also, are you certain that dma_alloc_coherent() ends up creating a
non-cacheable mapping in Linux (this call translates to a function
pointer call which may or may not create non-cacheable memory, depending
on the "dma-coherent" property passed via DT).

--
Catalin

2014-12-08 16:39:15

by Arnd Bergmann

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Monday 08 December 2014 17:22:44 Arend van Spriel wrote:
> >> The log: first the ring allocation info is printed. Starting at
> >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this
> >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each
> >> 16 bytes. The next thing printed is the kernel page tables. Then some
> >> OpenWRT info and the logging of part of the connection setup. Then at
> >> 1780.130752 the logging of the failure starts. The sequence number is
> >> modulo 253 with ring size of 1024 matches an "old" entry (read 40,
> >> expected 52). Then the different pointers are printed followed by
> >> the kernel page table. The code does then a cache invalidate on the
> >> dma_handle and the next read the sequence number is correct.
> >
> > How do you invalidate the cache? A dma_handle is of type dma_addr_t
> > and we don't define an operation for that, nor does it make sense
> > on an allocation from dma_alloc_coherent(). What happens if you
> > take out the invalidate?
>
> dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating
> the cache (or that is our suspicion).

I'm not sure about that:

static void arm_dma_sync_single_for_cpu(struct device *dev,
dma_addr_t handle, size_t size, enum dma_data_direction dir)
{
unsigned int offset = handle & (PAGE_SIZE - 1);
struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset));
__dma_page_dev_to_cpu(page, offset, size, dir);
}

Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping,
dma_to_pfn will return the correct pfn here, but pfn_to_page will return a
page pointer into the kernel linear mapping, which is not the same
as the pointer you get from __alloc_remap_buffer(). The pointer that
was returned from dma_alloc_coherent is a) non-cachable, and b) not the
same that you flush here.

Arnd

2014-12-08 16:50:53

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Mon, Dec 08, 2014 at 04:38:57PM +0000, Arnd Bergmann wrote:
> On Monday 08 December 2014 17:22:44 Arend van Spriel wrote:
> > >> The log: first the ring allocation info is printed. Starting at
> > >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this
> > >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each
> > >> 16 bytes. The next thing printed is the kernel page tables. Then some
> > >> OpenWRT info and the logging of part of the connection setup. Then at
> > >> 1780.130752 the logging of the failure starts. The sequence number is
> > >> modulo 253 with ring size of 1024 matches an "old" entry (read 40,
> > >> expected 52). Then the different pointers are printed followed by
> > >> the kernel page table. The code does then a cache invalidate on the
> > >> dma_handle and the next read the sequence number is correct.
> > >
> > > How do you invalidate the cache? A dma_handle is of type dma_addr_t
> > > and we don't define an operation for that, nor does it make sense
> > > on an allocation from dma_alloc_coherent(). What happens if you
> > > take out the invalidate?
> >
> > dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating
> > the cache (or that is our suspicion).
>
> I'm not sure about that:
>
> static void arm_dma_sync_single_for_cpu(struct device *dev,
> dma_addr_t handle, size_t size, enum dma_data_direction dir)
> {
> unsigned int offset = handle & (PAGE_SIZE - 1);
> struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset));
> __dma_page_dev_to_cpu(page, offset, size, dir);
> }
>
> Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping,
> dma_to_pfn will return the correct pfn here, but pfn_to_page will return a
> page pointer into the kernel linear mapping,

Or a highmem page, both should be handled by dma_cache_maint_page().

> which is not the same
> as the pointer you get from __alloc_remap_buffer(). The pointer that
> was returned from dma_alloc_coherent is a) non-cachable, and b) not the
> same that you flush here.

Correct. But apart from the fact that you don't need to flush buffers
allocated with dma_alloc_coherent(), the above sync_single would work on
ARMv7 where the D-cache is PIPT, so the virtual address doesn't matter
much as long as it maps the same physical address.

--
Catalin

2014-12-05 09:45:31

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote:
> For our brcm80211 development we are working on getting brcmfmac driver
> up and running on a Broadcom ARM-based platform. The wireless device is
> a PCIe device, which is hooked up to the system behind a PCIe host
> bridge, and we transfer information between host and device using a
> descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
> tested on x86 and seen no issue. However, on this ARM platform
> (single-core A9) we detect occasionally that the descriptor content is
> invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
> retried a number of times if the problem persists. Actually, found out
> that someone made a mistake by using virt_to_dma(va) to get the
> dma_handle parameter. So probably we only provided a delay in the retry
> loop. After fixing that a single call to dma_sync_single_for_cpu() is
> sufficient. The DMA-API-HOWTO clearly states that:
>
> """
> the hardware should guarantee that the device and the CPU can access the
> data in parallel and will see updates made by each other without any
> explicit software flushing.
> """
>
> So it seems incorrect that we would need to do a dma_sync for this
> memory. That we do need it seems like this memory can end up in
> cache(?), or whatever happens, in some rare condition. Is there anyway
> to investigate this situation either through DMA-API or some low-level
> ARM specific functions.

It's been a long while since I looked at the code, and the code for
dma_alloc_coherent() has completely changed since then with the
addition of CMA. I'm afraid that anything I would say about it would
not be accurate without research into the possible paths through that
code - it's no longer just a simple allocator.

What you say is correct however: the memory should not have any cache
lines associated with it, if it does, there's a bug somewhere.

Also, the memory will be weakly ordered, which means that writes to such
memory can be reordered. If ordering matters, barriers should be used.
rmb() and wmb() can be used for this.

(Added Marek for comment on dma_alloc_coherent(), Will for comment on
barrier stuff.)

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 16:50:32

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Mon, Dec 08, 2014 at 03:55:57PM +0000, Catalin Marinas wrote:
> On Mon, Dec 08, 2014 at 12:55:38PM +0000, Johannes Stezenbach wrote:
> > On Fri, Dec 05, 2014 at 06:53:03PM +0000, Catalin Marinas wrote:
> > >
> > > BTW, if you really have a PL310-like L2 cache, have a look at some
> > > patches (I've seen similar symptoms) and make sure your configuration is
> > > correct:
> > >
> > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1
> > >
> > > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1
> > >
> > > The first one is vexpress specific. The second one was eventually
> > > discarded by Russell (I don't remember the reason, I guess it's because
> > > SoC code is supposed to set the right bits in there anyway). In your
> > > case, such bits may be set up by firmware, so Linux cannot fix anything
> > > up.
> >
> > How do you avoid the unpredictable behavior mentioned in the
> > PL310 TRM when the Shared Attribute Invalidate Enable bit is set?
> > http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/Ceggcfcj.html
>
> So you talk about "Shared Attribute _Invalidate_ Enable" (bit 13) while
> I talk about "Shared Attribute _Override_ Enable" (bit 22).
>
> In addition, Shared _Invalidate_ behaviour can only be enabled if Shared
> Attribute _Override_ Enable bit is not set.

Yeah, I got confused, sorry for the noise.

Thanks,
Johannes

2014-12-05 19:25:54

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 08:22:05PM +0100, Arend van Spriel wrote:
> On 12/05/14 19:28, Catalin Marinas wrote:
> >This is solved by using a pre-allocated, pre-mapped atomic_pool which
> >avoids any further mapping. __dma_alloc() calls __alloc_from_pool() when
> >!__GFP_WAIT.
>
> So we are actually calling dma_alloc_coherent() with GFP_KERNEL during
> device probe. That last paragraph Russell pointed out seems to suggest this
> is not allowed.

device probe is a schedulable, sleepable context, so dma_alloc_coherent()
is fine there. As Catalin points out, and as I realised after sending
them ail, it does check for __GFP_WAIT and uses a smaller atomic pool
for those allocations. This explains why no one has hit any warnings in
map_vm_area.

So, it's safe from atomic contexts after all.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 16:47:38

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Mon, Dec 08, 2014 at 05:38:57PM +0100, Arnd Bergmann wrote:
> On Monday 08 December 2014 17:22:44 Arend van Spriel wrote:
> > >> The log: first the ring allocation info is printed. Starting at
> > >> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this
> > >> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each
> > >> 16 bytes. The next thing printed is the kernel page tables. Then some
> > >> OpenWRT info and the logging of part of the connection setup. Then at
> > >> 1780.130752 the logging of the failure starts. The sequence number is
> > >> modulo 253 with ring size of 1024 matches an "old" entry (read 40,
> > >> expected 52). Then the different pointers are printed followed by
> > >> the kernel page table. The code does then a cache invalidate on the
> > >> dma_handle and the next read the sequence number is correct.
> > >
> > > How do you invalidate the cache? A dma_handle is of type dma_addr_t
> > > and we don't define an operation for that, nor does it make sense
> > > on an allocation from dma_alloc_coherent(). What happens if you
> > > take out the invalidate?
> >
> > dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating
> > the cache (or that is our suspicion).
>
> I'm not sure about that:
>
> static void arm_dma_sync_single_for_cpu(struct device *dev,
> dma_addr_t handle, size_t size, enum dma_data_direction dir)
> {
> unsigned int offset = handle & (PAGE_SIZE - 1);
> struct page *page = pfn_to_page(dma_to_pfn(dev, handle-offset));
> __dma_page_dev_to_cpu(page, offset, size, dir);
> }
>
> Assuming a noncoherent linear (no IOMMU, no swiotlb, no dmabounce) mapping,
> dma_to_pfn will return the correct pfn here, but pfn_to_page will return a
> page pointer into the kernel linear mapping, which is not the same
> as the pointer you get from __alloc_remap_buffer(). The pointer that
> was returned from dma_alloc_coherent is a) non-cachable, and b) not the
> same that you flush here.

Having looked up the details of the Cortex CPU TRMs:

1. The caches are PIPT.
2. A non-cacheable mapping will not hit L1 cache lines which may be
allocated against the same physical address. (This is implementation
specific.)

So, the problem can't be the L1 cache, it has to be the L2 cache.

The L2 cache only deals with physical addresses, so it doesn't really
matter which mapping gets flushed - the result will be the same as far
as the L2 cache is concerned.

If bit 22 is not set in the auxcr, then a non-cacheable access can hit
a cache line which may be allocated in the L2 cache (which may have
been allocated via a speculative prefetch via the cacheable mapping.)

In the case which has been supplied, the physical address does indeed
have two mappings: it has a lowmem mapping which is cacheable, and it
has the DMA mapping which is marked as non-cacheable. Accesses via
the non-cacheable mapping will not hit L1 (that's an implementation
specific behaviour.) However, they may hit L2 if bit 22 is clear.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 15:02:43

by Arnd Bergmann

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Monday 08 December 2014 13:47:38 Hante Meuleman wrote:
> Still using outlook, but will limit the line length, I hope that works for the
> moment. Attached is a log with the requested information, it is a little
> bit non-standard though. The dump code from the mm was copied in
> the driver and called from there, mapping the prints back to our local
> printf, but it should produce the same. I did this because I didn't realize
> the table is static.
>
> Some background on the test setup: I'm using a Broadcom reference
> design AP platform with an BRCM 4708 host SOC.

I think you are using the wrong dtb file, the log says this is
a "Buffalo WZR-1750DHP", not the reference design.

> For the AP router
> platform the opensource packet OpenWRT was used. Some small
> modifications were made to get it to work on our HW. Only one core
> is enabled for the moment (no time to figure out how to enable the
> other one). Openwrt was configured to use kernel 3.18-rc2 and
> the brcmfmac of the compat-wireless code was updated with our
> latest code (minor patches, which have been submitted already).
> The device used is 43602 pcie device. Some modifications to the build
> system were made to enable PCIE. The test is to connect with a
> client to the AP and run iperf (TCP). The test can run for many hours
> without a problem, but sometimes fails very quickly.

The bcm4708 platform is maintained by Hauke Mehrtens, adding him to Cc.

In your log, I see this message:

[ 0.000000] PL310 OF: cache setting yield illegal associativity
[ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
[ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
[ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
[ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
[ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
[ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001

Evidently the cache controller information in DT is incorrect and
the setup may be wrong as a consequence, which may explain cache
coherency problems.

Can you verify that the AUX_CTRL value is the same one you see
in a working kernel?

> The log: first the ring allocation info is printed. Starting at
> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this
> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each
> 16 bytes. The next thing printed is the kernel page tables. Then some
> OpenWRT info and the logging of part of the connection setup. Then at
> 1780.130752 the logging of the failure starts. The sequence number is
> modulo 253 with ring size of 1024 matches an "old" entry (read 40,
> expected 52). Then the different pointers are printed followed by
> the kernel page table. The code does then a cache invalidate on the
> dma_handle and the next read the sequence number is correct.

How do you invalidate the cache? A dma_handle is of type dma_addr_t
and we don't define an operation for that, nor does it make sense
on an allocation from dma_alloc_coherent(). What happens if you
take out the invalidate?

Can you post the patch that you use (both platform and driver) relative
to the snapshot of the the mainline kernel you are basing on?

Arnd


2014-12-08 15:22:51

by Arnd Bergmann

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Monday 08 December 2014 15:17:33 Russell King - ARM Linux wrote:
> On Mon, Dec 08, 2014 at 04:01:32PM +0100, Arnd Bergmann wrote:
> > In your log, I see this message:
> >
> > [ 0.000000] PL310 OF: cache setting yield illegal associativity
> > [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
> > [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
> > [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
> > [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
> > [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
> > [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001
> >
> > Evidently the cache controller information in DT is incorrect and
> > the setup may be wrong as a consequence, which may explain cache
> > coherency problems.
>
> No. See d0b92845e54 (ARM: 8182/1: l2c: Make l2x0_cache_size_of_parse() return 'int')
> which was merged in rc3, post-dating this kernel. The original commit
> was buggy, and produces those harmless "illegal associativity" messages.
> They occur if DT *doesn't* specify the cache parameters, in which case
> we use the hardware-set value (which /should/ be correct.)
>
> We made these optional in DT as hardware really should set these
> correctly in the first place.

Ok, good.

Arnd

2014-12-05 12:59:49

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 01:43:01PM +0100, Arend van Spriel wrote:
> Ok. You already had a peek in our code checking the memory barriers, which
> does not have the dma_sync_single_for_cpu() "workaround" yet. So here some
> more background. The problem is in DMA_FROM_DEVICE direction. Because of the
> possible reordering issue we first tried using rmb() in the retry loop but
> that did not solve it. Another experiment was to ignore the failed ring
> descriptor entry and proceed. So we get interrupt from device and access the
> ring descriptor entry. This should contain expected value X, however we get
> X-1 back. When proceeding everything works find until hitting the same ring
> descriptor entry again reading X-1 when X+1 would be valid. This lead us to
> the assumption that somehow this entry ended up in cache lines. The issue
> goes away using the dma_sync_single_for_cpu() with DMA_FROM_DEVICE in
> direction parameter.

Can you give some further detail - I think it would help understanding
if you could give:

- the initial numerical state of the descriptor (presumably setup by
msgbuf.c calling brcmf_commonring_reserve_for_write(), and then
writing the contents into the ring buffer, followed by
brcmf_commonring_write_complete().

- time passes, the hardware processes the entry

- the numerical state of the descriptor (which is in error) which you
read back

- the expected numerical state of the descriptor

> So is there any function interface to verify cache status.

There isn't, but if you dump the virtual address, and you have debugfs
enabled, along with CONFIG_ARM_PTDUMP, you should be able to find the
mapping in /sys/kernel/debug/kernel_page_tables, which will tell you
the attributes that it's mapped using.

What it won't tell you is whether there's an alias of the mapping with
differing attributes. If you use dma_to_pfn() to convert the DMA handle
into a PFN, we can use that to see whether there could be another mapping
from the kernel page table dump (by checking whether the PFN would be a
lowmem PFN, and therefore whether it's already mapped at it's lowmem
address.)

If you'd like to mail me (in addition to the ring contents above):

- the kernel_page_tables dump
- virtual address of the ring buffer
- dma_to_pfn() converted DMA handle of the ring buffer
- PHYS_PFN_OFFSET for your platform

then I can see whether there is.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-05 09:52:19

by Arnd Bergmann

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Friday 05 December 2014 10:22:22 Arend van Spriel wrote:
> Hi Russell,
>
> For our brcm80211 development we are working on getting brcmfmac driver
> up and running on a Broadcom ARM-based platform. The wireless device is
> a PCIe device, which is hooked up to the system behind a PCIe host
> bridge, and we transfer information between host and device using a
> descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
> tested on x86 and seen no issue. However, on this ARM platform
> (single-core A9) we detect occasionally that the descriptor content is
> invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
> retried a number of times if the problem persists. Actually, found out
> that someone made a mistake by using virt_to_dma(va) to get the
> dma_handle parameter. So probably we only provided a delay in the retry
> loop. After fixing that a single call to dma_sync_single_for_cpu() is
> sufficient. The DMA-API-HOWTO clearly states that:
>
> """
> the hardware should guarantee that the device and the CPU can access the
> data in parallel and will see updates made by each other without any
> explicit software flushing.
> """
>
> So it seems incorrect that we would need to do a dma_sync for this
> memory. That we do need it seems like this memory can end up in
> cache(?), or whatever happens, in some rare condition. Is there anyway
> to investigate this situation either through DMA-API or some low-level
> ARM specific functions.

I think the problem comes down to not following the advice from this
comment in asm/dma-mapping.h:

/*
* dma_to_pfn/pfn_to_dma/dma_to_virt/virt_to_dma are architecture private
* functions used internally by the DMA-mapping API to provide DMA
* addresses. They must not be used by drivers.
*/

The previous behavior of the driver is clearly wrong and cannot work
on any architecture that has noncoherent PCI DMA or uses swiotlb, and
that includes some older 64-bit x86 machines (Pentium D and similar).

I'm still puzzled why you'd need a single dma_sync_single_for_cpu()
after dma_alloc_coherent though, you should not need any. Is it possible
that the driver accidentally uses __raw_readl() instead of readl()
in some places and you are just lacking an appropriate barrier?

Arnd

2014-12-05 12:24:25

by Will Deacon

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 09:45:08AM +0000, Russell King - ARM Linux wrote:
> On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote:
> > For our brcm80211 development we are working on getting brcmfmac driver
> > up and running on a Broadcom ARM-based platform. The wireless device is
> > a PCIe device, which is hooked up to the system behind a PCIe host
> > bridge, and we transfer information between host and device using a
> > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
> > tested on x86 and seen no issue. However, on this ARM platform
> > (single-core A9) we detect occasionally that the descriptor content is
> > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
> > retried a number of times if the problem persists. Actually, found out
> > that someone made a mistake by using virt_to_dma(va) to get the
> > dma_handle parameter. So probably we only provided a delay in the retry
> > loop. After fixing that a single call to dma_sync_single_for_cpu() is
> > sufficient. The DMA-API-HOWTO clearly states that:
> >
> > """
> > the hardware should guarantee that the device and the CPU can access the
> > data in parallel and will see updates made by each other without any
> > explicit software flushing.
> > """
> >
> > So it seems incorrect that we would need to do a dma_sync for this
> > memory. That we do need it seems like this memory can end up in
> > cache(?), or whatever happens, in some rare condition. Is there anyway
> > to investigate this situation either through DMA-API or some low-level
> > ARM specific functions.
>
> It's been a long while since I looked at the code, and the code for
> dma_alloc_coherent() has completely changed since then with the
> addition of CMA. I'm afraid that anything I would say about it would
> not be accurate without research into the possible paths through that
> code - it's no longer just a simple allocator.
>
> What you say is correct however: the memory should not have any cache
> lines associated with it, if it does, there's a bug somewhere.
>
> Also, the memory will be weakly ordered, which means that writes to such
> memory can be reordered. If ordering matters, barriers should be used.
> rmb() and wmb() can be used for this.
>
> (Added Marek for comment on dma_alloc_coherent(), Will for comment on
> barrier stuff.)

I'm not quite clear on the issue being seen here: is this on write from
the CPU to the descriptor ring, or the other way around (or both?).

Either way, you need barriers on the CPU side to ensure ordering of
accesses to the buffer. rmb/wmb will work, but are heavier than what you
need (relaxed versions have been proposed on LKML recently).

Will

2014-12-05 12:56:48

by Hante Meuleman

[permalink] [raw]
Subject: RE: using DMA-API on ARM

The problem is with data coming from device, so DMA from device to host. The DMA takes place from device local memory to host memory, where the host memory is allocated with dma_alloc_coherent, which we thought should not be cached. The host is an ARM (as is the device). The data being DMA'ed ends up in a ring buffer. This ring is only being read by host when it is a d2h ring (device to host). Each entry in the ring is 32 bytes, and contains a sequence number. The sequence number is a modulo 253 and the ring has 256 entries. At some point we read a sequence number which was "old". Then we loop to see if the sequence number changes. The loop is 1024 times and uses an rmb() call. This does not help. After looping 1024 times it is still reading the same value for sequence number. Now it can happen that 256 entries further we are still reading this old sequence (so iso reading a seqnum which is off by 3, it is off by 6). This was an indication that it was cached. So instead of using rmb() we used dma_sync_single_for_cpu. When using that call the problem was fixed. Whenever an old sequence number was read a single call to dma_sync_single_for_cpu would flush the cache and the next read would be correct.

However: this indicates that dma_alloc_coherent on an ARM target may result in a memory buffer which can be cached which conflicts with the API of this function. This problem has sofar not been observed on x86 hosts.

Regards,
Hante

-----Original Message-----
From: Will Deacon [mailto:[email protected]]
Sent: vrijdag 5 december 2014 13:24
To: Russell King - ARM Linux
Cc: Arend Van Spriel; Marek Szyprowski; [email protected]; David Miller; [email protected]; brcm80211-dev-list; linux-wireless
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 09:45:08AM +0000, Russell King - ARM Linux wrote:
> On Fri, Dec 05, 2014 at 10:22:22AM +0100, Arend van Spriel wrote:
> > For our brcm80211 development we are working on getting brcmfmac driver
> > up and running on a Broadcom ARM-based platform. The wireless device is
> > a PCIe device, which is hooked up to the system behind a PCIe host
> > bridge, and we transfer information between host and device using a
> > descriptor ring buffer allocated using dma_alloc_coherent(). We mostly
> > tested on x86 and seen no issue. However, on this ARM platform
> > (single-core A9) we detect occasionally that the descriptor content is
> > invalid. When this occurs we do a dma_sync_single_for_cpu() and this is
> > retried a number of times if the problem persists. Actually, found out
> > that someone made a mistake by using virt_to_dma(va) to get the
> > dma_handle parameter. So probably we only provided a delay in the retry
> > loop. After fixing that a single call to dma_sync_single_for_cpu() is
> > sufficient. The DMA-API-HOWTO clearly states that:
> >
> > """
> > the hardware should guarantee that the device and the CPU can access the
> > data in parallel and will see updates made by each other without any
> > explicit software flushing.
> > """
> >
> > So it seems incorrect that we would need to do a dma_sync for this
> > memory. That we do need it seems like this memory can end up in
> > cache(?), or whatever happens, in some rare condition. Is there anyway
> > to investigate this situation either through DMA-API or some low-level
> > ARM specific functions.
>
> It's been a long while since I looked at the code, and the code for
> dma_alloc_coherent() has completely changed since then with the
> addition of CMA. I'm afraid that anything I would say about it would
> not be accurate without research into the possible paths through that
> code - it's no longer just a simple allocator.
>
> What you say is correct however: the memory should not have any cache
> lines associated with it, if it does, there's a bug somewhere.
>
> Also, the memory will be weakly ordered, which means that writes to such
> memory can be reordered. If ordering matters, barriers should be used.
> rmb() and wmb() can be used for this.
>
> (Added Marek for comment on dma_alloc_coherent(), Will for comment on
> barrier stuff.)

I'm not quite clear on the issue being seen here: is this on write from
the CPU to the descriptor ring, or the other way around (or both?).

Either way, you need barriers on the CPU side to ensure ordering of
accesses to the buffer. rmb/wmb will work, but are heavier than what you
need (relaxed versions have been proposed on LKML recently).

Will

2014-12-09 11:54:35

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Tue, Dec 09, 2014 at 10:29:05AM +0000, Russell King - ARM Linux wrote:
> On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote:
> > The issue did not trigger overnight so it seems setting bit 22 <Shared
> > Attribute _Override_ Enable> solves the issue over here. Now the question is
> > how to move forward with this. As I understood from Catalin this patch was
> > not included as it was not considered responsibility of the linux kernel.
>
> It is preferable for firmware to configure the L2 cache appropriately,
> which includes things like the prefetch offsets as well as feature bits
> like bit 22.
>
> I think what I'll do is queue up a patch which adds a warning if bit 22
> is not set, suggesting that firmware is updated to set this bit.

I'm fine with a (big) warning on this bit. But when you boot in secure
mode on 32-bit, do we still have a read/modify/write sequence for
L2X0_AUX_CTRL? A quick look at l2c310_enable() didn't reveal this (the
code has changed since I proposed the bit 22 setting patch).

--
Catalin

2014-12-05 18:29:08

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 03:06:48PM +0000, Russell King - ARM Linux wrote:
> I've been doing more digging into the current DMA code, and I'm dismayed
> to see that there's new bugs in it...
>
> commit 513510ddba9650fc7da456eefeb0ead7632324f6
> Author: Laura Abbott <[email protected]>
> Date: Thu Oct 9 15:26:40 2014 -0700
>
> common: dma-mapping: introduce common remapping functions
>
> This uses map_vm_area() to achieve the remapping of pages allocated inside
> dma_alloc_coherent(). dma_alloc_coherent() is documented in a rather
> round-about way in Documentation/DMA-API.txt:
>
> | Part Ia - Using large DMA-coherent buffers
> | ------------------------------------------
> |
> | void *
> | dma_alloc_coherent(struct device *dev, size_t size,
> | dma_addr_t *dma_handle, gfp_t flag)
> |
> | void
> | dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
> | dma_addr_t dma_handle)
> |
> | Free a region of consistent memory you previously allocated. dev,
> | size and dma_handle must all be the same as those passed into
> | dma_alloc_coherent(). cpu_addr must be the virtual address returned by
> | the dma_alloc_coherent().
> |
> | Note that unlike their sibling allocation calls, these routines
> | may only be called with IRQs enabled.
>
> Note that very last paragraph. What this says is that it is explicitly
> permitted to call dma_alloc_coherent() with IRQs disabled.

This is solved by using a pre-allocated, pre-mapped atomic_pool which
avoids any further mapping. __dma_alloc() calls __alloc_from_pool() when
!__GFP_WAIT.

This code got pretty complex and we may find bugs. It can be simplified
by a pre-allocated non-cacheable region that is safe in atomic context
(how big you allocate this is hard to say).

> If the problem which you (Broadcom) are suffering from is down to the
> issue I suspect (that being having mappings with different cache
> attributes) then I'm not sure that there's anything we can realistically
> do about that. There's a number of issues which make it hard to see a
> way forward.

I'm still puzzled by this problem, so I don't have any suggestion yet. I
wouldn't blame the mismatched attributes yet as I haven't seen such
problem in practice (but you never know).

How does the DT describe this device? Could it have some dma-coherent
property in there that causes dma_alloc_coherent() to create a cacheable
memory?

The reverse could also cause problems: the device is coherent but the
CPU creates a non-cacheable mapping.

--
Catalin

2014-12-08 16:22:49

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 12/08/14 16:01, Arnd Bergmann wrote:
> On Monday 08 December 2014 13:47:38 Hante Meuleman wrote:
>> Still using outlook, but will limit the line length, I hope that works for the
>> moment. Attached is a log with the requested information, it is a little
>> bit non-standard though. The dump code from the mm was copied in
>> the driver and called from there, mapping the prints back to our local
>> printf, but it should produce the same. I did this because I didn't realize
>> the table is static.
>>
>> Some background on the test setup: I'm using a Broadcom reference
>> design AP platform with an BRCM 4708 host SOC.
>
> I think you are using the wrong dtb file, the log says this is
> a "Buffalo WZR-1750DHP", not the reference design.

That router is close enough to the reference design.

>> For the AP router
>> platform the opensource packet OpenWRT was used. Some small
>> modifications were made to get it to work on our HW. Only one core
>> is enabled for the moment (no time to figure out how to enable the
>> other one). Openwrt was configured to use kernel 3.18-rc2 and
>> the brcmfmac of the compat-wireless code was updated with our
>> latest code (minor patches, which have been submitted already).
>> The device used is 43602 pcie device. Some modifications to the build
>> system were made to enable PCIE. The test is to connect with a
>> client to the AP and run iperf (TCP). The test can run for many hours
>> without a problem, but sometimes fails very quickly.
>
> The bcm4708 platform is maintained by Hauke Mehrtens, adding him to Cc.

Thanks. While going through the DTS files I intended to add him as well ;-)

> In your log, I see this message:
>
> [ 0.000000] PL310 OF: cache setting yield illegal associativity
> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001
>
> Evidently the cache controller information in DT is incorrect and
> the setup may be wrong as a consequence, which may explain cache
> coherency problems.

While staring at the DTS files I suspect there are some parts still
missing. I have attached them for reference. Catalin pointed us to a
patch in the l2 cache [1]. We have not tried that yet.

> Can you verify that the AUX_CTRL value is the same one you see
> in a working kernel?
>
>> The log: first the ring allocation info is printed. Starting at
>> 16.124847, ring 2, 3 and 4 are rings used for device to host. In this
>> log the failure is on a read of ring 3. Ring 3 is 1024 entries of each
>> 16 bytes. The next thing printed is the kernel page tables. Then some
>> OpenWRT info and the logging of part of the connection setup. Then at
>> 1780.130752 the logging of the failure starts. The sequence number is
>> modulo 253 with ring size of 1024 matches an "old" entry (read 40,
>> expected 52). Then the different pointers are printed followed by
>> the kernel page table. The code does then a cache invalidate on the
>> dma_handle and the next read the sequence number is correct.
>
> How do you invalidate the cache? A dma_handle is of type dma_addr_t
> and we don't define an operation for that, nor does it make sense
> on an allocation from dma_alloc_coherent(). What happens if you
> take out the invalidate?

dma_sync_single_for_cpu(, DMA_FROM_DEVICE) which ends up invalidating
the cache (or that is our suspicion).

> Can you post the patch that you use (both platform and driver) relative
> to the snapshot of the the mainline kernel you are basing on?
>
> Arnd
>

Regards,
Arend

[1] http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1


Attachments:
bcm-dt-files.tar.bz2 (2.09 kB)

2014-12-08 15:17:54

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Mon, Dec 08, 2014 at 04:01:32PM +0100, Arnd Bergmann wrote:
> In your log, I see this message:
>
> [ 0.000000] PL310 OF: cache setting yield illegal associativity
> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001
>
> Evidently the cache controller information in DT is incorrect and
> the setup may be wrong as a consequence, which may explain cache
> coherency problems.

No. See d0b92845e54 (ARM: 8182/1: l2c: Make l2x0_cache_size_of_parse() return 'int')
which was merged in rc3, post-dating this kernel. The original commit
was buggy, and produces those harmless "illegal associativity" messages.
They occur if DT *doesn't* specify the cache parameters, in which case
we use the hardware-set value (which /should/ be correct.)

We made these optional in DT as hardware really should set these
correctly in the first place.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-05 11:49:50

by Hante Meuleman

[permalink] [raw]
Subject: RE: using DMA-API on ARM

Thank you for investigating. Your analysis matches what was intended to be done. Regarding the cast, you're right, I'll improve it. My idea was that long long is always 64bit, but when casting directly to long long I get a compiler (when using C=2) warning: "warning: right shift by bigger than source value". Probably I concluded wrongly that I should cast it to long first. On the targets used the high address was always 0, so I got away with it thusfar, but indeed it should and shall be fixed.

Regards,
Hante

-----Original Message-----
From: Russell King - ARM Linux [mailto:[email protected]]
Sent: vrijdag 5 december 2014 12:11
To: Arnd Bergmann
Cc: [email protected]; Arend Van Spriel; linux-wireless; brcm80211-dev-list; David Miller; [email protected]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 10:52:02AM +0100, Arnd Bergmann wrote:
> I'm still puzzled why you'd need a single dma_sync_single_for_cpu()
> after dma_alloc_coherent though, you should not need any. Is it possible
> that the driver accidentally uses __raw_readl() instead of readl()
> in some places and you are just lacking an appropriate barrier?

Digging into the driver, it looks like individual DMA buffers are
allocated (via brcmf_pcie_init_dmabuffer_for_device) and registered
into a "commonring" layer.

Whenever the buffer is written to, space is first allocated via a call
to brcmf_commonring_reserve_for_write() or
brcmf_commonring_reserve_for_write_multiple(), data written to the
buffer, followed by a call to brcmf_commonring_write_complete().

brcmf_commonring_write_complete() calls two methods at that point:
cr_write_wptr() and cr_ring_bell(), which will be
brcmf_pcie_ring_mb_write_wptr() and brcmf_pcie_ring_mb_ring_bell().

The first calls brcmf_pcie_write_tcm16(), which uses iowrite16(),
which contains the appropriate barrier. The bell ringing functions
also use ioread*/iowrite*().

So, on the write side, it looks fine from the barrier perspective.

On the read side, brcmf_commonring_get_read_ptr() is used before
a read access to the ring - which calls the cr_update_wptr() method,
which in turn uses an ioread16() call. After the CPU has read data
from the ring, brcmf_commonring_read_complete() is used, which uses
iowrite16().

So, I don't see a barrier problem on the read side.

However, I did trip over this:

static void *
brcmf_pcie_init_dmabuffer_for_device(struct brcmf_pciedev_info *devinfo,
u32 size, u32 tcm_dma_phys_addr,
dma_addr_t *dma_handle)
{
void *ring;
long long address;

ring = dma_alloc_coherent(&devinfo->pdev->dev, size, dma_handle,
GFP_KERNEL);
if (!ring)
return NULL;

address = (long long)(long)*dma_handle;

Casting to (long) will truncate the DMA handle to 32-bits on a 32-bit
architecture, even if it supports 64-bit DMA addresses. There's a couple
of other places where this same truncation occurs:

address = (long long)(long)devinfo->shared.scratch_dmahandle;

and

address = (long long)(long)devinfo->shared.ringupd_dmahandle;

In any case, wouldn't using a u64 type for "address" be better - isn't
"long long" 128-bit on 64-bit architectures?

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-05 11:11:33

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 10:52:02AM +0100, Arnd Bergmann wrote:
> I'm still puzzled why you'd need a single dma_sync_single_for_cpu()
> after dma_alloc_coherent though, you should not need any. Is it possible
> that the driver accidentally uses __raw_readl() instead of readl()
> in some places and you are just lacking an appropriate barrier?

Digging into the driver, it looks like individual DMA buffers are
allocated (via brcmf_pcie_init_dmabuffer_for_device) and registered
into a "commonring" layer.

Whenever the buffer is written to, space is first allocated via a call
to brcmf_commonring_reserve_for_write() or
brcmf_commonring_reserve_for_write_multiple(), data written to the
buffer, followed by a call to brcmf_commonring_write_complete().

brcmf_commonring_write_complete() calls two methods at that point:
cr_write_wptr() and cr_ring_bell(), which will be
brcmf_pcie_ring_mb_write_wptr() and brcmf_pcie_ring_mb_ring_bell().

The first calls brcmf_pcie_write_tcm16(), which uses iowrite16(),
which contains the appropriate barrier. The bell ringing functions
also use ioread*/iowrite*().

So, on the write side, it looks fine from the barrier perspective.

On the read side, brcmf_commonring_get_read_ptr() is used before
a read access to the ring - which calls the cr_update_wptr() method,
which in turn uses an ioread16() call. After the CPU has read data
from the ring, brcmf_commonring_read_complete() is used, which uses
iowrite16().

So, I don't see a barrier problem on the read side.

However, I did trip over this:

static void *
brcmf_pcie_init_dmabuffer_for_device(struct brcmf_pciedev_info *devinfo,
u32 size, u32 tcm_dma_phys_addr,
dma_addr_t *dma_handle)
{
void *ring;
long long address;

ring = dma_alloc_coherent(&devinfo->pdev->dev, size, dma_handle,
GFP_KERNEL);
if (!ring)
return NULL;

address = (long long)(long)*dma_handle;

Casting to (long) will truncate the DMA handle to 32-bits on a 32-bit
architecture, even if it supports 64-bit DMA addresses. There's a couple
of other places where this same truncation occurs:

address = (long long)(long)devinfo->shared.scratch_dmahandle;

and

address = (long long)(long)devinfo->shared.ringupd_dmahandle;

In any case, wouldn't using a u64 type for "address" be better - isn't
"long long" 128-bit on 64-bit architectures?

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-09 10:29:27

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote:
> The issue did not trigger overnight so it seems setting bit 22 <Shared
> Attribute _Override_ Enable> solves the issue over here. Now the question is
> how to move forward with this. As I understood from Catalin this patch was
> not included as it was not considered responsibility of the linux kernel.

It is preferable for firmware to configure the L2 cache appropriately,
which includes things like the prefetch offsets as well as feature bits
like bit 22.

I think what I'll do is queue up a patch which adds a warning if bit 22
is not set, suggesting that firmware is updated to set this bit.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 17:01:53

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 12/08/14 17:03, Catalin Marinas wrote:
> On Mon, Dec 08, 2014 at 03:01:32PM +0000, Arnd Bergmann wrote:
>> [ 0.000000] PL310 OF: cache setting yield illegal associativity
>> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
>> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
>> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
>> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
>> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
>> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001
>
> If the above value is correct, they should make sure bit 22 is set in
> AUX_CTRL.

Hante applied the patch and it now says:

[ 0.000000] PL310 OF: cache setting yield illegal associativity
[ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
[ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
[ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
[ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
[ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
[ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e530001

He started running a test overnight. So will see if it hits the failure
with this L2 cache configuration.

Regards,
Arend

2014-12-05 13:23:54

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

Please wrap your message - replying to a message which looks like this in
my editor is far from easy, and gives me much more work to /manually/
reformat it before I can reply to it:

On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
> The problem is with data coming from device, so DMA from device to host. The $
>
> However: this indicates that dma_alloc_coherent on an ARM target may result i$
>
> Regards,
> Hante

Thanks.

On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
> However: this indicates that dma_alloc_coherent on an ARM target may
> result in a memory buffer which can be cached which conflicts with
> the API of this function.

If the memory has an alias which is cacheable, it is possible for cache
lines to get allocated via that alias, even if the alias has no explicit
accesses to it.

This is something which I've been going on for quite literally /years/ -
mismatched cache attributes can cause unpredictable behaviour. I've had
a lot of push back from people who are of the opinion that "if it works
for me, then there isn't a problem" and I eventually gave up fighting
the battle, especially as the ARM architecture people weakened my
reasoning behind it by publishing a relaxation of the "no differing
attributes" issue. This was particularly true of those who wanted to
use ioremap() on system memory - and cases such as
dma_init_coherent_memory().

So, I never fixed this problem in the original DMA allocator code; I
basically gave up with it. It's a latent bug which did need to be fixed,
and is still present today in the non-CMA case.

The symptoms which you are reporting sound very much like this kind of
problem - the virtual address for the memory returned by
dma_alloc_coherent() will not be cacheable memory - it will have been
remapped using map_vm_area(). However, there could very well be a fully
cacheable lowmem mapping of that memory, which if a read (speculative or
otherwise) will bring a cache line in, and because the caches are VIPT
or PIPT, that cache line can be hit via the non-cacheable mapping too.

What I /really/ need is more evidence of this to tell those disbelievers
where to stick their flawed arguments. :)

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 15:56:07

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Mon, Dec 08, 2014 at 12:55:38PM +0000, Johannes Stezenbach wrote:
> On Fri, Dec 05, 2014 at 06:53:03PM +0000, Catalin Marinas wrote:
> > On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote:
> > >
> > > Does your system have an L2 cache? What's the SoC topology, can PCIe see
> > > such L2 cache (or snoop the L1 caches)?
> >
> > BTW, if you really have a PL310-like L2 cache, have a look at some
> > patches (I've seen similar symptoms) and make sure your configuration is
> > correct:
> >
> > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1
> >
> > http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1
> >
> > The first one is vexpress specific. The second one was eventually
> > discarded by Russell (I don't remember the reason, I guess it's because
> > SoC code is supposed to set the right bits in there anyway). In your
> > case, such bits may be set up by firmware, so Linux cannot fix anything
> > up.
>
> How do you avoid the unpredictable behavior mentioned in the
> PL310 TRM when the Shared Attribute Invalidate Enable bit is set?
> http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/Ceggcfcj.html

So you talk about "Shared Attribute _Invalidate_ Enable" (bit 13) while
I talk about "Shared Attribute _Override_ Enable" (bit 22).

In addition, Shared _Invalidate_ behaviour can only be enabled if Shared
Attribute _Override_ Enable bit is not set.

--
Catalin

2014-12-05 18:24:52

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 05:38:39PM +0000, Catalin Marinas wrote:
> On Fri, Dec 05, 2014 at 11:11:14AM +0000, Russell King - ARM Linux wrote:
> > In any case, wouldn't using a u64 type for "address" be better - isn't
> > "long long" 128-bit on 64-bit architectures?
>
> No, it's still 64-bit. There is no 128-bit integer in the C standard.

Actually, that's a fallicy.

The C99 standard (like previous versions) does not define exactly the
number of bits in each type. It defines ranks of type, and says that
lower ranks are a subrange of integers with higher ranks (for the same
signed-ness.) See section 6.2.5.

So, it merely states that:

range(char) <= range(short) <= range(int) <= range(long) <= range(long long)

So, an implementation could have:

char: 8 short: 16 int: 16 long: 32 long long: 64
char: 8 short: 16 int: 32 long: 32 long long: 64
char: 8 short: 16 int: 32 long: 64 long long: 64
char: 8 short: 16 int: 64 long: 64 long long: 64

or even:

char: 8 short: 16 int: 32 long: 64 long long: 128

and that would still be compliant with C99, since it continues to meet
the criteria about the required data types specified in the standard.

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 16:03:32

by Catalin Marinas

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Mon, Dec 08, 2014 at 03:01:32PM +0000, Arnd Bergmann wrote:
> [ 0.000000] PL310 OF: cache setting yield illegal associativity
> [ 0.000000] PL310 OF: -1069781724 calculated, only 8 and 16 legal
> [ 0.000000] L2C-310 enabling early BRESP for Cortex-A9
> [ 0.000000] L2C-310 full line of zeros enabled for Cortex-A9
> [ 0.000000] L2C-310 dynamic clock gating enabled, standby mode enabled
> [ 0.000000] L2C-310 cache controller enabled, 16 ways, 256 kB
> [ 0.000000] L2C-310: CACHE_ID 0x410000c8, AUX_CTRL 0x4e130001

If the above value is correct, they should make sure bit 22 is set in
AUX_CTRL.

--
Catalin

2014-12-05 19:22:09

by Arend van Spriel

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On 12/05/14 19:28, Catalin Marinas wrote:
> On Fri, Dec 05, 2014 at 03:06:48PM +0000, Russell King - ARM Linux wrote:
>> I've been doing more digging into the current DMA code, and I'm dismayed
>> to see that there's new bugs in it...
>>
>> commit 513510ddba9650fc7da456eefeb0ead7632324f6
>> Author: Laura Abbott<[email protected]>
>> Date: Thu Oct 9 15:26:40 2014 -0700
>>
>> common: dma-mapping: introduce common remapping functions
>>
>> This uses map_vm_area() to achieve the remapping of pages allocated inside
>> dma_alloc_coherent(). dma_alloc_coherent() is documented in a rather
>> round-about way in Documentation/DMA-API.txt:
>>
>> | Part Ia - Using large DMA-coherent buffers
>> | ------------------------------------------
>> |
>> | void *
>> | dma_alloc_coherent(struct device *dev, size_t size,
>> | dma_addr_t *dma_handle, gfp_t flag)
>> |
>> | void
>> | dma_free_coherent(struct device *dev, size_t size, void *cpu_addr,
>> | dma_addr_t dma_handle)
>> |
>> | Free a region of consistent memory you previously allocated. dev,
>> | size and dma_handle must all be the same as those passed into
>> | dma_alloc_coherent(). cpu_addr must be the virtual address returned by
>> | the dma_alloc_coherent().
>> |
>> | Note that unlike their sibling allocation calls, these routines
>> | may only be called with IRQs enabled.
>>
>> Note that very last paragraph. What this says is that it is explicitly
>> permitted to call dma_alloc_coherent() with IRQs disabled.
>
> This is solved by using a pre-allocated, pre-mapped atomic_pool which
> avoids any further mapping. __dma_alloc() calls __alloc_from_pool() when
> !__GFP_WAIT.

So we are actually calling dma_alloc_coherent() with GFP_KERNEL during
device probe. That last paragraph Russell pointed out seems to suggest
this is not allowed.

> This code got pretty complex and we may find bugs. It can be simplified
> by a pre-allocated non-cacheable region that is safe in atomic context
> (how big you allocate this is hard to say).
>
>> If the problem which you (Broadcom) are suffering from is down to the
>> issue I suspect (that being having mappings with different cache
>> attributes) then I'm not sure that there's anything we can realistically
>> do about that. There's a number of issues which make it hard to see a
>> way forward.
>
> I'm still puzzled by this problem, so I don't have any suggestion yet. I
> wouldn't blame the mismatched attributes yet as I haven't seen such
> problem in practice (but you never know).
>
> How does the DT describe this device? Could it have some dma-coherent
> property in there that causes dma_alloc_coherent() to create a cacheable
> memory?

Ok. Will add it to our todo list: check DTS files for dma-coherent property.

Thanks,
Arend

> The reverse could also cause problems: the device is coherent but the
> CPU creates a non-cacheable mapping.
>


2014-12-05 14:20:32

by Hante Meuleman

[permalink] [raw]
Subject: RE: using DMA-API on ARM

Ok, I'll add the necessary debug to get all the information out,
but it will take some time to get it done, so I won't have anything
before Monday.

-----Original Message-----
From: Russell King - ARM Linux [mailto:[email protected]]
Sent: vrijdag 5 december 2014 14:24
To: Hante Meuleman
Cc: Will Deacon; Arend Van Spriel; Marek Szyprowski; [email protected]; David Miller; [email protected]; brcm80211-dev-list; linux-wireless
Subject: Re: using DMA-API on ARM

Please wrap your message - replying to a message which looks like this in
my editor is far from easy, and gives me much more work to /manually/
reformat it before I can reply to it:

On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
> The problem is with data coming from device, so DMA from device to host. The $
>
> However: this indicates that dma_alloc_coherent on an ARM target may result i$
>
> Regards,
> Hante

Thanks.

On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
> However: this indicates that dma_alloc_coherent on an ARM target may
> result in a memory buffer which can be cached which conflicts with
> the API of this function.

If the memory has an alias which is cacheable, it is possible for cache
lines to get allocated via that alias, even if the alias has no explicit
accesses to it.

This is something which I've been going on for quite literally /years/ -
mismatched cache attributes can cause unpredictable behaviour. I've had
a lot of push back from people who are of the opinion that "if it works
for me, then there isn't a problem" and I eventually gave up fighting
the battle, especially as the ARM architecture people weakened my
reasoning behind it by publishing a relaxation of the "no differing
attributes" issue. This was particularly true of those who wanted to
use ioremap() on system memory - and cases such as
dma_init_coherent_memory().

So, I never fixed this problem in the original DMA allocator code; I
basically gave up with it. It's a latent bug which did need to be fixed,
and is still present today in the non-CMA case.

The symptoms which you are reporting sound very much like this kind of
problem - the virtual address for the memory returned by
dma_alloc_coherent() will not be cacheable memory - it will have been
remapped using map_vm_area(). However, there could very well be a fully
cacheable lowmem mapping of that memory, which if a read (speculative or
otherwise) will bring a cache line in, and because the caches are VIPT
or PIPT, that cache line can be hit via the non-cacheable mapping too.

What I /really/ need is more evidence of this to tell those disbelievers
where to stick their flawed arguments. :)

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.

2014-12-08 13:47:41

by Hante Meuleman

[permalink] [raw]
Subject: RE: using DMA-API on ARM

Still using outlook, but will limit the line length, I hope that works for the
moment. Attached is a log with the requested information, it is a little
bit non-standard though. The dump code from the mm was copied in
the driver and called from there, mapping the prints back to our local
printf, but it should produce the same. I did this because I didn't realize
the table is static.

Some background on the test setup: I'm using a Broadcom reference
design AP platform with an BRCM 4708 host SOC. For the AP router
platform the opensource packet OpenWRT was used. Some small
modifications were made to get it to work on our HW. Only one core
is enabled for the moment (no time to figure out how to enable the
other one). Openwrt was configured to use kernel 3.18-rc2 and
the brcmfmac of the compat-wireless code was updated with our
latest code (minor patches, which have been submitted already).
The device used is 43602 pcie device. Some modifications to the build
system were made to enable PCIE. The test is to connect with a
client to the AP and run iperf (TCP). The test can run for many hours
without a problem, but sometimes fails very quickly.

The log: first the ring allocation info is printed. Starting at
16.124847, ring 2, 3 and 4 are rings used for device to host. In this
log the failure is on a read of ring 3. Ring 3 is 1024 entries of each
16 bytes. The next thing printed is the kernel page tables. Then some
OpenWRT info and the logging of part of the connection setup. Then at
1780.130752 the logging of the failure starts. The sequence number is
modulo 253 with ring size of 1024 matches an "old" entry (read 40,
expected 52). Then the different pointers are printed followed by
the kernel page table. The code does then a cache invalidate on the
dma_handle and the next read the sequence number is correct.

Regards,
Hante



Please wrap your message - replying to a message which looks like this in
my editor is far from easy, and gives me much more work to /manually/
reformat it before I can reply to it:

On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
> The problem is with data coming from device, so DMA from device to host. The $
>
> However: this indicates that dma_alloc_coherent on an ARM target may result i$
>
> Regards,
> Hante

Thanks.

On Fri, Dec 05, 2014 at 12:56:45PM +0000, Hante Meuleman wrote:
> However: this indicates that dma_alloc_coherent on an ARM target may
> result in a memory buffer which can be cached which conflicts with
> the API of this function.

If the memory has an alias which is cacheable, it is possible for cache
lines to get allocated via that alias, even if the alias has no explicit
accesses to it.

This is something which I've been going on for quite literally /years/ -
mismatched cache attributes can cause unpredictable behaviour. I've had
a lot of push back from people who are of the opinion that "if it works
for me, then there isn't a problem" and I eventually gave up fighting
the battle, especially as the ARM architecture people weakened my
reasoning behind it by publishing a relaxation of the "no differing
attributes" issue. This was particularly true of those who wanted to
use ioremap() on system memory - and cases such as
dma_init_coherent_memory().

So, I never fixed this problem in the original DMA allocator code; I
basically gave up with it. It's a latent bug which did need to be fixed,
and is still present today in the non-CMA case.

The symptoms which you are reporting sound very much like this kind of
problem - the virtual address for the memory returned by
dma_alloc_coherent() will not be cacheable memory - it will have been
remapped using map_vm_area(). However, there could very well be a fully
cacheable lowmem mapping of that memory, which if a read (speculative or
otherwise) will bring a cache line in, and because the caches are VIPT
or PIPT, that cache line can be hit via the non-cacheable mapping too.

What I /really/ need is more evidence of this to tell those disbelievers
where to stick their flawed arguments. :)

--
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.


Attachments:
cache_fail_dmesg.txt (34.76 kB)
cache_fail_dmesg.txt

2014-12-08 13:38:01

by Johannes Stezenbach

[permalink] [raw]
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 06:53:03PM +0000, Catalin Marinas wrote:
> On Fri, Dec 05, 2014 at 06:39:45PM +0000, Catalin Marinas wrote:
> >
> > Does your system have an L2 cache? What's the SoC topology, can PCIe see
> > such L2 cache (or snoop the L1 caches)?
>
> BTW, if you really have a PL310-like L2 cache, have a look at some
> patches (I've seen similar symptoms) and make sure your configuration is
> correct:
>
> http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6395/1
>
> http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1
>
> The first one is vexpress specific. The second one was eventually
> discarded by Russell (I don't remember the reason, I guess it's because
> SoC code is supposed to set the right bits in there anyway). In your
> case, such bits may be set up by firmware, so Linux cannot fix anything
> up.

How do you avoid the unpredictable behavior mentioned in the
PL310 TRM when the Shared Attribute Invalidate Enable bit is set?
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0246h/Ceggcfcj.html

I think this bit does not do what you seem to think it does, it only
changes behaviour for "writes targeting a full cache line, for example
4x64-bit bursts with all strobes active", which then cause the
cacheline to be invalidated. "Other cases are identical to the default
shared behavior", which is "cacheable no allocate for reads"
and "write through no write allocate for writes".

If the problem is really speculative reads via the cachable
alias mapping, it seems this bit cannot solve the problem, right?


Johannes

2015-01-20 15:23:01

by Fabio Estevam

[permalink] [raw]
Subject: Re: using DMA-API on ARM

Hi Russell,

On Tue, Dec 9, 2014 at 8:29 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Tue, Dec 09, 2014 at 11:19:40AM +0100, Arend van Spriel wrote:
>> The issue did not trigger overnight so it seems setting bit 22 <Shared
>> Attribute _Override_ Enable> solves the issue over here. Now the question is
>> how to move forward with this. As I understood from Catalin this patch was
>> not included as it was not considered responsibility of the linux kernel.
>
> It is preferable for firmware to configure the L2 cache appropriately,
> which includes things like the prefetch offsets as well as feature bits
> like bit 22.
>
> I think what I'll do is queue up a patch which adds a warning if bit 22
> is not set, suggesting that firmware is updated to set this bit.

Do you mean something like this?

--- a/arch/arm/mm/cache-l2x0.c
+++ b/arch/arm/mm/cache-l2x0.c
@@ -943,6 +943,10 @@ static int __init __l2c_init(const struct
l2c_init_data *data,
pr_info("%s: CACHE_ID 0x%08x, AUX_CTRL 0x%08x\n",
data->type, cache_id, aux);

+ if (!(aux & L2C_AUX_CTRL_SHARED_OVERRIDE))
+ pr_warn("%s: L2C_AUX_CTRL_SHARED_OVERRIDE needs to be set by
the bootloader\n",
+ data->type);
+
return 0;
}