Hi
DMA mapping scheme introduced in commit cbc8e55f6fda ('igb: Map entire
page and sync half instead of mapping and unmapping half pages') back in
2012, and used up to now, can probably cause breakage of unrelated code
on archs with non-coherent caches.
With this scheme, page used for Rx is completely dma_map()ed at
allocation time, split into two buffers, and individual buffer is
sync_to_cpu()ed AND PASSED TO NETWORK STACK via skb_add_rx_frag() -
while driver driver still uses other buffer. Later, when driver decides
to no longer use this page, it will dma_unmap() it completely - which on
archs with non-coherent caches means cache invalidation. This cache
invalidation will include area that is already passed elsewhere. If
external code has performed any writes to that area and writes still are
in cache only, cache invalidation will cause writes to be lost.
I'm not sure if this breakage is indeed possible. I did not face it,
just found while checking how things work.
Code in question is in kernel already for 4 years. However, since (1)
igb is mostly used on x86 where caches are coherent, and (2) Rx buffers
are normally not written to, it could stay unnoticed all that time.
Could somebody please comment on this?
Nikita Yushchenko
From: Nikita Yushchenko <[email protected]>
Date: Mon, 10 Oct 2016 11:52:06 +0300
> With this scheme, page used for Rx is completely dma_map()ed at
> allocation time, split into two buffers, and individual buffer is
> sync_to_cpu()ed AND PASSED TO NETWORK STACK via skb_add_rx_frag() -
> while driver driver still uses other buffer. Later, when driver decides
> to no longer use this page, it will dma_unmap() it completely - which on
> archs with non-coherent caches means cache invalidation. This cache
> invalidation will include area that is already passed elsewhere.
This should happen only if the device wrote into that piece of the
memory which it absolutely should not.
When the dma sync occurs, no dirty data should be in the caches for
the portion of the page any more, and therefore nothing should be
written back unless the device illegally wrote to that part of the
page again.
If something is causing data to be written back even if the device
doesn't write into that area again, it's a bug.
And FWIW the swiommu code has this bug. It should never (re-)copy
back into the mapped area after a sync unless the device wrote into
that area in the time between the sync and the unmap.
>> With this scheme, page used for Rx is completely dma_map()ed at
>> allocation time, split into two buffers, and individual buffer is
>> sync_to_cpu()ed AND PASSED TO NETWORK STACK via skb_add_rx_frag() -
>> while driver driver still uses other buffer. Later, when driver decides
>> to no longer use this page, it will dma_unmap() it completely - which on
>> archs with non-coherent caches means cache invalidation. This cache
>> invalidation will include area that is already passed elsewhere.
>
> This should happen only if the device wrote into that piece of the
> memory which it absolutely should not.
Hmm... I'm not about device writing to memory.
Sequence in igb driver is:
dma_map(full_page)
<device writes here>
sync_to_cpu(half_page);
skb_add_rx_frag(skb, half_page);
napi_gro_receive(skb);
...
dma_unmap(full_page)
What I'm concerned about is - same area is first passed up to network
stack, and _later_ dma_unmap()ed. Is this indeed safe?
Nikita
From: Nikita Yushchenko <[email protected]>
Date: Mon, 10 Oct 2016 12:51:28 +0300
> Hmm... I'm not about device writing to memory.
This absolutely is about whether the device wrote into the
area or not.
> Sequence in igb driver is:
>
> dma_map(full_page)
> <device writes here>
> sync_to_cpu(half_page);
> skb_add_rx_frag(skb, half_page);
> napi_gro_receive(skb);
> ...
> dma_unmap(full_page)
>
> What I'm concerned about is - same area is first passed up to network
> stack, and _later_ dma_unmap()ed. Is this indeed safe?
dma_unmap() should never write anything unless the device has
meanwhile written to that chunk of memory.
If the device made no intervening writes into the area, dma_unmap()
should not cause any data to be written to that area, period.
In your example above, consider the case where the device never
writes into the memory area after sync_to_cpu(). In that case
there is nothing that dma_unmap() can possibly write. All the
data has been synced, and no device writes into the memory are
have occurred.
>> Hmm... I'm not about device writing to memory.
>
> This absolutely is about whether the device wrote into the
> area or not.
Not only.
>> Sequence in igb driver is:
>>
>> dma_map(full_page)
>> <device writes here>
>> sync_to_cpu(half_page);
>> skb_add_rx_frag(skb, half_page);
>> napi_gro_receive(skb);
>> ...
>> dma_unmap(full_page)
>>
>> What I'm concerned about is - same area is first passed up to network
>> stack, and _later_ dma_unmap()ed. Is this indeed safe?
>
> dma_unmap() should never write anything unless the device has
> meanwhile written to that chunk of memory.
dma_unmap() for DMA_FROM_DEVICE never writes whatever to memory,
regardless of what device did.
dma_unmap() for DMA_FROM_DEVICE ensures that data written to memory
by device (if any) is visible to CPU. Cache may contain stale data
for that memory region. To drop that from cache, dma_unmap() for
DMA_FROM_DEVICE does cache invalidation.
static void arm_dma_unmap_page(struct device *dev, dma_addr_t handle,
size_t size, enum dma_data_direction dir, unsigned long attrs)
{
if ((attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
__dma_page_dev_to_cpu(pfn_to_page(dma_to_pfn(dev, handle)),
handle & ~PAGE_MASK, size, dir);
}
static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
size_t size, enum dma_data_direction dir)
{
...
if (dir != DMA_TO_DEVICE) {
outer_inv_range(paddr, paddr + size);
dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
}
...
}
> If the device made no intervening writes into the area, dma_unmap()
> should not cause any data to be written to that area, period.
I'm not about writing.
I'm about just the opposite - dropping not-written data from cache.
- napi_gro_receive(skb) passes area to upper layers of the network stack,
- something in those layers - perhaps packet mangling or such - writes
to the area,
- this write enters cache but does not end into memory immediately,
- at this moment, igb does dma_unmap(),
- write that was in cache but not yet in memory gets lost.
> In your example above, consider the case where the device never
> writes into the memory area after sync_to_cpu(). In that case
> there is nothing that dma_unmap() can possibly write.
> All the data has been synced
Non-synced data is write done by CPU executing upper layers of network stack,
Nikita
>> All the data has been synced
>
> Non-synced data is write done by CPU executing upper layers of network stack,
Upper layers shall never get area that is still dma_map()ed and will be
dma_unmap()ed in future. But with igb, this is exactly what happens.
On Mon, Oct 10, 2016 at 5:27 AM, Nikita Yushchenko
<[email protected]> wrote:
>>> Hmm... I'm not about device writing to memory.
>>
>> This absolutely is about whether the device wrote into the
>> area or not.
>
> Not only.
>
>>> Sequence in igb driver is:
>>>
>>> dma_map(full_page)
>>> <device writes here>
>>> sync_to_cpu(half_page);
>>> skb_add_rx_frag(skb, half_page);
>>> napi_gro_receive(skb);
>>> ...
>>> dma_unmap(full_page)
>>>
>>> What I'm concerned about is - same area is first passed up to network
>>> stack, and _later_ dma_unmap()ed. Is this indeed safe?
>>
>> dma_unmap() should never write anything unless the device has
>> meanwhile written to that chunk of memory.
>
> dma_unmap() for DMA_FROM_DEVICE never writes whatever to memory,
> regardless of what device did.
>
> dma_unmap() for DMA_FROM_DEVICE ensures that data written to memory
> by device (if any) is visible to CPU. Cache may contain stale data
> for that memory region. To drop that from cache, dma_unmap() for
> DMA_FROM_DEVICE does cache invalidation.
>
> static void arm_dma_unmap_page(struct device *dev, dma_addr_t handle,
> size_t size, enum dma_data_direction dir, unsigned long attrs)
> {
> if ((attrs & DMA_ATTR_SKIP_CPU_SYNC) == 0)
> __dma_page_dev_to_cpu(pfn_to_page(dma_to_pfn(dev, handle)),
> handle & ~PAGE_MASK, size, dir);
> }
>
> static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
> size_t size, enum dma_data_direction dir)
> {
> ...
> if (dir != DMA_TO_DEVICE) {
> outer_inv_range(paddr, paddr + size);
>
> dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
> }
> ...
> }
>
>
>> If the device made no intervening writes into the area, dma_unmap()
>> should not cause any data to be written to that area, period.
>
> I'm not about writing.
>
> I'm about just the opposite - dropping not-written data from cache.
>
> - napi_gro_receive(skb) passes area to upper layers of the network stack,
> - something in those layers - perhaps packet mangling or such - writes
> to the area,
> - this write enters cache but does not end into memory immediately,
> - at this moment, igb does dma_unmap(),
> - write that was in cache but not yet in memory gets lost.
>
>
>> In your example above, consider the case where the device never
>> writes into the memory area after sync_to_cpu(). In that case
>> there is nothing that dma_unmap() can possibly write.
>> All the data has been synced
>
> Non-synced data is write done by CPU executing upper layers of network stack,
The main reason why this isn't a concern for the igb driver is because
we currently pass the page up as read-only. We don't allow the stack
to write into the page by keeping the page count greater than 1 which
means that the page is shared. It isn't until we unmap the page that
the page count is allowed to drop to 1 indicating that it is writable.
That being said, we are hoping to make the pages writable but in order
to do so I was thinking of adding a new DMA API call that would
destroy the mapping without performing any cache invalidation or sync.
The general idea would be to create the mapping using the map call, to
sync the contents using sync_for_cpu, and then to destroy the mapping
using a destroy call instead of an unmap call. It would allow us to
use streaming mappings without having to worry about possibly
invalidating writes by other holders of the page.
- Alex
Hi Alexander
Thanks for your explanation.
> The main reason why this isn't a concern for the igb driver is because
> we currently pass the page up as read-only. We don't allow the stack
> to write into the page by keeping the page count greater than 1 which
> means that the page is shared. It isn't until we unmap the page that
> the page count is allowed to drop to 1 indicating that it is writable.
Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
avoided? If page is read only for entire world, then it can't be dirty
in cache and thus device can safely write to it without preparation step.
Nikita
P.S.
We are observing strange performance anomaly with igb on imx6q board.
Test is - simple iperf UDP receive. Other host runs
iperf -c X.X.X.X -u -b xxxM -t 300 -i 3
Imx6q board can run iperf -s -u, or it can run nothing - result is the same.
While generated traffic (controlled via xxx) is slow, softirq thread on
imx6 board takes near-zero cpu time. With increasing xxx, it still is
near zero - up to some moment about 700 Mbps. At this moment softirqd
cpu usage suddenly jumps to almost 100%. Without anything in between:
it is near-zero with slightly smaller traffic, and it is immediately
>99% with slightly larger traffic.
Profiling this situation (>99% in softirqd) with perf gives up to 50%
hits inside cache invalidation loops. That's why originally we thought
cache invalidation is slow. But having the above dependency between
traffic and softirq cpu usage (where napi code runs) can't be explained
with slow cache invalidation.
Also there are additional factors:
- if UDP traffic is dropped - via iptables, or via forcing error paths
at different points of network stack - softirqd cpu usage drops back to
near-zero - although it still does all the same cache invalidations,
- I tried to modify igb driver to disallow page reuse (made
igb_add_rx_frag() always returning false). Result was - "border traffic"
where softirq cpu usage goes from zero to 100% changed from ~700 Mbps to
~400 Mbps.
Any ideas what can happen there, and how to debug it?
TIA
On Mon, Oct 10, 2016 at 10:00 AM, Nikita Yushchenko
<[email protected]> wrote:
> Hi Alexander
>
> Thanks for your explanation.
>
>> The main reason why this isn't a concern for the igb driver is because
>> we currently pass the page up as read-only. We don't allow the stack
>> to write into the page by keeping the page count greater than 1 which
>> means that the page is shared. It isn't until we unmap the page that
>> the page count is allowed to drop to 1 indicating that it is writable.
>
> Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
> avoided? If page is read only for entire world, then it can't be dirty
> in cache and thus device can safely write to it without preparation step.
For the sake of correctness we were adding the
dma_sync_single_range_for_device. Since it is an DMA_FROM_DEVICE
mapping calling it should really have no effect for most DMA mapping
interfaces.
Also you may want to try updating to the 4.8 version of the driver.
It reduces the size of the dma_sync_single_range_for_cpu loops by
reducing the sync size down to the size that was DMAed into the
buffer.
> Nikita
>
>
> P.S.
>
> We are observing strange performance anomaly with igb on imx6q board.
>
> Test is - simple iperf UDP receive. Other host runs
> iperf -c X.X.X.X -u -b xxxM -t 300 -i 3
>
> Imx6q board can run iperf -s -u, or it can run nothing - result is the same.
>
> While generated traffic (controlled via xxx) is slow, softirq thread on
> imx6 board takes near-zero cpu time. With increasing xxx, it still is
> near zero - up to some moment about 700 Mbps. At this moment softirqd
> cpu usage suddenly jumps to almost 100%. Without anything in between:
> it is near-zero with slightly smaller traffic, and it is immediately
>>99% with slightly larger traffic.
>
> Profiling this situation (>99% in softirqd) with perf gives up to 50%
> hits inside cache invalidation loops. That's why originally we thought
> cache invalidation is slow. But having the above dependency between
> traffic and softirq cpu usage (where napi code runs) can't be explained
> with slow cache invalidation.
>
> Also there are additional factors:
> - if UDP traffic is dropped - via iptables, or via forcing error paths
> at different points of network stack - softirqd cpu usage drops back to
> near-zero - although it still does all the same cache invalidations,
> - I tried to modify igb driver to disallow page reuse (made
> igb_add_rx_frag() always returning false). Result was - "border traffic"
> where softirq cpu usage goes from zero to 100% changed from ~700 Mbps to
> ~400 Mbps.
>
> Any ideas what can happen there, and how to debug it?
>
> TIA
I'm adding Eric Dumazet as he is more of an expert on all things NAPI
than I am, but it is my understanding that there are known issues in
regards to how the softirq traffic is handled. Specifically I believe
the 0->100% accounting problem is due to the way this is all tracked.
You may want to try pulling the most recent net-next kernel and
testing that to see if you still see the same behavior as Eric has
recently added a fix that is meant to allow for better sharing between
softirq polling and applications when dealing with stuff like UDP
traffic.
As far as identifying the problem areas your best bet would be to push
the CPU to 100% and then identify the hot spots.
- Alex
>>> The main reason why this isn't a concern for the igb driver is because
>>> we currently pass the page up as read-only. We don't allow the stack
>>> to write into the page by keeping the page count greater than 1 which
>>> means that the page is shared. It isn't until we unmap the page that
>>> the page count is allowed to drop to 1 indicating that it is writable.
>>
>> Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
>> avoided? If page is read only for entire world, then it can't be dirty
>> in cache and thus device can safely write to it without preparation step.
>
> For the sake of correctness we were adding the
> dma_sync_single_range_for_device.
Could you please elaborate this "for sake of correctness"?
If by "correctness" you mean ensuring that buffer gets frame DMAed by
device and that's not broken by cache activity, then:
- on first use of this buffer after page allocation, sync_for_device()
is not needed due to previous dma_page_map() call,
- on later uses of the same buffer, sync_for_device() is not needed due
to buffer being read-only since dma_page_map() call, thus it can't be
dirty in cache and thus no writebacks of this area can be possible.
If by "correctness" you mean strict following "ownership" concept - i.e.
memory area is "owned" either by cpu or by device, and "ownersip" must
be passed to device before DMA and back to cpu after DMA - then, igb
driver already breaks these rules anyway:
- igb calls dma_map_page() at page allocation time, thus entire page
becomes "owned" by device,
- and then, on first use of second buffer inside the page, igb calls
sync_for_device() for buffer area, despite of that area is already
"owned" by device,
- and later, if a buffer within page gets reused, igb calls
sync_for_device() for entire buffer, despite of only part of buffer was
sync_for_cpu()'ed at time of completing receive of previous frame into
this buffer,
- and later, igb calls dma_unmap_page(), despite of that part of page
was sync_for_cpu()'ed and thus is "owned" by CPU.
Given all that, not calling sync_for_device() before reusing buffer
won't make picture much worse :)
> Since it is an DMA_FROM_DEVICE
> mapping calling it should really have no effect for most DMA mapping
> interfaces.
Unfortunately dma_sync_single_range_for_device() *is* slow on imx6q - it
does cache invalidation. I don't really understand why invalidating
cache can be slow - it only removes data from cache, it should not
access slow outer memory - but cache invalidation *is* in top of perf
profiles.
To get some throughput improvement, I propose removal of that
sync_for_device() before reusing buffer. Will you accept such a patch ;)
> Also you may want to try updating to the 4.8 version of the driver.
> It reduces the size of the dma_sync_single_range_for_cpu loops by
> reducing the sync size down to the size that was DMAed into the
> buffer.
Actually that patch came out of the work I'm currently participating in
;). Sure I have it.
> Specifically I believe
> the 0->100% accounting problem is due to the way this is all tracked.
Thanks for this hint - shame on me not realizing this earlier...
> You may want to try pulling the most recent net-next kernel and
> testing that to see if you still see the same behavior as Eric has
> recently added a fix that is meant to allow for better sharing between
> softirq polling and applications when dealing with stuff like UDP
> traffic.
>
> As far as identifying the problem areas your best bet would be to push
> the CPU to 100% and then identify the hot spots.
Thanks for hints
Nikita
On Tue, Oct 11, 2016 at 11:55 PM, Nikita Yushchenko
<[email protected]> wrote:
>>>> The main reason why this isn't a concern for the igb driver is because
>>>> we currently pass the page up as read-only. We don't allow the stack
>>>> to write into the page by keeping the page count greater than 1 which
>>>> means that the page is shared. It isn't until we unmap the page that
>>>> the page count is allowed to drop to 1 indicating that it is writable.
>>>
>>> Doesn't that mean that sync_to_device() in igb_reuse_rx_page() can be
>>> avoided? If page is read only for entire world, then it can't be dirty
>>> in cache and thus device can safely write to it without preparation step.
>>
>> For the sake of correctness we were adding the
>> dma_sync_single_range_for_device.
>
> Could you please elaborate this "for sake of correctness"?
>
> If by "correctness" you mean ensuring that buffer gets frame DMAed by
> device and that's not broken by cache activity, then:
> - on first use of this buffer after page allocation, sync_for_device()
> is not needed due to previous dma_page_map() call,
> - on later uses of the same buffer, sync_for_device() is not needed due
> to buffer being read-only since dma_page_map() call, thus it can't be
> dirty in cache and thus no writebacks of this area can be possible.
>
> If by "correctness" you mean strict following "ownership" concept - i.e.
> memory area is "owned" either by cpu or by device, and "ownersip" must
> be passed to device before DMA and back to cpu after DMA - then, igb
> driver already breaks these rules anyway:
Sort of. Keep in mind the recent changes to only sync what the device
had DMAed into is a recent change and was provided by a third party.
> - igb calls dma_map_page() at page allocation time, thus entire page
> becomes "owned" by device,
> - and then, on first use of second buffer inside the page, igb calls
> sync_for_device() for buffer area, despite of that area is already
> "owned" by device,
Right. However there is nothing wrong with assigning a buffer to the
device twice especially since we are just starting out. If we wanted
to be more correct we would probably be allocating and deallocating
the pages with the DMA_ATTR_SKIP_CPU_SYNC attribute and then just do
the sync_for_device before reassigning the page back to the device.
> - and later, if a buffer within page gets reused, igb calls
> sync_for_device() for entire buffer, despite of only part of buffer was
> sync_for_cpu()'ed at time of completing receive of previous frame into
> this buffer,
This is going to come into play with future changes we have planned.
If we update things to use build_skb we are going to be writing into
other parts of the page as we will be using that for shared info.
> - and later, igb calls dma_unmap_page(), despite of that part of page
> was sync_for_cpu()'ed and thus is "owned" by CPU.
Right we are looking into finding a fix for that.
> Given all that, not calling sync_for_device() before reusing buffer
> won't make picture much worse :)
Actually it will since you are suggesting taking things in a different
direction then the rest of the community. What we plan to do is
weaken the dma_map/dma_unmap semantics by likely using
DMA_ATTR_SKIP_CPU_SYNC on the map and unmap calls.
>> Since it is an DMA_FROM_DEVICE
>> mapping calling it should really have no effect for most DMA mapping
>> interfaces.
>
> Unfortunately dma_sync_single_range_for_device() *is* slow on imx6q - it
> does cache invalidation. I don't really understand why invalidating
> cache can be slow - it only removes data from cache, it should not
> access slow outer memory - but cache invalidation *is* in top of perf
> profiles.
>
> To get some throughput improvement, I propose removal of that
> sync_for_device() before reusing buffer. Will you accept such a patch ;)
Not one that gets rid of sync_for_device() in the driver. From what I
can tell there are some DMA APIs that use that to perform the
invalidation on the region of memory so that it can be DMAed into.
Without that we run the risk of having a race between something the
CPU might have placed in the cache and something the device wrote into
memory. The sync_for_device() call is meant to invalidate the cache
for the region so that when the device writes into memory there is no
risk of that race.
What you may want to do is look at the DMA API you are using and
determine if it is functioning correctly. Most DMA APIs I am familiar
with will either sync Rx data on the sync_for_cpu() or
sync_for_device() but it should not sync on both. The fact that it is
syncing on both makes me wonder if the API was modified to work around
a buggy driver that didn't follow the proper semantics for buffer
ownership instead of just fixing the buggy driver.
>> Also you may want to try updating to the 4.8 version of the driver.
>> It reduces the size of the dma_sync_single_range_for_cpu loops by
>> reducing the sync size down to the size that was DMAed into the
>> buffer.
>
> Actually that patch came out of the work I'm currently participating in
> ;). Sure I have it.
>
>> Specifically I believe
>> the 0->100% accounting problem is due to the way this is all tracked.
>
> Thanks for this hint - shame on me not realizing this earlier...
>
>> You may want to try pulling the most recent net-next kernel and
>> testing that to see if you still see the same behavior as Eric has
>> recently added a fix that is meant to allow for better sharing between
>> softirq polling and applications when dealing with stuff like UDP
>> traffic.
>>
>> As far as identifying the problem areas your best bet would be to push
>> the CPU to 100% and then identify the hot spots.
>
> Thanks for hints
>
>
> Nikita
From: Alexander Duyck
> Sent: 12 October 2016 16:33
...
> > To get some throughput improvement, I propose removal of that
> > sync_for_device() before reusing buffer. Will you accept such a patch ;)
>
> Not one that gets rid of sync_for_device() in the driver. From what I
> can tell there are some DMA APIs that use that to perform the
> invalidation on the region of memory so that it can be DMAed into.
> Without that we run the risk of having a race between something the
> CPU might have placed in the cache and something the device wrote into
> memory. The sync_for_device() call is meant to invalidate the cache
> for the region so that when the device writes into memory there is no
> risk of that race.
I'm not expert, but some thought...
Just remember that the cpu can do speculative memory and cache line reads.
So you need to ensure there are no dirty cache lines when the receive
buffer is setup and no cache lines at all at before looking at the frame.
So unless you know the exact rules for these speculative cache line reads
you have to invalidate the cache after the buffer is written to by the
hardware even it was invalidated when the buffer was set up.
If you can 100% guarantee the cpu hasn't dirtied the cache then I
think the invalidate prior to reusing the buffer can be skipped.
But I wouldn't want to debug that going wrong.
Might be provable safe in the 'copybreak' path.
David
>>> To get some throughput improvement, I propose removal of that
>>> sync_for_device() before reusing buffer. Will you accept such a patch ;)
>>
>> Not one that gets rid of sync_for_device() in the driver. From what I
>> can tell there are some DMA APIs that use that to perform the
>> invalidation on the region of memory so that it can be DMAed into.
>> Without that we run the risk of having a race between something the
>> CPU might have placed in the cache and something the device wrote into
>> memory. The sync_for_device() call is meant to invalidate the cache
>> for the region so that when the device writes into memory there is no
>> risk of that race.
>
> I'm not expert, but some thought...
>
> Just remember that the cpu can do speculative memory and cache line reads.
> So you need to ensure there are no dirty cache lines when the receive
> buffer is setup and no cache lines at all at before looking at the frame.
Exactly.
And because of that, arm does cache invalidation both in sync_for_cpu()
and sync_for_device().
> If you can 100% guarantee the cpu hasn't dirtied the cache then I
> think the invalidate prior to reusing the buffer can be skipped.
> But I wouldn't want to debug that going wrong.
I've written earlier in this thread why it is the case for igb - as long
as Alexander's statement that igb's buffers are readonly for the stack
is true. If Alexander's statement is wrong, then igb is vulnerable to
issue I've described in the first message of this thread.
Btw, we are unable get any breakage with that sync_to_device() removed
already for a day. And - our tests run with software checksumming,
because on our hardware i210 is wired connected to DSA switch and in
this config, no i210 offloading works (i210 hardware gets frames with
eDSA headers that it can't parse). Thus any breakage should be
immediately discovered.
And throughput measured by iperf raises from 650 to 850 Mbps.
Nikita
On Wed, Oct 12, 2016 at 9:11 AM, Nikita Yushchenko
<[email protected]> wrote:
>>>> To get some throughput improvement, I propose removal of that
>>>> sync_for_device() before reusing buffer. Will you accept such a patch ;)
>>>
>>> Not one that gets rid of sync_for_device() in the driver. From what I
>>> can tell there are some DMA APIs that use that to perform the
>>> invalidation on the region of memory so that it can be DMAed into.
>>> Without that we run the risk of having a race between something the
>>> CPU might have placed in the cache and something the device wrote into
>>> memory. The sync_for_device() call is meant to invalidate the cache
>>> for the region so that when the device writes into memory there is no
>>> risk of that race.
>>
>> I'm not expert, but some thought...
>>
>> Just remember that the cpu can do speculative memory and cache line reads.
>> So you need to ensure there are no dirty cache lines when the receive
>> buffer is setup and no cache lines at all at before looking at the frame.
>
> Exactly.
>
> And because of that, arm does cache invalidation both in sync_for_cpu()
> and sync_for_device().
>
>> If you can 100% guarantee the cpu hasn't dirtied the cache then I
>> think the invalidate prior to reusing the buffer can be skipped.
>> But I wouldn't want to debug that going wrong.
>
> I've written earlier in this thread why it is the case for igb - as long
> as Alexander's statement that igb's buffers are readonly for the stack
> is true. If Alexander's statement is wrong, then igb is vulnerable to
> issue I've described in the first message of this thread.
>
> Btw, we are unable get any breakage with that sync_to_device() removed
> already for a day. And - our tests run with software checksumming,
> because on our hardware i210 is wired connected to DSA switch and in
> this config, no i210 offloading works (i210 hardware gets frames with
> eDSA headers that it can't parse). Thus any breakage should be
> immediately discovered.
>
> And throughput measured by iperf raises from 650 to 850 Mbps.
>
> Nikita
The point I was trying to make is that you are invalidating the cache
in both the sync_for_device and sync_for_cpu. Do you really need that
for ARM or do you need to perform the invalidation on sync_for_device
if that may be pushed out anyway? If you aren't running with with
speculative look-ups do you even need the invalidation in
sync_for_cpu? The underlying problem lives in the code for
__dma_page_cpu_to_dev and __dma_page_dev_to_cpu. It seems like they
are trying to solve for both speculative and non-speculative setups
and having "FIXME" comments in the code related to this kind of points
to your problems being there.
Changing the driver code for this won't necessarily work on all
architectures, and on top of it we have some changes planned which
will end up making the pages writable in the future to support the
ongoing XDP effort. That is one of the reasons why I would not be
okay with changing the driver to make this work.
It would make more sense to update the DMA API for
__dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
the direction is DMA_FROM_DEVICE.
- Alex
> It would make more sense to update the DMA API for
> __dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
> the direction is DMA_FROM_DEVICE.
No, in generic case it's unsafe.
If CPU issued a write to a location, and sometime later that location is
used as DMA buffer, there is danger that write is still in cache only,
and writeback is pending. Later this writeback can overwrite data
written to memory via DMA, causing corruption.
> The point I was trying to make is that you are invalidating the cache
> in both the sync_for_device and sync_for_cpu. Do you really need that
> for ARM or do you need to perform the invalidation on sync_for_device
> if that may be pushed out anyway? If you aren't running with with
> speculative look-ups do you even need the invalidation in
> sync_for_cpu?
I'm not lowlevel arm guru and I don't know for sure. Probably another
CPU core can be accessing locations neighbor page, causing specilative
load of locations in DMA page.
> Changing the driver code for this won't necessarily work on all
> architectures, and on top of it we have some changes planned which
> will end up making the pages writable in the future to support the
> ongoing XDP effort. That is one of the reasons why I would not be
> okay with changing the driver to make this work.
Well I was not really serious about removing that sync_for_device() in
mainline :) Although >20% throughput win that this provides is
impressive...
But what about doing something safer, e.g. adding a bit of tracking and
only sync_for_device() what was previously sync_for_cpu()ed? Will you
accept that?
Nikita
On Wed, Oct 12, 2016 at 11:12 AM, Nikita Yushchenko
<[email protected]> wrote:
>> It would make more sense to update the DMA API for
>> __dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
>> the direction is DMA_FROM_DEVICE.
>
> No, in generic case it's unsafe.
>
> If CPU issued a write to a location, and sometime later that location is
> used as DMA buffer, there is danger that write is still in cache only,
> and writeback is pending. Later this writeback can overwrite data
> written to memory via DMA, causing corruption.
Okay so if I understand it correctly then the invalidation in
sync_for_device is to force any writes to be flushed out, and the
invalidation in sync_for_cpu is to flush out any speculative reads.
So without speculative reads then the sync_for_cpu portion is not
needed. You might want to determine if the core you are using
supports the speculative reads, if not you might be able to get away
without having to do the sync_for_cpu at all.
>> The point I was trying to make is that you are invalidating the cache
>> in both the sync_for_device and sync_for_cpu. Do you really need that
>> for ARM or do you need to perform the invalidation on sync_for_device
>> if that may be pushed out anyway? If you aren't running with with
>> speculative look-ups do you even need the invalidation in
>> sync_for_cpu?
>
> I'm not lowlevel arm guru and I don't know for sure. Probably another
> CPU core can be accessing locations neighbor page, causing specilative
> load of locations in DMA page.
>
>
>> Changing the driver code for this won't necessarily work on all
>> architectures, and on top of it we have some changes planned which
>> will end up making the pages writable in the future to support the
>> ongoing XDP effort. That is one of the reasons why I would not be
>> okay with changing the driver to make this work.
>
> Well I was not really serious about removing that sync_for_device() in
> mainline :) Although >20% throughput win that this provides is
> impressive...
I agree the improvement is pretty impressive. The think is there are
similar gains we can generate on x86 by stripping out bits and pieces
that are needed for other architectures. I'm just wanting to make
certain we aren't optimizing for one architecture at the detriment of
others.
> But what about doing something safer, e.g. adding a bit of tracking and
> only sync_for_device() what was previously sync_for_cpu()ed? Will you
> accept that?
>
> Nikita
The problem is that as we move things over for XDP we will be looking
at having the CPU potentially write to any spot in the region that was
mapped as we could append headers to the front or pad data onto the
end of the frame. It is probably safest for us to invalidate the
entire region just to make sure we don't have a collision with
something that is writing to the page.
So for example in the near future I am planning to expand out the
DMA_ATTR_SKIP_CPU_SYNC DMA attribute beyond just the ARM architecture
to see if I can expand it for use with SWIOTLB. If we can use this on
unmap we might be able to solve some of the existing problems that
required us to make the page read-only since we could unmap the page
without invalidating any existing writes on the page.
- Alex
From: Alexander Duyck
> Sent: 12 October 2016 19:30
> On Wed, Oct 12, 2016 at 11:12 AM, Nikita Yushchenko
> <[email protected]> wrote:
> >> It would make more sense to update the DMA API for
> >> __dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
> >> the direction is DMA_FROM_DEVICE.
> >
> > No, in generic case it's unsafe.
> >
> > If CPU issued a write to a location, and sometime later that location is
> > used as DMA buffer, there is danger that write is still in cache only,
> > and writeback is pending. Later this writeback can overwrite data
> > written to memory via DMA, causing corruption.
>
> Okay so if I understand it correctly then the invalidation in
> sync_for_device is to force any writes to be flushed out, and the
> invalidation in sync_for_cpu is to flush out any speculative reads.
> So without speculative reads then the sync_for_cpu portion is not
> needed. You might want to determine if the core you are using
> supports the speculative reads, if not you might be able to get away
> without having to do the sync_for_cpu at all.
For reads the sync_for_device invalidates (ie discards the contents of)
the cache lines to ensure the cpu won't write back dirty lines.
The sync_for_cpu invalidates the cache to ensure the cpu actually
reads from memory.
For writes you need to flush (write back) cache lines prior to the DMA.
I don't think anything is needed when the transfer finishes.
If headers/trailers might be added the sync_for_device must cover
the entire area the frame will be written to.
Clearly the sync_for_cpu need only cover the valid frame data.
I suspect you can safely skip the sync_for_device in the 'copybreak'
path but not if the frame is passed up the protocol stack.
David
>>> It would make more sense to update the DMA API for
>>> __dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
>>> the direction is DMA_FROM_DEVICE.
>>
>> No, in generic case it's unsafe.
>>
>> If CPU issued a write to a location, and sometime later that location is
>> used as DMA buffer, there is danger that write is still in cache only,
>> and writeback is pending. Later this writeback can overwrite data
>> written to memory via DMA, causing corruption.
>
> Okay so if I understand it correctly then the invalidation in
> sync_for_device is to force any writes to be flushed out, and the
> invalidation in sync_for_cpu is to flush out any speculative reads.
> So without speculative reads then the sync_for_cpu portion is not
> needed. You might want to determine if the core you are using
> supports the speculative reads, if not you might be able to get away
> without having to do the sync_for_cpu at all.
pl310 L2 cache controller does support prefetching - and I think most
arm systems use pl310.
>>> Changing the driver code for this won't necessarily work on all
>>> architectures, and on top of it we have some changes planned which
>>> will end up making the pages writable in the future to support the
>>> ongoing XDP effort. That is one of the reasons why I would not be
>>> okay with changing the driver to make this work.
>>
>> Well I was not really serious about removing that sync_for_device() in
>> mainline :) Although >20% throughput win that this provides is
>> impressive...
>
> I agree the improvement is pretty impressive. The think is there are
> similar gains we can generate on x86 by stripping out bits and pieces
> that are needed for other architectures. I'm just wanting to make
> certain we aren't optimizing for one architecture at the detriment of
> others.
Well in ideal world needs of other architectures should not limit x86 -
and if they do then that's a bug and should be fixed - by designing
proper set of abstractions :)
Perhaps issue is that "do whatever needed for device to perform DMA
correctly" semantics of dma_map() / sync_for_device() - and symmetric
for dma_unmap() / sync_for_cpu() - is too abstract and that hurts
performance. In particular, "setup i/o" and "sync caches" is different
activity with conflicting performance properties: for better
performance, one wants to setup i/o for larger blocks, but sync caches
for smaller blocks. Probably separation of these activities into
different calls is the way for better performance.
>> But what about doing something safer, e.g. adding a bit of tracking and
>> only sync_for_device() what was previously sync_for_cpu()ed? Will you
>> accept that?
>
> The problem is that as we move things over for XDP we will be looking
> at having the CPU potentially write to any spot in the region that was
> mapped as we could append headers to the front or pad data onto the
> end of the frame. It is probably safest for us to invalidate the
> entire region just to make sure we don't have a collision with
> something that is writing to the page.
Can't comment without knowning particular access patterns that XDP will
cause. Still rule is simple - "invalidate more" does hurt performance,
thus need to invalidate minimal required area. To avoid this
invalidation thing hurting performance on x86 that does not need
invalidation at all, good idea is to use some compile-time magic - just
to compile out unneeded things completely.
Still, XDP is future question, currently igb does not use it. Why not
improve sync_for_cpu() / sync_for_device() pairing in the current code?
I can propare such a patch. If XDP will make it irrelevant in future,
perhaps it could be just undone (and if this will cause performance
degradation, then it will be something to work on)
> So for example in the near future I am planning to expand out the
> DMA_ATTR_SKIP_CPU_SYNC DMA attribute beyond just the ARM architecture
> to see if I can expand it for use with SWIOTLB. If we can use this on
> unmap we might be able to solve some of the existing problems that
> required us to make the page read-only since we could unmap the page
> without invalidating any existing writes on the page.
Ack. Actually it is the same decoupling between "setup i/o" and "sync
caches" I've mentioned above.
Nikita
On Thu, Oct 13, 2016 at 4:00 AM, Nikita Yushchenko
<[email protected]> wrote:
>>>> It would make more sense to update the DMA API for
>>>> __dma_page_cpu_to_dev on ARM so that you don't invalidate the cache if
>>>> the direction is DMA_FROM_DEVICE.
>>>
>>> No, in generic case it's unsafe.
>>>
>>> If CPU issued a write to a location, and sometime later that location is
>>> used as DMA buffer, there is danger that write is still in cache only,
>>> and writeback is pending. Later this writeback can overwrite data
>>> written to memory via DMA, causing corruption.
>>
>> Okay so if I understand it correctly then the invalidation in
>> sync_for_device is to force any writes to be flushed out, and the
>> invalidation in sync_for_cpu is to flush out any speculative reads.
>> So without speculative reads then the sync_for_cpu portion is not
>> needed. You might want to determine if the core you are using
>> supports the speculative reads, if not you might be able to get away
>> without having to do the sync_for_cpu at all.
>
> pl310 L2 cache controller does support prefetching - and I think most
> arm systems use pl310.
>
>
>>>> Changing the driver code for this won't necessarily work on all
>>>> architectures, and on top of it we have some changes planned which
>>>> will end up making the pages writable in the future to support the
>>>> ongoing XDP effort. That is one of the reasons why I would not be
>>>> okay with changing the driver to make this work.
>>>
>>> Well I was not really serious about removing that sync_for_device() in
>>> mainline :) Although >20% throughput win that this provides is
>>> impressive...
>>
>> I agree the improvement is pretty impressive. The think is there are
>> similar gains we can generate on x86 by stripping out bits and pieces
>> that are needed for other architectures. I'm just wanting to make
>> certain we aren't optimizing for one architecture at the detriment of
>> others.
>
> Well in ideal world needs of other architectures should not limit x86 -
> and if they do then that's a bug and should be fixed - by designing
> proper set of abstractions :)
>
> Perhaps issue is that "do whatever needed for device to perform DMA
> correctly" semantics of dma_map() / sync_for_device() - and symmetric
> for dma_unmap() / sync_for_cpu() - is too abstract and that hurts
> performance. In particular, "setup i/o" and "sync caches" is different
> activity with conflicting performance properties: for better
> performance, one wants to setup i/o for larger blocks, but sync caches
> for smaller blocks. Probably separation of these activities into
> different calls is the way for better performance.
>
>
>>> But what about doing something safer, e.g. adding a bit of tracking and
>>> only sync_for_device() what was previously sync_for_cpu()ed? Will you
>>> accept that?
>>
>> The problem is that as we move things over for XDP we will be looking
>> at having the CPU potentially write to any spot in the region that was
>> mapped as we could append headers to the front or pad data onto the
>> end of the frame. It is probably safest for us to invalidate the
>> entire region just to make sure we don't have a collision with
>> something that is writing to the page.
>
> Can't comment without knowning particular access patterns that XDP will
> cause. Still rule is simple - "invalidate more" does hurt performance,
> thus need to invalidate minimal required area. To avoid this
> invalidation thing hurting performance on x86 that does not need
> invalidation at all, good idea is to use some compile-time magic - just
> to compile out unneeded things completely.
On x86 we don't perform any invalidation. We just have the function
pointers usually referencing functions that don't do anything unless
we can't handle the DMA address for some reason as we usually default
to swiotlb setup on x86_64.
> Still, XDP is future question, currently igb does not use it. Why not
> improve sync_for_cpu() / sync_for_device() pairing in the current code?
> I can propare such a patch. If XDP will make it irrelevant in future,
> perhaps it could be just undone (and if this will cause performance
> degradation, then it will be something to work on)
So the plan is to update the code path in the near future. I'm
working on rewriting the DMA APIs for the drivers now, will update the
page count handling for the pages, and in the near future we should be
able to support XDP and the use of build_skb. Once we get to that
point there should be a modest bump in performance since we can drop
the memcpy for headers.
>> So for example in the near future I am planning to expand out the
>> DMA_ATTR_SKIP_CPU_SYNC DMA attribute beyond just the ARM architecture
>> to see if I can expand it for use with SWIOTLB. If we can use this on
>> unmap we might be able to solve some of the existing problems that
>> required us to make the page read-only since we could unmap the page
>> without invalidating any existing writes on the page.
>
> Ack. Actually it is the same decoupling between "setup i/o" and "sync
> caches" I've mentioned above.
Right. That is kind of what I was thinking as well. We need to be
able to create and destroy DMA mappable pages without corrupting
things once we hand the pages up the stack. If we can get away from
having to worry about unmap invalidating any writes from the CPU then
we can get away with using things like build_skb which should give us
a pretty good increase in throughput.
- Alex