2023-06-29 15:51:21

by Alexander Lobakin

[permalink] [raw]
Subject: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
even when on DMA-coherent platforms (like x86) with no active IOMMU or
swiotlb, just for the call ladder.
Indeed, it's

page_pool_put_page()
page_pool_put_defragged_page() <- external
__page_pool_put_page()
page_pool_dma_sync_for_device() <- non-inline
dma_sync_single_range_for_device()
dma_sync_single_for_device() <- external
dma_direct_sync_single_for_device()
dev_is_dma_coherent() <- exit

For the inline functions, no guarantees the compiler won't uninline them
(they're clearly not one-liners and sometimes compilers uninline even
2 + 2). The first external call is necessary, but the rest 2+ are done
for nothing each time, plus a bunch of checks here and there.
Since Page Pool mappings are long-term and for one "device + addr" pair
dma_need_sync() will always return the same value (basically, whether it
belongs to an swiotlb pool), addresses can be tested once right after
they're obtained and the result can be reused until the page is unmapped.
Define new PP flag, which will mean "do DMA syncs for device, but only
when needed" and turn it on by default when the driver asks to sync
pages. When a page is mapped, check whether it needs syncs and if so,
replace that "sync when needed" back to "always do syncs" globally for
the whole pool (better safe than sorry). As long as a pool has no pages
requiring DMA syncs, this cuts off a good piece of calls and checks.
On my x86_64, this gives from 2% to 5% performance benefit with no
negative impact for cases when IOMMU is on and the shortcut can't be
used.

Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/page_pool.h | 3 +++
net/core/page_pool.c | 10 ++++++++++
2 files changed, 13 insertions(+)

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index 829dc1f8ba6b..ff3772fab707 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -23,6 +23,9 @@
* Please note DMA-sync-for-CPU is still
* device driver responsibility
*/
+#define PP_FLAG_DMA_MAYBE_SYNC BIT(2) /* Internal, should not be used in
+ * drivers
+ */
#define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\
PP_FLAG_DMA_SYNC_DEV)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index dff0b4fa2316..498e058140b3 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -197,6 +197,10 @@ static int page_pool_init(struct page_pool *pool,
/* pool->p.offset has to be set according to the address
* offset used by the DMA engine to start copying rx data
*/
+
+ /* Try to avoid calling no-op syncs */
+ pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
+ pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
}

#ifdef CONFIG_PAGE_POOL_STATS
@@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)

page_pool_set_dma_addr(page, dma);

+ if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
+ dma_need_sync(pool->p.dev, dma)) {
+ pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
+ pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
+ }
+
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);

--
2.41.0



2023-06-29 17:28:07

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
> even when on DMA-coherent platforms (like x86) with no active IOMMU or
> swiotlb, just for the call ladder.
> Indeed, it's
>
> page_pool_put_page()
> page_pool_put_defragged_page() <- external
> __page_pool_put_page()
> page_pool_dma_sync_for_device() <- non-inline
> dma_sync_single_range_for_device()
> dma_sync_single_for_device() <- external
> dma_direct_sync_single_for_device()
> dev_is_dma_coherent() <- exit
>
> For the inline functions, no guarantees the compiler won't uninline them
> (they're clearly not one-liners and sometimes compilers uninline even
> 2 + 2). The first external call is necessary, but the rest 2+ are done
> for nothing each time, plus a bunch of checks here and there.
> Since Page Pool mappings are long-term and for one "device + addr" pair
> dma_need_sync() will always return the same value (basically, whether it
> belongs to an swiotlb pool), addresses can be tested once right after
> they're obtained and the result can be reused until the page is unmapped.
> Define new PP flag, which will mean "do DMA syncs for device, but only
> when needed" and turn it on by default when the driver asks to sync
> pages. When a page is mapped, check whether it needs syncs and if so,
> replace that "sync when needed" back to "always do syncs" globally for
> the whole pool (better safe than sorry). As long as a pool has no pages
> requiring DMA syncs, this cuts off a good piece of calls and checks.
> On my x86_64, this gives from 2% to 5% performance benefit with no
> negative impact for cases when IOMMU is on and the shortcut can't be
> used.
>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> include/net/page_pool.h | 3 +++
> net/core/page_pool.c | 10 ++++++++++
> 2 files changed, 13 insertions(+)
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 829dc1f8ba6b..ff3772fab707 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -23,6 +23,9 @@
> * Please note DMA-sync-for-CPU is still
> * device driver responsibility
> */
> +#define PP_FLAG_DMA_MAYBE_SYNC BIT(2) /* Internal, should not be used in
> + * drivers
> + */
> #define PP_FLAG_ALL (PP_FLAG_DMA_MAP |\
> PP_FLAG_DMA_SYNC_DEV)
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index dff0b4fa2316..498e058140b3 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -197,6 +197,10 @@ static int page_pool_init(struct page_pool *pool,
> /* pool->p.offset has to be set according to the address
> * offset used by the DMA engine to start copying rx data
> */
> +
> + /* Try to avoid calling no-op syncs */
> + pool->p.flags |= PP_FLAG_DMA_MAYBE_SYNC;
> + pool->p.flags &= ~PP_FLAG_DMA_SYNC_DEV;
> }
>
> #ifdef CONFIG_PAGE_POOL_STATS
> @@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>
> page_pool_set_dma_addr(page, dma);
>
> + if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
> + dma_need_sync(pool->p.dev, dma)) {
> + pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
> + pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
> + }
> +
> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>

I am pretty sure the logic is flawed here. The problem is
dma_needs_sync depends on the DMA address being used. In the worst case
scenario we could have a device that has something like a 32b DMA
address space on a system with over 4GB of memory. In such a case the
higher addresses would need to be synced because they will go off to a
swiotlb bounce buffer while the lower addresses wouldn't.

If you were to store a flag like this it would have to be generated per
page.

2023-06-30 13:08:03

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

From: Alexander H Duyck <[email protected]>
Date: Thu, 29 Jun 2023 09:45:26 -0700

> On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
>> swiotlb, just for the call ladder.
>> Indeed, it's

[...]

>> @@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>
>> page_pool_set_dma_addr(page, dma);
>>
>> + if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
>> + dma_need_sync(pool->p.dev, dma)) {
>> + pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
>> + pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
>> + }
>> +
>> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>> page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>>
>
> I am pretty sure the logic is flawed here. The problem is
> dma_needs_sync depends on the DMA address being used. In the worst case
> scenario we could have a device that has something like a 32b DMA
> address space on a system with over 4GB of memory. In such a case the
> higher addresses would need to be synced because they will go off to a
> swiotlb bounce buffer while the lower addresses wouldn't.
>
> If you were to store a flag like this it would have to be generated per
> page.

I know when DMA might need syncing :D That's the point of this shortcut:
if at least one page needs syncing, I disable it for the whole pool.
It's a "better safe than sorry".
Using a per-page flag involves more changes and might hurt some
scenarios/setups. For example, non-coherent systems, where you always
need to do syncs. The idea was to give some improvement when possible,
otherwise just fallback to what we have today.

Thanks,
Olek

2023-06-30 15:15:44

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

On Fri, Jun 30, 2023 at 5:30 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander H Duyck <[email protected]>
> Date: Thu, 29 Jun 2023 09:45:26 -0700
>
> > On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
> >> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
> >> even when on DMA-coherent platforms (like x86) with no active IOMMU or
> >> swiotlb, just for the call ladder.
> >> Indeed, it's
>
> [...]
>
> >> @@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>
> >> page_pool_set_dma_addr(page, dma);
> >>
> >> + if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
> >> + dma_need_sync(pool->p.dev, dma)) {
> >> + pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
> >> + pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
> >> + }
> >> +
> >> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >> page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> >>
> >
> > I am pretty sure the logic is flawed here. The problem is
> > dma_needs_sync depends on the DMA address being used. In the worst case
> > scenario we could have a device that has something like a 32b DMA
> > address space on a system with over 4GB of memory. In such a case the
> > higher addresses would need to be synced because they will go off to a
> > swiotlb bounce buffer while the lower addresses wouldn't.
> >
> > If you were to store a flag like this it would have to be generated per
> > page.
>
> I know when DMA might need syncing :D That's the point of this shortcut:
> if at least one page needs syncing, I disable it for the whole pool.
> It's a "better safe than sorry".
> Using a per-page flag involves more changes and might hurt some
> scenarios/setups. For example, non-coherent systems, where you always
> need to do syncs. The idea was to give some improvement when possible,
> otherwise just fallback to what we have today.

I am not a fan of having the page pool force the syncing either. Last
I knew I thought the PP_FLAG_DMA_SYNC_DEV was meant to be set by the
driver, not by the page pool API itself. The big reason for that being
that the driver in many cases will have to take care of the DMA sync
itself instead of letting the allocator take care of it.

Basically we are just trading off the dma_need_sync cost versus the
page_pool_dma_sync_for_device cost. If we think it is a win to call
dma_need_sync for every frame then maybe we should look at folding it
into page_pool_dma_sync_for_device itself since that is the only
consumer of it it or just fold it into the PP_FLAG_DMA_SYNC_DEV if
statement after the flag check rather than adding yet another flag
that will likely always be true for most devices. Otherwise you are
just adding overhead for the non-exception case and devices that don't
bother setting PP_FLAG_DMA_SYNC_DEV.

2023-06-30 15:48:40

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

From: Alexander Duyck <[email protected]>
Date: Fri, 30 Jun 2023 07:44:45 -0700

> On Fri, Jun 30, 2023 at 5:30 AM Alexander Lobakin
> <[email protected]> wrote:
>>
>> From: Alexander H Duyck <[email protected]>
>> Date: Thu, 29 Jun 2023 09:45:26 -0700
>>
>>> On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
>>>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
>>>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
>>>> swiotlb, just for the call ladder.
>>>> Indeed, it's
>>
>> [...]
>>
>>>> @@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
>>>>
>>>> page_pool_set_dma_addr(page, dma);
>>>>
>>>> + if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
>>>> + dma_need_sync(pool->p.dev, dma)) {
>>>> + pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
>>>> + pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
>>>> + }
>>>> +
>>>> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
>>>> page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
>>>>
>>>
>>> I am pretty sure the logic is flawed here. The problem is
>>> dma_needs_sync depends on the DMA address being used. In the worst case
>>> scenario we could have a device that has something like a 32b DMA
>>> address space on a system with over 4GB of memory. In such a case the
>>> higher addresses would need to be synced because they will go off to a
>>> swiotlb bounce buffer while the lower addresses wouldn't.
>>>
>>> If you were to store a flag like this it would have to be generated per
>>> page.
>>
>> I know when DMA might need syncing :D That's the point of this shortcut:
>> if at least one page needs syncing, I disable it for the whole pool.
>> It's a "better safe than sorry".
>> Using a per-page flag involves more changes and might hurt some
>> scenarios/setups. For example, non-coherent systems, where you always
>> need to do syncs. The idea was to give some improvement when possible,
>> otherwise just fallback to what we have today.
>
> I am not a fan of having the page pool force the syncing either. Last
> I knew I thought the PP_FLAG_DMA_SYNC_DEV was meant to be set by the

Please follow the logics of the patch.

1. The driver sets DMA_SYNC_DEV.
2. PP tries to shortcut and replaces it with MAYBE_SYNC.
3. If dma_need_sync() returns true for some page, it gets replaced back
to DMA_SYNC_DEV, no further dma_need_sync() calls for that pool.

OR

1. The driver doesn't set DMA_SYNC_DEV.
2. PP doesn't turn on MAYBE_SYNC.
3. No dma_need_sync() tests.

Where does PP force syncs for drivers which don't need them?

> driver, not by the page pool API itself. The big reason for that being
> that the driver in many cases will have to take care of the DMA sync
> itself instead of letting the allocator take care of it.
>
> Basically we are just trading off the dma_need_sync cost versus the
> page_pool_dma_sync_for_device cost. If we think it is a win to call

dma_need_sync() is called once per newly allocated and mapped page.
page_pool_dma_sync_for_device() is called each time you ask for a page.

On my setup and with patch #4, I have literally 0 allocations once a
ring is filled. This means dma_need_sync() is not called at all during
Rx, while sync_for_device() would be called all the time.
When pages go through ptr_ring, sometimes new allocations happen, but
still the number of times dma_need_sync() is called is thousands times
lower.

> dma_need_sync for every frame then maybe we should look at folding it
> into page_pool_dma_sync_for_device itself since that is the only
> consumer of it it or just fold it into the PP_FLAG_DMA_SYNC_DEV if
> statement after the flag check rather than adding yet another flag
> that will likely always be true for most devices. Otherwise you are

What you suggest is either calling dma_need_sync() each time a page is
requested or introducing a flag to store it somewhere in struct page to
allow some optimization for really-not-common-cases when dma_need_sync()
might return different values due to swiotlb etc. Did I get it right?

> just adding overhead for the non-exception case and devices that don't
> bother setting PP_FLAG_DMA_SYNC_DEV.

Thanks,
Olek

2023-06-30 18:47:19

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

On Fri, Jun 30, 2023 at 8:34 AM Alexander Lobakin
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
> Date: Fri, 30 Jun 2023 07:44:45 -0700
>
> > On Fri, Jun 30, 2023 at 5:30 AM Alexander Lobakin
> > <[email protected]> wrote:
> >>
> >> From: Alexander H Duyck <[email protected]>
> >> Date: Thu, 29 Jun 2023 09:45:26 -0700
> >>
> >>> On Thu, 2023-06-29 at 17:23 +0200, Alexander Lobakin wrote:
> >>>> Turned out page_pool_put{,_full}_page() can burn quite a bunch of cycles
> >>>> even when on DMA-coherent platforms (like x86) with no active IOMMU or
> >>>> swiotlb, just for the call ladder.
> >>>> Indeed, it's
> >>
> >> [...]
> >>
> >>>> @@ -341,6 +345,12 @@ static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
> >>>>
> >>>> page_pool_set_dma_addr(page, dma);
> >>>>
> >>>> + if ((pool->p.flags & PP_FLAG_DMA_MAYBE_SYNC) &&
> >>>> + dma_need_sync(pool->p.dev, dma)) {
> >>>> + pool->p.flags |= PP_FLAG_DMA_SYNC_DEV;
> >>>> + pool->p.flags &= ~PP_FLAG_DMA_MAYBE_SYNC;
> >>>> + }
> >>>> +
> >>>> if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> >>>> page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> >>>>
> >>>
> >>> I am pretty sure the logic is flawed here. The problem is
> >>> dma_needs_sync depends on the DMA address being used. In the worst case
> >>> scenario we could have a device that has something like a 32b DMA
> >>> address space on a system with over 4GB of memory. In such a case the
> >>> higher addresses would need to be synced because they will go off to a
> >>> swiotlb bounce buffer while the lower addresses wouldn't.
> >>>
> >>> If you were to store a flag like this it would have to be generated per
> >>> page.
> >>
> >> I know when DMA might need syncing :D That's the point of this shortcut:
> >> if at least one page needs syncing, I disable it for the whole pool.
> >> It's a "better safe than sorry".
> >> Using a per-page flag involves more changes and might hurt some
> >> scenarios/setups. For example, non-coherent systems, where you always
> >> need to do syncs. The idea was to give some improvement when possible,
> >> otherwise just fallback to what we have today.
> >
> > I am not a fan of having the page pool force the syncing either. Last
> > I knew I thought the PP_FLAG_DMA_SYNC_DEV was meant to be set by the
>
> Please follow the logics of the patch.
>
> 1. The driver sets DMA_SYNC_DEV.
> 2. PP tries to shortcut and replaces it with MAYBE_SYNC.
> 3. If dma_need_sync() returns true for some page, it gets replaced back
> to DMA_SYNC_DEV, no further dma_need_sync() calls for that pool.
>
> OR
>
> 1. The driver doesn't set DMA_SYNC_DEV.
> 2. PP doesn't turn on MAYBE_SYNC.
> 3. No dma_need_sync() tests.
>
> Where does PP force syncs for drivers which don't need them?

You are right. I was looking at it out of context.

> > driver, not by the page pool API itself. The big reason for that being
> > that the driver in many cases will have to take care of the DMA sync
> > itself instead of letting the allocator take care of it.
> >
> > Basically we are just trading off the dma_need_sync cost versus the
> > page_pool_dma_sync_for_device cost. If we think it is a win to call
>
> dma_need_sync() is called once per newly allocated and mapped page.
> page_pool_dma_sync_for_device() is called each time you ask for a page.
>
> On my setup and with patch #4, I have literally 0 allocations once a
> ring is filled. This means dma_need_sync() is not called at all during
> Rx, while sync_for_device() would be called all the time.
> When pages go through ptr_ring, sometimes new allocations happen, but
> still the number of times dma_need_sync() is called is thousands times
> lower.

I see, so you are using it as a screener for pages as they are added
to the pool. However the first time somebody trips the dma_need_sync
then everybody in the pool is going to be getting hit with the sync
code.

> > dma_need_sync for every frame then maybe we should look at folding it
> > into page_pool_dma_sync_for_device itself since that is the only
> > consumer of it it or just fold it into the PP_FLAG_DMA_SYNC_DEV if
> > statement after the flag check rather than adding yet another flag
> > that will likely always be true for most devices. Otherwise you are
>
> What you suggest is either calling dma_need_sync() each time a page is
> requested or introducing a flag to store it somewhere in struct page to
> allow some optimization for really-not-common-cases when dma_need_sync()
> might return different values due to swiotlb etc. Did I get it right?

Yeah, my thought would be to have a flag in the page to indicate if it
will need the sync bits or not. Then you could look at exposing that
to the drivers as well so that they could cut down on their own
overhead. We could probably look at just embedding a flag in the lower
bits of the DMA address stored in the page since I suspect at a
minimum the resultant DMA address for a page would always be at least
aligned to a long if not a full page.

2023-07-03 13:55:09

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

From: Alexander Duyck <[email protected]>
Date: Fri, 30 Jun 2023 11:28:30 -0700

> On Fri, Jun 30, 2023 at 8:34 AM Alexander Lobakin
> <[email protected]> wrote:

[...]

>> On my setup and with patch #4, I have literally 0 allocations once a
>> ring is filled. This means dma_need_sync() is not called at all during
>> Rx, while sync_for_device() would be called all the time.
>> When pages go through ptr_ring, sometimes new allocations happen, but
>> still the number of times dma_need_sync() is called is thousands times
>> lower.
>
> I see, so you are using it as a screener for pages as they are added
> to the pool. However the first time somebody trips the dma_need_sync
> then everybody in the pool is going to be getting hit with the sync
> code.

Right. "Better safe than sorry". If at least one page needs sync, let's
drop the shortcut. It won't make it worse than the mainline anyway.

>
>>> dma_need_sync for every frame then maybe we should look at folding it
>>> into page_pool_dma_sync_for_device itself since that is the only
>>> consumer of it it or just fold it into the PP_FLAG_DMA_SYNC_DEV if
>>> statement after the flag check rather than adding yet another flag
>>> that will likely always be true for most devices. Otherwise you are
>>
>> What you suggest is either calling dma_need_sync() each time a page is
>> requested or introducing a flag to store it somewhere in struct page to
>> allow some optimization for really-not-common-cases when dma_need_sync()
>> might return different values due to swiotlb etc. Did I get it right?
>
> Yeah, my thought would be to have a flag in the page to indicate if it
> will need the sync bits or not. Then you could look at exposing that
> to the drivers as well so that they could cut down on their own
> overhead. We could probably look at just embedding a flag in the lower
> bits of the DMA address stored in the page since I suspect at a
> minimum the resultant DMA address for a page would always be at least
> aligned to a long if not a full page.

As for drivers, I could add a wrapper like page_pool_need_sync() to test
for DMA_SYNC_DEV. I'm not sure it's worth it to check on a per-page basis.
Also, having that bit in the struct page forces us to always fetch it to
a cacheline. Right now, if the sync is skipped, this also avoids
touching struct page or at least postpones it. page_address() (for
&xdp_buff or build_skb()) doesn't touch it.

As for possible implementation, I also thought of the lowest bit of DMA
address. It probably can be lower than %PAGE_SIZE in some cases (at
least non-PP), but not to the 1-byte granularity.
But again, we need some group decision on if it's worth to do on a
per-page basis :)

Thanks,
Olek

2023-07-03 20:36:40

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

On Fri, 30 Jun 2023 17:34:02 +0200 Alexander Lobakin wrote:
> > I am not a fan of having the page pool force the syncing either. Last
> > I knew I thought the PP_FLAG_DMA_SYNC_DEV was meant to be set by the
>
> Please follow the logics of the patch.
>
> 1. The driver sets DMA_SYNC_DEV.
> 2. PP tries to shortcut and replaces it with MAYBE_SYNC.
> 3. If dma_need_sync() returns true for some page, it gets replaced back
> to DMA_SYNC_DEV, no further dma_need_sync() calls for that pool.
>
> OR
>
> 1. The driver doesn't set DMA_SYNC_DEV.
> 2. PP doesn't turn on MAYBE_SYNC.
> 3. No dma_need_sync() tests.
>
> Where does PP force syncs for drivers which don't need them?

I think both Alex and I got confused about what's going on here.

Could you reshuffle the code somehow to make it more obvious?
Rename the flag, perhaps put it in a different field than
the driver-set PP flags?

2023-07-05 14:51:20

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH RFC net-next 2/4] net: page_pool: avoid calling no-op externals when possible

From: Jakub Kicinski <[email protected]>
Date: Mon, 3 Jul 2023 13:32:07 -0700

> On Fri, 30 Jun 2023 17:34:02 +0200 Alexander Lobakin wrote:
>>> I am not a fan of having the page pool force the syncing either. Last
>>> I knew I thought the PP_FLAG_DMA_SYNC_DEV was meant to be set by the
>>
>> Please follow the logics of the patch.
>>
>> 1. The driver sets DMA_SYNC_DEV.
>> 2. PP tries to shortcut and replaces it with MAYBE_SYNC.
>> 3. If dma_need_sync() returns true for some page, it gets replaced back
>> to DMA_SYNC_DEV, no further dma_need_sync() calls for that pool.
>>
>> OR
>>
>> 1. The driver doesn't set DMA_SYNC_DEV.
>> 2. PP doesn't turn on MAYBE_SYNC.
>> 3. No dma_need_sync() tests.
>>
>> Where does PP force syncs for drivers which don't need them?
>
> I think both Alex and I got confused about what's going on here.
>
> Could you reshuffle the code somehow to make it more obvious?
> Rename the flag, perhaps put it in a different field than
> the driver-set PP flags?

PP currently doesn't have a field for internal flags or so, so I reused
the existing one :s But you're probably right, that would make it more
obvious.

1. Driver sets PP_SYNC_DEV.
2. PP doesn't set its internal one until dma_need_sync() returns false.
3. PP-sync-for-dev checks for the internal flag.

Although needs more lines to be changed :D

Thanks,
Olek