+CC Michal, Peter, Greg and Bjorn
Because there has been disscusion about where and how the NUMA_NO_NODE
should be handled before.
On 2019/12/12 5:24, Saeed Mahameed wrote:
> On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:
>> On Sat, 7 Dec 2019 03:52:41 +0000
>> Saeed Mahameed <[email protected]> wrote:
>>
>>> I don't think it is correct to check that the page nid is same as
>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
>>> all
>>> pages to recycle, because you can't assume where pages are
>>> allocated
>>> from and where they are being handled.
>>
>> I agree, using numa_mem_id() is not valid, because it takes the numa
>> node id from the executing CPU and the call to __page_pool_put_page()
>> can happen on a remote CPU (e.g. cpumap redirect, and in future
>> SKBs).
>>
>>
>>> I suggest the following:
>>>
>>> return !page_pfmemalloc() &&
>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE
>>> );
>>
>> Above code doesn't generate optimal ASM code, I suggest:
>>
>> static bool pool_page_reusable(struct page_pool *pool, struct page
>> *page)
>> {
>> return !page_is_pfmemalloc(page) &&
>> pool->p.nid != NUMA_NO_NODE &&
>> page_to_nid(page) == pool->p.nid;
>> }
>>
>
> this is not equivalent to the above. Here in case pool->p.nid is
> NUMA_NO_NODE, pool_page_reusable() will always be false.
>
> We can avoid the extra check in data path.
> How about avoiding NUMA_NO_NODE in page_pool altogether, and force
> numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
> pool init, as already done in alloc_pages_node().
That means we will not support page reuse migragtion for NUMA_NO_NODE,
which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
because alloc_pages_node() will allocate the page based on the node
of the current running cpu.
Also, There seems to be a wild guessing of the node id here, which has
been disscussed before and has not reached a agreement yet.
>
> which will imply recycling without adding any extra condition to the
> data path.
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..00c99282a306 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
>
> memcpy(&pool->p, params, sizeof(pool->p));
>
> + /* overwrite to allow recycling.. */
> + if (pool->p.nid == NUMA_NO_NODE)
> + pool->p.nid = numa_mem_id();
> +
>
> After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
> pool->p.nid..
>
>
>> I have compiled different variants and looked at the A
>> SM code generated
>> by GCC. This seems to give the best result.
>>
>>
>>> 1) never recycle emergency pages, regardless of pool nid.
>>> 2) always recycle if pool is NUMA_NO_NODE.
>>
>> Yes, this defines the semantics, that a page_pool configured with
>> NUMA_NO_NODE means skip NUMA checks. I think that sounds okay...
>>
>>
>>> the above change should not add any overhead, a modest branch
>>> predictor will handle this with no effort.
>>
>> It still annoys me that we keep adding instructions to this code
>> hot-path (I counted 34 bytes and 11 instructions in my proposed
>> function).
>>
>> I think that it might be possible to move these NUMA checks to
>> alloc-side (instead of return/recycles side as today), and perhaps
>> only
>> on slow-path when dequeuing from ptr_ring (as recycles that call
>> __page_pool_recycle_direct() will be pinned during NAPI). But lets
>> focus on a smaller fix for the immediate issue...
>>
>
> I know. It annoys me too, but we need recycling to work in production :
> where rings/napi can migrate and numa nodes can be NUMA_NO_NODE :-(.
>
>
On Thu, 12 Dec 2019 09:34:14 +0800
Yunsheng Lin <[email protected]> wrote:
> +CC Michal, Peter, Greg and Bjorn
> Because there has been disscusion about where and how the NUMA_NO_NODE
> should be handled before.
>
> On 2019/12/12 5:24, Saeed Mahameed wrote:
> > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:
> >> On Sat, 7 Dec 2019 03:52:41 +0000
> >> Saeed Mahameed <[email protected]> wrote:
> >>
> >>> I don't think it is correct to check that the page nid is same as
> >>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
> >>> all pages to recycle, because you can't assume where pages are
> >>> allocated from and where they are being handled.
> >>
> >> I agree, using numa_mem_id() is not valid, because it takes the numa
> >> node id from the executing CPU and the call to __page_pool_put_page()
> >> can happen on a remote CPU (e.g. cpumap redirect, and in future
> >> SKBs).
> >>
> >>
> >>> I suggest the following:
> >>>
> >>> return !page_pfmemalloc() &&
> >>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );
> >>
> >> Above code doesn't generate optimal ASM code, I suggest:
> >>
> >> static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> >> {
> >> return !page_is_pfmemalloc(page) &&
> >> pool->p.nid != NUMA_NO_NODE &&
> >> page_to_nid(page) == pool->p.nid;
> >> }
> >>
> >
> > this is not equivalent to the above. Here in case pool->p.nid is
> > NUMA_NO_NODE, pool_page_reusable() will always be false.
> >
> > We can avoid the extra check in data path.
> > How about avoiding NUMA_NO_NODE in page_pool altogether, and force
> > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
> > pool init, as already done in alloc_pages_node().
>
> That means we will not support page reuse mitigation for NUMA_NO_NODE,
> which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
> because alloc_pages_node() will allocate the page based on the node
> of the current running cpu.
True, as I wrote (below) my code defines semantics as: that a page_pool
configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle
regardless of NUMA node page belong to. It seems that you want another
semantics.
I'm open to other semantics. My main concern is performance. The
page_pool fast-path for driver recycling use-case of XDP_DROP, have
extreme performance requirements, as it needs to compete with driver
local recycle tricks (else we cannot use page_pool to simplify drivers).
The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
and Bjørn is showing 30 Mpps = 33.3ns with i40e. At this level every
cycle/instruction counts.
> Also, There seems to be a wild guessing of the node id here, which has
> been disscussed before and has not reached a agreement yet.
>
> >
> > which will imply recycling without adding any extra condition to the
> > data path.
I love code that moves thing out of our fast-path.
> >
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..00c99282a306 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
> >
> > memcpy(&pool->p, params, sizeof(pool->p));
> >
> > + /* overwrite to allow recycling.. */
> > + if (pool->p.nid == NUMA_NO_NODE)
> > + pool->p.nid = numa_mem_id();
> > +
The problem is that page_pool_init() is can be initiated from a random
CPU, first at driver setup/bringup, and later at other queue changes
that can be started via ethtool or XDP attach. (numa_mem_id() picks
from running CPU).
As Yunsheng mentioned elsewhere, there is also a dev_to_node() function.
Isn't that what we want in a place like this?
One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns
NUMA_NO_NODE (-1). (And device_initialize() also set it to -1). Thus,
in that case we set pool->p.nid = 0, as page_to_nid() will also return
zero in that case (as far as I follow the code).
> > After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
> > pool->p.nid..
> >
> >
> >> I have compiled different variants and looked at the ASM code
> >> generated by GCC. This seems to give the best result.
> >>
> >>
> >>> 1) never recycle emergency pages, regardless of pool nid.
> >>> 2) always recycle if pool is NUMA_NO_NODE.
> >>
> >> Yes, this defines the semantics, that a page_pool configured with
> >> NUMA_NO_NODE means skip NUMA checks. I think that sounds okay...
> >>
> >>
> >>> the above change should not add any overhead, a modest branch
> >>> predictor will handle this with no effort.
> >>
> >> It still annoys me that we keep adding instructions to this code
> >> hot-path (I counted 34 bytes and 11 instructions in my proposed
> >> function).
> >>
> >> I think that it might be possible to move these NUMA checks to
> >> alloc-side (instead of return/recycles side as today), and perhaps
> >> only on slow-path when dequeuing from ptr_ring (as recycles that
> >> call __page_pool_recycle_direct() will be pinned during NAPI).
> >> But lets focus on a smaller fix for the immediate issue...
> >>
> >
> > I know. It annoys me too, but we need recycling to work in
> > production : where rings/napi can migrate and numa nodes can be
> > NUMA_NO_NODE :-(.
> >
> >
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On 2019/12/12 18:18, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <[email protected]> wrote:
>
>> +CC Michal, Peter, Greg and Bjorn
>> Because there has been disscusion about where and how the NUMA_NO_NODE
>> should be handled before.
>>
>> On 2019/12/12 5:24, Saeed Mahameed wrote:
>>> On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:
>>>> On Sat, 7 Dec 2019 03:52:41 +0000
>>>> Saeed Mahameed <[email protected]> wrote:
>>>>
>>>>> I don't think it is correct to check that the page nid is same as
>>>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
>>>>> all pages to recycle, because you can't assume where pages are
>>>>> allocated from and where they are being handled.
>>>>
>>>> I agree, using numa_mem_id() is not valid, because it takes the numa
>>>> node id from the executing CPU and the call to __page_pool_put_page()
>>>> can happen on a remote CPU (e.g. cpumap redirect, and in future
>>>> SKBs).
>>>>
>>>>
>>>>> I suggest the following:
>>>>>
>>>>> return !page_pfmemalloc() &&
>>>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );
>>>>
>>>> Above code doesn't generate optimal ASM code, I suggest:
>>>>
>>>> static bool pool_page_reusable(struct page_pool *pool, struct page *page)
>>>> {
>>>> return !page_is_pfmemalloc(page) &&
>>>> pool->p.nid != NUMA_NO_NODE &&
>>>> page_to_nid(page) == pool->p.nid;
>>>> }
>>>>
>>>
>>> this is not equivalent to the above. Here in case pool->p.nid is
>>> NUMA_NO_NODE, pool_page_reusable() will always be false.
>>>
>>> We can avoid the extra check in data path.
>>> How about avoiding NUMA_NO_NODE in page_pool altogether, and force
>>> numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
>>> pool init, as already done in alloc_pages_node().
>>
>> That means we will not support page reuse mitigation for NUMA_NO_NODE,
>> which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
>> because alloc_pages_node() will allocate the page based on the node
>> of the current running cpu.
>
> True, as I wrote (below) my code defines semantics as: that a page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle
> regardless of NUMA node page belong to. It seems that you want another
> semantics.
For driver that does not have page pool support yet, the semantics seems
to be: always allocate and recycle local page(excpet pfmemalloc one). Page
reuse migration moves when the rx interrupt affinity moves(the NAPI polling
context moves accordingly) regardless of the dev_to_node().
It would be good to maintain the above semantics. And rx data page seems
to be close to the cpu that doing the rx cleaning, which means the cpu
can process the buffer quicker?
>
> I'm open to other semantics. My main concern is performance. The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and Bjørn is showing 30 Mpps = 33.3ns with i40e. At this level every
> cycle/instruction counts.
Yes, the performance is a concern too.
But if the rx page is closer to the cpu, maybe the time taken to process
the buffer can be reduced?
It is good to allocate the rx page close to both cpu and device, but if
both goal can not be reached, maybe we choose to allocate page that close
to cpu?
>
>
>> Also, There seems to be a wild guessing of the node id here, which has
>> been disscussed before and has not reached a agreement yet.
>>
>>>
>>> which will imply recycling without adding any extra condition to the
>>> data path.
>
> I love code that moves thing out of our fast-path.
>
>>>
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index a6aefe989043..00c99282a306 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
>>>
>>> memcpy(&pool->p, params, sizeof(pool->p));
>>>
>>> + /* overwrite to allow recycling.. */
>>> + if (pool->p.nid == NUMA_NO_NODE)
>>> + pool->p.nid = numa_mem_id();
>>> +
>
> The problem is that page_pool_init() is can be initiated from a random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).
Yes, changing ring num or ring depth releases and allocates rx data page,
so using NUMA_NO_NODE to allocate page and alway recycle those page may
has different performance noticable to user.
>
> As Yunsheng mentioned elsewhere, there is also a dev_to_node() function.
> Isn't that what we want in a place like this?
>
>
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns
> NUMA_NO_NODE (-1). (And device_initialize() also set it to -1). Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also return
> zero in that case (as far as I follow the code).
>
>
>>> After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
>>> pool->p.nid..
>>>
>>>
>>>> I have compiled different variants and looked at the ASM code
>>>> generated by GCC. This seems to give the best result.
>>>>
>>>>
>>>>> 1) never recycle emergency pages, regardless of pool nid.
>>>>> 2) always recycle if pool is NUMA_NO_NODE.
>>>>
>>>> Yes, this defines the semantics, that a page_pool configured with
>>>> NUMA_NO_NODE means skip NUMA checks. I think that sounds okay...
>>>>
>>>>
>>>>> the above change should not add any overhead, a modest branch
>>>>> predictor will handle this with no effort.
>>>>
>>>> It still annoys me that we keep adding instructions to this code
>>>> hot-path (I counted 34 bytes and 11 instructions in my proposed
>>>> function).
>>>>
>>>> I think that it might be possible to move these NUMA checks to
>>>> alloc-side (instead of return/recycles side as today), and perhaps
>>>> only on slow-path when dequeuing from ptr_ring (as recycles that
>>>> call __page_pool_recycle_direct() will be pinned during NAPI).
>>>> But lets focus on a smaller fix for the immediate issue...
>>>>
>>>
>>> I know. It annoys me too, but we need recycling to work in
>>> production : where rings/napi can migrate and numa nodes can be
>>> NUMA_NO_NODE :-(.
>>>
>>>
>
>
> It is good to allocate the rx page close to both cpu and device, but if
> both goal can not be reached, maybe we choose to allocate page that close
> to cpu?
>
I think it is true
If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with
alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as
that page_to_nid(page) is checked with numa_mem_id()
since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow
will be called in NAPI polling often if recycle failed, after some cycle, the page will be from
local memory node.
-Li
On 2019/12/13 14:27, Li,Rongqing wrote:
>>
>> It is good to allocate the rx page close to both cpu and device, but if
>> both goal can not be reached, maybe we choose to allocate page that close
>> to cpu?
>>
> I think it is true
>
> If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with
> alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as
> that page_to_nid(page) is checked with numa_mem_id()
>
> since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow
> will be called in NAPI polling often if recycle failed, after some cycle, the page will be from
> local memory node.
Yes if allocation and recycling are in the same NAPI polling context.
As pointed out by Saeed and Ilias, the allocation and recycling seems to
may not be happening in the same NAPI polling context, see:
"In the current code base if they are only called under NAPI this might be true.
On the page_pool skb recycling patches though (yes we'll eventually send those
:)) this is called from kfree_skb()."
So there may need some additionl attention.
>
> -Li
>
On Fri, 13 Dec 2019 14:53:37 +0800
Yunsheng Lin <[email protected]> wrote:
> On 2019/12/13 14:27, Li,Rongqing wrote:
> >>
> >> It is good to allocate the rx page close to both cpu and device,
> >> but if both goal can not be reached, maybe we choose to allocate
> >> page that close to cpu?
> >>
> > I think it is true
> >
> > If it is true, , we can remove pool->p.nid, and replace
> > alloc_pages_node with alloc_pages in __page_pool_alloc_pages_slow,
> > and change pool_page_reusable as that page_to_nid(page) is checked
> > with numa_mem_id()
No, as I explained before, you cannot use numa_mem_id() in pool_page_reusable,
because recycle call can happen from/on a remote CPU (and numa_mem_id()
uses the local CPU to determine the NUMA id).
Today we already have XDP cpumap redirect. And we are working on
extending this to SKBs, which will increase the likelihood even more.
I don't think we want to not-recycle if a remote NUMA node CPU happened
to touch the packet?
(Based on the optimizations done for Facebook (the reason we added this))
What seems to matter is the NUMA node of CPU that runs RX NAPI-polling,
this is the first CPU that touch the packet memory and struct-page
memory. For these drivers, it is also the same "RX-CPU" that does the
allocation of new pages (to refill the RX-ring), and these "new" pages
can come from the page_pool.
With this in mind, it does seem strange that we are doing the check on
the "free"/recycles code path, that can run on remote CPUs...
> > since alloc_pages hint to use the current node page, and
> > __page_pool_alloc_pages_slow will be called in NAPI polling often
> > if recycle failed, after some cycle, the page will be from local
> > memory node.
You are basically saying that the NUMA check should be moved to
allocation time, as it is running the RX-CPU (NAPI). And eventually
after some time the pages will come from correct NUMA node.
I think we can do that, and only affect the semi-fast-path.
We just need to handle that pages in the ptr_ring that are recycled can
be from the wrong NUMA node. In __page_pool_get_cached() when
consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we
can evict pages from wrong NUMA node.
For the pool->alloc.cache we either accept, that it will eventually
after some time be emptied (it is only in a 100% XDP_DROP workload that
it will continue to reuse same pages). Or we simply clear the
pool->alloc.cache when calling page_pool_update_nid().
> Yes if allocation and recycling are in the same NAPI polling context.
Which is a false assumption.
> As pointed out by Saeed and Ilias, the allocation and recycling seems
> to may not be happening in the same NAPI polling context, see:
>
> "In the current code base if they are only called under NAPI this
> might be true. On the page_pool skb recycling patches though (yes
> we'll eventually send those :)) this is called from kfree_skb()."
>
> So there may need some additionl attention.
Yes, as explained above.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer