LinuxLists.cc - [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO

2019-12-19 14:22:28

Subject: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.

The goal of the NUMA changes in commit d5394610b1ba ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.

This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.

First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid. It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid. The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.

This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.

Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime. To speed up transision to new nid the alloc
cache is now flushed on nid changes. This force pages to come from
ptr_ring, which does the appropate nid check.

For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, we accept that transitioning the alloc cache doesn't happen
immediately. The preferred nid change runtime via consulting
numa_mem_id() based on the CPU processing RX-packets.

Notice, to avoid stressing the page buddy allocator and avoid doing too
much work under softirq with preempt disabled, the NUMA check at
ptr_ring dequeue will break the refill cycle, when detecting a NUMA
mismatch. This will cause a slower transition, but its done on purpose.

Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <[email protected]>
Reported-by: Yunsheng Lin <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
net/core/page_pool.c | 80 ++++++++++++++++++++++++++++++++++++++------------
1 file changed, 61 insertions(+), 19 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..748f0d36f4be 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -96,10 +96,60 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
}
EXPORT_SYMBOL(page_pool_create);

+static void __page_pool_return_page(struct page_pool *pool, struct page *page);
+
+noinline
+static struct page *page_pool_refill_alloc_cache(struct page_pool *pool,
+ bool refill)
+{
+ struct ptr_ring *r = &pool->ring;
+ struct page *page;
+ int pref_nid; /* preferred NUMA node */
+
+ /* Quicker fallback, avoid locks when ring is empty */
+ if (__ptr_ring_empty(r))
+ return NULL;
+
+ /* Softirq guarantee CPU and thus NUMA node is stable. This,
+ * assumes CPU refilling driver RX-ring will also run RX-NAPI.
+ */
+ pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;
+
+ /* Slower-path: Get pages from locked ring queue */
+ spin_lock(&r->consumer_lock);
+
+ /* Refill alloc array, but only if NUMA match */
+ do {
+ page = __ptr_ring_consume(r);
+ if (unlikely(!page))
+ break;
+
+ if (likely(page_to_nid(page) == pref_nid)) {
+ pool->alloc.cache[pool->alloc.count++] = page;
+ } else {
+ /* NUMA mismatch;
+ * (1) release 1 page to page-allocator and
+ * (2) break out to fallthough to alloc_pages_node.
+ * This limit stress on page buddy alloactor.
+ */
+ __page_pool_return_page(pool, page);
+ page = NULL;
+ break;
+ }
+ } while (pool->alloc.count < PP_ALLOC_CACHE_REFILL &&
+ refill);
+
+ /* Return last page */
+ if (likely(pool->alloc.count > 0))
+ page = pool->alloc.cache[--pool->alloc.count];
+
+ spin_unlock(&r->consumer_lock);
+ return page;
+}
+
/* fast path */
static struct page *__page_pool_get_cached(struct page_pool *pool)
{
- struct ptr_ring *r = &pool->ring;
bool refill = false;
struct page *page;

@@ -113,20 +163,7 @@ static struct page *__page_pool_get_cached(struct page_pool *pool)
refill = true;
}

- /* Quicker fallback, avoid locks when ring is empty */
- if (__ptr_ring_empty(r))
- return NULL;
-
- /* Slow-path: Get page from locked ring queue,
- * refill alloc array if requested.
- */
- spin_lock(&r->consumer_lock);
- page = __ptr_ring_consume(r);
- if (refill)
- pool->alloc.count = __ptr_ring_consume_batched(r,
- pool->alloc.cache,
- PP_ALLOC_CACHE_REFILL);
- spin_unlock(&r->consumer_lock);
+ page = page_pool_refill_alloc_cache(pool, refill);
return page;
}

@@ -311,13 +348,10 @@ static bool __page_pool_recycle_direct(struct page *page,

/* page is NOT reusable when:
* 1) allocated when system is under some pressure. (page_is_pfmemalloc)
- * 2) belongs to a different NUMA node than pool->p.nid.
- *
- * To update pool->p.nid users must call page_pool_update_nid.
*/
static bool pool_page_reusable(struct page_pool *pool, struct page *page)
{
- return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
+ return !page_is_pfmemalloc(page);
}

void __page_pool_put_page(struct page_pool *pool, struct page *page,
@@ -484,7 +518,15 @@ EXPORT_SYMBOL(page_pool_destroy);
/* Caller must provide appropriate safe context, e.g. NAPI. */
void page_pool_update_nid(struct page_pool *pool, int new_nid)
{
+ struct page *page;
+
trace_page_pool_update_nid(pool, new_nid);
pool->p.nid = new_nid;
+
+ /* Flush pool alloc cache, as refill will check NUMA node */
+ while (pool->alloc.count) {
+ page = pool->alloc.cache[--pool->alloc.count];
+ __page_pool_return_page(pool, page);
+ }
}
EXPORT_SYMBOL(page_pool_update_nid);

2019-12-20 10:24:31

by Ilias Apalodimas

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

Hi Jesper,

I like the overall approach since this moves the check out of the hotpath.
@Saeed, since i got no hardware to test this on, would it be possible to check
that it still works fine for mlx5?

[...]
> + struct ptr_ring *r = &pool->ring;
> + struct page *page;
> + int pref_nid; /* preferred NUMA node */
> +
> + /* Quicker fallback, avoid locks when ring is empty */
> + if (__ptr_ring_empty(r))
> + return NULL;
> +
> + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> + */
> + pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;

One of the use cases for this is that during the allocation we are not
guaranteed to pick up the correct NUMA node.
This will get automatically fixed once the driver starts recycling packets.

I don't feel strongly about this, since i don't usually like hiding value
changes from the user but, would it make sense to move this into
__page_pool_alloc_pages_slow() and change the pool->p.nid?

Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
regardless, why not store the actual node in our page pool information?
You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
what's configured.

Thanks
/Ilias

2019-12-20 10:42:20

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

On Fri, 20 Dec 2019 12:23:14 +0200
Ilias Apalodimas <[email protected]> wrote:

> Hi Jesper,
>
> I like the overall approach since this moves the check out of the hotpath.
> @Saeed, since i got no hardware to test this on, would it be possible to check
> that it still works fine for mlx5?
>
> [...]
> > + struct ptr_ring *r = &pool->ring;
> > + struct page *page;
> > + int pref_nid; /* preferred NUMA node */
> > +
> > + /* Quicker fallback, avoid locks when ring is empty */
> > + if (__ptr_ring_empty(r))
> > + return NULL;
> > +
> > + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> > + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > + */
> > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;
>
> One of the use cases for this is that during the allocation we are not
> guaranteed to pick up the correct NUMA node.
> This will get automatically fixed once the driver starts recycling packets.
>
> I don't feel strongly about this, since i don't usually like hiding value
> changes from the user but, would it make sense to move this into
> __page_pool_alloc_pages_slow() and change the pool->p.nid?
>
> Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> regardless, why not store the actual node in our page pool information?
> You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> what's configured.

This single code line helps support that drivers can control the nid
themselves. This is a feature that is only used my mlx5 AFAIK.

I do think that is useful to allow the driver to "control" the nid, as
pinning/preferring the pages to come from the NUMA node that matches
the PCI-e controller hardware is installed in does have benefits.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2019-12-20 10:51:09

by Ilias Apalodimas

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 20 Dec 2019 12:23:14 +0200
> Ilias Apalodimas <[email protected]> wrote:
>
> > Hi Jesper,
> >
> > I like the overall approach since this moves the check out of the hotpath.
> > @Saeed, since i got no hardware to test this on, would it be possible to check
> > that it still works fine for mlx5?
> >
> > [...]
> > > + struct ptr_ring *r = &pool->ring;
> > > + struct page *page;
> > > + int pref_nid; /* preferred NUMA node */
> > > +
> > > + /* Quicker fallback, avoid locks when ring is empty */
> > > + if (__ptr_ring_empty(r))
> > > + return NULL;
> > > +
> > > + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > + */
> > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;
> >
> > One of the use cases for this is that during the allocation we are not
> > guaranteed to pick up the correct NUMA node.
> > This will get automatically fixed once the driver starts recycling packets.
> >
> > I don't feel strongly about this, since i don't usually like hiding value
> > changes from the user but, would it make sense to move this into
> > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> >
> > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > regardless, why not store the actual node in our page pool information?
> > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > what's configured.
>
> This single code line helps support that drivers can control the nid
> themselves. This is a feature that is only used my mlx5 AFAIK.
>
> I do think that is useful to allow the driver to "control" the nid, as
> pinning/preferring the pages to come from the NUMA node that matches
> the PCI-e controller hardware is installed in does have benefits.

Sure you can keep the if statement as-is, it won't break anything.
Would we want to store the actual numa id in pool->p.nid if the user selects
'NUMA_NO_NODE'?

Thanks
/Ilias
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
>

2019-12-20 15:24:19

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

On Fri, 20 Dec 2019 12:49:37 +0200
Ilias Apalodimas <[email protected]> wrote:

> On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> > On Fri, 20 Dec 2019 12:23:14 +0200
> > Ilias Apalodimas <[email protected]> wrote:
> >
> > > Hi Jesper,
> > >
> > > I like the overall approach since this moves the check out of the hotpath.
> > > @Saeed, since i got no hardware to test this on, would it be possible to check
> > > that it still works fine for mlx5?
> > >
> > > [...]
> > > > + struct ptr_ring *r = &pool->ring;
> > > > + struct page *page;
> > > > + int pref_nid; /* preferred NUMA node */
> > > > +
> > > > + /* Quicker fallback, avoid locks when ring is empty */
> > > > + if (__ptr_ring_empty(r))
> > > > + return NULL;
> > > > +
> > > > + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > > + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > > + */
> > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;
> > >
> > > One of the use cases for this is that during the allocation we are not
> > > guaranteed to pick up the correct NUMA node.
> > > This will get automatically fixed once the driver starts recycling packets.
> > >
> > > I don't feel strongly about this, since i don't usually like hiding value
> > > changes from the user but, would it make sense to move this into
> > > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> > >
> > > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > > regardless, why not store the actual node in our page pool information?
> > > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > > what's configured.
> >
> > This single code line helps support that drivers can control the nid
> > themselves. This is a feature that is only used my mlx5 AFAIK.
> >
> > I do think that is useful to allow the driver to "control" the nid, as
> > pinning/preferring the pages to come from the NUMA node that matches
> > the PCI-e controller hardware is installed in does have benefits.
>
> Sure you can keep the if statement as-is, it won't break anything.
> Would we want to store the actual numa id in pool->p.nid if the user
> selects 'NUMA_NO_NODE'?

No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
dynamic. If someone moves an RX IRQ to another CPU on another NUMA
node, then this 'NUMA_NO_NODE' setting makes pages transitioned
automatically.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2019-12-20 16:09:50

by Ilias Apalodimas

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 20 Dec 2019 12:49:37 +0200
> Ilias Apalodimas <[email protected]> wrote:
>
> > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > Ilias Apalodimas <[email protected]> wrote:
> > >
> > > > Hi Jesper,
> > > >
> > > > I like the overall approach since this moves the check out of the hotpath.
> > > > @Saeed, since i got no hardware to test this on, would it be possible to check
> > > > that it still works fine for mlx5?
> > > >
> > > > [...]
> > > > > + struct ptr_ring *r = &pool->ring;
> > > > > + struct page *page;
> > > > > + int pref_nid; /* preferred NUMA node */
> > > > > +
> > > > > + /* Quicker fallback, avoid locks when ring is empty */
> > > > > + if (__ptr_ring_empty(r))
> > > > > + return NULL;
> > > > > +
> > > > > + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > > > + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > > > + */
> > > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;
> > > >
> > > > One of the use cases for this is that during the allocation we are not
> > > > guaranteed to pick up the correct NUMA node.
> > > > This will get automatically fixed once the driver starts recycling packets.
> > > >
> > > > I don't feel strongly about this, since i don't usually like hiding value
> > > > changes from the user but, would it make sense to move this into
> > > > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> > > >
> > > > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > > > regardless, why not store the actual node in our page pool information?
> > > > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > > > what's configured.
> > >
> > > This single code line helps support that drivers can control the nid
> > > themselves. This is a feature that is only used my mlx5 AFAIK.
> > >
> > > I do think that is useful to allow the driver to "control" the nid, as
> > > pinning/preferring the pages to come from the NUMA node that matches
> > > the PCI-e controller hardware is installed in does have benefits.
> >
> > Sure you can keep the if statement as-is, it won't break anything.
> > Would we want to store the actual numa id in pool->p.nid if the user
> > selects 'NUMA_NO_NODE'?
>
> No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> dynamic. If someone moves an RX IRQ to another CPU on another NUMA
> node, then this 'NUMA_NO_NODE' setting makes pages transitioned
> automatically.
Ok this assumed that drivers were going to use page_pool_nid_changed(), but with
the current code we don't have to force them to do that. Let's keep this as-is.

I'll be running a few more tests and wait in case Saeed gets a chance to test
it and send my reviewed-by

Cheers
/Ilias
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
>

2019-12-20 21:28:55

by Saeed Mahameed

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

On Fri, 2019-12-20 at 12:23 +0200, Ilias Apalodimas wrote:
> Hi Jesper,
>
> I like the overall approach since this moves the check out of the
> hotpath.
> @Saeed, since i got no hardware to test this on, would it be possible
> to check
> that it still works fine for mlx5?

The idea seems reasonable,
I will need a day or two to test and review this.

The only thing we need to be careful about is how heavy the flush
operation on numa changes, holding a spin lock and releasing all pages
at once ..
prior to this patch, page releasing was done per packet, so there
should be an improvement here of bulk page flush, but on the other hand
we will be holding a spin lock.. i am not worried about spin lock
contention though, just about the potential cpu spikes.

Thanks,
Saeed.

>
> [...]
> > + struct ptr_ring *r = &pool->ring;
> > + struct page *page;
> > + int pref_nid; /* preferred NUMA node */
> > +
> > + /* Quicker fallback, avoid locks when ring is empty */
> > + if (__ptr_ring_empty(r))
> > + return NULL;
> > +
> > + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> > + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > + */
> > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() :
> > pool->p.nid;
>
> One of the use cases for this is that during the allocation we are
> not
> guaranteed to pick up the correct NUMA node.
> This will get automatically fixed once the driver starts recycling
> packets.
>
> I don't feel strongly about this, since i don't usually like hiding
> value
> changes from the user but, would it make sense to move this into
> __page_pool_alloc_pages_slow() and change the pool->p.nid?
>
> Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> regardless, why not store the actual node in our page pool
> information?
> You can then skip this and check pool->p.nid == numa_mem_id(),
> regardless of
> what's configured.
>
> Thanks
> /Ilias

2019-12-23 07:57:58

by Ilias Apalodimas

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

Hi Jesper,

Looking at the overall path again, i still need we need to reconsider
pool->p.nid semantics.

As i said i like the patch and the whole functionality and code seems fine,
but here's the current situation.
If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
page_pool_update_nid() the whole behavior feels a liitle odd.
page_pool_update_nid() first check will always be true since .nid =
NUMA_NO_NODE). Then we'll update this to a real nid. So we end up overwriting
what the user initially coded in.
This is close to what i proposed in the previous mails on this thread. Always
store a real nid even if the user explicitly requests NUMA_NO_NODE.

So semantics is still a problem. I'll stick to what we initially suggested.
1. We either *always* store a real nid
or
2. If NUMA_NO_NODE is present ignore every other check and recycle the memory
blindly.

Regards
/Ilias

On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer wrote:
> > On Fri, 20 Dec 2019 12:49:37 +0200
> > Ilias Apalodimas <[email protected]> wrote:
> >
> > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard Brouer wrote:
> > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > Ilias Apalodimas <[email protected]> wrote:
> > > >
> > > > > Hi Jesper,
> > > > >
> > > > > I like the overall approach since this moves the check out of the hotpath.
> > > > > @Saeed, since i got no hardware to test this on, would it be possible to check
> > > > > that it still works fine for mlx5?
> > > > >
> > > > > [...]
> > > > > > + struct ptr_ring *r = &pool->ring;
> > > > > > + struct page *page;
> > > > > > + int pref_nid; /* preferred NUMA node */
> > > > > > +
> > > > > > + /* Quicker fallback, avoid locks when ring is empty */
> > > > > > + if (__ptr_ring_empty(r))
> > > > > > + return NULL;
> > > > > > +
> > > > > > + /* Softirq guarantee CPU and thus NUMA node is stable. This,
> > > > > > + * assumes CPU refilling driver RX-ring will also run RX-NAPI.
> > > > > > + */
> > > > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ? numa_mem_id() : pool->p.nid;
> > > > >
> > > > > One of the use cases for this is that during the allocation we are not
> > > > > guaranteed to pick up the correct NUMA node.
> > > > > This will get automatically fixed once the driver starts recycling packets.
> > > > >
> > > > > I don't feel strongly about this, since i don't usually like hiding value
> > > > > changes from the user but, would it make sense to move this into
> > > > > __page_pool_alloc_pages_slow() and change the pool->p.nid?
> > > > >
> > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with numa_mem_id()
> > > > > regardless, why not store the actual node in our page pool information?
> > > > > You can then skip this and check pool->p.nid == numa_mem_id(), regardless of
> > > > > what's configured.
> > > >
> > > > This single code line helps support that drivers can control the nid
> > > > themselves. This is a feature that is only used my mlx5 AFAIK.
> > > >
> > > > I do think that is useful to allow the driver to "control" the nid, as
> > > > pinning/preferring the pages to come from the NUMA node that matches
> > > > the PCI-e controller hardware is installed in does have benefits.
> > >
> > > Sure you can keep the if statement as-is, it won't break anything.
> > > Would we want to store the actual numa id in pool->p.nid if the user
> > > selects 'NUMA_NO_NODE'?
> >
> > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> > dynamic. If someone moves an RX IRQ to another CPU on another NUMA
> > node, then this 'NUMA_NO_NODE' setting makes pages transitioned
> > automatically.
> Ok this assumed that drivers were going to use page_pool_nid_changed(), but with
> the current code we don't have to force them to do that. Let's keep this as-is.
>
> I'll be running a few more tests and wait in case Saeed gets a chance to test
> it and send my reviewed-by
>
> Cheers
> /Ilias
> >
> > --
> > Best regards,
> > Jesper Dangaard Brouer
> > MSc.CS, Principal Kernel Engineer at Red Hat
> > LinkedIn: http://www.linkedin.com/in/brouer
> >

2019-12-23 16:56:16

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

On Mon, 23 Dec 2019 09:57:00 +0200
Ilias Apalodimas <[email protected]> wrote:

> Hi Jesper,
>
> Looking at the overall path again, i still need we need to reconsider
> pool->p.nid semantics.
>
> As i said i like the patch and the whole functionality and code seems fine,
> but here's the current situation.

> If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
> page_pool_update_nid() the whole behavior feels a liitle odd.

As soon as driver uses page_pool_update_nid() than means they want to
control the NUMA placement explicitly. As soon as that happens, it is
the drivers responsibility and choice, and page_pool API must respect
that (and not automatically change that behind drivers back).

> page_pool_update_nid() first check will always be true since .nid =
> NUMA_NO_NODE). Then we'll update this to a real nid. So we end up
> overwriting what the user initially coded in.
>
> This is close to what i proposed in the previous mails on this
> thread. Always store a real nid even if the user explicitly requests
> NUMA_NO_NODE.
>
> So semantics is still a problem. I'll stick to what we initially
> suggested.
> 1. We either *always* store a real nid
> or
> 2. If NUMA_NO_NODE is present ignore every other check and recycle
> the memory blindly.
>

Hmm... I actually disagree with both 1 and 2.

My semantics proposal:
If driver configures page_pool with NUMA_NO_NODE, then page_pool tried
to help get the best default performance. (Which according to
performance measurements is to have RX-pages belong to the NUMA node
RX-processing runs on).

The reason I want this behavior is that during driver init/boot, it can
easily happen that a driver allocates RX-pages from wrong NUMA node.
This will cause a performance slowdown, that normally doesn't happen,
because without a cache (like page_pool) RX-pages would fairly quickly
transition over to the RX NUMA node (instead we keep recycling these,
in your case #2, where you suggest recycle blindly in case of
NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
driver developers.

--Jesper

> On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > wrote:
> > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > Ilias Apalodimas <[email protected]> wrote:
> > >
> > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > Brouer wrote:
> > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > Ilias Apalodimas <[email protected]> wrote:
> > > > >
> > > > > > Hi Jesper,
> > > > > >
> > > > > > I like the overall approach since this moves the check out
> > > > > > of the hotpath. @Saeed, since i got no hardware to test
> > > > > > this on, would it be possible to check that it still works
> > > > > > fine for mlx5?
> > > > > >
> > > > > > [...]
> > > > > > > + struct ptr_ring *r = &pool->ring;
> > > > > > > + struct page *page;
> > > > > > > + int pref_nid; /* preferred NUMA node */
> > > > > > > +
> > > > > > > + /* Quicker fallback, avoid locks when ring is
> > > > > > > empty */
> > > > > > > + if (__ptr_ring_empty(r))
> > > > > > > + return NULL;
> > > > > > > +
> > > > > > > + /* Softirq guarantee CPU and thus NUMA node is
> > > > > > > stable. This,
> > > > > > > + * assumes CPU refilling driver RX-ring will
> > > > > > > also run RX-NAPI.
> > > > > > > + */
> > > > > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > numa_mem_id() : pool->p.nid;
> > > > > >
> > > > > > One of the use cases for this is that during the allocation
> > > > > > we are not guaranteed to pick up the correct NUMA node.
> > > > > > This will get automatically fixed once the driver starts
> > > > > > recycling packets.
> > > > > >
> > > > > > I don't feel strongly about this, since i don't usually
> > > > > > like hiding value changes from the user but, would it make
> > > > > > sense to move this into __page_pool_alloc_pages_slow() and
> > > > > > change the pool->p.nid?
> > > > > >
> > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > numa_mem_id() regardless, why not store the actual node in
> > > > > > our page pool information? You can then skip this and check
> > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > configured.
> > > > >
> > > > > This single code line helps support that drivers can control
> > > > > the nid themselves. This is a feature that is only used my
> > > > > mlx5 AFAIK.
> > > > >
> > > > > I do think that is useful to allow the driver to "control"
> > > > > the nid, as pinning/preferring the pages to come from the
> > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > installed in does have benefits.
> > > >
> > > > Sure you can keep the if statement as-is, it won't break
> > > > anything. Would we want to store the actual numa id in
> > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?
> > >
> > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> > > dynamic. If someone moves an RX IRQ to another CPU on another
> > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > transitioned automatically.
> > Ok this assumed that drivers were going to use
> > page_pool_nid_changed(), but with the current code we don't have to
> > force them to do that. Let's keep this as-is.
> >
> > I'll be running a few more tests and wait in case Saeed gets a
> > chance to test it and send my reviewed-by

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2019-12-23 22:11:09

by Saeed Mahameed

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

On Mon, 2019-12-23 at 17:52 +0100, Jesper Dangaard Brouer wrote:
> On Mon, 23 Dec 2019 09:57:00 +0200
> Ilias Apalodimas <[email protected]> wrote:
>
> > Hi Jesper,
> >
> > Looking at the overall path again, i still need we need to
> > reconsider
> > pool->p.nid semantics.
> >
> > As i said i like the patch and the whole functionality and code
> > seems fine,
> > but here's the current situation.
> > If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
> > page_pool_update_nid() the whole behavior feels a liitle odd.
>
> As soon as driver uses page_pool_update_nid() than means they want to
> control the NUMA placement explicitly. As soon as that happens, it
> is
> the drivers responsibility and choice, and page_pool API must respect
> that (and not automatically change that behind drivers back).
>
>
> > page_pool_update_nid() first check will always be true since .nid =
> > NUMA_NO_NODE). Then we'll update this to a real nid. So we end up
> > overwriting what the user initially coded in.
> >
> > This is close to what i proposed in the previous mails on this
> > thread. Always store a real nid even if the user explicitly
> > requests
> > NUMA_NO_NODE.
> >
> > So semantics is still a problem. I'll stick to what we initially
> > suggested.
> > 1. We either *always* store a real nid
> > or
> > 2. If NUMA_NO_NODE is present ignore every other check and recycle
> > the memory blindly.
> >
>
> Hmm... I actually disagree with both 1 and 2.
>
> My semantics proposal:
> If driver configures page_pool with NUMA_NO_NODE, then page_pool
> tried
> to help get the best default performance. (Which according to
> performance measurements is to have RX-pages belong to the NUMA node
> RX-processing runs on).
>

which is the msix affinity.. the pool has no knowledge of that on
initialization.

> The reason I want this behavior is that during driver init/boot, it
> can
> easily happen that a driver allocates RX-pages from wrong NUMA node.
> This will cause a performance slowdown, that normally doesn't happen,
> because without a cache (like page_pool) RX-pages would fairly
> quickly
> transition over to the RX NUMA node (instead we keep recycling these,
> in your case #2, where you suggest recycle blindly in case of
> NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
> driver developers.
>

So, Ilias's #1 suggestion make sense, to always store a valid nid
value.
the question is which value to store on initialization if the user
provided NUMA_NO_NODE ? I don't think the pool is capable of choosing
the right value, so let's just use numa node 0.

If the developer cares, he would have picked the right affinity on
initialization, or he can just call pool_update_nid() when the affinity
is determined and every thing will be fine after that point.

My 2cent is that you just can't provide the perfect performance when
the user uses NUMA_NO_NODE, so just pick any default concrete node id
and avoid dealing with NUMA_NO_NODE in the pool fast or even slow
path..

> --Jesper
>
>
> > On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > > wrote:
> > > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > > Ilias Apalodimas <[email protected]> wrote:
> > > >
> > > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > > Brouer wrote:
> > > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > > Ilias Apalodimas <[email protected]> wrote:
> > > > > >
> > > > > > > Hi Jesper,
> > > > > > >
> > > > > > > I like the overall approach since this moves the check
> > > > > > > out
> > > > > > > of the hotpath. @Saeed, since i got no hardware to test
> > > > > > > this on, would it be possible to check that it still
> > > > > > > works
> > > > > > > fine for mlx5?
> > > > > > >
> > > > > > > [...]
> > > > > > > > + struct ptr_ring *r = &pool->ring;
> > > > > > > > + struct page *page;
> > > > > > > > + int pref_nid; /* preferred NUMA node */
> > > > > > > > +
> > > > > > > > + /* Quicker fallback, avoid locks when ring is
> > > > > > > > empty */
> > > > > > > > + if (__ptr_ring_empty(r))
> > > > > > > > + return NULL;
> > > > > > > > +
> > > > > > > > + /* Softirq guarantee CPU and thus NUMA node is
> > > > > > > > stable. This,
> > > > > > > > + * assumes CPU refilling driver RX-ring will
> > > > > > > > also run RX-NAPI.
> > > > > > > > + */
> > > > > > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > > numa_mem_id() : pool->p.nid;
> > > > > > >
> > > > > > > One of the use cases for this is that during the
> > > > > > > allocation
> > > > > > > we are not guaranteed to pick up the correct NUMA node.
> > > > > > > This will get automatically fixed once the driver starts
> > > > > > > recycling packets.
> > > > > > >
> > > > > > > I don't feel strongly about this, since i don't usually
> > > > > > > like hiding value changes from the user but, would it
> > > > > > > make
> > > > > > > sense to move this into __page_pool_alloc_pages_slow()
> > > > > > > and
> > > > > > > change the pool->p.nid?
> > > > > > >
> > > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > > numa_mem_id() regardless, why not store the actual node
> > > > > > > in
> > > > > > > our page pool information? You can then skip this and
> > > > > > > check
> > > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > > configured.
> > > > > >
> > > > > > This single code line helps support that drivers can
> > > > > > control
> > > > > > the nid themselves. This is a feature that is only used my
> > > > > > mlx5 AFAIK.
> > > > > >
> > > > > > I do think that is useful to allow the driver to "control"
> > > > > > the nid, as pinning/preferring the pages to come from the
> > > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > > installed in does have benefits.
> > > > >
> > > > > Sure you can keep the if statement as-is, it won't break
> > > > > anything. Would we want to store the actual numa id in
> > > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?
> > > >
> > > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes
> > > > it
> > > > dynamic. If someone moves an RX IRQ to another CPU on another
> > > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > > transitioned automatically.
> > > Ok this assumed that drivers were going to use
> > > page_pool_nid_changed(), but with the current code we don't have
> > > to
> > > force them to do that. Let's keep this as-is.
> > >
> > > I'll be running a few more tests and wait in case Saeed gets a
> > > chance to test it and send my reviewed-by
>
>

2019-12-24 07:42:19

by Ilias Apalodimas

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

Hi Jesper,

On Mon, Dec 23, 2019 at 05:52:57PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 23 Dec 2019 09:57:00 +0200
> Ilias Apalodimas <[email protected]> wrote:
>
> > Hi Jesper,
> >
> > Looking at the overall path again, i still need we need to reconsider
> > pool->p.nid semantics.
> >
> > As i said i like the patch and the whole functionality and code seems fine,
> > but here's the current situation.
>
> > If a user sets pool->p.nid == NUMA_NO_NODE and wants to use
> > page_pool_update_nid() the whole behavior feels a liitle odd.
>
> As soon as driver uses page_pool_update_nid() than means they want to
> control the NUMA placement explicitly. As soon as that happens, it is
> the drivers responsibility and choice, and page_pool API must respect
> that (and not automatically change that behind drivers back).
>
>
> > page_pool_update_nid() first check will always be true since .nid =
> > NUMA_NO_NODE). Then we'll update this to a real nid. So we end up
> > overwriting what the user initially coded in.
> >
> > This is close to what i proposed in the previous mails on this
> > thread. Always store a real nid even if the user explicitly requests
> > NUMA_NO_NODE.
> >
> > So semantics is still a problem. I'll stick to what we initially
> > suggested.
> > 1. We either *always* store a real nid
> > or
> > 2. If NUMA_NO_NODE is present ignore every other check and recycle
> > the memory blindly.
> >
>
> Hmm... I actually disagree with both 1 and 2.
>
> My semantics proposal:
> If driver configures page_pool with NUMA_NO_NODE, then page_pool tried
> to help get the best default performance. (Which according to
> performance measurements is to have RX-pages belong to the NUMA node
> RX-processing runs on).
>
> The reason I want this behavior is that during driver init/boot, it can
> easily happen that a driver allocates RX-pages from wrong NUMA node.
> This will cause a performance slowdown, that normally doesn't happen,
> because without a cache (like page_pool) RX-pages would fairly quickly
> transition over to the RX NUMA node (instead we keep recycling these,
> in your case #2, where you suggest recycle blindly in case of
> NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
> driver developers.

Yea #2 has different semantics than the one you propose. So if he chooses
NUMA_NO_NODE, i'd expect the machines(s) the driver sits on, are not NUMA-aware.
Think specific SoC's, i'd never expect PCI cards to use that.
As i said i don't feel strongly about this anyway, it's just another case i had
under consideration but i like what you propose more. I'll try to add
documentation on page_pool API and describe the semantics you have in mind.

Thanks
/Ilias

>
> --Jesper
>
>
> > On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > > wrote:
> > > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > > Ilias Apalodimas <[email protected]> wrote:
> > > >
> > > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > > Brouer wrote:
> > > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > > Ilias Apalodimas <[email protected]> wrote:
> > > > > >
> > > > > > > Hi Jesper,
> > > > > > >
> > > > > > > I like the overall approach since this moves the check out
> > > > > > > of the hotpath. @Saeed, since i got no hardware to test
> > > > > > > this on, would it be possible to check that it still works
> > > > > > > fine for mlx5?
> > > > > > >
> > > > > > > [...]
> > > > > > > > + struct ptr_ring *r = &pool->ring;
> > > > > > > > + struct page *page;
> > > > > > > > + int pref_nid; /* preferred NUMA node */
> > > > > > > > +
> > > > > > > > + /* Quicker fallback, avoid locks when ring is
> > > > > > > > empty */
> > > > > > > > + if (__ptr_ring_empty(r))
> > > > > > > > + return NULL;
> > > > > > > > +
> > > > > > > > + /* Softirq guarantee CPU and thus NUMA node is
> > > > > > > > stable. This,
> > > > > > > > + * assumes CPU refilling driver RX-ring will
> > > > > > > > also run RX-NAPI.
> > > > > > > > + */
> > > > > > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > > numa_mem_id() : pool->p.nid;
> > > > > > >
> > > > > > > One of the use cases for this is that during the allocation
> > > > > > > we are not guaranteed to pick up the correct NUMA node.
> > > > > > > This will get automatically fixed once the driver starts
> > > > > > > recycling packets.
> > > > > > >
> > > > > > > I don't feel strongly about this, since i don't usually
> > > > > > > like hiding value changes from the user but, would it make
> > > > > > > sense to move this into __page_pool_alloc_pages_slow() and
> > > > > > > change the pool->p.nid?
> > > > > > >
> > > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > > numa_mem_id() regardless, why not store the actual node in
> > > > > > > our page pool information? You can then skip this and check
> > > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > > configured.
> > > > > >
> > > > > > This single code line helps support that drivers can control
> > > > > > the nid themselves. This is a feature that is only used my
> > > > > > mlx5 AFAIK.
> > > > > >
> > > > > > I do think that is useful to allow the driver to "control"
> > > > > > the nid, as pinning/preferring the pages to come from the
> > > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > > installed in does have benefits.
> > > > >
> > > > > Sure you can keep the if statement as-is, it won't break
> > > > > anything. Would we want to store the actual numa id in
> > > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?
> > > >
> > > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes it
> > > > dynamic. If someone moves an RX IRQ to another CPU on another
> > > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > > transitioned automatically.
> > > Ok this assumed that drivers were going to use
> > > page_pool_nid_changed(), but with the current code we don't have to
> > > force them to do that. Let's keep this as-is.
> > >
> > > I'll be running a few more tests and wait in case Saeed gets a
> > > chance to test it and send my reviewed-by
>
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
>

2019-12-24 09:35:08

by Ilias Apalodimas

[permalink] [raw]

Subject: Re: [net-next v5 PATCH] page_pool: handle page recycle for NUMA_NO_NODE condition

Hi Saeed,
> which is the msix affinity.. the pool has no knowledge of that on
> initialization.
>
> > The reason I want this behavior is that during driver init/boot, it
> > can
> > easily happen that a driver allocates RX-pages from wrong NUMA node.
> > This will cause a performance slowdown, that normally doesn't happen,
> > because without a cache (like page_pool) RX-pages would fairly
> > quickly
> > transition over to the RX NUMA node (instead we keep recycling these,
> > in your case #2, where you suggest recycle blindly in case of
> > NUMA_NO_NODE). IMHO page_pool should hide this border-line case from
> > driver developers.
> >
>
> So, Ilias's #1 suggestion make sense, to always store a valid nid
> value.
> the question is which value to store on initialization if the user
> provided NUMA_NO_NODE ? I don't think the pool is capable of choosing
> the right value, so let's just use numa node 0.

Again i don't mind using the current solution. We could use 0 or the whatever
numa is choosen from alloc_pages_node()

>
> If the developer cares, he would have picked the right affinity on
> initialization, or he can just call pool_update_nid() when the affinity
> is determined and every thing will be fine after that point.
>
> My 2cent is that you just can't provide the perfect performance when
> the user uses NUMA_NO_NODE, so just pick any default concrete node id
> and avoid dealing with NUMA_NO_NODE in the pool fast or even slow
> path..

I don't have strong preference on any of those. I just prefer the homogeneous
approach of always storing a normal usable memory id. Either way rest of the
code seems fine, so i'll approve this once you manage to test it on your setup.

I did test it on my netsec card using NUMA_NO_NODE. On that machine though it
doesn't make any difference since page_to_nid(page) and numa_mem_id() always
return 0 on that. So the allocation is already 'correct', the only thing that
changes once i call page_pool_update_nid() is pool->p.nid

Thanks
/Ilias
>
> > --Jesper
> >
> >
> > > On Fri, Dec 20, 2019 at 06:06:49PM +0200, Ilias Apalodimas wrote:
> > > > On Fri, Dec 20, 2019 at 04:22:54PM +0100, Jesper Dangaard Brouer
> > > > wrote:
> > > > > On Fri, 20 Dec 2019 12:49:37 +0200
> > > > > Ilias Apalodimas <[email protected]> wrote:
> > > > >
> > > > > > On Fri, Dec 20, 2019 at 11:41:16AM +0100, Jesper Dangaard
> > > > > > Brouer wrote:
> > > > > > > On Fri, 20 Dec 2019 12:23:14 +0200
> > > > > > > Ilias Apalodimas <[email protected]> wrote:
> > > > > > >
> > > > > > > > Hi Jesper,
> > > > > > > >
> > > > > > > > I like the overall approach since this moves the check
> > > > > > > > out
> > > > > > > > of the hotpath. @Saeed, since i got no hardware to test
> > > > > > > > this on, would it be possible to check that it still
> > > > > > > > works
> > > > > > > > fine for mlx5?
> > > > > > > >
> > > > > > > > [...]
> > > > > > > > > + struct ptr_ring *r = &pool->ring;
> > > > > > > > > + struct page *page;
> > > > > > > > > + int pref_nid; /* preferred NUMA node */
> > > > > > > > > +
> > > > > > > > > + /* Quicker fallback, avoid locks when ring is
> > > > > > > > > empty */
> > > > > > > > > + if (__ptr_ring_empty(r))
> > > > > > > > > + return NULL;
> > > > > > > > > +
> > > > > > > > > + /* Softirq guarantee CPU and thus NUMA node is
> > > > > > > > > stable. This,
> > > > > > > > > + * assumes CPU refilling driver RX-ring will
> > > > > > > > > also run RX-NAPI.
> > > > > > > > > + */
> > > > > > > > > + pref_nid = (pool->p.nid == NUMA_NO_NODE) ?
> > > > > > > > > numa_mem_id() : pool->p.nid;
> > > > > > > >
> > > > > > > > One of the use cases for this is that during the
> > > > > > > > allocation
> > > > > > > > we are not guaranteed to pick up the correct NUMA node.
> > > > > > > > This will get automatically fixed once the driver starts
> > > > > > > > recycling packets.
> > > > > > > >
> > > > > > > > I don't feel strongly about this, since i don't usually
> > > > > > > > like hiding value changes from the user but, would it
> > > > > > > > make
> > > > > > > > sense to move this into __page_pool_alloc_pages_slow()
> > > > > > > > and
> > > > > > > > change the pool->p.nid?
> > > > > > > >
> > > > > > > > Since alloc_pages_node() will replace NUMA_NO_NODE with
> > > > > > > > numa_mem_id() regardless, why not store the actual node
> > > > > > > > in
> > > > > > > > our page pool information? You can then skip this and
> > > > > > > > check
> > > > > > > > pool->p.nid == numa_mem_id(), regardless of what's
> > > > > > > > configured.
> > > > > > >
> > > > > > > This single code line helps support that drivers can
> > > > > > > control
> > > > > > > the nid themselves. This is a feature that is only used my
> > > > > > > mlx5 AFAIK.
> > > > > > >
> > > > > > > I do think that is useful to allow the driver to "control"
> > > > > > > the nid, as pinning/preferring the pages to come from the
> > > > > > > NUMA node that matches the PCI-e controller hardware is
> > > > > > > installed in does have benefits.
> > > > > >
> > > > > > Sure you can keep the if statement as-is, it won't break
> > > > > > anything. Would we want to store the actual numa id in
> > > > > > pool->p.nid if the user selects 'NUMA_NO_NODE'?
> > > > >
> > > > > No. pool->p.nid should stay as NUMA_NO_NODE, because that makes
> > > > > it
> > > > > dynamic. If someone moves an RX IRQ to another CPU on another
> > > > > NUMA node, then this 'NUMA_NO_NODE' setting makes pages
> > > > > transitioned automatically.
> > > > Ok this assumed that drivers were going to use
> > > > page_pool_nid_changed(), but with the current code we don't have
> > > > to
> > > > force them to do that. Let's keep this as-is.
> > > >
> > > > I'll be running a few more tests and wait in case Saeed gets a
> > > > chance to test it and send my reviewed-by
> >
> >