2023-04-23 08:08:09

by Kal Cutter Conley

[permalink] [raw]
Subject: [PATCH] xsk: Use pool->dma_pages to check for DMA

Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
active DMA mapping. pool->dma_pages needs to be read anyway to access
the map so this compiles to more efficient code.

Signed-off-by: Kal Conley <[email protected]>
Acked-by: Magnus Karlsson <[email protected]>
---
include/net/xsk_buff_pool.h | 2 +-
net/xdp/xsk_buff_pool.c | 7 ++++---
2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index d318c769b445..a8d7b8a3688a 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
if (likely(!cross_pg))
return false;

- return pool->dma_pages_cnt &&
+ return pool->dma_pages &&
!(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
}

diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b2df1e0f8153..26f6d304451e 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -350,7 +350,7 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
{
struct xsk_dma_map *dma_map;

- if (pool->dma_pages_cnt == 0)
+ if (!pool->dma_pages)
return;

dma_map = xp_find_dma_map(pool);
@@ -364,6 +364,7 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)

__xp_dma_unmap(dma_map, attrs);
kvfree(pool->dma_pages);
+ pool->dma_pages = NULL;
pool->dma_pages_cnt = 0;
pool->dev = NULL;
}
@@ -503,7 +504,7 @@ static struct xdp_buff_xsk *__xp_alloc(struct xsk_buff_pool *pool)
if (pool->unaligned) {
xskb = pool->free_heads[--pool->free_heads_cnt];
xp_init_xskb_addr(xskb, pool, addr);
- if (pool->dma_pages_cnt)
+ if (pool->dma_pages)
xp_init_xskb_dma(xskb, pool, pool->dma_pages, addr);
} else {
xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];
@@ -569,7 +570,7 @@ static u32 xp_alloc_new_from_fq(struct xsk_buff_pool *pool, struct xdp_buff **xd
if (pool->unaligned) {
xskb = pool->free_heads[--pool->free_heads_cnt];
xp_init_xskb_addr(xskb, pool, addr);
- if (pool->dma_pages_cnt)
+ if (pool->dma_pages)
xp_init_xskb_dma(xskb, pool, pool->dma_pages, addr);
} else {
xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];
--
2.39.2


2023-04-23 19:15:38

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

> Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
> active DMA mapping. pool->dma_pages needs to be read anyway to access
> the map so this compiles to more efficient code.
>
> Signed-off-by: Kal Conley <[email protected]>
> Acked-by: Magnus Karlsson <[email protected]>

I forgot to specify the target tree. This patch should be applied to bpf-next.

2023-04-24 19:20:26

by John Fastabend

[permalink] [raw]
Subject: RE: [PATCH] xsk: Use pool->dma_pages to check for DMA

Kal Conley wrote:
> Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
> active DMA mapping. pool->dma_pages needs to be read anyway to access
> the map so this compiles to more efficient code.

Was it noticable in some sort of performance test?

>
> Signed-off-by: Kal Conley <[email protected]>
> Acked-by: Magnus Karlsson <[email protected]>
> ---
> include/net/xsk_buff_pool.h | 2 +-
> net/xdp/xsk_buff_pool.c | 7 ++++---
> 2 files changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index d318c769b445..a8d7b8a3688a 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> if (likely(!cross_pg))
> return false;
>
> - return pool->dma_pages_cnt &&
> + return pool->dma_pages &&
> !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> }
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index b2df1e0f8153..26f6d304451e 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -350,7 +350,7 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
> {
> struct xsk_dma_map *dma_map;
>
> - if (pool->dma_pages_cnt == 0)
> + if (!pool->dma_pages)
> return;

This seems to be used in the setup/tear-down paths so your optimizing
a control side. Is there a fast path with this code? I walked the
ice driver. If its just setup code we should do whatever is more
readable.

>
> dma_map = xp_find_dma_map(pool);
> @@ -364,6 +364,7 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
>
> __xp_dma_unmap(dma_map, attrs);
> kvfree(pool->dma_pages);
> + pool->dma_pages = NULL;
> pool->dma_pages_cnt = 0;
> pool->dev = NULL;
> }
> @@ -503,7 +504,7 @@ static struct xdp_buff_xsk *__xp_alloc(struct xsk_buff_pool *pool)
> if (pool->unaligned) {
> xskb = pool->free_heads[--pool->free_heads_cnt];
> xp_init_xskb_addr(xskb, pool, addr);
> - if (pool->dma_pages_cnt)
> + if (pool->dma_pages)
> xp_init_xskb_dma(xskb, pool, pool->dma_pages, addr);
> } else {
> xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];
> @@ -569,7 +570,7 @@ static u32 xp_alloc_new_from_fq(struct xsk_buff_pool *pool, struct xdp_buff **xd
> if (pool->unaligned) {
> xskb = pool->free_heads[--pool->free_heads_cnt];
> xp_init_xskb_addr(xskb, pool, addr);
> - if (pool->dma_pages_cnt)
> + if (pool->dma_pages)
> xp_init_xskb_dma(xskb, pool, pool->dma_pages, addr);

Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
This is so deep into micro-optimizing I'm curious if you could measure it?

> } else {
> xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];

I'm not actually against optimizing but maybe another idea. Why do we have to
check at all? Seems if the DMA has been disabled/unmapped the driver shouldn't
be trying to call xsk_buff_alloc_batch? Then you can just drop the 'if' check.

It feels to me the drivers shouldn't even be calling this after unmapping
the dma. WDYT?

2023-04-24 23:03:03

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

> > Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
> > active DMA mapping. pool->dma_pages needs to be read anyway to access
> > the map so this compiles to more efficient code.
>
> Was it noticable in some sort of performance test?

This patch is part of the patchset found at
https://lore.kernel.org/all/[email protected]/
which is being actively discussed and needs to be resubmitted anyway
because of a conflict. While the discussion continues, I am submitting
this patch by itself because I think it's an improvement on its own
(regardless of what happens with the rest of the linked patchset). On
one system, I measured a performance regression of 2-3% with xdpsock
and the linked changes without the current patch. With the current
patch, the performance regression was no longer observed.

> > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > index d318c769b445..a8d7b8a3688a 100644
> > --- a/include/net/xsk_buff_pool.h
> > +++ b/include/net/xsk_buff_pool.h
> > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > if (likely(!cross_pg))
> > return false;
> >
> > - return pool->dma_pages_cnt &&
> > + return pool->dma_pages &&
> > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > }

I would consider the above code part of the "fast path". It may be
executed approximately once per frame in unaligned mode.

> This seems to be used in the setup/tear-down paths so your optimizing
> a control side. Is there a fast path with this code? I walked the
> ice driver. If its just setup code we should do whatever is more
> readable.

It is not only used in setup/tear-down paths (see above).
Additionally, I believe the code is also _more_ readable with this
patch applied. In particular, this patch reduces cognitive complexity
since people (and compilers) reading the code don't need to
additionally think about pool->dma_pages_cnt.

> Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> This is so deep into micro-optimizing I'm curious if you could measure it?

It is saving a load which also reduces code size. This will affect
other decisions such as what to inline. Also in the linked patchset,
dma_pages and dma_pages_cnt do not share a cache line (on x86_64).

>
> > } else {
> > xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];
>
> I'm not actually against optimizing but maybe another idea. Why do we have to
> check at all? Seems if the DMA has been disabled/unmapped the driver shouldn't
> be trying to call xsk_buff_alloc_batch? Then you can just drop the 'if' check.
>
> It feels to me the drivers shouldn't even be calling this after unmapping
> the dma. WDYT?

Many of these code paths are used both for ZC and copy modes. You
might be right that this particular case is only used with DMA.

2023-04-25 10:01:13

by Maciej Fijalkowski

[permalink] [raw]
Subject: Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

On Tue, Apr 25, 2023 at 12:52:53AM +0200, Kal Cutter Conley wrote:

Hey again!

> > > Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
> > > active DMA mapping. pool->dma_pages needs to be read anyway to access
> > > the map so this compiles to more efficient code.
> >
> > Was it noticable in some sort of performance test?
>
> This patch is part of the patchset found at
> https://lore.kernel.org/all/[email protected]/
> which is being actively discussed and needs to be resubmitted anyway
> because of a conflict. While the discussion continues, I am submitting
> this patch by itself because I think it's an improvement on its own
> (regardless of what happens with the rest of the linked patchset). On
> one system, I measured a performance regression of 2-3% with xdpsock
> and the linked changes without the current patch. With the current
> patch, the performance regression was no longer observed.

Okay, 2-3% but with what settings? rxdrop for unaligned mode? what chunk
size etc? We need this kind of info, "compiles to more efficient code"
from original commit message is too generic and speculative to me. 2-3% of
perf diff against specific xdpsock setup is real improvement and is a
strong argument for getting this patch as-is, by its own.

>
> > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > index d318c769b445..a8d7b8a3688a 100644
> > > --- a/include/net/xsk_buff_pool.h
> > > +++ b/include/net/xsk_buff_pool.h
> > > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > > if (likely(!cross_pg))
> > > return false;
> > >
> > > - return pool->dma_pages_cnt &&
> > > + return pool->dma_pages &&
> > > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > > }
>
> I would consider the above code part of the "fast path". It may be
> executed approximately once per frame in unaligned mode.
>
> > This seems to be used in the setup/tear-down paths so your optimizing
> > a control side. Is there a fast path with this code? I walked the
> > ice driver. If its just setup code we should do whatever is more
> > readable.
>
> It is not only used in setup/tear-down paths (see above).
> Additionally, I believe the code is also _more_ readable with this
> patch applied. In particular, this patch reduces cognitive complexity
> since people (and compilers) reading the code don't need to
> additionally think about pool->dma_pages_cnt.

John was referring to xp_dma_unmap() with the comment above which indeed
is a teardown path, so probably this doesn't matter from performance
perspective and you could avoid this chunk from your patch.

>
> > Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> > This is so deep into micro-optimizing I'm curious if you could measure it?
>
> It is saving a load which also reduces code size. This will affect
> other decisions such as what to inline. Also in the linked patchset,
> dma_pages and dma_pages_cnt do not share a cache line (on x86_64).

Yes I believe that the with your patch on unaligned mode by touching the
dma_pages you're warming the relevant cache line for your setup.

>
> >
> > > } else {
> > > xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];
> >
> > I'm not actually against optimizing but maybe another idea. Why do we have to
> > check at all? Seems if the DMA has been disabled/unmapped the driver shouldn't
> > be trying to call xsk_buff_alloc_batch? Then you can just drop the 'if' check.
> >
> > It feels to me the drivers shouldn't even be calling this after unmapping
> > the dma. WDYT?
>
> Many of these code paths are used both for ZC and copy modes. You
> might be right that this particular case is only used with DMA.

Map/unmap is when you do ZC, copy/skb mode work without these internal dma
mappings.

2023-04-25 10:48:10

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

> Okay, 2-3% but with what settings? rxdrop for unaligned mode? what chunk
> size etc? We need this kind of info, "compiles to more efficient code"
> from original commit message is too generic and speculative to me. 2-3% of
> perf diff against specific xdpsock setup is real improvement and is a
> strong argument for getting this patch as-is, by its own.

I don't have the exact numbers anymore. I measured a performance
difference (up to 2-3%) with different settings including rxdrop and
unaligned mode. The exact settings you have in the commit message from
the linked patch. I didn't go into details in the commit message
because I thought this change would be a slam dunk. I don't think
there is any reason to believe the code is slower with this patch. If
anything, it should generally be faster. At the very least it will
lead to more efficient code in terms of size since dma_pages_cnt is no
longer read. Also I think the code is more readable with this patch.

>
> >
> > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > > index d318c769b445..a8d7b8a3688a 100644
> > > > --- a/include/net/xsk_buff_pool.h
> > > > +++ b/include/net/xsk_buff_pool.h
> > > > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > > > if (likely(!cross_pg))
> > > > return false;
> > > >
> > > > - return pool->dma_pages_cnt &&
> > > > + return pool->dma_pages &&
> > > > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > > > }
> >
> > I would consider the above code part of the "fast path". It may be
> > executed approximately once per frame in unaligned mode.
> >
> > > This seems to be used in the setup/tear-down paths so your optimizing
> > > a control side. Is there a fast path with this code? I walked the
> > > ice driver. If its just setup code we should do whatever is more
> > > readable.
> >
> > It is not only used in setup/tear-down paths (see above).
> > Additionally, I believe the code is also _more_ readable with this
> > patch applied. In particular, this patch reduces cognitive complexity
> > since people (and compilers) reading the code don't need to
> > additionally think about pool->dma_pages_cnt.
>
> John was referring to xp_dma_unmap() with the comment above which indeed
> is a teardown path, so probably this doesn't matter from performance
> perspective and you could avoid this chunk from your patch.

The setup/tear-down lines were also changed to keep the code
consistent. It doesn't make sense to sometimes check dma_pages_cnt and
other times dma_pages.

>
> >
> > > Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> > > This is so deep into micro-optimizing I'm curious if you could measure it?
> >
> > It is saving a load which also reduces code size. This will affect
> > other decisions such as what to inline. Also in the linked patchset,
> > dma_pages and dma_pages_cnt do not share a cache line (on x86_64).
>
> Yes I believe that the with your patch on unaligned mode by touching the
> dma_pages you're warming the relevant cache line for your setup.

dma_pages is touched anyway right after. That is the point of this
patch: since dma_pages already needs to be loaded into a register,
just check that instead of loading an additional field possibly from a
different cache line.

2023-04-25 20:30:25

by John Fastabend

[permalink] [raw]
Subject: Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

Kal Cutter Conley wrote:
> > > Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an
> > > active DMA mapping. pool->dma_pages needs to be read anyway to access
> > > the map so this compiles to more efficient code.
> >
> > Was it noticable in some sort of performance test?
>
> This patch is part of the patchset found at
> https://lore.kernel.org/all/[email protected]/
> which is being actively discussed and needs to be resubmitted anyway
> because of a conflict. While the discussion continues, I am submitting
> this patch by itself because I think it's an improvement on its own
> (regardless of what happens with the rest of the linked patchset). On
> one system, I measured a performance regression of 2-3% with xdpsock
> and the linked changes without the current patch. With the current
> patch, the performance regression was no longer observed.

Would be nice to have in commit message so reader has an idea the
perf numbers are in fact better.

>
> > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > index d318c769b445..a8d7b8a3688a 100644
> > > --- a/include/net/xsk_buff_pool.h
> > > +++ b/include/net/xsk_buff_pool.h
> > > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > > if (likely(!cross_pg))
> > > return false;
> > >
> > > - return pool->dma_pages_cnt &&
> > > + return pool->dma_pages &&
> > > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > > }
>
> I would consider the above code part of the "fast path". It may be
> executed approximately once per frame in unaligned mode.

In the unlikely case though is my reading. So really shouldn't
be called for every packet or we have other perf issues by that
likely() there.

I assume the above is where the perf is being gained because below
two things are in setup/tear down. But then we are benchmarking
an unlikely() path?

>
> > This seems to be used in the setup/tear-down paths so your optimizing
> > a control side. Is there a fast path with this code? I walked the
> > ice driver. If its just setup code we should do whatever is more
> > readable.
>
> It is not only used in setup/tear-down paths (see above).
> Additionally, I believe the code is also _more_ readable with this
> patch applied. In particular, this patch reduces cognitive complexity
> since people (and compilers) reading the code don't need to
> additionally think about pool->dma_pages_cnt.
>
> > Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> > This is so deep into micro-optimizing I'm curious if you could measure it?
>
> It is saving a load which also reduces code size. This will affect
> other decisions such as what to inline. Also in the linked patchset,
> dma_pages and dma_pages_cnt do not share a cache line (on x86_64).

But again buried in an unlikely path. Sure but removing the conditional
altogether would be even better.

>
> >
> > > } else {
> > > xskb = &pool->heads[xp_aligned_extract_idx(pool, addr)];
> >
> > I'm not actually against optimizing but maybe another idea. Why do we have to
> > check at all? Seems if the DMA has been disabled/unmapped the driver shouldn't
> > be trying to call xsk_buff_alloc_batch? Then you can just drop the 'if' check.
> >
> > It feels to me the drivers shouldn't even be calling this after unmapping
> > the dma. WDYT?
>
> Many of these code paths are used both for ZC and copy modes. You
> might be right that this particular case is only used with DMA.

So my understanding is ZC is preferred and default mode and copy modes
are primarily fall back modes. So we are punishing the good case here
for a fallback to copy mode. I think overall refactoring the code to
avoid burdoning the fast case with a fallback slow case would be ideal
solution.

However, I agree just on readability the patch is fine and good. No
objection on my side. But I think if we are making performance
arguments for 2-3% here the better thing to do is remove the check
and unlikely() and we would see better benchmarks when using the
ZC mode which as I understand it is what performance aware folks should
be doing.

Just $0.02 here.

2023-04-26 13:05:06

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH] xsk: Use pool->dma_pages to check for DMA

> > > Was it noticable in some sort of performance test?
> >
> > This patch is part of the patchset found at
> > https://lore.kernel.org/all/[email protected]/
> > which is being actively discussed and needs to be resubmitted anyway
> > because of a conflict. While the discussion continues, I am submitting
> > this patch by itself because I think it's an improvement on its own
> > (regardless of what happens with the rest of the linked patchset). On
> > one system, I measured a performance regression of 2-3% with xdpsock
> > and the linked changes without the current patch. With the current
> > patch, the performance regression was no longer observed.
>
> Would be nice to have in commit message so reader has an idea the
> perf numbers are in fact better.

When I measured this patch by itself (on bpf-next), I didn't measure
any statistically significant performance gains. However, it did allow
me to avoid a regression when combined with the other linked patch (as
mentioned). I don't know if it makes sense to mention that other
change which is not even applied to any tree. I was mainly submitting
this patch from the perspective of the code being better not
contingent on any provable performance gains.

>
> >
> > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > > index d318c769b445..a8d7b8a3688a 100644
> > > > --- a/include/net/xsk_buff_pool.h
> > > > +++ b/include/net/xsk_buff_pool.h
> > > > @@ -180,7 +180,7 @@ static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
> > > > if (likely(!cross_pg))
> > > > return false;
> > > >
> > > > - return pool->dma_pages_cnt &&
> > > > + return pool->dma_pages &&
> > > > !(pool->dma_pages[addr >> PAGE_SHIFT] & XSK_NEXT_PG_CONTIG_MASK);
> > > > }
> >
> > I would consider the above code part of the "fast path". It may be
> > executed approximately once per frame in unaligned mode.
>
> In the unlikely case though is my reading. So really shouldn't
> be called for every packet or we have other perf issues by that
> likely() there.
>
> I assume the above is where the perf is being gained because below
> two things are in setup/tear down. But then we are benchmarking
> an unlikely() path?

I was testing with large chunk sizes in unaligned mode (4000-4096
bytes) with ZC. For chunk sizes nearly as large as PAGE_SIZE the
unlikely path is actually the main path.

> >
> > > This seems to be used in the setup/tear-down paths so your optimizing
> > > a control side. Is there a fast path with this code? I walked the
> > > ice driver. If its just setup code we should do whatever is more
> > > readable.
> >
> > It is not only used in setup/tear-down paths (see above).
> > Additionally, I believe the code is also _more_ readable with this
> > patch applied. In particular, this patch reduces cognitive complexity
> > since people (and compilers) reading the code don't need to
> > additionally think about pool->dma_pages_cnt.
> >
> > > Both the _alloc_ cases read neighboring free_heads_cnt so your saving a load I guess?
> > > This is so deep into micro-optimizing I'm curious if you could measure it?
> >
> > It is saving a load which also reduces code size. This will affect
> > other decisions such as what to inline. Also in the linked patchset,
> > dma_pages and dma_pages_cnt do not share a cache line (on x86_64).
>
> But again buried in an unlikely path. Sure but removing the conditional
> altogether would be even better.

Yeah, I think that is another improvement to consider.

> So my understanding is ZC is preferred and default mode and copy modes
> are primarily fall back modes. So we are punishing the good case here
> for a fallback to copy mode. I think overall refactoring the code to
> avoid burdoning the fast case with a fallback slow case would be ideal
> solution.

I agree that ZC is preferred and this patch is aimed at improving the
ZC path. The performance gain I observed was for ZC.

> However, I agree just on readability the patch is fine and good. No
> objection on my side. But I think if we are making performance
> arguments for 2-3% here the better thing to do is remove the check
> and unlikely() and we would see better benchmarks when using the
> ZC mode which as I understand it is what performance aware folks should
> be doing.

I totally agree that other better improvements exist but I don't think
they make this patch any less desirable. This change is only meant as
a small incremental improvement.