2023-11-15 12:57:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote:

> +static __always_inline inline void handle_pending_pir(struct pi_desc *pid, struct pt_regs *regs)
> +{

__always_inline means that... (A)

> + int i, vec = FIRST_EXTERNAL_VECTOR;
> + u64 pir_copy[4];
> +
> + /*
> + * Make a copy of PIR which contains IRQ pending bits for vectors,
> + * then invoke IRQ handlers for each pending vector.
> + * If any new interrupts were posted while we are processing, will
> + * do again before allowing new notifications. The idea is to
> + * minimize the number of the expensive notifications if IRQs come
> + * in a high frequency burst.
> + */
> + for (i = 0; i < 4; i++)
> + pir_copy[i] = raw_atomic64_xchg((atomic64_t *)&pid->pir_l[i], 0);
> +
> + /*
> + * Ideally, we should start from the high order bits set in the PIR
> + * since each bit represents a vector. Higher order bit position means
> + * the vector has higher priority. But external vectors are allocated
> + * based on availability not priority.
> + *
> + * EOI is included in the IRQ handlers call to apic_ack_irq, which
> + * allows higher priority system interrupt to get in between.
> + */
> + for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)
> + call_irq_handler(vec, regs);
> +
> +}
> +
> +/*
> + * Performance data shows that 3 is good enough to harvest 90+% of the benefit
> + * on high IRQ rate workload.
> + * Alternatively, could make this tunable, use 3 as default.
> + */
> +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> +
> +/*
> + * For MSIs that are delivered as posted interrupts, the CPU notifications
> + * can be coalesced if the MSIs arrive in high frequency bursts.
> + */
> +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> +{
> + struct pt_regs *old_regs = set_irq_regs(regs);
> + struct pi_desc *pid;
> + int i = 0;
> +
> + pid = this_cpu_ptr(&posted_interrupt_desc);
> +
> + inc_irq_stat(posted_msi_notification_count);
> + irq_enter();
> +
> + while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
> + handle_pending_pir(pid, regs);
> +
> + /*
> + * If there are new interrupts posted in PIR, do again. If
> + * nothing pending, no need to wait for more interrupts.
> + */
> + if (is_pir_pending(pid))

So this reads those same 4 words we xchg in handle_pending_pir(), right?

> + continue;
> + else
> + break;
> + }
> +
> + /*
> + * Clear outstanding notification bit to allow new IRQ notifications,
> + * do this last to maximize the window of interrupt coalescing.
> + */
> + pi_clear_on(pid);
> +
> + /*
> + * There could be a race of PI notification and the clearing of ON bit,
> + * process PIR bits one last time such that handling the new interrupts
> + * are not delayed until the next IRQ.
> + */
> + if (unlikely(is_pir_pending(pid)))
> + handle_pending_pir(pid, regs);

(A) ... we get _two_ copies of that thing in this function. Does that
make sense ?

> +
> + apic_eoi();
> + irq_exit();
> + set_irq_regs(old_regs);
> +}
> #endif /* X86_POSTED_MSI */

Would it not make more sense to write things something like:

bool handle_pending_pir()
{
bool handled = false;
u64 pir_copy[4];

for (i = 0; i < 4; i++) {
if (!pid-pir_l[i]) {
pir_copy[i] = 0;
continue;
}

pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
handled |= true;
}

if (!handled)
return handled;

for_each_set_bit()
....

return handled.
}

sysvec_posted_blah_blah()
{
bool done = false;
bool handled;

for (;;) {
handled = handle_pending_pir();
if (done)
break;
if (!handled || ++loops > MAX_LOOPS) {
pi_clear_on(pid);
/* once more after clear_on */
done = true;
}
}
}


Hmm?


2023-11-15 20:01:54

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

Hi Peter,

On Wed, 15 Nov 2023 13:56:24 +0100, Peter Zijlstra <[email protected]>
wrote:

> On Sat, Nov 11, 2023 at 08:16:39PM -0800, Jacob Pan wrote:
>
> > +static __always_inline inline void handle_pending_pir(struct pi_desc
> > *pid, struct pt_regs *regs) +{
>
> __always_inline means that... (A)
>
> > + int i, vec = FIRST_EXTERNAL_VECTOR;
> > + u64 pir_copy[4];
> > +
> > + /*
> > + * Make a copy of PIR which contains IRQ pending bits for
> > vectors,
> > + * then invoke IRQ handlers for each pending vector.
> > + * If any new interrupts were posted while we are processing,
> > will
> > + * do again before allowing new notifications. The idea is to
> > + * minimize the number of the expensive notifications if IRQs
> > come
> > + * in a high frequency burst.
> > + */
> > + for (i = 0; i < 4; i++)
> > + pir_copy[i] = raw_atomic64_xchg((atomic64_t
> > *)&pid->pir_l[i], 0); +
> > + /*
> > + * Ideally, we should start from the high order bits set in
> > the PIR
> > + * since each bit represents a vector. Higher order bit
> > position means
> > + * the vector has higher priority. But external vectors are
> > allocated
> > + * based on availability not priority.
> > + *
> > + * EOI is included in the IRQ handlers call to apic_ack_irq,
> > which
> > + * allows higher priority system interrupt to get in between.
> > + */
> > + for_each_set_bit_from(vec, (unsigned long *)&pir_copy[0], 256)
> > + call_irq_handler(vec, regs);
> > +
> > +}
> > +
> > +/*
> > + * Performance data shows that 3 is good enough to harvest 90+% of the
> > benefit
> > + * on high IRQ rate workload.
> > + * Alternatively, could make this tunable, use 3 as default.
> > + */
> > +#define MAX_POSTED_MSI_COALESCING_LOOP 3
> > +
> > +/*
> > + * For MSIs that are delivered as posted interrupts, the CPU
> > notifications
> > + * can be coalesced if the MSIs arrive in high frequency bursts.
> > + */
> > +DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
> > +{
> > + struct pt_regs *old_regs = set_irq_regs(regs);
> > + struct pi_desc *pid;
> > + int i = 0;
> > +
> > + pid = this_cpu_ptr(&posted_interrupt_desc);
> > +
> > + inc_irq_stat(posted_msi_notification_count);
> > + irq_enter();
> > +
> > + while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
> > + handle_pending_pir(pid, regs);
> > +
> > + /*
> > + * If there are new interrupts posted in PIR, do
> > again. If
> > + * nothing pending, no need to wait for more
> > interrupts.
> > + */
> > + if (is_pir_pending(pid))
>
> So this reads those same 4 words we xchg in handle_pending_pir(), right?
>
> > + continue;
> > + else
> > + break;
> > + }
> > +
> > + /*
> > + * Clear outstanding notification bit to allow new IRQ
> > notifications,
> > + * do this last to maximize the window of interrupt coalescing.
> > + */
> > + pi_clear_on(pid);
> > +
> > + /*
> > + * There could be a race of PI notification and the clearing
> > of ON bit,
> > + * process PIR bits one last time such that handling the new
> > interrupts
> > + * are not delayed until the next IRQ.
> > + */
> > + if (unlikely(is_pir_pending(pid)))
> > + handle_pending_pir(pid, regs);
>
> (A) ... we get _two_ copies of that thing in this function. Does that
> make sense ?
>
> > +
> > + apic_eoi();
> > + irq_exit();
> > + set_irq_regs(old_regs);
> > +}
> > #endif /* X86_POSTED_MSI */
>
> Would it not make more sense to write things something like:
>
it is a great idea, we can save expensive xchg if pir[i] is 0. But I have
to tweak a little to let it perform better.

> bool handle_pending_pir()
> {
> bool handled = false;
> u64 pir_copy[4];
>
> for (i = 0; i < 4; i++) {
> if (!pid-pir_l[i]) {
> pir_copy[i] = 0;
> continue;
> }
>
> pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
we are interleaving cacheline read and xchg. So made it to

for (i = 0; i < 4; i++) {
pir_copy[i] = pid->pir_l[i];
}

for (i = 0; i < 4; i++) {
if (pir_copy[i]) {
pir_copy[i] = arch_xchg(&pid->pir_l[i], 0);
handled = true;
}
}

With DSA MEMFILL test just one queue one MSI, we are saving 3 xchg per loop.
Here is the performance comparison in IRQ rate:

Original RFC 9.29 m/sec,
Optimized in your email 8.82m/sec,
Tweaked above: 9.54m/s

I need to test with more MSI vectors spreading out to all 4 u64. I suspect
the benefit will decrease since we need to do both read and xchg for
non-zero entries.

> handled |= true;
> }


>
> if (!handled)
> return handled;
>
> for_each_set_bit()
> ....
>
> return handled.
> }
>
> sysvec_posted_blah_blah()
> {
> bool done = false;
> bool handled;
>
> for (;;) {
> handled = handle_pending_pir();
> if (done)
> break;
> if (!handled || ++loops > MAX_LOOPS) {
> pi_clear_on(pid);
> /* once more after clear_on */
> done = true;
> }
> }
> }
>
>
> Hmm?




Thanks,

Jacob

2023-11-15 20:26:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

On Wed, Nov 15, 2023 at 12:04:01PM -0800, Jacob Pan wrote:

> we are interleaving cacheline read and xchg. So made it to

Hmm, I wasn't expecting that to be a problem, but sure.

> for (i = 0; i < 4; i++) {
> pir_copy[i] = pid->pir_l[i];
> }
>
> for (i = 0; i < 4; i++) {
> if (pir_copy[i]) {
> pir_copy[i] = arch_xchg(&pid->pir_l[i], 0);
> handled = true;
> }
> }
>
> With DSA MEMFILL test just one queue one MSI, we are saving 3 xchg per loop.
> Here is the performance comparison in IRQ rate:
>
> Original RFC 9.29 m/sec,
> Optimized in your email 8.82m/sec,
> Tweaked above: 9.54m/s
>
> I need to test with more MSI vectors spreading out to all 4 u64. I suspect
> the benefit will decrease since we need to do both read and xchg for
> non-zero entries.

Ah, but performance was not the reason I suggested this. Code
compactness and clarity was.

Possibly using less xchg is just a bonus :-)

2023-12-06 19:50:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

On Wed, Nov 15 2023 at 13:56, Peter Zijlstra wrote:
>
> Would it not make more sense to write things something like:
>
> bool handle_pending_pir()
> {
> bool handled = false;
> u64 pir_copy[4];
>
> for (i = 0; i < 4; i++) {
> if (!pid-pir_l[i]) {
> pir_copy[i] = 0;
> continue;
> }
>
> pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
> handled |= true;
> }
>
> if (!handled)
> return handled;
>
> for_each_set_bit()
> ....
>
> return handled.
> }

I don't understand what the whole copy business is about. It's
absolutely not required.

static bool handle_pending_pir(unsigned long *pir)
{
unsigned int idx, vec;
bool handled = false;
unsigned long pend;

for (idx = 0; offs < 4; idx++) {
if (!pir[idx])
continue;
pend = arch_xchg(pir + idx, 0);
for_each_set_bit(vec, &pend, 64)
call_irq_handler(vec + idx * 64, NULL);
handled = true;
}
return handled;
}

No?

> sysvec_posted_blah_blah()
> {
> bool done = false;
> bool handled;
>
> for (;;) {
> handled = handle_pending_pir();
> if (done)
> break;
> if (!handled || ++loops > MAX_LOOPS) {

That does one loop too many. Should be ++loops == MAX_LOOPS. No?

> pi_clear_on(pid);
> /* once more after clear_on */
> done = true;
> }
> }
> }
>
>
> Hmm?

I think that can be done less convoluted.

{
struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
struct pt_regs *old_regs = set_irq_regs(regs);
int loops;

for (loops = 0;;) {
bool handled = handle_pending_pir((unsigned long)pid->pir);

if (++loops > MAX_LOOPS)
break;

if (!handled || loops == MAX_LOOPS) {
pi_clear_on(pid);
/* Break the loop after handle_pending_pir()! */
loops = MAX_LOOPS;
}
}

...
set_irq_regs(old_regs);
}

Hmm? :)

2023-12-08 04:41:29

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

Hi Thomas,

On Wed, 06 Dec 2023 20:50:24 +0100, Thomas Gleixner <[email protected]>
wrote:

> On Wed, Nov 15 2023 at 13:56, Peter Zijlstra wrote:
> >
> > Would it not make more sense to write things something like:
> >
> > bool handle_pending_pir()
> > {
> > bool handled = false;
> > u64 pir_copy[4];
> >
> > for (i = 0; i < 4; i++) {
> > if (!pid-pir_l[i]) {
> > pir_copy[i] = 0;
> > continue;
> > }
> >
> > pir_copy[i] = arch_xchg(&pir->pir_l[i], 0);
> > handled |= true;
> > }
> >
> > if (!handled)
> > return handled;
> >
> > for_each_set_bit()
> > ....
> >
> > return handled.
> > }
>
> I don't understand what the whole copy business is about. It's
> absolutely not required.
>
> static bool handle_pending_pir(unsigned long *pir)
> {
> unsigned int idx, vec;
> bool handled = false;
> unsigned long pend;
>
> for (idx = 0; offs < 4; idx++) {
> if (!pir[idx])
> continue;
> pend = arch_xchg(pir + idx, 0);
> for_each_set_bit(vec, &pend, 64)
> call_irq_handler(vec + idx * 64, NULL);
> handled = true;
> }
> return handled;
> }
>
My thinking is the following:
The PIR cache line is contended by between CPU and IOMMU, where CPU can
access PIR much faster. Nevertheless, when IOMMU does atomic swap of the
PID (PIR included), L1 cache gets evicted. Subsequent CPU read or xchg will
deal with invalid cold cache.

By making a copy of PIR as quickly as possible and clearing PIR with xchg,
we minimized the chance that IOMMU does atomic swap in the middle.
Therefore, having less L1D misses.

In the code above, it does read, xchg, and call_irq_handler() in a loop
to handle the 4 64bit PIR bits at a time. IOMMU has a greater chance to do
atomic xchg on the PIR cache line while doing call_irq_handler(). Therefore,
it causes more L1D misses.

I might be missing something?

I tested the two versions below with my DSA memory fill test and measured
DMA bandwidth and perf cache misses:

#ifdef NO_PIR_COPY
static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
{
int i, vec;
bool handled = false;
unsigned long pending;

for (i = 0; i < 4; i++) {
if (!pir[i])
continue;

pending = arch_xchg(pir + i, 0);
for_each_set_bit(vec, &pending, 64)
call_irq_handler(i * 64 + vec, regs);
handled = true;
}

return handled;
}
#else
static __always_inline inline bool handle_pending_pir(u64 *pir, struct pt_regs *regs)
{
int i, vec = FIRST_EXTERNAL_VECTOR;
bool handled = false;
unsigned long pir_copy[4];

for (i = 0; i < 4; i++)
pir_copy[i] = pir[i];

for (i = 0; i < 4; i++) {
if (!pir_copy[i])
continue;

pir_copy[i] = arch_xchg(pir, 0);
handled = true;
}

if (handled) {
for_each_set_bit_from(vec, pir_copy, FIRST_SYSTEM_VECTOR)
call_irq_handler(vec, regs);
}

return handled;
}
#endif

DEFINE_IDTENTRY_SYSVEC(sysvec_posted_msi_notification)
{
struct pt_regs *old_regs = set_irq_regs(regs);
struct pi_desc *pid;
int i = 0;

pid = this_cpu_ptr(&posted_interrupt_desc);

inc_irq_stat(posted_msi_notification_count);
irq_enter();

while (i++ < MAX_POSTED_MSI_COALESCING_LOOP) {
if (!handle_pending_pir(pid->pir64, regs))
break;
}

/*
* Clear outstanding notification bit to allow new IRQ notifications,
* do this last to maximize the window of interrupt coalescing.
*/
pi_clear_on(pid);

/*
* There could be a race of PI notification and the clearing of ON bit,
* process PIR bits one last time such that handling the new interrupts
* are not delayed until the next IRQ.
*/
handle_pending_pir(pid->pir64, regs);

apic_eoi();
irq_exit();
set_irq_regs(old_regs);
}

Without PIR copy:

DMA memfill bandwidth: 4.944 Gbps
Performance counter stats for './run_intr.sh 512 30':

77,313,298,506 L1-dcache-loads (79.98%)
8,279,458 L1-dcache-load-misses # 0.01% of all L1-dcache accesses (80.03%)
41,654,221,245 L1-dcache-stores (80.01%)
10,476 LLC-load-misses # 0.31% of all LL-cache accesses (79.99%)
3,332,748 LLC-loads (80.00%)

30.212055434 seconds time elapsed

0.002149000 seconds user
30.183292000 seconds sys


With PIR copy:
DMA memfill bandwidth: 5.029 Gbps
Performance counter stats for './run_intr.sh 512 30':

78,327,247,423 L1-dcache-loads (80.01%)
7,762,311 L1-dcache-load-misses # 0.01% of all L1-dcache accesses (80.01%)
42,203,221,466 L1-dcache-stores (79.99%)
23,691 LLC-load-misses # 0.67% of all LL-cache accesses (80.01%)
3,561,890 LLC-loads (80.00%)

30.201065706 seconds time elapsed

0.005950000 seconds user
30.167885000 seconds sys


> No?
>
> > sysvec_posted_blah_blah()
> > {
> > bool done = false;
> > bool handled;
> >
> > for (;;) {
> > handled = handle_pending_pir();
> > if (done)
> > break;
> > if (!handled || ++loops > MAX_LOOPS) {
>
> That does one loop too many. Should be ++loops == MAX_LOOPS. No?
>
> > pi_clear_on(pid);
> > /* once more after clear_on */
> > done = true;
> > }
> > }
> > }
> >
> >
> > Hmm?
>
> I think that can be done less convoluted.
>
> {
> struct pi_desc *pid = this_cpu_ptr(&posted_interrupt_desc);
> struct pt_regs *old_regs = set_irq_regs(regs);
> int loops;
>
> for (loops = 0;;) {
> bool handled = handle_pending_pir((unsigned
> long)pid->pir);
>
> if (++loops > MAX_LOOPS)
> break;
>
> if (!handled || loops == MAX_LOOPS) {
> pi_clear_on(pid);
> /* Break the loop after handle_pending_pir()! */
> loops = MAX_LOOPS;
> }
> }
>
> ...
> set_irq_regs(old_regs);
> }
>
> Hmm? :)


Thanks,

Jacob

2023-12-08 11:53:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

On Thu, Dec 07 2023 at 20:46, Jacob Pan wrote:
> On Wed, 06 Dec 2023 20:50:24 +0100, Thomas Gleixner <[email protected]>
> wrote:
>> I don't understand what the whole copy business is about. It's
>> absolutely not required.
>
> My thinking is the following:
> The PIR cache line is contended by between CPU and IOMMU, where CPU can
> access PIR much faster. Nevertheless, when IOMMU does atomic swap of the
> PID (PIR included), L1 cache gets evicted. Subsequent CPU read or xchg will
> deal with invalid cold cache.
>
> By making a copy of PIR as quickly as possible and clearing PIR with xchg,
> we minimized the chance that IOMMU does atomic swap in the middle.
> Therefore, having less L1D misses.
>
> In the code above, it does read, xchg, and call_irq_handler() in a loop
> to handle the 4 64bit PIR bits at a time. IOMMU has a greater chance to do
> atomic xchg on the PIR cache line while doing call_irq_handler(). Therefore,
> it causes more L1D misses.

That makes sense and if we go there it wants to be documented.

> Without PIR copy:
>
> DMA memfill bandwidth: 4.944 Gbps
> Performance counter stats for './run_intr.sh 512 30':
>
> 77,313,298,506 L1-dcache-loads (79.98%)
> 8,279,458 L1-dcache-load-misses # 0.01% of all L1-dcache accesses (80.03%)
> 41,654,221,245 L1-dcache-stores (80.01%)
> 10,476 LLC-load-misses # 0.31% of all LL-cache accesses (79.99%)
> 3,332,748 LLC-loads (80.00%)
>
> 30.212055434 seconds time elapsed
>
> 0.002149000 seconds user
> 30.183292000 seconds sys
>
>
> With PIR copy:
> DMA memfill bandwidth: 5.029 Gbps
> Performance counter stats for './run_intr.sh 512 30':
>
> 78,327,247,423 L1-dcache-loads (80.01%)
> 7,762,311 L1-dcache-load-misses # 0.01% of all L1-dcache accesses (80.01%)
> 42,203,221,466 L1-dcache-stores (79.99%)
> 23,691 LLC-load-misses # 0.67% of all LL-cache accesses (80.01%)
> 3,561,890 LLC-loads (80.00%)
>
> 30.201065706 seconds time elapsed
>
> 0.005950000 seconds user
> 30.167885000 seconds sys

Interesting, though I'm not really convinced that this DMA memfill
microbenchmark resembles real work loads.

Did you test with something realistic, e.g. storage or networking, too?

Thanks,

tglx

2023-12-08 19:58:07

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

Hi Thomas,

On Fri, 08 Dec 2023 12:52:49 +0100, Thomas Gleixner <[email protected]>
wrote:

> On Thu, Dec 07 2023 at 20:46, Jacob Pan wrote:
> > On Wed, 06 Dec 2023 20:50:24 +0100, Thomas Gleixner <[email protected]>
> > wrote:
> >> I don't understand what the whole copy business is about. It's
> >> absolutely not required.
> >
> > My thinking is the following:
> > The PIR cache line is contended by between CPU and IOMMU, where CPU can
> > access PIR much faster. Nevertheless, when IOMMU does atomic swap of the
> > PID (PIR included), L1 cache gets evicted. Subsequent CPU read or xchg
> > will deal with invalid cold cache.
> >
> > By making a copy of PIR as quickly as possible and clearing PIR with
> > xchg, we minimized the chance that IOMMU does atomic swap in the middle.
> > Therefore, having less L1D misses.
> >
> > In the code above, it does read, xchg, and call_irq_handler() in a loop
> > to handle the 4 64bit PIR bits at a time. IOMMU has a greater chance to
> > do atomic xchg on the PIR cache line while doing call_irq_handler().
> > Therefore, it causes more L1D misses.
>
> That makes sense and if we go there it wants to be documented.
will do. How about this explanation:
"
Posted interrupt descriptor (PID) fits in a cache line that is frequently
accessed by both CPU and IOMMU.

During posted MSI processing, the CPU needs to do 64-bit read and xchg for
checking and clearing posted interrupt request (PIR), a 256 bit field
within the PID. On the other side, IOMMU do atomic swaps of the entire
PID cache line when posting interrupts. The CPU can access the cache line
much faster than the IOMMU.

The cache line states after each operation are as follows:

CPU IOMMU PID Cache line state
-------------------------------------------------------------
read64 exclusive
lock xchg64 modified
post/atomic swap invalid
-------------------------------------------------------------
Note that PID cache line is evicted after each IOMMU interrupt posting.

The posted MSI demuxing loop is written to optimize the cache performance
based on the two considerations around the PID cache line:

1. Reduce L1 data cache miss by avoiding contention with IOMMU's interrupt
posting/atomic swap, a copy of PIR is used to dispatch interrupt handlers.

2. Keep the cache line state consistent as much as possible. e.g. when
making a copy and clearing the PIR (assuming non-zero PIR bits are present
in the entire PIR), do:
read, read, read, read, xchg, xchg, xchg, xchg
instead of:
read, xchg, read, xchg, read, xchg, read, xchg
"

>
> > Without PIR copy:
> >
> > DMA memfill bandwidth: 4.944 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> >
> > 77,313,298,506 L1-dcache-loads
> > (79.98%) 8,279,458 L1-dcache-load-misses #
> > 0.01% of all L1-dcache accesses (80.03%) 41,654,221,245
> > L1-dcache-stores (80.01%)
> > 10,476 LLC-load-misses # 0.31% of all LL-cache
> > accesses (79.99%) 3,332,748 LLC-loads
> > (80.00%) 30.212055434 seconds time elapsed
> >
> > 0.002149000 seconds user
> > 30.183292000 seconds sys
> >
> >
> > With PIR copy:
> > DMA memfill bandwidth: 5.029 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> >
> > 78,327,247,423 L1-dcache-loads
> > (80.01%) 7,762,311 L1-dcache-load-misses #
> > 0.01% of all L1-dcache accesses (80.01%) 42,203,221,466
> > L1-dcache-stores (79.99%)
> > 23,691 LLC-load-misses # 0.67% of all LL-cache
> > accesses (80.01%) 3,561,890 LLC-loads
> > (80.00%)
> >
> > 30.201065706 seconds time elapsed
> >
> > 0.005950000 seconds user
> > 30.167885000 seconds sys
>
> Interesting, though I'm not really convinced that this DMA memfill
> microbenchmark resembles real work loads.
>
It is just a tool to get some quick experiments done, not realistic. Though
I am adding various knobs to make it more useful. e.g. adjustable interrupt
rate, delays in idxd hardirq handler.

> Did you test with something realistic, e.g. storage or networking, too?
>
Not yet for this particular code, working on testing with FIO on Samsung
Gen5 NVMe disks. I am getting help from the people with the set up.


Thanks,

Jacob

2024-01-27 05:42:02

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH RFC 09/13] x86/irq: Install posted MSI notification handler

Hi Thomas,

On Fri, 08 Dec 2023 12:52:49 +0100, Thomas Gleixner <[email protected]>
wrote:

> > Without PIR copy:
> >
> > DMA memfill bandwidth: 4.944 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> >
> > 77,313,298,506 L1-dcache-loads
> > (79.98%) 8,279,458 L1-dcache-load-misses #
> > 0.01% of all L1-dcache accesses (80.03%) 41,654,221,245
> > L1-dcache-stores (80.01%)
> > 10,476 LLC-load-misses # 0.31% of all LL-cache
> > accesses (79.99%) 3,332,748 LLC-loads
> > (80.00%) 30.212055434 seconds time elapsed
> >
> > 0.002149000 seconds user
> > 30.183292000 seconds sys
> >
> >
> > With PIR copy:
> > DMA memfill bandwidth: 5.029 Gbps
> > Performance counter stats for './run_intr.sh 512 30':
> >
> > 78,327,247,423 L1-dcache-loads
> > (80.01%) 7,762,311 L1-dcache-load-misses #
> > 0.01% of all L1-dcache accesses (80.01%) 42,203,221,466
> > L1-dcache-stores (79.99%)
> > 23,691 LLC-load-misses # 0.67% of all LL-cache
> > accesses (80.01%) 3,561,890 LLC-loads
> > (80.00%)
> >
> > 30.201065706 seconds time elapsed
> >
> > 0.005950000 seconds user
> > 30.167885000 seconds sys
>
> Interesting, though I'm not really convinced that this DMA memfill
> microbenchmark resembles real work loads.
>
> Did you test with something realistic, e.g. storage or networking, too?
I have done the following FIO test on NVME drives and not seeing any
meaningful differences in IOPS between the two implementations.

Here is my setup and results on 4 NVME drives connected to a x16 PCIe slot:

+-[0000:62]-
| +-01.0-[63]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X
| +-03.0-[64]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X
| +-05.0-[65]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X
| \-07.0-[66]----00.0 Samsung Electronics Co Ltd NVMe SSD Controller PM174X


libaio, no PIR_COPY
======================================
fio-3.35
Starting 512 processes
Jobs: 512 (f=512): [r(512)][100.0%][r=32.2GiB/s][r=8445k IOPS][eta 00m:00s]
disk_nvme6n1_thread_1: (groupid=0, jobs=512): err= 0: pid=31559: Mon Jan 8 21:49:22 2024
read: IOPS=8419k, BW=32.1GiB/s (34.5GB/s)(964GiB/30006msec)
slat (nsec): min=1325, max=115807k, avg=42368.34, stdev=1517031.57
clat (usec): min=2, max=499085, avg=15139.97, stdev=25682.25
lat (usec): min=68, max=499089, avg=15182.33, stdev=25709.81
clat percentiles (usec):
| 1.00th=[ 734], 5.00th=[ 783], 10.00th=[ 816], 20.00th=[ 857],
| 30.00th=[ 906], 40.00th=[ 971], 50.00th=[ 1074], 60.00th=[ 1369],
| 70.00th=[ 13042], 80.00th=[ 19792], 90.00th=[ 76022], 95.00th=[ 76022],
| 99.00th=[ 77071], 99.50th=[ 81265], 99.90th=[ 85459], 99.95th=[ 91751],
| 99.99th=[200279]
bw ( MiB/s): min=18109, max=51859, per=100.00%, avg=32965.98, stdev=16.88, samples=14839
iops : min=4633413, max=13281470, avg=8439278.47, stdev=4324.70, samples=14839
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=1.84%, 1000=41.96%
lat (msec) : 2=18.37%, 4=0.20%, 10=3.88%, 20=13.95%, 50=5.42%
lat (msec) : 100=14.33%, 250=0.02%, 500=0.01%
cpu : usr=1.16%, sys=3.54%, ctx=4932752, majf=0, minf=192764
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=252616589,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=32.1GiB/s (34.5GB/s), 32.1GiB/s-32.1GiB/s (34.5GB/s-34.5GB/s), io=964GiB (1035GB), run=30006-30006msec

Disk stats (read/write):
nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=96.31%
nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=97.15%
nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.06%
nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.94%

Performance counter stats for 'system wide':

22,985,903,515 L1-dcache-load-misses (42.86%)
22,989,992,126 L1-dcache-load-misses (57.14%)
751,228,710,993 L1-dcache-stores (57.14%)
465,033,820 LLC-load-misses # 18.27% of all LL-cache accesses (57.15%)
2,545,570,669 LLC-loads (57.14%)
1,058,582,881 LLC-stores (28.57%)
326,135,823 LLC-store-misses (28.57%)

32.045718194 seconds time elapsed
-------------------------------------------
libaio with PIR_COPY
-------------------------------------------
fio-3.35
Starting 512 processes
Jobs: 512 (f=512): [r(512)][100.0%][r=32.2GiB/s][r=8445k IOPS][eta 00m:00s]
disk_nvme6n1_thread_1: (groupid=0, jobs=512): err= 0: pid=5103: Mon Jan 8 23:12:12 2024
read: IOPS=8420k, BW=32.1GiB/s (34.5GB/s)(964GiB/30011msec)
slat (nsec): min=1339, max=97021k, avg=42447.84, stdev=1442726.09
clat (usec): min=2, max=369410, avg=14820.01, stdev=24112.59
lat (usec): min=69, max=369412, avg=14862.46, stdev=24139.33
clat percentiles (usec):
| 1.00th=[ 717], 5.00th=[ 783], 10.00th=[ 824], 20.00th=[ 873],
| 30.00th=[ 930], 40.00th=[ 1012], 50.00th=[ 1172], 60.00th=[ 8094],
| 70.00th=[ 14222], 80.00th=[ 18744], 90.00th=[ 76022], 95.00th=[ 76022],
| 99.00th=[ 76022], 99.50th=[ 78119], 99.90th=[ 81265], 99.95th=[ 81265],
| 99.99th=[135267]
bw ( MiB/s): min=19552, max=62819, per=100.00%, avg=33774.56, stdev=31.02, samples=14540
iops : min=5005807, max=16089892, avg=8646500.17, stdev=7944.42, samples=14540
lat (usec) : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
lat (usec) : 250=0.01%, 500=0.01%, 750=2.50%, 1000=36.41%
lat (msec) : 2=17.39%, 4=0.27%, 10=5.83%, 20=18.94%, 50=5.59%
lat (msec) : 100=13.06%, 250=0.01%, 500=0.01%
cpu : usr=1.20%, sys=3.74%, ctx=6758326, majf=0, minf=193128
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=252677827,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=256

Run status group 0 (all jobs):
READ: bw=32.1GiB/s (34.5GB/s), 32.1GiB/s-32.1GiB/s (34.5GB/s-34.5GB/s), io=964GiB (1035GB), run=30011-30011msec

Disk stats (read/write):
nvme6n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=96.36%
nvme5n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=97.18%
nvme4n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.08%
nvme3n1: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=98.96%

Performance counter stats for 'system wide':

24,762,800,042 L1-dcache-load-misses (42.86%)
24,764,415,765 L1-dcache-load-misses (57.14%)
756,096,467,595 L1-dcache-stores (57.14%)
483,611,270 LLC-load-misses # 16.21% of all LL-cache accesses (57.14%)
2,982,610,898 LLC-loads (57.14%)
1,283,077,818 LLC-stores (28.57%)
313,253,711 LLC-store-misses (28.57%)

32.059810215 seconds time elapsed



Thanks,

Jacob