2006-05-04 21:29:58

by Roland Dreier

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 07/16] ehca: interrupt handling routines

> +void ehca_queue_comp_task(struct ehca_comp_pool *pool, struct ehca_cq *__cq)
> +{
> + int cpu;
> + int cpu_id;
> + struct ehca_cpu_comp_task *cct;
> + unsigned long flags_cct;
> + unsigned long flags_cq;
> +
> + cpu = get_cpu();
> + cpu_id = find_next_online_cpu(pool);
> +
> + EDEB_EN(7, "pool=%p cq=%p cq_nr=%x CPU=%x:%x:%x:%x",
> + pool, __cq, __cq->cq_number,
> + cpu, cpu_id, num_online_cpus(), num_possible_cpus());
> +
> + BUG_ON(!cpu_online(cpu_id));
> +
> + cct = per_cpu_ptr(pool->cpu_comp_tasks, cpu_id);
> +
> + spin_lock_irqsave(&cct->task_lock, flags_cct);
> + spin_lock_irqsave(&__cq->task_lock, flags_cq);
> +
> + if (__cq->nr_callbacks == 0) {
> + __cq->nr_callbacks++;
> + list_add_tail(&__cq->entry, &cct->cq_list);
> + wake_up(&cct->wait_queue);
> + }
> + else
> + __cq->nr_callbacks++;
> +
> + spin_unlock_irqrestore(&__cq->task_lock, flags_cq);
> + spin_unlock_irqrestore(&cct->task_lock, flags_cct);
> +
> + put_cpu();
> +
> + EDEB_EX(7, "cct=%p", cct);
> +
> + return;
> +}

I never read the ehca completion event handling code very carefully
until now. But I was motivated by Shirley's work on IPoIB to take a
closer look.

It seems that you are deferring completion event dispatch into threads
spread across all the CPUs. This seems like a very strange thing to
me -- you are adding latency and possibly causing cacheline pingpong.

It may help throughput in some cases to spread the work across
multiple CPUs but it seems strange to me to do this in the low-level
driver. My intuition would be that it would be better to do this in
the higher levels, and leave open the possibility for protocols that
want the lowest possible latency to be called directly from the
interrupt handler.

What was the thinking that led to this design?

- R.


2006-05-05 13:05:29

by Heiko J Schick

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 07/16] ehca: interrupt handling routines

Hello Roland,

Roland Dreier wrote:
> It seems that you are deferring completion event dispatch into threads
> spread across all the CPUs. This seems like a very strange thing to
> me -- you are adding latency and possibly causing cacheline pingpong.
>
> It may help throughput in some cases to spread the work across
> multiple CPUs but it seems strange to me to do this in the low-level
> driver. My intuition would be that it would be better to do this in
> the higher levels, and leave open the possibility for protocols that
> want the lowest possible latency to be called directly from the
> interrupt handler.

We've implemented this "spread CQ callbacks across multiple CPUs"
functionality to get better throughput on a SMP system, as you have
seen.

Originaly, we had the same idea as you mentioned, that it would be better
to do this in the higher levels. The point is that we can't see so far
any simple posibility how this can done in the OpenIB stack, the TCP/IP
network layer or somewhere in the Linux kernel.

For example:
For IPoIB we get the best throughput when we do the CQ callbacks on
different CPUs and not to stay on the same CPU.

In other papers and slides (see [1]) you can see similar approaches.

I think such one implementation or functionality could require more
or less a non-trivial changes. This could be also releated to other
I/O traffic.

[1]: Speeding up Networking, Van Jacobson and Bob Felderman,
http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf

Regards,
Heiko

2006-05-05 14:49:15

by Roland Dreier

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 07/16] ehca: interrupt handling routines

Heiko> Originaly, we had the same idea as you mentioned, that it
Heiko> would be better to do this in the higher levels. The point
Heiko> is that we can't see so far any simple posibility how this
Heiko> can done in the OpenIB stack, the TCP/IP network layer or
Heiko> somewhere in the Linux kernel.

Heiko> For example: For IPoIB we get the best throughput when we
Heiko> do the CQ callbacks on different CPUs and not to stay on
Heiko> the same CPU.

So why not do it in IPoIB then? This approach is not optimal
globally. For example, uverbs event dispatch is just going to queue
an event and wake up the process waiting for events, and doing this on
some random CPU not related to the where the process will run is
clearly the worst possible way to dispatch the event.

Heiko> In other papers and slides (see [1]) you can see similar
Heiko> approaches.

Heiko> [1]: Speeding up Networking, Van Jacobson and Bob
Heiko> Felderman,
Heiko> http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf

I think you've misunderstood this paper. It's about maximizing CPU
locality and pushing processing directly into the consumer. In the
context of slide 9, what you've done is sort of like adding another
control loop inside the kernel, since you dispatch from interrupt
handler to driver thread to final consumer. So I would argue that
your approach is exactly the opposite of what VJ is advocating.

- R.

2006-05-09 12:26:16

by Heiko J Schick

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 07/16] ehca: interrupt handling routines

Roland Dreier wrote:
> Heiko> Originaly, we had the same idea as you mentioned, that it
> Heiko> would be better to do this in the higher levels. The point
> Heiko> is that we can't see so far any simple posibility how this
> Heiko> can done in the OpenIB stack, the TCP/IP network layer or
> Heiko> somewhere in the Linux kernel.
>
> Heiko> For example: For IPoIB we get the best throughput when we
> Heiko> do the CQ callbacks on different CPUs and not to stay on
> Heiko> the same CPU.
>
> So why not do it in IPoIB then? This approach is not optimal
> globally. For example, uverbs event dispatch is just going to queue
> an event and wake up the process waiting for events, and doing this on
> some random CPU not related to the where the process will run is
> clearly the worst possible way to dispatch the event.

Yes, I agree. It would not be an optimal solution, because other upper
level protocols (e.g. SDP, SRP, etc.) or userspace verbs would not be
affected by this changes. Nevertheless, how can an improved "scaling"
or "SMP" version of IPoIB look like. How could it be implemented?

> Heiko> In other papers and slides (see [1]) you can see similar
> Heiko> approaches.
>
> Heiko> [1]: Speeding up Networking, Van Jacobson and Bob
> Heiko> Felderman,
> Heiko> http://www.lemis.com/grog/Documentation/vj/lca06vj.pdf

> I think you've misunderstood this paper. It's about maximizing CPU
> locality and pushing processing directly into the consumer. In the
> context of slide 9, what you've done is sort of like adding another
> control loop inside the kernel, since you dispatch from interrupt
> handler to driver thread to final consumer. So I would argue that
> your approach is exactly the opposite of what VJ is advocating.

Sorry, my idea was not to use the *.pdf file how it should be
implemented. I only wanted to show that other people are also thinking
about how TCP/IP performance could be increased and where the bottlenecks
(e.g. SOFTIRQs) are. :)

Regards,
Heiko

2006-05-09 16:23:58

by Roland Dreier

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 07/16] ehca: interrupt handling routines

Heiko> Yes, I agree. It would not be an optimal solution, because
Heiko> other upper level protocols (e.g. SDP, SRP, etc.) or
Heiko> userspace verbs would not be affected by this
Heiko> changes. Nevertheless, how can an improved "scaling" or
Heiko> "SMP" version of IPoIB look like. How could it be
Heiko> implemented?

The trivial way to do it would be to use the same idea as the current
ehca driver: just create a thread for receive CQ events and a thread
for send CQ events, and defer CQ polling into those two threads.

Something even better may be possible by specializing to IPoIB of course.

- R.

2006-05-09 16:48:38

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH 07/16] ehca: interrupt handling routines

Quoting r. Roland Dreier <[email protected]>:
> The trivial way to do it would be to use the same idea as the current
> ehca driver: just create a thread for receive CQ events and a thread
> for send CQ events, and defer CQ polling into those two threads.

For RX, isn't this basically what NAPI is doing?
Only NAPI seems better, avoiding interrupts completely and avoiding latency hit
by only getting triggered on high load ...

--
MST

2006-05-09 18:54:52

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: Re: [PATCH 07/16] ehca: interrupt handling routines

Quoting r. Shirley Ma <[email protected]>:
> No, CPU utilization wasn't reduced. When you use single CQ, NAPI polls on both RX/TX.

I think NAPI's point is to reduce the interrupt rate.
Wouldn't this reduce CPU load?

> netperf, iperf, mpstat, netpipe, oprofiling, what's your suggestion?

netperf has -C which gives CPU load, which is handy.
Running vmstat in another window also works reasoably well.

--
MST

2006-05-09 18:57:27

by Heiko J Schick

[permalink] [raw]
Subject: Re: [openib-general] Re: [PATCH 07/16] ehca: interrupt handling routines

On 09.05.2006, at 18:49, Michael S. Tsirkin wrote:

>> The trivial way to do it would be to use the same idea as the current
>> ehca driver: just create a thread for receive CQ events and a thread
>> for send CQ events, and defer CQ polling into those two threads.
>
> For RX, isn't this basically what NAPI is doing?
> Only NAPI seems better, avoiding interrupts completely and avoiding
> latency hit
> by only getting triggered on high load ...

Does NAPI schedules CQ callbacks to different CPUs or stays the callback
(handling of data, etc.) on the same CPU where the interrupt came in?

Regards,
Heiko

2006-05-09 23:36:08

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [openib-general] [PATCH 07/16] ehca: interrupt handling routines

> Heiko> Yes, I agree. It would not be an optimal solution, because
> Heiko> other upper level protocols (e.g. SDP, SRP, etc.) or
> Heiko> userspace verbs would not be affected by this
> Heiko> changes. Nevertheless, how can an improved "scaling" or
> Heiko> "SMP" version of IPoIB look like. How could it be
> Heiko> implemented?
>
> The trivial way to do it would be to use the same idea as the current
> ehca driver: just create a thread for receive CQ events and a thread
> for send CQ events, and defer CQ polling into those two threads.
>
> Something even better may be possible by specializing to IPoIB of
> course.

The hardware IRQ should go to some CPU close to the hardware itself.
The
softirq (or whatever else) should go to the same CPU that is handling
the
user-level task for that message. Or a CPU close to it, at least.


Segher