I've been investigating why lttng destroys full nohz mode, and the
root cause is that lttng uses timers for flushing trace buffers. So
I'm planning on moving the timers to the ticking CPU, so that any CPU
using full nohz mode can continue to do so even though they might have
tracepoints.
I can see that kernel/sched/core.c has the function
get_nohz_timer_target() which tries to find an idle CPU to allocate
for a timer that has not specified a CPU to be pinned to.
My question here is: For full nohz mode, should this still be "only"
an idle CPU, or should it be translated to a CPU not running in full
nohz mode? I'd think this could make it a lot easier to allow
applications to make full use of full nohz.
/Mats
* Mats Liljegren ([email protected]) wrote:
> I've been investigating why lttng destroys full nohz mode, and the
> root cause is that lttng uses timers for flushing trace buffers. So
> I'm planning on moving the timers to the ticking CPU, so that any CPU
> using full nohz mode can continue to do so even though they might have
> tracepoints.
>
> I can see that kernel/sched/core.c has the function
> get_nohz_timer_target() which tries to find an idle CPU to allocate
> for a timer that has not specified a CPU to be pinned to.
>
> My question here is: For full nohz mode, should this still be "only"
> an idle CPU, or should it be translated to a CPU not running in full
> nohz mode? I'd think this could make it a lot easier to allow
> applications to make full use of full nohz.
One thing to be aware of wrt LTTng ring buffer: if you look at
lttng-ring-buffer-client.h, you will notice that we use
.sync = RING_BUFFER_SYNC_PER_CPU,
as ring buffer synchronization. This means we need to issue event write
and sub-buffer switch from the CPU owning the buffer, or, in very
specific cases, if the CPU owning the buffer is offline, we can touch it
from a remote CPU, but just one (e.g. cpu hotplug code).
For the LTTng ring buffer, there are two timers to take into account:
switch_timer and read_timer.
The switch_timer is not enabled by default. When it is enabled by the
end-user, it periodically flush the lttng buffers. If you want to make
this timer execute from a single timer handler and apply to all buffers
(without IPI), you will need to use
.sync = RING_BUFFER_SYNC_GLOBAL,
to allow concurrent updates to a ring buffer from remote CPUs.
The other timer requires less modifications: the read_timer periodically
checks if the poll() needs to be awakened. It just reads the producer
offset position and compares it to the current consumer position. This
one can be moved to a single timer handler that covers all CPUs without
any change to the "sync" choice.
Please note that the read_timer is current used by default. It can be
entirely removed if you choose
.wakeup = RING_BUFFER_WAKEUP_BY_WRITER,
instead of RING_BUFFER_WAKEUP_BY_TIMER. However, if you choose the
wakeup by writer, the tracer will discard events coming from NMI
handlers, because some locks need to be taken by the tracing site in
this mode.
If we care about performance and scalability (we really should), the
right approach would be to keep RING_BUFFER_SYNC_PER_CPU though, and
keep the per-CPU timers for periodic flush (switch_timer). We might want
to hook into the full nohz entry/hooks (hopefully they exist) to move
the per-cpu timers out of the full nohz CPUs, and enable a new flag on
these ring buffers that would allow to dynamically change between
RING_BUFFER_SYNC_PER_CPU and RING_BUFFER_SYNC_GLOBAL for a given ring
buffer.
Thoughts ?
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com