Hi folks,
I've heard about issues with process cpu timers for a while (~years) but only
recently found the time to look into them. I'm starting this thread in an
attempt to get directional opinions on how to resolve them (I'm happy to do
the work itself).
Let's take setitimer(2). The man page says that "Under very heavy loading, an
ITIMER_REAL timer may expire before the signal from a previous expiration has
been delivered." This is true but incomplete - the same issue plagues
ITIMER_PROF and ITIMER_VIRTUAL as well. I'll call this property "completeness"
i.e. that all accrued process CPU time should be accounted by the signals
delivered to the process.
A second issue is proportionality. Specifically for setitimer, there appears to
be an expectation in userspace that the number of signals received per thread
is proportional to that thread's CPU time. I'm not sure where this belief is
coming from but my guess is that people assumed multi-threadedness preserved
the "sample a stack trace on every SIGPROF" methodology from single-threaded
setitimer usage. I don't know if it was ever possible but you cannot currently
implement this strategy and get good data out of it. Yet, there's software
like gperftools that assumes you can. (Did this ever work well?)
1. Completeness
The crux of the completeness issue is that process CPU time can easily be
accrued faster than signals on a shared queue can be dequeued. Relatively
large time intervals like 10ms can trivially drop signals on 12-core 24-thread
system but in my tests, 2-core 4-thread systems behave just as poorly under
enough load.
There's a few possible improvements to alleviate or fix this.
a. Instead of delivering the signal to the shared queue, we can deliver it to
the task that won the "process cpu timers" race. This improves the situation
by effectively sharding the collision space by the number of runnable threads.
b. An alternative solution would be to search through the threads for one that
doesn't have the signal queued and deliver to it. This leads to more overhead
but better signal delivery guarantees. However, it also has worse behavior
w.r.t. waking up idle threads.
c. A third solution may be to treat SIGPROF and SIGVTALRM as rt-signals when
delivered due to an itimer expiring. I'm not convinced this is necessary but
it's the most complete solution.
2. Proportionally
The issue of proportionality is really the issue of "can you use signals for
multi-threaded profiling at all." As it stands, there's no mechanism that's
ensuring proportionality, so the distribution across threads is meaningless.
The only way I can think of to actually enforce this property is to keep
snapshots of per-thread cpu time and diff them from one SIGPROF to the next to
determine the target thread (by doing a weighted random choice). It's not _a
lot_ of work but it's certainly a little more overhead and a fair bit of
complexity. With POSIX_CPU_TIMERS_TASK_WORK=y, this extra overhead shouldn't
impact things too much.
Note that proportionality is orthogonal to completeness - while you can
configure posix timers to use rt-signals with timer_create (which fixes
completeness), they still have the same distribution issues.
Overall, I'd love to hear opinions on 1) whether either or both of these
concerns are worth fixing (I can expand on why I think they are) and 2) the
direction the work should take.
Thanks for reading all this,
-- Delyan
Delyan!
On Sat, Feb 10 2024 at 17:30, Delyan Kratunov wrote:
> I've heard about issues with process cpu timers for a while (~years) but only
> recently found the time to look into them. I'm starting this thread in an
> attempt to get directional opinions on how to resolve them (I'm happy to do
> the work itself).
>
> Let's take setitimer(2). The man page says that "Under very heavy
> loading, an ITIMER_REAL timer may expire before the signal from a
> previous expiration has been delivered." This is true but incomplete -
> the same issue plagues ITIMER_PROF and ITIMER_VIRTUAL as well. I'll
> call this property "completeness" i.e. that all accrued process CPU
> time should be accounted by the signals delivered to the process.
That's wishful thinking and there is no way to ensure that.
timer_fires()
queue_signal()
T1 wakeup_process(target)
T2 target: consume_signal()
T3 target: signal_handler_in_user_space()
There is no guarantee that
T2 - T1 < interval
and there is no guarantee that
T3 - T1 < interval
So whatever you propose to "fix" that, will eventually improve the
situation slightly for a few corner cases, but it won't ever fix it
completely.
Just for the record: setitimer() has been marked obsolescent in the
POSIX standard issue 7 in 2018. The replacement is timer_settime() which
has a few interesting properties vs. the overrun handling.
> A second issue is proportionality. Specifically for setitimer, there appears to
> be an expectation in userspace that the number of signals received per thread
> is proportional to that thread's CPU time. I'm not sure where this belief is
> coming from but my guess is that people assumed multi-threadedness preserved
> the "sample a stack trace on every SIGPROF" methodology from single-threaded
> setitimer usage. I don't know if it was ever possible but you cannot currently
> implement this strategy and get good data out of it. Yet, there's software
> like gperftools that assumes you can. (Did this ever work well?)
I don't know and those assumptions have been clearly wrong at the point
where the tool was written.
> 1. Completeness
>
> The crux of the completeness issue is that process CPU time can easily be
> accrued faster than signals on a shared queue can be dequeued. Relatively
> large time intervals like 10ms can trivially drop signals on 12-core 24-thread
> system but in my tests, 2-core 4-thread systems behave just as poorly under
> enough load.
It does not drop signals. The pending signal subsumes all subsequent
ones until it is delivered.
The setitimer() specification is silent about it, but SIGALRM is
definitely not queueable. SIGPROF/SIGVTIME are not queued by the kernel
which is standard compliant as the decision whether to queue a legacy
signal more than once is implementation-defined.
The timer_settimer() specification says clearly:
"Only a single signal shall be queued to the process for a given timer
at any point in time. When a timer for which a signal is still pending
expires, no signal shall be queued, and a timer overrun shall
occur. When a timer expiration signal is delivered to or accepted by a
process, the timer_getoverrun() function shall return the timer
expiration overrun count for the specified timer. The overrun count
returned contains the number of extra timer expirations that occurred
between the time the signal was generated (queued) and when it was
delivered or accepted, up to but not including an
implementation-defined maximum of {DELAYTIMER_MAX}. If the number of
such extra expirations is greater than or equal to {DELAYTIMER_MAX},
then the overrun count shall be set to {DELAYTIMER_MAX}. The value
returned by timer_getoverrun() shall apply to the most recent
expiration signal delivery or acceptance for the timer. If no
expiration signal has been delivered for the timer, the return value
of timer_getoverrun() is unspecified."
> There's a few possible improvements to alleviate or fix this.
>
> a. Instead of delivering the signal to the shared queue, we can deliver it to
> the task that won the "process cpu timers" race. This improves the situation
> by effectively sharding the collision space by the number of runnable
> threads.
I have no idea how you define "won the race", but you can't deliver
process wide signals targeted to a single thread. That thread could be
just in the process of blocking the signal, so the signal would get lost
in the worst case.
Also if you want to do that then you suddenly change the signal
semantics to allow queueing the signal multiple times, which is a user
space visible change breaking existing applications.
> b. An alternative solution would be to search through the threads for one that
> doesn't have the signal queued and deliver to it. This leads to more overhead
> but better signal delivery guarantees. However, it also has worse behavior
> w.r.t. waking up idle threads.
No. You cannot queue those signals more than once without changing the
current behaviour which is a user space visible change.
> c. A third solution may be to treat SIGPROF and SIGVTALRM as rt-signals when
> delivered due to an itimer expiring. I'm not convinced this is necessary but
> it's the most complete solution.
No. They are not RT signals.
> 2. Proportionally
>
> The issue of proportionality is really the issue of "can you use signals for
> multi-threaded profiling at all." As it stands, there's no mechanism that's
> ensuring proportionality, so the distribution across threads is meaningless.
>
> The only way I can think of to actually enforce this property is to keep
> snapshots of per-thread cpu time and diff them from one SIGPROF to the next to
> determine the target thread (by doing a weighted random choice). It's not _a
> lot_ of work but it's certainly a little more overhead and a fair bit of
> complexity. With POSIX_CPU_TIMERS_TASK_WORK=y, this extra overhead shouldn't
> impact things too much.
>
> Note that proportionality is orthogonal to completeness - while you can
> configure posix timers to use rt-signals with timer_create (which fixes
> completeness),
No. There is still only a single signal per timer queued. See the spec
quote above.
The advantage of POSIX timers over the legacy itimers is that they
provide overrun information.
> they still have the same distribution issues.
CLOCK_THREAD_CPUTIME_ID exists for a reason and user space can correlate
the thread data nicely.
Aside of that there are PMUs and perf which solve all the problems you
are trying to solve in one go.
Thanks,
tglx
Thanks for your detailed response, Thomas, I appreciate you taking the time
with my random side quest!
> [...]
>
> That's wishful thinking and there is no way to ensure that.
> Just for the record: setitimer() has been marked obsolescent in the
> POSIX standard issue 7 in 2018. The replacement is timer_settime() which
> has a few interesting properties vs. the overrun handling.
This is a great point and I think it overrides anything I have to say about
setitimer. Overall, I have nothing to rehash on the process signal delivery
point, I understand the situation now, thanks to your thorough explanation!
> [...]
> I don't know and those assumptions have been clearly wrong at the point
> where the tool was written.
That was my impression as well, thanks for confirming. (I've found at least 3
tools with this same incorrect belief)
> [...]
> > they still have the same distribution issues.
>
> CLOCK_THREAD_CPUTIME_ID exists for a reason and user space can correlate
> the thread data nicely.
>
> Aside of that there are PMUs and perf which solve all the problems you
> are trying to solve in one go.
Absolutely, the ability to write a profiler with perf_event_open is not in
question at all. However, not every situation allows for PMU or
perf_event_open access. Timers could form a nice middle ground, in exactly the
way people have tried to use them.
I'd like to push back a little on the "CLOCK_THREAD_CPUTIME_ID fixes things"
point, though. From an application and library point of view, the per-thread
clocks are harder to use - you need to either orchestrate every thread to
participate voluntarily or poll the thread ids and create timers from another
thread. In perf_event_open, this is solved via the .inherit/.inherit_thread
bits.
More importantly, they don't work for all workloads. If I have 10 threads that
each run for 5ms, a 10ms process timer would fire 5 times, while per-thread
10ms timers would never fire. You can easily imagine an application that
accrues all its cpu time in a way that doesn't generate a single signal (in
the extreme, threads only living a single tick).
Overall, what I want to establish is whether there's a path to achieve the
_assumed_ interface that these tools expect - process-wide cpu signals that
correlate with where cpu time is spent - through any existing or extended
timer API. This interface would be imminently useful, as people have clearly,
albeit misguidedly, demonstrated.
If the answer is definitely "no," I'd like to at least add some notes to the
man pages.
-- Delyan
Delyan!
On Mon, Feb 26 2024 at 16:29, Delyan Kratunov wrote:
>> I don't know and those assumptions have been clearly wrong at the point
>> where the tool was written.
>
> That was my impression as well, thanks for confirming. (I've found at least 3
> tools with this same incorrect belief)
The wonders of error proliferation by mindless copy & pasta and/or design
borrowing.
> Absolutely, the ability to write a profiler with perf_event_open is not in
> question at all. However, not every situation allows for PMU or
> perf_event_open access. Timers could form a nice middle ground, in exactly the
> way people have tried to use them.
>
> I'd like to push back a little on the "CLOCK_THREAD_CPUTIME_ID fixes things"
> point, though. From an application and library point of view, the per-thread
> clocks are harder to use - you need to either orchestrate every thread to
> participate voluntarily or poll the thread ids and create timers from another
> thread. In perf_event_open, this is solved via the .inherit/.inherit_thread
> bits.
I did not say it's easy and fixes all problems magically :)
As accessing a different thread/process requires ptrace permissions this
might be solvable via ptrace, which might turn out to be too heavy weight.
Though it would be certainly possible to implement inheritance for those
timers and let the kernel set them up for all existing and future threads.
That's a bit tricky vs. accounting on behalf of and association to the
profiler thread in the (v)fork() case and also needs some thought about
how the profiler thread gets informed of the newly associated timer_id,
but I think it's doable.
> More importantly, they don't work for all workloads. If I have 10 threads that
> each run for 5ms, a 10ms process timer would fire 5 times, while per-thread
> 10ms timers would never fire. You can easily imagine an application that
> accrues all its cpu time in a way that doesn't generate a single signal (in
> the extreme, threads only living a single tick).
That's true, but you have to look at the life time rules of those
timers.
A CLOCK_THREAD_CPUTIME_ID timer is owned by the thread which creates it,
no matter what the monitored target thread is. So when the monitored
thread exits then it disarms the timer, but the timer itself stays
accessible to the owner. That means the owner can still query the timer.
As of today a timer_get(CLOCK_THREAD_CPUTIME_ID) after the monitored
thread exited results in { .it_value = 0, .it_interval = 0 }.
We can't change in general, but if we go and do the inheritance mode,
then the timer would be owned by the profiler thread. Even without
inheritance mode we can handle a special flag for timer_create() to
denote that this is a magic timer :)
So that magic flag would preserve the accumulated runtime when the
thread exits in the timer in some way and either return that in
timer_get() along with some magic to denote that the monitored thread is
gone or add a new timer_get_foo() syscall for it.
Whether the profiler then polls the timers periodically or acts on an
exit signal that's a user space implementation detail.
Thanks,
tglx