Received-SPF: pass (google.com: domain of linux-kernel+bounces-64069-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99;
From: Thomas Gleixner <tglx@linutronix.de>
To: Delyan Kratunov <delyan@delyan.me>, linux-kernel@vger.kernel.org
Subject: Re: Posix process cpu timer inaccuracies
In-Reply-To: <2635838.Lt9SDvczpP@discovery>
Date: Tue, 13 Feb 2024 19:20:32 +0100
Message-ID: <87cyt0gr9r.ffs@tglx>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain

Delyan!

On Sat, Feb 10 2024 at 17:30, Delyan Kratunov wrote:
> I've heard about issues with process cpu timers for a while (~years) but only 
> recently found the time to look into them. I'm starting this thread in an 
> attempt to get directional opinions on how to resolve them (I'm happy to do 
> the work itself).
>
> Let's take setitimer(2). The man page says that "Under very heavy
> loading, an ITIMER_REAL timer may expire before the signal from a
> previous expiration has been delivered." This is true but incomplete -
> the same issue plagues ITIMER_PROF and ITIMER_VIRTUAL as well. I'll
> call this property "completeness" i.e. that all accrued process CPU
> time should be accounted by the signals delivered to the process.

That's wishful thinking and there is no way to ensure that.

       timer_fires()
         queue_signal()
T1         wakeup_process(target)

T2     target: consume_signal()

T3     target: signal_handler_in_user_space()

There is no guarantee that

      T2 - T1 < interval

and there is no guarantee that

      T3 - T1 < interval

So whatever you propose to "fix" that, will eventually improve the
situation slightly for a few corner cases, but it won't ever fix it
completely.

Just for the record: setitimer() has been marked obsolescent in the
POSIX standard issue 7 in 2018. The replacement is timer_settime() which
has a few interesting properties vs. the overrun handling.

> A second issue is proportionality. Specifically for setitimer, there appears to 
> be an expectation in userspace that the number of signals received per thread 
> is proportional to that thread's CPU time. I'm not sure where this belief is 
> coming from but my guess is that people assumed multi-threadedness preserved 
> the "sample a stack trace on every SIGPROF" methodology from single-threaded 
> setitimer usage. I don't know if it was ever possible but you cannot currently 
> implement this strategy and get good data out of it. Yet, there's software 
> like gperftools that assumes you can. (Did this ever work well?)

I don't know and those assumptions have been clearly wrong at the point
where the tool was written.

> 1. Completeness
>
> The crux of the completeness issue is that process CPU time can easily be 
> accrued faster than signals on a shared queue can be dequeued. Relatively 
> large time intervals like 10ms can trivially drop signals on 12-core 24-thread 
> system but in my tests, 2-core 4-thread systems behave just as poorly under 
> enough load.

It does not drop signals. The pending signal subsumes all subsequent
ones until it is delivered.

The setitimer() specification is silent about it, but SIGALRM is
definitely not queueable. SIGPROF/SIGVTIME are not queued by the kernel
which is standard compliant as the decision whether to queue a legacy
signal more than once is implementation-defined.

The timer_settimer() specification says clearly:

  "Only a single signal shall be queued to the process for a given timer
   at any point in time. When a timer for which a signal is still pending
   expires, no signal shall be queued, and a timer overrun shall
   occur. When a timer expiration signal is delivered to or accepted by a
   process, the timer_getoverrun() function shall return the timer
   expiration overrun count for the specified timer. The overrun count
   returned contains the number of extra timer expirations that occurred
   between the time the signal was generated (queued) and when it was
   delivered or accepted, up to but not including an
   implementation-defined maximum of {DELAYTIMER_MAX}. If the number of
   such extra expirations is greater than or equal to {DELAYTIMER_MAX},
   then the overrun count shall be set to {DELAYTIMER_MAX}. The value
   returned by timer_getoverrun() shall apply to the most recent
   expiration signal delivery or acceptance for the timer. If no
   expiration signal has been delivered for the timer, the return value
   of timer_getoverrun() is unspecified."

> There's a few possible improvements to alleviate or fix this.
>
> a. Instead of delivering the signal to the shared queue, we can deliver it to 
> the task that won the "process cpu timers" race. This improves the situation 
> by effectively sharding the collision space by the number of runnable
> threads.

I have no idea how you define "won the race", but you can't deliver
process wide signals targeted to a single thread. That thread could be
just in the process of blocking the signal, so the signal would get lost
in the worst case.

Also if you want to do that then you suddenly change the signal
semantics to allow queueing the signal multiple times, which is a user
space visible change breaking existing applications.

> b. An alternative solution would be to search through the threads for one that 
> doesn't have the signal queued and deliver to it. This leads to more overhead 
> but better signal delivery guarantees. However, it also has worse behavior 
> w.r.t. waking up idle threads.

No. You cannot queue those signals more than once without changing the
current behaviour which is a user space visible change.

> c. A third solution may be to treat SIGPROF and SIGVTALRM as rt-signals when 
> delivered due to an itimer expiring. I'm not convinced this is necessary but 
> it's the most complete solution.

No. They are not RT signals.

> 2. Proportionally
>
> The issue of proportionality is really the issue of "can you use signals for 
> multi-threaded profiling at all." As it stands, there's no mechanism that's 
> ensuring proportionality, so the distribution across threads is meaningless. 
>
> The only way I can think of to actually enforce this property is to keep 
> snapshots of per-thread cpu time and diff them from one SIGPROF to the next to 
> determine the target thread (by doing a weighted random choice). It's not _a 
> lot_ of work but it's certainly a little more overhead and a fair bit of 
> complexity. With POSIX_CPU_TIMERS_TASK_WORK=y, this extra overhead shouldn't 
> impact things too much.
>
> Note that proportionality is orthogonal to completeness - while you can 
> configure posix timers to use rt-signals with timer_create (which fixes 
> completeness),

No. There is still only a single signal per timer queued. See the spec
quote above.

The advantage of POSIX timers over the legacy itimers is that they
provide overrun information.

> they still have the same distribution issues.

CLOCK_THREAD_CPUTIME_ID exists for a reason and user space can correlate
the thread data nicely.

Aside of that there are PMUs and perf which solve all the problems you
are trying to solve in one go.

Thanks,

        tglx