In CONFIG_PREEMPT_RT kernel, ktimer threads need to be excluded as well as
ksoftirqd when accounting CPUTIME_SOFTIRQ in irqtime_account_irq.
Also add this_cpu_ktimers to keep consistency with this_cpu_ksoftirqd.
Signed-off-by: tiozhang <[email protected]>
---
include/linux/interrupt.h | 5 +++++
kernel/sched/cputime.c | 5 +++++
2 files changed, 10 insertions(+)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a5091ac97fc6..a88646acaf3f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -621,6 +621,11 @@ static inline unsigned int local_pending_timers(void)
return __this_cpu_read(pending_timer_softirq);
}
+static inline struct task_struct *this_cpu_ktimers(void)
+{
+ return this_cpu_read(timersd);
+}
+
#else
static inline void raise_timer_softirq(void)
{
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..0fac0109d151 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -73,7 +73,12 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
*/
if (pc & HARDIRQ_MASK)
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
+#ifdef CONFIG_PREEMPT_RT
+ else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd()
+ && curr != this_cpu_ktimers())
+#else
else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
+#endif
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
--
2.17.1
On Fri, 24 Nov 2023 14:34:50 +0800
tiozhang <[email protected]> wrote:
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index af7952f12e6c..0fac0109d151 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -73,7 +73,12 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
> */
> if (pc & HARDIRQ_MASK)
> irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
> +#ifdef CONFIG_PREEMPT_RT
> + else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd()
> + && curr != this_cpu_ktimers())
> +#else
> else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
> +#endif
> irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
> }
If the above is necessary, it would look nicer as:
else if (((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd() &&
(!IS_ENABLED(CONFIG_PREEMPT_RT) || curr != this_cpu_ktimers()))
-- Steve
In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
so when accounting CPUTIME_SOFTIRQ, ktimers need to be excluded
as well as ksoftirqd.
Also add this_cpu_ktimers to keep consistency with this_cpu_ksoftirqd.
Signed-off-by: tiozhang <[email protected]>
---
include/linux/interrupt.h | 5 +++++
kernel/sched/cputime.c | 3 ++-
2 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a5091ac97fc6..a88646acaf3f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -621,6 +621,11 @@ static inline unsigned int local_pending_timers(void)
return __this_cpu_read(pending_timer_softirq);
}
+static inline struct task_struct *this_cpu_ktimers(void)
+{
+ return this_cpu_read(timersd);
+}
+
#else
static inline void raise_timer_softirq(void)
{
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..fd3610353e12 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -73,7 +73,8 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
*/
if (pc & HARDIRQ_MASK)
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
- else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
+ else if (((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd() &&
+ (!IS_ENABLED(CONFIG_PREEMPT_RT) || curr != this_cpu_ktimers()))
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
--
2.17.1
On 2023-11-30 17:41:47 [+0800], tiozhang wrote:
> In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
> so when accounting CPUTIME_SOFTIRQ, ktimers need to be excluded
> as well as ksoftirqd.
> Also add this_cpu_ktimers to keep consistency with this_cpu_ksoftirqd.
I'm still not sure what the benefit here is. It says align with
ksoftirqd but why? Why don't we account softirq time for ksoftirqd (and
should continue to do so for ktimersd)?
ktimers runs almost all the time in softirq context. So does every
force-threaded interrupt. Should we exclude them, too?
Sebastian
Hi Sebastian, thanks for your reply.
Sebastian Andrzej Siewior <[email protected]> 于2023年11月30日周四 20:00写道:
>
> On 2023-11-30 17:41:47 [+0800], tiozhang wrote:
> > In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
> > so when accounting CPUTIME_SOFTIRQ, ktimers need to be excluded
> > as well as ksoftirqd.
> > Also add this_cpu_ktimers to keep consistency with this_cpu_ksoftirqd.
>
> I'm still not sure what the benefit here is. It says align with
> ksoftirqd but why? Why don't we account softirq time for ksoftirqd (and
> should continue to do so for ktimersd)?
That is my miss. When CONFIG_IRQ_TIME_ACCOUNTING is enabled,
ksoftirqd is counted elsewhere (where ktimers should also be aligned).
Please review my later patch v3.
>
> ktimers runs almost all the time in softirq context. So does every
> force-threaded interrupt. Should we exclude them, too?
For force-threaded interrupt, it counts on CPUTIME_SYSTEM
instead of CPUTIME_IRQ nor CPUTIME_SOFTIRQ.
To me, it does not quite make sense, I'm also thinking of sending a patch
of this, but IMHO it should not be considered in this patch...
>
> Sebastian
On 2023-12-01 15:27:39 [+0800], Yuanhan Zhang wrote:
> Hi Sebastian, thanks for your reply.
Hi,
> Sebastian Andrzej Siewior <[email protected]> 于2023年11月30日周四 20:00写道:
> >
> > On 2023-11-30 17:41:47 [+0800], tiozhang wrote:
> > > In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
> > > so when accounting CPUTIME_SOFTIRQ, ktimers need to be excluded
> > > as well as ksoftirqd.
> > > Also add this_cpu_ktimers to keep consistency with this_cpu_ksoftirqd.
> >
> > I'm still not sure what the benefit here is. It says align with
> > ksoftirqd but why? Why don't we account softirq time for ksoftirqd (and
> > should continue to do so for ktimersd)?
>
> That is my miss. When CONFIG_IRQ_TIME_ACCOUNTING is enabled,
> ksoftirqd is counted elsewhere (where ktimers should also be aligned).
> Please review my later patch v3.
What is elsewhere?
> > ktimers runs almost all the time in softirq context. So does every
> > force-threaded interrupt. Should we exclude them, too?
>
> For force-threaded interrupt, it counts on CPUTIME_SYSTEM
> instead of CPUTIME_IRQ nor CPUTIME_SOFTIRQ.
> To me, it does not quite make sense, I'm also thinking of sending a patch
> of this, but IMHO it should not be considered in this patch...
Sure. I was just curious what is different with ktimers/ksoftirqd which
run mostly in softirq vs threaded interrupts which do the same.
Sebastian
In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
so when accounting CPUTIME_SOFTIRQ, ktimers need to be accounted the same
as ksoftirqd.
Signed-off-by: tiozhang <[email protected]>
---
include/linux/interrupt.h | 5 +++++
kernel/sched/cputime.c | 6 ++++--
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index a5091ac97fc6..a88646acaf3f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -621,6 +621,11 @@ static inline unsigned int local_pending_timers(void)
return __this_cpu_read(pending_timer_softirq);
}
+static inline struct task_struct *this_cpu_ktimers(void)
+{
+ return this_cpu_read(timersd);
+}
+
#else
static inline void raise_timer_softirq(void)
{
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..2393c533314f 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -73,7 +73,8 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
*/
if (pc & HARDIRQ_MASK)
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
- else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
+ else if (((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd() &&
+ (!IS_ENABLED(CONFIG_PREEMPT_RT) || curr != this_cpu_ktimers()))
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
@@ -391,7 +392,8 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
cputime -= other;
- if (this_cpu_ksoftirqd() == p) {
+ if (this_cpu_ksoftirqd() == p ||
+ (IS_ENABLED(CONFIG_PREEMPT_RT) && this_cpu_ktimers() == p)) {
/*
* ksoftirqd time do not get accounted in cpu_softirq_time.
* So, we have to handle it separately here.
--
2.17.1
On 2023-12-01 16:05:41 [+0800], tiozhang wrote:
> In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
> so when accounting CPUTIME_SOFTIRQ, ktimers need to be accounted the same
> as ksoftirqd.
I still don't understand why this is a good thing and why want to align
it with ksoftirqd and what breaks if we don't.
This "skip ksoftirqd for accounting" has been added in commit
b52bfee445d31 ("sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time")
At this point (v2.6.37) it had no accounting of time spent in ksoftirqd as
SOFTIRQ time. This was then fixed/ added by commit
414bee9ba613a ("softirqs: Account ksoftirqd time as cpustat softirq")
which went in v2.6.39. It started accounting it when it was noticed by
the tick. So it is less accurate. The "benefit" seems to be that this
accounting pops up in /proc/stat. As per-CPU or overall.
I *guess* this was to align the softirqs which occur at the end of an
interrupt with those which were outsourced to ksoftirqd because they
took too long. This would patch the wording
… wanted to see more complete solution in not accounting irq
processing time to tasks at all.
https://lore.kernel.org/all/[email protected]/
Or it tried to preserve the current status.
A different account occurs for SOFTIRQs if they occur as port of
hardirq and are maybe deferred to ksoftirqd vs a task raising softirqs
on their own like packet over loopback.
Don't see the benefit that but this is my interpretation based on what
it does.
This was v2.6.39. Since then we got threaded interrupts (also v2.6.39)
or threaded-NAPI which utilise mostly the same mechanism as ksoftirqd
but are treated differently. I don't see why ktimers should align with
ksoftirqd and honestly and I don't understand why ksoftirqd had to be
excluded to in the first place.
Sorry, but I would need to go on than this.
> Signed-off-by: tiozhang <[email protected]>
Sebastian
Hi,
Thanks for your kindly interpretation.
Sebastian Andrzej Siewior <[email protected]> 于2023年12月1日周五 11:16写道:
>
> On 2023-12-01 16:05:41 [+0800], tiozhang wrote:
> > In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
> > so when accounting CPUTIME_SOFTIRQ, ktimers need to be accounted the same
> > as ksoftirqd.
>
> I still don't understand why this is a good thing and why want to align
> it with ksoftirqd and what breaks if we don't.
My motivation of doing this is to keep CPUTIME_SOFTIRQ in /proc/stat
remaining more accurate in PREEPT_RT kernel.
If we dont align, ktimers' cpu time is added to CPUTIME_SYSTEM when
CONFIG_IRQ_TIME_ACCOUNTING is enabled, make our stats less accurate..
>
> This "skip ksoftirqd for accounting" has been added in commit
> b52bfee445d31 ("sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time")
>
> At this point (v2.6.37) it had no accounting of time spent in ksoftirqd as
> SOFTIRQ time. This was then fixed/ added by commit
> 414bee9ba613a ("softirqs: Account ksoftirqd time as cpustat softirq")
>
> which went in v2.6.39. It started accounting it when it was noticed by
> the tick. So it is less accurate. The "benefit" seems to be that this
> accounting pops up in /proc/stat. As per-CPU or overall.
>
> I *guess* this was to align the softirqs which occur at the end of an
> interrupt with those which were outsourced to ksoftirqd because they
> took too long. This would patch the wording
> … wanted to see more complete solution in not accounting irq
> processing time to tasks at all.
> https://lore.kernel.org/all/[email protected]/
>
> Or it tried to preserve the current status.
>
> A different account occurs for SOFTIRQs if they occur as port of
> hardirq and are maybe deferred to ksoftirqd vs a task raising softirqs
> on their own like packet over loopback.
> Don't see the benefit that but this is my interpretation based on what
> it does.
>
> This was v2.6.39. Since then we got threaded interrupts (also v2.6.39)
> or threaded-NAPI which utilise mostly the same mechanism as ksoftirqd
> but are treated differently. I don't see why ktimers should align with
> ksoftirqd and honestly and I don't understand why ksoftirqd had to be
> excluded to in the first place.
The main diff between ksoftirqd and force-threaded interrupt is that ksoftirq
is in SOFTIRQ_OFFSET (serving softirqs) while force-threaded is in
SOFTIRQ_DISABLE_OFFSET (by using local_disable_bh).
CPUTIME_SOFTIRQ serves for time in SOFTIRQ_OFFSET (processing softirqs).
See
https://lore.kernel.org/all/[email protected]/
So this leads to ksoftirqd is counted into CPUTIME_SOFTIRQ but irq-threads
into CPUTIME_SYSTEM.
Since ktimers is also in SOFTIRQ_OFFSET, align it with ksoftirq will
put it into CPUTIME_SOFTIRQ, making /proc/stat more accurate.
>
> Sorry, but I would need to go on than this.
Thanks for your time.
>
> > Signed-off-by: tiozhang <[email protected]>
>
> Sebastian
On 2023-12-02 05:28:15 [-0500], Yuanhan Zhang wrote:
> Hi,
Hi,
> Sebastian Andrzej Siewior <[email protected]> 于2023年12月1日周五 11:16写道:
> >
> > On 2023-12-01 16:05:41 [+0800], tiozhang wrote:
> > > In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
> > > so when accounting CPUTIME_SOFTIRQ, ktimers need to be accounted the same
> > > as ksoftirqd.
> >
> > I still don't understand why this is a good thing and why want to align
> > it with ksoftirqd and what breaks if we don't.
>
> My motivation of doing this is to keep CPUTIME_SOFTIRQ in /proc/stat
> remaining more accurate in PREEPT_RT kernel.
>
> If we dont align, ktimers' cpu time is added to CPUTIME_SYSTEM when
> CONFIG_IRQ_TIME_ACCOUNTING is enabled, make our stats less accurate..
So it is only SYSTEM? There is no additional SOFTIRQ that is used?
> The main diff between ksoftirqd and force-threaded interrupt is that ksoftirq
> is in SOFTIRQ_OFFSET (serving softirqs) while force-threaded is in
> SOFTIRQ_DISABLE_OFFSET (by using local_disable_bh).
This depends. If you look at a network driver, SOFTIRQ_DISABLE_OFFSET is
used during the driver routine doing almost only schedule a softirq.
Then the main part happens during SOFTIRQ_OFFSET where the driver pulls
the packets and passes them to the network stack.
> CPUTIME_SOFTIRQ serves for time in SOFTIRQ_OFFSET (processing softirqs).
> See
> https://lore.kernel.org/all/[email protected]/
Let me look at this later…
> So this leads to ksoftirqd is counted into CPUTIME_SOFTIRQ but irq-threads
> into CPUTIME_SYSTEM.
>
> Since ktimers is also in SOFTIRQ_OFFSET, align it with ksoftirq will
> put it into CPUTIME_SOFTIRQ, making /proc/stat more accurate.
But this only works if it is observed during the TICK interrupt, right?
Sebastian
Hi,
Sebastian Andrzej Siewior <[email protected]> 于2023年12月5日周二 23:31写道:
>
> On 2023-12-02 05:28:15 [-0500], Yuanhan Zhang wrote:
> > Hi,
> Hi,
>
> > Sebastian Andrzej Siewior <[email protected]> 于2023年12月1日周五 11:16写道:
> > >
> > > On 2023-12-01 16:05:41 [+0800], tiozhang wrote:
> > > > In CONFIG_PREEMPT_RT kernel, ktimers also calls __do_softirq,
> > > > so when accounting CPUTIME_SOFTIRQ, ktimers need to be accounted the same
> > > > as ksoftirqd.
> > >
> > > I still don't understand why this is a good thing and why want to align
> > > it with ksoftirqd and what breaks if we don't.
> >
> > My motivation of doing this is to keep CPUTIME_SOFTIRQ in /proc/stat
> > remaining more accurate in PREEPT_RT kernel.
> >
> > If we dont align, ktimers' cpu time is added to CPUTIME_SYSTEM when
> > CONFIG_IRQ_TIME_ACCOUNTING is enabled, make our stats less accurate..
>
> So it is only SYSTEM? There is no additional SOFTIRQ that is used?
Yes, there is some SOFTIRQ, but less than expected.
Here is my little test to explain this, in a reverse way:
I patch my kernel with ksoftirqd not-excluded, like this:
/**/
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 0796f938c4f0..6de9eccd094b 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -67,7 +67,7 @@ void irqtime_account_irq(struct task_struct *curr)
*/
if (hardirq_count())
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
- else if (in_serving_softirq() && curr != this_cpu_ksoftirqd())
+ else if (in_serving_softirq())
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
EXPORT_SYMBOL_GPL(irqtime_account_irq);
@@ -371,14 +371,7 @@ static void irqtime_account_process_tick(struct
task_struct *p, int user_tick,
cputime -= other;
- if (this_cpu_ksoftirqd() == p) {
- /*
- * ksoftirqd time do not get accounted in cpu_softirq_time.
- * So, we have to handle it separately here.
- * Also, p->stime needs to be updated for ksoftirqd.
- */
- account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
- } else if (user_tick) {
+ if (user_tick) {
account_user_time(p, cputime);
} else if (p == rq->idle) {
account_idle_time(cputime);
/**/
Then I do the following test between patched and unpatched kernels on
the same machine:
1. set loopback rps&xps on cpu0
2. test it with iperf like: `iperf -c 127.0.0.1 -P 1 -t 999 -i 1`, making some
softirqs happening on [ksoftirqd/0].
3. observe delta per second on /proc/stat
observation on *not-excluded ksoftirq patched* kernel:
------------------------------------------------------------------------
// SOFTIRQ stats
root@ubuntu-1804-live-x86-tiozhang:~#
while true;do sleep 1;curr=`cat /proc/stat | grep cpu0 | awk '{print
$8}'`;echo $((curr-last));last=$curr;done
...
13
14
13
13
13
13
14
13
13
13
...
// SYSTEM stats
root@ubuntu-1804-live-x86-tiozhang:~#
while true;do sleep 1;curr=`cat /proc/stat | grep cpu0 | awk '{print
$4}'`;echo $((curr-last));last=$curr;done
...
75
78
76
79
77
77
78
76
76
77
78
77
...
===================================================================================
observation on *normal* kernel:
------------------------------------------------------------------------
// SOFTIRQ stats
root@ubuntu-1804-live-x86-tiozhang-2:~#
while true;do sleep 1;curr=`cat /proc/stat | grep cpu0 | awk '{print
$8}'`;echo $((curr-last));last=$curr;done
...
18
20
22
17
23
18
18
20
17
20
18
18
22
18
22
...
// SYSTEM stats
root@ubuntu-1804-live-x86-tiozhang-2:~#
while true;do sleep 1;curr=`cat /proc/stat | grep cpu0 | awk '{print
$4}'`;echo $((curr-last));last=$curr;done
...
72
74
73
74
72
69
74
76
73
73
74
73
72
73
73
...
It results in if we do not handle ksoftirqd like this, we will have a
bigger SYSTEM and less SOFTIRQ.
So my point is if we do not align ktimers, ktimers would act like
**observation on *not-excluded ksoftirq patched* kernel** part in the
above example,
and this might make SOFTIRQ less than expected, /proc/stat less accurate.
>
> > The main diff between ksoftirqd and force-threaded interrupt is that ksoftirq
> > is in SOFTIRQ_OFFSET (serving softirqs) while force-threaded is in
> > SOFTIRQ_DISABLE_OFFSET (by using local_disable_bh).
>
> This depends. If you look at a network driver, SOFTIRQ_DISABLE_OFFSET is
> used during the driver routine doing almost only schedule a softirq.
> Then the main part happens during SOFTIRQ_OFFSET where the driver pulls
> the packets and passes them to the network stack.
>
> > CPUTIME_SOFTIRQ serves for time in SOFTIRQ_OFFSET (processing softirqs).
> > See
> > https://lore.kernel.org/all/[email protected]/
>
> Let me look at this later…
>
> > So this leads to ksoftirqd is counted into CPUTIME_SOFTIRQ but irq-threads
> > into CPUTIME_SYSTEM.
> >
> > Since ktimers is also in SOFTIRQ_OFFSET, align it with ksoftirq will
> > put it into CPUTIME_SOFTIRQ, making /proc/stat more accurate.
>
> But this only works if it is observed during the TICK interrupt, right?
This may take a long time to read, thanks again for your time and patience.
>
> Sebastian
On Thu, 7 Dec 2023 18:43:47 +0800
Yuanhan Zhang <[email protected]> wrote:
> It results in if we do not handle ksoftirqd like this, we will have a
> bigger SYSTEM and less SOFTIRQ.
And honestly that's what we want. Interrupts and softirqs that execute in
interrupts and softirq context take away from the system. That is, if they
are not explicitly blocked (local_irq_disable/local_bh_disable) they
interrupt the current task and take up the time of the current task. We
need to differentiate this because this context has no "task" context to
measure.
We do not want to add ksoftirqd or threaded interrupt handlers / softirqs
to this measurement. Sure, they are handling interrupt and softirq code,
but they have their own context that can be measured like any other task.
If we blur this with real irqs and softirqs, then we will not know what
those real irqs and softirqs are measuring.
> So my point is if we do not align ktimers, ktimers would act like
> **observation on *not-excluded ksoftirq patched* kernel** part in the
> above example,
> and this might make SOFTIRQ less than expected, /proc/stat less accurate.
No it does not. When a softirq kicks off it's work to a thread (ksoftirq,
threaded softirqd, or simply a workqueue), it's no longer running in
softirq context, and should not be measured as such.
The measurement is not about how much work the softirq is doing (otherwise
we need to add workqueues started by softirqs too), it's about measuring
the actual irq and softirq context. In PREEMPT_RT, we try to eliminate that
context as much as possible.
So seeing less is a feature not a bug!
-- Steve
Hi,
Steven Rostedt <[email protected]> 于2023年12月7日周四 10:35写道:
>
> On Thu, 7 Dec 2023 18:43:47 +0800
> Yuanhan Zhang <[email protected]> wrote:
>
> > It results in if we do not handle ksoftirqd like this, we will have a
> > bigger SYSTEM and less SOFTIRQ.
>
> And honestly that's what we want. Interrupts and softirqs that execute in
> interrupts and softirq context take away from the system. That is, if they
> are not explicitly blocked (local_irq_disable/local_bh_disable) they
> interrupt the current task and take up the time of the current task. We
> need to differentiate this because this context has no "task" context to
> measure.
>
> We do not want to add ksoftirqd or threaded interrupt handlers / softirqs
> to this measurement. Sure, they are handling interrupt and softirq code,
> but they have their own context that can be measured like any other task.
>
> If we blur this with real irqs and softirqs, then we will not know what
> those real irqs and softirqs are measuring.
Yes you say clearly enough and it makes some sense to me!
So my understanding is that in PREEMPT_RT, it is better to put ksoftirqd's time
into SYSTEM since they are just in their "task" context.
If my understanding is right, how about we just exclude ksoftirqd
like what I do in the last email?
(something like
else if (in_serving_softirq() &&
(IS_ENABLED(CONFIG_PREEMPT_RT) || curr != this_cpu_ksoftirqd()))
If this looks okay to you, I'm happy to send a new patch for this :)
>
> > So my point is if we do not align ktimers, ktimers would act like
> > **observation on *not-excluded ksoftirq patched* kernel** part in the
> > above example,
> > and this might make SOFTIRQ less than expected, /proc/stat less accurate.
>
> No it does not. When a softirq kicks off it's work to a thread (ksoftirq,
> threaded softirqd, or simply a workqueue), it's no longer running in
> softirq context, and should not be measured as such.
>
> The measurement is not about how much work the softirq is doing (otherwise
> we need to add workqueues started by softirqs too), it's about measuring
> the actual irq and softirq context. In PREEMPT_RT, we try to eliminate that
> context as much as possible.
Thanks anyway, I think I begin to learn a bit more about PREEMPT_RT...
>
> So seeing less is a feature not a bug!
>
> -- Steve
On Thu, 7 Dec 2023 12:19:28 -0500
Yuanhan Zhang <[email protected]> wrote:
> Hi,
>
> Steven Rostedt <[email protected]> 于2023年12月7日周四 10:35写道:
> >
> > On Thu, 7 Dec 2023 18:43:47 +0800
> > Yuanhan Zhang <[email protected]> wrote:
> >
> > > It results in if we do not handle ksoftirqd like this, we will have a
> > > bigger SYSTEM and less SOFTIRQ.
> >
> > And honestly that's what we want. Interrupts and softirqs that execute in
> > interrupts and softirq context take away from the system. That is, if they
> > are not explicitly blocked (local_irq_disable/local_bh_disable) they
> > interrupt the current task and take up the time of the current task. We
> > need to differentiate this because this context has no "task" context to
> > measure.
> >
> > We do not want to add ksoftirqd or threaded interrupt handlers / softirqs
> > to this measurement. Sure, they are handling interrupt and softirq code,
> > but they have their own context that can be measured like any other task.
> >
> > If we blur this with real irqs and softirqs, then we will not know what
> > those real irqs and softirqs are measuring.
>
> Yes you say clearly enough and it makes some sense to me!
> So my understanding is that in PREEMPT_RT, it is better to put ksoftirqd's time
> into SYSTEM since they are just in their "task" context.
>
> If my understanding is right, how about we just exclude ksoftirqd
> like what I do in the last email?
> (something like
> else if (in_serving_softirq() &&
> (IS_ENABLED(CONFIG_PREEMPT_RT) || curr != this_cpu_ksoftirqd()))
>
> If this looks okay to you, I'm happy to send a new patch for this :)
Hmm, looking at the code in mainline, I may have misspoken. Although,
honestly, I don't know why we want this.
In irqtime_account_process_tick() there's:
if (this_cpu_ksoftirqd() == p) {
/*
* ksoftirqd time do not get accounted in cpu_softirq_time.
* So, we have to handle it separately here.
* Also, p->stime needs to be updated for ksoftirqd.
*/
account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
Which to me looks like it is counting ksoftirqd for SOFTIRQ time. But
honestly, why do we care about that? What's the difference if ksoftirqd
were to run or softirqd were to pass work off to a workqueue?
ksoftirqd runs in vanilla Linux as SCHED_OTHER. The work it does doesn't
interrupt processes any more than any other kernel thread. I don't know why
we make it "special".
I guess the better question I need to ask is, what is this information used
for? I thought it was how much time was take away from tasks. As current
would be a task, and we do care if a real softirq is running, as we do not
want to add that to the current task accounting.
But for ksoftirqd, that's not the case, and I don't really care if it's
running a softirq or not. As that time isn't interrupting actual tasks. Not
to mention, one could simply look at the ksoftirqd tasks to see how much
time they take up.
>
> >
> > > So my point is if we do not align ktimers, ktimers would act like
> > > **observation on *not-excluded ksoftirq patched* kernel** part in the
> > > above example,
> > > and this might make SOFTIRQ less than expected, /proc/stat less accurate.
> >
> > No it does not. When a softirq kicks off it's work to a thread (ksoftirq,
> > threaded softirqd, or simply a workqueue), it's no longer running in
> > softirq context, and should not be measured as such.
> >
> > The measurement is not about how much work the softirq is doing (otherwise
> > we need to add workqueues started by softirqs too), it's about measuring
> > the actual irq and softirq context. In PREEMPT_RT, we try to eliminate that
> > context as much as possible.
>
> Thanks anyway, I think I begin to learn a bit more about PREEMPT_RT...
Well, I may have been wrong in this information.
I would like to know who is using this information and for what purpose.
Ideally, I would love to make ksoftirqd *not* report its time as handling
SOFTIRQ.
-- Steve
On 2023-12-07 13:18:11 [-0500], Steven Rostedt wrote:
> On Thu, 7 Dec 2023 12:19:28 -0500
> Yuanhan Zhang <[email protected]> wrote:
>
> In irqtime_account_process_tick() there's:
>
> if (this_cpu_ksoftirqd() == p) {
> /*
> * ksoftirqd time do not get accounted in cpu_softirq_time.
> * So, we have to handle it separately here.
> * Also, p->stime needs to be updated for ksoftirqd.
> */
> account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
>
> Which to me looks like it is counting ksoftirqd for SOFTIRQ time. But
> honestly, why do we care about that? What's the difference if ksoftirqd
> were to run or softirqd were to pass work off to a workqueue?
>
> ksoftirqd runs in vanilla Linux as SCHED_OTHER. The work it does doesn't
> interrupt processes any more than any other kernel thread. I don't know why
> we make it "special".
The special part is that it runs with disabled preemption the whole time
and the scheduler can't do a thing about it. This is different on
PREEMPT_RT where the softirq is preemptible and scheduler can replace it
with another task if suited. It still runs as SCHED_OTHER. The ktimers/
thread runs as SCHED_FIFO 1. So accounting it (incl. ksoftirqd) on
SYSTEM is fine IMHO.
> I guess the better question I need to ask is, what is this information used
> for? I thought it was how much time was take away from tasks. As current
> would be a task, and we do care if a real softirq is running, as we do not
> want to add that to the current task accounting.
>
> But for ksoftirqd, that's not the case, and I don't really care if it's
> running a softirq or not. As that time isn't interrupting actual tasks. Not
> to mention, one could simply look at the ksoftirqd tasks to see how much
> time they take up.
The original argument was to have the softirq counters right in
/proc/stat. This is what I remember from the trip to the museum.
> -- Steve
Sebastian
In PREEMPT_RT kernel, we dont want ksoftirqd's time accounting on SOFTIRQ
since it is available to the scheduler (while it is unpreemptable in
mainline). So we put it into SYSTEM like any other task running in SYSTEM.
With this patch, when ksoftirqd is taking CPU's time, we observe SYSTEM
in /proc/stat would be bigger than before while SOFTIRQ would be less,
which behaves in contract to mainline, but more suitable for PREEMPT_RT.
Signed-off-by: Tio Zhang <[email protected]>
---
kernel/sched/cputime.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..6685bb46805d 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -73,7 +73,8 @@ void irqtime_account_irq(struct task_struct *curr, unsigned int offset)
*/
if (pc & HARDIRQ_MASK)
irqtime_account_delta(irqtime, delta, CPUTIME_IRQ);
- else if ((pc & SOFTIRQ_OFFSET) && curr != this_cpu_ksoftirqd())
+ else if ((pc & SOFTIRQ_OFFSET) &&
+ (IS_ENABLED(CONFIG_PREEMPT_RT) || curr != this_cpu_ksoftirqd()))
irqtime_account_delta(irqtime, delta, CPUTIME_SOFTIRQ);
}
@@ -391,7 +392,7 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
cputime -= other;
- if (this_cpu_ksoftirqd() == p) {
+ if (!IS_ENABLED(CONFIG_PREEMPT_RT) && this_cpu_ksoftirqd() == p) {
/*
* ksoftirqd time do not get accounted in cpu_softirq_time.
* So, we have to handle it separately here.
--
2.17.1
Hi,
Sebastian Andrzej Siewior <[email protected]> 于2023年12月8日周五 04:26写道:
>
> On 2023-12-07 13:18:11 [-0500], Steven Rostedt wrote:
> > On Thu, 7 Dec 2023 12:19:28 -0500
> > Yuanhan Zhang <[email protected]> wrote:
> >
> > In irqtime_account_process_tick() there's:
> >
> > if (this_cpu_ksoftirqd() == p) {
> > /*
> > * ksoftirqd time do not get accounted in cpu_softirq_time.
> > * So, we have to handle it separately here.
> > * Also, p->stime needs to be updated for ksoftirqd.
> > */
> > account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
> >
> > Which to me looks like it is counting ksoftirqd for SOFTIRQ time. But
> > honestly, why do we care about that? What's the difference if ksoftirqd
> > were to run or softirqd were to pass work off to a workqueue?
> >
> > ksoftirqd runs in vanilla Linux as SCHED_OTHER. The work it does doesn't
> > interrupt processes any more than any other kernel thread. I don't know why
> > we make it "special".
>
> The special part is that it runs with disabled preemption the whole time
> and the scheduler can't do a thing about it. This is different on
> PREEMPT_RT where the softirq is preemptible and scheduler can replace it
> with another task if suited. It still runs as SCHED_OTHER. The ktimers/
> thread runs as SCHED_FIFO 1. So accounting it (incl. ksoftirqd) on
> SYSTEM is fine IMHO.
I send a [PATCH v4]
https://lore.kernel.org/lkml/20231211120209.GA25877@didi-ThinkCentre-M930t-N000/
for making ksoftirqd not special in PREEMPT_RT. It makes SYSTEM more
while SOFTIRQ less if we run softirqs on ksoftirqd (eg NET_*).
If you have further comments please let me know. Thanks a lot!
>
> > I guess the better question I need to ask is, what is this information used
> > for? I thought it was how much time was take away from tasks. As current
> > would be a task, and we do care if a real softirq is running, as we do not
> > want to add that to the current task accounting.
> >
> > But for ksoftirqd, that's not the case, and I don't really care if it's
> > running a softirq or not. As that time isn't interrupting actual tasks. Not
> > to mention, one could simply look at the ksoftirqd tasks to see how much
> > time they take up.
>
> The original argument was to have the softirq counters right in
> /proc/stat. This is what I remember from the trip to the museum.
>
> > -- Steve
>
> Sebastian