2024-02-26 16:16:56

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] sched/core: split iowait state into two states

iowait is a bogus metric, but it's helpful in the sense that it allows
short waits to not enter sleep states that have a higher exit latency
than we would've picked for iowait'ing tasks. However, it's harmless in
that lots of applications and monitoring assumes that iowait is busy
time, or otherwise use it as a health metric. Particularly for async
IO it's entirely nonsensical.

Split the iowait part into two parts - one that tracks whether we need
boosting for short waits, and one that says we need to account the task
as such. ->in_iowait_acct nests inside of ->in_iowait, both for
efficiency reasons, but also so that the relationship between the two
is clear. A waiter may set ->in_wait alone and not care about the
accounting.

Existing users of nr_iowait() for accounting purposes are switched to
use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
only cares about iowaiters, not the accounting side.

io_schedule_prepare() and io_schedule_finish() are changed to return
a simple mask of two state bits, as we now have more than one state to
manage. Outside of that, no further changes are needed to suppor this
generically.

Signed-off-by: Jens Axboe <[email protected]>

---

arch/s390/appldata/appldata_os.c | 2 +-
fs/proc/stat.c | 2 +-
include/linux/sched.h | 6 ++++++
include/linux/sched/stat.h | 10 +++++++--
kernel/sched/core.c | 37 +++++++++++++++++++++++++-------
kernel/sched/sched.h | 1 +
kernel/time/tick-sched.c | 6 +++---
7 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index a363d30ce739..fa4b278aca6c 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data)

os_data->nr_threads = nr_threads;
os_data->nr_running = nr_running();
- os_data->nr_iowait = nr_iowait();
+ os_data->nr_iowait = nr_iowait_acct();
os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index da60956b2915..149be7a884fb 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v)
(unsigned long long)boottime.tv_sec,
total_forks,
nr_running(),
- nr_iowait());
+ nr_iowait_acct());

seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq);

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..1e198e268df1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -922,7 +922,13 @@ struct task_struct {

/* Bit to tell TOMOYO we're in execve(): */
unsigned in_execve:1;
+ /* task is in iowait */
unsigned in_iowait:1;
+ /*
+ * task is in iowait and should be accounted as such. can only be set
+ * if ->in_iowait is also set.
+ */
+ unsigned in_iowait_acct:1;
#ifndef TIF_RESTORE_SIGMASK
unsigned restore_sigmask:1;
#endif
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 0108a38bb64d..7c48a35f98ee 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -19,8 +19,14 @@ DECLARE_PER_CPU(unsigned long, process_counts);
extern int nr_processes(void);
extern unsigned int nr_running(void);
extern bool single_task_running(void);
-extern unsigned int nr_iowait(void);
-extern unsigned int nr_iowait_cpu(int cpu);
+extern unsigned int nr_iowait_acct(void);
+extern unsigned int nr_iowait_acct_cpu(int cpu);
+unsigned int nr_iowait_cpu(int cpu);
+
+enum {
+ TASK_IOWAIT = 1,
+ TASK_IOWAIT_ACCT = 2,
+};

static inline int sched_info_on(void)
{
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..c643d44e38e7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3790,6 +3790,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
+ if (p->in_iowait_acct)
+ atomic_dec(&task_rq(p)->nr_iowait_acct);
}

activate_task(rq, p, en_flags);
@@ -4356,6 +4358,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
if (p->in_iowait) {
delayacct_blkio_end(p);
atomic_dec(&task_rq(p)->nr_iowait);
+ if (p->in_iowait_acct)
+ atomic_dec(&task_rq(p)->nr_iowait_acct);
}

wake_flags |= WF_MIGRATED;
@@ -5461,9 +5465,9 @@ unsigned long long nr_context_switches(void)
* it does become runnable.
*/

-unsigned int nr_iowait_cpu(int cpu)
+unsigned int nr_iowait_acct_cpu(int cpu)
{
- return atomic_read(&cpu_rq(cpu)->nr_iowait);
+ return atomic_read(&cpu_rq(cpu)->nr_iowait_acct);
}

/*
@@ -5496,16 +5500,21 @@ unsigned int nr_iowait_cpu(int cpu)
* Task CPU affinities can make all that even more 'interesting'.
*/

-unsigned int nr_iowait(void)
+unsigned int nr_iowait_acct(void)
{
unsigned int i, sum = 0;

for_each_possible_cpu(i)
- sum += nr_iowait_cpu(i);
+ sum += nr_iowait_acct_cpu(i);

return sum;
}

+unsigned int nr_iowait_cpu(int cpu)
+{
+ return atomic_read(&cpu_rq(cpu)->nr_iowait);
+}
+
#ifdef CONFIG_SMP

/*
@@ -6682,6 +6691,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)

if (prev->in_iowait) {
atomic_inc(&rq->nr_iowait);
+ if (prev->in_iowait_acct)
+ atomic_inc(&rq->nr_iowait_acct);
delayacct_blkio_start();
}
}
@@ -8986,16 +8997,25 @@ EXPORT_SYMBOL_GPL(yield_to);

int io_schedule_prepare(void)
{
- int old_iowait = current->in_iowait;
+ int old_wait_flags = 0;
+
+ if (current->in_iowait)
+ old_wait_flags |= TASK_IOWAIT;
+ if (current->in_iowait_acct)
+ old_wait_flags |= TASK_IOWAIT_ACCT;

current->in_iowait = 1;
+ current->in_iowait_acct = 1;
blk_flush_plug(current->plug, true);
- return old_iowait;
+ return old_wait_flags;
}

-void io_schedule_finish(int token)
+void io_schedule_finish(int old_wait_flags)
{
- current->in_iowait = token;
+ if (!(old_wait_flags & TASK_IOWAIT))
+ current->in_iowait = 0;
+ if (!(old_wait_flags & TASK_IOWAIT_ACCT))
+ current->in_iowait_acct = 0;
}

/*
@@ -10029,6 +10049,7 @@ void __init sched_init(void)
#endif
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
+ atomic_set(&rq->nr_iowait_acct, 0);
atomic_set(&rq->nr_iowait, 0);

#ifdef CONFIG_SCHED_CORE
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe047bd5d..9006335b01c8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1049,6 +1049,7 @@ struct rq {
u64 clock_idle_copy;
#endif

+ atomic_t nr_iowait_acct;
atomic_t nr_iowait;

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 01fb50c1b17e..f6709d543dac 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -669,7 +669,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
delta = ktime_sub(now, ts->idle_entrytime);

write_seqcount_begin(&ts->idle_sleeptime_seq);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
+ if (nr_iowait_acct_cpu(smp_processor_id()) > 0)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
else
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
@@ -742,7 +742,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);

return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
- !nr_iowait_cpu(cpu), last_update_time);
+ !nr_iowait_acct_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

@@ -768,7 +768,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);

return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
- nr_iowait_cpu(cpu), last_update_time);
+ nr_iowait_acct_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);

--
Jens Axboe



2024-02-27 11:04:48

by Christian Loehle

[permalink] [raw]
Subject: Re: [PATCH] sched/core: split iowait state into two states

Hi Jens,

On 26/02/2024 16:15, Jens Axboe wrote:
> iowait is a bogus metric, but it's helpful in the sense that it allows
> short waits to not enter sleep states that have a higher exit latency
> than we would've picked for iowait'ing tasks. However, it's harmless in
> that lots of applications and monitoring assumes that iowait is busy
> time, or otherwise use it as a health metric. Particularly for async
> IO it's entirely nonsensical.>
> Split the iowait part into two parts - one that tracks whether we need
> boosting for short waits, and one that says we need to account the task
> as such. ->in_iowait_acct nests inside of ->in_iowait, both for
> efficiency reasons, but also so that the relationship between the two
> is clear. A waiter may set ->in_wait alone and not care about the
> accounting>
> Existing users of nr_iowait() for accounting purposes are switched to
> use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
> only cares about iowaiters, not the accounting side.
>
> io_schedule_prepare() and io_schedule_finish() are changed to return
> a simple mask of two state bits, as we now have more than one state to
> manage. Outside of that, no further changes are needed to suppor this
> generically.
> [snip]

Actually there are probably three uses of the in_iowait flag
1. The (original) accounting use
2. The sleep state heuristic based on nr_iowaiters in cpuidle/governors/menu.c
3. The CPU frequency boost when in_iowait tasks wake up implemented in both
intel_pstate.c and cpufreq_schedutil.c cpufreq governors.

2 & 3 have just been piggybacked onto 1 because they work somewhat, but as
your patch also shows they really don't.
I have been working on a hopefully better approach for 3., I'll use your patch
as a chance to reintroduce the problem. I was going to ask for your thoughts
on the patch anyway.

The piggybacking of 2 and 3 have (IMHO) more dire consequences than just the
fact that you have to accept being accounted for as busy (until now) if you
wanted to make use of 2 and 3.

I assume the intention of your patch is to remove this link for the io_uring
case in particular, given that AFAICT it's the only occurence actually affected
by your patch (sets in_iowait directly and not the helper functions which will
set both in_iowait and in_iowait_acct).

I think that is the right direction, but if we touch this stuff, can we also
consider reworking it entirely?
Let's take io_uring as an example, not because it's the worst, but because it's
overhead is so low it shows the biggest problem (or room for improvement).

The iowait boosting of the CPU frequency will currently lead to e.g.
io_uring NR_CPUS threads with high enough iodepth (let's say 128) to possibly run
all CPUs on the highest, or at least one of the higher OPPs (frequency and
therefore power consumption).
(fio --rw=randread --bs=4k --ioengine=io_uring --iodepth=128 --numjobs=$(nproc) for 12 CPUs)
if we're using e.g. cpufreq_schedutil.c on all of them.
This is an issue as on many systems even running them on the lowest OPP suffices
to saturate the storage device (using cpufreq governor powersave on all).
The frequency boost based on iowait is therefore incredibly wasteful here
and destroys the incredibly low overhead of io_uring and the impact it could have
on energy being spent by the CPU.

Looking at git grep io_schedule and mutex_lock_io iowait currently means
anything from actual block io over sending CXL transactions to waiting
for DMA fences as a i915 GPU driver.
These things are clearly very different and deserve distinct handling.

Even if we remain in the realm of block io we have, as you already put it
nicely, "for async IO it's entirely nonsensical", but it doesn't stop there.
Writes in general have a similar problem, for some SSDs we just boost the
CPU frequency to land a tiny bit earlier in the SSDs DRAM cache, where it
will be flushed to flash at it's convenience (or necessity).
Again, boosting being entirely wasteful.
Boosting is of course also applied on periodic page cache writebacks for usually
no good reason at all.

I have a patch for 3 that (among other changes) tracks if the boost actually
improved throughput (measured in the only way we currently can, iowait wakeups
per time interval). I think it's an improvement over the current situation, but
it's far from perfect.

Ideally we would get the three different signals as distinct:
1. iowait_acct
2. iowait_short_sleep or something, we expect to wake up pretty soon due to some IO,
(which in case of the block layer maybe there would even be some estimate when?)
3. iowait_util_boost to signal we are in a scenario where the time between iowaits
(that the task is potentially using the CPU), is critical to IO throughput and
therefore running it as quickly as possible is worth the energy spending of boosting.

Ideally we (the sched folks) would move away from these iowait-piggybacked
heuristics and try to get as much information as possible from the e.g. the
block layer and act accordingly.
At least for the iowait boosting of frequency I would claim the heuristics are
wrong more often than not.

Would love to hear your thoughts and thanks for the patch (and apologies for
this scope-explosion, but I think the discussion is worth having).

Kind Regards,
Christian

2024-02-27 12:54:28

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] sched/core: split iowait state into two states

On 2/27/24 3:50 AM, Christian Loehle wrote:
> Hi Jens,
>
> On 26/02/2024 16:15, Jens Axboe wrote:
>> iowait is a bogus metric, but it's helpful in the sense that it allows
>> short waits to not enter sleep states that have a higher exit latency
>> than we would've picked for iowait'ing tasks. However, it's harmless in
>> that lots of applications and monitoring assumes that iowait is busy
>> time, or otherwise use it as a health metric. Particularly for async
>> IO it's entirely nonsensical.>
>> Split the iowait part into two parts - one that tracks whether we need
>> boosting for short waits, and one that says we need to account the task
>> as such. ->in_iowait_acct nests inside of ->in_iowait, both for
>> efficiency reasons, but also so that the relationship between the two
>> is clear. A waiter may set ->in_wait alone and not care about the
>> accounting>
>> Existing users of nr_iowait() for accounting purposes are switched to
>> use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
>> only cares about iowaiters, not the accounting side.
>>
>> io_schedule_prepare() and io_schedule_finish() are changed to return
>> a simple mask of two state bits, as we now have more than one state to
>> manage. Outside of that, no further changes are needed to suppor this
>> generically.
>> [snip]
>
> Actually there are probably three uses of the in_iowait flag
> 1. The (original) accounting use
> 2. The sleep state heuristic based on nr_iowaiters in cpuidle/governors/menu.c
> 3. The CPU frequency boost when in_iowait tasks wake up implemented in both
> intel_pstate.c and cpufreq_schedutil.c cpufreq governors.
>
> 2 & 3 have just been piggybacked onto 1 because they work somewhat, but as
> your patch also shows they really don't.

Right, I did collapse 2 & 3 into cpufreq related sleep/wakeup latencies.

> I have been working on a hopefully better approach for 3., I'll use
> your patch as a chance to reintroduce the problem. I was going to ask
> for your thoughts on the patch anyway.
>
> The piggybacking of 2 and 3 have (IMHO) more dire consequences than
> just the fact that you have to accept being accounted for as busy
> (until now) if you wanted to make use of 2 and 3.
>
> I assume the intention of your patch is to remove this link for the
> io_uring case in particular, given that AFAICT it's the only occurence
> actually affected by your patch (sets in_iowait directly and not the
> helper functions which will set both in_iowait and in_iowait_acct).

Right. It doesn't matter too much for storage as people kind of expect
iowait on that side, but for high frequency network IO (or just
networked IO in general), adding iowait to the mix tends to confuse
application owners. And since stat is mostly garbage anyway, I can
either spend time arguing with people that it's a useless metric, or I
can do something about it and just eliminate it on my side for good.
BTW, reading your email you seem to equate io_uring with storage, this
is very much not the case. Just wanted to clarify that this is in no way
storage specific.

> I think that is the right direction, but if we touch this stuff, can
> we also consider reworking it entirely? Let's take io_uring as an
> example, not because it's the worst, but because it's overhead is so
> low it shows the biggest problem (or room for improvement).

Sure, I have no objections to that, though I do want to fix the
immediate problem of just getting rid of iowait accounting. As I don't
think the next step is immediately obvious, I'd prefer if we can at
least fix the immediate issue and defer a rework to a step 2.

> The iowait boosting of the CPU frequency will currently lead to e.g.
> io_uring NR_CPUS threads with high enough iodepth (let's say 128) to
> possibly run all CPUs on the highest, or at least one of the higher
> OPPs (frequency and therefore power consumption). (fio --rw=randread
> --bs=4k --ioengine=io_uring --iodepth=128 --numjobs=$(nproc) for 12
> CPUs) if we're using e.g. cpufreq_schedutil.c on all of them. This is
> an issue as on many systems even running them on the lowest OPP
> suffices to saturate the storage device (using cpufreq governor
> powersave on all). The frequency boost based on iowait is therefore
> incredibly wasteful here and destroys the incredibly low overhead of
> io_uring and the impact it could have on energy being spent by the
> CPU.

To be honest, I think it's hard to generalize on that. For the above
example, it completely depends on what you're driving. If this is 12
CPUs doing IO to N devices, what kind of devices are these? Are they
doing millions of IOPS each, or is is 100k each? A storage device is
many things, and you can easily have a storage device that it would take
more than one CPU to fully saturate. Or you can have 12 of them that one
can easily saturate.

> Looking at git grep io_schedule and mutex_lock_io iowait currently
> means anything from actual block io over sending CXL transactions to
> waiting for DMA fences as a i915 GPU driver. These things are clearly
> very different and deserve distinct handling.
>
> Even if we remain in the realm of block io we have, as you already put
> it nicely, "for async IO it's entirely nonsensical", but it doesn't
> stop there. Writes in general have a similar problem, for some SSDs we
> just boost the CPU frequency to land a tiny bit earlier in the SSDs
> DRAM cache, where it will be flushed to flash at it's convenience (or
> necessity). Again, boosting being entirely wasteful. Boosting is of
> course also applied on periodic page cache writebacks for usually no
> good reason at all.

I don't think that is true at all. We don't boost to have data land in
the drive cache earlier, we boost so that:

1) prepare IO to device
2) submit IO to device
3) wait on IO completion, task goes to sleep
4) IO completes, wake task
5) task wakes up, gets completion

the last two steps here aren't burdened by latencies that are higher
than they need to be, IOW steps 2 & 3 from above. This is why I'm
bundling them into one, as they really are the same thing from that
perspective.

> I have a patch for 3 that (among other changes) tracks if the boost
> actually improved throughput (measured in the only way we currently
> can, iowait wakeups per time interval). I think it's an improvement
> over the current situation, but it's far from perfect.
>
> Ideally we would get the three different signals as distinct:
> 1. iowait_acct
> 2. iowait_short_sleep or something, we expect to wake up pretty soon
> due to some IO, (which in case of the block layer maybe there would
> even be some estimate when?)
> 3. iowait_util_boost to signal we are in a scenario where the time
> between iowaits (that the task is potentially using the CPU), is
> critical to IO throughput and therefore running it as quickly as
> possible is worth the energy spending of boosting.
>
>
> Ideally we (the sched folks) would move away from these
> iowait-piggybacked heuristics and try to get as much information as
> possible from the e.g. the block layer and act accordingly. At least
> for the iowait boosting of frequency I would claim the heuristics are
> wrong more often than not.
>
> Would love to hear your thoughts and thanks for the patch (and
> apologies for this scope-explosion, but I think the discussion is
> worth having).

As mentioned higher up, I do agree that there's room for improvement for
the heuristics in general, and I'll be more than happy to help test and
help with the block layer side or io_uring of things too. If we can get
good latencies when we need it and be cognizant of power at the same
time, that's certainly a win all around.

However, I would greatly prefer to sort out the mixing up of iowait
accounting and boosting first as it's a much simpler problem and
deserves fixing separately, and is not one that will inevitably get
complicated as it needs to coordinate across layers. Outside of that,
any change in heuristics for that will need considerable testing,
whereas the existing one is not encumbered by that.

I'll respin this version to try and avoid the atomics here, as that was
a comment that Peter had. If we can improve the existing nr_iowait
accounting and logic with that as well, then I think that's an exercise
that's worthwhile separately.

--
Jens Axboe