2024-02-28 19:24:09

by Jens Axboe

[permalink] [raw]
Subject: [PATCHSET v3 0/2] Split iowait into two states

Hi,

This is v3 of the patchset where the current in_iowait state is split
into two parts:

1) The "task is sleeping waiting on IO", and would like cpufreq goodness
in terms of sleep and wakeup latencies.
2) The above, and also accounted as such in the iowait stats.

The current ->in_iowait covers both, with this series we have ->in_iowait
for step 1 above, and ->in_iowait_acct for step 2. You cannot be
->in_iowait_acct without also having ->in_iowait set.

Patch 1 is a prep patch, that turns rq->nr_iowait into an int rather than
an atomic_t. Reasons given in that patch. This patch can stand alone, as
it should not have any functional changes, outside of improving the
handling of iowait a bit.

Patch 2 adds the ->in_iowait_acct stage inside the current ->in_iowait
setting.

I wasn't super happy with the need to grab the rq lock for the one case
where we still need it, and woke up with a better idea of how to do this.
It's all in patch 1. Nice part here is that we get rid of atomics for
3 out of the 4 iowait inc/dec. Patch just follows the same principle.

Comments welcome! Peter, CC'ing you since I did on the previous posting,
feel free to ignore.

Since v2:
- Drop need for rq lock for the remote case by turning both nr_iowait
and nr_iowait_acct into the difference between a remote atomic count
and the local non-atomic one.

--
Jens Axboe



2024-02-28 19:24:18

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

In 3 of the 4 spots where we modify rq->nr_iowait we already hold the
rq lock, and hence don't need atomics to modify the current per-rq
iowait count. In the 4th case, where we are scheduling in on a different
CPU than the task was previously on, we do not hold the previous rq lock,
and hence still need to use an atomic to increment the iowait count.

Rename the existing nr_iowait to nr_iowait_remote, and use that for the
4th case. The other three cases can simply inc/dec in a non-atomic
fashion under the held rq lock.

The per-rq iowait now becomes the difference between the two, the local
count minus the remote count.

Signed-off-by: Jens Axboe <[email protected]>
---
kernel/sched/core.c | 15 ++++++++++-----
kernel/sched/cputime.c | 3 +--
kernel/sched/sched.h | 8 +++++++-
3 files changed, 18 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9116bcc90346..48d15529a777 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3789,7 +3789,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
#endif
if (p->in_iowait) {
delayacct_blkio_end(p);
- atomic_dec(&task_rq(p)->nr_iowait);
+ task_rq(p)->nr_iowait--;
}

activate_task(rq, p, en_flags);
@@ -4354,8 +4354,10 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
cpu = select_task_rq(p, p->wake_cpu, wake_flags | WF_TTWU);
if (task_cpu(p) != cpu) {
if (p->in_iowait) {
+ struct rq *__rq = task_rq(p);
+
delayacct_blkio_end(p);
- atomic_dec(&task_rq(p)->nr_iowait);
+ atomic_inc(&__rq->nr_iowait_remote);
}

wake_flags |= WF_MIGRATED;
@@ -5463,7 +5465,9 @@ unsigned long long nr_context_switches(void)

unsigned int nr_iowait_cpu(int cpu)
{
- return atomic_read(&cpu_rq(cpu)->nr_iowait);
+ struct rq *rq = cpu_rq(cpu);
+
+ return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
}

/*
@@ -6681,7 +6685,7 @@ static void __sched notrace __schedule(unsigned int sched_mode)
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

if (prev->in_iowait) {
- atomic_inc(&rq->nr_iowait);
+ rq->nr_iowait++;
delayacct_blkio_start();
}
}
@@ -10029,7 +10033,8 @@ void __init sched_init(void)
#endif
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
- atomic_set(&rq->nr_iowait, 0);
+ rq->nr_iowait = 0;
+ atomic_set(&rq->nr_iowait_remote, 0);

#ifdef CONFIG_SCHED_CORE
rq->core = rq;
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index af7952f12e6c..0ed81c2d3c3b 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -222,9 +222,8 @@ void account_steal_time(u64 cputime)
void account_idle_time(u64 cputime)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
- struct rq *rq = this_rq();

- if (atomic_read(&rq->nr_iowait) > 0)
+ if (nr_iowait_cpu(smp_processor_id()) > 0)
cpustat[CPUTIME_IOWAIT] += cputime;
else
cpustat[CPUTIME_IDLE] += cputime;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 001fe047bd5d..91fa5b4d45ed 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1049,7 +1049,13 @@ struct rq {
u64 clock_idle_copy;
#endif

- atomic_t nr_iowait;
+ /*
+ * Total per-cpu iowait is the difference of the two below. One is
+ * modified under the rq lock (nr_iowait), and if we don't have the rq
+ * lock, then nr_iowait_remote is used.
+ */
+ unsigned int nr_iowait;
+ atomic_t nr_iowait_remote;

#ifdef CONFIG_SCHED_DEBUG
u64 last_seen_need_resched_ns;
--
2.43.0


2024-02-28 19:24:28

by Jens Axboe

[permalink] [raw]
Subject: [PATCH 2/2] sched/core: split iowait state into two states

iowait is a bogus metric, but it's helpful in the sense that it allows
short waits to not enter sleep states that have a higher exit latency
than we would've picked for iowait'ing tasks. However, it's harmless in
that lots of applications and monitoring assumes that iowait is busy
time, or otherwise use it as a health metric. Particularly for async
IO it's entirely nonsensical.

Split the iowait part into two parts - one that tracks whether we need
boosting for short waits, and one that says we need to account the task
as such. ->in_iowait_acct nests inside of ->in_iowait, both for
efficiency reasons, but also so that the relationship between the two
is clear. A waiter may set ->in_wait alone and not care about the
accounting.

Existing users of nr_iowait() for accounting purposes are switched to
use nr_iowait_acct(), which leaves the governor using nr_iowait() as it
only cares about iowaiters, not the accounting side.

io_schedule_prepare() and io_schedule_finish() are changed to return
a simple mask of two state bits, as we now have more than one state to
manage. Outside of that, no further changes are needed to suppor this
generically.

Signed-off-by: Jens Axboe <[email protected]>
---
arch/s390/appldata/appldata_base.c | 2 +-
arch/s390/appldata/appldata_os.c | 2 +-
fs/proc/stat.c | 2 +-
include/linux/sched.h | 6 ++++
include/linux/sched/stat.h | 10 +++++--
kernel/sched/core.c | 45 ++++++++++++++++++++++++------
kernel/sched/sched.h | 2 ++
kernel/time/tick-sched.c | 6 ++--
8 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/arch/s390/appldata/appldata_base.c b/arch/s390/appldata/appldata_base.c
index c2978cb03b36..6844b5294a8b 100644
--- a/arch/s390/appldata/appldata_base.c
+++ b/arch/s390/appldata/appldata_base.c
@@ -423,4 +423,4 @@ EXPORT_SYMBOL_GPL(si_swapinfo);
#endif
EXPORT_SYMBOL_GPL(nr_threads);
EXPORT_SYMBOL_GPL(nr_running);
-EXPORT_SYMBOL_GPL(nr_iowait);
+EXPORT_SYMBOL_GPL(nr_iowait_acct);
diff --git a/arch/s390/appldata/appldata_os.c b/arch/s390/appldata/appldata_os.c
index a363d30ce739..fa4b278aca6c 100644
--- a/arch/s390/appldata/appldata_os.c
+++ b/arch/s390/appldata/appldata_os.c
@@ -100,7 +100,7 @@ static void appldata_get_os_data(void *data)

os_data->nr_threads = nr_threads;
os_data->nr_running = nr_running();
- os_data->nr_iowait = nr_iowait();
+ os_data->nr_iowait = nr_iowait_acct();
os_data->avenrun[0] = avenrun[0] + (FIXED_1/200);
os_data->avenrun[1] = avenrun[1] + (FIXED_1/200);
os_data->avenrun[2] = avenrun[2] + (FIXED_1/200);
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index da60956b2915..149be7a884fb 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -180,7 +180,7 @@ static int show_stat(struct seq_file *p, void *v)
(unsigned long long)boottime.tv_sec,
total_forks,
nr_running(),
- nr_iowait());
+ nr_iowait_acct());

seq_put_decimal_ull(p, "softirq ", (unsigned long long)sum_softirq);

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ffe8f618ab86..1e198e268df1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -922,7 +922,13 @@ struct task_struct {

/* Bit to tell TOMOYO we're in execve(): */
unsigned in_execve:1;
+ /* task is in iowait */
unsigned in_iowait:1;
+ /*
+ * task is in iowait and should be accounted as such. can only be set
+ * if ->in_iowait is also set.
+ */
+ unsigned in_iowait_acct:1;
#ifndef TIF_RESTORE_SIGMASK
unsigned restore_sigmask:1;
#endif
diff --git a/include/linux/sched/stat.h b/include/linux/sched/stat.h
index 0108a38bb64d..7c48a35f98ee 100644
--- a/include/linux/sched/stat.h
+++ b/include/linux/sched/stat.h
@@ -19,8 +19,14 @@ DECLARE_PER_CPU(unsigned long, process_counts);
extern int nr_processes(void);
extern unsigned int nr_running(void);
extern bool single_task_running(void);
-extern unsigned int nr_iowait(void);
-extern unsigned int nr_iowait_cpu(int cpu);
+extern unsigned int nr_iowait_acct(void);
+extern unsigned int nr_iowait_acct_cpu(int cpu);
+unsigned int nr_iowait_cpu(int cpu);
+
+enum {
+ TASK_IOWAIT = 1,
+ TASK_IOWAIT_ACCT = 2,
+};

static inline int sched_info_on(void)
{
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48d15529a777..66a3654aab5d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3790,6 +3790,8 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
if (p->in_iowait) {
delayacct_blkio_end(p);
task_rq(p)->nr_iowait--;
+ if (p->in_iowait_acct)
+ task_rq(p)->nr_iowait_acct--;
}

activate_task(rq, p, en_flags);
@@ -4358,6 +4360,8 @@ int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

delayacct_blkio_end(p);
atomic_inc(&__rq->nr_iowait_remote);
+ if (p->in_iowait_acct)
+ atomic_inc(&__rq->nr_iowait_acct_remote);
}

wake_flags |= WF_MIGRATED;
@@ -5463,11 +5467,11 @@ unsigned long long nr_context_switches(void)
* it does become runnable.
*/

-unsigned int nr_iowait_cpu(int cpu)
+unsigned int nr_iowait_acct_cpu(int cpu)
{
struct rq *rq = cpu_rq(cpu);

- return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
+ return rq->nr_iowait_acct - atomic_read(&rq->nr_iowait_acct_remote);
}

/*
@@ -5500,16 +5504,23 @@ unsigned int nr_iowait_cpu(int cpu)
* Task CPU affinities can make all that even more 'interesting'.
*/

-unsigned int nr_iowait(void)
+unsigned int nr_iowait_acct(void)
{
unsigned int i, sum = 0;

for_each_possible_cpu(i)
- sum += nr_iowait_cpu(i);
+ sum += nr_iowait_acct_cpu(i);

return sum;
}

+unsigned int nr_iowait_cpu(int cpu)
+{
+ struct rq *rq = cpu_rq(cpu);
+
+ return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
+}
+
#ifdef CONFIG_SMP

/*
@@ -6686,6 +6697,8 @@ static void __sched notrace __schedule(unsigned int sched_mode)

if (prev->in_iowait) {
rq->nr_iowait++;
+ if (prev->in_iowait_acct)
+ rq->nr_iowait_acct++;
delayacct_blkio_start();
}
}
@@ -8988,18 +9001,32 @@ int __sched yield_to(struct task_struct *p, bool preempt)
}
EXPORT_SYMBOL_GPL(yield_to);

+/*
+ * Returns a token which is comprised of the two bits of iowait wait state -
+ * one is whether we're making ourselves as in iowait for cpufreq reasons,
+ * and the other is if the task should be accounted as such.
+ */
int io_schedule_prepare(void)
{
- int old_iowait = current->in_iowait;
+ int old_wait_flags = 0;
+
+ if (current->in_iowait)
+ old_wait_flags |= TASK_IOWAIT;
+ if (current->in_iowait_acct)
+ old_wait_flags |= TASK_IOWAIT_ACCT;

current->in_iowait = 1;
+ current->in_iowait_acct = 1;
blk_flush_plug(current->plug, true);
- return old_iowait;
+ return old_wait_flags;
}

-void io_schedule_finish(int token)
+void io_schedule_finish(int old_wait_flags)
{
- current->in_iowait = token;
+ if (!(old_wait_flags & TASK_IOWAIT))
+ current->in_iowait = 0;
+ if (!(old_wait_flags & TASK_IOWAIT_ACCT))
+ current->in_iowait_acct = 0;
}

/*
@@ -10033,6 +10060,8 @@ void __init sched_init(void)
#endif
#endif /* CONFIG_SMP */
hrtick_rq_init(rq);
+ rq->nr_iowait_acct = 0;
+ atomic_set(&rq->nr_iowait_acct_remote, 0);
rq->nr_iowait = 0;
atomic_set(&rq->nr_iowait_remote, 0);

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91fa5b4d45ed..abd7a938bc99 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1054,6 +1054,8 @@ struct rq {
* modified under the rq lock (nr_iowait), and if we don't have the rq
* lock, then nr_iowait_remote is used.
*/
+ unsigned int nr_iowait_acct;
+ atomic_t nr_iowait_acct_remote;
unsigned int nr_iowait;
atomic_t nr_iowait_remote;

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 01fb50c1b17e..f6709d543dac 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -669,7 +669,7 @@ static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
delta = ktime_sub(now, ts->idle_entrytime);

write_seqcount_begin(&ts->idle_sleeptime_seq);
- if (nr_iowait_cpu(smp_processor_id()) > 0)
+ if (nr_iowait_acct_cpu(smp_processor_id()) > 0)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
else
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
@@ -742,7 +742,7 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);

return get_cpu_sleep_time_us(ts, &ts->idle_sleeptime,
- !nr_iowait_cpu(cpu), last_update_time);
+ !nr_iowait_acct_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

@@ -768,7 +768,7 @@ u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);

return get_cpu_sleep_time_us(ts, &ts->iowait_sleeptime,
- nr_iowait_cpu(cpu), last_update_time);
+ nr_iowait_acct_cpu(cpu), last_update_time);
}
EXPORT_SYMBOL_GPL(get_cpu_iowait_time_us);

--
2.43.0


2024-02-29 16:53:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote:
> In 3 of the 4 spots where we modify rq->nr_iowait we already hold the

We modify something and hold locks? It's documented that changelogs
should not impersonate code. It simply does not make any sense.

> rq lock, and hence don't need atomics to modify the current per-rq
> iowait count. In the 4th case, where we are scheduling in on a different
> CPU than the task was previously on, we do not hold the previous rq lock,
> and hence still need to use an atomic to increment the iowait count.
>
> Rename the existing nr_iowait to nr_iowait_remote, and use that for the
> 4th case. The other three cases can simply inc/dec in a non-atomic
> fashion under the held rq lock.
>
> The per-rq iowait now becomes the difference between the two, the local
> count minus the remote count.
>
> Signed-off-by: Jens Axboe <[email protected]>

Other than that:

Reviewed-by: Thomas Gleixner <[email protected]>

2024-02-29 17:19:49

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

On 2/29/24 9:53 AM, Thomas Gleixner wrote:
> On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote:
>> In 3 of the 4 spots where we modify rq->nr_iowait we already hold the
>
> We modify something and hold locks? It's documented that changelogs
> should not impersonate code. It simply does not make any sense.

Agree it doesn't read that well... It's meant to say that we already
hold the rq lock in 3 of the 4 spots, hence using atomic_inc/dec is
pointless for those cases.

> Other than that:
>
> Reviewed-by: Thomas Gleixner <[email protected]>

Thanks for the review!

--
Jens Axboe


2024-02-29 17:31:19

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/core: split iowait state into two states

On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote:
> iowait is a bogus metric, but it's helpful in the sense that it allows
> short waits to not enter sleep states that have a higher exit latency
> than we would've picked for iowait'ing tasks. However, it's harmless in
> that lots of applications and monitoring assumes that iowait is busy
> time, or otherwise use it as a health metric. Particularly for async
> IO it's entirely nonsensical.
>
> Split the iowait part into two parts - one that tracks whether we need
> boosting for short waits, and one that says we need to account the
> task

We :)

> as such. ->in_iowait_acct nests inside of ->in_iowait, both for
> efficiency reasons, but also so that the relationship between the two
> is clear. A waiter may set ->in_wait alone and not care about the
> accounting.

> +/*
> + * Returns a token which is comprised of the two bits of iowait wait state -
> + * one is whether we're making ourselves as in iowait for cpufreq reasons,
> + * and the other is if the task should be accounted as such.
> + */
> int io_schedule_prepare(void)
> {
> - int old_iowait = current->in_iowait;
> + int old_wait_flags = 0;
> +
> + if (current->in_iowait)
> + old_wait_flags |= TASK_IOWAIT;
> + if (current->in_iowait_acct)
> + old_wait_flags |= TASK_IOWAIT_ACCT;
>
> current->in_iowait = 1;
> + current->in_iowait_acct = 1;
> blk_flush_plug(current->plug, true);
> - return old_iowait;
> + return old_wait_flags;
> }
>
> -void io_schedule_finish(int token)
> +void io_schedule_finish(int old_wait_flags)
> {
> - current->in_iowait = token;
> + if (!(old_wait_flags & TASK_IOWAIT))
> + current->in_iowait = 0;
> + if (!(old_wait_flags & TASK_IOWAIT_ACCT))
> + current->in_iowait_acct = 0;

Why? TASK_IOWAIT_ACCT requires TASK_IOWAIT, right? So if TASK_IOWAIT was
not set then TASK_IOWAIT_ACCT must have been clear too, no?

Thanks,

tglx

2024-02-29 17:42:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

On Thu, Feb 29 2024 at 10:19, Jens Axboe wrote:
> On 2/29/24 9:53 AM, Thomas Gleixner wrote:
>> On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote:
>>> In 3 of the 4 spots where we modify rq->nr_iowait we already hold the
>>
>> We modify something and hold locks? It's documented that changelogs
>> should not impersonate code. It simply does not make any sense.
>
> Agree it doesn't read that well... It's meant to say that we already
> hold the rq lock in 3 of the 4 spots, hence using atomic_inc/dec is
> pointless for those cases.

That and the 'we'. Write it neutral.

The accounting of rq::nr_iowait is using an atomic_t but 3 out of 4
places hold runqueue lock already. ....

So but I just noticed that there is actually an issue with this:

> unsigned int nr_iowait_cpu(int cpu)
> {
> - return atomic_read(&cpu_rq(cpu)->nr_iowait);
> + struct rq *rq = cpu_rq(cpu);
> +
> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);

The access to rq->nr_iowait is not protected by the runqueue lock and
therefore a data race when @cpu is not the current CPU.

This needs to be properly annotated and explained why it does not
matter.

So s/Reviewed-by/Un-Reviewed-by/

Though thinking about it some more. Is this split a real benefit over
always using the atomic? Do you have numbers to show?

Thanks,

tglx




2024-02-29 17:46:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched/core: split iowait state into two states

On 2/29/24 10:31 AM, Thomas Gleixner wrote:
> On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote:
>> iowait is a bogus metric, but it's helpful in the sense that it allows
>> short waits to not enter sleep states that have a higher exit latency
>> than we would've picked for iowait'ing tasks. However, it's harmless in
>> that lots of applications and monitoring assumes that iowait is busy
>> time, or otherwise use it as a health metric. Particularly for async
>> IO it's entirely nonsensical.
>>
>> Split the iowait part into two parts - one that tracks whether we need
>> boosting for short waits, and one that says we need to account the
>> task
>
> We :)

I appreciate the commit message police :-)

I'll rewrite it.

>> +/*
>> + * Returns a token which is comprised of the two bits of iowait wait state -
>> + * one is whether we're making ourselves as in iowait for cpufreq reasons,
>> + * and the other is if the task should be accounted as such.
>> + */
>> int io_schedule_prepare(void)
>> {
>> - int old_iowait = current->in_iowait;
>> + int old_wait_flags = 0;
>> +
>> + if (current->in_iowait)
>> + old_wait_flags |= TASK_IOWAIT;
>> + if (current->in_iowait_acct)
>> + old_wait_flags |= TASK_IOWAIT_ACCT;
>>
>> current->in_iowait = 1;
>> + current->in_iowait_acct = 1;
>> blk_flush_plug(current->plug, true);
>> - return old_iowait;
>> + return old_wait_flags;
>> }
>>
>> -void io_schedule_finish(int token)
>> +void io_schedule_finish(int old_wait_flags)
>> {
>> - current->in_iowait = token;
>> + if (!(old_wait_flags & TASK_IOWAIT))
>> + current->in_iowait = 0;
>> + if (!(old_wait_flags & TASK_IOWAIT_ACCT))
>> + current->in_iowait_acct = 0;
>
> Why? TASK_IOWAIT_ACCT requires TASK_IOWAIT, right? So if TASK_IOWAIT was
> not set then TASK_IOWAIT_ACCT must have been clear too, no?

It does, IOWAIT_ACCT always nests inside IOWAIT. I guess it would be
more explanatory as:

/*
* If TASK_IOWAIT isn't set, then TASK_IOWAIT_ACCT cannot have
* been set either as it nests inside TASK_IOWAIT.
*/
if (!(old_wait_flags & TASK_IOWAIT))
current->in_iowait = 0;
else if (!(old_wait_flags & TASK_IOWAIT_ACCT))
current->in_iowait_acct = 0;

?

--
Jens Axboe


2024-02-29 17:49:55

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

On 2/29/24 10:42 AM, Thomas Gleixner wrote:
> On Thu, Feb 29 2024 at 10:19, Jens Axboe wrote:
>> On 2/29/24 9:53 AM, Thomas Gleixner wrote:
>>> On Wed, Feb 28 2024 at 12:16, Jens Axboe wrote:
>>>> In 3 of the 4 spots where we modify rq->nr_iowait we already hold the
>>>
>>> We modify something and hold locks? It's documented that changelogs
>>> should not impersonate code. It simply does not make any sense.
>>
>> Agree it doesn't read that well... It's meant to say that we already
>> hold the rq lock in 3 of the 4 spots, hence using atomic_inc/dec is
>> pointless for those cases.
>
> That and the 'we'. Write it neutral.
>
> The accounting of rq::nr_iowait is using an atomic_t but 3 out of 4
> places hold runqueue lock already. ....

Will do

> So but I just noticed that there is actually an issue with this:
>
>> unsigned int nr_iowait_cpu(int cpu)
>> {
>> - return atomic_read(&cpu_rq(cpu)->nr_iowait);
>> + struct rq *rq = cpu_rq(cpu);
>> +
>> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
>
> The access to rq->nr_iowait is not protected by the runqueue lock and
> therefore a data race when @cpu is not the current CPU.
>
> This needs to be properly annotated and explained why it does not
> matter.

But that was always racy before as well, if someone else is inc/dec'ing
->nr_iowait while it's being read, you could get either the before or
after value. This doesn't really change that. I could've sworn I
mentioned that in the commit message, but I did not.

> So s/Reviewed-by/Un-Reviewed-by/
>
> Though thinking about it some more. Is this split a real benefit over
> always using the atomic? Do you have numbers to show?

It was more on Peter's complaint that now we're trading a single atomic
for two, hence I got to thinking about nr_iowait in general. I don't
have numbers showing it matters, as mentioned in another email the most
costly part about this seems to be fetching task->in_iowait and not the
actual atomic.

--
Jens Axboe


2024-02-29 19:53:59

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote:
> On 2/29/24 10:42 AM, Thomas Gleixner wrote:
>> So but I just noticed that there is actually an issue with this:
>>
>>> unsigned int nr_iowait_cpu(int cpu)
>>> {
>>> - return atomic_read(&cpu_rq(cpu)->nr_iowait);
>>> + struct rq *rq = cpu_rq(cpu);
>>> +
>>> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
>>
>> The access to rq->nr_iowait is not protected by the runqueue lock and
>> therefore a data race when @cpu is not the current CPU.
>>
>> This needs to be properly annotated and explained why it does not
>> matter.
>
> But that was always racy before as well, if someone else is inc/dec'ing
> ->nr_iowait while it's being read, you could get either the before or
> after value. This doesn't really change that. I could've sworn I
> mentioned that in the commit message, but I did not.

There are actually two issues here:

1) atomic_read() vs. atomic_inc/dec() guarantees that the read value
is consistent in itself.

Non-atomic inc/dec is not guaranteeing that the concurrent read is a
consistent value as the compiler is free to do store/load
tearing. Unlikely but not guaranteed to never happen.

KCSAN will complain about it sooner than later and then someone has
to go and do the analysis and the annotation. I rather let you do
the reasoning now than chasing you down later :)

2) What's worse is that the result can be completely bogus:

i.e.

CPU0 CPU1 CPU2
a = rq(CPU1)->nr_iowait; // 0
rq->nr_iowait++;
rq(CPU1)->nr_iowait_remote++;
b = rq(CPU1)->nr_iowait_remote; // 1

r = a - b; // -1
return (unsigned int) r; // UINT_MAX

The consumers of this interface might be upset. :)

While with a single atomic_t it's guaranteed that the result is
always greater or equal zero.

>> So s/Reviewed-by/Un-Reviewed-by/
>>
>> Though thinking about it some more. Is this split a real benefit over
>> always using the atomic? Do you have numbers to show?
>
> It was more on Peter's complaint that now we're trading a single atomic
> for two, hence I got to thinking about nr_iowait in general. I don't
> have numbers showing it matters, as mentioned in another email the most
> costly part about this seems to be fetching task->in_iowait and not the
> actual atomic.

On the write side (except for the remote case) the cache line is already
dirty on the current CPU and I doubt that the atomic will be
noticable. If there is concurrent remote access to the runqueue then the
cache line is bouncing no matter what.

On the read side there is always an atomic operation required, so it's
not really different.

I assume Peter's complaint was about the extra nr_iowait_acct part. I
think that's solvable without the extra atomic_t member and with a
single atomic_add()/sub(). atomic_t is 32bit wide, so what about
splitting the thing and adding/subtracting both in one go?

While sketching this I noticed that prepare/finish can be written w/o
any conditionals.

int io_schedule_prepare(void)
{
int flags = current->in_iowait + current->in_iowait_acct << 16;

current->in_iowait = 1;
current->in_iowait_acct = 1;
blk_flush_plug(current->plug, true);
return flags;
}

void io_schedule_finish(int old_wait_flags)
{
current->in_iowait = flags & 0x01;
current->in_iowait_acct = flags >> 16;
}

Now __schedule():

if (prev->in_iowait) {
int x = 1 + current->in_iowait_acct << 16;

atomic_add(x, rq->nr_iowait);
delayacct_blkio_start();
}

and ttwu_do_activate():

if (p->in_iowait) {
int x = 1 + current->in_iowait_acct << 16;

delayacct_blkio_end(p);
atomic_sub(x, task_rq(p)->nr_iowait);
}


and try_to_wake_up():

delayacct_blkio_end(p);

int x = 1 + current->in_iowait_acct << 16;

atomic_add(x, task_rq(p)->nr_iowait);

nr_iowait_acct_cpu() becomes:

return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16;

and nr_iowait_cpu():

return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1);

Obviously written with proper inline wrappers and defines, but you get
the idea.

Hmm?

Thanks,

tglx

2024-02-29 22:32:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

On 2/29/24 12:52 PM, Thomas Gleixner wrote:
> On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote:
>> On 2/29/24 10:42 AM, Thomas Gleixner wrote:
>>> So but I just noticed that there is actually an issue with this:
>>>
>>>> unsigned int nr_iowait_cpu(int cpu)
>>>> {
>>>> - return atomic_read(&cpu_rq(cpu)->nr_iowait);
>>>> + struct rq *rq = cpu_rq(cpu);
>>>> +
>>>> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
>>>
>>> The access to rq->nr_iowait is not protected by the runqueue lock and
>>> therefore a data race when @cpu is not the current CPU.
>>>
>>> This needs to be properly annotated and explained why it does not
>>> matter.
>>
>> But that was always racy before as well, if someone else is inc/dec'ing
>> ->nr_iowait while it's being read, you could get either the before or
>> after value. This doesn't really change that. I could've sworn I
>> mentioned that in the commit message, but I did not.
>
> There are actually two issues here:
>
> 1) atomic_read() vs. atomic_inc/dec() guarantees that the read value
> is consistent in itself.
>
> Non-atomic inc/dec is not guaranteeing that the concurrent read is a
> consistent value as the compiler is free to do store/load
> tearing. Unlikely but not guaranteed to never happen.
>
> KCSAN will complain about it sooner than later and then someone has
> to go and do the analysis and the annotation. I rather let you do
> the reasoning now than chasing you down later :)

Fair enough.

> 2) What's worse is that the result can be completely bogus:
>
> i.e.
>
> CPU0 CPU1 CPU2
> a = rq(CPU1)->nr_iowait; // 0
> rq->nr_iowait++;
> rq(CPU1)->nr_iowait_remote++;
> b = rq(CPU1)->nr_iowait_remote; // 1
>
> r = a - b; // -1
> return (unsigned int) r; // UINT_MAX
>
> The consumers of this interface might be upset. :)
>
> While with a single atomic_t it's guaranteed that the result is
> always greater or equal zero.

Yeah OK, this is a real problem...

>>> So s/Reviewed-by/Un-Reviewed-by/
>>>
>>> Though thinking about it some more. Is this split a real benefit over
>>> always using the atomic? Do you have numbers to show?
>>
>> It was more on Peter's complaint that now we're trading a single atomic
>> for two, hence I got to thinking about nr_iowait in general. I don't
>> have numbers showing it matters, as mentioned in another email the most
>> costly part about this seems to be fetching task->in_iowait and not the
>> actual atomic.
>
> On the write side (except for the remote case) the cache line is already
> dirty on the current CPU and I doubt that the atomic will be
> noticable. If there is concurrent remote access to the runqueue then the
> cache line is bouncing no matter what.

That was my exact thinking too, same cacheline and back-to-back atomics
don't really matter vs a single atomic on it.

> On the read side there is always an atomic operation required, so it's
> not really different.
>
> I assume Peter's complaint was about the extra nr_iowait_acct part. I
> think that's solvable without the extra atomic_t member and with a
> single atomic_add()/sub(). atomic_t is 32bit wide, so what about
> splitting the thing and adding/subtracting both in one go?
>
> While sketching this I noticed that prepare/finish can be written w/o
> any conditionals.
>
> int io_schedule_prepare(void)
> {
> int flags = current->in_iowait + current->in_iowait_acct << 16;
>
> current->in_iowait = 1;
> current->in_iowait_acct = 1;
> blk_flush_plug(current->plug, true);
> return flags;
> }
>
> void io_schedule_finish(int old_wait_flags)
> {
> current->in_iowait = flags & 0x01;
> current->in_iowait_acct = flags >> 16;
> }
>
> Now __schedule():
>
> if (prev->in_iowait) {
> int x = 1 + current->in_iowait_acct << 16;
>
> atomic_add(x, rq->nr_iowait);
> delayacct_blkio_start();
> }
>
> and ttwu_do_activate():
>
> if (p->in_iowait) {
> int x = 1 + current->in_iowait_acct << 16;
>
> delayacct_blkio_end(p);
> atomic_sub(x, task_rq(p)->nr_iowait);
> }
>
>
> and try_to_wake_up():
>
> delayacct_blkio_end(p);
>
> int x = 1 + current->in_iowait_acct << 16;
>
> atomic_add(x, task_rq(p)->nr_iowait);
>
> nr_iowait_acct_cpu() becomes:
>
> return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16;
>
> and nr_iowait_cpu():
>
> return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1);
>
> Obviously written with proper inline wrappers and defines, but you get
> the idea.

I'll play with this a bit, but do we want to switch to an atomic_long_t
for this? 2^16 in iowait seems extreme, but it definitely seems possible
to overflow it.

--
Jens Axboe


2024-03-01 00:02:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

On Thu, Feb 29 2024 at 15:30, Jens Axboe wrote:
> On 2/29/24 12:52 PM, Thomas Gleixner wrote:
>> return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1);
>>
>> Obviously written with proper inline wrappers and defines, but you get
>> the idea.
>
> I'll play with this a bit, but do we want to switch to an atomic_long_t
> for this? 2^16 in iowait seems extreme, but it definitely seems possible
> to overflow it.

Indeed. 32bit has PID_MAX_LIMIT == 0x8000 which obviously fits into 16
bits, while 64bit lifts that limit and relies on memory exhaustion to
limit the number of concurrent threads on the machine, but that
obviously can exceed 16bits.

Whether more than 2^16 sit in iowait concurrently on a single CPU that's
a different question and probably more academic. :)

Though as this will touch all nr_iowait places anyway changing it to
atomic_long_t in a preparatory patch first makes a lot of sense.

Thanks,

tglx