Subject: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption

During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).

The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the sleeping lock
again while scheduling out.

While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.

I don't see a significant reason why this lock should become a
raw_spinlock_t therefore I suggest to move it after the
tsk_is_pi_blocked() check.
The io_worker::flags are usually modified under the lock except in the
scheduler path. Ideally the lock is always acquired since the
IO_WORKER_F_UP flag is set early in the startup and IO_WORKER_F_RUNNING
should be set unless the task loops within schedule(). I *think* ::flags
requires the same protection like workqueue's ::sleeping and therefore I
move the check within the locked section.

Any feedback on this vs raw_spinlock_t?

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/io-wq.c | 8 ++++----
kernel/sched/core.c | 10 +++++-----
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index e92c4724480ca..a7e07b3ac5b95 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -623,15 +623,15 @@ void io_wq_worker_sleeping(struct task_struct *tsk)
struct io_worker *worker = kthread_data(tsk);
struct io_wqe *wqe = worker->wqe;

+ spin_lock_irq(&wqe->lock);
if (!(worker->flags & IO_WORKER_F_UP))
- return;
+ goto out;
if (!(worker->flags & IO_WORKER_F_RUNNING))
- return;
+ goto out;

worker->flags &= ~IO_WORKER_F_RUNNING;
-
- spin_lock_irq(&wqe->lock);
io_wqe_dec_running(wqe, worker);
+out:
spin_unlock_irq(&wqe->lock);
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bbb60b97c73c..b76c0f27bd95e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4694,18 +4694,18 @@ static inline void sched_submit_work(struct task_struct *tsk)
* in the possible wakeup of a kworker and because wq_worker_sleeping()
* requires it.
*/
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (tsk->flags & PF_WQ_WORKER) {
preempt_disable();
- if (tsk->flags & PF_WQ_WORKER)
- wq_worker_sleeping(tsk);
- else
- io_wq_worker_sleeping(tsk);
+ wq_worker_sleeping(tsk);
preempt_enable_no_resched();
}

if (tsk_is_pi_blocked(tsk))
return;

+ if (tsk->flags & PF_IO_WORKER)
+ io_wq_worker_sleeping(tsk);
+
/*
* If we are going to sleep and we have plugged IO queued,
* make sure to submit it to avoid deadlocks.
--
2.28.0


2020-08-19 13:26:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption

On 8/19/20 6:15 AM, [email protected] wrote:
> On Wed, Aug 19, 2020 at 02:37:58PM +0200, Sebastian Andrzej Siewior wrote:
>
>> I don't see a significant reason why this lock should become a
>> raw_spinlock_t therefore I suggest to move it after the
>> tsk_is_pi_blocked() check.
>
>> Any feedback on this vs raw_spinlock_t?
>>
>> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
>> ---
>> fs/io-wq.c | 8 ++++----
>> kernel/sched/core.c | 10 +++++-----
>> 2 files changed, 9 insertions(+), 9 deletions(-)
>>
>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3bbb60b97c73c..b76c0f27bd95e 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -4694,18 +4694,18 @@ static inline void sched_submit_work(struct task_struct *tsk)
>> * in the possible wakeup of a kworker and because wq_worker_sleeping()
>> * requires it.
>> */
>> - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
>> + if (tsk->flags & PF_WQ_WORKER) {
>> preempt_disable();
>> - if (tsk->flags & PF_WQ_WORKER)
>> - wq_worker_sleeping(tsk);
>> - else
>> - io_wq_worker_sleeping(tsk);
>> + wq_worker_sleeping(tsk);
>> preempt_enable_no_resched();
>> }
>>
>> if (tsk_is_pi_blocked(tsk))
>> return;
>>
>> + if (tsk->flags & PF_IO_WORKER)
>> + io_wq_worker_sleeping(tsk);
>> +
>
> Urgh, so this adds a branch in what is normally considered a fairly hot
> path.
>
>
> I'm thinking that the raw_spinlock_t option would permit leaving that
> single:
>
> if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER))
>
> branch intact?

Yes, the raw spinlock would do it, and leave the single branch intact
in the hot path. I'd be fine with going that route for io-wq.


--
Jens Axboe

2020-08-19 13:29:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption

On Wed, Aug 19, 2020 at 02:37:58PM +0200, Sebastian Andrzej Siewior wrote:

> I don't see a significant reason why this lock should become a
> raw_spinlock_t therefore I suggest to move it after the
> tsk_is_pi_blocked() check.

> Any feedback on this vs raw_spinlock_t?
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
> fs/io-wq.c | 8 ++++----
> kernel/sched/core.c | 10 +++++-----
> 2 files changed, 9 insertions(+), 9 deletions(-)
>

> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 3bbb60b97c73c..b76c0f27bd95e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4694,18 +4694,18 @@ static inline void sched_submit_work(struct task_struct *tsk)
> * in the possible wakeup of a kworker and because wq_worker_sleeping()
> * requires it.
> */
> - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (tsk->flags & PF_WQ_WORKER) {
> preempt_disable();
> - if (tsk->flags & PF_WQ_WORKER)
> - wq_worker_sleeping(tsk);
> - else
> - io_wq_worker_sleeping(tsk);
> + wq_worker_sleeping(tsk);
> preempt_enable_no_resched();
> }
>
> if (tsk_is_pi_blocked(tsk))
> return;
>
> + if (tsk->flags & PF_IO_WORKER)
> + io_wq_worker_sleeping(tsk);
> +

Urgh, so this adds a branch in what is normally considered a fairly hot
path.

I'm thinking that the raw_spinlock_t option would permit leaving that
single:

if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER))

branch intact?

Subject: Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption

On 2020-08-19 15:15:07 [+0200], [email protected] wrote:

> > - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> > + if (tsk->flags & PF_WQ_WORKER) {
> > preempt_disable();
> > - if (tsk->flags & PF_WQ_WORKER)
> > - wq_worker_sleeping(tsk);
> > - else
> > - io_wq_worker_sleeping(tsk);
> > + wq_worker_sleeping(tsk);
> > preempt_enable_no_resched();
> > }
> >
> > if (tsk_is_pi_blocked(tsk))
> > return;
> >
> > + if (tsk->flags & PF_IO_WORKER)
> > + io_wq_worker_sleeping(tsk);
> > +
>
> Urgh, so this adds a branch in what is normally considered a fairly hot
> path.
>
> I'm thinking that the raw_spinlock_t option would permit leaving that
> single:
>
> if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER))
>
> branch intact?

The compiler generates code to test for both flags at once. If none of
both possible flags are set then there is one branch (get out and bring
me to tst_is_pi…).
And yes, with raw_spinlock_t we could keep that one branch.

If you want to optimize further, we could move PF_IO_WORKER to an lower
bit. x86 can test for both via
(gcc-10)
| testl $536870944, 44(%rbp) #, _11->flags
| jne .L1635 #,

(clang-9)
| testl $536870944, 44(%rbx) # imm = 0x20000020
| je .LBB112_6


but ARM can't and does
| ldr r1, [r5, #16] @ tsk_3->flags, tsk_3->flags
| mov r2, #32 @ tmp157,
| movt r2, 8192 @ tmp157,
| tst r2, r1 @ tmp157, tsk_3->flags
| beq .L998 @,

same ARM64
| ldr w0, [x20, 60] //, _11->flags
| and w0, w0, 1073741792 // tmp117, _11->flags,
| and w0, w0, -536870849 // tmp117, tmp117,
| cbnz w0, .L453 // tmp117,

using 0x10 for PF_IO_WORKER instead will turn this into:
| ldr w0, [x20, 60] //, _11->flags
| tst w0, 48 // _11->flags,
| bne .L453 //,

ARM:
| ldr r2, [r5, #16] @ tsk_3->flags, tsk_3->flags
| tst r2, #48 @ tsk_3->flags,
| beq .L998 @,

Sebastian

2020-08-19 14:24:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH] sched: Invoke io_wq_worker_sleeping() with enabled preemption

On Wed, Aug 19, 2020 at 03:33:20PM +0200, Sebastian Andrzej Siewior wrote:
> On 2020-08-19 15:15:07 [+0200], [email protected] wrote:

> If you want to optimize further, we could move PF_IO_WORKER to an lower
> bit. x86 can test for both via
> (gcc-10)
> | testl $536870944, 44(%rbp) #, _11->flags
> | jne .L1635 #,
>
> (clang-9)
> | testl $536870944, 44(%rbx) # imm = 0x20000020
> | je .LBB112_6
>
>
> but ARM can't and does
> | ldr r1, [r5, #16] @ tsk_3->flags, tsk_3->flags
> | mov r2, #32 @ tmp157,
> | movt r2, 8192 @ tmp157,
> | tst r2, r1 @ tmp157, tsk_3->flags
> | beq .L998 @,
>
> same ARM64
> | ldr w0, [x20, 60] //, _11->flags
> | and w0, w0, 1073741792 // tmp117, _11->flags,
> | and w0, w0, -536870849 // tmp117, tmp117,
> | cbnz w0, .L453 // tmp117,
>
> using 0x10 for PF_IO_WORKER instead will turn this into:
> | ldr w0, [x20, 60] //, _11->flags
> | tst w0, 48 // _11->flags,
> | bne .L453 //,
>
> ARM:
> | ldr r2, [r5, #16] @ tsk_3->flags, tsk_3->flags
> | tst r2, #48 @ tsk_3->flags,
> | beq .L998 @,

Good point, AFAICT there's a number of low bits still open (and we can
shuffle if we have to), so sure put a patch in to that effect while
you're at it.


Subject: [PATCH] io_wq: Make io_wqe::lock a raw_spinlock_t

During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).

The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.

While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.

In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.

The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.

Convert io_wqe::lock to a raw_spinlock_t.c

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
fs/io-wq.c | 52 ++++++++++++++++++++++++++--------------------------
1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/fs/io-wq.c b/fs/io-wq.c
index e92c4724480ca..16dd3a5534a06 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -87,7 +87,7 @@ enum {
*/
struct io_wqe {
struct {
- spinlock_t lock;
+ raw_spinlock_t lock;
struct io_wq_work_list work_list;
unsigned long hash_map;
unsigned flags;
@@ -148,7 +148,7 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)

if (current->files != worker->restore_files) {
__acquire(&wqe->lock);
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
dropped_lock = true;

task_lock(current);
@@ -166,7 +166,7 @@ static bool __io_worker_unuse(struct io_wqe *wqe, struct io_worker *worker)
if (worker->mm) {
if (!dropped_lock) {
__acquire(&wqe->lock);
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
dropped_lock = true;
}
__set_current_state(TASK_RUNNING);
@@ -220,17 +220,17 @@ static void io_worker_exit(struct io_worker *worker)
worker->flags = 0;
preempt_enable();

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
hlist_nulls_del_rcu(&worker->nulls_node);
list_del_rcu(&worker->all_list);
if (__io_worker_unuse(wqe, worker)) {
__release(&wqe->lock);
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
}
acct->nr_workers--;
nr_workers = wqe->acct[IO_WQ_ACCT_BOUND].nr_workers +
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);

/* all workers gone, wq exit can proceed */
if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
@@ -504,7 +504,7 @@ static void io_worker_handle_work(struct io_worker *worker)
else if (!wq_list_empty(&wqe->work_list))
wqe->flags |= IO_WQE_FLAG_STALLED;

- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
if (!work)
break;
io_assign_current_work(worker, work);
@@ -538,17 +538,17 @@ static void io_worker_handle_work(struct io_worker *worker)
io_wqe_enqueue(wqe, linked);

if (hash != -1U && !next_hashed) {
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
wqe->hash_map &= ~BIT_ULL(hash);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
/* skip unnecessary unlock-lock wqe->lock */
if (!work)
goto get_next;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
}
} while (work);

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
} while (1);
}

@@ -563,7 +563,7 @@ static int io_wqe_worker(void *data)
while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
set_current_state(TASK_INTERRUPTIBLE);
loop:
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
if (io_wqe_run_queue(wqe)) {
__set_current_state(TASK_RUNNING);
io_worker_handle_work(worker);
@@ -574,7 +574,7 @@ static int io_wqe_worker(void *data)
__release(&wqe->lock);
goto loop;
}
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
if (signal_pending(current))
flush_signals(current);
if (schedule_timeout(WORKER_IDLE_TIMEOUT))
@@ -586,11 +586,11 @@ static int io_wqe_worker(void *data)
}

if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
if (!wq_list_empty(&wqe->work_list))
io_worker_handle_work(worker);
else
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
}

io_worker_exit(worker);
@@ -630,9 +630,9 @@ void io_wq_worker_sleeping(struct task_struct *tsk)

worker->flags &= ~IO_WORKER_F_RUNNING;

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
io_wqe_dec_running(wqe, worker);
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
}

static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
@@ -656,7 +656,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
return false;
}

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
list_add_tail_rcu(&worker->all_list, &wqe->all_list);
worker->flags |= IO_WORKER_F_FREE;
@@ -665,7 +665,7 @@ static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
if (!acct->nr_workers && (worker->flags & IO_WORKER_F_BOUND))
worker->flags |= IO_WORKER_F_FIXED;
acct->nr_workers++;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);

if (index == IO_WQ_ACCT_UNBOUND)
atomic_inc(&wq->user->processes);
@@ -720,12 +720,12 @@ static int io_wq_manager(void *data)
if (!node_online(node))
continue;

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
if (io_wqe_need_worker(wqe, IO_WQ_ACCT_BOUND))
fork_worker[IO_WQ_ACCT_BOUND] = true;
if (io_wqe_need_worker(wqe, IO_WQ_ACCT_UNBOUND))
fork_worker[IO_WQ_ACCT_UNBOUND] = true;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
if (fork_worker[IO_WQ_ACCT_BOUND])
create_io_worker(wq, wqe, IO_WQ_ACCT_BOUND);
if (fork_worker[IO_WQ_ACCT_UNBOUND])
@@ -821,10 +821,10 @@ static void io_wqe_enqueue(struct io_wqe *wqe, struct io_wq_work *work)
}

work_flags = work->flags;
- spin_lock_irqsave(&wqe->lock, flags);
+ raw_spin_lock_irqsave(&wqe->lock, flags);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
- spin_unlock_irqrestore(&wqe->lock, flags);
+ raw_spin_unlock_irqrestore(&wqe->lock, flags);

if ((work_flags & IO_WQ_WORK_CONCURRENT) ||
!atomic_read(&acct->nr_running))
@@ -933,14 +933,14 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
unsigned long flags;

retry:
- spin_lock_irqsave(&wqe->lock, flags);
+ raw_spin_lock_irqsave(&wqe->lock, flags);
wq_list_for_each(node, prev, &wqe->work_list) {
work = container_of(node, struct io_wq_work, list);
if (!match->fn(work, match->data))
continue;

wq_list_del(&wqe->work_list, node, prev);
- spin_unlock_irqrestore(&wqe->lock, flags);
+ raw_spin_unlock_irqrestore(&wqe->lock, flags);
io_run_cancel(work, wqe);
match->nr_pending++;
if (!match->cancel_all)
@@ -949,7 +949,7 @@ static void io_wqe_cancel_pending_work(struct io_wqe *wqe,
/* not safe to continue after unlock */
goto retry;
}
- spin_unlock_irqrestore(&wqe->lock, flags);
+ raw_spin_unlock_irqrestore(&wqe->lock, flags);
}

static void io_wqe_cancel_running_work(struct io_wqe *wqe,
@@ -1057,7 +1057,7 @@ struct io_wq *io_wq_create(unsigned bounded, struct io_wq_data *data)
}
atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
wqe->wq = wq;
- spin_lock_init(&wqe->lock);
+ raw_spin_lock_init(&wqe->lock);
INIT_WQ_LIST(&wqe->work_list);
INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0);
INIT_LIST_HEAD(&wqe->all_list);
--
2.28.0

Subject: [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together

The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
sched_submit_work() which is considered to be a hot path.
If the two bits cross the 8 or 16 bit boundary then most architecture
require multiple load instructions in order to create the constant
value. Also, such a value can not be encoded within the compare opcode.

By moving the bit definition within the same block, the compiler can
create/use one immediate value.

For some reason gcc-10 on ARM64 requires both bits to be next to each
other in order to issue "tst reg, val; bne label". Otherwise the result
is "mov reg1, val; tst reg, reg1; bne label".

Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
PF_WQ_WORKER.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---

Could someone from the ARM64 camp please verify if this a gcc "bug" or
opcode/arch limitation? With PF_IO_WORKER as 1 (without the PF_VCPU
swap) I get for ARM:

| tst r2, #33 @ task_flags,
| beq .L998 @,

however ARM64 does here:
| mov w0, 33 // tmp117,
| tst w19, w0 // task_flags, tmp117
| bne .L453 //,

the extra mov operation. Moving PF_IO_WORKER next to PF_WQ_WORKER as
this patch gives me:
| tst w19, 48 // task_flags,
| bne .L453 //,


include/linux/sched.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93ecd930efd31..2bf0af19a62a4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1489,9 +1489,10 @@ extern struct pid *cad_pid;
/*
* Per process flags
*/
+#define PF_VCPU 0x00000001 /* I'm a virtual CPU */
#define PF_IDLE 0x00000002 /* I am an IDLE thread */
#define PF_EXITING 0x00000004 /* Getting shut down */
-#define PF_VCPU 0x00000010 /* I'm a virtual CPU */
+#define PF_IO_WORKER 0x00000010 /* Task is an IO worker */
#define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */
#define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */
#define PF_MCE_PROCESS 0x00000080 /* Process policy on mce errors */
@@ -1515,7 +1516,6 @@ extern struct pid *cad_pid;
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
-#define PF_IO_WORKER 0x20000000 /* Task is an IO worker */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */

--
2.28.0

Subject: [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work()

sched_submit_work() is considered to be a hot path. The preempt_disable()
instruction is a compiler barrier and forces the compiler to load
task_struct::flags for the second comparison.
By using a local variable, the compiler can load the value once and keep it in
a register for the second comparison.

Verified on x86-64 with gcc-10.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---

Optimisation at molecule level, part two. Drop this in case this branch
isn't consider *that* hot and the cache hot value can be loaded again.
But then the value is around and be speculated early on :)

kernel/sched/core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8471a0f7eb322..c36dc1ae58beb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,9 +4551,12 @@ void __noreturn do_task_dead(void)

static inline void sched_submit_work(struct task_struct *tsk)
{
+ unsigned int task_flags;
+
if (!tsk->state)
return;

+ task_flags = tsk->flags;
/*
* If a worker went to sleep, notify and ask workqueue whether
* it wants to wake up a task to maintain concurrency.
@@ -4562,9 +4565,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
* in the possible wakeup of a kworker and because wq_worker_sleeping()
* requires it.
*/
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
preempt_disable();
- if (tsk->flags & PF_WQ_WORKER)
+ if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
else
io_wq_worker_sleeping(tsk);
--
2.28.0

2020-08-19 20:13:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 2/2] sched: Cache task_struct::flags in sched_submit_work()

On Wed, Aug 19, 2020 at 10:00:25PM +0200, Sebastian Andrzej Siewior wrote:
> sched_submit_work() is considered to be a hot path. The preempt_disable()
> instruction is a compiler barrier and forces the compiler to load
> task_struct::flags for the second comparison.
> By using a local variable, the compiler can load the value once and keep it in
> a register for the second comparison.
>
> Verified on x86-64 with gcc-10.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
>
> Optimisation at molecule level, part two. Drop this in case this branch
> isn't consider *that* hot and the cache hot value can be loaded again.
> But then the value is around and be speculated early on :)

That's fine, task->flags can only be changed by current.

Patches look good to me, I'll stuff them in tomorrow. Thanks!

Subject: [tip: sched/core] sched: Cache task_struct::flags in sched_submit_work()

The following commit has been merged into the sched/core branch of tip:

Commit-ID: c1cecf884ad748f63f9139d5a18ee265ee2f70fb
Gitweb: https://git.kernel.org/tip/c1cecf884ad748f63f9139d5a18ee265ee2f70fb
Author: Sebastian Andrzej Siewior <[email protected]>
AuthorDate: Wed, 19 Aug 2020 22:00:25 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 26 Aug 2020 12:41:58 +02:00

sched: Cache task_struct::flags in sched_submit_work()

sched_submit_work() is considered to be a hot path. The preempt_disable()
instruction is a compiler barrier and forces the compiler to load
task_struct::flags for the second comparison.
By using a local variable, the compiler can load the value once and keep it in
a register for the second comparison.

Verified on x86-64 with gcc-10.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
kernel/sched/core.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8471a0f..c36dc1a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4551,9 +4551,12 @@ void __noreturn do_task_dead(void)

static inline void sched_submit_work(struct task_struct *tsk)
{
+ unsigned int task_flags;
+
if (!tsk->state)
return;

+ task_flags = tsk->flags;
/*
* If a worker went to sleep, notify and ask workqueue whether
* it wants to wake up a task to maintain concurrency.
@@ -4562,9 +4565,9 @@ static inline void sched_submit_work(struct task_struct *tsk)
* in the possible wakeup of a kworker and because wq_worker_sleeping()
* requires it.
*/
- if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
+ if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
preempt_disable();
- if (tsk->flags & PF_WQ_WORKER)
+ if (task_flags & PF_WQ_WORKER)
wq_worker_sleeping(tsk);
else
io_wq_worker_sleeping(tsk);

Subject: [tip: sched/core] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 01ccf592362a984534371b3596d4c953da6a7bb2
Gitweb: https://git.kernel.org/tip/01ccf592362a984534371b3596d4c953da6a7bb2
Author: Sebastian Andrzej Siewior <[email protected]>
AuthorDate: Wed, 19 Aug 2020 21:55:05 +02:00
Committer: Peter Zijlstra <[email protected]>
CommitterDate: Wed, 26 Aug 2020 12:41:58 +02:00

sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together

The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
sched_submit_work() which is considered to be a hot path.
If the two bits cross the 8 or 16 bit boundary then most architecture
require multiple load instructions in order to create the constant
value. Also, such a value can not be encoded within the compare opcode.

By moving the bit definition within the same block, the compiler can
create/use one immediate value.

For some reason gcc-10 on ARM64 requires both bits to be next to each
other in order to issue "tst reg, val; bne label". Otherwise the result
is "mov reg1, val; tst reg, reg1; bne label".

Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
PF_WQ_WORKER.

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
include/linux/sched.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 93ecd93..2bf0af1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1489,9 +1489,10 @@ extern struct pid *cad_pid;
/*
* Per process flags
*/
+#define PF_VCPU 0x00000001 /* I'm a virtual CPU */
#define PF_IDLE 0x00000002 /* I am an IDLE thread */
#define PF_EXITING 0x00000004 /* Getting shut down */
-#define PF_VCPU 0x00000010 /* I'm a virtual CPU */
+#define PF_IO_WORKER 0x00000010 /* Task is an IO worker */
#define PF_WQ_WORKER 0x00000020 /* I'm a workqueue worker */
#define PF_FORKNOEXEC 0x00000040 /* Forked but didn't exec */
#define PF_MCE_PROCESS 0x00000080 /* Process policy on mce errors */
@@ -1515,7 +1516,6 @@ extern struct pid *cad_pid;
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMALLOC_NOCMA 0x10000000 /* All allocation request will have _GFP_MOVABLE cleared */
-#define PF_IO_WORKER 0x20000000 /* Task is an IO worker */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* This thread called freeze_processes() and should not be frozen */

Subject: [PATCH v2] io_wq: Make io_wqe::lock a raw_spinlock_t

During a context switch the scheduler invokes wq_worker_sleeping() with
disabled preemption. Disabling preemption is needed because it protects
access to `worker->sleeping'. As an optimisation it avoids invoking
schedule() within the schedule path as part of possible wake up (thus
preempt_enable_no_resched() afterwards).

The io-wq has been added to the mix in the same section with disabled
preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
acquires a spinlock_t. Also within the schedule() the spinlock_t must be
acquired after tsk_is_pi_blocked() otherwise it will block on the
sleeping lock again while scheduling out.

While playing with `io_uring-bench' I didn't notice a significant
latency spike after converting io_wqe::lock to a raw_spinlock_t. The
latency was more or less the same.

In order to keep the spinlock_t it would have to be moved after the
tsk_is_pi_blocked() check which would introduce a branch instruction
into the hot path.

The lock is used to maintain the `work_list' and wakes one task up at
most.
Should io_wqe_cancel_pending_work() cause latency spikes, while
searching for a specific item, then it would need to drop the lock
during iterations.
revert_creds() is also invoked under the lock. According to debug
cred::non_rcu is 0. Otherwise it should be moved outside of the locked
section because put_cred_rcu()->free_uid() acquires a sleeping lock.

Convert io_wqe::lock to a raw_spinlock_t.c

Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
v1…v2: Rebase to -rc3

fs/io-wq.c | 52 ++++++++++++++++++++++++++--------------------------
1 file changed, 26 insertions(+), 26 deletions(-)

--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -87,7 +87,7 @@ enum {
*/
struct io_wqe {
struct {
- spinlock_t lock;
+ raw_spinlock_t lock;
struct io_wq_work_list work_list;
unsigned long hash_map;
unsigned flags;
@@ -148,7 +148,7 @@ static bool __io_worker_unuse(struct io_

if (current->files != worker->restore_files) {
__acquire(&wqe->lock);
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
dropped_lock = true;

task_lock(current);
@@ -166,7 +166,7 @@ static bool __io_worker_unuse(struct io_
if (worker->mm) {
if (!dropped_lock) {
__acquire(&wqe->lock);
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
dropped_lock = true;
}
__set_current_state(TASK_RUNNING);
@@ -220,17 +220,17 @@ static void io_worker_exit(struct io_wor
worker->flags = 0;
preempt_enable();

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
hlist_nulls_del_rcu(&worker->nulls_node);
list_del_rcu(&worker->all_list);
if (__io_worker_unuse(wqe, worker)) {
__release(&wqe->lock);
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
}
acct->nr_workers--;
nr_workers = wqe->acct[IO_WQ_ACCT_BOUND].nr_workers +
wqe->acct[IO_WQ_ACCT_UNBOUND].nr_workers;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);

/* all workers gone, wq exit can proceed */
if (!nr_workers && refcount_dec_and_test(&wqe->wq->refs))
@@ -504,7 +504,7 @@ static void io_worker_handle_work(struct
else if (!wq_list_empty(&wqe->work_list))
wqe->flags |= IO_WQE_FLAG_STALLED;

- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
if (!work)
break;
io_assign_current_work(worker, work);
@@ -538,17 +538,17 @@ static void io_worker_handle_work(struct
io_wqe_enqueue(wqe, linked);

if (hash != -1U && !next_hashed) {
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
wqe->hash_map &= ~BIT_ULL(hash);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
/* skip unnecessary unlock-lock wqe->lock */
if (!work)
goto get_next;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
}
} while (work);

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
} while (1);
}

@@ -563,7 +563,7 @@ static int io_wqe_worker(void *data)
while (!test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
set_current_state(TASK_INTERRUPTIBLE);
loop:
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
if (io_wqe_run_queue(wqe)) {
__set_current_state(TASK_RUNNING);
io_worker_handle_work(worker);
@@ -574,7 +574,7 @@ static int io_wqe_worker(void *data)
__release(&wqe->lock);
goto loop;
}
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
if (signal_pending(current))
flush_signals(current);
if (schedule_timeout(WORKER_IDLE_TIMEOUT))
@@ -586,11 +586,11 @@ static int io_wqe_worker(void *data)
}

if (test_bit(IO_WQ_BIT_EXIT, &wq->state)) {
- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
if (!wq_list_empty(&wqe->work_list))
io_worker_handle_work(worker);
else
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
}

io_worker_exit(worker);
@@ -630,9 +630,9 @@ void io_wq_worker_sleeping(struct task_s

worker->flags &= ~IO_WORKER_F_RUNNING;

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
io_wqe_dec_running(wqe, worker);
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
}

static bool create_io_worker(struct io_wq *wq, struct io_wqe *wqe, int index)
@@ -656,7 +656,7 @@ static bool create_io_worker(struct io_w
return false;
}

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
hlist_nulls_add_head_rcu(&worker->nulls_node, &wqe->free_list);
list_add_tail_rcu(&worker->all_list, &wqe->all_list);
worker->flags |= IO_WORKER_F_FREE;
@@ -665,7 +665,7 @@ static bool create_io_worker(struct io_w
if (!acct->nr_workers && (worker->flags & IO_WORKER_F_BOUND))
worker->flags |= IO_WORKER_F_FIXED;
acct->nr_workers++;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);

if (index == IO_WQ_ACCT_UNBOUND)
atomic_inc(&wq->user->processes);
@@ -720,12 +720,12 @@ static int io_wq_manager(void *data)
if (!node_online(node))
continue;

- spin_lock_irq(&wqe->lock);
+ raw_spin_lock_irq(&wqe->lock);
if (io_wqe_need_worker(wqe, IO_WQ_ACCT_BOUND))
fork_worker[IO_WQ_ACCT_BOUND] = true;
if (io_wqe_need_worker(wqe, IO_WQ_ACCT_UNBOUND))
fork_worker[IO_WQ_ACCT_UNBOUND] = true;
- spin_unlock_irq(&wqe->lock);
+ raw_spin_unlock_irq(&wqe->lock);
if (fork_worker[IO_WQ_ACCT_BOUND])
create_io_worker(wq, wqe, IO_WQ_ACCT_BOUND);
if (fork_worker[IO_WQ_ACCT_UNBOUND])
@@ -821,10 +821,10 @@ static void io_wqe_enqueue(struct io_wqe
}

work_flags = work->flags;
- spin_lock_irqsave(&wqe->lock, flags);
+ raw_spin_lock_irqsave(&wqe->lock, flags);
io_wqe_insert_work(wqe, work);
wqe->flags &= ~IO_WQE_FLAG_STALLED;
- spin_unlock_irqrestore(&wqe->lock, flags);
+ raw_spin_unlock_irqrestore(&wqe->lock, flags);

if ((work_flags & IO_WQ_WORK_CONCURRENT) ||
!atomic_read(&acct->nr_running))
@@ -951,13 +951,13 @@ static void io_wqe_cancel_pending_work(s
unsigned long flags;

retry:
- spin_lock_irqsave(&wqe->lock, flags);
+ raw_spin_lock_irqsave(&wqe->lock, flags);
wq_list_for_each(node, prev, &wqe->work_list) {
work = container_of(node, struct io_wq_work, list);
if (!match->fn(work, match->data))
continue;
io_wqe_remove_pending(wqe, work, prev);
- spin_unlock_irqrestore(&wqe->lock, flags);
+ raw_spin_unlock_irqrestore(&wqe->lock, flags);
io_run_cancel(work, wqe);
match->nr_pending++;
if (!match->cancel_all)
@@ -966,7 +966,7 @@ static void io_wqe_cancel_pending_work(s
/* not safe to continue after unlock */
goto retry;
}
- spin_unlock_irqrestore(&wqe->lock, flags);
+ raw_spin_unlock_irqrestore(&wqe->lock, flags);
}

static void io_wqe_cancel_running_work(struct io_wqe *wqe,
@@ -1074,7 +1074,7 @@ struct io_wq *io_wq_create(unsigned boun
}
atomic_set(&wqe->acct[IO_WQ_ACCT_UNBOUND].nr_running, 0);
wqe->wq = wq;
- spin_lock_init(&wqe->lock);
+ raw_spin_lock_init(&wqe->lock);
INIT_WQ_LIST(&wqe->work_list);
INIT_HLIST_NULLS_HEAD(&wqe->free_list, 0);
INIT_LIST_HEAD(&wqe->all_list);

2020-09-01 14:40:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2] io_wq: Make io_wqe::lock a raw_spinlock_t

On 9/1/20 2:41 AM, Sebastian Andrzej Siewior wrote:
> During a context switch the scheduler invokes wq_worker_sleeping() with
> disabled preemption. Disabling preemption is needed because it protects
> access to `worker->sleeping'. As an optimisation it avoids invoking
> schedule() within the schedule path as part of possible wake up (thus
> preempt_enable_no_resched() afterwards).
>
> The io-wq has been added to the mix in the same section with disabled
> preemption. This breaks on PREEMPT_RT because io_wq_worker_sleeping()
> acquires a spinlock_t. Also within the schedule() the spinlock_t must be
> acquired after tsk_is_pi_blocked() otherwise it will block on the
> sleeping lock again while scheduling out.
>
> While playing with `io_uring-bench' I didn't notice a significant
> latency spike after converting io_wqe::lock to a raw_spinlock_t. The
> latency was more or less the same.
>
> In order to keep the spinlock_t it would have to be moved after the
> tsk_is_pi_blocked() check which would introduce a branch instruction
> into the hot path.
>
> The lock is used to maintain the `work_list' and wakes one task up at
> most.
> Should io_wqe_cancel_pending_work() cause latency spikes, while
> searching for a specific item, then it would need to drop the lock
> during iterations.
> revert_creds() is also invoked under the lock. According to debug
> cred::non_rcu is 0. Otherwise it should be moved outside of the locked
> section because put_cred_rcu()->free_uid() acquires a sleeping lock.
>
> Convert io_wqe::lock to a raw_spinlock_t.c

Thanks, I've applied this for 5.10.

--
Jens Axboe

2020-09-07 13:01:38

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 1/2] sched: Bring the PF_IO_WORKER and PF_WQ_WORKER bits closer together

On Wed, Aug 19, 2020 at 09:55:05PM +0200, Sebastian Andrzej Siewior wrote:
> The bits PF_IO_WORKER and PF_WQ_WORKER are tested together in
> sched_submit_work() which is considered to be a hot path.
> If the two bits cross the 8 or 16 bit boundary then most architecture
> require multiple load instructions in order to create the constant
> value. Also, such a value can not be encoded within the compare opcode.
>
> By moving the bit definition within the same block, the compiler can
> create/use one immediate value.
>
> For some reason gcc-10 on ARM64 requires both bits to be next to each
> other in order to issue "tst reg, val; bne label". Otherwise the result
> is "mov reg1, val; tst reg, reg1; bne label".
>
> Move PF_VCPU out of the way so that PF_IO_WORKER can be next to
> PF_WQ_WORKER.
>
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
> ---
>
> Could someone from the ARM64 camp please verify if this a gcc "bug" or
> opcode/arch limitation? With PF_IO_WORKER as 1 (without the PF_VCPU
> swap) I get for ARM:
>
> | tst r2, #33 @ task_flags,
> | beq .L998 @,
>
> however ARM64 does here:
> | mov w0, 33 // tmp117,
> | tst w19, w0 // task_flags, tmp117
> | bne .L453 //,
>
> the extra mov operation. Moving PF_IO_WORKER next to PF_WQ_WORKER as
> this patch gives me:
> | tst w19, 48 // task_flags,
> | bne .L453 //,

Moving an immediate into a register really shouldn't be a performance
issue, so I don't think this is a problem. However, the reason GCC does
this is because of the slightly weird way in which immediates are encoded,
meaning that '33' can't be packed into the 'tst' alias. You can try to
decipher the "DecodeBitMasks()" pseudocode in the Arm ARM if you're
interested.

Will