2023-03-24 19:57:00

by Bernd Schubert

[permalink] [raw]
Subject: fuse uring / wake_up on the same core

Ingo, Peter,

I would like to ask how to wake up from a waitq on the same core. I have
tried __wake_up_sync()/WF_SYNC, but I do not see any effect.

I'm currently working on fuse/uring communication patches, besides uring
communication there is also a queue per core. Basic bonnie++ benchmarks
with a zero file size to just create/read(0)/delete show a ~3x IOPs
difference between CPU bound bonnie++ and unbound - i.e. with these
patches it _not_ fuse-daemon that needs to be bound, but the application
doing IO to the file system. We basically have

bonnie -> vfs (app/vfs)
fuse_req (app/fuse.ko)
qid = task_cpu(current) (app/fuse.ko)
ring(qid) / SQE completion (fuse.ko) (app/fuse.ko/uring)
wait_event(req->waitq, ...) (app/fuse.ko)
[app wait]
daemon ring / handle CQE (daemon)
send-back result as SQE (daemon/uring)
fuse_request_end (daemon/uring/fuse.ko)
wake_up() ---> random core (daemon/uring/fuse.ko)
[app wakeup/fuse/vfs/syscall return]
bonnie ==> different core


1) bound

[root@imesrv1 ~]# numactl --localalloc --physcpubind=0 bonnie++ -q -x 1
-s0 -d /scratch/dest/ -n 20:1:1:20 -r 0 -u 0 | bon_csv2txt
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
imesrv1 20:1:1:20 6229 28 11289 41 12785 24 6615 28 7769 40
10020 25
Latency 411us 824us 816us 298us 10473us
200ms


2) not bound

[root@imesrv1 ~]# bonnie++ -q -x 1 -s0 -d /scratch/dest/ -n 20:1:1:20
-r 0 -u 0 | bon_csv2txt
------Sequential Create------ --------Random
Create--------
-Create-- --Read--- -Delete-- -Create-- --Read---
-Delete--
files:max:min /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
/sec %CP
imesrv1 20:1:1:20 2064 33 2923 43 4556 28 2061 33 2186 42
4245 30
Latency 850us 3914us 2496us 738us 758us
6469us


With less files the difference becomes a bit smaller, but is still very
visible. Besides cache line bouncing, I'm sure that CPU frequency and
C-states will matter - I could tune that it in the lab, but in the end I
want to test what users do (I had recently checked with large HPC center
- Forschungszentrum Juelich - their HPC compute nodes are not tuned up,
to save energy).
Also, in order to really tune down latencies, I want want to add a
struct file_operations::uring_cmd_iopoll thread, which will spin for a
short time and avoid most of kernel/userspace communication. If
applications (with n-nthreads < n-cores) then get scheduled on different
core differnent rings will be used, result in
n-threads-spinning > n-threads-application


There was already a related thread about fuse before

https://lore.kernel.org/lkml/[email protected]/

With the fuse-uring patches that part is basically solved - the waitq
that that thread is about is not used anymore. But as per above,
remaining is the waitq of the incoming workq (not mentioned in the
thread above). As I wrote, I have tried
__wake_up_sync((x), TASK_NORMAL), but it does not make a difference for
me - similar to Miklos' testing before. I have also tried struct
completion / swait - does not make a difference either.
I can see task_struct has wake_cpu, but there doesn't seem to be a good
interface to set it.

Any ideas?


Thanks,
Bernd







2023-03-24 23:02:17

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 3/24/23 20:50, Bernd Schubert wrote:
> Ingo, Peter,
>
> I would like to ask how to wake up from a waitq on the same core. I have
> tried __wake_up_sync()/WF_SYNC, but I do not see any effect.
>
> I'm currently working on fuse/uring communication patches, besides uring
> communication there is also a queue per core. Basic bonnie++ benchmarks
> with a zero file size to just create/read(0)/delete show a ~3x IOPs
> difference between CPU bound bonnie++ and unbound - i.e. with these
> patches it _not_ fuse-daemon that needs to be bound, but the application
> doing IO to the file system. We basically have
>

[...]

> With less files the difference becomes a bit smaller, but is still very
> visible. Besides cache line bouncing, I'm sure that CPU frequency and
> C-states will matter - I could tune that it in the lab, but in the end I
> want to test what users do (I had recently checked with large HPC center
> - Forschungszentrum Juelich - their HPC compute nodes are not tuned up,
> to save energy).
> Also, in order to really tune down latencies, I want want to add a
> struct file_operations::uring_cmd_iopoll thread, which will spin for a
> short time and avoid most of kernel/userspace communication. If
> applications (with n-nthreads < n-cores) then get scheduled on different
> core differnent rings will be used, result in
> n-threads-spinning > n-threads-application
>
>
> There was already a related thread about fuse before
>
> https://lore.kernel.org/lkml/[email protected]/
>
> With the fuse-uring patches that part is basically solved - the waitq
> that that thread is about is not used anymore. But as per above,
> remaining is the waitq of the incoming workq (not mentioned in the
> thread above). As I wrote, I have tried
> __wake_up_sync((x), TASK_NORMAL), but it does not make a difference for
> me - similar to Miklos' testing before. I have also tried struct
> completion / swait - does not make a difference either.
> I can see task_struct has wake_cpu, but there doesn't seem to be a good
> interface to set it.
>
> Any ideas?
>

How much of hack is this patch?

[RFC] fuse: wake on the same core / disable migrate before wait

From: Bernd Schubert <[email protected]>

Avoid bouncing cores on wake, especially with uring everything
is core affine - bouncing badly decreases performance.
With read/write(/dev/fuse) it is not good either - needs to be tested
for negative impacts.
---
fs/fuse/dev.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index e82db13da8f6..d47b6a492434 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -372,12 +372,17 @@ static void request_wait_answer(struct fuse_req *req)
struct fuse_iqueue *fiq = &fc->iq;
int err;

+ /* avoid bouncing between cores on wake */
+ pr_devel("task=%p before wait on core: %u wake_cpu: %u\n",
+ current, task_cpu(current), current->wake_cpu);
+ migrate_disable();
+
if (!fc->no_interrupt) {
/* Any signal may interrupt this */
err = wait_event_interruptible(req->waitq,
test_bit(FR_FINISHED, &req->flags));
if (!err)
- return;
+ goto out;

set_bit(FR_INTERRUPTED, &req->flags);
/* matches barrier in fuse_dev_do_read() */
@@ -391,7 +396,7 @@ static void request_wait_answer(struct fuse_req *req)
err = wait_event_killable(req->waitq,
test_bit(FR_FINISHED, &req->flags));
if (!err)
- return;
+ goto out;

spin_lock(&fiq->lock);
/* Request is not yet in userspace, bail out */
@@ -400,7 +405,7 @@ static void request_wait_answer(struct fuse_req *req)
spin_unlock(&fiq->lock);
__fuse_put_request(req);
req->out.h.error = -EINTR;
- return;
+ goto out;
}
spin_unlock(&fiq->lock);
}
@@ -410,6 +415,11 @@ static void request_wait_answer(struct fuse_req *req)
* Wait it out.
*/
wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+
+out:
+ migrate_enable();
+ pr_devel("task=%p after wait on core: %u\n", current, task_cpu(current));
+
}

static void __fuse_request_send(struct fuse_req *req)

2023-03-25 07:25:24

by K Prateek Nayak

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

Hello Hillf,

On 3/25/2023 5:58 AM, Hillf Danton wrote:
> On 24 Mar 2023 22:44:16 +0000 Bernd Schubert <[email protected]>
>> How much of hack is this patch?
>
> It adds a churn to my mind then another RFC [1] rises.
>
> Feel free to make it work for you and resend it.
>
> [1] Subject: [RFC PATCH 0/5] sched: Userspace Hinting for Task Placement
> https://lore.kernel.org/lkml/[email protected]/

Thank you for pointing to my series.

Another possible way to tackle this is with a strategy Andrei is using in
his "seccomp: add the synchronous mode for seccomp_unotify" series
(https://lore.kernel.org/lkml/[email protected]/)

In patch 2, Andrei adds a WF_CURRENT_CPU that allows the task to always
wake on the CPU where the waker is running.
(https://lore.kernel.org/lkml/[email protected]/)

I believe Bernd's requirement calls for a a WF_PREV_CPU which always
wakes up the task on the CPU where it previously ran. I believe you can
modify fuse_request_end() (in fs/fuse/dev.c) to use the WF_PREV_CPU flag
when waking the tasks on req->waitq

(P.S. I'm not familiar with the fuse subsystem but the comment
"Wake up waiter sleeping in request_wait_answer()" in fuse_request_end()
seems to suggest that is the right place to add this flag.)

Peter favored wake flag strategy of fixing wakeup target in a reply to an
earlier version of Andrei's series
(https://lore.kernel.org/lkml/[email protected]/)
but I'll let Peter respond with what he thinks is the right way to tackle
this.

>
>>
>> [RFC] fuse: wake on the same core / disable migrate before wait
>>
>> From: Bernd Schubert <[email protected]>
>>
>> Avoid bouncing cores on wake, especially with uring everything
>> is core affine - bouncing badly decreases performance.
>> With read/write(/dev/fuse) it is not good either - needs to be tested
>> for negative impacts.
>> ---
>> fs/fuse/dev.c | 16 +++++++++++++---
>> 1 file changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index e82db13da8f6..d47b6a492434 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -372,12 +372,17 @@ static void request_wait_answer(struct fuse_req *req)
>> struct fuse_iqueue *fiq = &fc->iq;
>> int err;
>>
>> + /* avoid bouncing between cores on wake */
>> + pr_devel("task=%p before wait on core: %u wake_cpu: %u\n",
>> + current, task_cpu(current), current->wake_cpu);
>> + migrate_disable();
>> +
>> if (!fc->no_interrupt) {
>> /* Any signal may interrupt this */
>> err = wait_event_interruptible(req->waitq,
>> test_bit(FR_FINISHED, &req->flags));
>> if (!err)
>> - return;
>> + goto out;
>>
>> set_bit(FR_INTERRUPTED, &req->flags);
>> /* matches barrier in fuse_dev_do_read() */
>> @@ -391,7 +396,7 @@ static void request_wait_answer(struct fuse_req *req)
>> err = wait_event_killable(req->waitq,
>> test_bit(FR_FINISHED, &req->flags));
>> if (!err)
>> - return;
>> + goto out;
>>
>> spin_lock(&fiq->lock);
>> /* Request is not yet in userspace, bail out */
>> @@ -400,7 +405,7 @@ static void request_wait_answer(struct fuse_req *req)
>> spin_unlock(&fiq->lock);
>> __fuse_put_request(req);
>> req->out.h.error = -EINTR;
>> - return;
>> + goto out;
>> }
>> spin_unlock(&fiq->lock);
>> }
>> @@ -410,6 +415,11 @@ static void request_wait_answer(struct fuse_req *req)
>> * Wait it out.
>> */
>> wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
>> +
>> +out:
>> + migrate_enable();
>> + pr_devel("task=%p after wait on core: %u\n", current, task_cpu(current));
>> +
>> }
>>
>> static void __fuse_request_send(struct fuse_req *req)
>>
--
Thanks and Regards,
Prateek

2023-03-27 10:30:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On Fri, Mar 24, 2023 at 07:50:12PM +0000, Bernd Schubert wrote:

> With the fuse-uring patches that part is basically solved - the waitq
> that that thread is about is not used anymore. But as per above,
> remaining is the waitq of the incoming workq (not mentioned in the
> thread above). As I wrote, I have tried
> __wake_up_sync((x), TASK_NORMAL), but it does not make a difference for
> me - similar to Miklos' testing before. I have also tried struct
> completion / swait - does not make a difference either.
> I can see task_struct has wake_cpu, but there doesn't seem to be a good
> interface to set it.
>
> Any ideas?

Does the stuff from:

https://lkml.kernel.org/r/[email protected]

work for you?

2023-03-27 14:45:52

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 3/25/23 08:08, K Prateek Nayak wrote:
> [You don't often get email from [email protected]. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hello Hillf,
>
> On 3/25/2023 5:58 AM, Hillf Danton wrote:
>> On 24 Mar 2023 22:44:16 +0000 Bernd Schubert <[email protected]>
>>> How much of hack is this patch?
>>
>> It adds a churn to my mind then another RFC [1] rises.
>>
>> Feel free to make it work for you and resend it.
>>
>> [1] Subject: [RFC PATCH 0/5] sched: Userspace Hinting for Task Placement
>> https://lore.kernel.org/lkml/[email protected]/
>
> Thank you for pointing to my series.
>
> Another possible way to tackle this is with a strategy Andrei is using in
> his "seccomp: add the synchronous mode for seccomp_unotify" series
> (https://lore.kernel.org/lkml/[email protected]/)
>
> In patch 2, Andrei adds a WF_CURRENT_CPU that allows the task to always
> wake on the CPU where the waker is running.
> (https://lore.kernel.org/lkml/[email protected]/)
>
> I believe Bernd's requirement calls for a a WF_PREV_CPU which always
> wakes up the task on the CPU where it previously ran. I believe you can
> modify fuse_request_end() (in fs/fuse/dev.c) to use the WF_PREV_CPU flag
> when waking the tasks on req->waitq
>
> (P.S. I'm not familiar with the fuse subsystem but the comment
> "Wake up waiter sleeping in request_wait_answer()" in fuse_request_end()
> seems to suggest that is the right place to add this flag.)
>
> Peter favored wake flag strategy of fixing wakeup target in a reply to an
> earlier version of Andrei's series
> (https://lore.kernel.org/lkml/[email protected]/)
> but I'll let Peter respond with what he thinks is the right way to tackle
> this.
>

Thanks Hillf, Prateek and Peter! I'm going right now through Andrei's
(will also check Prateek patches later). On the first glance
WF_CURRENT_CPU is exactly what I need. At least for fuse/uring no need
for another 'WF_PREV_CPU' flag - it goes and comes back to/from the ring
on 'current' cpu and only switches on the final completion - staying on
the current cpu is all we need. Will test these patches later today.


Thanks again,
Bernd

2023-04-26 22:46:19

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 3/27/23 12:28, Peter Zijlstra wrote:
> On Fri, Mar 24, 2023 at 07:50:12PM +0000, Bernd Schubert wrote:
>
>> With the fuse-uring patches that part is basically solved - the waitq
>> that that thread is about is not used anymore. But as per above,
>> remaining is the waitq of the incoming workq (not mentioned in the
>> thread above). As I wrote, I have tried
>> __wake_up_sync((x), TASK_NORMAL), but it does not make a difference for
>> me - similar to Miklos' testing before. I have also tried struct
>> completion / swait - does not make a difference either.
>> I can see task_struct has wake_cpu, but there doesn't seem to be a good
>> interface to set it.
>>
>> Any ideas?
>
> Does the stuff from:
>
> https://lkml.kernel.org/r/[email protected]

Thanks Peter, I have already replied in that thread - using
__wake_up_on_current_cpu() helps to avoid cpu migrations. Well, some
update since my last mail in that thread (a few hours ago), more logging
reveals that I still see a few cpu switches, but nothing compared to
what I had before.
My issue is now that these patches are not enough and contrary to
previous testing, forcefully disabling cpu migration using
migrate_disable() before wait_event_* in fuse's request_wait_answer()
and enabling it after does not help either - my process to create files
(bonnie++) somewhere migrates to another cpu at a later time.
The only workaround I currently have is to set the ring thread
processing vfs/fuse events in userspace to SCHED_IDLE. In combination
with WF_CURRENT_CPU performance then goes from ~2200 to ~9000 file
creates/s for a single thread in the latest branch (should be scalable).
Which is very close to binding the bonnie++ process to a single core
(~9400 creates/s).

Is there something available to mark ring threads as IO processing and
that there is no need to migrate away the submitting thread from IO
threads?

* application sends request -> forwards to ring and wake ring -> wait
* ring wakes up (core bound) -> process request -> sends completion ->
wake up application -> wait for next request
* application wakes up with request result

==> I don't understand why the application is moved to another process
at all, after the wake issue is eliminated.

I also only see SCHED_IDLE only as workaround, as it would likely have
side effects if there is anything else running on the system and would
consume cpus while another process is doing IO.
Is there a way to trace where and why a process is migrated away?


Thanks,
Bernd

2023-04-27 13:36:49

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 4/27/23 14:24, Hillf Danton wrote:
> On 26 Apr 2023 22:40:32 +0000 Bernd Schubert <[email protected]>
>> My issue is now that these patches are not enough and contrary to
>> previous testing, forcefully disabling cpu migration using
>> migrate_disable() before wait_event_* in fuse's request_wait_answer()
>> and enabling it after does not help either - my process to create files
>> (bonnie++) somewhere migrates to another cpu at a later time.
>
> Less than 2 migrates every ten minutes?

The test does not run that long... kind of migrate immediately,
I think in less than a second.

>
>> The only workaround I currently have is to set the ring thread
>> processing vfs/fuse events in userspace to SCHED_IDLE. In combination
>> with WF_CURRENT_CPU performance then goes from ~2200 to ~9000 file
>> creates/s for a single thread in the latest branch (should be scalable).
>> Which is very close to binding the bonnie++ process to a single core
>> (~9400 creates/s).
>
> The scheduler is good at dispatching tasks to CPUs at least, and it works
> better with userspace hints as both Prateek and Andrei's works propose. 9400
> shows positive feedback from kernel, and the question is, is it feasible
> in your production environment to set CPU affinity? If yes, what else do
> you want?

Well, this is the fuse file system - each and every user would need to do that
and get core affinity right. I'm personally not setting core affinity for
any 'cp' or 'rsync' I'm doing.

Btw, a very hackish way to 'solve' the issue is this


diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..dd32effb5010 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
int err;
int prev_cpu = task_cpu(current);

+ /* When running over uring and core affined userspace threads, we
+ * do not want to let migrate away the request submitting process.
+ * Issue is that even after waking up on the right core, processes
+ * that have submitted requests might get migrated away, because
+ * the ring thread is still doing a bit of work or is in the process
+ * to go to sleep. Assumption here is that processes are started on
+ * the right core (i.e. idle cores) and can then stay on that core
+ * when they come and do file system requests.
+ * Another alternative way is to set SCHED_IDLE for ring threads,
+ * but that would have an issue if there are other processes keeping
+ * the cpu busy.
+ * SCHED_IDLE or this hack here result in about factor 3.5 for
+ * max meta request performance.
+ *
+ * Ideal would to tell the scheduler that ring threads are not disturbing
+ * that migration away from it should very very rarely happen.
+ */
+ if (fc->ring.ready)
+ migrate_disable();
+
if (!fc->no_interrupt) {
/* Any signal may interrupt this */
err = wait_event_interruptible(req->waitq,


So it disables migration and never re-enables it...
I'm still continuing to digg if there is a better way, any
hints are very welcome.


Thanks,
Bernd

2023-04-28 22:04:44

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 4/28/23 03:44, Hillf Danton wrote:
> On 27 Apr 2023 13:35:31 +0000 Bernd Schubert <[email protected]>
>> Btw, a very hackish way to 'solve' the issue is this
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index cd7aa679c3ee..dd32effb5010 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -373,6 +373,26 @@ static void request_wait_answer(struct fuse_req *req)
>> int err;
>> int prev_cpu = task_cpu(current);
>>
>> + /* When running over uring and core affined userspace threads, we
>> + * do not want to let migrate away the request submitting process.
>> + * Issue is that even after waking up on the right core, processes
>> + * that have submitted requests might get migrated away, because
>> + * the ring thread is still doing a bit of work or is in the process
>> + * to go to sleep. Assumption here is that processes are started on
>> + * the right core (i.e. idle cores) and can then stay on that core
>> + * when they come and do file system requests.
>> + * Another alternative way is to set SCHED_IDLE for ring threads,
>> + * but that would have an issue if there are other processes keeping
>> + * the cpu busy.
>> + * SCHED_IDLE or this hack here result in about factor 3.5 for
>> + * max meta request performance.
>> + *
>> + * Ideal would to tell the scheduler that ring threads are not disturbing
>> + * that migration away from it should very very rarely happen.
>> + */
>> + if (fc->ring.ready)
>> + migrate_disable();
>> +
>> if (!fc->no_interrupt) {
>> /* Any signal may interrupt this */
>> err = wait_event_interruptible(req->waitq,
>>
> If I understand it correctly, the seesaw workload hint to scheduler looks
> like the diff below, leaving scheduler free to pull the two players apart
> across CPU and to migrate anyone.

Thank a lot Hillf! I had a day off / family day today, kernel is now
eventually compiling.

>
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -421,6 +421,7 @@ static void __fuse_request_send(struct f
> /* acquire extra reference, since request is still needed
> after fuse_request_end() */
> __fuse_get_request(req);
> + current->seesaw = 1;
> queue_request_and_unlock(fiq, req);
>
> request_wait_answer(req);
> @@ -1229,6 +1230,7 @@ static ssize_t fuse_dev_do_read(struct f
> fc->max_write))
> return -EINVAL;
>
> + current->seesaw = 1;

fuse_dev_do_read is plain /dev/fuse (with read/write) and we don't know
on which core these IO threads are running and which of them to wake up
when an application comes with a request.

There is a patch to use __wake_up_sync to wake the IO thread and reports
that it helps in performance, but I don't see it and I think Miklos
neither. For direct-io read I had also already tested disabling
migration - it didn't show any effect - we better don't set
current->seesaw = 1 in fuse_dev_do_read for now.

With my fuse-uring patches things are more clear
(https://lwn.net/Articles/926773/), there is one IO thread per core and
libfuse side is binding these threads to a single core only.

nproc /dev/fuse /dev/fuse fuse uring fuse uring
migrate on migrate off migrate on migrate off
1 2023 1652 1151 3998
2 3375 2805 2221 7950
4 3823 4193 4540 15022
8 7796 8161 7846 22591
16 8520 8518 12235 27864
24 8361 8084 9415 27864
32 8361 8084 9124 12971

(in MiB/s)

So core affinity really matters and with core affinity it is always
faster with fuse-uring over the existing code.

For single threaded metadata (file creates/stat/unlink) difference
between migrate on/off is rather similar. Going to run with multiple
processes during the next days.

For paged (async) IO it behaves a bit different as here uring can show
it strength and multiple requests can be combined on CQE processing -
better to chose and idle ring thread on another core. I actually have a
question for that as well - later.


> restart:
> for (;;) {
> spin_lock(&fiq->lock);
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -953,6 +953,7 @@ struct task_struct {
> /* delay due to memory thrashing */
> unsigned in_thrashing:1;
> #endif
> + unsigned seesaw:1;
>
> unsigned long atomic_flags; /* Flags requiring atomic access. */
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
> if (wake_flags & WF_TTWU) {
> record_wakee(p);
>
> + if (p->seesaw && current->seesaw)
> + return cpu;
> if (sched_energy_enabled()) {
> new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> if (new_cpu >= 0)


Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is
in cpus_ptr? The combination of both patches results in

if (p->seesaw && current->seesaw)
return cpu;

if ((wake_flags & WF_CURRENT_CPU) &&
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;



While writing the mail kernel compilation is ready, but it got late,
will test in the morning.


Thanks again,
Bernd

2023-05-01 21:45:58

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 4/28/23 23:54, Bernd Schubert wrote:
> On 4/28/23 03:44, Hillf Danton wrote:
>>    restart:
>>       for (;;) {
>>           spin_lock(&fiq->lock);
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -953,6 +953,7 @@ struct task_struct {
>>       /* delay due to memory thrashing */
>>       unsigned                        in_thrashing:1;
>>   #endif
>> +    unsigned             seesaw:1;
>>       unsigned long            atomic_flags; /* Flags requiring atomic
>> access. */
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7424,6 +7424,8 @@ select_task_rq_fair(struct task_struct *
>>       if (wake_flags & WF_TTWU) {
>>           record_wakee(p);
>> +        if (p->seesaw && current->seesaw)
>> +            return cpu;
>>           if (sched_energy_enabled()) {
>>               new_cpu = find_energy_efficient_cpu(p, prev_cpu);
>>               if (new_cpu >= 0)
>
>
> Hmm, WF_CURRENT_CPU works rather similar, except that it tests if cpu is
> in cpus_ptr?  The combination of both patches results in
>
>         if (p->seesaw && current->seesaw)
>             return cpu;
>
>         if ((wake_flags & WF_CURRENT_CPU) &&
>             cpumask_test_cpu(cpu, p->cpus_ptr))
>             return cpu;
>
>
>
> While writing the mail kernel compilation is ready, but it got late,
> will test in the morning.

This works wonders! The fuse-uring part is this

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..ec5853ca9646 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -373,6 +373,9 @@ static void request_wait_answer(struct fuse_req *req)
int err;
int prev_cpu = task_cpu(current);

+ if (fc->ring.per_core_queue)
+ current->seesaw = 1;
+
if (!fc->no_interrupt) {
/* Any signal may interrupt this */
err = wait_event_interruptible(req->waitq,
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7d327699b4c5..715741ed58bf 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
/* XXX error injection or test with malicious daemon */
}

+ /* In combination with requesting process (application) seesaw
+ * setting (see request_wait_answer), the application will
+ * stay on the same core.
+ */
+ if (fc->ring.per_core_queue)
+ current->seesaw = 1;
+
ret = fuse_uring_fetch(ring_ent, cmd);
break;
case FUSE_URING_REQ_COMMIT_AND_FETCH:




I'm not familiar at all with scheduler code,
given this works perfectly this suggests the same function is also
called without explicit waitq, when the scheduler preempts a task?

I think there might be side effects - what is if multiple
applications are on one core and another core would be available?
With this flag they would stay on the same core? Maybe better two flags?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..07783ddaec5c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,8 @@ struct task_struct {
/* delay due to memory thrashing */
unsigned in_thrashing:1;
#endif
+ unsigned seesaw_req:1;
+ unsigned seesaw_io:1;

unsigned long atomic_flags; /* Flags requiring atomic access. */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9d6ed7585c6..474bf3657ef0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);

+ /* current is handling requests on behalf of the waking process,
+ * both want to run on the same core in seeswaw manner.
+ */
+ if (p->seesaw_req && current->seesaw_io &&
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if ((wake_flags & WF_CURRENT_CPU) &&
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;

(not tested yet)


Thanks,
Bernd

2023-05-03 17:07:08

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 5/2/23 02:33, Hillf Danton wrote:
> On 1 May 2023 21:44:48 +0000 Bernd Schubert <[email protected]>
>> I'm not familiar at all with scheduler code,
>> given this works perfectly this suggests the same function is also
>> called without explicit waitq, when the scheduler preempts a task?
>
> Please see comment below in the select_task function.
>>
>> I think there might be side effects - what is if multiple
>> applications are on one core and another core would be available?
>> With this flag they would stay on the same core? Maybe better two flags?
>
> Gooddd question. Given multiple seesaws,
>
> Player i1 Player j1
> | |
> Seesaw i Seesaw j
> | |
> P i2 P j2
>
> what is ideal is run Seesaws i and j on different CPU cores.
>
> We can do that by replacing the seesaw flag in the task struct with
> seesaw id for instance. But I have no idea how complex it will become
> to set up multiple seesaw workloads on the fuse side, by grouping and
> binding players to different seesaws,
>
> - unsigned seesaw:1;
> + unsigned int seesaw;>
> while the corresponding change to scheduler looks like.
>
> + if (p->seesaw && p->seesaw == current->seesaw &&
> + cpumask_test_cpu(cpu, p->cpus_ptr))
> + return cpu;


Hmm, how is the seesaw id assigned and assuming two processes landed
on the same core but later on another core is available, how does it
dynamically change the ID?
My idea with two bits was that there is a fuse ring thread bound to a
specific core - it is the IO processor and gets the seesaw_proc bit.
Application is submitting requests and get the seesaw_req bit set.
Two applications running on the same core won't disturb each other that way.

As addition, if the application is not submitting requests anymore, but
let's say is busy doing computations, we want to have a timeout to let
it move away if another core is more suitable. What do you think about
the new patch version at the end of the mail? It uses two bits and jiffies.
Just tested it and it works fine. The exact timeout period is
certainly debatable. I also feel a bit bad
to take so many bits in struct task. If this approach is acceptable,
the jiffies parameter could be probably an u16.

>
> Even after job done for fuse, the emerging question is, how to set up
> seesaw workloads for crypto for example, if no more than the seesaw id
> hint to scheduler is preferred.
>
> And it makes me wonder why Prateek's work stalled for quite a while,
> as more is needed for userspace hint to scheduler to work for
> workloads other than seesaw.

Just quickly went over these a bit, assuming seesaw doesn't get accepted
and we need these, I think it would need a bit modification for fuse


> + /*
> + * Handle the case where a hint is set and the current CPU
> + * and the previous CPU where task ran don't share caches.
> + */
> + if (wakeup_hint && !cpus_share_cache(cpu, prev_cpu)) {

I'm testing on and older Xeon system (E5-2650) and tried different settings
with numa binding the application (fuse ring thread is bound anyway)


governor conservative performance
(default)
application cpu 0, ring cpu 0 creates/s ~9200 ~9500
application cpu 0. ring cpu 16 creates/s ~4200 ~8000
application cpu 0. ring cpu 1 creates/s ~4200 ~8500


No idea why cpu 1 gives better results in performance mode than cpu 16, might be within
accuracy. CPU frequency definitely has the largest effect here - the cpus_share_cache()
condition is not ideal for fuse. And I guess asking users to use cpu performance mode
for fuse is also too much asked - other file systems don't have that requirement...
So far your seesaw idea works best (the modified version in combination with
wake-on-same-core).

>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 63d242164b1a..07783ddaec5c 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -953,6 +953,8 @@ struct task_struct {
>> /* delay due to memory thrashing */
>> unsigned in_thrashing:1;
>> #endif
>> + unsigned seesaw_req:1;
>> + unsigned seesaw_io:1;
>>
>> unsigned long atomic_flags; /* Flags requiring atomic access. */
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b9d6ed7585c6..474bf3657ef0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7605,6 +7605,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>> if (wake_flags & WF_TTWU) {
>> record_wakee(p);
>
> Seesaw does not work without WF_TTWU as per define.

What does WF_TTWU actually mean? Something like work flow time to wake unit?

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index cd7aa679c3ee..6da0de4ae9ca 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -411,6 +411,20 @@ static void request_wait_answer(struct fuse_req *req)
* Wait it out.
*/
wait_event(req->waitq, test_bit(FR_FINISHED, &req->flags));
+
+ /*
+ * __wake_up_on_current_cpu ensures we wake up on the right core,
+ * after that we still want to stay on the same core, shared with
+ * a ring thread to submit next request to it. Issue without seesaw
+ * is that the while the ring thread is on its way to wait, it disturbs
+ * the application and application might get migrated away
+ */
+ if (fc->ring.per_core_queue) {
+ current->seesaw_req = 1;
+ current->seesaw_jiffies = jiffies;
+ }
+
+
out:
if (prev_cpu != task_cpu(current))
pr_devel("%s cpu switch from=%d to=%d\n",
diff --git a/fs/fuse/dev_uring.c b/fs/fuse/dev_uring.c
index 7d327699b4c5..73adc2b16778 100644
--- a/fs/fuse/dev_uring.c
+++ b/fs/fuse/dev_uring.c
@@ -1312,6 +1312,13 @@ int fuse_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags)
/* XXX error injection or test with malicious daemon */
}

+ /* In combination with requesting process (application) seesaw
+ * setting (see request_wait_answer), the application will
+ * stay on the same core.
+ */
+ if (fc->ring.per_core_queue)
+ current->seesaw_proc = 1;
+
ret = fuse_uring_fetch(ring_ent, cmd);
break;
case FUSE_URING_REQ_COMMIT_AND_FETCH:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b1a..53d9c77672b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -953,6 +953,13 @@ struct task_struct {
/* delay due to memory thrashing */
unsigned in_thrashing:1;
#endif
+ /* requesting task */
+ unsigned seesaw_req:1;
+ /* request processing task */
+ unsigned seesaw_proc:1;
+
+ /* limit seesaw time slot */
+ unsigned long seesaw_jiffies;

unsigned long atomic_flags; /* Flags requiring atomic access. */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b9d6ed7585c6..a14161e6e456 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7605,6 +7605,17 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
if (wake_flags & WF_TTWU) {
record_wakee(p);

+ /*
+ * Current is handling requests on behalf of the waking process,
+ * both want to run on the same core in seeswaw manner.
+ * Typically current is bound to one core.'and only p is allowed
+ * to freely move.
+ */
+ if (p->seesaw_req && current->seesaw_proc &&
+ time_after(jiffies, p->seesaw_jiffies + 10),
+ cpumask_test_cpu(cpu, p->cpus_ptr))
+ return cpu;
+
if ((wake_flags & WF_CURRENT_CPU) &&
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;


Thanks,
Bernd

2023-05-05 13:22:45

by Bernd Schubert

[permalink] [raw]
Subject: Re: fuse uring / wake_up on the same core

On 5/4/23 04:16, Hillf Danton wrote:
>
>> + time_after(jiffies, p->seesaw_jiffies + 10),
>> + cpumask_test_cpu(cpu, p->cpus_ptr))
>> + return cpu;

Above is a big typo, I don't even see on the first glance how this
compiled at all.

This was supposed to be

if (p->seesaw_req && current->seesaw_proc &&
time_after(jiffies, p->seesaw_jiffies + 10) &&
cpumask_test_cpu(cpu, p->cpus_ptr))


Anyway, I now understand that the WF_TTWU flag is related to waitq - we
don't need the timeout at all. But then if the main issue is about waitq
migration, I don't understand yet why Andrei's WF_CURRENT_CPU is not
sufficient. I'm going to investigate that next. Probably much easier to
get that accepted that a rather fuse specific seesaw.


Thanks,
Bernd