by Peter Oskolkov

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups

On Wed, Dec 15, 2021 at 3:16 PM Peter Zijlstra <[email protected]> wrote:
>
> On Wed, Dec 15, 2021 at 01:04:33PM -0800, Peter Oskolkov wrote:
> > On Wed, Dec 15, 2021 at 10:25 AM Peter Zijlstra <[email protected]> wrote:
> > >
> > > On Wed, Dec 15, 2021 at 09:56:06AM -0800, Peter Oskolkov wrote:
> > > > On Wed, Dec 15, 2021 at 2:06 AM Peter Zijlstra <[email protected]> wrote:
> > > > > /*
> > > > > + * Enqueue tsk to it's server's runnable list and wake the server for pickup if
> > > > > + * so desired. Notable LAZY workers will not wake the server and rely on the
> > > > > + * server to do pickup whenever it naturally runs next.
> > > >
> > > > No, I never suggested we needed per-server runnable queues: in all my
> > > > patchsets I had a single list of idle (runnable) workers.
> > >
> > > This is not about the idle servers..
> > >
> > > So without the LAZY thing on, a previously blocked task hitting sys_exit
> > > will enqueue itself on the runnable list and wake the server for pickup.
> >
> > How can a blocked task hit sys_exit()? Shouldn't it be RUNNING?
>
> Task was RUNNING, hits schedule() after passing through sys_enter().
> this marks it BLOCKED. Task wakes again and proceeds to sys_exit(), at
> which point it's marked RUNNABLE and put on the runnable list. After
> which it'll kick the server to process said list.
>

Ah, you are talking about sys_exit hook; sorry, I thought you talked
about the exit() syscall.

[...]

>
> Well, that's *your* use-case. I'm fairly sure there's more people that
> want to use this thing.
>
> > multiple
> > priorities and work isolation: these are easy to address directly with
> > a scheduler that has a global view rather than multiple
> > per-cpu/per-server schedulers/queues that try to coordinate.
>
> You can trivially create this, even if the underlying thing is
> per-server. Simply have a lock and shared data structure between the
> servers.
>
> Even in the kernel, it should be mostly trivial to create a global
> policy. The only tricky bit (in the kernel) is the whole affinity muck,
> but userspace doesn't *need* to do even that.
>
> > > LAZY enables that.. *however* it does need to wake the server when it is
> > > idle, otherwise they'll all sit there waiting for one another.
> >
> > If all servers are busy running workers, then it is not up to the
> > kernel to "preempt" them in my model: the userspace can set up another
> > thread/task to preempt a misbehaving worker, which will wake the
> > server attached to it.
>
> So the way I'm seeing things is that the server *is* the 'CPU'. A UP
> machine cannot rely on another CPU to make preemption happen.
>
> Also, preemption is very much not about misbehaviour. Wakeup can cause a
> preemption event if the woken task is deemed higher priority than the
> current running one for example.
>
> And time based preemption is definitely also a thing wrt resource
> distribution.
>
> > But in practice there are always workers
> > blocking in the kernel, which wakes their servers, which then reap the
> > woken/runnable workers list, so well-behaving code does not need this.
>
> This seems to discount pure computational workloads.
>
> > And so we need to figure out this high-level thing first: do we go
> > with the per-server worker queues/lists, or do we go with the approach
> > I use in my patchset? It seems to me that the kernel-side code in my
> > patchset is not more complicated than your patchset is shaping up to
> > be, and some things are actually easier to accomplish, like having a
> > single idle_server_ptr vs this LAZY and/or server "preemption"
> > behavior that you have.
> >
> > Again, I'm OK with having it your way if all needed features are
> > covered, but I think we should be explicit about why
> > per-server/per-cpu model is chosen vs the one I proposed, especially
> > as it seems the kernel side code is not really simpler in the end.
>
> So I went with a UP first approach. I made single server preemption
> driven scheduling work first (both tick and wakeup-preemption are
> supported).

I agree that the UP approach is better than the LAZY one if we have
per-server/per-cpu worker queues.

>
> The whole LAZY thing is only meant to supress some of that (notably
> wakeup preemption), but you're right in that it's not really nice. I got
> it working, but I'm not particularly happy with it either.
>
> Having the sys_enter/sys_exit hooks also made the page pins short lived,
> and signals much simpler to handle. You're destroying signals IIUC.
>
>
> So I see no fundamental reason why userspace cannot do something like:
>
> struct umcg_task *current = NULL;
>
> for (;;) {
> self->state = UMCG_TASK_RUNNABLE | UMCG_TF_COND_WAIT;
>
> runnable_ptr = (void *)__atomic_exchange_n(&self->runnable_workers_ptr,
> NULL, __ATOMIC_SEQ_CST);
>
> pthread_mutex_lock(&global_queue.lock);
> while (runnable_ptr) {
> next = (void *)runnable_ptr->runnable_workers_ptr;
> enqueue_task(&global_queue, runnable_ptr);
> runnable_ptr = next;
> }
>
> /* complicated bit about current already running goes here */
>
> current = pick_task(&global_queue);
> self->next_tid = current ? current->tid : 0;
> unlock:
> pthread_mutex_unlock(&global_queue.lock);
>
> ret = sys_umcg_wait(0, 0);
>
> pthread_mutex_lock(&global_queue.lock);
> /* umcg_wait() didn't switch, make sure to return the task */
> if (self->next_tid) {
> enqueue_task(&global_queue, current);
> current = NULL;
> }
> pthread_mutex_unlock(&global_queue.lock);
>
> // do something with @ret
> }
>
> to get global scheduling and all the contention^Wgoodness related to it.
> Except, of course, it's more complicated, but I think the idea's clear
> enough.

Let me spend some time and see if I can make all of this work together
beyond simple tests. With the upcoming holidays and some other things
I am busy with, this may take more than a week, I'm afraid...

2021-12-16 13:23:40

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [RFC][PATCH 0/3] sched: User Managed Concurrency Groups

Peter,

On Wed, Dec 15 2021 at 15:26, Peter Oskolkov wrote:
> On Wed, Dec 15, 2021 at 2:25 PM Peter Zijlstra <[email protected]> wrote:
>> > - take a userspace (spin) lock (so the steps below are all within a
>> > single critical section):
>>
>> Don't ever suggest userspace spinlocks, they're horrible crap.
>
> This can easily be a mutex, not really important (although for very
> short critical sections with only memory reads/writes, like here, spin
> locks often perform better, in our experience).

Performance may be better, but user space spinlocks have a fundamental
problem: They are prone to live locks.

That's completely independent of the length of the critical section, it
even can be empty.

There are ways to avoid that, but that needs a very careful design on
the application/library level and at the system configuration level
(priorities, affinities ...). And even then, there are trival ways to
break that, e.g. via CPU hotplug.

So no, for something of general use, they are a complete NONO. People
who think they know what they are doing have the source and can replace
them if they feel the need to do so.

Thanks,

tglx