2022-07-14 03:42:25

by André Almeida

[permalink] [raw]
Subject: [RFC] futex2: add NUMA awareness

Hi,

futex2 is an ongoing project with the goal to create a new interface for
futex that solves ongoing issues with the current syscall.

One of this problems is the lack of NUMA awareness for futex operations.
This RFC is aimed to gather feedback around the a NUMA interface proposal.

* The problem

futex has a single, global hash table to store information of current
waiters to be queried by wakers. This hash table is stored in a single
node in non-uniform machines. This means that a process running in other
nodes will have some overhead using futex, given that it will need to
access the table in a different node.

* A solution

For NUMA machines, it would be allocated a table per node. Processes
then would be able to use the local table to avoid sharing data with
other nodes.

* The interface

Userspace needs to specify which node would like to use to store/query
the futex table. The common case would be to operate on the current
node, but some cases could required to operate in another one.

Before getting to the NUMA part, a quick recap of the syscalls interface
of futex2:

futex_wait(void *uaddr, unsigned int val, unsigned int flags,
struct timespec *timo)

futex_wake(void *uaddr, unsigned long nr_wake, unsigned int flags)

struct futex_requeue {
void *uaddr;
unsigned int flags;
};

futex_requeue(struct futex_requeue *rq1, struct futex_requeue *rq2,
unsigned int nr_wake, unsigned int nr_requeue,
u64 cmpval, unsigned int flags)


As requeue already has 6 arguments, we can't add an argument for the
node ID, we need to pack it in a struct. So then we have

struct futexX_numa {
__uX value;
__sX hint;
};

Where X can be 8, 16, 32 or 64 (futex2 supports variable sized futexes).
`value` is the futex value and `hint` can be -1 for the current node, or
[0, MAX_NUMA_NODES) to specify a node. Example:

struct futex32_numa f = {.value = 0, hint = -1};

...

futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);

Then &f would be used as the futex address, as expected, and this would
be used for the current node. If an app is expecting to have calls from
different nodes then it should do for instance:

struct futex32_numa f = {.value = 0, hint = 2};

For non-NUMA apps, a call without FUTEX_NUMA flag would just use the
first node as default.

Feedback? Who else should I CC?

Thanks,
André


2022-07-14 11:11:40

by Andrey Semashev

[permalink] [raw]
Subject: Re: [RFC] futex2: add NUMA awareness

On 7/14/22 06:18, André Almeida wrote:
> Hi,
>
> futex2 is an ongoing project with the goal to create a new interface for
> futex that solves ongoing issues with the current syscall.
>
> One of this problems is the lack of NUMA awareness for futex operations.
> This RFC is aimed to gather feedback around the a NUMA interface proposal.
>
> * The problem
>
> futex has a single, global hash table to store information of current
> waiters to be queried by wakers. This hash table is stored in a single
> node in non-uniform machines. This means that a process running in other
> nodes will have some overhead using futex, given that it will need to
> access the table in a different node.
>
> * A solution
>
> For NUMA machines, it would be allocated a table per node. Processes
> then would be able to use the local table to avoid sharing data with
> other nodes.
>
> * The interface
>
> Userspace needs to specify which node would like to use to store/query
> the futex table. The common case would be to operate on the current
> node, but some cases could required to operate in another one.
>
> Before getting to the NUMA part, a quick recap of the syscalls interface
> of futex2:
>
> futex_wait(void *uaddr, unsigned int val, unsigned int flags,
> struct timespec *timo)
>
> futex_wake(void *uaddr, unsigned long nr_wake, unsigned int flags)
>
> struct futex_requeue {
> void *uaddr;
> unsigned int flags;
> };
>
> futex_requeue(struct futex_requeue *rq1, struct futex_requeue *rq2,
> unsigned int nr_wake, unsigned int nr_requeue,
> u64 cmpval, unsigned int flags)
>
>
> As requeue already has 6 arguments, we can't add an argument for the
> node ID, we need to pack it in a struct. So then we have
>
> struct futexX_numa {
> __uX value;
> __sX hint;
> };
>
> Where X can be 8, 16, 32 or 64 (futex2 supports variable sized futexes).
> `value` is the futex value and `hint` can be -1 for the current node, or
> [0, MAX_NUMA_NODES) to specify a node. Example:
>
> struct futex32_numa f = {.value = 0, hint = -1};
>
> ...
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> Then &f would be used as the futex address, as expected, and this would
> be used for the current node. If an app is expecting to have calls from
> different nodes then it should do for instance:
>
> struct futex32_numa f = {.value = 0, hint = 2};
>
> For non-NUMA apps, a call without FUTEX_NUMA flag would just use the
> first node as default.
>
> Feedback? Who else should I CC?

Just a few questions:

Do I understand correctly that notifiers won't be able to wake up
waiters unless they know on which node they are waiting?

Is it possible to wait on a futex on different nodes?

Is it possible to wake waiters on a futex on all nodes? When a single
(or N, where N is not "all") waiter is woken, which node is selected? Is
there a rotation of nodes, so that nodes are not skewed in terms of
notified waiters?

2022-07-14 15:13:08

by André Almeida

[permalink] [raw]
Subject: Re: [RFC] futex2: add NUMA awareness

Hi Andrey,

Thanks for the feedback.

Às 08:01 de 14/07/22, Andrey Semashev escreveu:
> On 7/14/22 06:18, André Almeida wrote:
[...]
>>
>> Feedback? Who else should I CC?
>
> Just a few questions:
>
> Do I understand correctly that notifiers won't be able to wake up
> waiters unless they know on which node they are waiting?
>

If userspace is using NUMA_FLAG, yes. Otherwise all futexes would be
located in the default node, and userspace doesn't need to know which
one is the default.

> Is it possible to wait on a futex on different nodes?

Yes, given that you specify `.hint = id` with the proper node id.

>
> Is it possible to wake waiters on a futex on all nodes? When a single
> (or N, where N is not "all") waiter is woken, which node is selected? Is
> there a rotation of nodes, so that nodes are not skewed in terms of
> notified waiters?

Regardless of which node the waiter process is running, what matter is
in which node the futex hash table is. So for instance if we have:

struct futex32_numa f = {.value = 0, hint = 2};

And now we add some waiters for this futex:

Thread 1, running on node 3:

futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);

Thread 2, running on node 0:

futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);

Thread 3, running on node 2:

futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);

And then, Thread 4, running on node 3:

futex_wake(&f, 2, FUTEX_NUMA | FUTEX_32);

Now, two waiter would wake up (e.g. T1 and T3, node 3 and 2) and they
are from different nodes. futex_wake() doesn't provide guarantees of
which waiter will be selected, so I can't say which node would be
selected. There's no policy for fairness/starvation for futex_wake(). Do
you think this would be important for the NUMA case?

Let me know if this clarifies your questions.

2022-07-22 16:50:44

by Andrey Semashev

[permalink] [raw]
Subject: Re: [RFC] futex2: add NUMA awareness

On 7/14/22 18:00, André Almeida wrote:
> Hi Andrey,
>
> Thanks for the feedback.
>
> Às 08:01 de 14/07/22, Andrey Semashev escreveu:
>> On 7/14/22 06:18, André Almeida wrote:
> [...]
>>>
>>> Feedback? Who else should I CC?
>>
>> Just a few questions:
>>
>> Do I understand correctly that notifiers won't be able to wake up
>> waiters unless they know on which node they are waiting?
>>
>
> If userspace is using NUMA_FLAG, yes. Otherwise all futexes would be
> located in the default node, and userspace doesn't need to know which
> one is the default.
>
>> Is it possible to wait on a futex on different nodes?
>
> Yes, given that you specify `.hint = id` with the proper node id.

So any given futex_wake(FUTEX_NUMA) operates only within its node, right?

>> Is it possible to wake waiters on a futex on all nodes? When a single
>> (or N, where N is not "all") waiter is woken, which node is selected? Is
>> there a rotation of nodes, so that nodes are not skewed in terms of
>> notified waiters?
>
> Regardless of which node the waiter process is running, what matter is
> in which node the futex hash table is. So for instance if we have:
>
> struct futex32_numa f = {.value = 0, hint = 2};
>
> And now we add some waiters for this futex:
>
> Thread 1, running on node 3:
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> Thread 2, running on node 0:
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> Thread 3, running on node 2:
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> And then, Thread 4, running on node 3:
>
> futex_wake(&f, 2, FUTEX_NUMA | FUTEX_32);
>
> Now, two waiter would wake up (e.g. T1 and T3, node 3 and 2) and they
> are from different nodes. futex_wake() doesn't provide guarantees of
> which waiter will be selected, so I can't say which node would be
> selected.

In this example, T1, T2 and T3 are all blocking on node 2 (since all of
them presumably specify hint == 2), right? In this sense, it doesn't
matter which node they are running on, what matters is what node they
block on.

What I'm asking is can I wake all threads blocked on all nodes on the
same futex? That is, is the following possible?

// I'm using hint == -1 to indicate the current node
// of the calling thread for waiters and all nodes for notifiers
struct futex32_numa f = {.value = 0, .hint = -1};

Thread 1, running on node 3, blocks on node 3:

futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);

Thread 2, running on node 0, blocks on node 0:

futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);

Thread 3, running on node 2, blocks on node 2:

futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);

And then, Thread 4, running on whatever node:

futex_wake(&f, -1, FUTEX_NUMA | FUTEX_32);

Here, futex_wake would wake T1, T2 and T3. Or:

futex_wake(&f, 1, FUTEX_NUMA | FUTEX_32);

Here, futex_wake would wake any one of T1, T2 or T3.

> There's no policy for fairness/starvation for futex_wake(). Do
> you think this would be important for the NUMA case?

I'm not sure yet. If there isn't a cross-node behavior like in my
example above then, I suppose, it falls to the userspace to ensure fair
rotation of the wakeups on different nodes. If there is functionality
like this, I imagine, some sort of fairness would be desired.

2022-07-27 18:34:14

by André Almeida

[permalink] [raw]
Subject: Re: [RFC] futex2: add NUMA awareness

Às 13:42 de 22/07/22, Andrey Semashev escreveu:
> On 7/14/22 18:00, André Almeida wrote:
>> Hi Andrey,
>>
>> Thanks for the feedback.
>>
>> Às 08:01 de 14/07/22, Andrey Semashev escreveu:
>>> On 7/14/22 06:18, André Almeida wrote:
>> [...]
>>>>
>>>> Feedback? Who else should I CC?
>>>
>>> Just a few questions:
>>>
>>> Do I understand correctly that notifiers won't be able to wake up
>>> waiters unless they know on which node they are waiting?
>>>
>>
>> If userspace is using NUMA_FLAG, yes. Otherwise all futexes would be
>> located in the default node, and userspace doesn't need to know which
>> one is the default.
>>
>>> Is it possible to wait on a futex on different nodes?
>>
>> Yes, given that you specify `.hint = id` with the proper node id.
>
> So any given futex_wake(FUTEX_NUMA) operates only within its node, right?
>
>>> Is it possible to wake waiters on a futex on all nodes? When a single
>>> (or N, where N is not "all") waiter is woken, which node is selected? Is
>>> there a rotation of nodes, so that nodes are not skewed in terms of
>>> notified waiters?
>>
>> Regardless of which node the waiter process is running, what matter is
>> in which node the futex hash table is. So for instance if we have:
>>
>> struct futex32_numa f = {.value = 0, hint = 2};
>>
>> And now we add some waiters for this futex:
>>
>> Thread 1, running on node 3:
>>
>> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>>
>> Thread 2, running on node 0:
>>
>> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>>
>> Thread 3, running on node 2:
>>
>> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>>
>> And then, Thread 4, running on node 3:
>>
>> futex_wake(&f, 2, FUTEX_NUMA | FUTEX_32);
>>
>> Now, two waiter would wake up (e.g. T1 and T3, node 3 and 2) and they
>> are from different nodes. futex_wake() doesn't provide guarantees of
>> which waiter will be selected, so I can't say which node would be
>> selected.
>
> In this example, T1, T2 and T3 are all blocking on node 2 (since all of
> them presumably specify hint == 2), right? In this sense, it doesn't
> matter which node they are running on, what matters is what node they
> block on.

yes

>
> What I'm asking is can I wake all threads blocked on all nodes on the
> same futex? That is, is the following possible?
>
> // I'm using hint == -1 to indicate the current node
> // of the calling thread for waiters and all nodes for notifiers
> struct futex32_numa f = {.value = 0, .hint = -1};
>
> Thread 1, running on node 3, blocks on node 3:
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> Thread 2, running on node 0, blocks on node 0:
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> Thread 3, running on node 2, blocks on node 2:
>
> futex_wait(&f, 0, FUTEX_NUMA | FUTEX_32, NULL);
>
> And then, Thread 4, running on whatever node:
>
> futex_wake(&f, -1, FUTEX_NUMA | FUTEX_32);

this futex_wake will wake all futexes waiting on the node that called
futex_wake(), waking only one futex in this example. They are __not__
the same futex. If they have different nodes, they would have different
information inside the kernel.

if you want to wake them all with the same futex_wake(), they need to be
waiting on the same node.

>
> Here, futex_wake would wake T1, T2 and T3. Or:
>
> futex_wake(&f, 1, FUTEX_NUMA | FUTEX_32);

this would behave exactly as the futex_wake() above.

>
> Here, futex_wake would wake any one of T1, T2 or T3.
>
>> There's no policy for fairness/starvation for futex_wake(). Do
>> you think this would be important for the NUMA case?
>
> I'm not sure yet. If there isn't a cross-node behavior like in my
> example above then, I suppose, it falls to the userspace to ensure fair
> rotation of the wakeups on different nodes. If there is functionality
> like this, I imagine, some sort of fairness would be desired.