This series addresses a couple of robust/PI futex exit races:
1) The unlock races debugged and fixed by Yi and Yang
These races are really subtle and I'm still puzzled how to trigger them
reliably enough to decode them.
The basic issue is that:
A) An unlocking task can be killed between clearing the user space
futex value and calling futex(FUTEX_WAKE).
B) A woken up waiter can be killed before it can acquire the futex
after returning to user space.
In both cases the futex value is 0 and due to that the robust list exit
code refuses to wake up waiters as the futex is not owned by the
exiting task. As a consequence all other waiters might be blocked
forever.
2) Oleg provided a test case which causes an infinite loop in the
futex_lock_pi() code.
The problem there is that an exiting task might be preempted by a
waiter in a state which makes the waiter busy wait for the exiting task
to complete the robust/PI exit cleanup code.
That's obviously impossible when the waiter has higher priority than
the exiting task and both are pinned on the same CPU resulting in a
live lock.
#1 is a straight forward and simple fix
The solution Yi and Yang provided looks solid and in the worst case
causes a spurious wakeup of a waiter which is nothing to worry about
as all waiter code has to be prepared for that anyway.
#2 is more complex
In the current implementation there is no way to block until the exiting
task has finished the cleanup.
To fix this there is quite some code reshuffling required which at the
same time is a valuable cleanup.
The final solution is to guard the futex exit handling with a per task
mutex and make the waiter block on that mutex until the exiting task has
the cleanup completed.
Details why a simpler solution is not feasible can be found here:
https://lore.kernel.org/r/[email protected]
Ignore my confusion of fork vs. vfork at the beginning of the thread.
Futexes do that to human brains. :)
The following series addresses both issues.
Patch 1 is a slightly polished version of the original Yi and Yang
submission. It is included for completeness sake and because it
creates conflicts with the larger surgery which fixes issue #2.
Aside of that a few eyeballs more on that subtlety are definitely not
a bad thing especially as this has a user space component in it.
The rest of the series addresses issue #2 which is more or less a kernel
only problem, but extra eyeballs are appreciated.
I'm certainly not proud about the solution for #2 but it's the best I could
come up with without violating the user/kernel state consistency
constraints.
Rusty Russell was definitely right when he said that futexes are cursed,
but as Peter Zijlstra pointed out he should have named them SNAFUtex
right away.
The series is also available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.locking/futex
Thanks,
tglx
8<-------------
fs/exec.c | 2
include/linux/compat.h | 2
include/linux/futex.h | 38 +++--
include/linux/sched.h | 3
include/linux/sched/mm.h | 6
kernel/exit.c | 30 ----
kernel/fork.c | 40 ++---
kernel/futex.c | 324 ++++++++++++++++++++++++++++++++++++++++-------
8 files changed, 330 insertions(+), 115 deletions(-)
* Thomas Gleixner <[email protected]> wrote:
> The series is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.locking/futex
Quick testing feedback, doesn't build on 32-bit systems (32-bit defconfig
for example):
kernel/futex.c: In function ‘futex_cleanup’:
kernel/futex.c:3750:19: error: ‘struct task_struct’ has no member named ‘compat_robust_list’
Thanks,
Ingo
* Thomas Gleixner <[email protected]> wrote:
> This series addresses a couple of robust/PI futex exit races:
>
> 1) The unlock races debugged and fixed by Yi and Yang
>
> These races are really subtle and I'm still puzzled how to trigger them
> reliably enough to decode them.
>
> The basic issue is that:
>
> A) An unlocking task can be killed between clearing the user space
> futex value and calling futex(FUTEX_WAKE).
>
> B) A woken up waiter can be killed before it can acquire the futex
> after returning to user space.
>
> In both cases the futex value is 0 and due to that the robust list exit
> code refuses to wake up waiters as the futex is not owned by the
> exiting task. As a consequence all other waiters might be blocked
> forever.
>
> 2) Oleg provided a test case which causes an infinite loop in the
> futex_lock_pi() code.
>
> The problem there is that an exiting task might be preempted by a
> waiter in a state which makes the waiter busy wait for the exiting task
> to complete the robust/PI exit cleanup code.
>
> That's obviously impossible when the waiter has higher priority than
> the exiting task and both are pinned on the same CPU resulting in a
> live lock.
>
> #1 is a straight forward and simple fix
>
> The solution Yi and Yang provided looks solid and in the worst case
> causes a spurious wakeup of a waiter which is nothing to worry about
> as all waiter code has to be prepared for that anyway.
>
> #2 is more complex
>
> In the current implementation there is no way to block until the exiting
> task has finished the cleanup.
>
> To fix this there is quite some code reshuffling required which at the
> same time is a valuable cleanup.
>
> The final solution is to guard the futex exit handling with a per task
> mutex and make the waiter block on that mutex until the exiting task has
> the cleanup completed.
>
> Details why a simpler solution is not feasible can be found here:
>
> https://lore.kernel.org/r/[email protected]
>
> Ignore my confusion of fork vs. vfork at the beginning of the thread.
> Futexes do that to human brains. :)
>
> The following series addresses both issues.
>
> Patch 1 is a slightly polished version of the original Yi and Yang
> submission. It is included for completeness sake and because it
> creates conflicts with the larger surgery which fixes issue #2.
>
> Aside of that a few eyeballs more on that subtlety are definitely not
> a bad thing especially as this has a user space component in it.
>
> The rest of the series addresses issue #2 which is more or less a kernel
> only problem, but extra eyeballs are appreciated.
>
> I'm certainly not proud about the solution for #2 but it's the best I could
> come up with without violating the user/kernel state consistency
> constraints.
I really like the whole series - this is how it should have been
implemented originally, but the exit scenarios 'looked' so simple so it
was just open-coded ... Mea culpa. :-)
As to ->futex_exit_mutex: that's really just a consequence of the ABI,
and a lot cleaner than all the previous pretense that these exit ops are
atomic - which they fundamentally aren't.
Haven't tested the series beyond build coverage, but the high level
principles behind the whole series look very sound to me:
Reviewed-by: Ingo Molnar <[email protected]>
Thanks,
Ingo
On Wed, Nov 06, 2019 at 10:55:34PM +0100, Thomas Gleixner wrote:
> fs/exec.c | 2
> include/linux/compat.h | 2
> include/linux/futex.h | 38 +++--
> include/linux/sched.h | 3
> include/linux/sched/mm.h | 6
> kernel/exit.c | 30 ----
> kernel/fork.c | 40 ++---
> kernel/futex.c | 324 ++++++++++++++++++++++++++++++++++++++++-------
> 8 files changed, 330 insertions(+), 115 deletions(-)
Acked-by: Peter Zijlstra (Intel) <[email protected]>
On 11/06, Thomas Gleixner wrote:
>
> fs/exec.c | 2
> include/linux/compat.h | 2
> include/linux/futex.h | 38 +++--
> include/linux/sched.h | 3
> include/linux/sched/mm.h | 6
> kernel/exit.c | 30 ----
> kernel/fork.c | 40 ++---
> kernel/futex.c | 324 ++++++++++++++++++++++++++++++++++++++++-------
> 8 files changed, 330 insertions(+), 115 deletions(-)
The whole series looks good to me.
But I am just curious, what do you all think about the patch below
instead of 10/12 and 12/12 ?
Oleg.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e0de08..ad18433 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -621,6 +621,11 @@ struct wake_q_node {
struct wake_q_node *next;
};
+struct wake_q_head {
+ struct wake_q_node *first;
+ struct wake_q_node **lastp;
+};
+
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
/*
@@ -1055,6 +1060,7 @@ struct task_struct {
struct list_head pi_state_list;
struct futex_pi_state *pi_state_cache;
unsigned int futex_state;
+ struct wake_q_head futex_exit_q;
#endif
#ifdef CONFIG_PERF_EVENTS
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index 26a2013..62805b5 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -35,11 +35,6 @@
#include <linux/sched.h>
-struct wake_q_head {
- struct wake_q_node *first;
- struct wake_q_node **lastp;
-};
-
#define WAKE_Q_TAIL ((struct wake_q_node *) 0x01)
#define DEFINE_WAKE_Q(name) \
diff --git a/kernel/futex.c b/kernel/futex.c
index 4b36bc8..87763c7 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -1176,6 +1176,24 @@ static int attach_to_pi_state(u32 __user *uaddr, u32 uval,
return ret;
}
+static void wait_for_owner_exiting(int ret)
+{
+ struct wake_q_node *node = ¤t->wake_q;
+
+ if (ret != -EBUSY) {
+ WARN_ON_ONCE(node->next); // XXX not really correct ...
+ return;
+ }
+
+ for (;;) {
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ if (!READ_ONCE(node->next))
+ break;
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+}
+
static int handle_exit_race(u32 __user *uaddr, u32 uval,
struct task_struct *tsk)
{
@@ -1185,8 +1203,10 @@ static int handle_exit_race(u32 __user *uaddr, u32 uval,
* If the futex exit state is not yet FUTEX_STATE_DEAD, tell the
* caller that the alleged owner is busy.
*/
- if (tsk && tsk->futex_state != FUTEX_STATE_DEAD)
+ if (tsk && tsk->futex_state != FUTEX_STATE_DEAD) {
+ wake_q_add(&tsk->futex_exit_q, current);
return -EBUSY;
+ }
/*
* Reread the user space value to handle the following situation:
@@ -2104,6 +2124,7 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags,
hb_waiters_dec(hb2);
put_futex_key(&key2);
put_futex_key(&key1);
+ wait_for_owner_exiting(ret);
cond_resched();
goto retry;
default:
@@ -2855,6 +2876,7 @@ static int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
queue_unlock(hb);
put_futex_key(&q.key);
cond_resched();
+ wait_for_owner_exiting(ret);
goto retry;
default:
goto out_unlock_put_key;
@@ -3701,6 +3723,7 @@ static void futex_cleanup(struct task_struct *tsk)
void futex_exit_recursive(struct task_struct *tsk)
{
tsk->futex_state = FUTEX_STATE_DEAD;
+ wake_up_q(&tsk->futex_exit_q);
}
static void futex_cleanup_begin(struct task_struct *tsk)
@@ -3718,16 +3741,17 @@ static void futex_cleanup_begin(struct task_struct *tsk)
*/
raw_spin_lock_irq(&tsk->pi_lock);
tsk->futex_state = FUTEX_STATE_EXITING;
+ wake_q_init(&tsk->futex_exit_q);
raw_spin_unlock_irq(&tsk->pi_lock);
}
static void futex_cleanup_end(struct task_struct *tsk, int state)
{
- /*
- * Lockless store. The only side effect is that an observer might
- * take another loop until it becomes visible.
- */
+ raw_spin_lock_irq(&tsk->pi_lock);
tsk->futex_state = state;
+ raw_spin_unlock_irq(&tsk->pi_lock);
+
+ wake_up_q(&tsk->futex_exit_q);
}
void futex_exec_release(struct task_struct *tsk)
* Thomas Gleixner:
> The series is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.locking/futex
I ran the glibc upstream test suite (which has some robust futex tests)
against b21be7e942b49168ee15a75cbc49fbfdeb1e6a97 on x86-64, both native
and 32-bit/i386 compat mode.
compat mode seems broken, nptl/tst-thread-affinity-pthread fails. This
is probably *not* due to
<https://bugzilla.kernel.org/show_bug.cgi?id=154011> because the failure
is non-sporadic, but reliable fails for thread 253:
info: Detected CPU set size (in bits): 225
info: Maximum test CPU: 255
error: pthread_create for thread 253 failed: Resource temporarily unavailable
I'm running this on a large box as root, so ulimits etc. do not apply.
I did not see this failure with the x86-64 test.
You should be able to reproduce with (assuming you've got a multilib gcc):
git clone git://sourceware.org/git/glibc.git git
mkdir build
cd build
../git/configure --prefix=/usr CC="gcc -m32" CXX="g++ -m32" --build=i686-linux
make -j`nproc`
make test t=nptl/tst-thread-affinity-pthread
Thanks,
Florian
* Florian Weimer:
> * Thomas Gleixner:
>
>> The series is also available from git:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.locking/futex
>
> I ran the glibc upstream test suite (which has some robust futex tests)
> against b21be7e942b49168ee15a75cbc49fbfdeb1e6a97 on x86-64, both native
> and 32-bit/i386 compat mode.
>
> compat mode seems broken, nptl/tst-thread-affinity-pthread fails. This
> is probably *not* due to
> <https://bugzilla.kernel.org/show_bug.cgi?id=154011> because the failure
> is non-sporadic, but reliable fails for thread 253:
>
> info: Detected CPU set size (in bits): 225
> info: Maximum test CPU: 255
> error: pthread_create for thread 253 failed: Resource temporarily unavailable
>
> I'm running this on a large box as root, so ulimits etc. do not apply.
>
> I did not see this failure with the x86-64 test.
>
> You should be able to reproduce with (assuming you've got a multilib gcc):
>
> git clone git://sourceware.org/git/glibc.git git
> mkdir build
> cd build
> ../git/configure --prefix=/usr CC="gcc -m32" CXX="g++ -m32" --build=i686-linux
> make -j`nproc`
> make test t=nptl/tst-thread-affinity-pthread
Sorry, I realized that I didn't actually verify that this is a
regression caused by your patches. Maybe I can do that tomorrow.
Florian
* Florian Weimer:
> * Florian Weimer:
>
>> * Thomas Gleixner:
>>
>>> The series is also available from git:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.locking/futex
>>
>> I ran the glibc upstream test suite (which has some robust futex tests)
>> against b21be7e942b49168ee15a75cbc49fbfdeb1e6a97 on x86-64, both native
>> and 32-bit/i386 compat mode.
>>
>> compat mode seems broken, nptl/tst-thread-affinity-pthread fails. This
>> is probably *not* due to
>> <https://bugzilla.kernel.org/show_bug.cgi?id=154011> because the failure
>> is non-sporadic, but reliable fails for thread 253:
>>
>> info: Detected CPU set size (in bits): 225
>> info: Maximum test CPU: 255
>> error: pthread_create for thread 253 failed: Resource temporarily unavailable
>>
>> I'm running this on a large box as root, so ulimits etc. do not apply.
>>
>> I did not see this failure with the x86-64 test.
>>
>> You should be able to reproduce with (assuming you've got a multilib gcc):
>>
>> git clone git://sourceware.org/git/glibc.git git
>> mkdir build
>> cd build
>> ../git/configure --prefix=/usr CC="gcc -m32" CXX="g++ -m32" --build=i686-linux
>> make -j`nproc`
>> make test t=nptl/tst-thread-affinity-pthread
>
> Sorry, I realized that I didn't actually verify that this is a
> regression caused by your patches. Maybe I can do that tomorrow.
Confirmed as a regression caused by the patches. Depending on the
nature of the bug, you need a machine which has or pretends to have many
CPUs (this one has 256 CPUs).
Please let me know if you need more information.
Thanks,
Florian
On Fri, 8 Nov 2019, Florian Weimer wrote:
> * Florian Weimer:
> > * Florian Weimer:
> >> I ran the glibc upstream test suite (which has some robust futex tests)
> >> against b21be7e942b49168ee15a75cbc49fbfdeb1e6a97 on x86-64, both native
> >> and 32-bit/i386 compat mode.
> >>
> >> compat mode seems broken, nptl/tst-thread-affinity-pthread fails. This
> >> is probably *not* due to
> >> <https://bugzilla.kernel.org/show_bug.cgi?id=154011> because the failure
> >> is non-sporadic, but reliable fails for thread 253:
> >>
> >> info: Detected CPU set size (in bits): 225
> >> info: Maximum test CPU: 255
> >> error: pthread_create for thread 253 failed: Resource temporarily unavailable
> >>
> >> I'm running this on a large box as root, so ulimits etc. do not apply.
> >>
> >> I did not see this failure with the x86-64 test.
> >>
> >> You should be able to reproduce with (assuming you've got a multilib gcc):
> >>
> >> git clone git://sourceware.org/git/glibc.git git
> >> mkdir build
> >> cd build
> >> ../git/configure --prefix=/usr CC="gcc -m32" CXX="g++ -m32" --build=i686-linux
> >> make -j`nproc`
> >> make test t=nptl/tst-thread-affinity-pthread
> >
> > Sorry, I realized that I didn't actually verify that this is a
> > regression caused by your patches. Maybe I can do that tomorrow.
>
> Confirmed as a regression caused by the patches. Depending on the
> nature of the bug, you need a machine which has or pretends to have many
> CPUs (this one has 256 CPUs).
Sure I can do that, but I completely fail to see how that's a
regression.
Unpatched 5.4-rc6:
FAIL: nptl/tst-thread-affinity-pthread
original exit status 1
info: Detected CPU set size (in bits): 225
info: Maximum test CPU: 255
error: pthread_create for thread 253 failed: Resource temporarily unavailable
TBH, the futex changes have absolutely nothing to do with that resource
fail.
Thanks,
tglx
* Thomas Gleixner:
> On Fri, 8 Nov 2019, Florian Weimer wrote:
>> * Florian Weimer:
>> > * Florian Weimer:
>> >> I ran the glibc upstream test suite (which has some robust futex tests)
>> >> against b21be7e942b49168ee15a75cbc49fbfdeb1e6a97 on x86-64, both native
>> >> and 32-bit/i386 compat mode.
>> >>
>> >> compat mode seems broken, nptl/tst-thread-affinity-pthread fails. This
>> >> is probably *not* due to
>> >> <https://bugzilla.kernel.org/show_bug.cgi?id=154011> because the failure
>> >> is non-sporadic, but reliable fails for thread 253:
>> >>
>> >> info: Detected CPU set size (in bits): 225
>> >> info: Maximum test CPU: 255
>> >> error: pthread_create for thread 253 failed: Resource temporarily unavailable
>> >>
>> >> I'm running this on a large box as root, so ulimits etc. do not apply.
>> >>
>> >> I did not see this failure with the x86-64 test.
>> >>
>> >> You should be able to reproduce with (assuming you've got a multilib gcc):
>> >>
>> >> git clone git://sourceware.org/git/glibc.git git
>> >> mkdir build
>> >> cd build
>> >> ../git/configure --prefix=/usr CC="gcc -m32" CXX="g++ -m32" --build=i686-linux
>> >> make -j`nproc`
>> >> make test t=nptl/tst-thread-affinity-pthread
>> >
>> > Sorry, I realized that I didn't actually verify that this is a
>> > regression caused by your patches. Maybe I can do that tomorrow.
>>
>> Confirmed as a regression caused by the patches. Depending on the
>> nature of the bug, you need a machine which has or pretends to have many
>> CPUs (this one has 256 CPUs).
>
> Sure I can do that, but I completely fail to see how that's a
> regression.
>
> Unpatched 5.4-rc6:
>
> FAIL: nptl/tst-thread-affinity-pthread
> original exit status 1
> info: Detected CPU set size (in bits): 225
> info: Maximum test CPU: 255
> error: pthread_create for thread 253 failed: Resource temporarily unavailable
Huh. Reverting your patches (at commit 26bc672134241a080a83b2ab9aa8abede8d30e1c)
fixes the test for me.
> TBH, the futex changes have absolutely nothing to do with that resource
> fail.
I suspect that there are some changes to task exit latency, which
triggers the latent resource management bug.
Thanks,
Florian
On Fri, 8 Nov 2019, Florian Weimer wrote:
> > On Fri, 8 Nov 2019, Florian Weimer wrote:
> > Unpatched 5.4-rc6:
> >
> > FAIL: nptl/tst-thread-affinity-pthread
> > original exit status 1
> > info: Detected CPU set size (in bits): 225
> > info: Maximum test CPU: 255
> > error: pthread_create for thread 253 failed: Resource temporarily unavailable
>
> Huh. Reverting your patches (at commit 26bc672134241a080a83b2ab9aa8abede8d30e1c)
> fixes the test for me.
>
> > TBH, the futex changes have absolutely nothing to do with that resource
> > fail.
>
> I suspect that there are some changes to task exit latency, which
> triggers the latent resource management bug.
Right, and depending on which hardware you run, this changes. On the big
testbox I use the failure is also bouncing around between thread 252 and
254.
Thanks,
tglx
On Fri, 8 Nov 2019, Thomas Gleixner wrote:
> On Fri, 8 Nov 2019, Florian Weimer wrote:
> > > On Fri, 8 Nov 2019, Florian Weimer wrote:
> > > Unpatched 5.4-rc6:
> > >
> > > FAIL: nptl/tst-thread-affinity-pthread
> > > original exit status 1
> > > info: Detected CPU set size (in bits): 225
> > > info: Maximum test CPU: 255
> > > error: pthread_create for thread 253 failed: Resource temporarily unavailable
> >
> > Huh. Reverting your patches (at commit 26bc672134241a080a83b2ab9aa8abede8d30e1c)
> > fixes the test for me.
> >
> > > TBH, the futex changes have absolutely nothing to do with that resource
> > > fail.
> >
> > I suspect that there are some changes to task exit latency, which
> > triggers the latent resource management bug.
>
> Right, and depending on which hardware you run, this changes. On the big
> testbox I use the failure is also bouncing around between thread 252 and
> 254.
Which was just an assumption and is completely wrong.
The fail is expected and the failure output of that test is totally
bonkers:
Tracing shows that clone is not failing at all:
ld-linux.so.2-26694 [060] .... 6477.924785: sys_enter: NR 120 (3d0f00, f7cda424, f7cdaba8, ff819790, f7cdaba8, f7edd000)
ld-linux.so.2-26694 [060] .... 6477.924867: sys_exit: NR 120 = 26695
...
ld-linux.so.2-26694 [191] .... 6477.985139: sys_enter: NR 120 (3d0f00, fef27424, fef27ba8, ff819790, fef27ba8, f7edd000)
ld-linux.so.2-26694 [191] .... 6477.985220: sys_exit: NR 120 = 27203
That's a total of 509 threads created. And then right after that:
ld-linux.so.2-26694 [191] .... 6477.985221: sys_enter: NR 192 (0, 801000, 0, 20022, ffffffff, 0)
ld-linux.so.2-26694 [191] .... 6477.985222: sys_exit: NR 192 = -12
mmap2 fails with ENOMEM which is not really surprising. The map length is
0x801000 which means that the already started threads have already consumed
509 * 0x801000 == 4073.99 MB == 3.9785 GB
The next mmap2 fails for a 32bit process for pretty obvious reasons and
rightfully so.
pthread_create() returns EAGAIN while the underlying problem is ENOMEM
which causes this bonkers output:
error: pthread_create for thread 253 failed: Resource temporarily unavailable
There is nothing temporarily. The process has its address space exhausted.
That test's output is anyway strange:
info: Detected CPU set size (in bits): 225
info: Maximum test CPU: 255
Interesting how it fits 256 CPUs into a cpuset with a size of 225 bits.
/me goes back to stare into iopl().
Thanks,
tglx
* Thomas Gleixner:
> pthread_create() returns EAGAIN while the underlying problem is ENOMEM
> which causes this bonkers output:
>
> error: pthread_create for thread 253 failed: Resource temporarily unavailable
>
> There is nothing temporarily. The process has its address space exhausted.
Thanks for analyzing the failure. I thought we had already covered
that. I've fixed the test locally and will submit the changes. The
fixed test passes, as expected.
I expected that we've fixed all such occurrences of per-CPU thread
creation, but apparently not. 8-(
> That test's output is anyway strange:
>
> info: Detected CPU set size (in bits): 225
> info: Maximum test CPU: 255
>
> Interesting how it fits 256 CPUs into a cpuset with a size of 225 bits.
That's an unfortunate side effect of how the CPU set allocation works in
userspace. The allocation uses a size meaured in bits (which are
rounded up, out of necessity), but the kernel interface is byte-based.
The kernel does not not know that some bits are padding, and happily
writes result data there. So we get bits back for which no space had
been allocated explicitly.
I hesitated to clean this up because the story on the kernel side was
equally mystifying. getaffinity requires space in the mask for CPUs
that are currently not present and whose affinity bits are not set. Due
to firmware bugs, this means that we can cross the magic 1024 bits
boundary (corresponding to the 128 byte legacy mask size), and some
applications will refuse to start. 8-( There was considerable
controversy on the kernel side the last time this came up IIRC.
Thanks,
Florian