I am investigating an issue on 4.9.184 in which futex() returns EPERM
intermittently for
futex(uaddr, FUTEX_WAIT_PRIVATE, val, &timeout, NULL, 0)
The failure affects an application in an AWS lambda; traditional
debugging approaches vary from difficult to impossible. I cannot
reproduce the problem at will, instrument the kernel, install a new
kernel or get an application core dump.
Understanding the circumstances under which EPERM can be returned for
FUTEX_WAIT_PRIVATE would be useful but it is not a documented failure
mode. I have spent some time looking through futex.c but have not
found anything yet. I would be grateful for a hint from someone more
knowledgeable.
Please address/cc me on any reply.
Thanks,
Robert Harris
Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
> On 12 Nov 2019, at 17:40, Harris, Robert <[email protected]> wrote:
>
> I am investigating an issue on 4.9.184 in which futex() returns EPERM
> intermittently for
>
> futex(uaddr, FUTEX_WAIT_PRIVATE, val, &timeout, NULL, 0)
>
> The failure affects an application in an AWS lambda; traditional
> debugging approaches vary from difficult to impossible. I cannot
> reproduce the problem at will, instrument the kernel, install a new
> kernel or get an application core dump.
>
> Understanding the circumstances under which EPERM can be returned for
> FUTEX_WAIT_PRIVATE would be useful but it is not a documented failure
> mode. I have spent some time looking through futex.c but have not
> found anything yet. I would be grateful for a hint from someone more
> knowledgeable.
>
> Please address/cc me on any reply.
To be clear, I do mean that futex() is returning -1 and setting errno
to EPERM.
Robert
Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
On Tue, 12 Nov 2019, Harris, Robert wrote:
> I am investigating an issue on 4.9.184 in which futex() returns EPERM
> intermittently for
>
> futex(uaddr, FUTEX_WAIT_PRIVATE, val, &timeout, NULL, 0)
>
> The failure affects an application in an AWS lambda; traditional
> debugging approaches vary from difficult to impossible. I cannot
> reproduce the problem at will, instrument the kernel, install a new
> kernel or get an application core dump.
>
> Understanding the circumstances under which EPERM can be returned for
> FUTEX_WAIT_PRIVATE would be useful but it is not a documented failure
> mode. I have spent some time looking through futex.c but have not
> found anything yet. I would be grateful for a hint from someone more
> knowledgeable.
sys_futex(FUTEX_WAIT_PRIVATE) does not return -EPERM. Only the PI variants
do that.
Thanks,
tglx
> On 13 Nov 2019, at 09:04, Thomas Gleixner <[email protected]> wrote:
>
> On Tue, 12 Nov 2019, Harris, Robert wrote:
>
>> I am investigating an issue on 4.9.184 in which futex() returns EPERM
>> intermittently for
>>
>> futex(uaddr, FUTEX_WAIT_PRIVATE, val, &timeout, NULL, 0)
>>
>> The failure affects an application in an AWS lambda; traditional
>> debugging approaches vary from difficult to impossible. I cannot
>> reproduce the problem at will, instrument the kernel, install a new
>> kernel or get an application core dump.
>>
>> Understanding the circumstances under which EPERM can be returned for
>> FUTEX_WAIT_PRIVATE would be useful but it is not a documented failure
>> mode. I have spent some time looking through futex.c but have not
>> found anything yet. I would be grateful for a hint from someone more
>> knowledgeable.
>
> sys_futex(FUTEX_WAIT_PRIVATE) does not return -EPERM. Only the PI variants
> do that.
In that case I would appreciate a second pair of eyes. The error I see
(intermittently) is
pthread/ethr_event.c:164: Fatal error in wait__(): Operation not permitted (1)
which comes from
https://github.com/erlang/otp/blob/348e328375fb774b3fa919ffd1c4811367406516/erts/lib_src/pthread/ethr_event.c#L152-L164
> res = ETHR_FUTEX__(&e->futex,
> ETHR_FUTEX_WAIT__,
> ETHR_EVENT_OFF_WAITER__,
> tsp);
> switch (res) {
> case EINTR:
> case ETIMEDOUT:
> return res;
> case 0:
> case EWOULDBLOCK:
> break;
> default:
> ETHR_FATAL_ERROR__(res);
where
https://github.com/erlang/otp/blob/348e328375fb774b3fa919ffd1c4811367406516/erts/include/internal/ethread.h#L259-L260
> #define ETHR_FATAL_ERROR__(ERR) \
> ethr_fatal_error__(__FILE__, __LINE__, __func__, (ERR))
and
https://github.com/erlang/otp/blob/348e328375fb774b3fa919ffd1c4811367406516/erts/lib_src/common/ethr_aux.c#L725-L741
> ETHR_IMPL_NORETURN__ ethr_fatal_error__(const char *file,
> int line,
> const char *func,
> int err)
> {
> char *errstr;
> if (err == ENOTSUP)
> errstr = "Operation not supported";
> else {
> errstr = strerror(err);
> if (!errstr)
> errstr = "Unknown error";
> }
> fprintf(stderr, "%s:%d: Fatal error in %s(): %s (%d)\n",
> file, line, func, errstr, err);
> ethr_abort__();
> }
and
https://github.com/erlang/otp/blob/348e328375fb774b3fa919ffd1c4811367406516/erts/include/internal/pthread/ethr_event.h#L38-L58
> #if defined(FUTEX_WAIT_PRIVATE) && defined(FUTEX_WAKE_PRIVATE)
> # define ETHR_FUTEX_WAIT__ FUTEX_WAIT_PRIVATE
> # define ETHR_FUTEX_WAKE__ FUTEX_WAKE_PRIVATE
> #else
> # define ETHR_FUTEX_WAIT__ FUTEX_WAIT
> # define ETHR_FUTEX_WAKE__ FUTEX_WAKE
> #endif
>
> typedef struct {
> ethr_atomic32_t futex;
> } ethr_event;
>
> #define ETHR_FUTEX__(FTX, OP, VAL, TIMEOUT)\
> (-1 == syscall(__NR_futex,\
> (void *) ethr_atomic32_addr((FTX)),\
> (OP),\
> (int) (VAL),\
> (TIMEOUT),\
> NULL,\
> 0)\
> ? errno : 0)
To be sure:
> 0x0000000000687e65 <+325>: mov $0x80,%edx
> 0x0000000000687e6a <+330>: mov $0xca,%edi
> 0x0000000000687e6f <+335>: callq 0x443ab0 <syscall@plt>
Thanks,
Robert
Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
On Tue, Nov 12, 2019 at 6:43 PM Harris, Robert
<[email protected]> wrote:
>
> I am investigating an issue on 4.9.184 in which futex() returns EPERM
> intermittently for
>
> futex(uaddr, FUTEX_WAIT_PRIVATE, val, &timeout, NULL, 0)
>
> The failure affects an application in an AWS lambda; traditional
> debugging approaches vary from difficult to impossible. I cannot
> reproduce the problem at will, instrument the kernel, install a new
> kernel or get an application core dump.
>
> Understanding the circumstances under which EPERM can be returned for
> FUTEX_WAIT_PRIVATE would be useful but it is not a documented failure
> mode. I have spent some time looking through futex.c but have not
> found anything yet. I would be grateful for a hint from someone more
> knowledgeable.
I just wanted to add that a colleague of mine reported the exact same
issue to me two days ago: a highly threaded application (the Erlang
VM) running in AWS lambda, futex wait calls occasionally failing with
EPERM. I don't have more specifics than that, I've asked for kernel
version and the exact parameters in the failed futex call.
(Third attempt, really sorry about the noise, gmail's UI sucks.)
> On 13 Nov 2019, at 13:29, Mikael Pettersson <[email protected]> wrote:
>
> On Tue, Nov 12, 2019 at 6:43 PM Harris, Robert
> <[email protected]> wrote:
>>
>> I am investigating an issue on 4.9.184 in which futex() returns EPERM
>> intermittently for
>>
>> futex(uaddr, FUTEX_WAIT_PRIVATE, val, &timeout, NULL, 0)
>>
>> The failure affects an application in an AWS lambda; traditional
>> debugging approaches vary from difficult to impossible. I cannot
>> reproduce the problem at will, instrument the kernel, install a new
>> kernel or get an application core dump.
>>
>> Understanding the circumstances under which EPERM can be returned for
>> FUTEX_WAIT_PRIVATE would be useful but it is not a documented failure
>> mode. I have spent some time looking through futex.c but have not
>> found anything yet. I would be grateful for a hint from someone more
>> knowledgeable.
>
>
> I just wanted to add that a colleague of mine reported the exact same
> issue to me two days ago: a highly threaded application (the Erlang
> VM) running in AWS lambda, futex wait calls occasionally failing with
> EPERM. I don't have more specifics than that, I've asked for kernel
> version and the exact parameters in the failed futex call.
Thanks, that's a great data point. One of my outstanding questions had
been "why does this happen to only us?"
When I look at the timings I can say with some confidence that the
problem stopped for us minutes after
2017 on 2019-10-23 in us-east-1
2030 on 2019-10-24 in eu-west-1
1817 on 2019-10-25 in us-west-2
(all times UTC). I've logged a ticket with Amazon to find out what
changed.
Robert
Confidentiality Notice | This email and any included attachments may be privileged, confidential and/or otherwise protected from disclosure. Access to this email by anyone other than the intended recipient is unauthorized. If you believe you have received this email in error, please contact the sender immediately and delete all copies. If you are not the intended recipient, you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited.
On Wed, 13 Nov 2019, Harris, Robert wrote:
> > On 13 Nov 2019, at 09:04, Thomas Gleixner <[email protected]> wrote:
> > On Tue, 12 Nov 2019, Harris, Robert wrote:
> >> Understanding the circumstances under which EPERM can be returned for
> >> FUTEX_WAIT_PRIVATE would be useful but it is not a documented failure
> >> mode. I have spent some time looking through futex.c but have not
> >> found anything yet. I would be grateful for a hint from someone more
> >> knowledgeable.
> >
> > sys_futex(FUTEX_WAIT_PRIVATE) does not return -EPERM. Only the PI variants
> > do that.
>
> In that case I would appreciate a second pair of eyes. The error I see
> (intermittently) is
The code looks innocent enough. As I don't know whether the kernel version
you mentioned is a vanilla 4.19.184 from the stable tree or some patched up
frankenkernel which pretends to have this version number, I can't be sure
that this is an issue in that particular kernel.
In the vanilla 4.19.184 I really cant find how that would return EPERM for
regular futexes.
Thanks,
tglx