2008-01-04 21:41:31

by Phil Endecott

[permalink] [raw]
Subject: strace, accept(), ERESTARTSYS and EINTR

Dear Experts,

I have some code like this:

struct sockaddr_in client_addr;
socklen_t client_size=sizeof(client_addr);
int connfd = accept(fd,(struct sockaddr*)(&client_addr),&client_size);
if (connfd==-1) {
// [1]
.....report error and terminate......
}
int rc = fcntl(connfd,F_SETFD,FD_CLOEXEC);


I believe that I should be checking for errno==EINTR at [1] and
retrying the accept(); currently I'm not doing so.

When I strace -f this application - which is multi-threaded - I see this:

[pid 11079] accept(3, <unfinished ...>
[pid 11093] restart_syscall(<... resuming interrupted call ...>
<unfinished ...>
[pid 8799] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid 11079] <... accept resumed> 0xbfdaa73c, [16]) = ? ERESTARTSYS (To
be restarted)
[pid 8799] read(6, <unfinished ...>
[pid 11079] fcntl64(-512, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)

This shows accept() "returning" ERESTARTSYS; as I understand it this is
an artefact of how strace works, and my code will not have seen accept
return at all at that point. However, the strace output does not show
any other return from the call to accept() before reporting that
thread's call to fcntl(). And the first parameter to fcntl, -512, is
the return value from accept() which should be -1 or >0. What is going
on here???

Google found a couple of related reports:

http://lkml.org/lkml/2001/11/22/65 - Phil Howard reports getting
ERESTARTSYS returned from accept(), not only in the strace output, and
fixed his problem by treating it like EINTR. He looked at errno if
accept() returned <0, not ==-1.

http://lkml.org/lkml/2005/9/20/135 - Peter Duellings reports seeing
accept() return -512 with errno==0.


Some more background: I started stracing the program because it was
already misbehaving. It had created a larger than usual number of
threads (30?). It's quite possible that some process resource limit
had been reached; could this have confused the glibc syscall wrapper,
causing it to return the mysterious -512? Could it have confused the
kernel into returng ERESTARTSYS instead of e.g. E-too-many-sockets-open?

It's also possible that some random memory corruption had occurred.
valgrind is on my to-do list. But even if this had occurred, I would
expect to see strace reporting a legitimate return from accept() before
the call to fcntl(); there was no sign of any global resource limit
that might affect strace occurring.

This is a Debian system running kernel 2.6.21-1 i686, glibc 2.7, and
strace 4.5.14.


Many thanks for any suggestions.

Phil.

(You're welcome to Cc: me in any replies.)



2008-01-04 22:55:23

by Jiri Slaby

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

On 01/04/2008 10:01 PM, Phil Endecott wrote:
> Dear Experts,
>
> I have some code like this:
>
> struct sockaddr_in client_addr;
> socklen_t client_size=sizeof(client_addr);
> int connfd = accept(fd,(struct sockaddr*)(&client_addr),&client_size);
> if (connfd==-1) {
> // [1]
> .....report error and terminate......
> }
> int rc = fcntl(connfd,F_SETFD,FD_CLOEXEC);

show socket() call please to see what proto and type you have there.

> I believe that I should be checking for errno==EINTR at [1] and retrying
> the accept(); currently I'm not doing so.
>
> When I strace -f this application - which is multi-threaded - I see this:
>
> [pid 11079] accept(3, <unfinished ...>
> [pid 11093] restart_syscall(<... resuming interrupted call ...>
> <unfinished ...>
> [pid 8799] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
> [pid 11079] <... accept resumed> 0xbfdaa73c, [16]) = ? ERESTARTSYS (To
> be restarted)
> [pid 8799] read(6, <unfinished ...>
> [pid 11079] fcntl64(-512, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file
> descriptor)
>
> This shows accept() "returning" ERESTARTSYS; as I understand it this is
> an artefact of how strace works, and my code will not have seen accept
> return at all at that point. However, the strace output does not show
> any other return from the call to accept() before reporting that
> thread's call to fcntl(). And the first parameter to fcntl, -512, is
> the return value from accept() which should be -1 or >0. What is going
> on here???
>
> Google found a couple of related reports:
>
> http://lkml.org/lkml/2001/11/22/65 - Phil Howard reports getting
> ERESTARTSYS returned from accept(), not only in the strace output, and
> fixed his problem by treating it like EINTR. He looked at errno if
> accept() returned <0, not ==-1.
>
> http://lkml.org/lkml/2005/9/20/135 - Peter Duellings reports seeing
> accept() return -512 with errno==0.

ERESTARTSYS might be returned from system calls only when signal is pending.
Signal handler will change ERESTARTSYS to proper userspace error, i.e.
ERESTARTSYS (512) must not leak to userspace.

Some fail paths returns ERESTARTSYS even if no signal is pending and that used
to be the point.

2008-01-04 23:52:33

by Phil Endecott

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

Hi Jiri,

Jiri Slaby wrote:
> On 01/04/2008 10:01 PM, Phil Endecott wrote:
>> Dear Experts,
>>
>> I have some code like this:
>>
>> struct sockaddr_in client_addr;
>> socklen_t client_size=sizeof(client_addr);
>> int connfd = accept(fd,(struct sockaddr*)(&client_addr),&client_size);
>> if (connfd==-1) {
>> // [1]
>> .....report error and terminate......
>> }
>> int rc = fcntl(connfd,F_SETFD,FD_CLOEXEC);
>
> show socket() call please to see what proto and type you have there.

It's a ipv4 tcp socket:

// error handling & other noise removed:
int fd = socket(PF_INET,SOCK_STREAM,0);
struct sockaddr_in server_addr;
memset(&server_addr,0,sizeof(server_addr));
server_addr.sin_family=AF_INET;
server_addr.sin_addr.s_addr=htonl(INADDR_ANY);
server_addr.sin_port=htons(port);
bind(fd,(struct sockaddr*)&server_addr,sizeof(server_addr));
listen(listenfd,128);


>> I believe that I should be checking for errno==EINTR at [1] and retrying
>> the accept(); currently I'm not doing so.
>>
>> When I strace -f this application - which is multi-threaded - I see this:
>>
>> [pid 11079] accept(3, <unfinished ...>
>> [pid 11093] restart_syscall(<... resuming interrupted call ...>
>> <unfinished ...>
>> [pid 8799] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
>> [pid 11079] <... accept resumed> 0xbfdaa73c, [16]) = ? ERESTARTSYS (To
>> be restarted)
>> [pid 8799] read(6, <unfinished ...>
>> [pid 11079] fcntl64(-512, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file
>> descriptor)
>>
>> This shows accept() "returning" ERESTARTSYS; as I understand it this is
>> an artefact of how strace works, and my code will not have seen accept
>> return at all at that point. However, the strace output does not show
>> any other return from the call to accept() before reporting that
>> thread's call to fcntl(). And the first parameter to fcntl, -512, is
>> the return value from accept() which should be -1 or >0. What is going
>> on here???
>>
>> Google found a couple of related reports:
>>
>> http://lkml.org/lkml/2001/11/22/65 - Phil Howard reports getting
>> ERESTARTSYS returned from accept(), not only in the strace output, and
>> fixed his problem by treating it like EINTR. He looked at errno if
>> accept() returned <0, not ==-1.
>>
>> http://lkml.org/lkml/2005/9/20/135 - Peter Duellings reports seeing
>> accept() return -512 with errno==0.
>
> ERESTARTSYS might be returned from system calls only when signal is pending.
> Signal handler will change ERESTARTSYS to proper userspace error, i.e.
> ERESTARTSYS (512) must not leak to userspace.
>
> Some fail paths returns ERESTARTSYS even if no signal is pending and that used
> to be the point.

There are two odd things happening:

1. ERESTARTSYS is escaping to user-space, rather than EINTR or
restarting the accept.
2. It gets out of libc into my code in the form ret=-512, not (ret=-1, errno=512).

Very odd; a user-space mess (e.g. stack corruption) shouldn't be able
to change the kernel behaviour, and a kernel problem shouldn't cause
the odd libc behaviour. There must be another explanation....


Phil.



2008-01-05 01:07:02

by Frederik Deweerdt

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

On Fri, Jan 04, 2008 at 09:01:38PM +0000, Phil Endecott wrote:
> Dear Experts,
>
> I have some code like this:
>
> struct sockaddr_in client_addr;
> socklen_t client_size=sizeof(client_addr);
> int connfd = accept(fd,(struct sockaddr*)(&client_addr),&client_size);
> if (connfd==-1) {
> // [1]
> .....report error and terminate......
Replacing the (connfd == -1) check with (connfd < 0),
could you try printf'ing connfd and errno here?
Just to confirm strace's output.
> }
> int rc = fcntl(connfd,F_SETFD,FD_CLOEXEC);
>
Regards,
Frederik

2008-01-05 01:39:14

by Phil Endecott

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

Frederik Deweerdt wrote:
> On Fri, Jan 04, 2008 at 09:01:38PM +0000, Phil Endecott wrote:
>> struct sockaddr_in client_addr;
>> socklen_t client_size=sizeof(client_addr);
>> int connfd = accept(fd,(struct sockaddr*)(&client_addr),&client_size);
>> if (connfd==-1) {
>> // [1]
>> .....report error and terminate......
> Replacing the (connfd == -1) check with (connfd < 0),
> could you try printf'ing connfd and errno here?
> Just to confirm strace's output.

Yes, I've already added something like that. Now we just need to wait
for it to crash again, which will happen faster if people in a
different timezone go and play with
http://demos.anyterm.org/bastet/anyterm.html while I sleep....

Phil.

(Your are welcome to Cc: me in any replies)



2008-01-05 10:31:30

by Jiri Slaby

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

Do not remove CCed people.

On 01/05/2008 02:38 AM, Phil Endecott wrote:
> timezone go and play with http://demos.anyterm.org/bastet/anyterm.html
> while I sleep....

Service Temporarily Unavailable

2008-01-05 11:34:22

by Jiri Slaby

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

On 01/05/2008 11:31 AM, Jiri Slaby wrote:
> Do not remove CCed people.
>
> On 01/05/2008 02:38 AM, Phil Endecott wrote:
>> timezone go and play with http://demos.anyterm.org/bastet/anyterm.html
>> while I sleep....
>
> Service Temporarily Unavailable

bastet? the tetris in which you always get the worst block you can get at the
moment :)? Do you serve the pages by the code which causes the problems?

Could we see more of the server's code, even with multi-threading, which you
mentioned etc.

2008-01-05 14:24:55

by Phil Endecott

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

Jiri Slaby wrote:
> On 01/05/2008 11:31 AM, Jiri Slaby wrote:
>> Do not remove CCed people.
>>
>> On 01/05/2008 02:38 AM, Phil Endecott wrote:
>>> timezone go and play with http://demos.anyterm.org/bastet/anyterm.html
>>> while I sleep....
>>
>> Service Temporarily Unavailable

It failed at 0325 GMT thanks to the many of you who played, but
unfortunately it was an assert somewhere unrelated that failed. I'll
start it up again shortly.

> bastet? the tetris in which you always get the worst block you can get at the
> moment :)?

That's the one! At the time it was the only game that worked with my
VT100-to-HTML conversion.

> Do you serve the pages by the code which causes the problems?

Yes.

> Could we see more of the server's code, even with multi-threading, which you
> mentioned etc.

It's all here: https://svn.chezphil.org/anyterm/trunk/daemon/
In particular: https://svn.chezphil.org/libpbe/trunk/src/Daemon.cc
However, there's a lot of code and I know that there are bugs in it. I
just want to focus on the kernel-related issue that the strace fragment
that I posted brings up: even if my user code gets completely screwed
up (corrupts its stack, runs out of FDs/VM/threads etc), I don't think
that I should see in the strace output that accept() has returned
ERESTARTSYS. I should always see accept() returning a legitimate
result before seeing that thread doing the next syscall. Right?


Cheers, Phil.


2008-01-05 15:03:20

by Andi Kleen

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

"Phil Endecott" <[email protected]> writes:
>
> Many thanks for any suggestions.

It's a long standing bug reported regularly but so far nobody has tracked
it down. That's mostly because most people cannot really reproduce it.

If you can reproduce it reliably: you could do the brute force debugging
method. Simply grep all places in the kernel code that set or pass ERESTART
and add a

printk("%s:%d ERESTART\n", __FILE__, __LINE__);

or perhaps

printk("%s:%d %s:%d ERESTART\n", current->comm, current->pid, __FILE__,
__LINE__);

or perhaps

if (!strcmp(current->comm, "yourprogramwithoutpath"))
printk("%s:%d ERESTART\n", __FILE__, __LINE__);

[the later two might need adding linux/string.h and/or linux/sched.h includes
to the files if they don't compile]

to all of them. Then recompile and boot that kernel.

Now if you can match such a output line to your program
malfunction we would know where the problem is and then it would be likely
reasonable easy to fix.

Note that the above first two might produce a lot of output if you're
unlucky -- definitely don't do it on a production machine and kill
syslog first so that it doesn't all end up on disk.

-Andi

2008-01-05 16:38:47

by Andreas Schwab

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

"Phil Endecott" <[email protected]> writes:

> However, there's a lot of code and I know that there are bugs in it. I
> just want to focus on the kernel-related issue that the strace fragment
> that I posted brings up: even if my user code gets completely screwed up
> (corrupts its stack, runs out of FDs/VM/threads etc), I don't think that I
> should see in the strace output that accept() has returned
> ERESTARTSYS.

strace always sees the raw return value, before the signal handler is
executed and before the check for syscall restart is done.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-01-05 17:02:56

by Jiri Slaby

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

On 01/05/2008 04:03 PM, Andi Kleen wrote:
> if (!strcmp(current->comm, "yourprogramwithoutpath"))
> printk("%s:%d ERESTART\n", __FILE__, __LINE__);
>
> [the later two might need adding linux/string.h and/or linux/sched.h includes
> to the files if they don't compile]
>
> to all of them. Then recompile and boot that kernel.

If I'm not completely blind, there is no suspicous ERESTARTSYS in
PF_INET:SOCK_STREAM path.

Anyway even if it was returned by kernel syscall, glibc should put it in errno
and return -1 (since it is >= -4095 and < 0). He sees it in the accept retval.
Something rotten, probably broken/racy ?ntpl? implementation of syscalls or
somewhat?

regards,
--
Jiri Slaby
Faculty of Informatics, Masaryk University
Suse Labs

2008-01-05 18:18:40

by Phil Endecott

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

Andreas Schwab wrote:
> "Phil Endecott" <[email protected]> writes:
>
>> However, there's a lot of code and I know that there are bugs in it. I
>> just want to focus on the kernel-related issue that the strace fragment
>> that I posted brings up: even if my user code gets completely screwed up
>> (corrupts its stack, runs out of FDs/VM/threads etc), I don't think that I
>> should see in the strace output that accept() has returned
>> ERESTARTSYS.
>
> strace always sees the raw return value, before the signal handler is
> executed and before the check for syscall restart is done.

Yes, but I should see the real final return value in another strace
output line before I see that thread doing something else. Correct?
Here's the strace output again. Look at what thread 11079 does:

[pid 11079] accept(3, <unfinished ...>
[pid 11093] restart_syscall(<... resuming interrupted call ...>
<unfinished ...>
[pid 8799] --- SIGSTOP (Stopped (signal)) @ 0 (0) ---
[pid 11079] <... accept resumed> 0xbfdaa73c, [16]) = ? ERESTARTSYS (To
be restarted)
[pid 8799] read(6, <unfinished ...>
[pid 11079] fcntl64(-512, F_SETFD, FD_CLOEXEC) = -1 EBADF (Bad file descriptor)

strace reports accept() returning ERESTARTSYS, and the next thing we
see from that thread is the call to fcntl(), which is the next thing
that my code does.

Phil.



2008-01-05 18:42:36

by Phil Endecott

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

Andi Kleen wrote:
> "Phil Endecott" <[email protected]> writes:
>>
>> Many thanks for any suggestions.
>
> It's a long standing bug reported regularly

Thanks for the re-assurance.

> but so far nobody has tracked
> it down. That's mostly because most people cannot really reproduce it.

I'm pretty sure that it's triggered by some earlier problem, most
likely reaching a threads-per-process limit (or an associated total VM
per process limit due to large default thread stack sizes). There's a
good chance that I will find and fix that problem, and this one will
then go away unsolved...

> If you can reproduce it reliably
[snip]
> definitely don't do it on a production machine

Unfortunately, it happens on the production machine but I've not been
able to reproduce it otherwise. I'll keep trying. FWIW my assumption
is that the ERESTARTSYS comes from the sock_intr_errno() call from inet_csk_wait_for_connect().


Thanks, Phil.



2008-01-05 19:19:20

by Andreas Schwab

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

"Phil Endecott" <[email protected]> writes:

> Andreas Schwab wrote:
>> "Phil Endecott" <[email protected]> writes:
>>
>>> However, there's a lot of code and I know that there are bugs in it. I
>>> just want to focus on the kernel-related issue that the strace fragment
>>> that I posted brings up: even if my user code gets completely screwed up
>>> (corrupts its stack, runs out of FDs/VM/threads etc), I don't think that I
>>> should see in the strace output that accept() has returned
>>> ERESTARTSYS.
>>
>> strace always sees the raw return value, before the signal handler is
>> executed and before the check for syscall restart is done.
>
> Yes, but I should see the real final return value in another strace output
> line before I see that thread doing something else. Correct?

No. As far as strace is concerned the syscall has finished. Since it
isn't restarted, you won't see it again in the trace.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-01-05 19:35:32

by Phil Endecott

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

Andreas Schwab wrote:
> "Phil Endecott" <[email protected]> writes:
>> Andreas Schwab wrote:
>>> "Phil Endecott" <[email protected]> writes:
>>>> However, there's a lot of code and I know that there are bugs in it. I
>>>> just want to focus on the kernel-related issue that the strace fragment
>>>> that I posted brings up: even if my user code gets completely screwed up
>>>> (corrupts its stack, runs out of FDs/VM/threads etc), I don't think that I
>>>> should see in the strace output that accept() has returned
>>>> ERESTARTSYS.
>>>
>>> strace always sees the raw return value, before the signal handler is
>>> executed and before the check for syscall restart is done.
>>
>> Yes, but I should see the real final return value in another strace output
>> line before I see that thread doing something else. Correct?
>
> No. As far as strace is concerned the syscall has finished. Since it
> isn't restarted, you won't see it again in the trace.

Sorry for being dense, but what is the implication of your comment
"Since it isn't restarted" ? Are you saying that the kernel isn't
going to restart it and will have converted it to EINTR and returned
that to user-space, and that this modified return value is not reported
by strace?


Phil.



2008-01-05 19:50:52

by Andreas Schwab

[permalink] [raw]
Subject: Re: strace, accept(), ERESTARTSYS and EINTR

"Phil Endecott" <[email protected]> writes:

> Sorry for being dense, but what is the implication of your comment "Since
> it isn't restarted" ? Are you saying that the kernel isn't going to
> restart it and will have converted it to EINTR and returned that to
> user-space, and that this modified return value is not reported by strace?

Yes, assuming there was a signal handler and it wasn't registered with
SA_RESTART.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."