2007-01-11 00:14:49

by Sean Reifschneider

[permalink] [raw]
Subject: select() setting ERESTARTNOHAND (514).

I've been looking at an issue in Python where a "time.sleep(1)" will
sporadically raise an IOError exception with errno=514. time.sleep() is
implemented with select(), to get sub-second resolution.

In looking at the Linux code for ERESTARTNOHAND, I see that
include/linux/errno.h says this errno should never make it to the user.
However, in this instance we ARE seeing it. Looking around on google shows
others are seeing it as well, though hits are few.

In looking at the select() code, I see that there are definitely cases
where sys_select() or sys_pselect7() can return -ERESTARTNOHAND. However,
I don't know if this is expected to be caught elsewhere, or if returning it
here would send it back to user-space. Worse, I don't fully understand
what the impact would be of trapping the ERESTARTNOHAND in the
sys_select/sys_pselect7 functions would be.

Is this something that's intended to be retrned back to the user, in which
case the message in include/linux/errno.h should be corrected and people
using time.sleep() in python will just have to live with it sometimes
raising an exception? Or is it something that definitely should never
reach the user-space code, and there's some leak.

Just to be clear, this is happening only on one machine out of at least 4
where this has been tested. The machine where it's happening is a dual
processor, dual core Xeon 2GHz 51xx series system. The other systems where
it's not happening are single CPU Celeron or P4 class systems, though one
is a 2-year-old quad CPU Xeon running something <2GHz, IIRC.

More details on my investigation are at:

http://www.tummy.com/journals/entries/jafo_20070110_154659

Thoughts?

Thanks,
Sean
(Not subscribed, I'll use the list archive to follow-up)
--
Electricity travels a foot in a nanosecond.
-- Commodore Grace Murray Hopper
Sean Reifschneider, Member of Technical Staff <[email protected]>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability


2007-01-11 00:27:49

by David Miller

[permalink] [raw]
Subject: Re: select() setting ERESTARTNOHAND (514).

From: Sean Reifschneider <[email protected]>
Date: Wed, 10 Jan 2007 16:42:38 -0700

> In looking at the select() code, I see that there are definitely cases
> where sys_select() or sys_pselect7() can return -ERESTARTNOHAND. However,
> I don't know if this is expected to be caught elsewhere, or if returning it
> here would send it back to user-space. Worse, I don't fully understand
> what the impact would be of trapping the ERESTARTNOHAND in the
> sys_select/sys_pselect7 functions would be.

It gets caught by the return into userspace code.

Specifically the signal dispatch should repair that return
value to a valid error return code when it tries to dispatch
the signal that select() set in the task struct.

Note that select() only returns these values when signal_pending()
is true.

2007-01-11 01:05:16

by Sean Reifschneider

[permalink] [raw]
Subject: Re: select() setting ERESTARTNOHAND (514).

On Wed, Jan 10, 2007 at 04:27:47PM -0800, David Miller wrote:
>It gets caught by the return into userspace code.

Ok, so somehow it is leaking. I have a system in the lab that is the same
hardware as production, but it currently has no, you know, hard drives in
it, so some assembly is required. I'll see if I can reproduce it in a test
environment and then see if I can get more information on when/where it is
leaking.

>Note that select() only returns these values when signal_pending()
>is true.

Yes, I saw that. I didn't fully understand it, but I saw it.

Thanks,
Sean
--
CChheecckk yyoouurr dduupplleexx sswwiittcchh..
Sean Reifschneider, Member of Technical Staff <[email protected]>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability
Back off man. I'm a scientist. http://HackingSociety.org/

2007-01-11 01:15:22

by David Miller

[permalink] [raw]
Subject: Re: select() setting ERESTARTNOHAND (514).

From: Sean Reifschneider <[email protected]>
Date: Wed, 10 Jan 2007 18:04:29 -0700

> On Wed, Jan 10, 2007 at 04:27:47PM -0800, David Miller wrote:
> >It gets caught by the return into userspace code.
>
> Ok, so somehow it is leaking. I have a system in the lab that is the same
> hardware as production, but it currently has no, you know, hard drives in
> it, so some assembly is required. I'll see if I can reproduce it in a test
> environment and then see if I can get more information on when/where it is
> leaking.

If you're only seeing it in strace, that's expected due to some
unfortunate things in the way that x86 and x86_64 handle signal
return events via ptrace().

On sparc and sparc64 I fixed this long ago such that ptrace() will
update the user registers before ptrace parents are notified, and
therefore you'll never see those kernel internal error codes.

The upside of this is that you'll really need to see what value is
making it to the application. What the kernel is probably
doing is looping trying to restart the system call and sending
the signal. If it's doing that the application is being rewound
to call the system call again once the signal handler returns
(if that is even being run at all).

2007-01-11 08:27:13

by Sean Reifschneider

[permalink] [raw]
Subject: Re: select() setting ERESTARTNOHAND (514).

On Wed, Jan 10, 2007 at 05:15:20PM -0800, David Miller wrote:
>If you're only seeing it in strace, that's expected due to some

Nope, I haven't looked in strace at all. It's definitely making it to
user-space. The code in question is (abbreviated):

if (select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &t) != 0) {
PyErr_SetFromErrno(PyExc_IOError);
return -1;
}

which causes the Python interpreter to raise an IOError exception, including
the value of errno, which is 514.

Thanks,
Sean
--
This mountain is PURE SNOW! Do you know what the street value of this
mountain is!?! -- Better Off Dead
Sean Reifschneider, Member of Technical Staff <[email protected]>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability

2007-01-11 22:22:49

by bert hubert

[permalink] [raw]
Subject: Re: select() setting ERESTARTNOHAND (514).

On Thu, Jan 11, 2007 at 01:25:16AM -0700, Sean Reifschneider wrote:

> Nope, I haven't looked in strace at all. It's definitely making it to
> user-space. The code in question is (abbreviated):
>
> if (select(0, (fd_set *)0, (fd_set *)0, (fd_set *)0, &t) != 0) {
> PyErr_SetFromErrno(PyExc_IOError);
> return -1;
> }

Anything else relevant? Do you know which signal interrupted select? Is this
a single or multithreaded application? And where did the signal come from?

I tried to reproduce your problem in various ways on 2.6.20-rc4, but it
didn't appear.

Thanks.

--
http://www.PowerDNS.com Open source, database driven DNS Software
http://netherlabs.nl Open and Closed source services

2007-01-24 00:56:19

by Sean Reifschneider

[permalink] [raw]
Subject: Re: select() setting ERESTARTNOHAND (514).

On Thu, Jan 11, 2007 at 11:22:46PM +0100, bert hubert wrote:
>Anything else relevant? Do you know which signal interrupted select? Is this
>a single or multithreaded application? And where did the signal come from?

It is, AFAIK, a multi-threaded application. I don't have any information
on which signal interrupted the process. I'll ask the person who reported
it to me, Doug, to respond with additional information.

>I tried to reproduce your problem in various ways on 2.6.20-rc4, but it
>didn't appear.

Thanks.

Sean
--
[...] Premature optimization is the root of all evil.
-- Donald Knuth
Sean Reifschneider, Member of Technical Staff <[email protected]>
tummy.com, ltd. - Linux Consulting since 1995: Ask me about High Availability