2023-03-14 02:29:32

by richard clark

[permalink] [raw]
Subject: Question about select and poll system call

Hi, (Sorry, not find the maintainers for this subsystem, so to the lkml)

There're two questions about these system calls:
1. According to https://pubs.opengroup.org/onlinepubs/7908799/xsh/select.html:
ERRORS
[EINVAL]
The nfds argument is less than 0 or greater than FD_SETSIZE.
But the current implementation in Linux like:
if (nfds > FD_SETSIZE)
nfds = FD_SETSIZE
What's the rationale behind this?

2. Can we unify the two different system calls? For example, using
poll(...) to implement the frontend select call(...), is there
something I'm missing for current implementation? The Cons and Pros,
etc

Thanks,


2023-03-14 02:32:15

by richard clark

[permalink] [raw]
Subject: Re: Question about select and poll system call

Adding [email protected] ... for more possible feedback:)

On Tue, Mar 14, 2023 at 10:28 AM richard clark
<[email protected]> wrote:
>
> Hi, (Sorry, not find the maintainers for this subsystem, so to the lkml)
>
> There're two questions about these system calls:
> 1. According to https://pubs.opengroup.org/onlinepubs/7908799/xsh/select.html:
> ERRORS
> [EINVAL]
> The nfds argument is less than 0 or greater than FD_SETSIZE.
> But the current implementation in Linux like:
> if (nfds > FD_SETSIZE)
> nfds = FD_SETSIZE
> What's the rationale behind this?
>
> 2. Can we unify the two different system calls? For example, using
> poll(...) to implement the frontend select call(...), is there
> something I'm missing for current implementation? The Cons and Pros,
> etc
>
> Thanks,

2023-03-15 03:32:17

by richard clark

[permalink] [raw]
Subject: Re: Question about select and poll system call

Adding more people...

I did some homework and found that the FD_SETSIZE question seems
related with below 2 commits:
1. 4e6fd33b7560 ("enforce RLIMIT_NOFILE in poll()")
"POSIX states that poll() shall fail with EINVAL if nfds > OPEN_MAX.
In this context, POSIX is referring to sysconf(OPEN_MAX), which is the
value of current->signal->rlim[RLIMIT_NOFILE].rlim_cur in the linux
kernel...". IOW, the nfds suggested by POSIX is kind of configurable,
making sense for Linux kernel to link it with rlimit.
2. bbea9f69668a ("fdtable: Make fdarray and fdsets equal in size")
This commit uses the fdt->max_fds instead of FD_SETSIZE suggested by
POSIX, but gives no reason to do that.

Curiously I did some tests on Linux and macOS, the testing code snippet:

static int test(void)
{
int err = 0;
int nfds = FD_SETSIZE;
fd_set rfds, wfds, efds;

FD_ZERO(&rfds);
FD_ZERO(&wfds);
FD_ZERO(&efds);

err = select(nfds + 1, &rfds, &wfds, &efds, NULL);
if (err < 0)
perror("select failed");
return err;

}

The test results as:
Linux
~~~~
Blocked at select

macOS
~~~~~~
select failed: Invalid argument

Thanks!

On Tue, Mar 14, 2023 at 10:31 AM richard clark
<[email protected]> wrote:
>
> Adding [email protected] ... for more possible feedback:)
>
> On Tue, Mar 14, 2023 at 10:28 AM richard clark
> <[email protected]> wrote:
> >
> > Hi, (Sorry, not find the maintainers for this subsystem, so to the lkml)
> >
> > There're two questions about these system calls:
> > 1. According to https://pubs.opengroup.org/onlinepubs/7908799/xsh/select.html:
> > ERRORS
> > [EINVAL]
> > The nfds argument is less than 0 or greater than FD_SETSIZE.
> > But the current implementation in Linux like:
> > if (nfds > FD_SETSIZE)
> > nfds = FD_SETSIZE
> > What's the rationale behind this?
> >
> > 2. Can we unify the two different system calls? For example, using
> > poll(...) to implement the frontend select call(...), is there
> > something I'm missing for current implementation? The Cons and Pros,
> > etc
> >
> > Thanks,

2023-03-15 09:00:19

by David Laight

[permalink] [raw]
Subject: RE: Question about select and poll system call

> 2. Can we unify the two different system calls? For example, using
> poll(...) to implement the frontend select call(...), is there
> something I'm missing for current implementation? The Cons and Pros,
> etc

The underlying code that implements them is common.

Beware that the glibc select() wrappers have their own limit
on the highest fd.
Exceeding that limit (probably 1024) will cause buffer overruns
in the application (One of the Android apps I uses crashes that way).

select() also doesn't scale well for sparse lists of fds.
So it really is best to use poll() and never select().
(Although for very large fd lists epoll() may be a better choice.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-03-16 00:56:59

by richard clark

[permalink] [raw]
Subject: Re: Question about select and poll system call

On Wed, Mar 15, 2023 at 4:59 PM David Laight <[email protected]> wrote:
>
> > 2. Can we unify the two different system calls? For example, using
> > poll(...) to implement the frontend select call(...), is there
> > something I'm missing for current implementation? The Cons and Pros,
> > etc
>
> The underlying code that implements them is common.
>
> Beware that the glibc select() wrappers have their own limit
> on the highest fd.
> Exceeding that limit (probably 1024) will cause buffer overruns
> in the application (One of the Android apps I uses crashes that way).

Ah, interesting. Seems glibc doesn't make that limit from my testing
code snippet in last email...

>
> select() also doesn't scale well for sparse lists of fds.
> So it really is best to use poll() and never select().
> (Although for very large fd lists epoll() may be a better choice.)
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

2023-03-16 09:34:38

by David Laight

[permalink] [raw]
Subject: RE: Question about select and poll system call

From: richard clark
> Sent: 16 March 2023 00:57
>
> On Wed, Mar 15, 2023 at 4:59 PM David Laight <[email protected]> wrote:
> >
> > > 2. Can we unify the two different system calls? For example, using
> > > poll(...) to implement the frontend select call(...), is there
> > > something I'm missing for current implementation? The Cons and Pros,
> > > etc
> >
> > The underlying code that implements them is common.
> >
> > Beware that the glibc select() wrappers have their own limit
> > on the highest fd.
> > Exceeding that limit (probably 1024) will cause buffer overruns
> > in the application (One of the Android apps I uses crashes that way).
>
> Ah, interesting. Seems glibc doesn't make that limit from my testing
> code snippet in last email...

Look at the FD_SET() macros....

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2023-03-16 18:15:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: Question about select and poll system call

On Mon, Mar 13, 2023 at 7:28 PM richard clark
<[email protected]> wrote:
>
> There're two questions about these system calls:
> 1. According to https://pubs.opengroup.org/onlinepubs/7908799/xsh/select.html:
> ERRORS
> [EINVAL]
> The nfds argument is less than 0 or greater than FD_SETSIZE.
> But the current implementation in Linux like:
> if (nfds > FD_SETSIZE)
> nfds = FD_SETSIZE
> What's the rationale behind this?

Basically, the value of FD_SETSIZE has changed, and different pieces
of the system have used different values over the years.

The exact value of FD_SETSIZE ends up actually depending on the
compile-time size of the "fd_set" variable, and both the kernel and
glibc (and presumably other C library implementations) have changed
over time.

Just to give you a flavor of that history, 'select()' was implemented
back in early '92 in linux-0.12 (one of the greatest Linux releases of
all time - 0.12 was when Linux actually became *useful* to some
people).

And back then, we had this:

typedef unsigned long fd_set;

which may seem a bit limiting today ("Only 32 bits??!?"), but to put
that in perspective, back then we also had this:

#define NR_OPEN 20

and Linux-0.12 also did the *radical* change of changing NR_INODE from
32 to 64. Whee..

It was a very different time, in other words.

Now, imagine what happens when you increase those kinds of limits (as
we obviously did), and you do the library and kernel maintenance
separately. Some people might use a newer library with an older
kernel, and vice versa.

Doing that

if (nfds > FD_SETSIZE)
nfds = FD_SETSIZE;

basically allows you to at least limp along in that situation, where
maybe the library uses a 'fd_set' with thousands of bits, but the
kernel has a smaller limit.

Because you *will* find user programs that basically do

select(FD_SETSIZE, ...)

even if they don't actually use all those bits. Returning an error
because the C library had a different idea of how big the fdset was
compared to the kernel would be bad.

Now, the above is the *historical* reason for this all. The kernel
hasn't actually changed FD_SETSIZE in decades. We could say "by now,
if you use FD_SETSIZE larger than 1024, we'll return an error instead
of just truncating it".

But at the same time, while time has passed and we could do those
kinds of decisions, by now the POSIX spec is almost immaterial, and
compatibility with older versions of Linux is more important than
POSIX paper compatibility.

So there just isn't any reason to change any more.

> 2. Can we unify the two different system calls? For example, using
> poll(...) to implement the frontend select call(...), is there
> something I'm missing for current implementation?

No. select() and poll() are completely different animals. Trying to
unify them means having to convert from an array of fd descriptors to
several arrays of bits. They are just very different interfaces.

Inside the kernel, the low-level implementation as far as individual
file descriptors is concerned is all unified already. Once you just
deal with one single file descriptor, we internally use a "->poll()"
thing. But to *get* to that individual file descriptor, select() and
poll() walk very different data structures.

Linus

2023-03-17 08:29:36

by richard clark

[permalink] [raw]
Subject: Re: Question about select and poll system call

I had to confess I've got *almost* the similar consideration after a
long dedicated thinking before seeing this, so it's one of the
greatest decisions we can make together. A very nice and patient
explanation, and happy weekend, good guy:). Please feel free to raise
your different options for anyone watching this...

Anyway, some comments inline...

On Fri, Mar 17, 2023 at 2:15 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Mar 13, 2023 at 7:28 PM richard clark
> <[email protected]> wrote:
> >
> > There're two questions about these system calls:
> > 1. According to https://pubs.opengroup.org/onlinepubs/7908799/xsh/select.html:
> > ERRORS
> > [EINVAL]
> > The nfds argument is less than 0 or greater than FD_SETSIZE.
> > But the current implementation in Linux like:
> > if (nfds > FD_SETSIZE)
> > nfds = FD_SETSIZE
> > What's the rationale behind this?
>
> Basically, the value of FD_SETSIZE has changed, and different pieces
> of the system have used different values over the years.
>
> The exact value of FD_SETSIZE ends up actually depending on the
> compile-time size of the "fd_set" variable, and both the kernel and
> glibc (and presumably other C library implementations) have changed
> over time.
>
> Just to give you a flavor of that history, 'select()' was implemented
> back in early '92 in linux-0.12 (one of the greatest Linux releases of
> all time - 0.12 was when Linux actually became *useful* to some
> people).
>
> And back then, we had this:
>
> typedef unsigned long fd_set;
>
> which may seem a bit limiting today ("Only 32 bits??!?"), but to put
> that in perspective, back then we also had this:
>
> #define NR_OPEN 20
>
> and Linux-0.12 also did the *radical* change of changing NR_INODE from
> 32 to 64. Whee..
>
> It was a very different time, in other words.
>
> Now, imagine what happens when you increase those kinds of limits (as
> we obviously did), and you do the library and kernel maintenance
> separately. Some people might use a newer library with an older
> kernel, and vice versa.
>
> Doing that
>
> if (nfds > FD_SETSIZE)
> nfds = FD_SETSIZE;
>
> basically allows you to at least limp along in that situation, where
> maybe the library uses a 'fd_set' with thousands of bits, but the
> kernel has a smaller limit.
>
> Because you *will* find user programs that basically do
>
> select(FD_SETSIZE, ...)
>
> even if they don't actually use all those bits. Returning an error
> because the C library had a different idea of how big the fdset was
> compared to the kernel would be bad.
>
> Now, the above is the *historical* reason for this all. The kernel
> hasn't actually changed FD_SETSIZE in decades. We could say "by now,
> if you use FD_SETSIZE larger than 1024, we'll return an error instead
> of just truncating it".
>
> But at the same time, while time has passed and we could do those
> kinds of decisions, by now the POSIX spec is almost immaterial, and
> compatibility with older versions of Linux is more important than
> POSIX paper compatibility.
>
> So there just isn't any reason to change any more.
>
> > 2. Can we unify the two different system calls? For example, using
> > poll(...) to implement the frontend select call(...), is there
> > something I'm missing for current implementation?
>
> No. select() and poll() are completely different animals. Trying to
> unify them means having to convert from an array of fd descriptors to
> several arrays of bits. They are just very different interfaces.

Technically, this kind of conversion is not as radical as thought(even
I think the performance pain can be ignored), the pros. is the
maintainer needs to care about only one piece of code. Actually the
unified implementation of the fd->poll(...) can be seen as obvious
evidence, essentially the core is the same but with different skin, at
least this is weak to justify current implementation.

>
> Inside the kernel, the low-level implementation as far as individual
> file descriptors is concerned is all unified already. Once you just
> deal with one single file descriptor, we internally use a "->poll()"
> thing. But to *get* to that individual file descriptor, select() and
> poll() walk very different data structures.
>
> Linus

2023-03-18 12:41:10

by David Laight

[permalink] [raw]
Subject: RE: Question about select and poll system call

> On Fri, Mar 17, 2023 at 2:15 AM Linus Torvalds
> <[email protected]> wrote:
> >
...
> > And back then, we had this:
> >
> > typedef unsigned long fd_set;
> >
> > which may seem a bit limiting today ("Only 32 bits??!?"), but to put
> > that in perspective, back then we also had this:
> >
> > #define NR_OPEN 20

That is the historic limit for SYSV (and probably BSD).
I suspect you just copied it.
Quite why it was 20 and not 16 or 32 I don't know.
But 20 open files was assumed to be 'plenty'!

The first SYSV kernel that supported fd >= 20 actually used
a linked list to hold the internal data (mostly a pointer).

So accessing a big fd number was O(fd).
Calling poll() O(numfd**2) and getting that many open sockets
in a network server process O(n**3).

Don't even think about how slowly a process trying to
use 2000 sockets was!

The 20 fd limit also made them a limited resource.
Demons couldn't really afford to dup() /dev/null onto 0, 1 and 2,
instead they'd just close the fd.
An accidental printf() that should have been sprintf() then
slowly fills the stdout buffer, when that eventually fills
the write to fd 1 has side effects that are rather difficult
to debug.
(Someone should have noticed that the tracing was incorrect)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)