I'm pleased to announce the availability of version 6 of the syslet subsystem.
Ingo and I agreed that I'll handle syslet releases while he's busy with CFS. I
copied the cc: list from Ingo's v5 announcement. If you'd like to be dropped
(or added), please let me know.
The v6 patch series against 2.6.21 can be downloaded from:
http://oss.oracle.com/~zab/syslets/v6/
Example applications and previous syslet releases can be found at:
http://people.redhat.com/~mingo/syslet-patches/
The syslet subsystem aims to provide user-space with an efficient interface for
managing the asynchronus submission and completion of existing system calls.
The only changes since v5 are small changes that I made to support the
experimental aio patch described below.
My syslet subsystem todo list is as follows, in no particular order:
- replace WARN_ON() calls with error handling or avoidance
- split the x86_64-async.patch into more specific patches
- investigate integration with ptrace
- investigate rare ./syslet-test cpu spinning
- provide distro kernel rpms and documentation for developers
- compat design problems, still? http://lkml.org/lkml/2007/3/7/523
Included in this patch series is an experimental patch which reworks fs/aio.c
to reuse the syslet subsystem to process iocb requests from user space. The
intent of this work is to simplify the code and broaden aio functionality.
Many issues need to be addressed before this aio work could be merged:
- support cancellation by sending signals to async_threads
- figure out what to do about signals from handlers, like SIGXFSZ
- verify that heavy loads do not consume excessive cpu or memory
- concurrent dio writes
- cfq gets confused, share io_context amongst threads?
- restrict allowed operations like .aio_{r,w} methods used to
More details on this work in progress can be found in the patch.
Any and all feedback is welcome and encouraged!
- z
On Tue, 29 May 2007, Zach Brown wrote:
>
> Included in this patch series is an experimental patch which reworks fs/aio.c
> to reuse the syslet subsystem to process iocb requests from user space. The
> intent of this work is to simplify the code and broaden aio functionality.
.. so don't keep us in suspense. Do you have any numbers for anything
(like Oracle, to pick a random thing out of thin air ;) that might
actually indicate whether this actually works or not?
Or is it just so experimental that no real program that uses aio can
actually work yet?
Linus
Zach Brown wrote:
> I'm pleased to announce the availability of version 6 of the syslet subsystem.
> Ingo and I agreed that I'll handle syslet releases while he's busy with CFS. I
> copied the cc: list from Ingo's v5 announcement. If you'd like to be dropped
> (or added), please let me know.
>
> The v6 patch series against 2.6.21 can be downloaded from:
>
> http://oss.oracle.com/~zab/syslets/v6/
>
> Example applications and previous syslet releases can be found at:
>
> http://people.redhat.com/~mingo/syslet-patches/
>
> The syslet subsystem aims to provide user-space with an efficient interface for
> managing the asynchronus submission and completion of existing system calls.
>
> The only changes since v5 are small changes that I made to support the
> experimental aio patch described below.
>
> My syslet subsystem todo list is as follows, in no particular order:
>
> - replace WARN_ON() calls with error handling or avoidance
> - split the x86_64-async.patch into more specific patches
> - investigate integration with ptrace
> - investigate rare ./syslet-test cpu spinning
> - provide distro kernel rpms and documentation for developers
> - compat design problems, still? http://lkml.org/lkml/2007/3/7/523
>
> Included in this patch series is an experimental patch which reworks fs/aio.c
> to reuse the syslet subsystem to process iocb requests from user space. The
> intent of this work is to simplify the code and broaden aio functionality.
>
> Many issues need to be addressed before this aio work could be merged:
>
> - support cancellation by sending signals to async_threads
> - figure out what to do about signals from handlers, like SIGXFSZ
> - verify that heavy loads do not consume excessive cpu or memory
> - concurrent dio writes
> - cfq gets confused, share io_context amongst threads?
> - restrict allowed operations like .aio_{r,w} methods used to
>
> More details on this work in progress can be found in the patch.
>
> Any and all feedback is welcome and encouraged!
You should pick up the kevent work :)
Having async request and response rings would be quite useful, and most
closely match what is going on under the hood in the kernel and hardware.
Jeff
> .. so don't keep us in suspense. Do you have any numbers for anything
> (like Oracle, to pick a random thing out of thin air ;) that might
> actually indicate whether this actually works or not?
I haven't gotten to running Oracle's database against it. It is going
to be Very Cranky if O_DIRECT writes aren't concurrent, and that's going
to take a bit of work in fs/direct-io.c.
I've done initial micro-benchmarking runs for basic sanity testing with
fio. They haven't wildly regressed, that's about as much as can be said
with confidence so far :).
Take a streaming O_DIRECT read. 1meg requests, 64 in flight.
str: (g=0): rw=read, bs=1M-1M/1M-1M, ioengine=libaio, iodepth=64
mainline:
read : io=3,405MiB, bw=97,996KiB/s, iops=93, runt= 36434msec
aio+syslets:
read : io=3,452MiB, bw=99,115KiB/s, iops=94, runt= 36520msec
That's on an old gigabit copper FC array with 10 drives behind a, no
seriously, qla2100.
The real test is the change in memory and cpu consumption, and I haven't
modified fio to take reasonably precise measurements of those yet. Once
I get O_DIRECT writes concurrent that'll be the next step.
I was pleased to see my motivation for the patches, to avoid having to
add specific support for operations to be called from fs/aio.c, work
out.
Take the case of 4k random buffered reads from a block device with a
cold cache:
read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64
mainine:
read : io=16,116KiB, bw=457KiB/s, iops=111, runt= 36047msec
slat (msec): min= 4, max= 629, avg=563.17, stdev=71.92
clat (msec): min= 0, max= 0, avg= 0.00, stdev= 0.00
aio+syslets:
read : io=125MiB, bw=3,634KiB/s, iops=887, runt= 36147msec
slat (msec): min= 0, max= 3, avg= 0.00, stdev= 0.08
clat (msec): min= 2, max= 643, avg=71.59, stdev=74.25
aio+syslets w/o cfq
read : io=208MiB, bw=6,057KiB/s, iops=1,478, runt= 36071msec
slat (msec): min= 0, max= 15, avg= 0.00, stdev= 0.09
clat (msec): min= 2, max= 758, avg=42.75, stdev=37.33
Everyone step back and thank Jens for writing a tool that gives us
interesting data without us always having to craft some stupid specific
test each and every time. Thanks, Jens!
In the mainline number fio clearly shows the buffered read submissions
being handled synchronously. The mainline buffered IO paths doesn't
know to identify and work with iocbs so requests are handled in series.
In the +syslet number we see the __async_schedule() catching
the blocking buffered read, letting the submission proceed
asynchronously. We get async behaviour without having to touch any of
the buffered IO paths.
Then we turn off cfq and we actually start to saturate the (relatively
ancient) drives :).
I need to mail Jens about that cfq behaviour, but I'm guessing it's
expected behaviour of a sort -- each syslet thread gets its own
io_context instead of inheriting it from its parent.
- z
> You should pick up the kevent work :)
I haven't looked at it in a while but yes, it's "on the radar" :).
> Having async request and response rings would be quite useful, and most
> closely match what is going on under the hood in the kernel and hardware.
Yeah, but I have lots of competing thoughts about this.
For the time being I'm focusing on simplifying the mechanisms that
support the sys_io_*() interface so I never ever have to debug fs/aio.c
(also known as chewing glass to those of us with the scars) again.
That said, I'll gladly work closely with developers who are seriously
considering putting some next gen interface to the test. That todo item
about producing documentation and distro kernels is specifically to bait
Uli into trying to implement posix aio on top of syslets in glibc.
'cause we can go back and forth about potential interfaces for, well,
how long as it been? years? I want non-trivial users who we can
measure so we can *stop* designing and implementing the moment something
is good enough for them.
- z
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Zach Brown wrote:
> That todo item
> about producing documentation and distro kernels is specifically to bait
> Uli into trying to implement posix aio on top of syslets in glibc.
Get DaveJ to pick up the code for Fedora kernels and I'll get to it.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
iD8DBQFGXLUk2ijCOnn/RHQRAjL0AJ0UQzNnMn8xpj7ga0OeEWUhnkhZfgCfTH+j
iQ52SLZgWwp4wmAGCy/eLZs=
=hpyn
-----END PGP SIGNATURE-----
On Tue, May 29, 2007 at 04:20:04PM -0700, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Zach Brown wrote:
> > That todo item
> > about producing documentation and distro kernels is specifically to bait
> > Uli into trying to implement posix aio on top of syslets in glibc.
>
> Get DaveJ to pick up the code for Fedora kernels and I'll get to it.
With F7 out the door, I'm looking at getting devel/ back in shape again,
so I can get something done there soon-ish. With the usual caveat that if
this isn't upstream by the time we do a release, we'll have to drop it
due to the added syscall. (Maybe we can just get that reserved upstream now?)
Dave
--
http://www.codemonkey.org.uk
* Jeff Garzik <[email protected]> wrote:
> You should pick up the kevent work :)
3 months ago i verified the published kevent vs. epoll benchmark and
found that benchmark to be fatally flawed. When i redid it properly
kevent showed no significant advantage over epoll. Note that i did those
measurements _before_ the recent round of epoll speedups. So unless
someone does believable benchmarks i consider kevent an over-hyped,
mis-benchmarked complication to do something that epoll is perfectly
capable of doing.
Ingo
* Zach Brown <[email protected]> wrote:
> > Having async request and response rings would be quite useful, and
> > most closely match what is going on under the hood in the kernel and
> > hardware.
>
> Yeah, but I have lots of competing thoughts about this.
note that async request and response rings are implemented already in
essence: that's how FIO uses syslets. The linked list of syslet atoms is
the 'request ring' (it's just that 'ring' is not a hard-enforced data
structure - you can use other request formats too), and the completion
ring is the 'response ring'.
Ingo
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ingo Molnar wrote:
> 3 months ago i verified the published kevent vs. epoll benchmark and
> found that benchmark to be fatally flawed. When i redid it properly
> kevent showed no significant advantage over epoll.
I'm not going to judge your tests but saying there are no significant
advantages is too one-sided. There is one huge advantage: the
interface. A memory-based interface is simply the best form. File
descriptors are a resource the runtime cannot transparently consume.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFGXShu2ijCOnn/RHQRAi5ZAJ920rRneulUMjTETu6XoiOaOi7SLgCfbmO+
UDM1CLqbaEZREAMnuOWRzuY=
=CERV
-----END PGP SIGNATURE-----
On Tue, May 29 2007, Zach Brown wrote:
Thanks for picking this up, Zach!
> - cfq gets confused, share io_context amongst threads?
Yeah, it'll confuse CFQ a lot actually. The threads either need to share
an io context (clean approach, however will introduce locking for things
that were previously lockless), or CFQ needs to get better support for
cooperating processes. The problem is that CFQ will wait for a dependent
IO for a given process, which may arrive from a totally unrelated
process.
For the fio testing, we can make some improvements there. Right now you
don't get any concurrency of the io requests if you set eg iodepth=32,
as the 32 requests will be submitted as a linked chain of atoms. For io
saturation, that's not really what you want.
I'll take a stab at improving both of the above.
--
Jens Axboe
Hi Ingo, developers.
On Wed, May 30, 2007 at 09:20:55AM +0200, Ingo Molnar ([email protected]) wrote:
>
> * Jeff Garzik <[email protected]> wrote:
>
> > You should pick up the kevent work :)
>
> 3 months ago i verified the published kevent vs. epoll benchmark and
> found that benchmark to be fatally flawed. When i redid it properly
> kevent showed no significant advantage over epoll. Note that i did those
> measurements _before_ the recent round of epoll speedups. So unless
> someone does believable benchmarks i consider kevent an over-hyped,
> mis-benchmarked complication to do something that epoll is perfectly
> capable of doing.
I did not want to start with another round of ping-pong insults :), but,
Ingo, you did not show that kevent works worse. I did show that
sometimes it works better. It flawed from 0 to 30% win in that tests,
in results Johann Bork presented kevent and epoll behaved the same. In
results I posted earlier, I said, that sometimes epoll behaved better,
sometimes kevent. What does it say? Just the fact, that in that given
workload result was the one we saw. Nothing more, nothing less.
It does not show something is broken, and definitely not that it is:
citation1:
we're heading to yet-another monolitic interface, we're heading with no
valid reasons given if other than some handwaving.
citation2:
consider kevent an over-hyped, mis-benchmarked complication to do
something that epoll is perfectly
Getting into account another features kevent has (and what it was
designed for originally - for network AIO, which is quite hard
(if ever possible) with files and epoll, I'm not talking about syslets
as AIO, it is different approach and likely it is simpler, getting even
only that it is already very good), it is not what people said in above
citations.
It looks like you have some personal insults on that, which I do not
understand. But it has nothing with technical side of the problem, so
lets stop such rethoric and concentrate on real problem and forget any
possible personal issues which might be raised sometimes :).
Although I closed kevent and eventfs projects, I would gladly continue
if we can and want to have progress in that area.
Thanks.
> Ingo
--
Evgeniy Polyakov
* Ulrich Drepper <[email protected]> wrote:
> Ingo Molnar wrote:
> > 3 months ago i verified the published kevent vs. epoll benchmark and
> > found that benchmark to be fatally flawed. When i redid it properly
> > kevent showed no significant advantage over epoll.
>
> I'm not going to judge your tests but saying there are no significant
> advantages is too one-sided. There is one huge advantage: the
> interface. A memory-based interface is simply the best form. File
> descriptors are a resource the runtime cannot transparently consume.
yeah - this is a fundamental design question for Linus i guess :-) glibc
(and other infrastructure libraries) have a fundamental problem: they
cannot (and do not) presently use persistent file descriptors to make
use of kernel functionality, due to ABI side-effects. [applications can
dup into an fd used by glibc, applications can close it - shells close
fds blindly for example, etc.] Today glibc simply cannot open a file
descriptor and keep it open while application code is running due to
these problems.
we should perhaps enable glibc to have its separate fd namespace (or
'hidden' file descriptors at the upper end of the fd space) so that it
can transparently listen to netlink events (or do epoll), without
impacting the application fd namespace - instead of ducking to a memory
based API as a workaround.
it is a serious flexibility issue that should not be ignored. The
unified fd space is a blessing on one hand because it's simple and
powerful, but it's also a curse because nested use of the fd space for
libraries is currently not possible. But it should be detached from any
fundamental question of kevent vs. epoll. (By improving library use of
file descriptors we'll improve the utility of all syscalls - by ducking
to a memory based API we only solve that particular event based usage.)
Ingo
* Evgeniy Polyakov <[email protected]> wrote:
> I did not want to start with another round of ping-pong insults :),
> but, Ingo, you did not show that kevent works worse. I did show that
> sometimes it works better. It flawed from 0 to 30% win in that tests,
> in results Johann Bork presented kevent and epoll behaved the same. In
> results I posted earlier, I said, that sometimes epoll behaved better,
> sometimes kevent. [...]
let me refresh your recollection:
http://lkml.org/lkml/2007/2/25/116
where you said:
"But note, that on my athlon64 3500 test machine kevent is about 7900
requests per second compared to 4000+ epoll, so expect a challenge."
for a long time you made much fuss about how kevents is so much better
and how epoll cannot perform and scale as well (you said various
arguments why that is supposedly so), and some people bought into the
performance argument and advocated kevent due to its supposed
performance and scalability advantages - while now we are down to "epoll
and kevent are break-even"?
in my book that is way too much of a difference, it is (best-case) a way
too sloppy approach to something as fundamental as Linux's basic event
model and design, and it is also compounded by your continued "nothing
happened, really, lets move on" stance. Losing trust is easy, winning it
back is hard. Let me reuse a phrase of yours: "expect a challenge".
Ingo
On Wed, May 30, 2007 at 10:42:52AM +0200, Ingo Molnar ([email protected]) wrote:
> it is a serious flexibility issue that should not be ignored. The
> unified fd space is a blessing on one hand because it's simple and
> powerful, but it's also a curse because nested use of the fd space for
> libraries is currently not possible. But it should be detached from any
> fundamental question of kevent vs. epoll. (By improving library use of
> file descriptors we'll improve the utility of all syscalls - by ducking
> to a memory based API we only solve that particular event based usage.)
There is another issue with file descriptors - userspace must dig into
kernel each time it wants to get a new set of events, while with memory
based approach it has them without doing so. After it has returned from
kernel and know that there are some evetns, kernel can add more of them
into the ring (if there is a place) and userspace will process them
withouth additional syscalls.
Although syscall overhead is very small, it does exist and should not be
ignored in the design.
>
> Ingo
--
Evgeniy Polyakov
* Evgeniy Polyakov <[email protected]> wrote:
> On Wed, May 30, 2007 at 10:42:52AM +0200, Ingo Molnar ([email protected]) wrote:
> > it is a serious flexibility issue that should not be ignored. The
> > unified fd space is a blessing on one hand because it's simple and
> > powerful, but it's also a curse because nested use of the fd space for
> > libraries is currently not possible. But it should be detached from any
> > fundamental question of kevent vs. epoll. (By improving library use of
> > file descriptors we'll improve the utility of all syscalls - by ducking
> > to a memory based API we only solve that particular event based usage.)
>
> There is another issue with file descriptors - userspace must dig into
> kernel each time it wants to get a new set of events, while with
> memory based approach it has them without doing so. After it has
> returned from kernel and know that there are some evetns, kernel can
> add more of them into the ring (if there is a place) and userspace
> will process them withouth additional syscalls.
Firstly, this is not a fundamental property of epoll. If we wanted to,
it would be possible to extend epoll to fill in a ring of events from
the wakeup handler. It's an incremental add-on to epoll that should not
impact the design. How much info to put into a single event is another
incremental thing - for most of the high-performance cases all the
information we need is the type of the event and the fd it occured on.
Currently epoll supports that minimal approach.
Secondly, our current syscall overhead is below 0.1 usecs on latest
hardware:
dione:~/l> ./lat_syscall null
Simple syscall: 0.0911 microseconds
so you need millions of events _per cpu_ for the syscall overhead to
show up.
Thirdly, our main problem was not the structure of epoll, our main
problem was that event APIs were not widely available, so applications
couldnt go to a pure event based design - they always had to handle
certain types of event domains specially, due to lack of coverage. The
latest epoll patches largely address that. This was a huge barrier
against adoption of epoll.
Ingo
Ingo Molnar wrote:
> * Jeff Garzik <[email protected]> wrote:
>
>> You should pick up the kevent work :)
>
> 3 months ago i verified the published kevent vs. epoll benchmark and
> found that benchmark to be fatally flawed. When i redid it properly
> kevent showed no significant advantage over epoll. Note that i did those
> measurements _before_ the recent round of epoll speedups. So unless
> someone does believable benchmarks i consider kevent an over-hyped,
> mis-benchmarked complication to do something that epoll is perfectly
> capable of doing.
You snipped the key part of my response, so I'll say it again:
Event rings (a) most closely match what is going on in the hardware and
(b) often closely match what is going on in multi-socket, event-driven
software application.
To echo Uli and paraphrase an ad, "it's the interface, silly."
This is not something epoll is capable of doing, at the present time.
Jeff
On Wed, May 30, 2007 at 10:54:00AM +0200, Ingo Molnar ([email protected]) wrote:
>
> * Evgeniy Polyakov <[email protected]> wrote:
>
> > I did not want to start with another round of ping-pong insults :),
> > but, Ingo, you did not show that kevent works worse. I did show that
> > sometimes it works better. It flawed from 0 to 30% win in that tests,
> > in results Johann Bork presented kevent and epoll behaved the same. In
> > results I posted earlier, I said, that sometimes epoll behaved better,
> > sometimes kevent. [...]
>
> let me refresh your recollection:
>
> http://lkml.org/lkml/2007/2/25/116
>
> where you said:
>
> "But note, that on my athlon64 3500 test machine kevent is about 7900
> requests per second compared to 4000+ epoll, so expect a challenge."
You can also find in that threads that I managed to run epoll server on
that machine with 9k requests per second, although that was not
reproducible again.
> for a long time you made much fuss about how kevents is so much better
> and how epoll cannot perform and scale as well (you said various
> arguments why that is supposedly so), and some people bought into the
> performance argument and advocated kevent due to its supposed
> performance and scalability advantages - while now we are down to "epoll
> and kevent are break-even"?
You just draw a picture you want to see.
Even on the kevent page I have links to other people's benchmarks, which
show how kevent behave compared to epoll in theirs load.
_My_ tests showed kevent performance win, you tuned my (can be
broken) epoll code and results changed - this is developemnt process,
where things are not obtained from the air.
> in my book that is way too much of a difference, it is (best-case) a way
> too sloppy approach to something as fundamental as Linux's basic event
> model and design, and it is also compounded by your continued "nothing
> happened, really, lets move on" stance. Losing trust is easy, winning it
> back is hard. Let me reuse a phrase of yours: "expect a challenge".
Well, I do not care much about what people think I did wrong or right.
There are obviously bad and good ideas and implementations.
I might be absolutely wrong with something, but that is a process of
solving problems, which I really enjoy.
I just want that there sould be no personal insults, if I made such things,
shame on me :)
> Ingo
--
Evgeniy Polyakov
* Jeff Garzik <[email protected]> wrote:
> >>You should pick up the kevent work :)
> >
> >3 months ago i verified the published kevent vs. epoll benchmark and
> >found that benchmark to be fatally flawed. When i redid it properly
> >kevent showed no significant advantage over epoll. Note that i did
> >those measurements _before_ the recent round of epoll speedups. So
> >unless someone does believable benchmarks i consider kevent an
> >over-hyped, mis-benchmarked complication to do something that epoll
> >is perfectly capable of doing.
>
> You snipped the key part of my response, so I'll say it again:
>
> Event rings (a) most closely match what is going on in the hardware
> and (b) often closely match what is going on in multi-socket,
> event-driven software application.
event rings are just pure data structures that describe a set of data,
and they have advantages and disadvantages. For the record, we've
already got direct experience with rings as software APIs: they were
used for KAIO and they were an implementational and maintainance
nightmare and nobody used them. Kevent might be better, but you make it
sound as if it was a trivial design choice while it certainly isnt!
Sure, for hardware interfaces like networking cards tx and rx rings are
the best thing but that is apples to oranges: hardware itself is about
_limited_ physical resources, matching a _limited_ data structure like a
ring quite well. But for software APIs, the built-in limit of rings
makes it a baroque data structure that has a fair share disadvantages in
addition to its obvious advantages.
> This is not something epoll is capable of doing, at the present time.
epoll is very much is capable of doing it - but why bother if something
more flexible than a ring can be used and the performance difference is
negligible? (Read my other reply in this thread for further points.)
but, for the record, syslets very much use a completion ring, so i'm not
fundamentally opposed to the idea. I just think it's seriously
over-hyped, just like most other bits of the kevent approach. (Nor do we
have to attach this to syslets and threadlets - kevents are an
orthogonal approach not directly related to asynchronous syscalls -
syslets/threadlets can make use of epoll just as much as they can make
use of kevent APIs.)
Ingo
* Ingo Molnar <[email protected]> wrote:
> epoll is very much is capable of doing it - but why bother if
> something more flexible than a ring can be used and the performance
> difference is negligible? (Read my other reply in this thread for
> further points.)
in particular i'd like to (re-)stress this point:
Thirdly, our main problem was not the structure of epoll, our main
problem was that event APIs were not widely available, so applications
couldnt go to a pure event based design - they always had to handle
certain types of event domains specially, due to lack of coverage. The
latest epoll patches largely address that. This was a huge barrier
against adoption of epoll.
starting with putting limits into the design by going to over-smart data
structures like rings is just stupid. Lets fix, enhance and speed up
what we have now (epoll) so that it becomes ubiquitous, and _then_ we
can extend epoll to maybe fill events into rings. We should have our
priorities right and should stop rewriting the whole world, especially
when it comes to user APIs. Right now we have _no_ event API with
complete coverage, and that's far more of a problem than the actual
micro-structure of the API.
Ingo
On Wed, 30 May 2007, Ingo Molnar wrote:
>
> * Ulrich Drepper <[email protected]> wrote:
> >
> > I'm not going to judge your tests but saying there are no significant
> > advantages is too one-sided. There is one huge advantage: the
> > interface. A memory-based interface is simply the best form. File
> > descriptors are a resource the runtime cannot transparently consume.
>
> yeah - this is a fundamental design question for Linus i guess :-)
Well, quite frankly, to me, the most important part of syslets is that if
they are done right, they introduce _no_ new interfaces at all that people
actually use.
Over the years, we've done lots of nice "extended functionality" stuff.
Nobody ever uses them. The only thing that gets used is the standard stuff
that everybody else does too.
So when it comes to syslets, the most important interface will be the
existing aio_read() etc interfaces _without_ any in-memory stuff at all,
and everything done by the kernel to just make it look exactly like it
used to look. And the biggest advantage is that it simplifies the internal
kernel code, and makes us use the same code for aio and non-aio (and I
think we have a good possibility of improving performance too, if only
because we will get much more natural and fine-grained scheduling points!)
Any extended "direct syslets" use is technically _interesting_, but
ultimately almost totally pointless. Which was why I was pushing really
really hard for a simple interface and not being too clever or exposing
internal designs too much. An in-memory thing tends to be the absolute
_worst_ interface when it comes to abstraction layers and future changes.
> glibc (and other infrastructure libraries) have a fundamental problem:
> they cannot (and do not) presently use persistent file descriptors to
> make use of kernel functionality, due to ABI side-effects.
glibc has a more fundamental problem: the "fun" stuff is generally not
worth it.
For example, any AIO thing that requires glibc to be rewritten is almost
totally uninteresting. It should work with _existing_ binaries, and
_existing_ ABI's to be useful - since 99% of all AIO users are binary-
only and won't recompile for some experimental library.
The whole epoll/kevent flame-wars have ignored a huge issue: almost nobody
uses either. People still use poll and select, to such an _overwhelming_
degree that it almost doesn't even matter if you were to make the
alternatives orders of magnitude faster - it wouldn't even be visible.
> we should perhaps enable glibc to have its separate fd namespace (or
> 'hidden' file descriptors at the upper end of the fd space) so that it
> can transparently listen to netlink events (or do epoll), without
> impacting the application fd namespace - instead of ducking to a memory
> based API as a workaround.
Yeah, I don't think it would be at all wrong to have "private file
descriptors". I'd prefer that over memory-based (for all the abstraction
issues, and because a lot of things really *are* about file descriptors!).
Linus
On Wed, 30 May 2007, Jeff Garzik wrote:
>
> You snipped the key part of my response, so I'll say it again:
>
> Event rings (a) most closely match what is going on in the hardware and (b)
> often closely match what is going on in multi-socket, event-driven software
> application.
I have rather strong counter-arguments:
(a) yes, it's how hardware does it, but if you actually look at hardware,
you quickly realize that every single piece of hardware uses a
*different* ring interface.
This should really tell you something. In fact, it may not be rings
at all, but structures with more complex formats (eg the USB
descriptors).
(b) yes, event-driven software tends to use some data structures that are
sometimes approximated by event rings, but they all use *different*
software structures. There simply *is* no common "event" structure:
each program tends to have its own issues, it's own allocation
policies, and its own "ring" structures.
They may not be rings at all. They can be priority queues/heaps or
other much more complex structures.
> To echo Uli and paraphrase an ad, "it's the interface, silly."
THERE IS NO INTERFACE! You're just making that up, and glossing over the
most important part of the whole thing!
If you could actually point to something specific that matches what
everybody needs, and is architecture-neutral, it would be a different
issue. As is, you're just saying "memory-mapped interfaces" without
actually going into enough detail to show HOW MUCH IT SUCKS.
There really are very few programs that would use them. We had a trivial
benchmark, the only function of which was to show usage, and here Ingo and
Evgeniy are (once more) talking about bugs in that one months later.
THAT should tell you something.
Make poll/select/aio/read etc faster. THAT is where the payoffs are.
In fact, if somebody wants to look at a standard interface that could be
speeded up, the prime thing to look at is "readdir()" (aka getdents).
Making _that_ thing go faster and scale better and do read-ahead is likely
to be a lot more important for performance. It was one of the bottle-necks
for samba several years ago, and nobody has really tried to improve it.
And yes, that's because it's hard - people would rather make up new
interfaces that are largely irrelevant even before they are born, than
actually try to improve important existing ones.
Linus
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Ingo Molnar wrote:
> we should perhaps enable glibc to have its separate fd namespace (or
> 'hidden' file descriptors at the upper end of the fd space) so that it
> can transparently listen to netlink events (or do epoll),
Something like this would only work reliably if you have actual
protection coming with it. Also, there are still reasons why an
application might want to see, close, handle, whatever these descriptors
in a separate namespace.
I think such namespaces are a broken concept. How many do you want to
introduce? Plus, then you get away from the normal file descriptor
interfaces anyway. If you'd represent these alternative namespace
descriptors with ordinary ints you gain nothing. You'd have to use
tuples (namespace,descriptor) and then you need a whole set of new
interfaces or some sticky namespace selection which will only cause
problems (think signal delivery).
> without
> impacting the application fd namespace - instead of ducking to a memory
> based API as a workaround.
It's not "ducking". Memory mapping is one of the most natural
interfaces. Just because Unix/Linux is built around the concept of file
descriptors does not mean this is the ultimate in usability. File
descriptors are in fact clumsy: if you have a file descriptor to read
and write data, all auxiliary data for that communication must be
transferred out-of-band (e.g, fcntl) or in very magical and hard to use
ways (recvmsg, sendmsg). With a memory based event mechanism this
auxiliary data can be stored in memory along with the event notification.
> it is a serious flexibility issue that should not be ignored. The
> unified fd space is a blessing on one hand because it's simple and
Too simple.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFGXZqX2ijCOnn/RHQRAsSFAKCNrd8/sRss1wBA9hkpnYIeALDbXQCfRNAb
yZy2Nofz2CgDo9PQYK3C/bo=
=klUJ
-----END PGP SIGNATURE-----
* Linus Torvalds <[email protected]> wrote:
> > To echo Uli and paraphrase an ad, "it's the interface, silly."
>
> THERE IS NO INTERFACE! You're just making that up, and glossing over
> the most important part of the whole thing!
>
> If you could actually point to something specific that matches what
> everybody needs, and is architecture-neutral, it would be a different
> issue. As is, you're just saying "memory-mapped interfaces" without
> actually going into enough detail to show HOW MUCH IT SUCKS.
>
> There really are very few programs that would use them. [...]
looking over the list of our new generic APIs (see further below) i
think there are three important things that are needed for an API to
become widely used:
1) it should solve a real problem (ha ;-), it should be intuitive to
humans and it should fit into existing things naturally.
2) it should be ubiquitous. (if it's about IO it should cover block IO,
network IO, timers, signals and everything) Even if it might look
silly in some of the cases, having complete, utter, no compromises,
100% coverage for everything massively helps the uptake of an API,
because it allows the user-space coder to pick just one paradigm
that is closest to his application and stick to it and only to it.
3) it should be end-to-end supported by glibc.
our failed API attempts so far were:
- sendfile(). This API mainly failed on #2. It partly failed on #1 too.
(couldnt be used in certain types of scenarios so was unintuitive.)
splice() fixes this almost completely.
- KAIO. It fails on #2 and #3.
our more successful new APIs:
- futexes. After some hickups they form the base of all modern
user-space locking.
- splice. (a bit too early to tell but it's looking good so far. Would
be nice if someone did a brute-force memcpy() based vmsplice to user
memory, just to make usage fully symmetric.)
partially successful, not yet failed new APIs:
- epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but
not completely). Despite the non-complete coverage of event domains a
good number of apps are using it, and in particular a couple really
'high end' apps with massive amounts of event sources - which apps
would have no chance with poll, select or threads.
- inotify. It's being used quite happily on the desktop, despite some
of its limitations. (Possibly integratable into epoll?)
Ingo
> Yeah, it'll confuse CFQ a lot actually. The threads either need to share
> an io context (clean approach, however will introduce locking for things
> that were previously lockless), or CFQ needs to get better support for
> cooperating processes.
Do let me know if I can be of any help in this.
> For the fio testing, we can make some improvements there. Right now you
> don't get any concurrency of the io requests if you set eg iodepth=32,
> as the 32 requests will be submitted as a linked chain of atoms. For io
> saturation, that's not really what you want.
Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm
using fio's libaio engine. I'm not testing the syslet syscall interface
yet.
- z
> due to the added syscall. (Maybe we can just get that reserved
> upstream now?)
Maybe, but we'd have to agree on the bare syslet interface that is being
supported :).
Personally, I'd like that to be the simplest thing that works for people
and I'm not convinced that the current syslet-specific syscalls are that.
Certainly not the atom interface, anyway.
+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+ unsigned long val, new_val;
+
+ if (get_user(val, uptr))
+ return -EFAULT;
+ /*
+ * inc == 0 means 'read memory value':
+ */
+ if (!inc)
+ return val;
+
+ new_val = val + inc;
+ if (__put_user(new_val, uptr))
+ return -EFAULT;
+
+ return new_val;
+}
A syscall for *long addition* strikes me as a bit much, I have to admit.
Where do we stop? (Where's the compat wrapper? :))
Maybe this would be fine for some wildly aggressive optimization some
number of years in the future when we have millions of syslet interface
users complaining about the cycle overhead of their syslet engines, but
it seems like we can do something much less involved in the first pass
without harming the possibility of promising to support this complex
optimization in the future.
- z
On Wed, May 30 2007, Zach Brown wrote:
> > Yeah, it'll confuse CFQ a lot actually. The threads either need to share
> > an io context (clean approach, however will introduce locking for things
> > that were previously lockless), or CFQ needs to get better support for
> > cooperating processes.
>
> Do let me know if I can be of any help in this.
Thanks, it should not be a lot of work though.
> > For the fio testing, we can make some improvements there. Right now you
> > don't get any concurrency of the io requests if you set eg iodepth=32,
> > as the 32 requests will be submitted as a linked chain of atoms. For io
> > saturation, that's not really what you want.
>
> Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm
> using fio's libaio engine. I'm not testing the syslet syscall interface
> yet.
Ah ok, then there's no issue from that end!
--
Jens Axboe
On Wed, May 30 2007, Ingo Molnar wrote:
> - splice. (a bit too early to tell but it's looking good so far. Would
> be nice if someone did a brute-force memcpy() based vmsplice to user
> memory, just to make usage fully symmetric.)
Heh, I actually agree, at least then the interface is complete! We can
always replace it with something more clever, should someone feel so
inclined. Here's a rough patch to do that, it's totally untested (but it
compiles). sparse will warn about the __user removal, though. I'm sure
viro would shoot me dead on the spot, should he see this...
diff --git a/fs/splice.c b/fs/splice.c
index 12f2828..5023c01 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -657,9 +657,9 @@ out_ret:
* key here is the 'actor' worker passed in that actually moves the data
* to the wanted destination. See pipe_to_file/pipe_to_sendpage above.
*/
-ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
- struct file *out, loff_t *ppos, size_t len,
- unsigned int flags, splice_actor *actor)
+ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, void *actor_priv,
+ loff_t *ppos, size_t len, unsigned int flags,
+ splice_actor *actor)
{
int ret, do_wakeup, err;
struct splice_desc sd;
@@ -669,7 +669,7 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
sd.total_len = len;
sd.flags = flags;
- sd.file = out;
+ sd.file = actor_priv;
sd.pos = *ppos;
for (;;) {
@@ -1240,28 +1240,104 @@ static int get_iovec_page_array(const struct iovec __user *iov,
return error;
}
+static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+ struct splice_desc *sd)
+{
+ int ret;
+
+ ret = buf->ops->pin(pipe, buf);
+ if (!ret) {
+ void __user *dst = sd->userptr;
+ /*
+ * use non-atomic map, can be optimized to map atomically if we
+ * prefault the user memory.
+ */
+ char *src = buf->ops->map(pipe, buf, 0);
+
+ if (copy_to_user(dst, src, sd->len))
+ ret = -EFAULT;
+
+ buf->ops->unmap(pipe, buf, src);
+
+ if (!ret)
+ return sd->len;
+ }
+
+ return ret;
+}
+
+/*
+ * For lack of a better implementation, implement vmsplice() to userspace
+ * as a simple copy of the pipes pages to the user iov.
+ */
+static long vmsplice_to_user(struct file *file, const struct iovec __user *iov,
+ unsigned long nr_segs, unsigned int flags)
+{
+ struct pipe_inode_info *pipe;
+ ssize_t size;
+ int error;
+ long ret;
+
+ pipe = pipe_info(file->f_path.dentry->d_inode);
+ if (!pipe)
+ return -EBADF;
+ if (!nr_segs)
+ return 0;
+
+ if (pipe->inode)
+ mutex_lock(&pipe->inode->i_mutex);
+
+ ret = 0;
+ while (nr_segs) {
+ void __user *base;
+ size_t len;
+
+ /*
+ * Get user address base and length for this iovec.
+ */
+ error = get_user(base, &iov->iov_base);
+ if (unlikely(error))
+ break;
+ error = get_user(len, &iov->iov_len);
+ if (unlikely(error))
+ break;
+
+ /*
+ * Sanity check this iovec. 0 read succeeds.
+ */
+ if (unlikely(!len))
+ break;
+ error = -EFAULT;
+ if (unlikely(!base))
+ break;
+
+ size = __splice_from_pipe(pipe, (void *) base, NULL, len,
+ flags, pipe_to_user);
+ if (size < 0) {
+ if (!ret)
+ ret = size;
+
+ break;
+ }
+
+ nr_segs--;
+ iov++;
+ ret += size;
+ }
+
+ if (pipe->inode)
+ mutex_unlock(&pipe->inode->i_mutex);
+
+ return ret;
+}
+
/*
* vmsplice splices a user address range into a pipe. It can be thought of
* as splice-from-memory, where the regular splice is splice-from-file (or
* to file). In both cases the output is a pipe, naturally.
- *
- * Note that vmsplice only supports splicing _from_ user memory to a pipe,
- * not the other way around. Splicing from user memory is a simple operation
- * that can be supported without any funky alignment restrictions or nasty
- * vm tricks. We simply map in the user memory and fill them into a pipe.
- * The reverse isn't quite as easy, though. There are two possible solutions
- * for that:
- *
- * - memcpy() the data internally, at which point we might as well just
- * do a regular read() on the buffer anyway.
- * - Lots of nasty vm tricks, that are neither fast nor flexible (it
- * has restriction limitations on both ends of the pipe).
- *
- * Alas, it isn't here.
- *
*/
-static long do_vmsplice(struct file *file, const struct iovec __user *iov,
- unsigned long nr_segs, unsigned int flags)
+static long vmsplice_to_pipe(struct file *file, const struct iovec __user *iov,
+ unsigned long nr_segs, unsigned int flags)
{
struct pipe_inode_info *pipe;
struct page *pages[PIPE_BUFFERS];
@@ -1289,6 +1365,22 @@ static long do_vmsplice(struct file *file, const struct iovec __user *iov,
return splice_to_pipe(pipe, &spd);
}
+/*
+ * Note that vmsplice only really supports true splicing _from_ user memory
+ * to a pipe, not the other way around. Splicing from user memory is a simple
+ * operation that can be supported without any funky alignment restrictions
+ * or nasty vm tricks. We simply map in the user memory and fill them into
+ * a pipe. The reverse isn't quite as easy, though. There are two possible
+ * solutions for that:
+ *
+ * - memcpy() the data internally, at which point we might as well just
+ * do a regular read() on the buffer anyway.
+ * - Lots of nasty vm tricks, that are neither fast nor flexible (it
+ * has restriction limitations on both ends of the pipe).
+ *
+ * Currently we punt and implement it as a normal copy, see pipe_to_user().
+ *
+ */
asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
unsigned long nr_segs, unsigned int flags)
{
@@ -1300,7 +1392,9 @@ asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
file = fget_light(fd, &fput);
if (file) {
if (file->f_mode & FMODE_WRITE)
- error = do_vmsplice(file, iov, nr_segs, flags);
+ error = vmsplice_to_pipe(file, iov, nr_segs, flags);
+ else if (file->f_mode & FMODE_READ)
+ error = vmsplice_to_user(file, iov, nr_segs, flags);
fput_light(file, fput);
}
--
Jens Axboe
Ingo Molnar wrote:
>
> - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
> (couldnt be used in certain types of scenarios so was unintuitive.)
> splice() fixes this almost completely.
>
> - KAIO. It fails on #2 and #3.
I wonder how useful it would be to reimplement sendfile()
using splice(), either in glibc or inside the kernel itself?
sendfile() does get used a fair bit, but I really doubt that anyone
outside of a handful of people on this list actually use splice().
Cheers
On Wed, May 30 2007, Mark Lord wrote:
> Ingo Molnar wrote:
> >
> > - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
> > (couldnt be used in certain types of scenarios so was unintuitive.)
> > splice() fixes this almost completely.
> >
> > - KAIO. It fails on #2 and #3.
>
> I wonder how useful it would be to reimplement sendfile()
> using splice(), either in glibc or inside the kernel itself?
>
> sendfile() does get used a fair bit, but I really doubt that anyone
> outside of a handful of people on this list actually use splice().
It's indeed the plan, I even have git branch for it. Just never took the
time to actually finish it.
http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=splice-sendfile
--
Jens Axboe
On Wed, 30 May 2007, Mark Lord wrote:
>
> I wonder how useful it would be to reimplement sendfile()
> using splice(), either in glibc or inside the kernel itself?
I'd like that, if only because right now we have two separate paths that
kind of do the same thing, and splice really is the only one that is
generic.
I thought Jens even had some experimental patches for it. It might be
worth to "just do it" - there's some internal overhead, but on the other
hand, it's also likely the best way to make sure any issues get sorted
out.
Linus
On Wed, May 30 2007, Linus Torvalds wrote:
>
>
> On Wed, 30 May 2007, Mark Lord wrote:
> >
> > I wonder how useful it would be to reimplement sendfile()
> > using splice(), either in glibc or inside the kernel itself?
>
> I'd like that, if only because right now we have two separate paths that
> kind of do the same thing, and splice really is the only one that is
> generic.
>
> I thought Jens even had some experimental patches for it. It might be
> worth to "just do it" - there's some internal overhead, but on the other
> hand, it's also likely the best way to make sure any issues get sorted
> out.
I do, this is a one year old patch that does that:
http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=f8f550e027fd07ad8d87110178803dc63b544d89
I'll update it, test, and submit for 2.6.23.
--
Jens Axboe
On Wed, 30 May 2007, Ingo Molnar wrote:
> yeah - this is a fundamental design question for Linus i guess :-) glibc
> (and other infrastructure libraries) have a fundamental problem: they
> cannot (and do not) presently use persistent file descriptors to make
> use of kernel functionality, due to ABI side-effects. [applications can
> dup into an fd used by glibc, applications can close it - shells close
> fds blindly for example, etc.] Today glibc simply cannot open a file
> descriptor and keep it open while application code is running due to
> these problems.
Here I think we are forgetting that glibc is userspace and there's no
separation between the application code and glibc code. An application
linking to glibc can break glibc in thousand ways, indipendently from fds
or not fds. Like complaining that glibc is broken because printf()
suddendly does not work anymore ;)
#include <stdio.h>
int main(void) {
close(fileno(stdout));
printf("Whiskey Tango Foxtrot?\n");
return 0;
}
- Davide
On Wed, 30 May 2007, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > > To echo Uli and paraphrase an ad, "it's the interface, silly."
> >
> > THERE IS NO INTERFACE! You're just making that up, and glossing over
> > the most important part of the whole thing!
> >
> > If you could actually point to something specific that matches what
> > everybody needs, and is architecture-neutral, it would be a different
> > issue. As is, you're just saying "memory-mapped interfaces" without
> > actually going into enough detail to show HOW MUCH IT SUCKS.
> >
> > There really are very few programs that would use them. [...]
>
> looking over the list of our new generic APIs (see further below) i
> think there are three important things that are needed for an API to
> become widely used:
>
> 1) it should solve a real problem (ha ;-), it should be intuitive to
> humans and it should fit into existing things naturally.
>
> 2) it should be ubiquitous. (if it's about IO it should cover block IO,
> network IO, timers, signals and everything) Even if it might look
> silly in some of the cases, having complete, utter, no compromises,
> 100% coverage for everything massively helps the uptake of an API,
> because it allows the user-space coder to pick just one paradigm
> that is closest to his application and stick to it and only to it.
>
> 3) it should be end-to-end supported by glibc.
>
> our failed API attempts so far were:
>
> - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
> (couldnt be used in certain types of scenarios so was unintuitive.)
> splice() fixes this almost completely.
>
> - KAIO. It fails on #2 and #3.
>
> our more successful new APIs:
>
> - futexes. After some hickups they form the base of all modern
> user-space locking.
>
> - splice. (a bit too early to tell but it's looking good so far. Would
> be nice if someone did a brute-force memcpy() based vmsplice to user
> memory, just to make usage fully symmetric.)
>
> partially successful, not yet failed new APIs:
>
> - epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but
> not completely). Despite the non-complete coverage of event domains a
> good number of apps are using it, and in particular a couple really
> 'high end' apps with massive amounts of event sources - which apps
> would have no chance with poll, select or threads.
>
> - inotify. It's being used quite happily on the desktop, despite some
> of its limitations. (Possibly integratable into epoll?)
I think, as Linus pointed out (as I did a few months ago), that there's
confusion about the term "Unification" or "Single Interface".
Unification is not about fetching all the data coming from the more
diverse sources, into a single interface. That is just broken, because
each data source wants a different data structure to be reported.
This is ABI-hell 101. Unification is the ability to uniformly wait for
readiness, and then fetch data with source-dependent collectors (read(2),
io_getvents(2), ...). That way you have ABI isolation on the single
data source, and not monster structures trying to blob together the more
diverse data formats.
AFAIK, inotify works with select/poll/epoll as is.
- Davide
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Davide Libenzi wrote:
> An application
> linking to glibc can break glibc in thousand ways, indipendently from fds
> or not fds.
It's not (only/mainly) about breaking. File descriptors are a resources
which has to be used under the control of the program. The runtime
cannot just steal some for itself. This indirectly leads to breaking
code. We've seen this many times and I keep repeating the same issue
over and over again: why do we have MAP_ANON instead of keeping a file
descriptor with /dev/null open? Why is mmap made more complicated by
allowing the file descriptor to be closed after the mmap() call is done?
Take a look at a process running your favorite shell. Ever wonder why
there is this stray file descriptor with a high number?
$ cat /proc/3754/cmdline
bash
$ ll /proc/3754/fd/
total 0
lrwx------ 1 drepper drepper 64 2007-05-30 12:50 0 -> /dev/pts/19
lrwx------ 1 drepper drepper 64 2007-05-30 12:50 1 -> /dev/pts/19
lrwx------ 1 drepper drepper 64 2007-05-30 12:49 2 -> /dev/pts/19
lrwx------ 1 drepper drepper 64 2007-05-30 12:50 255 -> /dev/pts/19
File descriptors must be requested explicitly and cannot be implicitly
consumed.
All that and the other problem I mentioned earlier today about auxiliary
data. File descriptors are not the ideal interface. Elegant: yes,
ideal: no. Fro physics and math you might have learned that not every
result that looks clean and beautiful is correct.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFGXdbC2ijCOnn/RHQRAgBbAJ0RoNsQr4L6Bm5hLy7somAKeTqCcQCbBHmx
8hzG+1w0rYMTqXxNmi/QQ7o=
=O7Xm
-----END PGP SIGNATURE-----
On Wed, 30 May 2007, Davide Libenzi wrote:
>
> Here I think we are forgetting that glibc is userspace and there's no
> separation between the application code and glibc code. An application
> linking to glibc can break glibc in thousand ways, indipendently from fds
> or not fds. Like complaining that glibc is broken because printf()
> suddendly does not work anymore ;)
No, Davide, the problem is that some applications depend on getting
_specific_ file descriptors.
For example, if you do
close(0);
.. something else ..
if (open("myfile", O_RDONLY) < 0)
exit(1);
you can (and should) depend on the open returning zero.
So library routines *must not* open file descriptors in the normal space.
(The same is true of real applications doing the equivalent of
for (i = 0; i < NR_OPEN; i++)
close(i);
to clean up all file descriptors before doing something new. And yes, I
think it was bash that used to *literally* do something like that a long
time ago.
Another example of the same thing: people open file descriptors and know
that they'll be "dense" in the result, and then use "select()" on them.
So it's true that file descriptors can't be used randomly by the standard
libraries - they'd need to have some kind of separate "private space".
Which *could* be something as simple as saying "bit 30 in the file
descriptor specifies a separate fd space" along with some flags to make
open and friends return those separate fd's. That makes them useless for
"select()" (which assumes a flat address space, of course), but would be
useful for just about anything else.
Linus
Linus Torvalds a ?crit :
>
> On Wed, 30 May 2007, Mark Lord wrote:
>> I wonder how useful it would be to reimplement sendfile()
>> using splice(), either in glibc or inside the kernel itself?
>
> I'd like that, if only because right now we have two separate paths that
> kind of do the same thing, and splice really is the only one that is
> generic.
>
> I thought Jens even had some experimental patches for it. It might be
> worth to "just do it" - there's some internal overhead, but on the other
> hand, it's also likely the best way to make sure any issues get sorted
> out.
>
Last time I played with splice(), I found a bug with readahead logic, most
probably because nobody but me tried it before.
(corrected by Fengguang Wu in commit 9ae9d68cbf3fe0ec17c17c9ecaa2188ffb854a66 )
So yes, reimplement sendfile() should help to find last splice() bugs, and as
a bonus it could add non blocking disk io, (O_NONBLOCK on input file -> socket)
On Wed, 30 May 2007, Linus Torvalds wrote:
> On Wed, 30 May 2007, Davide Libenzi wrote:
> >
> > Here I think we are forgetting that glibc is userspace and there's no
> > separation between the application code and glibc code. An application
> > linking to glibc can break glibc in thousand ways, indipendently from fds
> > or not fds. Like complaining that glibc is broken because printf()
> > suddendly does not work anymore ;)
>
> No, Davide, the problem is that some applications depend on getting
> _specific_ file descriptors.
>
> For example, if you do
>
> close(0);
> .. something else ..
> if (open("myfile", O_RDONLY) < 0)
> exit(1);
>
> you can (and should) depend on the open returning zero.
>
> So library routines *must not* open file descriptors in the normal space.
>
> (The same is true of real applications doing the equivalent of
>
> for (i = 0; i < NR_OPEN; i++)
> close(i);
>
> to clean up all file descriptors before doing something new. And yes, I
> think it was bash that used to *literally* do something like that a long
> time ago.
Right. I misunderstood Uli and Ingo. I thought it was like trying to
protect glibc from intentional application mis-behaviour.
> Another example of the same thing: people open file descriptors and know
> that they'll be "dense" in the result, and then use "select()" on them.
>
> So it's true that file descriptors can't be used randomly by the standard
> libraries - they'd need to have some kind of separate "private space".
>
> Which *could* be something as simple as saying "bit 30 in the file
> descriptor specifies a separate fd space" along with some flags to make
> open and friends return those separate fd's. That makes them useless for
> "select()" (which assumes a flat address space, of course), but would be
> useful for just about anything else.
I think it can be solved in a few ways. Yours or Ingo's (or something
else) can work, to solve the above "legacy" fd space expectations.
- Davide
On Wed, 30 May 2007, Eric Dumazet wrote:
>
> So yes, reimplement sendfile() should help to find last splice() bugs, and as
> a bonus it could add non blocking disk io, (O_NONBLOCK on input file ->
> socket)
Well, to get those kinds of advantages, you'd have to use splice directly,
since sendfile() hasn't supported nonblocking disk IO, and the interface
doesn't really allow for it.
In fact, since nonblocking accesses require also some *polling* method,
and we don't have that for files, I suspect the best option for those
things is to simply mix AIO and splice(). AIO tends to be the right thing
for disk waits (read: short, often cached), and if we can improve AIO
performance for the cached accesses (which is exactly what the threadlets
should hopefully allow us to do), I would seriously suggest going that
route.
But the pure "use splice to _implement_ sendfile()" thing is worth doing
for all the other reasons, even if nonblocking file access is not likely
one of them.
Linus
Linus Torvalds a ?crit :
>
> On Wed, 30 May 2007, Davide Libenzi wrote:
>> Here I think we are forgetting that glibc is userspace and there's no
>> separation between the application code and glibc code. An application
>> linking to glibc can break glibc in thousand ways, indipendently from fds
>> or not fds. Like complaining that glibc is broken because printf()
>> suddendly does not work anymore ;)
>
> No, Davide, the problem is that some applications depend on getting
> _specific_ file descriptors.
>
Fix the application, and not adding kernel bloat ?
> For example, if you do
>
> close(0);
> .. something else ..
> if (open("myfile", O_RDONLY) < 0)
> exit(1);
>
> you can (and should) depend on the open returning zero.
Then you can also exclude multi-threading, since a thread (even not inside
glibc) can also use socket()/pipe()/open()/whatever and take the zero file
descriptor as well.
Frankly I dont buy this fd namespace stuff.
The only hardcoded thing in Unix is 0, 1 and 2 fds.
People usually take care of these, or should use a Microsoft OS.
POSIX mandates that open() returns the lowest available fd.
But this obviously works only if you dont have another thread messing with
fds, or if you dont call a library function that opens a file.
Thats all.
>
> So library routines *must not* open file descriptors in the normal space.
>
> (The same is true of real applications doing the equivalent of
>
> for (i = 0; i < NR_OPEN; i++)
> close(i);
Quite buggy IMHO
This hack was to avoid bugs coming from ancestors applications,
forking/execing a shell, and at times where one process could not open more
than 20 files (AT&T Unix, 21 years ago)
Unix has fcntl(fd, F_SETFD, FD_CLOEXEC). A library should use this to make
sure fd is not propagated at exec() time.
>
> to clean up all file descriptors before doing something new. And yes, I
> think it was bash that used to *literally* do something like that a long
> time ago.
>
> Another example of the same thing: people open file descriptors and know
> that they'll be "dense" in the result, and then use "select()" on them.
poll() is nice. Even AT&T Unix had it 21 years ago :)
>
> So it's true that file descriptors can't be used randomly by the standard
> libraries - they'd need to have some kind of separate "private space".
>
> Which *could* be something as simple as saying "bit 30 in the file
> descriptor specifies a separate fd space" along with some flags to make
> open and friends return those separate fd's. That makes them useless for
> "select()" (which assumes a flat address space, of course), but would be
> useful for just about anything else.
>
Please dont do that. Second class fds.
Then what about having ten different shared libraries ? Third class fds ?
On Wed, 30 May 2007, Eric Dumazet wrote:
> >
> > No, Davide, the problem is that some applications depend on getting
> > _specific_ file descriptors.
>
> Fix the application, and not adding kernel bloat ?
No. The application is _correct_. It's how file descriptors are defined to
work.
> Then you can also exclude multi-threading, since a thread (even not inside
> glibc) can also use socket()/pipe()/open()/whatever and take the zero file
> descriptor as well.
Totally different. That's an application internal issue. It does *not*
mean that we can break existing standards.
> The only hardcoded thing in Unix is 0, 1 and 2 fds.
Wrong. I already gave an example of real code that just didn't bother to
keep track of which fd's it had open, and closed them all. Partly, in
fact, because you can't even _know_ which fd's you have open when somebody
else just execve's you.
You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG.
You cannot just change years and years of coding practice, and standard
documentations. The behaviour of file descriptors is a fact. Ignoring that
fact because you don't like it is na?ve and simply not realistic.
Linus
Linus Torvalds a ?crit :
>
> On Wed, 30 May 2007, Eric Dumazet wrote:
>> So yes, reimplement sendfile() should help to find last splice() bugs, and as
>> a bonus it could add non blocking disk io, (O_NONBLOCK on input file ->
>> socket)
>
> Well, to get those kinds of advantages, you'd have to use splice directly,
> since sendfile() hasn't supported nonblocking disk IO, and the interface
> doesn't really allow for it.
sendfile() interface doesnt allow it, but if you open("somediskfile", O_RDONLY
| O_NONBLOCK), then splice() based sendfile() can perform a non blocking disk
io, (while starting an io with readahead)
I actually use this trick myself :)
(splice(disk -> pipe, NONBLOCK), splice(pipe -> worker))
non blocking disk io, + zero copy :)
>
> In fact, since nonblocking accesses require also some *polling* method,
> and we don't have that for files, I suspect the best option for those
> things is to simply mix AIO and splice(). AIO tends to be the right thing
> for disk waits (read: short, often cached), and if we can improve AIO
> performance for the cached accesses (which is exactly what the threadlets
> should hopefully allow us to do), I would seriously suggest going that
> route.
>
> But the pure "use splice to _implement_ sendfile()" thing is worth doing
> for all the other reasons, even if nonblocking file access is not likely
> one of them.
>
> Linus
>
>
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
> for (i = 0; i < NR_OPEN; i++)
> close(i);
>
> to clean up all file descriptors before doing something new. And yes, I
> think it was bash that used to *literally* do something like that a long
> time ago.
Indeed. It was not only bash, though, I fixed probably a dozen
applications. But even the new and better solution (readdir of
/proc/self/fd) does not prevent the problem of closing descriptors the
system might still need and the application doesn't know about.
> Which *could* be something as simple as saying "bit 30 in the file
> descriptor specifies a separate fd space" along with some flags to make
> open and friends return those separate fd's.
I don't like special cases. For me things better come in quantities 0,
1, and unlimited (well, reasonable high limit). Otherwise, who gets to
use that special namespace? The C library is not the only body of code
which would want to use descriptors.
And then the semantics: do these descriptors should show up in
/proc/self/fd? Are there separate directories for each namespace? Do
they count against the rlimit?
This seems to me like a shot from the hips without thinking about other
possibilities.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFGXemS2ijCOnn/RHQRAjsFAKCGhakZosSsRzCwOvruxECbzcwIzACeJAiY
z9ql4FJa8XTSiZzRG79ocwM=
=0E7f
-----END PGP SIGNATURE-----
On Wed, 30 May 2007, Ulrich Drepper wrote:
>
> I don't like special cases. For me things better come in quantities 0,
> 1, and unlimited (well, reasonable high limit). Otherwise, who gets to
> use that special namespace? The C library is not the only body of code
> which would want to use descriptors.
Well, don't think of it as a special case at all: think of bit 30 as a
"the user asked for a non-linear fd".
In fact, to make it effective, I'd suggest literally scrambling the low
bits (using, for example, some silly per-boot xor value to to actually
generate the "true" index - the equivalent of a really stupid randomizer).
That way you'd have the legacy "linear" space, and a separate "non-linear
space" where people simply *cannot* make assumptions about contiguous fd
allocations. There's no special case there - it's just an extension which
explicitly allows us to say "if you do that, your fd's won't be allocated
the traditional way any more, but you *can* mix the traditional and the
non-linear allocation".
> And then the semantics: do these descriptors should show up in
> /proc/self/fd? Are there separate directories for each namespace? Do
> they count against the rlimit?
Oh, absolutely. The'd be real fd's in every way. People could use them
100% equivalently (and concurrently) with the traditional ones. The whole,
and the _only_ point, would be that it breaks the legacy guarantees of a
dense fd space.
Most apps don't actually *need* that dense fd space in any case. But by
defaulting to it, we wouldn't break those (few) apps that actually depend
on it.
Linus
On Wed, 30 May 2007, Eric Dumazet wrote:
> > So library routines *must not* open file descriptors in the normal space.
> >
> > (The same is true of real applications doing the equivalent of
> >
> > for (i = 0; i < NR_OPEN; i++)
> > close(i);
>
> Quite buggy IMHO
Looking at it now, I'd agree (although I think I have that somewhere in my
old code too). Consider though, that such code is contained also in
reference books like Richard Stevens "UNIX Network Programming".
- Davide
Linus Torvalds wrote:
> Which *could* be something as simple as saying "bit 30 in the file
> descriptor specifies a separate fd space" along with some flags to make
> open and friends return those separate fd's. That makes them useless for
> "select()" (which assumes a flat address space, of course), but would be
> useful for just about anything else.
>
Some programs - legitimately, I think - scan /proc/self/fd to close
everything. The question is whether the glibc-private fds should appear
there. And something like a "close-on-fork" flag might be useful,
though I guess glibc can keep track of its own fds closely enough to not
need something like that.
J
Ulrich Drepper wrote:
> I don't like special cases. For me things better come in quantities 0,
> 1, and unlimited (well, reasonable high limit). Otherwise, who gets to
> use that special namespace? The C library is not the only body of code
> which would want to use descriptors.
Valgrind could certainly make use of it. It currently reserves a set of
fds "high enough", and tries hard to hide them from apps, but
/proc/self/fd makes it intractable in general (there was only so much
simulation I was willing to do in Valgrind).
J
On Wed, 30 May 2007, Ulrich Drepper wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Linus Torvalds wrote:
> > for (i = 0; i < NR_OPEN; i++)
> > close(i);
> >
> > to clean up all file descriptors before doing something new. And yes, I
> > think it was bash that used to *literally* do something like that a long
> > time ago.
>
> Indeed. It was not only bash, though, I fixed probably a dozen
> applications. But even the new and better solution (readdir of
> /proc/self/fd) does not prevent the problem of closing descriptors the
> system might still need and the application doesn't know about.
Please, do not drop me out of the Cc list. If you have a valid point, you
should be able to carry it forward regardless, no?
- Davide
On Wed, 30 May 2007, Jeremy Fitzhardinge wrote:
>
> Some programs - legitimately, I think - scan /proc/self/fd to close
> everything. The question is whether the glibc-private fds should appear
> there. And something like a "close-on-fork" flag might be useful,
> though I guess glibc can keep track of its own fds closely enough to not
> need something like that.
Sure. I think there are things we can do (like make the non-linear fd's
appear somewhere else, and make them close-on-exec by default etc).
And it's not like it's necessarily at all the only way to do things.
I just threw it out as a possible solution - and one that is almost
certainly *superior* to trying to work around the fd thing with some
shared memory area which has tons of much more serious problems of its own
(*).
Linus
(*) Ranging from: specialized-only interfaces, inability to pass it
around, lack of any abstraction interfaces, and almost impossible to
debug. The security implications of kernel and user space sharing
read-write access to some shared area are also legion!
On Wed, 30 May 2007, Linus Torvalds wrote:
> > And then the semantics: do these descriptors should show up in
> > /proc/self/fd? Are there separate directories for each namespace? Do
> > they count against the rlimit?
>
> Oh, absolutely. The'd be real fd's in every way. People could use them
> 100% equivalently (and concurrently) with the traditional ones. The whole,
> and the _only_ point, would be that it breaks the legacy guarantees of a
> dense fd space.
>
> Most apps don't actually *need* that dense fd space in any case. But by
> defaulting to it, we wouldn't break those (few) apps that actually depend
> on it.
I agree. What would be a good interface to allocate fds in such area? We
don't want to replicate syscalls, so maybe a special new dup function?
- Davide
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Linus Torvalds wrote:
> Well, don't think of it as a special case at all: think of bit 30 as a
> "the user asked for a non-linear fd".
This sounds easy but doesn't really solve all the issues. Let me repeat
your example and the solution currently in use:
problem: application wants to close all file descriptors except a select
few, cleaning up what is currently open. It doesn't know all the
descriptors that are open. Maybe all this in preparation of an exec call.
Today the best method to do this is to readdir() /proc/self/fd and
exclude the descriptors on the whitelist.
If the special, non-sequential descriptors are also listed in that
directory the runtimes still cannot use them since they are visible.
If you go ahead with this, then at the very least add a flag which
causes the descriptor to not show up in /proc/*/fd.
You also have to be aware that open() is just one piece of the puzzle.
What about socket()? I've cursed this interface many times before and
now it's biting you: there is parameter to pass a flag. What about
transferring file descriptors via Unix domain sockets? How can I decide
the transferred descriptor should be in the private namespace?
There are likely many many more problems and cornercases like this.
- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
iD8DBQFGXfD12ijCOnn/RHQRAk4nAJ0Zjevd9Y0lQa/fLzKK+BshcLVbngCfSspI
ALNKu8VCKy7CvoIqJD3Xs/Y=
=+fM8
-----END PGP SIGNATURE-----
On Wed, 30 May 2007, Linus Torvalds wrote:
>
> Sure. I think there are things we can do (like make the non-linear fd's
> appear somewhere else, and make them close-on-exec by default etc).
Side note: it might not even be a "close-on-exec by default" thing: it
might well be a *always* close-on-exec.
That COE is pretty horrid to do, we need to scan a bitmap of those things
on each exec. So it migth be totally sensible to just declare that the
non-linear fd's would simply always be "local", and never bleed across an
execve).
Linus
Linus Torvalds wrote:
> Side note: it might not even be a "close-on-exec by default" thing: it
> might well be a *always* close-on-exec.
>
> That COE is pretty horrid to do, we need to scan a bitmap of those things
> on each exec. So it migth be totally sensible to just declare that the
> non-linear fd's would simply always be "local", and never bleed across an
> execve).
Hm, I wouldn't limit the mechanism prematurely. Using Valgrind as an
example of an alternate user of this mechanism, it would be useful to
use a pipe to transmit out-of-band information from an exec-er to an
exec-ee process. At the moment there's a lot mucking around with
execve() to transmit enough information from the parent valgrind to its
successor.
J
Linus Torvalds a ?crit :
>
> On Wed, 30 May 2007, Eric Dumazet wrote:
>>> No, Davide, the problem is that some applications depend on getting
>>> _specific_ file descriptors.
>> Fix the application, and not adding kernel bloat ?
>
> No. The application is _correct_. It's how file descriptors are defined to
> work.
>
>> Then you can also exclude multi-threading, since a thread (even not inside
>> glibc) can also use socket()/pipe()/open()/whatever and take the zero file
>> descriptor as well.
>
> Totally different. That's an application internal issue. It does *not*
> mean that we can break existing standards.
>
>> The only hardcoded thing in Unix is 0, 1 and 2 fds.
>
> Wrong. I already gave an example of real code that just didn't bother to
> keep track of which fd's it had open, and closed them all. Partly, in
> fact, because you can't even _know_ which fd's you have open when somebody
> else just execve's you.
If someone really cares, /proc/self/fd can help. But one shouldn't care at all.
About the things that the process can do before execing() a process, file
descriptors outside of 0,1,2 are the most obvious thing, but you also have
alarm(), or stupid rlimits.
>
> You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG.
>
> You cannot just change years and years of coding practice, and standard
> documentations. The behaviour of file descriptors is a fact. Ignoring that
> fact because you don't like it is na?ve and simply not realistic.
I want to change nothing. Current situation is fine and well documented, thank
you.
If a program does "for (i = 0; i < NR_OPEN; i++) close(i);", this
*will*/*should* work as intended : close all files descriptors from 0 to
NR_OPEN. Big deal.
But you wont find in a program :
FILE *fp = fopen("somefile", "r");
for (i = 0; i < NR_OPEN; i++)
close(i);
while (fgets(buff, sizeof(buff), fp)) {
}
You and/or others want to add fd namespaces and other hacks.
I saw on this thread suspicious examples, I am waiting for a real one,
justifying all this stuff.
After file descriptors separation, I guess we'll need memory space separation
as well, signal separations (SIGALRM comes to mind), uid/gid separation, cpu
time separation, and so on... setrlimit() layered for every shared lib.
On Wed, 30 May 2007, Davide Libenzi wrote:
>
> I agree. What would be a good interface to allocate fds in such area? We
> don't want to replicate syscalls, so maybe a special new dup function?
I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or similar,
and just have NONLINEAR_FD be some magic value (for example, make it be
0x40000000 - the bit that says "private, nonlinear" in the first place).
But what's gotten lost in the current discussion is that we probably don't
actually _need_ such a private space. I'm just saying that if the *choice*
is between memory-mapped interfaces and a private fd-space, we should
probably go for the latter. "Everything is a file" is the UNIX way, after
all. But there's little reason to introduce private fd's otherwise.
Linus
On Wed, 30 May 2007, Ulrich Drepper wrote:
> You also have to be aware that open() is just one piece of the puzzle.
> What about socket()? I've cursed this interface many times before and
> now it's biting you: there is parameter to pass a flag. What about
> transferring file descriptors via Unix domain sockets? How can I decide
> the transferred descriptor should be in the private namespace?
Well, we can't just replicate/change every system call that creates a file
descriptor. So I'm for something like:
int sys_fdup(int fd, int flags);
So you basically create your fds with their native/existing system calls,
and then you dup/move them into the prefered fd space.
- Davide
Davide Libenzi a ?crit :
> On Wed, 30 May 2007, Linus Torvalds wrote:
>
>>> And then the semantics: do these descriptors should show up in
>>> /proc/self/fd? Are there separate directories for each namespace? Do
>>> they count against the rlimit?
>> Oh, absolutely. The'd be real fd's in every way. People could use them
>> 100% equivalently (and concurrently) with the traditional ones. The whole,
>> and the _only_ point, would be that it breaks the legacy guarantees of a
>> dense fd space.
>>
>> Most apps don't actually *need* that dense fd space in any case. But by
>> defaulting to it, we wouldn't break those (few) apps that actually depend
>> on it.
>
> I agree. What would be a good interface to allocate fds in such area? We
> don't want to replicate syscalls, so maybe a special new dup function?
>
If the deal is to be able to get faster open()/socket()/pipe()/... calls by
not finding the first 0 bit in a huge bitmap, a better way would be to have a
flag in struct task, reset to 0 at exec time.
A new syscall would say : This process is OK to receive *random* fds.
On Wed, 30 May 2007 14:27:52 -0700 (PDT)
Linus Torvalds <[email protected]> wrote:
> Well, don't think of it as a special case at all: think of bit 30 as
> a "the user asked for a non-linear fd".
If the sole point is to protect an fd from being closed or operated on
outside of a certain context, why not just provide the ability to
"protect" an fd to prevent its use. Maybe a pair of syscalls like
"fdprotect" and "fdunprotect" that take an fd and an integer key.
Protected fds would return EBADF or something if accessed. The same
integer key must be provided to fdunprotect in order to gain access
to it again. Then glibc or valgrind or whatever would just unprotect
the fd before operating on it.
- DML
On Wed, May 30, 2007 at 02:27:52PM -0700, Linus Torvalds wrote:
> Well, don't think of it as a special case at all: think of bit 30 as a
> "the user asked for a non-linear fd".
> In fact, to make it effective, I'd suggest literally scrambling the low
> bits (using, for example, some silly per-boot xor value to to actually
> generate the "true" index - the equivalent of a really stupid randomizer).
> That way you'd have the legacy "linear" space, and a separate "non-linear
> space" where people simply *cannot* make assumptions about contiguous fd
> allocations. There's no special case there - it's just an extension which
> explicitly allows us to say "if you do that, your fd's won't be allocated
> the traditional way any more, but you *can* mix the traditional and the
> non-linear allocation".
One could always stuff a seed or per-cpu seeds in the files_struct and
use a PRNG. The only trick would be cacheline bounces and/or space
consumption of seeds. Another possibility would be bitreversed
contiguity or otherwise a bit permutation of some contiguous range,
modulo (of course) the high bit used to tag the randomized range.
With "truly" random/sparse fd numbers it may be meaningful to use a
different data structure from a bitmap to track them in-kernel, though
xor and other easily-computed mappings to/from contiguous ranges won't
need such in earnest.
-- wli
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote:
> Which *could* be something as simple as saying "bit 30 in the file
> descriptor specifies a separate fd space" along with some flags to make
> open and friends return those separate fd's. That makes them useless for
> "select()" (which assumes a flat address space, of course), but would be
> useful for just about anything else.
Or.. we could have a method of swizzling in and out an entire FD
array, similar to UML's trick for swizzling MMs.
--
Mathematics is the supreme nostalgia of our time.
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote:
>> Which *could* be something as simple as saying "bit 30 in the file
>> descriptor specifies a separate fd space" along with some flags to make
>> open and friends return those separate fd's. That makes them useless for
>> "select()" (which assumes a flat address space, of course), but would be
>> useful for just about anything else.
On Wed, May 30, 2007 at 05:27:15PM -0500, Matt Mackall wrote:
> Or.. we could have a method of swizzling in and out an entire FD
> array, similar to UML's trick for swizzling MMs.
I like that notion even better than randomization. I think it should
happen. I like SKAS, too, of course.
-- wli
* Linus Torvalds <[email protected]> wrote:
> > I agree. What would be a good interface to allocate fds in such
> > area? We don't want to replicate syscalls, so maybe a special new
> > dup function?
>
> I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or
> similar, and just have NONLINEAR_FD be some magic value (for example,
> make it be 0x40000000 - the bit that says "private, nonlinear" in the
> first place).
>
> But what's gotten lost in the current discussion is that we probably
> don't actually _need_ such a private space. I'm just saying that if
> the *choice* is between memory-mapped interfaces and a private
> fd-space, we should probably go for the latter. "Everything is a file"
> is the UNIX way, after all. But there's little reason to introduce
> private fd's otherwise.
it's both a flexibility and a speedup thing as well:
flexibility: for libraries to be able to open files and keep them open
comes up regularly. For example currently glibc is quite wasteful in a
number of common networking related functions (Ulrich, please correct me
if i'm wrong), which could be optimized if glibc could just keep a
netlink channel fd open and could poll() it for changes and cache the
results if there are no changes (or something like that).
speedup: i suggested O_ANY 6 years ago as a speedup to Apache -
non-linear fds are cheaper to allocate/map:
http://www.mail-archive.com/[email protected]/msg23820.html
(i definitely remember having written code for that too, but i cannot
find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap
overhead as well and use a per-process list/pool of struct file buffers
plus a maximum-fd field as the 'non-linear fd allocator' (at the price
of only deallocating them at process exit time).
Ingo
On Thu, 31 May 2007 08:13:03 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > > I agree. What would be a good interface to allocate fds in such
> > > area? We don't want to replicate syscalls, so maybe a special new
> > > dup function?
> >
> > I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or
> > similar, and just have NONLINEAR_FD be some magic value (for example,
> > make it be 0x40000000 - the bit that says "private, nonlinear" in the
> > first place).
> >
> > But what's gotten lost in the current discussion is that we probably
> > don't actually _need_ such a private space. I'm just saying that if
> > the *choice* is between memory-mapped interfaces and a private
> > fd-space, we should probably go for the latter. "Everything is a file"
> > is the UNIX way, after all. But there's little reason to introduce
> > private fd's otherwise.
>
> it's both a flexibility and a speedup thing as well:
>
> flexibility: for libraries to be able to open files and keep them open
> comes up regularly. For example currently glibc is quite wasteful in a
> number of common networking related functions (Ulrich, please correct me
> if i'm wrong), which could be optimized if glibc could just keep a
> netlink channel fd open and could poll() it for changes and cache the
> results if there are no changes (or something like that).
>
> speedup: i suggested O_ANY 6 years ago as a speedup to Apache -
> non-linear fds are cheaper to allocate/map:
>
> http://www.mail-archive.com/[email protected]/msg23820.html
>
> (i definitely remember having written code for that too, but i cannot
> find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap
> overhead as well and use a per-process list/pool of struct file buffers
> plus a maximum-fd field as the 'non-linear fd allocator' (at the price
> of only deallocating them at process exit time).
Only very few apps need to open more than 100.000 files.
As these files are likely sockets, O_ANY is not a solution.
A trick is to try to keep first 64 handles freed, so that kernel wont consume
too much cpu time and cache in get_unused_fd()
http://lkml.org/lkml/2005/9/15/307
This trick is portable (not linux centric).
Ingo Molnar writes:
> looking over the list of our new generic APIs (see further below) i
> think there are three important things that are needed for an API to
> become widely used:
>
> 1) it should solve a real problem (ha ;-), it should be intuitive to
> humans and it should fit into existing things naturally.
>
> 2) it should be ubiquitous. (if it's about IO it should cover block IO,
> network IO, timers, signals and everything) Even if it might look
> silly in some of the cases, having complete, utter, no compromises,
> 100% coverage for everything massively helps the uptake of an API,
> because it allows the user-space coder to pick just one paradigm
> that is closest to his application and stick to it and only to it.
>
> 3) it should be end-to-end supported by glibc.
4) At least slightly portable.
Anything supported by any similar OS is already ahead, even if it
isn't the perfect API of our dreams. This means kqueue and doors.
If it's not on any BSD or UNIX, then most app developers won't
touch it. Worse yet, it won't appear in programming books, so even
the Linux-only app programmers won't know about it.
Running ideas by the FreeBSD and OpenSolaris developers wouldn't
be a bad idea. Agreement leads to standardization, which leads to
interfaces getting used.
BTW, wrapper libraries that bury the new API under a layer of
gunk are not helpful. One might as well just use the old API.
* Ingo Molnar <[email protected]> wrote:
> it's both a flexibility and a speedup thing as well:
>
> flexibility: for libraries to be able to open files and keep them open
> comes up regularly. For example currently glibc is quite wasteful in a
> number of common networking related functions (Ulrich, please correct
> me if i'm wrong), which could be optimized if glibc could just keep a
> netlink channel fd open and could poll() it for changes and cache the
> results if there are no changes (or something like that).
>
> speedup: i suggested O_ANY 6 years ago as a speedup to Apache -
> non-linear fds are cheaper to allocate/map:
>
> http://www.mail-archive.com/[email protected]/msg23820.html
>
> (i definitely remember having written code for that too, but i cannot
> find that in the archives. hm.) In theory we could avoid _all_
> fd-bitmap overhead as well and use a per-process list/pool of struct
> file buffers plus a maximum-fd field as the 'non-linear fd allocator'
> (at the price of only deallocating them at process exit time).
to measure this i've written fd-scale-bench.c:
http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c
which tests the (cache-hot or cache-cold) cost of open()-ing of two fds
while there are N other fds already open: one is from the 'middle' of
the range, one is from the end of it.
Lets check our current 'extreme high end' performance with 1 million
fds. (which is not realistic right now but there certainly are systems
with over a hundred thousand open fds). Results from a fast CPU with 2MB
of cache:
cache-hot:
# ./fd-scale-bench 1000000 0
checking the cache-hot performance of open()-ing 1000000 fds.
num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us
num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us
num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us
num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us
...
num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us
num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us
num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us
num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us
num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us
num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us
num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us
num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us
num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us
num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us
num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us
num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us
num_fds: 1000000, best cost: 8.20 us, worst cost: 9.60 us
cache-cold:
# ./fd-scale-bench 1000000 1
checking the performance of open()-ing 1000000 fds.
num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us
num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us
...
num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us
num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us
num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us
num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us
num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us
num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us
num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us
num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us
num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us
num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us
num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us
num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us
num_fds: 1000000, best cost: 13.60 us, worst cost: 15.40 us
we are pretty good at the moment: the open() cost starts to increase at
around 100K open fds, both in the cache-cold and cache-hot case. (that
roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At
1 million fds our fd bitmap has a size of 128K when there are 1 million
fds open in a single process.
so while it's certainly not 'urgent' to improve this, private fds are an
easier target for optimizations in this area, because they dont have the
continuity requirement anymore, so the fd bitmap is not a 'forced'
property of them.
Ingo
* Eric Dumazet <[email protected]> wrote:
> > speedup: i suggested O_ANY 6 years ago as a speedup to Apache -
> > non-linear fds are cheaper to allocate/map:
> >
> > http://www.mail-archive.com/[email protected]/msg23820.html
> >
> > (i definitely remember having written code for that too, but i
> > cannot find that in the archives. hm.) In theory we could avoid
> > _all_ fd-bitmap overhead as well and use a per-process list/pool of
> > struct file buffers plus a maximum-fd field as the 'non-linear fd
> > allocator' (at the price of only deallocating them at process exit
> > time).
>
> Only very few apps need to open more than 100.000 files.
yes. I did not list it as a primary reason for private fds, it's just a
nice side-effect. As long as the other apps are not hurt, i see no
problem in improving the >100K open files case.
> As these files are likely sockets, O_ANY is not a solution.
why not? It would be a natural thing to extend sys_socket() with a
'flags' parameter and pass in O_ANY (along with any other possible fd
parameter like O_NDELAY, which could be inherited over connect()).
> A trick is to try to keep first 64 handles freed, so that kernel wont
> consume too much cpu time and cache in get_unused_fd()
>
> http://lkml.org/lkml/2005/9/15/307
this is basically a user-space front-end cache to fd allocation - which
duplicates data needlessly. I dont see any problem with doing this in
the kernel. (Also, obviously 'first 64 handles' could easily break with
certain types of apps so glibc cannot do this.)
Ingo
* Ingo Molnar <[email protected]> wrote:
> (i definitely remember having written code for that too, but i cannot
> find that in the archives. hm.) In theory we could avoid _all_
> fd-bitmap overhead as well and use a per-process list/pool of struct
> file buffers plus a maximum-fd field as the 'non-linear fd allocator'
> (at the price of only deallocating them at process exit time).
btw., this also allows mostly-lockless fd allocation, which would
probably benefit threaded apps too. (we can just recycle it from a
per-CPU list of cached fds for that process)
Ingo
On Thu, May 31 2007, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > (i definitely remember having written code for that too, but i cannot
> > find that in the archives. hm.) In theory we could avoid _all_
> > fd-bitmap overhead as well and use a per-process list/pool of struct
> > file buffers plus a maximum-fd field as the 'non-linear fd allocator'
> > (at the price of only deallocating them at process exit time).
>
> btw., this also allows mostly-lockless fd allocation, which would
> probably benefit threaded apps too. (we can just recycle it from a
> per-CPU list of cached fds for that process)
See also:
http://lkml.org/lkml/2006/6/16/144
which originates from a much simpler patch I did to fix performance
regressions in this area for the SLES10 kernel.
--
Jens Axboe
* Albert Cahalan <[email protected]> wrote:
> Ingo Molnar writes:
>
> >looking over the list of our new generic APIs (see further below) i
> >think there are three important things that are needed for an API to
> >become widely used:
> >
> > 1) it should solve a real problem (ha ;-), it should be intuitive to
> > humans and it should fit into existing things naturally.
> >
> > 2) it should be ubiquitous. (if it's about IO it should cover block IO,
> > network IO, timers, signals and everything) Even if it might look
> > silly in some of the cases, having complete, utter, no compromises,
> > 100% coverage for everything massively helps the uptake of an API,
> > because it allows the user-space coder to pick just one paradigm
> > that is closest to his application and stick to it and only to it.
> >
> > 3) it should be end-to-end supported by glibc.
>
> 4) At least slightly portable.
>
> Anything supported by any similar OS is already ahead, even if it
> isn't the perfect API of our dreams. [...]
it might have been so a few years ago but it's changing slowly but
surely - BSD is becoming more and more irrelevant. What matters mostly
to app writers these days: "is it in most Linux distros" - and the key
to that is upstream kernel support and glibc support. The days of BSD
(and UNIX) are pretty much numbered. (I'm not against standardizing APIs
in POSIX of course - the BSDs tend to follow the Linux APIs in that area
with a few years lag.)
Ingo
On Thu, 31 May 2007 11:02:52 +0200
Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > it's both a flexibility and a speedup thing as well:
> >
> > flexibility: for libraries to be able to open files and keep them open
> > comes up regularly. For example currently glibc is quite wasteful in a
> > number of common networking related functions (Ulrich, please correct
> > me if i'm wrong), which could be optimized if glibc could just keep a
> > netlink channel fd open and could poll() it for changes and cache the
> > results if there are no changes (or something like that).
> >
> > speedup: i suggested O_ANY 6 years ago as a speedup to Apache -
> > non-linear fds are cheaper to allocate/map:
> >
> > http://www.mail-archive.com/[email protected]/msg23820.html
> >
> > (i definitely remember having written code for that too, but i cannot
> > find that in the archives. hm.) In theory we could avoid _all_
> > fd-bitmap overhead as well and use a per-process list/pool of struct
> > file buffers plus a maximum-fd field as the 'non-linear fd allocator'
> > (at the price of only deallocating them at process exit time).
>
> to measure this i've written fd-scale-bench.c:
>
> http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c
>
> which tests the (cache-hot or cache-cold) cost of open()-ing of two fds
> while there are N other fds already open: one is from the 'middle' of
> the range, one is from the end of it.
>
> Lets check our current 'extreme high end' performance with 1 million
> fds. (which is not realistic right now but there certainly are systems
> with over a hundred thousand open fds). Results from a fast CPU with 2MB
> of cache:
>
> cache-hot:
>
> # ./fd-scale-bench 1000000 0
> checking the cache-hot performance of open()-ing 1000000 fds.
> num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us
> num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us
> num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us
> num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us
> ...
> num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us
> num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us
> num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us
> num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us
> num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us
> num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us
> num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us
> num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us
> num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us
> num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us
> num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us
> num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us
> num_fds: 1000000, best cost: 8.20 us, worst cost: 9.60 us
>
> cache-cold:
>
> # ./fd-scale-bench 1000000 1
> checking the performance of open()-ing 1000000 fds.
> num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us
> num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us
> ...
> num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us
> num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us
> num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us
> num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us
> num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us
> num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us
> num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us
> num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us
> num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us
> num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us
> num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us
> num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us
> num_fds: 1000000, best cost: 13.60 us, worst cost: 15.40 us
>
> we are pretty good at the moment: the open() cost starts to increase at
> around 100K open fds, both in the cache-cold and cache-hot case. (that
> roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At
> 1 million fds our fd bitmap has a size of 128K when there are 1 million
> fds open in a single process.
>
> so while it's certainly not 'urgent' to improve this, private fds are an
> easier target for optimizations in this area, because they dont have the
> continuity requirement anymore, so the fd bitmap is not a 'forced'
> property of them.
Your numbers do not match mines (mines were more than two years old so I redid a test before replying)
I tried your bench and found two problems :
- You scan half of the bitmap
- You incorrectlty divide best_delta and worst_delta by LOOPS (5)
Try to close not a 'middle fd', but a really low one (10 for example), and latencie is doubled.
with a corrected bench; cache-cold numbers are > 100 us on this Intel Pentium-M
num_fds: 1000000, best cost: 120.00 us, worst cost: 131.00 us
On an Opteron x86_64 machine, results are better :)
num_fds: 1000000, best cost: 28.00 us, worst cost: 106.00 us
* Eric Dumazet <[email protected]> wrote:
> I tried your bench and found two problems :
> - You scan half of the bitmap
[...]
> Try to close not a 'middle fd', but a really low one (10 for example),
> and latencie is doubled.
that was intentional. I really didnt want to fabricate a worst-case
result but something more representative: in real apps the bitmap isnt
fully filled all the time and most of the find-bit sequences are short.
Hence the two fds and one of them goes from the middle of the range.
> - You incorrectlty divide best_delta and worst_delta by LOOPS (5)
ah, indeed, that's a bug - victim of a last minute edit :) Since the
divident is constant it doesnt really matter to the validity of the
relative nature of the slowdown (which is what i was intested in), but
you are right - i have fixed the download and have redone the numbers.
Here are the correct results from my box:
# ./fd-scale-bench 1000000 0
checking the cache-hot performance of open()-ing 1000000 fds.
num_fds: 1, best cost: 6.00 us, worst cost: 8.00 us
num_fds: 2, best cost: 6.00 us, worst cost: 7.00 us
...
num_fds: 31586, best cost: 7.00 us, worst cost: 8.00 us
num_fds: 39483, best cost: 8.00 us, worst cost: 8.00 us
num_fds: 49354, best cost: 7.00 us, worst cost: 9.00 us
num_fds: 61693, best cost: 8.00 us, worst cost: 10.00 us
num_fds: 77117, best cost: 8.00 us, worst cost: 13.00 us
num_fds: 96397, best cost: 9.00 us, worst cost: 11.00 us
num_fds: 120497, best cost: 10.00 us, worst cost: 14.00 us
num_fds: 150622, best cost: 11.00 us, worst cost: 13.00 us
num_fds: 188278, best cost: 12.00 us, worst cost: 15.00 us
num_fds: 235348, best cost: 14.00 us, worst cost: 20.00 us
num_fds: 294186, best cost: 16.00 us, worst cost: 22.00 us
num_fds: 367733, best cost: 19.00 us, worst cost: 35.00 us
num_fds: 459667, best cost: 22.00 us, worst cost: 37.00 us
num_fds: 574584, best cost: 26.00 us, worst cost: 40.00 us
num_fds: 718231, best cost: 31.00 us, worst cost: 62.00 us
num_fds: 897789, best cost: 37.00 us, worst cost: 54.00 us
num_fds: 1000000, best cost: 41.00 us, worst cost: 59.00 us
and cache-cold:
# ./fd-scale-bench 1000000 1
checking the cache-cold performance of open()-ing 1000000 fds.
num_fds: 1, best cost: 24.00 us, worst cost: 32.00 us
...
num_fds: 49354, best cost: 26.00 us, worst cost: 28.00 us
num_fds: 61693, best cost: 25.00 us, worst cost: 30.00 us
num_fds: 77117, best cost: 27.00 us, worst cost: 30.00 us
num_fds: 96397, best cost: 27.00 us, worst cost: 31.00 us
num_fds: 120497, best cost: 31.00 us, worst cost: 43.00 us
num_fds: 150622, best cost: 31.00 us, worst cost: 34.00 us
num_fds: 188278, best cost: 33.00 us, worst cost: 36.00 us
num_fds: 235348, best cost: 35.00 us, worst cost: 42.00 us
num_fds: 294186, best cost: 36.00 us, worst cost: 41.00 us
num_fds: 367733, best cost: 40.00 us, worst cost: 43.00 us
num_fds: 459667, best cost: 44.00 us, worst cost: 46.00 us
num_fds: 574584, best cost: 48.00 us, worst cost: 65.00 us
num_fds: 718231, best cost: 54.00 us, worst cost: 59.00 us
num_fds: 897789, best cost: 60.00 us, worst cost: 62.00 us
num_fds: 1000000, best cost: 65.00 us, worst cost: 68.00 us
> with a corrected bench; cache-cold numbers are > 100 us on this Intel
> Pentium-M
>
> num_fds: 1000000, best cost: 120.00 us, worst cost: 131.00 us
>
> On an Opteron x86_64 machine, results are better :)
>
> num_fds: 1000000, best cost: 28.00 us, worst cost: 106.00 us
yeah. I quoted the full range because i was really more interested of
our current 'limit' range (which is somewhere between 50K and 100K open
fds) where the scanning cost becomes directly measurable, and the nature
of slowdown.
Ingo