2007-06-03 18:52:47

by Aaron Wiebe

[permalink] [raw]
Subject: slow open() calls and o_nonblock

Greetings all. I'm not on this list, so I apologize if this subject
has been covered before. (Also, please cc me in the response.)

I've spent the last several months trying to work around the lack of a
decent disk AIO interface. I'm starting to wonder if one exists
anywhere. The short version:

I have written a daemon that needs to open several thousand files a
minute and write a small amount of data to each file. After extensive
research, I ended up going with the POSIX AIO kludgy pthreads wrapper
in glibc to handle my writes due to the time constraints of writing my
own pthreads handler into the application.

The problem with this equation is that opens, closes and non-readwrite
operations (fchmod, fcntl, etc) have no interface in posix aio. Now I
was under the assumption that given open and close operations are
comparatively less common than the write operations, this wouldn't be
a huge problem. My tests seemed to reflect that.

I went to production with this yesterday to discover that under
production load, our filesystems (nfs on netapps) were substantially
slower than I was expecting. open() calls are taking upwards of 2
seconds on occation, and usually ~20ms.

Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero
difference to my open times. Example:

open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>

Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
call with a nonblocking flag return EAGAIN if its going to take
anywhere near 415ms? Is there a way I can force opens to EAGAIN if
they take more than 10ms?

Thanks for any help you folks can offer.

-Aaron Wiebe

(ps. having come from the socket side of the fence, its incredibly
frustrating to be unable to poll() or epoll regular file FDs --
Especially knowing that the kernel is translating them into a TCP
socket to do NFS anyway. Please add regular files to epoll and give
me a way to do the opens in the same fasion as connects!)


2007-06-03 19:16:31

by Davide Libenzi

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On Sun, 3 Jun 2007, Aaron Wiebe wrote:

> (ps. having come from the socket side of the fence, its incredibly
> frustrating to be unable to poll() or epoll regular file FDs --
> Especially knowing that the kernel is translating them into a TCP
> socket to do NFS anyway. Please add regular files to epoll and give
> me a way to do the opens in the same fasion as connects!)

You may want to follow Ingo and Zach work on syslets/threadlets. If that
goes in, you can make *any* syscall asynchronous.
I ended up writing a userspace library, to cover the same exact problem
you have:

http://www.xmailserver.org/guasi.html

I basically host an epoll_wait (containing all my sockets, pipes, etc)
inside a GUASI async request, where other non-pollable async requests are
hosted. So guasi_fetch() becomes my main event collector, and when the
epoll_wait async request show up, I handle all the events in there.
This is a *very-trivial* HTTP server using such solution (coroutines,
epoll and GUASI):

http://www.xmailserver.org/cghttpd-home.html



- Davide


2007-06-03 23:56:19

by John Stoffel

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

>>>>> "Aaron" == Aaron Wiebe <[email protected]> writes:

More details on which kernel you're using and which distro would be
helpful. Also, more details on your App and reasons why you're
writing bunches of small files would help as well.

Aaron> Greetings all. I'm not on this list, so I apologize if this subject
Aaron> has been covered before. (Also, please cc me in the response.)

Aaron> I've spent the last several months trying to work around the lack of a
Aaron> decent disk AIO interface. I'm starting to wonder if one exists
Aaron> anywhere. The short version:

Aaron> I have written a daemon that needs to open several thousand
Aaron> files a minute and write a small amount of data to each file.

How large are these files? Are they all in a single directory? How
many files are in the directory?

Ugh. Why don't you just write to a DB instead? It sounds like you're
writing small records, with one record to a file. It can work, but
when you're doing thousands per-minute, the open/close overhead is
starting to dominate. Can you just amortize that overhead across a
bunch of writes instead by writing to a single file which is more
structured for your needs?

Aaron> After extensive research, I ended up going with the POSIX AIO
Aaron> kludgy pthreads wrapper in glibc to handle my writes due to the
Aaron> time constraints of writing my own pthreads handler into the
Aaron> application.

Aaron> The problem with this equation is that opens, closes and
Aaron> non-readwrite operations (fchmod, fcntl, etc) have no interface
Aaron> in posix aio. Now I was under the assumption that given open
Aaron> and close operations are comparatively less common than the
Aaron> write operations, this wouldn't be a huge problem. My tests
Aaron> seemed to reflect that.

Aaron> I went to production with this yesterday to discover that under
Aaron> production load, our filesystems (nfs on netapps) were
Aaron> substantially slower than I was expecting. open() calls are
Aaron> taking upwards of 2 seconds on occation, and usually ~20ms.

Netapps usually scream for NFS writes and such, so it sounds to me
that you've blown out the NVRAM cache on the box. Can you elaborate
more on your hardware & Network & Netapp setup?

Of course, you could also be using sucky NFS configuration, so we need
to see your mount options as well. You are using TCP and NFSv3,
right? And a large wsize/rsize values too?

Have you also checked your NetApp to make sure you have the following
options turned OFF:

nfs.per_client_stats.enable
nfs.mountd_trace

Seeing your exports file and output of 'options nfs' would help.

Aaron> Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make
Aaron> zero difference to my open times. Example:

Aaron> open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>

Aaron> Now, I'm a userspace guy so I can be pretty dense, but
Aaron> shouldn't a call with a nonblocking flag return EAGAIN if its
Aaron> going to take anywhere near 415ms? Is there a way I can force
Aaron> opens to EAGAIN if they take more than 10ms?

The problem is that O_NONBLOCK on files open doesn't make sense. You
either open it, or you don't. How long it takes to comlete isn't part
of the spec.

But in this case, I think you're doing something hokey with your data
design. You should be opening just a handful of files and then
streaming your writes to those files. You'll get much more
performance.

Also, have you tried writing to a local disk instead of via NFS to see
how local disk speed is?

Aaron> (ps. having come from the socket side of the fence, its
Aaron> incredibly frustrating to be unable to poll() or epoll regular
Aaron> file FDs -- Especially knowing that the kernel is translating
Aaron> them into a TCP socket to do NFS anyway. Please add regular
Aaron> files to epoll and give me a way to do the opens in the same
Aaron> fasion as connects!)

epoll isn't going to help you much here, it's the open which is
causing the delay, not the writing to the file itself.

Maybe you need to be caching more of your writes into memory on the
client side, and then streaming them to the NetApp later on when you
know you can write a bunch of data at once.

But honestly, I think you've done a bad job architecting your
application's backend data store and you really need to re-think it
through.

Heck, I'm not even much of a programmer, I'm a SysAdmin who runs
Netapps and talks the users into more sane ways of getting better
performance out of their applications. *grin*.

John

2007-06-04 00:27:18

by David Schwartz

[permalink] [raw]
Subject: RE: slow open() calls and o_nonblock


> Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero
> difference to my open times. Example:
>
> open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>

How could they make any difference? I can't think of any conceivable way
they could.

> Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
> call with a nonblocking flag return EAGAIN if its going to take
> anywhere near 415ms? Is there a way I can force opens to EAGAIN if
> they take more than 10ms?

There is no way you can re-try the request. The open must either succeed or
not return a handle. It is not like a 'read' operation that has an "I didn't
do anything, and you can retry this request" option.

If 'open' returns a file handle, you can't retry it (since it must succeed
in order to do that, failure must not return a handle). If you 'open'
doesn't return a file handle, you can't retry it (because, without a handle,
there is no way to associate a future request with this one, if it creates a
file, the file must not be created if you don't call 'open' again).

The 'open' function must, at minimum, confirm that the file exists (or
doesn't exist and can be created, or whatever). This takes however long it
takes on NFS.

You need either threads or a working asynchronous system call interface.
Short of that, you need your own NFS client code.

DS


2007-06-04 01:05:53

by Al Viro

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On Sun, Jun 03, 2007 at 05:27:06PM -0700, David Schwartz wrote:
>
> > Now, Netapp speed aside, O_NONBLOCK and O_DIRECT seem to make zero
> > difference to my open times. Example:
> >
> > open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>

> The 'open' function must, at minimum, confirm that the file exists (or
> doesn't exist and can be created, or whatever). This takes however long it
> takes on NFS.
>
> You need either threads or a working asynchronous system call interface.
> Short of that, you need your own NFS client code.

BTW, why close these suckers all the time? It's not that kernel would
be unable to hold thousands of open descriptors for your process...
Hash descriptors by pathname and be done with that; don't bother with
close unless you decide that you've got too many of them (e.g. when you
get a hash conflict).

2007-06-04 01:06:09

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

Hi John, thanks for responding. I'm using kernel 2.6.20 on a
home-grown distro.

I've responded to a few specific points inline - but as a whole,
Davide directed me to work that is being done specifically to address
these issues in the kernel, as well as a userspace implementation that
would allow me to sidestep this failing for the time being.


On 6/3/07, John Stoffel <[email protected]> wrote:
>
> How large are these files? Are they all in a single directory? How
> many files are in the directory?
>
> Ugh. Why don't you just write to a DB instead? It sounds like you're
> writing small records, with one record to a file. It can work, but
> when you're doing thousands per-minute, the open/close overhead is
> starting to dominate. Can you just amortize that overhead across a
> bunch of writes instead by writing to a single file which is more
> structured for your needs?

In short, I'm distributing logs in realtime for about 600,000
websites. The sources of the logs (http, ftp, realmedia, etc) are
flexible, however the base framework was build around a large cluster
of webservers. The output can be to several hundred thousand files
across about two dozen filers for user consumption - some can be very
active, some can be completely inactive.

> Netapps usually scream for NFS writes and such, so it sounds to me
> that you've blown out the NVRAM cache on the box. Can you elaborate
> more on your hardware & Network & Netapp setup?

You're totally correct here - Netapp has told us as much about our
filesystem design, we use too much ram on the filer itself. Its true
that the application would handle just fine if our filesystem
structure were redesigned - I am approaching this from an application
perspective though. These units are capable of the raw IO, its the
simple fact that open calls are taking a while. If I were to thread
off the application (which Davide has been kind enough to provide some
libraries which will make that substantially easier), the problem
wouldn't exist.

> The problem is that O_NONBLOCK on files open doesn't make sense. You
> either open it, or you don't. How long it takes to comlete isn't part
> of the spec.

You can certainly open the file, but not block on the call to do it.
What confuses me is why the kernel would "block" for 415ms on an open
call. Thats an eternity to suspend a process that has to distribute
data such as this.

> But in this case, I think you're doing something hokey with your data
> design. You should be opening just a handful of files and then
> streaming your writes to those files. You'll get much more
> performance.

Except I cant very well keep 600,000 files open over NFS. :) Pool
and queue, and cycle through the pool. I've managed to achieve a
balance in my production deployment with this method - my email was
more of a rant after months of trying to work around a problem (caused
by a limitation in system calls), only to have it present an order of
magnitude worse than I expected. Sorry for not giving more
information off the line - and thanks for your time.

-Aaron

2007-06-04 01:19:30

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

In article <[email protected]> you wrote:
> (ps. having come from the socket side of the fence, its incredibly
> frustrating to be unable to poll() or epoll regular file FDs --
> Especially knowing that the kernel is translating them into a TCP
> socket to do NFS anyway. Please add regular files to epoll and give
> me a way to do the opens in the same fasion as connects!)

You might want to use Windows? :)

Gruss
Bernd

2007-06-04 01:21:29

by NeilBrown

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On Sunday June 3, [email protected] wrote:
>
> You can certainly open the file, but not block on the call to do it.
> What confuses me is why the kernel would "block" for 415ms on an open
> call. Thats an eternity to suspend a process that has to distribute
> data such as this.

Have you tried the "nocto" mount option for your NFS filesystems.

The cache-coherency rules of NFS require the client to check with the
server at each open. If you are the sole client on this filesystem,
then you don't need the same cache-coherency, and "nocto" will tell
the NFS client not to both checking with the server in information is
available in cache.

This should speed up the time for open considerably.

NeilBrown

2007-06-04 01:26:14

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

In article <[email protected]> you wrote:
> In short, I'm distributing logs in realtime for about 600,000
> websites. The sources of the logs (http, ftp, realmedia, etc) are
> flexible, however the base framework was build around a large cluster
> of webservers. The output can be to several hundred thousand files
> across about two dozen filers for user consumption - some can be very
> active, some can be completely inactive.

Asuming you have multiple request log summary files, I would just run
multiple "splitters".

> You can certainly open the file, but not block on the call to do it.
> What confuses me is why the kernel would "block" for 415ms on an open
> call. Thats an eternity to suspend a process that has to distribute
> data such as this.

Because it has to, to return the result with the given API.

But If you would have a async interface, the operation would still take that
long and your throughput will still be limited by the opens/sec your filers
support, or?

> Except I cant very well keep 600,000 files open over NFS. :) Pool
> and queue, and cycle through the pool. I've managed to achieve a
> balance in my production deployment with this method - my email was
> more of a rant after months of trying to work around a problem (caused
> by a limitation in system calls),

I agree that a unified async layer is nice from the programmers POV, but I
disagree that it would help your performance problem which is caused by NFS
and/or NetApp (and I wont blame them).

Gruss
Bernd

2007-06-04 03:57:20

by Albert Cahalan

[permalink] [raw]
Subject: RE: slow open() calls and o_nonblock

David Schwartz writes:
> [Aaron Wiebe]

>> open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>
>
> How could they make any difference? I can't think of any
> conceivable way they could.
>
>> Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
>> call with a nonblocking flag return EAGAIN if its going to take
>> anywhere near 415ms? Is there a way I can force opens to EAGAIN if
>> they take more than 10ms?
>
> There is no way you can re-try the request. The open must either
> succeed or not return a handle. It is not like a 'read' operation
> that has an "I didn't do anything, and you can retry this request"
> option.
>
> If 'open' returns a file handle, you can't retry it (since it must
> succeed in order to do that, failure must not return a handle).
> If you 'open' doesn't return a file handle, you can't retry it
> (because, without a handle, there is no way to associate a future
> request with this one, if it creates a file, the file must not be
> created if you don't call 'open' again).
>
> The 'open' function must, at minimum, confirm that the file exists
> (or doesn't exist and can be created, or whatever). This takes
> however long it takes on NFS.

This is not the case, though we might need to allocate a new
flag to avoid breaking things.

Let open() with O_UNCHECKED always return a file descriptor,
except perhaps when failure can be identified without doing IO.
The "real" open then proceeds in the background.

>From poll() or select(), you can see that the file descriptor
is not ready for anything. Eventually it becomes ready for IO
or reports an error condition. Both select() and poll() are
capable of reporting errors. If the "real" (background) open()
fails, then the only valid operation is close(). Attempts to
do anything else get EBADFD or ESTALE.

You'll also need a background close().

2007-06-04 13:44:35

by Alan

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

> Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
> call with a nonblocking flag return EAGAIN if its going to take
> anywhere near 415ms?

Violation of causality. We don't know it will block for 415ms until 415ms
have elapsed.

Alan

2007-06-04 13:59:30

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On 6/3/07, Neil Brown <[email protected]> wrote:
>
> Have you tried the "nocto" mount option for your NFS filesystems.
>
> The cache-coherency rules of NFS require the client to check with the
> server at each open. If you are the sole client on this filesystem,
> then you don't need the same cache-coherency, and "nocto" will tell
> the NFS client not to both checking with the server in information is
> available in cache.

No I haven't - I will research this a little further today. While
we're not the only client using these filesystems, this process is
(currently) the only process that writes to these files. Thanks for
the suggestion.

-Aaron

2007-06-04 14:05:00

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On 6/4/07, Alan Cox <[email protected]> wrote:
>
> > Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
> > call with a nonblocking flag return EAGAIN if its going to take
> > anywhere near 415ms?
>
> Violation of causality. We don't know it will block for 415ms until 415ms
> have elapsed.

Understood - but what I'm getting at is more the fact that there
really doesn't appear to be any real implementation of nonblocking
open(). On the socket side of the fence, I would consider a regular
file open() to be equivalent to a connect() call - the difference
obviously being that we already have a handle for the socket.

The end result, however, is roughly the same. We have a file
descriptor with the endpoint established. In the socket world, we
assume that a nonblocking request will always return immediately and
the application is expected to come back around and see if the request
has completed. Regular files have no equivalent.

-Aaron

2007-06-04 14:17:58

by John Stoffel

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

>>>>> "Aaron" == Aaron Wiebe <[email protected]> writes:

Aaron> On 6/4/07, Alan Cox <[email protected]> wrote:
>>
>> > Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
>> > call with a nonblocking flag return EAGAIN if its going to take
>> > anywhere near 415ms?
>>
>> Violation of causality. We don't know it will block for 415ms until 415ms
>> have elapsed.

Aaron> Understood - but what I'm getting at is more the fact that
Aaron> there really doesn't appear to be any real implementation of
Aaron> nonblocking open(). On the socket side of the fence, I would
Aaron> consider a regular file open() to be equivalent to a connect()
Aaron> call - the difference obviously being that we already have a
Aaron> handle for the socket.

Aaron> The end result, however, is roughly the same. We have a file
Aaron> descriptor with the endpoint established. In the socket world,
Aaron> we assume that a nonblocking request will always return
Aaron> immediately and the application is expected to come back around
Aaron> and see if the request has completed. Regular files have no
Aaron> equivalent.

So how many files are in the directory where you're seeing the delays?
And what's the average size of the files in there?

John

2007-06-04 14:20:16

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

Replying to David Schwartz here.. (David, good to hear from you again
- haven't seen you around since the irc days :))

David Schwartz wrote:
>
> There is no way you can re-try the request. The open must either succeed or
> not return a handle. It is not like a 'read' operation that has an "I didn't
> do anything, and you can retry this request" option.
>
> If 'open' returns a file handle, you can't retry it (since it must succeed
> in order to do that, failure must not return a handle). If you 'open'
> doesn't return a file handle, you can't retry it (because, without a handle,
> there is no way to associate a future request with this one, if it creates a
> file, the file must not be created if you don't call 'open' again).

I understand, but this is exactly the situation that I'm complaining
about. There is no functionality to provide a nonblocking open - no
ability to come back around and retry a given open call.

> You need either threads or a working asynchronous system call interface.
> Short of that, you need your own NFS client code.

This is exactly my point - there is no asynchronous system call to do
this work, to my knowledge. I will likely fix this in my own code
using threads, but I see using threads in this case as working around
that lack of systems interface. Threads, imho, should be limited to
cases where I'm using them to distribute load across multiple
processors, not because the kernel interfaces for IO cannot support
nonblocking calls.

I'm speaking to my ideal world view - but any application I write
should not have to wait for the kernel if I don't want it to. I
should be able to submit my request, and come back to it later as I so
decide.

(And I did actually consider writing my own NFS client for about 5 minutes.)

Thanks for the response!
-Aaron

2007-06-04 14:24:34

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On 6/4/07, John Stoffel <[email protected]> wrote:
>
> So how many files are in the directory where you're seeing the delays?
> And what's the average size of the files in there?

The directories themselves will have a maximum of 160 files, and the
files are maybe a few megs each - the delays are (as you pointed out
earlier) due to the ram restrictions and our filesystem design of very
deep directory structures that Netapps suck at.

My point is more generic though - I will come up with ways to handle
this problem in my application (probably with threads), but I'm
griping more about the lack of a kernel interface that would have
allowed me to avoid this.

-Aaron

2007-06-04 14:39:29

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

Sorry for the unthreaded responses, I wasn't cc'd here, so I'm
replying to these based on mailing list archives....

Al Viro wrote:

> BTW, why close these suckers all the time? It's not that kernel would
> be unable to hold thousands of open descriptors for your process...
> Hash descriptors by pathname and be done with that; don't bother with
> close unless you decide that you've got too many of them (e.g. when you
> get a hash conflict).

A valid point - I currently keep a pool of 4000 descriptors open and
cycle them out based on inactivity. I hadn't seriously considered
just keeping them all open, because I simply wasn't sure how well
things would go with 100,000 files open. Would my backend storage
keep up... would the kernel mind maintaining 100,000 files open over
NFS?

The majority of the files would simply be idle - I would be keeping
file handles open for no reason. Pooling allows me to substantially
drop the number of opens I require, but I am hesitant to blow the pool
size to substantially higher numbers. Can anyone shed light on any
issues that may come up with a massive pool size, such as 128k?

-Aaron

2007-06-04 15:42:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On Mon, 2007-06-04 at 10:20 -0400, Aaron Wiebe wrote:
> I understand, but this is exactly the situation that I'm complaining
> about. There is no functionality to provide a nonblocking open - no
> ability to come back around and retry a given open call.

So exactly how would you expect a nonblocking open to work? Should it be
starting I/O? What if that involves blocking? How would you know when to
try again?

Trond

2007-06-04 15:59:30

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On 6/4/07, Trond Myklebust <[email protected]> wrote:
>
> So exactly how would you expect a nonblocking open to work? Should it be
> starting I/O? What if that involves blocking? How would you know when to
> try again?

Well, theres a bunch of options - some have been suggested in the
thread already. The idea of an open with O_NONBLOCK (or a different
flag) returning a handle immediately, and subsequent calls returning
EAGAIN if the open is incomplete, or ESTALE if it fails (with some
auxiliary method of getting the reason why it failed) are not too far
a stretch from my perspective.

The other option that comes to mind would be to add an interface that
behaves like sockets - get a handle from one system call, set it
nonblocking using fcntl, and use another call to attach it to a
regular file. This method would make the most sense to me - but its
also because I've worked with sockets in the past far far more than
with regular files.

The one that would take the least amount of work from the application
perspective would be to simply reply to the nonblocking open call with
EAGAIN (or something), and when an open on the same file is performed,
the kernel could have performed its work in the background. I can
understand, given the fact that there is no handle provided to the
application, that this idea could be sloppy.

I'm still getting caught up on some of the other suggestions (I'm
currently reading about the syslets work that Zach and Ingo are
doing), and it sounds like this is a common complaint that is being
addressed through a number of initiatives. I'm looking forward to
seeing where that work goes.

-Aaron

2007-06-04 16:26:25

by Aaron Wiebe

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

Actually, lets see if I can summarize this more generically... I
realize I'm suggesting something that probably would be a massive
undertaking, but ..

Regular files are the only interface that requires an application to
wait. With any other case, the nonblocking interfaces are fairly
complete and easy to work with. If userspace could treat regular
files in the same fashion as sockets, life would be good.

I admittedly do not understand internal kernel semantics in the
differences between a socket and a regular file. Why couldn't we just
have a different 'socket type' like PF_FILE or something like this?

Abstracting any IO through the existing interfaces provided to sockets
would be ideal from my perspective. The code required to use a file
through these interfaces would be more complex in userspace, but the
abstraction of the current open() itself could simply be an aggregate
of these interfaces without a nonblocking flag.

It would, however, fix problems around issues with event-based
applications handling events from both disk and sockets. I can't
trigger disk read/write events in the same event handlers I use for
sockets (ie, poll or epoll). I end up having two separate event
handlers - one for disk (currently using glibc's aio thread kludge),
and one for sockets.

I'm sure this isn't a new idea. Coming from my own development
backround that had little to do with disk, I was actually surprised
when I first discovered that I couldn't edge-trigger disk IO through
poll().

Thoughts, comments?

-Aaron

On 6/4/07, Aaron Wiebe <[email protected]> wrote:
> On 6/4/07, Trond Myklebust <[email protected]> wrote:
> >
> > So exactly how would you expect a nonblocking open to work? Should it be
> > starting I/O? What if that involves blocking? How would you know when to
> > try again?
>
> Well, theres a bunch of options - some have been suggested in the
> thread already. The idea of an open with O_NONBLOCK (or a different
> flag) returning a handle immediately, and subsequent calls returning
> EAGAIN if the open is incomplete, or ESTALE if it fails (with some
> auxiliary method of getting the reason why it failed) are not too far
> a stretch from my perspective.
>
> The other option that comes to mind would be to add an interface that
> behaves like sockets - get a handle from one system call, set it
> nonblocking using fcntl, and use another call to attach it to a
> regular file. This method would make the most sense to me - but its
> also because I've worked with sockets in the past far far more than
> with regular files.
>
> The one that would take the least amount of work from the application
> perspective would be to simply reply to the nonblocking open call with
> EAGAIN (or something), and when an open on the same file is performed,
> the kernel could have performed its work in the background. I can
> understand, given the fact that there is no handle provided to the
> application, that this idea could be sloppy.
>
> I'm still getting caught up on some of the other suggestions (I'm
> currently reading about the syslets work that Zach and Ingo are
> doing), and it sounds like this is a common complaint that is being
> addressed through a number of initiatives. I'm looking forward to
> seeing where that work goes.
>
> -Aaron
>

2007-06-04 19:46:29

by Trond Myklebust

[permalink] [raw]
Subject: Re: slow open() calls and o_nonblock

On Mon, 2007-06-04 at 12:26 -0400, Aaron Wiebe wrote:
> Actually, lets see if I can summarize this more generically... I
> realize I'm suggesting something that probably would be a massive
> undertaking, but ..
>
> Regular files are the only interface that requires an application to
> wait. With any other case, the nonblocking interfaces are fairly
> complete and easy to work with. If userspace could treat regular
> files in the same fashion as sockets, life would be good.
>
> I admittedly do not understand internal kernel semantics in the
> differences between a socket and a regular file. Why couldn't we just
> have a different 'socket type' like PF_FILE or something like this?
>
> Abstracting any IO through the existing interfaces provided to sockets
> would be ideal from my perspective. The code required to use a file
> through these interfaces would be more complex in userspace, but the
> abstraction of the current open() itself could simply be an aggregate
> of these interfaces without a nonblocking flag.
>
> It would, however, fix problems around issues with event-based
> applications handling events from both disk and sockets. I can't
> trigger disk read/write events in the same event handlers I use for
> sockets (ie, poll or epoll). I end up having two separate event
> handlers - one for disk (currently using glibc's aio thread kludge),
> and one for sockets.
>
> I'm sure this isn't a new idea. Coming from my own development
> backround that had little to do with disk, I was actually surprised
> when I first discovered that I couldn't edge-trigger disk IO through
> poll().
>
> Thoughts, comments?

Unless you're planning on rearchitecting the entire VFS lookup and
permissions code, you would basically have to fall back onto having a
pool of service threads actually perform the I/O. That can just as
easily be done today in userland.

AFAICS, syslets should give you the means to implement a more scalable
scheme, but we'll have to wait and see if/when those are ready for
kernel inclusion.

Cheers
Trond

2007-06-04 20:34:11

by David Schwartz

[permalink] [raw]
Subject: RE: slow open() calls and o_nonblock


Aaron Wiebe wrote:


> David Schwartz wrote:

> > There is no way you can re-try the request. The open must
> > either succeed or
> > not return a handle. It is not like a 'read' operation that has
> > an "I didn't
> > do anything, and you can retry this request" option.

> > If 'open' returns a file handle, you can't retry it (since it
> > must succeed
> > in order to do that, failure must not return a handle). If you 'open'
> > doesn't return a file handle, you can't retry it (because,
> > without a handle,
> > there is no way to associate a future request with this one, if
> > it creates a
> > file, the file must not be created if you don't call 'open' again).

> I understand, but this is exactly the situation that I'm complaining
> about. There is no functionality to provide a nonblocking open - no
> ability to come back around and retry a given open call.

I agree. I'm addressing why things can't "just work", not arguing that they
aren't broken or should stay broken. ;)

I think a good solution would be to re-use the 'connect' and 'shutdown'
calls. You would need a new asynchronous flag to 'open' that would mean,
*really* don't block. You would have to follow up with 'connect' to complete
the actual opening -- the 'open' would just assign a file descriptor (unless
it could complete or error immediately, of course).

To asynchronously close such a socket, you simply call 'shutdown'. Once the
'shutdown' completes, 'close' would be guaranteed not to block.

Obviously, being able to 'poll' or 'select' would be a huge plus (while an
'open' or 'close' is in progress, of course, otherwise it would always
return immediate availability).

I think this covers all the bases and the only ugly API change is an extra
'open' flag. (Which I think is unavoidable.)

> I'm speaking to my ideal world view - but any application I write
> should not have to wait for the kernel if I don't want it to. I
> should be able to submit my request, and come back to it later as I so
> decide.

A working generic asynchronous system call interface would be the best
solution, I think. But that may be further off than just an asynchronous
file open/close interface.

> (And I did actually consider writing my own NFS client for about
> 5 minutes.)

Yeah, what a pain that would be. The obvious counter-argument to what I
propose above is that it doesn't handle reads and writes, so why bother with
a complex partial solution?

DS