Non-blocking writes to a tty will block if there is a blocking write
waiting for the atomic_write semaphore.
Peter
--- linux-2.5.70/drivers/char/tty_io.c.orig 2003-06-03 21:34:42.000000000 +0100
+++ linux-2.5.70/drivers/char/tty_io.c 2003-06-04 01:40:33.000000000 +0100
@@ -687,8 +687,13 @@
{
ssize_t ret = 0, written = 0;
- if (down_interruptible(&tty->atomic_write)) {
- return -ERESTARTSYS;
+ if (file->f_flags & O_NONBLOCK) {
+ if (down_trylock(&tty->atomic_write))
+ return -EAGAIN;
+ }
+ else {
+ if (down_interruptible(&tty->atomic_write))
+ return -ERESTARTSYS;
}
if ( test_bit(TTY_NO_WRITE_SPLIT, &tty->flags) ) {
lock_kernel();
On Wed, Jun 04, 2003 at 01:58:02AM +0100, P. Benie wrote:
> - if (down_interruptible(&tty->atomic_write)) {
> - return -ERESTARTSYS;
> + if (file->f_flags & O_NONBLOCK) {
> + if (down_trylock(&tty->atomic_write))
> + return -EAGAIN;
> + }
> + else {
The else should be on the same line as the closing brace, else
the patch looks fine.
On Wed, 4 Jun 2003, Christoph Hellwig wrote:
>
> The else should be on the same line as the closing brace, else
> the patch looks fine.
No no no, it's wrong.
If you do something like this, then you also have to teach "select()"
about this, otherwise you just get busy looping in applications.
In general, we shouldn't do this, unless somebody can show an application
where it really matters. Taking internal kernel locking into account for
"blockingness" easily gets quite complicated, and there is seldom any real
point to it.
Remember: perfect is the enemy of good. I'll happily apply the patch (if
it also updates the tty poll() functionality), _if_ there is some
real-world situation where it matters.
Linus
On Wed, 4 Jun 2003, Linus Torvalds wrote:
> No no no, it's wrong.
>
> If you do something like this, then you also have to teach "select()"
> about this, otherwise you just get busy looping in applications.
>
> In general, we shouldn't do this, unless somebody can show an application
> where it really matters.
I wrote the patch to solve a real-world problem with wall(1), which
occasionally gets stuck writing to somebody's tty. I think it's reasonable
for wall to assume that non-blocking writes are non-blocking.
I'll think about how to do the patch correctly.
Peter
We ran into this problem here in an embedded environment. It causes
syslogd to hang and when this happens, everybody who talks to syslogd
hangs. Which means you may not even be able to login. In the end we used
exactly the same fix which seems to work.
I am curious to know the correct fix.
> On Wed, 4 Jun 2003, Christoph Hellwig wrote:
> >
> > The else should be on the same line as the closing brace, else
> > the patch looks fine.
>
> No no no, it's wrong.
>
> If you do something like this, then you also have to teach "select()"
> about this, otherwise you just get busy looping in applications.
>
> In general, we shouldn't do this, unless somebody can show an
> application
> where it really matters. Taking internal kernel locking into
> account for
> "blockingness" easily gets quite complicated, and there is
> seldom any real
> point to it.
>
> Remember: perfect is the enemy of good. I'll happily apply
> the patch (if
> it also updates the tty poll() functionality), _if_ there is some
> real-world situation where it matters.
>
> Linus
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Mer, 2003-06-04 at 15:35, Linus Torvalds wrote:
> In general, we shouldn't do this, unless somebody can show an application
> where it really matters. Taking internal kernel locking into account for
> "blockingness" easily gets quite complicated, and there is seldom any real
> point to it.
Hanging shutdown is the obvious one. With 2.0/2.2 we had a similar
problem and fixed it.
On Wed, 4 Jun 2003, Hua Zhong wrote:
>
> We ran into this problem here in an embedded environment. It causes
> syslogd to hang and when this happens, everybody who talks to syslogd
> hangs. Which means you may not even be able to login. In the end we used
> exactly the same fix which seems to work.
>
> I am curious to know the correct fix.
[ First off: your embedded syslog problem is fixed by making sure that
syslog doesn't try to write to a tty that somebody else might be
blocked. In other words, to me it sounds like a "well, don't do that
then" schenario, rather than a real kernel problem. ]
[ Secondly, you should all realize that O_NONBLOCK has _never_ meant that
the IO can't ever block. Even O_NONBLOCK reads and writes will always
block on things like having page faults on the user buffer, and a lot of
drivers still use the kernel lock and will block on that. O_NONBLOCK is
not an absolute "this is atomic" thing, it's a "don't wait for data if
there is none" thing ]
With that in mind, if you feel strongly about this particular path, then I
can only warn you that the correct fix actually looks fairly hard, as far
as I can tell. Yes, the posted patch is a small part of it, but the more
complex side is how to make poll() agree with the write semantics that the
posted patch changed.
If you have a write() that returns -EAGAIN, and a poll() function that
says "it's ok to write", any select-loop based application will start
busy-looping calling poll/write, and use up 100% CPU time.
Which may be acceptable for some users, of course, but what you're doing
with the simple patch is just replacing one bug with another one. And I
personally think the bug you're introducing is the worse one.
But which bug you "prefer" ends up depending entirely on the machine load
and usage - the current behaviour has clearly not ended up in very many
complaints, and even if the patch fixes it for those few people didn't
like the historical behaviour, it may well end up breaking a hell of a lot
more distributions that until now were perfectly happy.
For example: what happens when your real-time application starts
busy-looping due to this? Right. The system is totally _dead_, since the
application that is busy writing to the tty will never be scheduled.
And yes, something like syslogd could easily be marked high-priority in
some setup. You do NOT want to make it busy-loop.
As to how to expand the patch to avoid the busy-loop: it's definitely
non-trivial. Semaphores do not have poll() qualities, and I don't see a
good way to get them. Something like
static unsigned int tty_poll(struct file * filp, poll_table * wait)
{
struct tty_struct * tty;
struct semaphore *sem;
int retval;
tty = (struct tty_struct *)filp->private_data;
if (tty_paranoia_check(tty, filp->f_dentry->d_inode->i_rdev, "tty_poll"))
return 0;
sem = &tty->atomic_write;
if (!down_trylock(sem)) {
poll_wait(filp, sem->wait, wait);
if (!down_trylock(sem))
return 0;
}
retval = 0;
if (tty->ldisc.poll)
retval = tty->ldisc.poll(tty, filp, wait);
up(sem);
return retval;
}
MIGHT work, but as you can see it actually now depends on knowing the
internals of the semaphore implementation, and quite frankly I don't know
if it works at all. As a result, I'm not horribly keen on the idea.
And as I tried to explain, I'm also not horribly keen on having a write()
that doesn't match poll() and can cause busy looping.
Linus
On Wed, 4 Jun 2003, Hua Zhong wrote:
> We ran into this problem here in an embedded environment. It causes
> syslogd to hang and when this happens, everybody who talks to syslogd
> hangs. Which means you may not even be able to login. In the end we used
> exactly the same fix which seems to work.
I get this problem with writing to a remote syslog server, if the remote
syslog server hangs up or crashes no one can login to the machine that is
writing to the syslog server, even when the syslog server comes back.
Mike
On 4 Jun 2003, Alan Cox wrote:
>
> On Mer, 2003-06-04 at 15:35, Linus Torvalds wrote:
> > In general, we shouldn't do this, unless somebody can show an application
> > where it really matters. Taking internal kernel locking into account for
> > "blockingness" easily gets quite complicated, and there is seldom any real
> > point to it.
>
> Hanging shutdown is the obvious one. With 2.0/2.2 we had a similar
> problem and fixed it.
As I tried to point out, the current patch on the table doesn't actually
"fix" anything, in that it can break things even _worse_ than the current
situation.
A much better fix might well be to actually not allow over-long tty writes
at all, and thus avoid the "block out" thing at the source of the problem,
instead of trying to make programs who play nice be the ones that suffer.
If somebody does a 1MB write to a tty, do we actually have any reason to
try to make it so damn atomic and not return early?
Linus
On Wed, 4 Jun 2003, Hua Zhong wrote:
> This particular patch is in 2.4.20 already. There is another patch in
> 2.4.20 (?) which seems to fix the "main problem" (the n_tty_write_wakeup
> function in n_tty.c), but I didn't verify it.
Yes - that's because I submitted the patch ages ago. All that means is
that the distributions are relying on it, not that the patch is correct!
Peter
> -----Original Message-----
> From: Linus Torvalds [mailto:[email protected]]
> Sent: Wednesday, June 04, 2003 10:42 AM
> To: Hua Zhong
> Cc: 'Christoph Hellwig'; 'P. Benie'; 'Kernel Mailing List'
> Subject: RE: [PATCH] [2.5] Non-blocking write can block
>
>
>
> On Wed, 4 Jun 2003, Hua Zhong wrote:
> >
> > We ran into this problem here in an embedded environment. It causes
> > syslogd to hang and when this happens, everybody who talks to
syslogd
> > hangs. Which means you may not even be able to login. In the end we
> > used exactly the same fix which seems to work.
> >
> > I am curious to know the correct fix.
>
> [ First off: your embedded syslog problem is fixed by making sure that
> syslog doesn't try to write to a tty that somebody else might be
> blocked. In other words, to me it sounds like a "well, don't do that
> then" schenario, rather than a real kernel problem. ]
It's hard. The shell might be printing and you cannot prevent that.
That said, the main problem was somebody could be stuck in waiting for
tty *forever* and thus everyone who tries to write also hangs.
This particular patch is in 2.4.20 already. There is another patch in
2.4.20 (?) which seems to fix the "main problem" (the n_tty_write_wakeup
function in n_tty.c), but I didn't verify it.
Linus Torvalds <[email protected]> writes:
> On Wed, 4 Jun 2003, Hua Zhong wrote:
> >
> > We ran into this problem here in an embedded environment. It causes
> > syslogd to hang and when this happens, everybody who talks to syslogd
> > hangs. Which means you may not even be able to login. In the end we used
> > exactly the same fix which seems to work.
> >
> > I am curious to know the correct fix.
>
> [ First off: your embedded syslog problem is fixed by making sure that
> syslog doesn't try to write to a tty that somebody else might be
> blocked. In other words, to me it sounds like a "well, don't do that
> then" schenario, rather than a real kernel problem. ]
One day I managed to keep myself from logging in or su'ing or
doing a number of things that needed the log for a quite a while
by accidentally hitting Scroll Lock on a console that syslog was
set up to log to. I suppose the answer is "don't do that" but it
was a mysterious problem for several minutes that day.
--
"Let others praise ancient times; I am glad I was born in these."
--Ovid (43 BC-18 AD)
On Wed, 4 Jun 2003, P. Benie wrote:
> On Wed, 4 Jun 2003, Hua Zhong wrote:
> > This particular patch is in 2.4.20 already. There is another patch in
> > 2.4.20 (?) which seems to fix the "main problem" (the n_tty_write_wakeup
> > function in n_tty.c), but I didn't verify it.
>
> Yes - that's because I submitted the patch ages ago. All that means is
> that the distributions are relying on it, not that the patch is correct!
Sorry Hua, I wasn't reading your mail correctly. Please ignore the above
comment.
Peter
On Wed, 4 Jun 2003, Hua Zhong wrote:
>
> That said, the main problem was somebody could be stuck in waiting for
> tty *forever* and thus everyone who tries to write also hangs.
>
> This particular patch is in 2.4.20 already. There is another patch in
> 2.4.20 (?) which seems to fix the "main problem" (the n_tty_write_wakeup
> function in n_tty.c), but I didn't verify it.
Do y ou have that other patch handy? It sounds like that is the real cause
of the problem, and the patch quoted originally in this thread was a
(broken) work-around..
Linus
> Do y ou have that other patch handy? It sounds like that is
> the real cause of the problem, and the patch quoted originally
> in this thread was a (broken) work-around..
>
> Linus
>
Something like this:
--- n_tty.c.old 2003-06-04 12:28:36.000000000 -0700
+++ n_tty.c 2003-06-04 12:28:51.000000000 -0700
@@ -711,6 +711,23 @@
return 0;
}
+
+/*
+ * Required for the ptys, serial driver etc. since processes
+ * that attach themselves to the master and rely on ASYNC
+ * IO must be woken up
+ */
+
+static void n_tty_write_wakeup(struct tty_struct *tty)
+{
+ if (tty->fasync)
+ {
+ set_bit(TTY_DO_WRITE_WAKEUP, &tty->flags);
+ kill_fasync(&tty->fasync, SIGIO, POLL_OUT);
+ }
+ return;
+}
+
static void n_tty_receive_buf(struct tty_struct *tty, const unsigned char *cp,
char *fp, int count)
{
@@ -1157,6 +1174,8 @@
while (nr > 0) {
ssize_t num = opost_block(tty, b, nr);
if (num < 0) {
+ if (num == -EAGAIN)
+ break;
retval = num;
goto break_out;
}
@@ -1236,6 +1255,6 @@
normal_poll, /* poll */
n_tty_receive_buf, /* receive_buf */
n_tty_receive_room, /* receive_room */
- 0 /* write_wakeup */
+ n_tty_write_wakeup /* write_wakeup */
};
On Wed, 4 Jun 2003, Linus Torvalds wrote:
>
> A much better fix might well be to actually not allow over-long tty writes
> at all, and thus avoid the "block out" thing at the source of the problem,
> instead of trying to make programs who play nice be the ones that suffer.
>
> If somebody does a 1MB write to a tty, do we actually have any reason to
> try to make it so damn atomic and not return early?
The problem isn't to do with large writes. It's to do with any sequence of
writes that fills up the receive buffer, which is only 4K for N_TTY. If
the receiving program is suspended, the buffer will fill sooner or later.
I am half-tempted by this style of fix, but I can't help but feel that
we'll discover a huge set of programs that assume short writes never
happen if they aren't playing with signals.
It's also not as easy a fix as it sounds: for blocking writes, we've gone
into into ldisc.write and then in tty->driver->write before we discover
that that we can't write any bytes, by which time we already have the
write semaphore. I suspect that it requires just as much effort to ensure
that this case is handled correctly as it does to stop the non-blocking
write/poll loop.
I compared 2.4.20 and 2.5.70 to see if I could find the patch Hua
referred to. n_tty.c and pty.c look almost the same - I don't think the
patch is in 2.4.20.
Peter
On Wed, 4 Jun 2003, P. Benie wrote:
>
> The problem isn't to do with large writes. It's to do with any sequence of
> writes that fills up the receive buffer, which is only 4K for N_TTY. If
> the receiving program is suspended, the buffer will fill sooner or later.
Well, even then we could just drop the "write_atomic" lock.
The thing is, I don't know what the tty atomicity guarantees are. I know
what they are for pipes (quite reasonable), but tty's?
Linus
There is a missing piece in the previous mail. The complete patch is as
follows. I just googled it and the author is Sapan Bhatia .Cc-ed.
diff -urN linux-old/drivers/char/CVS/Entries linux/drivers/char/CVS/Entries
--- linux-old/drivers/char/CVS/Entries 2003-06-04 12:57:32.000000000 -0700
+++ linux/drivers/char/CVS/Entries 2003-06-04 13:01:35.000000000 -0700
@@ -194,5 +194,5 @@
D/pcmcia////
D/rio////
/tty_io.c/1.3/Wed Jun 4 19:28:08 2003//
-/pty.c/1.1/Wed Jun 4 19:57:20 2003//T1.1
-/n_tty.c/1.1/Wed Jun 4 19:57:32 2003//T1.1
+/n_tty.c/1.2/Wed Jun 4 20:01:32 2003//
+/pty.c/1.2/Wed Jun 4 20:01:32 2003//
diff -urN linux-old/drivers/char/n_tty.c linux/drivers/char/n_tty.c
--- linux-old/drivers/char/n_tty.c 2003-06-04 13:00:51.000000000 -0700
+++ linux/drivers/char/n_tty.c 2003-06-04 13:01:32.000000000 -0700
@@ -711,6 +711,23 @@
return 0;
}
+
+/*
+ * Required for the ptys, serial driver etc. since processes
+ * that attach themselves to the master and rely on ASYNC
+ * IO must be woken up
+ */
+
+static void n_tty_write_wakeup(struct tty_struct *tty)
+{
+ if (tty->fasync)
+ {
+ set_bit(TTY_DO_WRITE_WAKEUP, &tty->flags);
+ kill_fasync(&tty->fasync, SIGIO, POLL_OUT);
+ }
+ return;
+}
+
static void n_tty_receive_buf(struct tty_struct *tty, const unsigned char
*cp,
char *fp, int count)
{
@@ -1157,6 +1174,8 @@
while (nr > 0) {
ssize_t num = opost_block(tty, b, nr);
if (num < 0) {
+ if (num == -EAGAIN)
+ break;
retval = num;
goto break_out;
}
@@ -1236,6 +1255,6 @@
normal_poll, /* poll */
n_tty_receive_buf, /* receive_buf */
n_tty_receive_room, /* receive_room */
- 0 /* write_wakeup */
+ n_tty_write_wakeup /* write_wakeup */
};
diff -urN linux-old/drivers/char/pty.c linux/drivers/char/pty.c
--- linux-old/drivers/char/pty.c 2003-06-04 13:01:04.000000000 -0700
+++ linux/drivers/char/pty.c 2003-06-04 13:01:32.000000000 -0700
@@ -331,6 +331,7 @@
clear_bit(TTY_OTHER_CLOSED, &tty->link->flags);
wake_up_interruptible(&pty->open_wait);
set_bit(TTY_THROTTLED, &tty->flags);
+ set_bit(TTY_DO_WRITE_WAKEUP, &tty->flags);
/* Register a slave for the master */
if (tty->driver.major == PTY_MASTER_MAJOR)
tty_register_devfs(&tty->link->driver,
?@
Oops, yes, that patch is already in 2.5. It got merged in 2.4 sometime
between 2.4.17 and 2.4.20..
> I compared 2.4.20 and 2.5.70 to see if I could find the patch Hua
> referred to. n_tty.c and pty.c look almost the same - I don't
> think the
> patch is in 2.4.20.
>
> Peter
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Wed, 4 Jun 2003, Linus Torvalds wrote:
> On Wed, 4 Jun 2003, P. Benie wrote:
> > The problem isn't to do with large writes. It's to do with any sequence of
> > writes that fills up the receive buffer, which is only 4K for N_TTY. If
> > the receiving program is suspended, the buffer will fill sooner or later.
>
> Well, even then we could just drop the "write_atomic" lock.
>
> The thing is, I don't know what the tty atomicity guarantees are. I know
> what they are for pipes (quite reasonable), but tty's?
We don't have a PIPE_BUF-style atomicity guarantee, even though this would
be quite useful. This lock is only used to prevent simultaneous writes
from being interleaved. I've always assumed that when writes shouldn't be
interleaved, but I can't quote a source for that.
Peter
On Mer, 2003-06-04 at 18:57, Linus Torvalds wrote:
> A much better fix might well be to actually not allow over-long tty writes
> at all, and thus avoid the "block out" thing at the source of the problem,
> instead of trying to make programs who play nice be the ones that suffer.
>
> If somebody does a 1MB write to a tty, do we actually have any reason to
> try to make it so damn atomic and not return early?
I would be concerned as to what applications rely in the tty write being done
completely before returning. OTOH I can't see any reason we can't drop the
atomicity part without dropping the 1Mb write will eventually write 1Mbyte
property. That would not seem to be a problem unless POSIX says otherwise ?
On Wed, Jun 04, 2003 at 08:46:51PM +0100, P. Benie wrote:
> The problem isn't to do with large writes. It's to do with any sequence of
> writes that fills up the receive buffer, which is only 4K for N_TTY. If
> the receiving program is suspended, the buffer will fill sooner or later.
If the tty drivers buffer fills, we don't sleep in tty->driver->write,
but we return zero instead. If we are in non-blocking mode, and we
haven't written any characters, we return -EAGAIN. If we have, we
return the number of characters which the tty driver accepted.
However, the problem you are referring to is what happens if you have
a blocking process blocked in write_chan() in n_tty.c, and we have
a non-blocking process trying to write to the same tty.
Reading POSIX, it doesn't seem to be clear about our area of interest,
and I'd even say that it seems to be unspecified.
What are the pipe semantics in this case? According to my reading of
POSIX write(), if you have a blocked non-blocking writer, a non-blocking
writer should receive EAGAIN. It would seem sensible to apply the
same rules to terminal devices as well as pipes.
--
Russell King ([email protected]) The developer of ARM Linux
http://www.arm.linux.org.uk/personal/aboutme.html
On Thu, 5 Jun 2003, Russell King wrote:
> On Wed, Jun 04, 2003 at 08:46:51PM +0100, P. Benie wrote:
> > The problem isn't to do with large writes. It's to do with any sequence of
> > writes that fills up the receive buffer, which is only 4K for N_TTY. If
> > the receiving program is suspended, the buffer will fill sooner or later.
>
> If the tty drivers buffer fills, we don't sleep in tty->driver->write,
> but we return zero instead. If we are in non-blocking mode, and we
> haven't written any characters, we return -EAGAIN. If we have, we
> return the number of characters which the tty driver accepted.
>
> However, the problem you are referring to is what happens if you have
> a blocking process blocked in write_chan() in n_tty.c, and we have
> a non-blocking process trying to write to the same tty.
>
> Reading POSIX, it doesn't seem to be clear about our area of interest,
> and I'd even say that it seems to be unspecified.
Given that a problem exist for certain apps, and given that the proposed
fix will *at least* have existing apps to behave funny, couldn't this
implemented as a feature of the fd (default off).
Something like O_REALLYNONBLOCK :)
- Davide
On Wed, June 04, 2003 at 4:47 PM, Davide Libenzi wrote:
>
> On Thu, 5 Jun 2003, Russell King wrote:
>
> > On Wed, Jun 04, 2003 at 08:46:51PM +0100, P. Benie wrote:
> > > The problem isn't to do with large writes. It's to do
> with any sequence of
> > > writes that fills up the receive buffer, which is only 4K
> for N_TTY. If
> > > the receiving program is suspended, the buffer will fill
> sooner or later.
> >
> > If the tty drivers buffer fills, we don't sleep in
> tty->driver->write,
> > but we return zero instead. If we are in non-blocking mode, and we
> > haven't written any characters, we return -EAGAIN. If we have, we
> > return the number of characters which the tty driver accepted.
> >
> > However, the problem you are referring to is what happens
> if you have
> > a blocking process blocked in write_chan() in n_tty.c, and we have
> > a non-blocking process trying to write to the same tty.
> >
> > Reading POSIX, it doesn't seem to be clear about our area
> of interest,
> > and I'd even say that it seems to be unspecified.
>
> Given that a problem exist for certain apps, and given that
> the proposed
> fix will *at least* have existing apps to behave funny, couldn't this
> implemented as a feature of the fd (default off).
> Something like O_REALLYNONBLOCK :)
>
Davide,
Do you mean something like the separate O_NDELAY flag under Solar*s? IIRC
they also use return code EWOULDBLOCK to differentiate the "could not get
resource" cases from the "no room for more data" cases when O_NONBLOCK is
used.
Cheers,
Ed
On Wed, 4 Jun 2003, Ed Vance wrote:
> Do you mean something like the separate O_NDELAY flag under Solar*s? IIRC
> they also use return code EWOULDBLOCK to differentiate the "could not get
> resource" cases from the "no room for more data" cases when O_NONBLOCK is
> used.
Besides the stupid name O_REALLYNONBLOCK, it really should be different
from both O_NONBLOCK and O_NDELAY. Currently in Linux they both map to the
same value, so you really need a new value to not break binary compatibility.
- Davide
On Wed, Jun 04, 2003 at 05:19:05PM -0700, Davide Libenzi wrote:
> Besides the stupid name O_REALLYNONBLOCK, it really should be different
> from both O_NONBLOCK and O_NDELAY. Currently in Linux they both map to the
> same value, so you really need a new value to not break binary compatibility.
Hmm, wouldn't that be source and binary compatability? If an app used
O_NDELAY and O_NONBLOCK interchangably, then a change to O_NDELAY would
break source compatability too.
Also, what do other UNIX OSes do? Do they have seperate semantics for
O_NONBLOCK and O_NDELAY? If so, then it would probably be better to change
O_NDELAY to be similar and add another feature at the same time as reducing
platform specific codeing in userspace.
On Thu, 5 Jun 2003, Mike Fedyk wrote:
> On Wed, Jun 04, 2003 at 05:19:05PM -0700, Davide Libenzi wrote:
> > Besides the stupid name O_REALLYNONBLOCK, it really should be different
> > from both O_NONBLOCK and O_NDELAY. Currently in Linux they both map to the
> > same value, so you really need a new value to not break binary compatibility.
>
> Hmm, wouldn't that be source and binary compatability? If an app used
> O_NDELAY and O_NONBLOCK interchangably, then a change to O_NDELAY would
> break source compatability too.
>
> Also, what do other UNIX OSes do? Do they have seperate semantics for
> O_NONBLOCK and O_NDELAY? If so, then it would probably be better to change
> O_NDELAY to be similar and add another feature at the same time as reducing
> platform specific codeing in userspace.
> -
My Sun thinks that O_NDELAY = 0x04 and O_NONBLOCK = 0x80, FWIW.
Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.
On Thu, Jun 05, 2003 at 03:15:19PM -0400, Richard B. Johnson wrote:
> On Thu, 5 Jun 2003, Mike Fedyk wrote:
> > Also, what do other UNIX OSes do? Do they have seperate semantics for
> > O_NONBLOCK and O_NDELAY? If so, then it would probably be better to change
> > O_NDELAY to be similar and add another feature at the same time as reducing
> > platform specific codeing in userspace.
>
> My Sun thinks that O_NDELAY = 0x04 and O_NONBLOCK = 0x80, FWIW.
Same here on my AT&T System V ES/MP box.
As far as semantics go, the two appear to be identical except that
O_NDELAY always returns 0 on a blocking condition while O_NONBLOCK
usually returns EAGAIN and only occasionally returns 0.
Joe
[cc list trimmed]
On Thu, 5 Jun 2003, Mike Fedyk wrote:
> On Wed, Jun 04, 2003 at 05:19:05PM -0700, Davide Libenzi wrote:
> > Besides the stupid name O_REALLYNONBLOCK, it really should be different
> > from both O_NONBLOCK and O_NDELAY. Currently in Linux they both map to the
> > same value, so you really need a new value to not break binary compatibility.
>
> Hmm, wouldn't that be source and binary compatability? If an app used
> O_NDELAY and O_NONBLOCK interchangably, then a change to O_NDELAY would
> break source compatability too.
Oh, that's for sure.
> Also, what do other UNIX OSes do? Do they have seperate semantics for
> O_NONBLOCK and O_NDELAY? If so, then it would probably be better to change
> O_NDELAY to be similar and add another feature at the same time as reducing
> platform specific codeing in userspace.
If I remember it correctly, they differ from the return value that you get
from blocking-candidate functions (0 <-> -1).
- Davide
>>> "Richard B. Johnson" <[email protected]> 06/05/03 03:15PM
>>>
>On Thu, 5 Jun 2003, Mike Fedyk wrote:
> On Wed, Jun 04, 2003 at 05:19:05PM -0700, Davide Libenzi wrote:
> > Besides the stupid name O_REALLYNONBLOCK, it really should be
different
> > from both O_NONBLOCK and O_NDELAY. Currently in Linux they both map
to the
> > same value, so you really need a new value to not break binary
compatibility.
>
> Hmm, wouldn't that be source and binary compatability? If an app
used
> O_NDELAY and O_NONBLOCK interchangably, then a change to O_NDELAY
would
> break source compatability too.
>
> Also, what do other UNIX OSes do? Do they have seperate semantics
for
> O_NONBLOCK and O_NDELAY? If so, then it would probably be better to
change
> O_NDELAY to be similar and add another feature at the same time as
reducing
> platform specific codeing in userspace.
> -
>My Sun thinks that O_NDELAY = 0x04 and O_NONBLOCK = 0x80, FWIW.
AIX 4.3.3 O_NDELAY = 0x8000 and O_NONBLOCK = 0x04 FWTW.
Nik
>Cheers,
>Dick Johnson
>Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
>Why is the government concerned about the lunatic fringe? Think about
it.
(A quick food-for-thought note here...)
I hate to bring it up, but I am fairly certain that this argument is
part-and-parcel of the original AT&T "STREAMS" (as opposed to "Streams" or
"streams", talk about namespace pollution 8-)interface.
Basically most devices (particularly character and network devices) were
implemented as a pair of queues of messages (one upstream one downstream)
and a writer would compose his entire text into a message (a linked list of
message buffers) which could be atomically placed on the write queue without
any consideration for device internal buffering.
[Between the "head" of the file handle and the actual driver you could
interpose translators and such by forming chains of queues, but that isn't
germane here.]
In such a usage, the writeability of a device is predicated on the
availability of a message buffer from some pool and some (in AT&T's case
some "lame-ass slightly less than Linux'es idea of a") kernel thread does
the pumping from the list to the device logic itself.
Without some sort of "above the driver but below the write" dynamic
buffering you almost inevitably get into a squeeze-the-balloon bug shuffling
situation. The three bugs already mentioned here are write-interleaving,
odd blocking, and busy waiting.
Basically the peek-ahead on writeability is the relatively simple test: "is
there a free buffer in the pool of any size?" (because writeability is "can
accept at least one character.") and the anti-interleaving warrant is set at
multiples of "smallest buffer pool entry size".
The important thing is that the only lock contention window takes place for
the predictable time of putting the filled buffer onto or getting it off of
the queue (linked list pointer juggling time).
So why did I bring it up?
It seems to me that this can't really be fixed unilaterally at the driver
level without putting in a heck of a lot of scaffolding (e.g. evil STREAMS
8-).
The specific fix, however, might be easy at a fairly high level (line
discipline level?) with a fairly straight-forward linked list of buffers
thing. With a fixed (or variable sized for that matter) pool of buffers,
the poll() could become a simple look-aside for a buffer well before the
branching internal logic/locking is otherwise referenced. A
false-positive/race on the poll-to-get-buffer branch remains possible, but
unlikely to loop.
Basically you would be buying the correction in several ways:
1) raise the granularity. With the warrant for an atomic write raised to
"one buffer" you increase the likely hood that any one operation will get in
and get out all at once.
2) add an insulating layer. With the buffers rotating in and out via
pointer juggling, a very-short (spinlock) duration locking behavior is
placed between the fast world of the calling processes and the slow world of
serial I/O (and its analogues)
3) dynamic buffering. Since the writes are called with reasonably sleepable
user context (e.g. enough latitude to do a memory allocate) "extra buffers"
are "always available" (though at the far end of this is a nasty DOR attack
8-) if circumstances or tuning makes such allocation desirable.
4) look aside used to determine write space availability. (really a "2a"
thought) the list and pool of buffers approach would net you a look-aside
for the question of "can I write" so there is no contention between what the
driver is actually doing and the applicability of the question itself.
5) POSIX (et al) is (if I recall) silent about the size and nature of the
write buffer for a tty device.
uh... but it's just a thought... 8-)
Rob.
P. Benie wrote:
> On Wed, 4 Jun 2003, Linus Torvalds wrote:
> > On Wed, 4 Jun 2003, P. Benie wrote:
> > > The problem isn't to do with large writes. It's to do with any
sequence of
> > > writes that fills up the receive buffer, which is only 4K for N_TTY.
If
> > > the receiving program is suspended, the buffer will fill sooner or
later.
> >
> > Well, even then we could just drop the "write_atomic" lock.
> >
> > The thing is, I don't know what the tty atomicity guarantees are. I know
> > what they are for pipes (quite reasonable), but tty's?
> We don't have a PIPE_BUF-style atomicity guarantee, even though this would
> be quite useful. This lock is only used to prevent simultaneous writes
> from being interleaved. I've always assumed that when writes shouldn't be
> interleaved, but I can't quote a source for that.