2002-07-31 11:49:18

by David Howells

[permalink] [raw]
Subject: manipulating sigmask from filesystems and drivers


Hi Linus, Alan,

Can you confirm that this is A Bad Thing(TM)? I've been poking around in the
OpenAFS filesystem driver, and it tries to achieve uninterruptible I/O waiting
by the following means:

/* CV_WAIT and CV_TIMEDWAIT rely on the fact that the Linux kernel has
* a global lock. Thus we can safely drop our locks before calling the
* kernel sleep services.
*/
static inline int CV_WAIT(afs_kcondvar_t *cv, afs_kmutex_t *l)
{
int isAFSGlocked = ISAFS_GLOCK();
sigset_t saved_set;
#ifdef DECLARE_WAITQUEUE
DECLARE_WAITQUEUE(wait, current);
#else
struct wait_queue wait = { current, NULL };
#endif

add_wait_queue((wait_queue_head_t *)cv, &wait);
set_current_state(TASK_INTERRUPTIBLE);

if (isAFSGlocked) AFS_GUNLOCK();
MUTEX_EXIT(l);

spin_lock_irq(&current->sigmask_lock);
saved_set = current->blocked;
sigfillset(&current->blocked);
recalc_sigpending(current);
spin_unlock_irq(&current->sigmask_lock);

schedule();
remove_wait_queue(cv, &wait);

spin_lock_irq(&current->sigmask_lock);
current->blocked = saved_set;
recalc_sigpending(current);
spin_unlock_irq(&current->sigmask_lock);

if (isAFSGlocked) AFS_GLOCK();
MUTEX_ENTER(l);

return 0;
}

The reason for them doing this is so that they can get the process to appear
in the "S" state and thus avoid increasing the load average.

What I'm concerned about is that they wait for an event to happen by blocking
all signals (by accessing the process's signal masks directly) and then
sitting in TASK_INTERRUPTIBLE (which _mostly_ works, but ptrace(PTRACE_KILL)
can interrupt).

Can you comment on whether a driver is allowed to block signals like this, and
whether they should be waiting in TASK_UNINTERRUPTIBLE?

Cheers,
David


2002-07-31 11:54:50

by Alan Cox

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

> OpenAFS filesystem driver, and it tries to achieve uninterruptible I/O waiting
> by the following means:

Bletch

> The reason for them doing this is so that they can get the process to appear
> in the "S" state and thus avoid increasing the load average.

Thats come up enough I wish there was a way to distinguish 'disk wait'
and uninterruptible. Its an old V7 handwaving load average estimation trick
that lived too long IMHO

> Can you comment on whether a driver is allowed to block signals like this, and
> whether they should be waiting in TASK_UNINTERRUPTIBLE?

NFS does something similar in hard mount mode.

2002-08-01 19:05:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


On Wed, 31 Jul 2002, David Howells wrote:
>
> Can you comment on whether a driver is allowed to block signals like this, and
> whether they should be waiting in TASK_UNINTERRUPTIBLE?

They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a flag
to distinguish between "increases load average" and "doesn't". So you
could have

TASK_WAKESIGNAL - wake on all signals
TASK_WAKEKILL - wake on signals that are deadly
TASK_NOSIGNAL - don't wake on signals
TASK_LOADAVG - counts toward loadaverage

#define TASK_UNINTERRUPTIBLE (TASK_NOSIGNAL | TASK_LOADAVG)
#define TASK_INTERRUPTIBLE TASK_WAKESIGNAL

and then people who wanted to could use other combinations. The
TASK_WAKEKILL thing is useful - there are many loops that cannot exit
until they have a result, simply because the calling conventions require
that. Th emost common example is disk wait.

HOWEVER, if they are killed by a signal, the calling convention doesn't
matter, and the read() or whatever could just return 0 (knowing that the
process will never see it), and leave a locked page locked. Things like
generic_file_read() could easily use this, and make processes killable
even when they are waiting for a hung NFS mount - regardless of any soft
mount issues, and without NFS having to have special code.

In the end, I'm too lazy, and I don't care. So I can only tell you how it
_should_ be done, and maybe you can tell somebody else until the sucker to
actually do it is found.

Linus

2002-08-01 20:07:04

by David Woodhouse

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


[email protected] said:
> They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a
> flag to distinguish between "increases load average" and "doesn't".

The disadvantage of this approach is that it encourages people to be lazy
and sleep with signals disabled, instead of implementing proper cleanup
code.

I'm more in favour of removing TASK_UNINTERRUPTIBLE entirely, or at least
making people apply for a special licence to be permitted to use it :)

--
dwmw2


2002-08-01 20:18:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


On Thu, 1 Aug 2002, David Woodhouse wrote:
>
> [email protected] said:
> > They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a
> > flag to distinguish between "increases load average" and "doesn't".
>
> The disadvantage of this approach is that it encourages people to be lazy
> and sleep with signals disabled, instead of implementing proper cleanup
> code.
>
> I'm more in favour of removing TASK_UNINTERRUPTIBLE entirely, or at least
> making people apply for a special licence to be permitted to use it :)

Can't do that.

Easy reason: there are tons of code sequences that _cannot_ take signals.
The only way to make a signal go away is to actually deliver it, and there
are documented interfaces that are guaranteed to complete without
delivering a signal. The trivial case is a disk read: real applications
break if you return partial results in order to handle signals in the
middle.

In short, this is not something that can be discussed. It's a cold fact, a
law of UNIX if you will.

There are enough reasons to discourage people from using uninterruptible
sleep ("this f*cking application won't die when the network goes down")
that I don't think this is an issue. We need to handle both cases, and
while we can expand on the two cases we have now, we can't remove them.

Linus

2002-08-01 20:44:24

by Roman Zippel

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

Hi,

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> Easy reason: there are tons of code sequences that _cannot_ take signals.
> The only way to make a signal go away is to actually deliver it, and there
> are documented interfaces that are guaranteed to complete without
> delivering a signal. The trivial case is a disk read: real applications
> break if you return partial results in order to handle signals in the
> middle.
>
> In short, this is not something that can be discussed. It's a cold fact, a
> law of UNIX if you will.

Any program setting up signal handlers should expext interrupted i/o,
otherwise it's buggy. If a program doesn't have any signal handlers, there
is no signal to deliver, so simple programs don't need to worry.

bye, Roman


2002-08-01 20:47:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


On Thu, 1 Aug 2002, Roman Zippel wrote:
> >
> > In short, this is not something that can be discussed. It's a cold fact, a
> > law of UNIX if you will.
>
> Any program setting up signal handlers should expext interrupted i/o,
> otherwise it's buggy.

Roman, THAT IS JUST NOT TRUE!

Go read the standards. Some IO is not interruptible. This is not something
I'm making up, and this is not something that can be discussed about. The
speed of light in vacuum is 'c', regardless of your own relative speed.
And file reads are not interruptible.

Linus

2002-08-01 21:12:17

by Roman Zippel

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

Hi,

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> Go read the standards. Some IO is not interruptible.

Which standard? Which "some IO"?
Some programs rely on interruptable IO, e.g. to implement timeouts.
/me is confused.

bye, Roman

2002-08-01 21:38:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


On Thu, 1 Aug 2002, Roman Zippel wrote:

> > Go read the standards. Some IO is not interruptible.
>
> Which standard? Which "some IO"?

Any regular file IO is supposed to give you the full result.

If you write() to a file, and get a partial return value back due to a
signal, there are programs that will assume that the disk is full (there
are also programs that will just lose your data).

This is not "sloppy programming". See the read() system call manual, which
says

Upon successful completion, read(), readv(), and pread() return the num-
ber of bytes actually read and placed in the buffer. The system guaran-
tees to read the number of bytes requested if the descriptor references a
normal file that has that many bytes left before the end-of-file, but in
no other case.

Note the "The system guarantees to read the number of bytes requested .."
part.

Stop arguing about this. It's a FACT.

Linus

2002-08-01 22:26:12

by David Woodhouse

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


[email protected] said:
> Any regular file IO is supposed to give you the full result.

read(2) is permitted to return -EINTR. Granted, we shouldn't allow it to be
interrupted and return a partial read after the point we start to
copy_to_user(), but before then it's fair game.

Regular file I/O through the page cache is inherently restartable, anyway,
as long as you're careful about fpos.

There are better examples where you really can't have a cleanup path without
severe pain, even using ERESTARTNOINTR, and I was only joking about removing
TASK_UNINTERRUPTIBLE _entirely_ -- but the point remains that reducing its
usage would be nice.

--
dwmw2


2002-08-01 22:32:21

by Roman Zippel

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

Hi

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> This is not "sloppy programming". See the read() system call manual, which
> says
>
> Upon successful completion, read(), readv(), and pread() return the num-
> ber of bytes actually read and placed in the buffer. The system guaran-
> tees to read the number of bytes requested if the descriptor references a
> normal file that has that many bytes left before the end-of-file, but in
> no other case.
>
> Note the "The system guarantees to read the number of bytes requested .."
> part.

Relying on that the fd will always point to a normal file is only asking
for trouble.

> Stop arguing about this. It's a FACT.

Linus, it's not that I don't want to believe you, but e.g. the SUS doesn't
make that special exception.
Installing signal handlers and not expecting EINTR _is_ sloppy
programming.

bye, Roman

2002-08-01 22:37:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


On Thu, 1 Aug 2002, David Woodhouse wrote:
>
> [email protected] said:
> > Any regular file IO is supposed to give you the full result.
>
> read(2) is permitted to return -EINTR.

It is _not_ allowed to do that for regular UNIX filesystems.

It is allowed to return it for things like pipes, sockets, etc, and for
filesystems that do not have UNIX behaviour.

> Regular file I/O through the page cache is inherently restartable, anyway,
> as long as you're careful about fpos.

It's not the kernel side that is not restartable. It's the _user_ side.
There is 30 _years_ of history on this, and there are programs that have
been programmed to follow the existing documentation.

And the existing documentation says that if you return a partial read from
a normal file, that means EOF for that file.

You may not like it, but that doesn't make it less so. Linux has UNIX
semantics for read(). Linux is not a research project where we change
fundamental semantics just because we don't like it. That's final.

Linus

2002-08-01 22:46:50

by David Woodhouse

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


[email protected] said:
> It's not the kernel side that is not restartable. It's the _user_
> side.

As I said, you can't allow it to be interrupted after you've started the
copy_to_user(). Show me how the userspace program can prove the signal
arrived _after_ the 'int 0x80' had trapped into the kernel rather then
beforehand, and I'll accept that you can't allow read() to be interrupted
even before the copy_to_user() starts.

But I also agree that there are other, better, examples of why
TASK_UNINTERRUPTIBLE has to be used sometimes. Page faults in
copy_{from,to}_user are probably one such.

It's just that it doesn't have to be scattered all over the place just
because people are too lazy to do the cleanup code.

Yeah -- sometimes it's hard. So go shopping.

--
dwmw2


2002-08-01 23:27:24

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


[ I'm having a real hard time to not shout at the top of my lungs "SHUT
THE FUCK UP ABOUT THIS ALREADY!" and then running around the offices
with a chainsaw laughing maniacally. But I'll try. This once. ]

On Fri, 2 Aug 2002, Roman Zippel wrote:
>
> Relying on that the fd will always point to a normal file is only asking
> for trouble.

People _do_ rely on regular files working this way. Wake up and smell the
coffee, you cannot change reality by just arguing about it.

There are cases where you absolutely _have_ to rely on this documented
UNIX behaviour. One example is using a log-file (yes, a _file_, not a
socket or a pipe) that you explicitly opened with O_APPEND, just so that
you can guarantee _atomic_ writes that do not get lost or partially
re-ordered in your log.

And yes, these logging programs are mission-critical, and they do have
signals going on, and they rely on well-defined and documented interfaces
that say that doing a write() to a filesystem is _not_ going to return in
the middle just because a signal came in.

These programs know what they are doing. They are explicitly _not_ using
"stdio" to write the log-file, exactly because they cannot afford to have
the write broken up into many parts (and they do not want to have it
buffered either, since getting the logging out in a timely fashion can be
important).

The only, and the _portable_ way to do this under UNIX is to create one
single buffer, and write it out with one single write() call. Anything
else is likely to cause the file to be interspersed by random
log-fragments, instead of being a nice consecutive list of full log
entries.

Feel free to change Linux to have your stupid preferred semantics where
everybody is supposed to be able to handle signals at any time, but please
re-name it to "Crapix" when you do. Because that is what it would be.
Crap. Utter braindamage.

If people cannot find this in SuS, then I simply don't _care_. I care
about not having a crap OS, and I also care about not having to repeat
myself and give a million examples of why the current behaviour is
_required_, and why we're not getting rid of it.

[ Did profanity help explain the situation? Do people finally understand
why this _really_ isn't up for discussion? Please don't bother sending
me any more email about this. My co-workers are already eyeing me and
my chainsaw nervously. Thank you for sparing them. ]

Linus

2002-08-01 23:33:54

by Pete Zaitcev

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

>> They should be waiting in TASK_UNINTERRUPTIBLE, and we should add a
>> flag to distinguish between "increases load average" and "doesn't".
>
> The disadvantage of this approach is that it encourages people to be lazy
> and sleep with signals disabled, instead of implementing proper cleanup
> code.
>
> I'm more in favour of removing TASK_UNINTERRUPTIBLE entirely, or at least
> making people apply for a special licence to be permitted to use it :)
>
> --
> dwmw2

Consider this. An application writes to /dev/dsp0, and ymfpci
(for example) start DMA. Then user interrupts the app with ^C.
When ymfpci gets ->release() call, it has to tell the chip
to stop DMA, then wait until it's complete. If it tries to
wait with TASK_INTERRUPTIBLE, schedule() will return immediately,
and in essense do a busy loop with CPU pegged at 100%.

Same thing happens in USB, only there it's worse: a spinning
application locks out khubd and whole subsistem dies.

-- Pete

2002-08-01 23:43:15

by David Woodhouse

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


[email protected] said:
> Consider this. An application writes to /dev/dsp0, and ymfpci (for
> example) start DMA. Then user interrupts the app with ^C. When ymfpci
> gets ->release() call, it has to tell the chip to stop DMA, then wait
> until it's complete. If it tries to wait with TASK_INTERRUPTIBLE,
> schedule() will return immediately, and in essense do a busy loop with
> CPU pegged at 100%.

Forgive me; it's late here. What's wrong with -ERESTARTNOINTR?

But even if that's not going to work -- as I said later, there _are_ cases
where you can't really get rid of it. And given the debates we've had
recently about the return value from close(), your release() call seems like
one such, if you really can't restart it.

My point was that we should be making TASK_UNINTERRUPTIBLE _less_ attractive
to encourage people not to use it simply because it's easier than thinking
about the cleanup path, rather than making it more attractive as was
originally suggested.

> Same thing happens in USB, only there it's worse: a spinning
> application locks out khubd and whole subsistem dies.

Spinning is obviously wrong. If after thinking hard about it you can't come
up with a better solution, go pick up a form, find 5 kernel hackers to sign
it saying there really is no better solution, and you can have one of these
licences to use TASK_UNTERRUPTIBLE that I was talking about :)

--
dwmw2


2002-08-02 00:28:12

by Olivier Galibert

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

On Thu, Aug 01, 2002 at 04:30:40PM -0700, Linus Torvalds wrote:
> And yes, these logging programs are mission-critical, and they do have
> signals going on, and they rely on well-defined and documented interfaces
> that say that doing a write() to a filesystem is _not_ going to return in
> the middle just because a signal came in.

How hard and/or insane would it be to somehow special-case SIGKILL?
It is a tad annoying not to be able to get rid of D state processes,
especially ones blocking unmounts because the filesystem is busy.

Of course, an alternative is a real, brutal, forced unmount that
leaves a clean filesystem and dead/dying processes.

OG.

2002-08-02 07:28:21

by Giuliano Pochini

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


>> Stop arguing about this. It's a FACT.
>
> Linus, it's not that I don't want to believe you, but e.g. the SUS doesn't
> make that special exception.
> Installing signal handlers and not expecting EINTR _is_ sloppy
> programming.

Linus is right. It's impossible to change default rw semantics
without breaking things. Anyway, we can add another non-standard
file flag (that's likely to remain mostly unused).


Bye.

2002-08-02 09:37:45

by kaih

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

[email protected] (Olivier Galibert) wrote on 01.08.02 in <[email protected]>:

> On Thu, Aug 01, 2002 at 04:30:40PM -0700, Linus Torvalds wrote:
> > And yes, these logging programs are mission-critical, and they do have
> > signals going on, and they rely on well-defined and documented interfaces
> > that say that doing a write() to a filesystem is _not_ going to return in
> > the middle just because a signal came in.
>
> How hard and/or insane would it be to somehow special-case SIGKILL?

Thank you for reading the thread before jumping in.

(Hint: some messages upthread, Linux explained how to do exactly this.)

MfG Kai

2002-08-02 09:59:05

by Roman Zippel

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

Hi,

I slept over it and I thought about a bit more, but I think it just starts
to get interresting, so I hope your co-workers had a chance to snatch that
chainsaw away. :-)

On Thu, 1 Aug 2002, Linus Torvalds wrote:

> There are cases where you absolutely _have_ to rely on this documented
> UNIX behaviour. One example is using a log-file (yes, a _file_, not a
> socket or a pipe) that you explicitly opened with O_APPEND, just so that
> you can guarantee _atomic_ writes that do not get lost or partially
> re-ordered in your log.
>
> And yes, these logging programs are mission-critical, and they do have
> signals going on, and they rely on well-defined and documented interfaces
> that say that doing a write() to a filesystem is _not_ going to return in
> the middle just because a signal came in.

If these programs are so mission-critical, they better do some error
checking. Your atomic write will fail on a full disk and if the
information is that important, the program has to handle that.

> These programs know what they are doing. They are explicitly _not_ using
> "stdio" to write the log-file, exactly because they cannot afford to have
> the write broken up into many parts (and they do not want to have it
> buffered either, since getting the logging out in a timely fashion can be
> important).
>
> The only, and the _portable_ way to do this under UNIX is to create one
> single buffer, and write it out with one single write() call. Anything
> else is likely to cause the file to be interspersed by random
> log-fragments, instead of being a nice consecutive list of full log
> entries.

I looked around a bit and it doesn't look that portable. UNIX systems seem
to have this behaviour, but not all POSIX systems. Letting multiple
programs log to the same file is insane (there must be a reason why syslog
was invented). Signals are your least worry, that's easy to deal with:

sigfillset(&s);
sigprocmask(SIG_SETMASK, &s, &os);
write(...);
sigprocmask(SIG_SETMASK, &os, NULL);

But that doesn't deal with an almost full disk.

> If people cannot find this in SuS, then I simply don't _care_. I care
> about not having a crap OS, and I also care about not having to repeat
> myself and give a million examples of why the current behaviour is
> _required_, and why we're not getting rid of it.

I'm not asking for removing for this, I simply don't agree with the
reason. I still consider any program relying on this behaviour buggy. Your
"atomic" write is an illusion, the os can certainly try, but in the end
it's the applications responsibility to get this right.
I know that we can't remove this behaviour, but the only reason is to
keep buggy programs working and let them instead fail in other more subtle
ways.

> [ Did profanity help explain the situation?

Not really. :) I'm not that easily impressed.

bye, Roman

2002-08-02 12:36:33

by Ryan Anderson

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

On Fri, 2 Aug 2002, Roman Zippel wrote:

> I slept over it and I thought about a bit more, but I think it just starts
> to get interresting, so I hope your co-workers had a chance to snatch that
> chainsaw away. :-)

uh-oh. heh.

> On Thu, 1 Aug 2002, Linus Torvalds wrote:
>
> > There are cases where you absolutely _have_ to rely on this documented
> > UNIX behaviour. One example is using a log-file (yes, a _file_, not a
> > socket or a pipe) that you explicitly opened with O_APPEND, just so that
> > you can guarantee _atomic_ writes that do not get lost or partially
> > re-ordered in your log.
> >
> > And yes, these logging programs are mission-critical, and they do have
> > signals going on, and they rely on well-defined and documented interfaces
> > that say that doing a write() to a filesystem is _not_ going to return in
> > the middle just because a signal came in.
>
> If these programs are so mission-critical, they better do some error
> checking. Your atomic write will fail on a full disk and if the
> information is that important, the program has to handle that.

>From personal experience, I have an application that DEPENDS on the
atomicity of writes in O_APPEND mode. In the logging function, we have
made a conscious choice that, if we get a disk-full error, there's not a
damn thing we can do about a "disk full" condition that makes sense in
the application, so we don't handle that error. (That situation is
something the sysadmin is supposed to prevent, anyway.) (Admittedly,
that application is running on AIX - but I still expect the behavior to
remain the same when we port that application over to Linux...
eventually.)


> > These programs know what they are doing. They are explicitly _not_ using
> > "stdio" to write the log-file, exactly because they cannot afford to have
> > the write broken up into many parts (and they do not want to have it
> > buffered either, since getting the logging out in a timely fashion can be
> > important).
> >
> > The only, and the _portable_ way to do this under UNIX is to create one
> > single buffer, and write it out with one single write() call. Anything
> > else is likely to cause the file to be interspersed by random
> > log-fragments, instead of being a nice consecutive list of full log
> > entries.
>
> I looked around a bit and it doesn't look that portable. UNIX systems seem
> to have this behaviour, but not all POSIX systems. Letting multiple
> programs log to the same file is insane (there must be a reason why syslog
> was invented). Signals are your least worry, that's easy to deal with:
>
> sigfillset(&s);
> sigprocmask(SIG_SETMASK, &s, &os);
> write(...);
> sigprocmask(SIG_SETMASK, &os, NULL);
>
> But that doesn't deal with an almost full disk.
>
> > If people cannot find this in SuS, then I simply don't _care_. I care
> > about not having a crap OS, and I also care about not having to repeat
> > myself and give a million examples of why the current behaviour is
> > _required_, and why we're not getting rid of it.

In any case, atomicity is guaranteed by SUSv3 (Read open(2) for O_APPEND,
then read write(2) and skip down to the Rationale, first paragraph.

> I'm not asking for removing for this, I simply don't agree with the
> reason. I still consider any program relying on this behaviour buggy. Your
> "atomic" write is an illusion, the os can certainly try, but in the end
> it's the applications responsibility to get this right.
> I know that we can't remove this behaviour, but the only reason is to
> keep buggy programs working and let them instead fail in other more subtle
> ways.

Yes, it's always the application's responsibility to get this right.

The application can expect that when write(2) returns from a request
that involved smaller than {PIPE_BUF} size, it either failed, or wrote
the whole request, and with that assumption, the application can handle
the other errors sanely.

Note that write(2) can never return less bytes written than requested if
the requested amount is smaller than {PIPE_BUF} size.

Linus is absolutely correct here, changing the semantics of this
scenario would be a huge mess.

--
Ryan Anderson
sometimes Pug Majere

2002-08-02 15:34:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers



On Fri, 2 Aug 2002, Roman Zippel wrote:
>
> If these programs are so mission-critical, they better do some error
> checking. Your atomic write will fail on a full disk and if the
> information is that important, the program has to handle that.

The logging thing is logging. It's critical for performance tuning, it's
critical for later finding of errors, but it's still secondary.

It's also performance-critical in a very real way.

> I looked around a bit and it doesn't look that portable. UNIX systems seem
> to have this behaviour, but not all POSIX systems.

POSIX is a hobbled standard, and does not matter.

We're not making a "POSIX-compliant OS". People have done that before:
see all the RT-OS's out there, and see even the NT POSIX subsystem.

They are uninteresting.

Linux is a _real_ OS, not some "we filled in the paperwork and it is now
standards compliant".

And being a real OS means taking the real world into account.

And the real world says that it's not acceptable to make up your own
semantics, unless you have some _damn_ good reason for doing so.

Let's turn the tables here. I'm not interested in your "but.." arguments
at all. I've refuted every single one, and I've refuted them with cold
hard facts.

The fact is, there is _zero_ reason to change existing functionality.

We already do this right, and there is no reason to _break_ the fact that
we do it right. Can you come up with a _single_ reason for why we should
break existing standardized binary interfaces?

Binary compatibility is important. As is the larger issue of generic UNIX
compatibility. You had better have some really strong arguments for why
you would think I'd be willing to break compatibility. So far you have had
_no_ arguments for the question "Why?".

Linus

2002-08-02 15:59:16

by Victor Yodaiken

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

On Thu, Aug 01, 2002 at 03:40:56PM -0700, Linus Torvalds wrote:
>
> On Thu, 1 Aug 2002, David Woodhouse wrote:
> >
> > [email protected] said:
> > > Any regular file IO is supposed to give you the full result.
> >
> > read(2) is permitted to return -EINTR.
>
> It is _not_ allowed to do that for regular UNIX filesystems.
>

SusV2 and POSIX seem to have changed the prior standard

The value returned may be less than nbyte if the number of bytes
left in the file is less than nbyte, if the read() request was
interrupted by a signal, or if the file is a pipe or FIFO or
special file and has fewer than nbyte bytes immediately available
for reading.


The rationale mentions that
while( read(...) > 0)
must work
but, I think Linus is correct that many programs rely on
while( read(fd,&b,n) == n)

------------------susv2

If a read() is interrupted by a signal before it reads any data, it
will return -1 with errno set to [EINTR].

If a read() is interrupted by a signal after it has successfully
read some data, it will return the number of bytes read.
----------------------



The grim effects of "cancel" spread through the system.


-------------------------------------
The transitive closure of a design error is limited only by the size of
the program.

2002-08-02 15:59:20

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

On Fri, Aug 02, 2002 at 08:39:34AM -0700, Linus Torvalds wrote:
> We already do this right, and there is no reason to _break_ the fact that
> we do it right. Can you come up with a _single_ reason for why we should
> break existing standardized binary interfaces?

Personally, I think that uninterruptible file io is good, but there needs
to be an upper limit to the maximum size of the io. As it stands today,
someone can do a single multigigabyte read or write that is completely
uninterruptible (even to kill -9), but could take a minute or more to
complete.

-ben
--
"You will be reincarnated as a toad; and you will be much happier."

2002-08-02 16:22:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers



On Fri, 2 Aug 2002, Benjamin LaHaise wrote:
>
> Personally, I think that uninterruptible file io is good, but there needs
> to be an upper limit to the maximum size of the io. As it stands today,
> someone can do a single multigigabyte read or write that is completely
> uninterruptible (even to kill -9), but could take a minute or more to
> complete.

Actually, i fyou read my original email on this thread, I actually
suggested splitting up the "INTERRUPTIBLE" vs "UNINTERRUPTIBLE" into three
different cases and one extra bit.

Sending somebody a SIGKILL (or any signal that kills the process) is
different (in my opinion) from a signal that interrupts a system call in
order to run a signal handler.

So my original suggestion on this thread was to make

TASK_WAKESIGNAL - wake on all signals
TASK_WAKEKILL - wake on signals that are deadly
TASK_NOSIGNAL - don't wake on signals
TASK_LOADAVG - counts toward loadaverage

#define TASK_UNINTERRUPTIBLE (TASK_NOSIGNAL | TASK_LOADAVG)
#define TASK_INTERRUPTIBLE TASK_WAKESIGNAL

and then let code like generic_file_write() etc use other combinations
than the two existing ones, ie

(TASK_WAKEKILL | TASK_LOADAVG)

results in a process that is woken up by signals that kill it (but not
other signals), and is counted towards the loadaverage. Which is what we
want in generic_file_read() (and _probably_ generic_file_write() as well,
but that's slightly more debatable).

(We'd also have to add a new way to test whether you've been killed, so
that such users could use "process_killed()" instead of the
"signal_pending()" that a INTERRUPTIBLE sleeper uses to test whether it
should exit).

This is the trivial way to get the best of both worlds - you can still
kill a process that is in D wait (if that particular kernel path allows
it), but you don't get process-visible semantic changes.

AND it also allows waiting uninterruptibly without adding to the
loadaverage, for those people who want to do long uninterruptible waits
(which was one of the reasons for starting this whole thread in the first
place - it just got slightly de-railed in the meantime).

Linus

2002-08-02 17:11:25

by Jamie Lokier

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

Linus Torvalds wrote:
> Sending somebody a SIGKILL (or any signal that kills the process) is
> different (in my opinion) from a signal that interrupts a system call in
> order to run a signal handler.

So it's ok to have truncated log entries (or more realistically,
truncated simple database entries) if the logging program is killed?

-- Jamie

2002-08-02 17:25:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers



On Fri, 2 Aug 2002, Jamie Lokier wrote:
>
> Linus Torvalds wrote:
> > Sending somebody a SIGKILL (or any signal that kills the process) is
> > different (in my opinion) from a signal that interrupts a system call in
> > order to run a signal handler.
>
> So it's ok to have truncated log entries (or more realistically,
> truncated simple database entries) if the logging program is killed?

This is why I said

"Which is what we want in generic_file_read() (and _probably_
generic_file_write() as well, but that's slightly more debatable)"

The "slightly more debatable" comes exactly from the thing you mention.

The thing is, "read()" on a file doesn't have any side effects outside the
process that does it, so if you kill the process, doing a partial read is
always ok (yeah, you can come up with thread examples etc where you can
see the state, but I think those are so contrieved as to not really merit
much worry and certainly have no existing programs issues).

With write(), you have to make a judgement call. Unlike read, a truncated
write _is_ visible outside the killed process. But exactly like read()
there _are_ system management reasons why you may really need to kill
writers. So the debatable point comes from whether you want to consider a
killing signal to be "exceptional enough" to warrant the partial write.

I can see both sides. I personally think I'd prefer the "if I kill a
process, I want it dead _now_" approach, but this definitely _is_ up for
discussion (unlike the signal handler case).

Linus

2002-08-02 17:35:29

by Oliver Neukum

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


> and then let code like generic_file_write() etc use other combinations
> than the two existing ones, ie
>
> (TASK_WAKEKILL | TASK_LOADAVG)
>
> results in a process that is woken up by signals that kill it (but not
> other signals), and is counted towards the loadaverage. Which is what we
> want in generic_file_read() (and _probably_ generic_file_write() as well,
> but that's slightly more debatable).
>
> (We'd also have to add a new way to test whether you've been killed, so
> that such users could use "process_killed()" instead of the
> "signal_pending()" that a INTERRUPTIBLE sleeper uses to test whether it
> should exit).
>
> This is the trivial way to get the best of both worlds - you can still
> kill a process that is in D wait (if that particular kernel path allows
> it), but you don't get process-visible semantic changes.

If you do this to generic_file_write() you change the semantics for the
parent process, which up to now can expect a change to a file
made by a single call to write() to be done either fully or not at all,
if there's no error.

So IMHO it would be better to limit this new kind of waiting to reading.

Regards
Oliver

2002-08-02 17:54:10

by Paul Menage

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

In article <[email protected]>,
you write:
>
>With write(), you have to make a judgement call. Unlike read, a truncated
>write _is_ visible outside the killed process. But exactly like read()
>there _are_ system management reasons why you may really need to kill
>writers. So the debatable point comes from whether you want to consider a
>killing signal to be "exceptional enough" to warrant the partial write.
>

How about a sysctl that lets the user specify the size threshold at
which writes use a killable wait state rather than
TASK_UNINTERRUPTIBLE? (Probably defaulting to never.)

Paul

2002-08-02 17:53:59

by Trond Myklebust

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

>>>>> " " == Linus Torvalds <[email protected]> writes:

> "Which is what we want in generic_file_read() (and _probably_
> generic_file_write() as well, but that's slightly more
> debatable)"

A frequent cause of complaints with the NFS client 'intr' mount option
is that grabbing the inode semaphore too is uninterruptible, and hence
even a lookup() can hang.
Would you therefore be planning on making down() interruptible by
SIGKILL?

Cheers,
Trond

2002-08-02 18:06:01

by Linus Torvalds

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers



On 2 Aug 2002, Trond Myklebust wrote:
>
> Would you therefore be planning on making down() interruptible by
> SIGKILL?

Can't do - existing users know that down() cannot fail.

But we already have a "down_interruptible()", so if we introduce the
notion of "non-interruptible but killable", we can also introduce a
"down_killable()".

Linus

2002-08-02 18:21:17

by Jesse Pollard

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

Linus Torvalds <[email protected]>:
>On Fri, 2 Aug 2002, Jamie Lokier wrote:
>>
>> Linus Torvalds wrote:
>> > Sending somebody a SIGKILL (or any signal that kills the process) is
>> > different (in my opinion) from a signal that interrupts a system call in
>> > order to run a signal handler.
>>
>> So it's ok to have truncated log entries (or more realistically,
>> truncated simple database entries) if the logging program is killed?
>
>This is why I said
>
> "Which is what we want in generic_file_read() (and _probably_
> generic_file_write() as well, but that's slightly more debatable)"
>
>The "slightly more debatable" comes exactly from the thing you mention.
>
>The thing is, "read()" on a file doesn't have any side effects outside the
>process that does it, so if you kill the process, doing a partial read is
>always ok (yeah, you can come up with thread examples etc where you can
>see the state, but I think those are so contrieved as to not really merit
>much worry and certainly have no existing programs issues).
>
>With write(), you have to make a judgement call. Unlike read, a truncated
>write _is_ visible outside the killed process. But exactly like read()
>there _are_ system management reasons why you may really need to kill
>writers. So the debatable point comes from whether you want to consider a
>killing signal to be "exceptional enough" to warrant the partial write.
>
>I can see both sides. I personally think I'd prefer the "if I kill a
>process, I want it dead _now_" approach, but this definitely _is_ up for
>discussion (unlike the signal handler case).

There has been cases (and systems) in the past that have provided BOTH
interpretations:

1. current kill -9 action:

terminates process as soon as current process returns or is in
the process of returning to user mode. This is normal, and prevents
most partial writes. This is applicable to things like data base
servers, log servers, and journaling processes.

2. Kill, and abort outstanding I/O.

This casues partial log writes, corrupts databases (usually), and will
cause any process to terminate.

When is #2 used:
a. real time systems where the device handling MUST be terminated now.
b. system shutdown for emergencies (this allows filesystems to
finsh flushing, but user processes may be stuck writing to an
audio/parallel device... procedure is to use kill -15, wait a
second or two, kill -9 wait a second or two, KILL UNCONDITIONALLY,
and then shutdown anyway).
Other uses:
b1. fire, flood, power failure (act of god)
b2. system overtemp (loss of AC cooling...)
b3. disk drive failures (to stop writing to a drive, abort
DMA actions, controller failure detection - no need to
propagate errors to a raid...)
b4. safety related aborts in time critical applications

Item b3 allows a system with some pretty catastrophic hardware
failures to actually do something and shutdown/clean up as much as possible
without just hanging - which will also introduce partial log writes...

I worked on one system that determined the main disk controller was failing,
and proceded to request a power cycle on all disk drives attached to that
particular controller to attempt to clear the failure. All user processes
were killed, a detailed diagnostic was provided, then the system shut itself
off.

In realtime underwater survey systems we used such an abort to cancel
expensive operations that were already in progress (expensive if it
finished - setting off remote explosives via an external controller).

-------------------------------------------------------------------------
Jesse I Pollard, II
Email: [email protected]

Any opinions expressed are solely my own.

2002-08-02 19:24:10

by Roman Zippel

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

Hi

On Fri, 2 Aug 2002, Linus Torvalds wrote:

> Binary compatibility is important. As is the larger issue of generic UNIX
> compatibility. You had better have some really strong arguments for why
> you would think I'd be willing to break compatibility. So far you have had
> _no_ arguments for the question "Why?".

I never asked for breaking binary compatibility.
On the other hand I can give you an example, why I'd like to have a
choice. I expect from a good application that it recovers gracefully from
failures, so if I'm saving a file and the server goes down, I would
really, really like it if something happened when I push that stop button
or I press Esc and the application should offer me the possibility to save
the file somewhere else.

To implement this I would suggest using file flags instead of new task
flags:

O_ATOMIC
O_NONBLOCK
O_SIGNALINT
O_KILLINT
O_DONT_BOTHER_ME

The first one might be useful for aio, it wants something like that
already anyway.
This way applications had a choice, how read/write should behave on
signals.

bye, Roman


2002-08-02 23:22:03

by Ryan Anderson

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

On Fri, 2 Aug 2002, Paul Menage wrote:

> In article <[email protected]>,
> you write:
> >
> >With write(), you have to make a judgement call. Unlike read, a truncated
> >write _is_ visible outside the killed process. But exactly like read()
> >there _are_ system management reasons why you may really need to kill
> >writers. So the debatable point comes from whether you want to consider a
> >killing signal to be "exceptional enough" to warrant the partial write.
> >
>
> How about a sysctl that lets the user specify the size threshold at
> which writes use a killable wait state rather than
> TASK_UNINTERRUPTIBLE? (Probably defaulting to never.)

/usr/include/linux/limits.h:#define PIPE_BUF 4096 /* #
bytes in atomic write to a pipe */

According to SUSv3, this is the maximum size that write is guaranteed to
be atomic for.

The minimum size this can be set to is 512. I *think* this applies to
file io - but I'm not up to tracking all the definitions down.

So, you've already got the definitions in place, especially the size
limits. It even reads like changing the behavior for larger writes
would be acceptable.

Just leave the small writes alone and I think you'll avoid causing large
problems for the majority of applications.

--
Ryan Anderson
sometimes Pug Majere

2002-08-02 23:26:48

by Paul Menage

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers

>> How about a sysctl that lets the user specify the size threshold at
>> which writes use a killable wait state rather than
>> TASK_UNINTERRUPTIBLE? (Probably defaulting to never.)
>
>/usr/include/linux/limits.h:#define PIPE_BUF 4096 /* #
>bytes in atomic write to a pipe */
>
>According to SUSv3, this is the maximum size that write is guaranteed to
>be atomic for.

But that doesn't mean that we can't allow the user to configure higher
limits if they're using some particular software that relies on (and
more importantly, has successfully relied upon in the past) the current
behaviour of no interruptions.

Paul



2002-08-03 18:24:34

by David Woodhouse

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


[email protected] said:
> So IMHO it would be better to limit this new kind of waiting to
> reading.

You can do it for write() in the case where no data have yet been written,
i.e. in prepare_write() first time round the loop. In fact, you can even
return -ERESTARTNOINTR in that case, just as you can for read() where no
data have yet been copied into userspace. Whether we want to special-case
the first time round the loop just to give better responsiveness in the
common case is debatable though.

You can also do it for open() in 2.5. (in 2.4 the read_inode API gave the
file system no choice but to return a full real inode or a bad one which
remained and prevented subsequent lookups of the same inode.)

--
dwmw2


2002-10-17 08:26:36

by David Woodhouse

[permalink] [raw]
Subject: Re: manipulating sigmask from filesystems and drivers


On Fri, 2 Aug 2002, Linus Torvalds wrote:
> Actually, i fyou read my original email on this thread, I actually
> suggested splitting up the "INTERRUPTIBLE" vs "UNINTERRUPTIBLE" into
> three different cases and one extra bit.
>
> Sending somebody a SIGKILL (or any signal that kills the process) is
> different (in my opinion) from a signal that interrupts a system call
> in order to run a signal handler.
>
> So my original suggestion on this thread was to make
>
> TASK_WAKESIGNAL - wake on all signals
> TASK_WAKEKILL - wake on signals that are deadly
> TASK_NOSIGNAL - don't wake on signals
> TASK_LOADAVG - counts toward loadaverage
>
> #define TASK_UNINTERRUPTIBLE (TASK_NOSIGNAL | TASK_LOADAVG)
> #define TASK_INTERRUPTIBLE TASK_WAKESIGNAL
>
> and then let code like generic_file_write() etc use other combinations
> than the two existing ones, ie
>
> (TASK_WAKEKILL | TASK_LOADAVG)
>
> results in a process that is woken up by signals that kill it (but not
> other signals), and is counted towards the loadaverage. Which is what we
> want in generic_file_read() (and _probably_ generic_file_write() as well,
> but that's slightly more debatable).

I like this, but perhaps with a slight modification. I think that instead of
being a flag in the task state, the 'can we deal with being interrupted'
should instead be a _counter_, and the process in question can be woken
by signals only if it's zero. For the following reason:

In the world you describe above, a function sets the NOSIGNAL flag when
sleeping if it can't deal with being interrupted -- presumably because it
has no simple way to clean up its state in that case.

But we should remain aware that the ability to clean up on being interrupted
is not a function of only the bottom-most function on the call stack; it is
actually a function of the _entire_ call stack, in some cases all the way
back up to the userspace program making the system call. Every function in
the call stack needs to be able to deal with the -EINTR return and clean up
appropriately, not just the bottom-most function.

Consider the case where function A() sleeps, and can perfectly happily deal
with being interrupted. It is called from function B() which can also deal
with an -EINTR return from A(), and in fact _wants_ to do so because it can
take a long time and being uninterruptible would suck. The majority of
callers of function A() are of this nature, and want to be interruptible.

But there exists a function C() which also calls function A(), and which
_cannot_ deal with being interrupted.

What we actually tend to do in this case is make function A()
uninterruptible at all times (or -- ugh -- pass an extra argument all the
way down telling it which type of sleep to use and not to check
signal_pending()).

My suggested alternative is to have the afore-mentioned'interruptible_count'
in the task struct, and each function that cannot deal with receiving -EINTR
from the other functions it calls shall increase this count accordingly.

So sys_write() would increase current->interruptible_count, and then
nothing called from there would upset the user by returning -EINTR.

Similarly, down() would just increase the count before calling what's
currently down_interruptible(), etc.

--
dwmw2