2002-02-12 23:14:25

by Andrew Morton

[permalink] [raw]
Subject: [patch] sys_sync livelock fix

The get_request fairness patch exposed a livelock
in the buffer layer. write_unlocked_buffers() will
not terminate while other tasks are generating write traffic.

The patch simply bales out after writing all the buffers which
were dirty at the time the function was called, rather than
keeping on trying to write buffers until the list is empty.

Given that /bin/sync calls write_unlocked_buffers() three times,
that's good enough. sync still takes aaaaaages, but it terminates.




--- linux-2.4.18-pre9-ac2/fs/buffer.c Tue Feb 12 12:26:41 2002
+++ ac24/fs/buffer.c Tue Feb 12 14:39:39 2002
@@ -189,12 +189,13 @@ static void write_locked_buffers(struct
* return without it!
*/
#define NRSYNC (32)
-static int write_some_buffers(kdev_t dev)
+static int write_some_buffers(kdev_t dev, signed long *nr_to_write)
{
struct buffer_head *next;
struct buffer_head *array[NRSYNC];
unsigned int count;
int nr;
+ int ret;

next = lru_list[BUF_DIRTY];
nr = nr_buffers_type[BUF_DIRTY];
@@ -213,29 +214,38 @@ static int write_some_buffers(kdev_t dev
array[count++] = bh;
if (count < NRSYNC)
continue;
-
- spin_unlock(&lru_list_lock);
- write_locked_buffers(array, count);
- return -EAGAIN;
+ ret = -EAGAIN;
+ goto writeout;
}
unlock_buffer(bh);
__refile_buffer(bh);
}
+ ret = 0;
+writeout:
spin_unlock(&lru_list_lock);
-
- if (count)
+ if (count) {
write_locked_buffers(array, count);
- return 0;
+ if (nr_to_write)
+ *nr_to_write -= count;
+ }
+ return ret;
}

/*
- * Write out all buffers on the dirty list.
+ * Because we drop the locking during I/O it's not possible
+ * to write out all the buffers. So the only guarantee that
+ * we can make here is that we write out all the buffers which
+ * were dirty at the time write_unlocked_buffers() was called.
+ * fsync_dev() calls in here three times, so we end up writing
+ * many more buffers than ever appear on BUF_DIRTY.
*/
static void write_unlocked_buffers(kdev_t dev)
{
+ signed long nr_to_write = nr_buffers_type[BUF_DIRTY] * 2;
+
do {
spin_lock(&lru_list_lock);
- } while (write_some_buffers(dev));
+ } while (write_some_buffers(dev, &nr_to_write) && (nr_to_write > 0));
run_task_queue(&tq_disk);
}

@@ -1085,7 +1095,7 @@ void balance_dirty(void)
*/
if (state > 0) {
spin_lock(&lru_list_lock);
- write_some_buffers(NODEV);
+ write_some_buffers(NODEV, NULL);
wait_for_some_buffers(NODEV);
}
}
@@ -2846,7 +2856,7 @@ static int sync_old_buffers(void)
bh = lru_list[BUF_DIRTY];
if (!bh || time_before(jiffies, bh->b_flushtime))
break;
- if (write_some_buffers(NODEV))
+ if (write_some_buffers(NODEV, NULL))
continue;
return 0;
}
@@ -2945,7 +2955,7 @@ int bdflush(void *startup)
CHECK_EMERGENCY_SYNC

spin_lock(&lru_list_lock);
- if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
+ if (!write_some_buffers(NODEV, NULL) || balance_dirty_state() < 0) {
wait_for_some_buffers(NODEV);
interruptible_sleep_on(&bdflush_wait);
}


-


2002-02-12 23:19:47

by Alan

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

> Given that /bin/sync calls write_unlocked_buffers() three times,
> that's good enough. sync still takes aaaaaages, but it terminates.

Whats wrong with sync not terminating when there is permenantly I/O left ?
Its seems preferably to suprise data loss

2002-02-12 23:25:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Alan Cox wrote:
>
> > Given that /bin/sync calls write_unlocked_buffers() three times,
> > that's good enough. sync still takes aaaaaages, but it terminates.
>
> Whats wrong with sync not terminating when there is permenantly I/O left ?
> Its seems preferably to suprise data loss

Hard call. What do we *want* sync to do?

SuS doesn't require sync to be synchronous, although in linux
it traditionally has been. SUS says:

The sync() function causes all information in memory that
updates file systems to be scheduled for writing out to all
file systems.

The writing, although scheduled, is not necessarily complete
upon return from sync().

I think the larger problem is that an infinite-duration
/bin/sync can break existing stuff.

If someone calls /bin/sync while there's write activity
going on, and expects that to prevent data loss then
their assumptions are broken anyway. There can be new
write data generated as soon as sync returns.

-

2002-02-12 23:29:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Tue, 12 Feb 2002, Alan Cox wrote:

> > Given that /bin/sync calls write_unlocked_buffers() three times,
> > that's good enough. sync still takes aaaaaages, but it terminates.
>
> Whats wrong with sync not terminating when there is permenantly I/O
> left ?

I've seen it get stuck for half an hour. This was not
during a test, but on a real server which was in a busy
period...

> Its seems preferably to suprise data loss

The data isn't lost, it'll simply get written out to
disk later.

cheers,

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-13 00:16:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Wed, 13 Feb 2002, Alan Cox wrote:

> > > Its seems preferably to suprise data loss
> >
> > The data isn't lost, it'll simply get written out to
> > disk later.
>
> Allow me to introduce you to the off button, and the scripts at
> shutdown which use sync

Those don't protect the system against applications that
write data after sync has exited.

I don't see why it should be different for applications
that write data after sync has started.

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-13 00:15:26

by Alan

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

> > Whats wrong with sync not terminating when there is permenantly I/O left ?
> > Its seems preferably to suprise data loss
>
> Hard call. What do we *want* sync to do?

I'd rather not change the 2.4 behaviour - just in case. For 2.5 I really
have no opinion either way if SuS doesn't mind

2002-02-13 00:23:56

by Alan

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

> I don't see why it should be different for applications
> that write data after sync has started.

The guarantee about data written _before_ the sync started is also being
broken unless I misread the code

2002-02-13 00:36:47

by Rik van Riel

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Wed, 13 Feb 2002, Alan Cox wrote:

> > I don't see why it should be different for applications
> > that write data after sync has started.
>
> The guarantee about data written _before_ the sync started is also
> being broken unless I misread the code

Hmm, I guess we will want to fix that part ;)

Rik
--
"Linux holds advantages over the single-vendor commercial OS"
-- Microsoft's "Competing with Linux" document

http://www.surriel.com/ http://distro.conectiva.com/

2002-02-13 00:41:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Alan Cox wrote:
>
> > I don't see why it should be different for applications
> > that write data after sync has started.
>
> The guarantee about data written _before_ the sync started is also being
> broken unless I misread the code

That would be very broken.

The theory is: newly dirtied buffers are added at the "new"
end of the LRU. write_some_buffers() starts at the "old"
end of the LRU.

So if write_unlock_buffers writes out the "oldest"
nr_buffers_type[BUF_DIRTY] buffers, then it knows
that it has written out everything which was dirty
at the time it was called.

Or did I miss something?

-

2002-02-13 00:58:28

by Alan

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

> > Its seems preferably to suprise data loss
>
> The data isn't lost, it'll simply get written out to
> disk later.

Allow me to introduce you to the off button, and the scripts at shutdown
which use sync

2002-02-13 01:37:25

by Olaf Dietsche

[permalink] [raw]
Subject: What is a livelock? (was: [patch] sys_sync livelock fix)

Andrew Morton <[email protected]> writes:

> The get_request fairness patch exposed a livelock
> in the buffer layer. write_unlocked_buffers() will
> not terminate while other tasks are generating write traffic.

The subject says it: what is a livelock? How is it different
from a deadlock?

Thanks, Olaf.

2002-02-13 01:58:30

by Andrew Morton

[permalink] [raw]
Subject: Re: What is a livelock? (was: [patch] sys_sync livelock fix)

Olaf Dietsche wrote:
>
> Andrew Morton <[email protected]> writes:
>
> > The get_request fairness patch exposed a livelock
> > in the buffer layer. write_unlocked_buffers() will
> > not terminate while other tasks are generating write traffic.
>
> The subject says it: what is a livelock? How is it different
> from a deadlock?
>

http://www.huis.hiroshima-u.ac.jp/jargon/LexiconEntries/Livelock.html

livelock

/li:v'lok/ n. A situation in which some critical stage of a task is
unable to finish because its clients perpetually create more work
for it to do after they have been serviced but before it can clear its
queue. Differs from {deadlock} in that the process is not blocked or
waiting for anything, but has a virtually infinite amount of work to
do and can never catch up.


This exactly describes the sync problem. But we also use the
term `livelock' to describe one of the Linux VM's favourite
failure modes: madly spinning on page lists and not finding
any useful work to do. Which is slightly different.

-

2002-02-13 02:32:59

by Rob Landley

[permalink] [raw]
Subject: Re: What is a livelock? (was: [patch] sys_sync livelock fix)

On Tuesday 12 February 2002 08:36 pm, Olaf Dietsche wrote:
> Andrew Morton <[email protected]> writes:
> > The get_request fairness patch exposed a livelock
> > in the buffer layer. write_unlocked_buffers() will
> > not terminate while other tasks are generating write traffic.
>
> The subject says it: what is a livelock? How is it different
> from a deadlock?

A deadlock is when two or more processes sleep forever, each waiting for
something the other has to release. In a deadlock situation, the dead
processes are not using CPU time.

In a livelock, at least one of the processes is active, and the other process
can only continue if the active one stops. (Sometimes they're both active,
but can never finish what they're doing.)

The sync thing is a good example: sync tries to flush all dirty blocks to
disk and exits when there are no more dirty blocks. But if the system is a
busy server that's constantly generating dirty blocks as fast as they can be
written to disk and keeping the write buffer full, then sync never manages to
empty the buffer completely, and it can be flushing for hours before getting
a lucky break where it can declare victory and exit. A script that calls
"sync" in the middle can get completely blocked, and something else waiting
for that to finish...

Process priority is another common cause of livelocks. One mainframe at MIT
was decomissioned in the early 70's and they found a "run only when idle"
task that had been started around seven years earlier and still hadn't gotten
any time slices because the server had never been completely idle. (The
pathfinder mars probe kept rebooting for a similar reason: a vital system
task was getting starved while it held a semaphore that other stuff needed,
and not scheduling before timeouts caused a reboot. Look up the "priority
inversion" problem...)

This is why even the lowest priority tasks in modern operating systems still
occasionally are guaranteed timeslices and will even pre-empt pretty high
priority stuff if they've been starved long enough. Otherwise, they might
grab some vital resource other higher-priority programs need (sempahore, file
lock, etc) and block with it, and the rest of the system will never be
TOTALLY idle so they can keep important stuff waiting for a very long time.

> Thanks, Olaf.

Rob

2002-02-13 02:30:49

by Olaf Dietsche

[permalink] [raw]
Subject: Re: What is a livelock? (was: [patch] sys_sync livelock fix)

Andrew Morton <[email protected]> writes:

> http://www.huis.hiroshima-u.ac.jp/jargon/LexiconEntries/Livelock.html
>
> livelock
>
> /li:v'lok/ n. A situation in which some critical stage of a task is
> unable to finish because its clients perpetually create more work
> for it to do after they have been serviced but before it can clear its
> queue. Differs from {deadlock} in that the process is not blocked or
> waiting for anything, but has a virtually infinite amount of work to
> do and can never catch up.

I still don't get it :-(. When there is more work, this more work
needs to be done. So, how could livelock be considered a bug? It's
just overload. Or is this about the work, which must be done _after_
the queue is empty?

Regards, Olaf.

2002-02-13 02:41:20

by Andrew Morton

[permalink] [raw]
Subject: Re: What is a livelock? (was: [patch] sys_sync livelock fix)

Olaf Dietsche wrote:
>
> Andrew Morton <[email protected]> writes:
>
> > http://www.huis.hiroshima-u.ac.jp/jargon/LexiconEntries/Livelock.html
> >
> > livelock
> >
> > /li:v'lok/ n. A situation in which some critical stage of a task is
> > unable to finish because its clients perpetually create more work
> > for it to do after they have been serviced but before it can clear its
> > queue. Differs from {deadlock} in that the process is not blocked or
> > waiting for anything, but has a virtually infinite amount of work to
> > do and can never catch up.
>
> I still don't get it :-(. When there is more work, this more work
> needs to be done. So, how could livelock be considered a bug? It's
> just overload. Or is this about the work, which must be done _after_
> the queue is empty?
>

Yes, it's just overload. Clearly, the CPU can dirty memory
faster than a disk can clean it.

The bug is the expectation in the design of sync() that
it'll ever be able to make all buffers clean.

We can either:

a) spin madly until the thing which is writing stuff stops.
This has some merit, but is of course racy.

b) give up when we see it's not working out or

c) acquire sufficient locking to prevent all new dirtyings,
while we proceed to flush everything to disk. This is
pointless, because as soon as we drop those locks, the
dirtyings start again.

The only reliable way we can do all this is to offline the device
while we flush everything. That happens on the unmount and
remount-ro path. We definitely need that to work.

-

2002-02-13 02:53:20

by William Lee Irwin III

[permalink] [raw]
Subject: Re: What is a livelock? (was: [patch] sys_sync livelock fix)

On Wed, Feb 13, 2002 at 03:30:01AM +0100, Olaf Dietsche wrote:
> I still don't get it :-(. When there is more work, this more work
> needs to be done. So, how could livelock be considered a bug? It's
> just overload. Or is this about the work, which must be done _after_
> the queue is empty?

The false assumption made here is that it has exclusive access to the
queue for the duration. It appears to acquire the lock, dequeue one,
drop the lock, do some work, and return to the queue. During the time
between the lock being released and then being reacquired more work
can be generated, ensuring nontermination, as it iterates until the
queue is empty, which can never happen while work is generated at a
faster rate than it can process.


Cheers,
Bill

2002-02-13 03:31:09

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Wed, 13 Feb 2002, Alan Cox wrote:

> > > Whats wrong with sync not terminating when there is permenantly I/O left ?
> > > Its seems preferably to suprise data loss
> >
> > Hard call. What do we *want* sync to do?
>
> I'd rather not change the 2.4 behaviour - just in case. For 2.5 I really
> have no opinion either way if SuS doesn't mind

Alan, I think you have this one wrong, although SuS seems to have it wrong
as well, and if Linux did what SuS said there would be no problem.

- What SuS seems to say is that all dirty buffers will queued for physical
write. I think if we did that the livelock would disappear, but data
integrity might suffer.
- sync() could be followed by write() at the very next dispatch, and it
was never intended to be the last call after which no writes would be
done. It is a point in time.
- the most common use of sync() is to flush data write to all files of the
current process. If there was a better way to do it which was portable,
sync() would be called less. I doubt there are processes which alluse
that no write will be done after sync() returns.
- since sync() can't promise "no new writes" why try to make it do so? It
should mean "write current sirty buffers" and that's far more than SuS
requires.

I don't think benchmarks are generally important, but in this case the
benchmark reveals that we have been implementing a system call in a way
which not only does more than SuS requires, but more than the user
expects. To leave it trying to do even more than that seems to have no
benefit and a high (possible) cost.

I have seen shutdown hang many times, and I have to wonder if the shutdown
script is waiting for a process which is in some kind of write loop, while
the process ignores KILL signals. Don't know, don't claim I do, but I
see no reason for a sync() to handle more than current dirty blocks.

My opinion, but I hope yours. Fewer hangs and better performance is a
compromise I can accept.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 03:45:29

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Tue, 12 Feb 2002, Andrew Morton wrote:

> Alan Cox wrote:
> >
> > > I don't see why it should be different for applications
> > > that write data after sync has started.
> >
> > The guarantee about data written _before_ the sync started is also being
> > broken unless I misread the code
>
> That would be very broken.
>
> The theory is: newly dirtied buffers are added at the "new"
> end of the LRU. write_some_buffers() starts at the "old"
> end of the LRU.
>
> So if write_unlock_buffers writes out the "oldest"
> nr_buffers_type[BUF_DIRTY] buffers, then it knows
> that it has written out everything which was dirty
> at the time it was called.
>
> Or did I miss something?

Alan is right about the first version of the patch not getting all dirty
buffers I haven't looked at the latest version but the change seems to be
correct. Other than that I agree that "everything which was dirty at the
time it was called" is exactly right, what the user expects and what the
SuS says.

However, after thinksing about the SuS, it says (paraphrase) "queued but
not necessarily written." So if I read that right sync() is intended to be
a non-blocking operation. We can do that, but there is one thing which
must be added: with all the various patches which hack the elevator code,
we need a flag which says "do not add anything more to this pass." That
makes the SuS implementation of sync() possible, and makes the completion
of the operation deterministic. When all the dirty blocks are written the
operation is done.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 03:47:19

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Bill Davidsen wrote:
>
> On Wed, 13 Feb 2002, Alan Cox wrote:
>
> > > > Whats wrong with sync not terminating when there is permenantly I/O left ?
> > > > Its seems preferably to suprise data loss
> > >
> > > Hard call. What do we *want* sync to do?
> >
> > I'd rather not change the 2.4 behaviour - just in case. For 2.5 I really
> > have no opinion either way if SuS doesn't mind
>
> Alan, I think you have this one wrong, although SuS seems to have it wrong
> as well, and if Linux did what SuS said there would be no problem.
>
> - What SuS seems to say is that all dirty buffers will queued for physical
> write. I think if we did that the livelock would disappear, but data
> integrity might suffer.
> - sync() could be followed by write() at the very next dispatch, and it
> was never intended to be the last call after which no writes would be
> done. It is a point in time.
> - the most common use of sync() is to flush data write to all files of the
> current process. If there was a better way to do it which was portable,
> sync() would be called less. I doubt there are processes which alluse
> that no write will be done after sync() returns.
> - since sync() can't promise "no new writes" why try to make it do so? It
> should mean "write current sirty buffers" and that's far more than SuS
> requires.
>
> I don't think benchmarks are generally important, but in this case the
> benchmark reveals that we have been implementing a system call in a way
> which not only does more than SuS requires, but more than the user
> expects. To leave it trying to do even more than that seems to have no
> benefit and a high (possible) cost.

Yow, your message inspired me to re-read SuSv2 and indeed confirm,
sync(2) schedules I/O but can return before completion, while
fsync(2) schedules I/O and waits for completion.

So we need to implement system call checkpoint(2) ? schedule I/O,
introduce an I/O barrier, then sleep until that I/O barrier and all I/O
scheduled before it occurs.

Jeff




--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com

2002-02-13 03:55:29

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Alan and/or Linus:

Am I misreading this or is the Linux implementation of sync() based on
making the shutdown scripts pause until disk i/o is done? Because I don't
think commercial unices work that way, I think they work as SuS
specifies. More reason to rethink this in 2.4 as well as 2.5 and get the
possible live lock out of the kernel.

If I'm missing any portable program which assumes this, or any common
UNIX version which works like Linux, please enlighten everyone, I'm going
to put the patch on my test system tomorrow, but I'm going to look at what
it takes to implement SuS and make it totally non-blocking, so I can see
if that really creates any problem.

If this were only a performance issue I wouldn't push for prompt
implementation, but anything which can hang the system, particularly in
shutdown, is bad.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 04:01:39

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Bill Davidsen wrote:
>
> Alan and/or Linus:
>
> Am I misreading this or is the Linux implementation of sync() based on
> making the shutdown scripts pause until disk i/o is done? Because I don't
> think commercial unices work that way, I think they work as SuS
> specifies. More reason to rethink this in 2.4 as well as 2.5 and get the
> possible live lock out of the kernel.



I don't think SuSv2 can be any more clear than:

> The writing, although scheduled, is not necessarily complete
> upon return from sync().

Quoting from http://www.opengroup.org/onlinepubs/007908799/xsh/sync.html

As I mentioned in the other message, IMHO we need some way to introduce
a global system I/O barrier, and then wait for all I/O scheduled before
that barrier to complete. My suggestion for naming was the "checkpoint"
system call.

Jeff


--
Jeff Garzik | "I went through my candy like hot oatmeal
Building 1024 | through an internally-buttered weasel."
MandrakeSoft | - goats.com

2002-02-13 04:31:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Bill Davidsen wrote:
>
> Alan and/or Linus:
>
> Am I misreading this or is the Linux implementation of sync() based on
> making the shutdown scripts pause until disk i/o is done? Because I don't
> think commercial unices work that way, I think they work as SuS
> specifies. More reason to rethink this in 2.4 as well as 2.5 and get the
> possible live lock out of the kernel.
>

IMO, the SuS definition sucks. We really do want to do our best to
ensure that pending writes are committed to disk before sys_sync()
returns. As long as that doesn't involve waiting until mid-August.

For example, ext3 users get to enjoy rebooting with `sync ; reboot -f'
to get around all those silly shutdown scripts. This very much relies
upon the sync waiting upon the I/O.


I mean, according to SUS, our sys_sync() implementation could be

asmlinkage void sys_sync(void)
{
return;
}

Because all I/O is already scheduled, thanks to kupdate.



But we want sync to be useful.


>
> If this were only a performance issue I wouldn't push for prompt
> implementation, but anything which can hang the system, particularly in
> shutdown, is bad.
>

If shutdown hangs, it's probably due to something else.

-

2002-02-13 04:54:49

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Tue, 12 Feb 2002, Jeff Garzik wrote:

> Bill Davidsen wrote:
> >
> > Alan and/or Linus:
> >
> > Am I misreading this or is the Linux implementation of sync() based on
> > making the shutdown scripts pause until disk i/o is done? Because I don't
> > think commercial unices work that way, I think they work as SuS
> > specifies. More reason to rethink this in 2.4 as well as 2.5 and get the
> > possible live lock out of the kernel.
>
> I don't think SuSv2 can be any more clear than:
>
> > The writing, although scheduled, is not necessarily complete
> > upon return from sync().
>
> Quoting from http://www.opengroup.org/onlinepubs/007908799/xsh/sync.html

I don't see anything which says we can't implement sync(2) as your
checkpoint, as long as we don't keep the current implementation which
could hang forever in theory, and for hours in practice. I don't think
that violates the standard, and it should be safe.

I said before that we could make sync(2) fast and just put up a barrier
to keep additional io from being queued, and I still like that. Pass on
the need for checkpoint, or the portability thereof. I would expect all
current programs to work if cync(2) worked like your checkpoint(2).

Glad you read the SuSv2 the same way!

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 05:23:44

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Tue, 12 Feb 2002, Andrew Morton wrote:

> IMO, the SuS definition sucks. We really do want to do our best to
> ensure that pending writes are committed to disk before sys_sync()
> returns. As long as that doesn't involve waiting until mid-August.

The current behaviour allows the system to hang forever waiting for
sync(2). In practice it does actually wait minutes on a busy system (df
has --no-sync for that reason) when there is no reason for that to happen.
I think that not only sucks worse, it's non-standard as well.

> For example, ext3 users get to enjoy rebooting with `sync ; reboot -f'
> to get around all those silly shutdown scripts. This very much relies
> upon the sync waiting upon the I/O.

Because people count on something broken we should keep the bug? You do
realize that the sync may NEVER finish? That's not in theory, I have news
servers which may wait overnight without finishing a "df" iwthout the
option.

> I mean, according to SUS, our sys_sync() implementation could be
>
> asmlinkage void sys_sync(void)
> {
> return;
> }
>
> Because all I/O is already scheduled, thanks to kupdate.

I think the wording is queued, and I would read that as "on the
elevator." Your example is a good example of bad practive, since even with
ext3 a program creating files quickly would lose data, even though the
directory structure is returned to a known state, without stopping the
writing processes the results are unknown.

> But we want sync to be useful.

No one has proposed otherwise. Unless you think that a possible hang is
useful, the questions becomes adding all dirty buffers to the elevator,
then (a) waiting or (b) returning. Either satisfies SuSv2.

>
> >
> > If this were only a performance issue I wouldn't push for prompt
> > implementation, but anything which can hang the system, particularly in
> > shutdown, is bad.
> >
>
> If shutdown hangs, it's probably due to something else.

If you discount the evidence you can prove anything... or disbelieve
anything.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 05:36:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Bill Davidsen wrote:
>
> > But we want sync to be useful.
>
> No one has proposed otherwise. Unless you think that a possible hang is
> useful, the questions becomes adding all dirty buffers to the elevator,
> then (a) waiting or (b) returning. Either satisfies SuSv2.

errr. Bill. I wrote the patch. Please take this as a sign
that I'm not happy with the current implementation :)

> >
> > >
> > > If this were only a performance issue I wouldn't push for prompt
> > > implementation, but anything which can hang the system, particularly in
> > > shutdown, is bad.
> > >
> >
> > If shutdown hangs, it's probably due to something else.
>
> If you discount the evidence you can prove anything... or disbelieve
> anything.

Thing is, at shutdown all the tasks which are generating write
traffic should have been killed off, and the filesystems unmounted
or set readonly. There's no obvious way in which this shutdown
sequence can be indefinitely stalled by the problem we're discussing here.

If the shutdown scripts are calling sys_sync() *before* killing
everything then yes, the scripts could hang indefinitely. Is
this the case?

If "yes" then the scripts are dumb. Just remove the `sync' call.

If "no" then something else is causing the hang.

-

2002-02-13 09:19:53

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

> Yow, your message inspired me to re-read SuSv2 and indeed confirm,

As a side note, these days you should be reading SuSv3,
it is an official standard now. See, for example,

http://www.UNIX-systems.org/version3/
http://www.opengroup.org/onlinepubs/007904975/toc.htm

> sync(2) schedules I/O but can return before completion

Don't forget that this standard does not describe what is
desirable, but describes the minimum guaranteed by all
Unices considered.

Having a sync that returns without having written the data
is not especially useful. Also without the sync this data
would have been written sooner or later.
We changed sync to wait, long ago, because otherwise shutdown
would cause filesystem corruption.

Andries

2002-02-13 14:11:53

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

In article <[email protected]> you write:
| Bill Davidsen wrote:
| >
| > > But we want sync to be useful.
| >
| > No one has proposed otherwise. Unless you think that a possible hang is
| > useful, the questions becomes adding all dirty buffers to the elevator,
| > then (a) waiting or (b) returning. Either satisfies SuSv2.
|
| errr. Bill. I wrote the patch. Please take this as a sign
| that I'm not happy with the current implementation :)

Sorry, I had been sitting at a keyboard for about 16 hours when I typed
that, and didn't look at the sender... Lot's of other typos in there as
well, sign of need for 3-4 hours sleep.

But I think sync(2) as a checkpoint, write out all dirty at the moment
of sync call, is fine and deterministic, and all that.

That serves the shutdown case as well, if there is a process in some
unkillable state, but somehow still writing, at least the system will go
down. I'm not sure any process not killable with kill -9 is able to do
anything, but I won't bet on it.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 15:11:57

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 13, 2002 04:46 am, Jeff Garzik wrote:
> Bill Davidsen wrote:
> >
> > On Wed, 13 Feb 2002, Alan Cox wrote:
> >
> > > > > Whats wrong with sync not terminating when there is permenantly I/O left ?
> > > > > Its seems preferably to suprise data loss
> > > >
> > > > Hard call. What do we *want* sync to do?
> > >
> > > I'd rather not change the 2.4 behaviour - just in case. For 2.5 I really
> > > have no opinion either way if SuS doesn't mind
> >
> > Alan, I think you have this one wrong, although SuS seems to have it wrong
> > as well, and if Linux did what SuS said there would be no problem.
> >
> > - What SuS seems to say is that all dirty buffers will queued for physical
> > write. I think if we did that the livelock would disappear, but data
> > integrity might suffer.
> > - sync() could be followed by write() at the very next dispatch, and it
> > was never intended to be the last call after which no writes would be
> > done. It is a point in time.
> > - the most common use of sync() is to flush data write to all files of the
> > current process. If there was a better way to do it which was portable,
> > sync() would be called less. I doubt there are processes which alluse
> > that no write will be done after sync() returns.
> > - since sync() can't promise "no new writes" why try to make it do so? It
> > should mean "write current sirty buffers" and that's far more than SuS
> > requires.
> >
> > I don't think benchmarks are generally important, but in this case the
> > benchmark reveals that we have been implementing a system call in a way
> > which not only does more than SuS requires, but more than the user
> > expects. To leave it trying to do even more than that seems to have no
> > benefit and a high (possible) cost.
>
> Yow, your message inspired me to re-read SuSv2 and indeed confirm,
> sync(2) schedules I/O but can return before completion,

I think that's just stupid and we have a duty to fix it, it's an anachronism.
The _natural_ expectation for the user is that sync means 'don't come back
until data is on disk' and any other interpretation is just apologizing for
halfway implementations.

Sync should mean: come back after all filesystem transactions submitted before
the sync are recorded, such that if the contents of RAM vanishes the data can
still be retrieved.

For dumb filesystems, this can degenerate to 'just try to write all the dirty
blocks', the traditional Linux interpretation, but for journalling filesystems
we can do the job properly.

> while fsync(2) schedules I/O and waits for completion.

Yes, right.

> So we need to implement system call checkpoint(2) ? schedule I/O,
> introduce an I/O barrier, then sleep until that I/O barrier and all I/O
> scheduled before it occurs.

How about adding: sync --old-broken-way

On this topic, it would make a lot of sense from the user's point of view to
have a way of syncing a single volume, how would we express that?

--
Daniel

2002-02-13 15:13:16

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 13, 2002 05:01 am, Jeff Garzik wrote:
> I don't think SuSv2 can be any more clear than:
> > The writing, although scheduled, is not necessarily complete
> > upon return from sync().

That doesn't mean we can't do more than that, as the naive user would expect.

--
Daniel

2002-02-13 15:25:26

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 13, 2002 06:21 am, Bill Davidsen wrote:
> On Tue, 12 Feb 2002, Andrew Morton wrote:
>
> > IMO, the SuS definition sucks. We really do want to do our best to
> > ensure that pending writes are committed to disk before sys_sync()
> > returns. As long as that doesn't involve waiting until mid-August.
>
> The current behaviour allows the system to hang forever waiting for
> sync(2). In practice it does actually wait minutes on a busy system (df
> has --no-sync for that reason) when there is no reason for that to happen.
> I think that not only sucks worse, it's non-standard as well.

Nothing sucks worse than losing your data. Let's concentrate on fixing
shutdown, not breaking (linux) sync.

> > For example, ext3 users get to enjoy rebooting with `sync ; reboot -f'
> > to get around all those silly shutdown scripts. This very much relies
> > upon the sync waiting upon the I/O.
>
> Because people count on something broken we should keep the bug? You do
> realize that the sync may NEVER finish?

You do realize that if you lose your data you may NEVER get it back? ;-)

> That's not in theory, I have news
> servers which may wait overnight without finishing a "df" iwthout the
> option.

OK, what you're really saying is, we need a way to kill the sync process
if it runs overtime, no?

> > I mean, according to SUS, our sys_sync() implementation could be
> >
> > asmlinkage void sys_sync(void)
> > {
> > return;
> > }
> >
> > Because all I/O is already scheduled, thanks to kupdate.
>
> I think the wording is queued, and I would read that as "on the
> elevator."

Well now you're adding your own semantics to SuS, welcome to the party.
I vote we keep the existing and-don't-come-back-until-you're-done Linux
semantics.

> Your example is a good example of bad practive, since even with
> ext3 a program creating files quickly would lose data, even though the
> directory structure is returned to a known state, without stopping the
> writing processes the results are unknown.

Huh? You know about journal commit, right?

--
Daniel

2002-02-13 16:19:55

by Olaf Dietsche

[permalink] [raw]
Subject: Re: What is a livelock? (was: [patch] sys_sync livelock fix)

Andrew Morton <[email protected]> writes:

> Olaf Dietsche wrote:
>>
>> I still don't get it :-(. When there is more work, this more work
>> needs to be done. So, how could livelock be considered a bug? It's
>> just overload. Or is this about the work, which must be done _after_
>> the queue is empty?
>>
> ...
> [good explanation, so even I grasped it :-)]

Thanks to all, who tried hard explaining livelock.

Regards, Olaf.

2002-02-13 22:27:52

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Wed, 13 Feb 2002, Daniel Phillips wrote:

> On February 13, 2002 04:46 am, Jeff Garzik wrote:

> > Yow, your message inspired me to re-read SuSv2 and indeed confirm,
> > sync(2) schedules I/O but can return before completion,
>
> I think that's just stupid and we have a duty to fix it, it's an anachronism.
> The _natural_ expectation for the user is that sync means 'don't come back
> until data is on disk' and any other interpretation is just apologizing for
> halfway implementations.

Feel free to join a standards committee. In the mean time, I agree we have
a duty to fix it, since the current implementation can hang forever
without improving the securty of the data one bit, therefore sync(2)
should return after all data generated before the sync has been written
and not wait for all data written by all processes in the system to
complete.

BTW: I think users would expect the system call to work as the standard
specifies, not some better way which would break on non-Linux systems. Of
course now working programs which conform to the standard DO break on
Linux.

> Sync should mean: come back after all filesystem transactions submitted before
> the sync are recorded, such that if the contents of RAM vanishes the data can
> still be retrieved.

We agree on that, and it doesn't violate the standard. What we do now
doesn't.

> For dumb filesystems, this can degenerate to 'just try to write all the dirty
> blocks', the traditional Linux interpretation, but for journalling filesystems
> we can do the job properly.

It doesn't matter, if you write the existing dirty buffers the filesystem
type is irrelevant. And if you have cache in your controller and/or drives
the data might be there and not on the disk. If you have those IBM drives
discussed a few months ago and a bad sector, the drive may drop the data.
The point I'm making is that doing it really right is harder than it
seems.

Also, there are applications which don't like journals because they create
and delete lots of little files, or update the file information
frequently, resulting in a write to the journal. Sendmail, web servers,
and usenet news do this in many cases. That's why the noatime option was
added.

> > while fsync(2) schedules I/O and waits for completion.
>
> Yes, right.
>
> > So we need to implement system call checkpoint(2) ? schedule I/O,
> > introduce an I/O barrier, then sleep until that I/O barrier and all I/O
> > scheduled before it occurs.
>
> How about adding: sync --old-broken-way

The problem is that the system call should work in a way which doesn't
violate the standard. I think waiting for all existing dirty buffers is
conforming, waiting until hell freezes over isn't, nor does it have any
benefit to the user, since the sync is either an end of the execution
safety net or a checkpoint. In either case the user doesn't expect to have
the program hang after *his/her* data is safe.

> On this topic, it would make a lot of sense from the user's point of view to
> have a way of syncing a single volume, how would we express that?

I have an idea, I'll put it in another message, since only two of us are
likely to be reading at this point.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 22:42:13

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Wed, Feb 13, 2002 at 05:24:38PM -0500, Bill Davidsen wrote:
> On Wed, 13 Feb 2002, Daniel Phillips wrote:
> > On this topic, it would make a lot of sense from the user's point of view to
> > have a way of syncing a single volume, how would we express that?
>
> I have an idea, I'll put it in another message, since only two of us are
> likely to be reading at this point.
>

Not true...

There's at least me. ;) I have this thread flagged as "interesting".

Syncing a specific volume/fs is quite interesting also. Maybe "sync
[--only-current-fs] [--fs $dev]"

Mike

2002-02-13 22:55:28

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Wed, 13 Feb 2002, Daniel Phillips wrote:

> On February 13, 2002 06:21 am, Bill Davidsen wrote:

> > The current behaviour allows the system to hang forever waiting for
> > sync(2). In practice it does actually wait minutes on a busy system (df
> > has --no-sync for that reason) when there is no reason for that to happen.
> > I think that not only sucks worse, it's non-standard as well.
>
> Nothing sucks worse than losing your data. Let's concentrate on fixing
> shutdown, not breaking (linux) sync.

If you have read my previous posts, you know that the current behaviour is
broken, allows users to glitch the system performance at no benefit to the
issuing process, etc. In other words sync is now currently broken, in
terms of both theory, practice, and standard.

> > > For example, ext3 users get to enjoy rebooting with `sync ; reboot -f'
> > > to get around all those silly shutdown scripts. This very much relies
> > > upon the sync waiting upon the I/O.

Of course between the sync and execution of the reboot the disk buffers
may be totally full again, and if shutdown isn't close to instant perhaps
you have a faulty shutdown and should fix it.

> > Because people count on something broken we should keep the bug? You do
> > realize that the sync may NEVER finish?
>
> You do realize that if you lose your data you may NEVER get it back? ;-)

The sync doesn't protect my data, after my data has been written why
should I care to wait while all the data in every active program in the
system gets written. This makes checkpoints stop points on a busy system.

> > That's not in theory, I have news
> > servers which may wait overnight without finishing a "df" iwthout the
> > option.
>
> OK, what you're really saying is, we need a way to kill the sync process
> if it runs overtime, no?

If that maps into write the current dirty buffers and get on with life,
yes. I posted a better solution for comment, I won't duplicate that here.

> > > I mean, according to SUS, our sys_sync() implementation could be
> > >
> > > asmlinkage void sys_sync(void)
> > > {
> > > return;
> > > }
> > >
> > > Because all I/O is already scheduled, thanks to kupdate.

I can't see anyone taking that reading or that implementation, so let's
not discuss how badly we can do a standard conforming kernel, this is
Linux, we do the best at everything once we figure out what that really
means;-)

> > Your example is a good example of bad practive, since even with
> > ext3 a program creating files quickly would lose data, even though the
> > directory structure is returned to a known state, without stopping the
> > writing processes the results are unknown.
>
> Huh? You know about journal commit, right?

Read or reread my other notes on that, journal prevents directory
corruption, it doesn't prevent data loss like a database transaction.
Returning to a known good state does not include "without losing any data
written to unclosed files."

I leave it to Mr Reiser to clarify that or correct me if data is protected
without using unbuffered writes.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-13 23:31:10

by Rob Landley

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Wednesday 13 February 2002 10:11 am, Daniel Phillips wrote:

> On this topic, it would make a lot of sense from the user's point of view
> to have a way of syncing a single volume, how would we express that?

If you're talking about sync(1), I'd make it work like df. Typing df with no
arguments lists all volumes, df with a path looks at just that path. (And
"df ." works fine too.)

If you're asking about sync(2) and how it should talk to the kernel, I'm not
going to express an opinion...

Rob

2002-02-14 00:26:47

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 13, 2002 11:24 pm, Bill Davidsen wrote:
> On Wed, 13 Feb 2002, Daniel Phillips wrote:
>
> > On February 13, 2002 04:46 am, Jeff Garzik wrote:
>
> > > Yow, your message inspired me to re-read SuSv2 and indeed confirm,
> > > sync(2) schedules I/O but can return before completion,
> >
> > I think that's just stupid and we have a duty to fix it, it's an anachronism.
> > The _natural_ expectation for the user is that sync means 'don't come back
> > until data is on disk' and any other interpretation is just apologizing for
> > halfway implementations.
>
> Feel free to join a standards committee.

I did, it's the LCSG (Linux Cabal Standards Group) :-)

> In the mean time, I agree we have
> a duty to fix it, since the current implementation can hang forever
> without improving the securty of the data one bit, therefore sync(2)
> should return after all data generated before the sync has been written
> and not wait for all data written by all processes in the system to
> complete.

Yes, absolutely, that's a bug.

> BTW: I think users would expect the system call to work as the standard
> specifies, not some better way which would break on non-Linux systems. Of
> course now working programs which conform to the standard DO break on
> Linux.

No, it should work in the _best_ way, and if the standard got it wrong then
the standard has to change.

> > For dumb filesystems, this can degenerate to 'just try to write all the dirty
> > blocks', the traditional Linux interpretation, but for journalling filesystems
> > we can do the job properly.
>
> It doesn't matter, if you write the existing dirty buffers the filesystem
> type is irrelevant.

Incorrect. The modern crop of filesystems has the concept of consistency
points, and data written after a consistency point is irrelevant except to the
next consistency point. IOW, it's often ok to leave some buffers dirty on a
sync. But for a dumb filesystem you just have to guess at what's needed for
a consistency point, and the best guess is 'whatever's dirty at the time of
sync'.

For metadata-only journalling the issues get more subtle and we need a ruling
from the ext3 guys.

> And if you have cache in your controller and/or drives
> the data might be there and not on the disk.

We're working on that, see Jen's recent series of patches re barriers.

> If you have those IBM drives
> discussed a few months ago and a bad sector, the drive may drop the data.
> The point I'm making is that doing it really right is harder than it
> seems.

That's being worked on too, see Andre Hedrik's linuxdiskcert.org.

> Also, there are applications which don't like journals because they create
> and delete lots of little files, or update the file information
> frequently, resulting in a write to the journal. Sendmail, web servers,
> and usenet news do this in many cases. That's why the noatime option was
> added.

Sorry, I don't see the connection to sync.

> > > while fsync(2) schedules I/O and waits for completion.
> >
> > Yes, right.
> >
> > > So we need to implement system call checkpoint(2) ? schedule I/O,
> > > introduce an I/O barrier, then sleep until that I/O barrier and all I/O
> > > scheduled before it occurs.
> >
> > How about adding: sync --old-broken-way
>
> The problem is that the system call should work in a way which doesn't
> violate the standard.

Waiting until the data is on the platter doesn't violate SuS.

> I think waiting for all existing dirty buffers is
> conforming, waiting until hell freezes over isn't,

Where does it say that in SuS? I not arguing in favor of waiting longer
than necessary, mind you.

> nor does it have any
> benefit to the user, since the sync is either an end of the execution
> safety net or a checkpoint. In either case the user doesn't expect to have
> the program hang after *his/her* data is safe.

Have you asked any users about that?

--
Daniel

2002-02-14 00:29:27

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 13, 2002 11:53 pm, Bill Davidsen wrote:
> On Wed, 13 Feb 2002, Daniel Phillips wrote:
> > > Because people count on something broken we should keep the bug? You do
> > > realize that the sync may NEVER finish?
> >
> > You do realize that if you lose your data you may NEVER get it back? ;-)
>
> The sync doesn't protect my data, after my data has been written why
> should I care to wait while all the data in every active program in the
> system gets written. This makes checkpoints stop points on a busy system.

Sync should not wait for data written after the sync. If it does, it's
broken and needs to be fixed.

> > > Your example is a good example of bad practive, since even with
> > > ext3 a program creating files quickly would lose data, even though the
> > > directory structure is returned to a known state, without stopping the
> > > writing processes the results are unknown.
> >
> > Huh? You know about journal commit, right?
>
> Read or reread my other notes on that, journal prevents directory
> corruption, it doesn't prevent data loss like a database transaction.
> Returning to a known good state does not include "without losing any data
> written to unclosed files."

It's true, we get into great areas with ordered-data journalling, but it's
black and white with full data journalling.

> I leave it to Mr Reiser to clarify that or correct me if data is protected
> without using unbuffered writes.

--
Daniel

2002-02-14 00:39:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Daniel Phillips wrote:
>
> On February 13, 2002 11:24 pm, Bill Davidsen wrote:
> > ...
> > It doesn't matter, if you write the existing dirty buffers the filesystem
> > type is irrelevant.
>
> Incorrect. The modern crop of filesystems has the concept of consistency
> points, and data written after a consistency point is irrelevant except to the
> next consistency point. IOW, it's often ok to leave some buffers dirty on a
> sync. But for a dumb filesystem you just have to guess at what's needed for
> a consistency point, and the best guess is 'whatever's dirty at the time of
> sync'.
>
> For metadata-only journalling the issues get more subtle and we need a ruling
> from the ext3 guys.

The current implementation of fsync_dev is about as good as
it'll get for journal=writeback mode - write the data,
run a commit, write the data again then wait on it all.

>
> Sorry, I don't see the connection to sync.

I don't understand the whole thread :)

The patch I sent yesterday, which is at
http://www.zip.com.au/~akpm/linux/2.4/2.4.18-pre9/sync_livelock.patch
provides sensible and safe sync semantics, and avoids livelock.

It'd be good if someone else could, like, apply and test it, rather
than sending out all this email and stuff.

-

2002-02-14 00:40:38

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 14, 2002 12:31 am, Rob Landley wrote:
> On Wednesday 13 February 2002 10:11 am, Daniel Phillips wrote:
>
> > On this topic, it would make a lot of sense from the user's point of view
> > to have a way of syncing a single volume, how would we express that?
>
> If you're talking about sync(1), I'd make it work like df. Typing df with no
> arguments lists all volumes, df with a path looks at just that path. (And
> "df ." works fine too.)

Yes, that's the right interface from the user's point of view.

> If you're asking about sync(2) and how it should talk to the kernel, I'm not
> going to express an opinion...

Patches speak louder than opinions. First we need a vfs super->sync method,
coming soon.

--
Daniel

2002-02-14 00:45:10

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 14, 2002 01:37 am, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > On February 13, 2002 11:24 pm, Bill Davidsen wrote:
> > > ...
> > > It doesn't matter, if you write the existing dirty buffers the filesystem
> > > type is irrelevant.
> >
> > Incorrect. The modern crop of filesystems has the concept of consistency
> > points, and data written after a consistency point is irrelevant except to the
> > next consistency point. IOW, it's often ok to leave some buffers dirty on a
> > sync. But for a dumb filesystem you just have to guess at what's needed for
> > a consistency point, and the best guess is 'whatever's dirty at the time of
> > sync'.
> >
> > For metadata-only journalling the issues get more subtle and we need a ruling
> > from the ext3 guys.
>
> The current implementation of fsync_dev is about as good as
> it'll get for journal=writeback mode - write the data,
> run a commit, write the data again then wait on it all.

What's the theory behind writing the data both before and after the commit?

> >
> > Sorry, I don't see the connection to sync.
>
> I don't understand the whole thread :)

Dangerous advocacy of the broken SuS semantics for sync, has to be stamped
out before it spreads ;-)

--
Daniel

2002-02-14 00:54:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Daniel Phillips wrote:
>
> What's the theory behind writing the data both before and after the commit?

see fsync_dev(). It starts I/O against existing dirty data, then
does various fs-level syncy things which can produce more dirty
data - this is where ext3 runs its commit, via brilliant reverse
engineering of its calling context :-(. It then again starts I/O
against new dirty data then waits on it again. And then again.

There's quite a lot of overkill there. But that's OK, as long
as it terminates sometime.

-

2002-02-14 00:59:12

by Andries E. Brouwer

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

> it should work in the _best_ way, and if the standard got it wrong
> then the standard has to change.

: BTW: I think users would expect the system call to work as the standard
: specifies, not some better way which would break on non-Linux systems. Of
: course now working programs which conform to the standard DO break on
: Linux.

Let me repeat:
The standard describes a *minimum*.
A system that does not give more than this minimum would be
a very poor system indeed.

That POSIX does not require more than 14 bytes in a filename
and does not promise me more than 6 simultaneous processes
does not prevent us from having something better.

In this particular case (sync) the minimum required is
essentially empty. The proposed semantics: make sure that
before return all writes that were scheduled at the time
of the call seems entirely satisfactory.

Andries


(BTW Is your df broken? It is very long ago that my df did
a sync.)

2002-02-14 01:23:52

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 14, 2002 01:53 am, Andrew Morton wrote:
> Daniel Phillips wrote:
> >
> > What's the theory behind writing the data both before and after the commit?
>
> see fsync_dev(). It starts I/O against existing dirty data, then
> does various fs-level syncy things which can produce more dirty
> data - this is where ext3 runs its commit, via brilliant reverse
> engineering of its calling context :-(.

OK, so it sounds like cleaning that up with an ext3-specific super->sync would
be cleaner for what it's worth, and save a little cpu.

> It then again starts I/O against new dirty data then waits on it again. And
> then again. There's quite a lot of overkill there. But that's OK, as long
> as it terminates sometime.

/me doesn't comment

--
Daniel

2002-02-14 01:30:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

Daniel Phillips wrote:
>
> On February 14, 2002 01:53 am, Andrew Morton wrote:
> > Daniel Phillips wrote:
> > >
> > > What's the theory behind writing the data both before and after the commit?
> >
> > see fsync_dev(). It starts I/O against existing dirty data, then
> > does various fs-level syncy things which can produce more dirty
> > data - this is where ext3 runs its commit, via brilliant reverse
> > engineering of its calling context :-(.
>
> OK, so it sounds like cleaning that up with an ext3-specific super->sync would
> be cleaner for what it's worth, and save a little cpu.

Oh, having a filesystem sync entry point is much more than
a little cleanup. It's quite important. In current kernels
the same code path is used for both sync() and for periodic
kupdate writeback. It's not possible for the filesystem
to know which context it's being called in, and we do want
different behaviour.

We want the sys_sync() path to wait on writeout, but it's
silly to make the kupdate path do that.


> > It then again starts I/O against new dirty data then waits on it again. And
> > then again. There's quite a lot of overkill there. But that's OK, as long
> > as it terminates sometime.
>
> /me doesn't comment

That's odd.

-

:)

2002-02-14 02:00:01

by Mike Fedyk

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Thu, Feb 14, 2002 at 01:49:03AM +0100, Daniel Phillips wrote:
> Dangerous advocacy of the broken SuS semantics for sync, has to be stamped
> out before it spreads ;-)

Daniel,

You seem to be taking both sides of this argument.

Do you agree that sync should be changed to a checkpoint so that doesn't
block for dirty data created *after* sync was called?

2002-02-14 02:04:01

by Daniel Phillips

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On February 14, 2002 02:59 am, Mike Fedyk wrote:
> On Thu, Feb 14, 2002 at 01:49:03AM +0100, Daniel Phillips wrote:
> > Dangerous advocacy of the broken SuS semantics for sync, has to be stamped
> > out before it spreads ;-)
>
> Daniel,
>
> You seem to be taking both sides of this argument.
>
> Do you agree that sync should be changed to a checkpoint so that doesn't
> block for dirty data created *after* sync was called?

Yes, sorry if I was ambiguous in stating that in any way.

--
Daniel

2002-02-18 02:31:03

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] sys_sync livelock fix

On Tue, 12 Feb 2002, Andrew Morton wrote:

> Thing is, at shutdown all the tasks which are generating write
> traffic should have been killed off, and the filesystems unmounted
> or set readonly. There's no obvious way in which this shutdown
> sequence can be indefinitely stalled by the problem we're discussing here.
>
> If the shutdown scripts are calling sys_sync() *before* killing
> everything then yes, the scripts could hang indefinitely. Is
> this the case?
>
> If "yes" then the scripts are dumb. Just remove the `sync' call.
>
> If "no" then something else is causing the hang.

I recently had the Linux version of Cyclone (usenet transit server) get
unhappy and start growing load without bounds. At L.A. 60 I issued a
killall the the application (which ran on happily), then a killall (-9)
which also didn't kill the processes (there is a problems there, a process
shouldNOT keep running after ill -9!). There a "reboot" which hung, and
finally a "reboot -n" which actually rebooted the machine.

Since -n only means "do sync but don't wait" I have to think that the
problem is not that the script doesn't issue a kill, but that the kill
doesn't work. I think making the sync(2) a single pass write of current
dirty buffers is the only way to avoid stuff like this. making kill -9
work would be great too, there seems to be a long term problem with
threaded programs which are hard to kill.

Just another data point, the bug in this case is not in shutdown.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2002-02-18 22:19:48

by David Schwartz

[permalink] [raw]
Subject: Re: What is a livelock? (was: [patch] sys_sync livelock fix)


>I still don't get it :-(. When there is more work, this more work
>needs to be done. So, how could livelock be considered a bug? It's
>just overload. Or is this about the work, which must be done _after_
>the queue is empty?

Livelock situations can differ. One common issue is when you tune your
ability to handle load only at a certain point and the load level is such
that you never reach that point.

Consider:

int work_count;
while(1)
{
work_count=0;
while(work_is_still_to_be_done())
{
do_work();
work_count++;
}
if(work_count>250)
create_more_threads();
}

In this case, you may be so busy doing work that you never look at how much
work you did and realize it was too much, so the additional threads never get
created, and so you remain overloaded forever.

There are other livelock scenarios that don't involve load over what the
system can actually take, just over what the system is currently tuned to
take. Any scheme that considers tuning a low priority can get into this kind
of problem.

Here's another case:

lock(first_lock);
/* some stuff */
lock(second_lock);
while(there_is_work())
{
unlock(second_lock);
do_work();
lock(second_lock);
}
unlock(second_lock);
/* more stuff */
unlock(first_lock);

In this case, so long as work keeps flowing in and keeps this thread
saturated, the first lock may never be released. This is true even if there
are other threads capable of doing this work without holding the first lock.
Since this thread remains perpetually ready, it may remain perpetually
scheduled. This is probably the most common type of livelock encountered in
kernel code.

DS