2007-06-07 13:44:36

by Mark Lord

[permalink] [raw]
Subject: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Andrew / Stephen / Ted,

I have a MythTV PVR box here, which has had intermittent shutdown issues
over the past while.

The main storage for recordings is a 2-drive RAID0 array,
formatted with ext3fs. The system runs a 2.6.17 Ubuntu kernel
that I've tailored/rebuild for this specific machine.

My observation is, if I delete a couple of 25+ GByte files,
and then immediately shutdown the system, the disks are still being
written to at the point when the power goes off (halt -f -p).

The overall sequence is something like this:

1. Delete the files in Myth, which uses a "delete slowly" function
to avoid locking up the machine for the 30-60 seconds that this would
otherwise require. Myth appears to open the file, unlink it, and then
sit in a loop doing small ftruncate's until nothing is left.

2. When I trigger the shutdown whilst this is happening, Myth gets
killed off, and so the unlinked file is automatically closed.
and the kernel (filesystem) code begins finishing the delete operation.

3. The shutdown scripts do their thing quickly, so the delete is
*still* underway when the umount commands are issued.
On this system, I use this sequence:

## /var/lib/mythtv is the recording's ext3fs, on /dev/md0 (RAID0):
mount /var/lib/mythtv -oremount,ro
sync
umount /var/lib/mythtv
sync
mount / -oremount,ro
sync
sleep 1
hdparm -W0 /dev/sda /dev/sdb
sync
sleep 2
halt -f -p

4. The hard drive light is on solid throughout, including at the point
when the power goes out.

5. On the next reboot, there is a LONG pause (20-30 seconds) at the
point where /var/lib/mythtv is remounted --> indicating unfinished business
from the journal file that needs to be replayed (eg. the file deletion).

So.. how can I guarantee a quiescent filesystem before doing "halt -f -p" ??
This looks pretty dangerous as-is.

Thanks


2007-06-07 15:42:18

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

On Thu, 07 Jun 2007 09:44:24 -0400 Mark Lord <[email protected]> wrote:

> Andrew / Stephen / Ted,
>
> I have a MythTV PVR box here, which has had intermittent shutdown issues
> over the past while.
>
> The main storage for recordings is a 2-drive RAID0 array,
> formatted with ext3fs. The system runs a 2.6.17 Ubuntu kernel
> that I've tailored/rebuild for this specific machine.
>
> My observation is, if I delete a couple of 25+ GByte files,
> and then immediately shutdown the system, the disks are still being
> written to at the point when the power goes off (halt -f -p).
>
> The overall sequence is something like this:
>
> 1. Delete the files in Myth, which uses a "delete slowly" function
> to avoid locking up the machine for the 30-60 seconds that this would
> otherwise require. Myth appears to open the file, unlink it, and then
> sit in a loop doing small ftruncate's until nothing is left.

sigh. yup.

> 2. When I trigger the shutdown whilst this is happening, Myth gets
> killed off, and so the unlinked file is automatically closed.
> and the kernel (filesystem) code begins finishing the delete operation.
>
> 3. The shutdown scripts do their thing quickly, so the delete is
> *still* underway when the umount commands are issued.
> On this system, I use this sequence:
>
> ## /var/lib/mythtv is the recording's ext3fs, on /dev/md0 (RAID0):

I assume the applikcaton has already been killed at this stage, and it is
blocked in the kernel running the truncate?

> mount /var/lib/mythtv -oremount,ro
> sync
> umount /var/lib/mythtv

Did this succeed? If the application is still truncating that file, the
umount should have failed.

> sync
> mount / -oremount,ro
> sync
> sleep 1
> hdparm -W0 /dev/sda /dev/sdb
> sync
> sleep 2
> halt -f -p
>
> 4. The hard drive light is on solid throughout, including at the point
> when the power goes out.
>
> 5. On the next reboot, there is a LONG pause (20-30 seconds) at the
> point where /var/lib/mythtv is remounted --> indicating unfinished business
> from the journal file that needs to be replayed (eg. the file deletion).

That opened-but-deleted file's inode is on the orphan list.

See, the unlink-then-slowly-truncate trick is done in this fashion so that
if the box crashes during the slow unlink, the orhpan list handling on the
reboot will finish off the truncate for us.

> So.. how can I guarantee a quiescent filesystem before doing "halt -f -p" ??
> This looks pretty dangerous as-is.

Wait for the killed-off applicaiton to actually exit, perhaps? But
that unmount should have failed.


2007-06-07 16:02:38

by Mark Lord

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Andrew Morton wrote:
> On Thu, 07 Jun 2007 09:44:24 -0400 Mark Lord <[email protected]> wrote:
..
>> 2. When I trigger the shutdown whilst this is happening, Myth gets
>> killed off, and so the unlinked file is automatically closed.
>> and the kernel (filesystem) code begins finishing the delete operation.
>>
>> 3. The shutdown scripts do their thing quickly, so the delete is
>> *still* underway when the umount commands are issued.
>> On this system, I use this sequence:
>>
>> ## /var/lib/mythtv is the recording's ext3fs, on /dev/md0 (RAID0):
>
> I assume the applikcaton has already been killed at this stage, and it is
> blocked in the kernel running the truncate?

Yes, I believe I saw that once.

>> mount /var/lib/mythtv -oremount,ro
>> sync
>> umount /var/lib/mythtv
>
> Did this succeed? If the application is still truncating that file, the
> umount should have failed.

Actually, what I expect to happen is for the remount,ro
to block until the file deletion completes. But it doesn't.

Once a f/s is read-only, there should be NO writing to it. Right?

I don't know if the umount worked or not, but the f/s ought to be
read-only at this point, so why is it still writing to the device?

I'll instrument the shutdown more for next time, to see if the remount
and umount really do succeed or not. Mmm.. do they log anything on failure?

>> sync
>> mount / -oremount,ro
>> sync
>> sleep 1
>> hdparm -W0 /dev/sda /dev/sdb
>> sync
>> sleep 2
>> halt -f -p
>>
>> 4. The hard drive light is on solid throughout, including at the point
>> when the power goes out.
>>
>> 5. On the next reboot, there is a LONG pause (20-30 seconds) at the
>> point where /var/lib/mythtv is remounted --> indicating unfinished business
>> from the journal file that needs to be replayed (eg. the file deletion).
>
> That opened-but-deleted file's inode is on the orphan list.
>
> See, the unlink-then-slowly-truncate trick is done in this fashion so that
> if the box crashes during the slow unlink, the orhpan list handling on the
> reboot will finish off the truncate for us.

Yes, absolutely.

>> So.. how can I guarantee a quiescent filesystem before doing "halt -f -p" ??
>> This looks pretty dangerous as-is.
>
> Wait for the killed-off applicaiton to actually exit, perhaps? But
> that unmount should have failed.

But some applications just "hang" regardless, and so this cannot wait forever.
There must be *some* way to know when a filesystem is really quiescent
and therefore safe to power off?

Cheers

2007-06-07 16:12:54

by Chuck Ebbert

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

On 06/07/2007 11:41 AM, Andrew Morton wrote:
>> mount /var/lib/mythtv -oremount,ro
>> sync
>> umount /var/lib/mythtv
>
> Did this succeed? If the application is still truncating that file, the
> umount should have failed.

Shouldn't sync should wait for truncate to finish?

2007-06-07 17:10:08

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Hi,

On Thu, 2007-06-07 at 12:01 -0400, Mark Lord wrote:

> >> mount /var/lib/mythtv -oremount,ro
> >> sync
> >> umount /var/lib/mythtv
> >
> > Did this succeed? If the application is still truncating that file, the
> > umount should have failed.
>
> Actually, what I expect to happen is for the remount,ro
> to block until the file deletion completes. But it doesn't.

No -- all that the remount,ro sees is that there's a fd still open for
write. It has no idea if or when that fd is going to get closed, so it
should fail. Returning -EBUSY is the only thing it _can_ do: waiting
for the fs to be remountable might wait forever.

So the fs is still writable after that point.

--Stephen


2007-06-07 19:45:35

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

On Thu, 07 Jun 2007 12:11:58 -0400
Chuck Ebbert <[email protected]> wrote:

> On 06/07/2007 11:41 AM, Andrew Morton wrote:
> >> mount /var/lib/mythtv -oremount,ro
> >> sync
> >> umount /var/lib/mythtv
> >
> > Did this succeed? If the application is still truncating that file, the
> > umount should have failed.
>
> Shouldn't sync should wait for truncate to finish?

I can't think of anything in there at present which would cause that to
happen, and it's not immediately obvious how we _could_ make it happen - we
have an inode which potentially has no dirty pages and which is itself
clean. The truncate can span multiple journal commits, so forcing a
journal commit in sync() won't necessarily block behind the truncate.

I guess we could ask sync to speculatively take and release every inode's
i_mutex or something. But even that would involve quite some hoop-jumping
due to those infuriating spinlock-protected list_heads on the superblock.

hmm.

2007-06-07 21:39:12

by Mark Lord

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Andrew Morton wrote:
> On Thu, 07 Jun 2007 12:11:58 -0400
> Chuck Ebbert <[email protected]> wrote:
>
>> On 06/07/2007 11:41 AM, Andrew Morton wrote:
>>>> mount /var/lib/mythtv -oremount,ro
>>>> sync
>>>> umount /var/lib/mythtv
>>> Did this succeed? If the application is still truncating that file, the
>>> umount should have failed.
>> Shouldn't sync should wait for truncate to finish?
>
> I can't think of anything in there at present which would cause that to
> happen, and it's not immediately obvious how we _could_ make it happen - we
> have an inode which potentially has no dirty pages and which is itself
> clean. The truncate can span multiple journal commits, so forcing a
> journal commit in sync() won't necessarily block behind the truncate.
>
> I guess we could ask sync to speculatively take and release every inode's
> i_mutex or something. But even that would involve quite some hoop-jumping
> due to those infuriating spinlock-protected list_heads on the superblock.
>
> hmm.

Yeah, I really don't know what to do with this either.
We have to have a bounds on how long we wait at shutdown,
but there doesn't seem to be an easy way to get notified
once a filesystem becomes idle (?).

I suppose I could have the script loop on /proc/interrupts until
it sees the disk activity has tapered off..

Cheers

2007-06-07 21:43:47

by Mark Lord

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Chuck Ebbert wrote:
> On 06/07/2007 11:41 AM, Andrew Morton wrote:
>>> mount /var/lib/mythtv -oremount,ro
>>> sync
>>> umount /var/lib/mythtv
>> Did this succeed? If the application is still truncating that file, the
>> umount should have failed.
>
> Shouldn't sync should wait for truncate to finish?

The part that gets me here, and that others might be missing,
is that we are not waiting for ftruncate at this point.

We're waiting for unlink. The application that was doing ftruncate
in tiny little doses has been sent a kill-9 signal, so what should
be happening now (confirmed by disk activity LEDs) is the file should
just be getting deleted the same as if we did "rm bigfile" on it.

And I kind of expected sync or remount,ro to complete *after* the
unlink finishes..

Oh well.

2007-06-07 22:07:55

by Andrew Morton

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

On Thu, 07 Jun 2007 17:38:54 -0400
Mark Lord <[email protected]> wrote:

> Andrew Morton wrote:
> > On Thu, 07 Jun 2007 12:11:58 -0400
> > Chuck Ebbert <[email protected]> wrote:
> >
> >> On 06/07/2007 11:41 AM, Andrew Morton wrote:
> >>>> mount /var/lib/mythtv -oremount,ro
> >>>> sync
> >>>> umount /var/lib/mythtv
> >>> Did this succeed? If the application is still truncating that file, the
> >>> umount should have failed.
> >> Shouldn't sync should wait for truncate to finish?
> >
> > I can't think of anything in there at present which would cause that to
> > happen, and it's not immediately obvious how we _could_ make it happen - we
> > have an inode which potentially has no dirty pages and which is itself
> > clean. The truncate can span multiple journal commits, so forcing a
> > journal commit in sync() won't necessarily block behind the truncate.
> >
> > I guess we could ask sync to speculatively take and release every inode's
> > i_mutex or something. But even that would involve quite some hoop-jumping
> > due to those infuriating spinlock-protected list_heads on the superblock.
> >
> > hmm.
>
> Yeah, I really don't know what to do with this either.
> We have to have a bounds on how long we wait at shutdown,
> but there doesn't seem to be an easy way to get notified
> once a filesystem becomes idle (?).
>
> I suppose I could have the script loop on /proc/interrupts until
> it sees the disk activity has tapered off..
>

I don't recall clarity on this question: did the umount fail?

Because it should have, in which case your script can poll that.

2007-06-08 14:51:32

by Mark Lord

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Andrew Morton wrote:
> On Thu, 07 Jun 2007 17:38:54 -0400
> Mark Lord <[email protected]> wrote:
>
>> Andrew Morton wrote:
>>> On Thu, 07 Jun 2007 12:11:58 -0400
>>> Chuck Ebbert <[email protected]> wrote:
>>>
>>>> On 06/07/2007 11:41 AM, Andrew Morton wrote:
>>>>>> mount /var/lib/mythtv -oremount,ro
>>>>>> sync
>>>>>> umount /var/lib/mythtv
>>>>> Did this succeed? If the application is still truncating that file, the
>>>>> umount should have failed.
>>>> Shouldn't sync should wait for truncate to finish?
>>> I can't think of anything in there at present which would cause that to
>>> happen, and it's not immediately obvious how we _could_ make it happen - we
>>> have an inode which potentially has no dirty pages and which is itself
>>> clean. The truncate can span multiple journal commits, so forcing a
>>> journal commit in sync() won't necessarily block behind the truncate.
>>>
>>> I guess we could ask sync to speculatively take and release every inode's
>>> i_mutex or something. But even that would involve quite some hoop-jumping
>>> due to those infuriating spinlock-protected list_heads on the superblock.
>>>
>>> hmm.
>> Yeah, I really don't know what to do with this either.
>> We have to have a bounds on how long we wait at shutdown,
>> but there doesn't seem to be an easy way to get notified
>> once a filesystem becomes idle (?).
>>
>> I suppose I could have the script loop on /proc/interrupts until
>> it sees the disk activity has tapered off..
>>
>
> I don't recall clarity on this question: did the umount fail?
>
> Because it should have, in which case your script can poll that.

I haven't had the opportunity to instrument/retest that part,
but would it really make any difference?

The process is already a Zombie at this point, existing only
because it was killed during a syscall and seems to have gotten
stuck there. So it won't be closing anything unless that has
already happened during the conversion to Zombie (?).

But yeah, I'll get back here again once I see if the remount
and umount work or not.

Cheers

2007-06-09 02:58:46

by Mark Lord

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Andrew Morton wrote:
> On Thu, 07 Jun 2007 12:11:58 -0400
> Chuck Ebbert <[email protected]> wrote:
>
>> On 06/07/2007 11:41 AM, Andrew Morton wrote:
>>>> mount /var/lib/mythtv -oremount,ro
>>>> sync
>>>> umount /var/lib/mythtv
>>> Did this succeed? If the application is still truncating that file, the
>>> umount should have failed.
>> Shouldn't sync should wait for truncate to finish?
>
> I can't think of anything in there at present which would cause that to
> happen, and it's not immediately obvious how we _could_ make it happen - we
> have an inode which potentially has no dirty pages and which is itself
> clean. The truncate can span multiple journal commits, so forcing a
> journal commit in sync() won't necessarily block behind the truncate.
>
> I guess we could ask sync to speculatively take and release every inode's
> i_mutex or something. But even that would involve quite some hoop-jumping
> due to those infuriating spinlock-protected list_heads on the superblock.
>
> hmm.

Okay, I added more instrumentation and retested today.

Good and Bad.
The umount does indeed fail while the massive unlink is happening,
so I can just loop on that a few times before giving up.

But.. the earlier "remount,ro".. well.. I don't know what it does.
I did get it to lock up solid, though.. hung on the "remount,ro"
when issued during an unlink of a 15GB file. The disk I/O eventually
completes, and drives go idle, but the system remains hung inside
the remount,ro call.

Alt-sysrq-T was functioning, so I have some screen shots (.jpg) here:

http://rtr.ca/remount_ro/

That's definitely a bug.
For now, I'll just not attempt the remount,ro on this system,
and have it loop for a minute attempting umount instead.

Cheers

2007-06-11 11:06:33

by Jan Kara

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

> Chuck Ebbert wrote:
> >On 06/07/2007 11:41 AM, Andrew Morton wrote:
> >>> mount /var/lib/mythtv -oremount,ro
> >>> sync
> >>> umount /var/lib/mythtv
> >>Did this succeed? If the application is still truncating that file, the
> >>umount should have failed.
> >
> >Shouldn't sync should wait for truncate to finish?
>
> The part that gets me here, and that others might be missing,
> is that we are not waiting for ftruncate at this point.
>
> We're waiting for unlink. The application that was doing ftruncate
> in tiny little doses has been sent a kill-9 signal, so what should
> be happening now (confirmed by disk activity LEDs) is the file should
> just be getting deleted the same as if we did "rm bigfile" on it.
But if that app has been waiting in D state, kill -9 does nothing to it
until it wakes up, doesn't it? So fd's are still open and umount fails.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-06-11 11:15:29

by Jan Kara

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

> Andrew Morton wrote:
> >On Thu, 07 Jun 2007 12:11:58 -0400
> >Chuck Ebbert <[email protected]> wrote:
> >
> >>On 06/07/2007 11:41 AM, Andrew Morton wrote:
> >>>> mount /var/lib/mythtv -oremount,ro
> >>>> sync
> >>>> umount /var/lib/mythtv
> >>>Did this succeed? If the application is still truncating that file, the
> >>>umount should have failed.
> >>Shouldn't sync should wait for truncate to finish?
> >
> >I can't think of anything in there at present which would cause that to
> >happen, and it's not immediately obvious how we _could_ make it happen - we
> >have an inode which potentially has no dirty pages and which is itself
> >clean. The truncate can span multiple journal commits, so forcing a
> >journal commit in sync() won't necessarily block behind the truncate.
> >
> >I guess we could ask sync to speculatively take and release every inode's
> >i_mutex or something. But even that would involve quite some hoop-jumping
> >due to those infuriating spinlock-protected list_heads on the superblock.
> >
> >hmm.
>
> Okay, I added more instrumentation and retested today.
>
> Good and Bad.
> The umount does indeed fail while the massive unlink is happening,
> so I can just loop on that a few times before giving up.
>
> But.. the earlier "remount,ro".. well.. I don't know what it does.
> I did get it to lock up solid, though.. hung on the "remount,ro"
> when issued during an unlink of a 15GB file. The disk I/O eventually
> completes, and drives go idle, but the system remains hung inside
> the remount,ro call.
>
> Alt-sysrq-T was functioning, so I have some screen shots (.jpg) here:
>
> http://rtr.ca/remount_ro/
Thanks for the traces.

> That's definitely a bug.
Yes. We have a nice lock inversion there. ext3_remount() is called
with sb->s_lock held and waits for transaction to finish in
journal_lock_updates(). On the other hand ext3_orphan_del() is called
inside a transaction and tries to do lock_super()... Bad luck.

Honza

--
Jan Kara <[email protected]>
SuSE CR Labs

2007-06-11 22:46:19

by Mark Lord

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Jan Kara wrote:
>> Chuck Ebbert wrote:
>>> On 06/07/2007 11:41 AM, Andrew Morton wrote:
>>>>> mount /var/lib/mythtv -oremount,ro
>>>>> sync
>>>>> umount /var/lib/mythtv
>>>> Did this succeed? If the application is still truncating that file, the
>>>> umount should have failed.
>>> Shouldn't sync should wait for truncate to finish?
>> The part that gets me here, and that others might be missing,
>> is that we are not waiting for ftruncate at this point.
>>
>> We're waiting for unlink. The application that was doing ftruncate
>> in tiny little doses has been sent a kill-9 signal, so what should
>> be happening now (confirmed by disk activity LEDs) is the file should
>> just be getting deleted the same as if we did "rm bigfile" on it.
> But if that app has been waiting in D state, kill -9 does nothing to it
> until it wakes up, doesn't it? So fd's are still open and umount fails.

Yeah, except the task is in Zombie state at this point.
But the umount does fail properly, so I no longer complain about that.

Cheers

2007-06-11 22:47:18

by Mark Lord

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Jan Kara wrote:
>> Andrew Morton wrote:
>>> On Thu, 07 Jun 2007 12:11:58 -0400
>>> Chuck Ebbert <[email protected]> wrote:
>>>
>>>> On 06/07/2007 11:41 AM, Andrew Morton wrote:
>>>>>> mount /var/lib/mythtv -oremount,ro
>>>>>> sync
>>>>>> umount /var/lib/mythtv
>>>>> Did this succeed? If the application is still truncating that file, the
>>>>> umount should have failed.
>>>> Shouldn't sync should wait for truncate to finish?
>>> I can't think of anything in there at present which would cause that to
>>> happen, and it's not immediately obvious how we _could_ make it happen - we
>>> have an inode which potentially has no dirty pages and which is itself
>>> clean. The truncate can span multiple journal commits, so forcing a
>>> journal commit in sync() won't necessarily block behind the truncate.
>>>
>>> I guess we could ask sync to speculatively take and release every inode's
>>> i_mutex or something. But even that would involve quite some hoop-jumping
>>> due to those infuriating spinlock-protected list_heads on the superblock.
>>>
>>> hmm.
>> Okay, I added more instrumentation and retested today.
>>
>> Good and Bad.
>> The umount does indeed fail while the massive unlink is happening,
>> so I can just loop on that a few times before giving up.
>>
>> But.. the earlier "remount,ro".. well.. I don't know what it does.
>> I did get it to lock up solid, though.. hung on the "remount,ro"
>> when issued during an unlink of a 15GB file. The disk I/O eventually
>> completes, and drives go idle, but the system remains hung inside
>> the remount,ro call.
>>
>> Alt-sysrq-T was functioning, so I have some screen shots (.jpg) here:
>>
>> http://rtr.ca/remount_ro/
> Thanks for the traces.
>
>> That's definitely a bug.
> Yes. We have a nice lock inversion there. ext3_remount() is called
> with sb->s_lock held and waits for transaction to finish in
> journal_lock_updates(). On the other hand ext3_orphan_del() is called
> inside a transaction and tries to do lock_super()... Bad luck.
>

Peachy. Do you have enough knowledge here to generate a fix for this?
Maybe just have the remount break out, releasing all locks, and then
loop and retry (or return -EBUSY?) when this happens?

Cheers

2007-06-12 09:48:18

by Jan Kara

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

On Mon 11-06-07 18:47:05, Mark Lord wrote:
> Jan Kara wrote:
> >>Andrew Morton wrote:
> >>>On Thu, 07 Jun 2007 12:11:58 -0400
> >>>Chuck Ebbert <[email protected]> wrote:
> >>>
> >>>>On 06/07/2007 11:41 AM, Andrew Morton wrote:
> >>>>>> mount /var/lib/mythtv -oremount,ro
> >>>>>> sync
> >>>>>> umount /var/lib/mythtv
> >>>>>Did this succeed? If the application is still truncating that file,
> >>>>>the
> >>>>>umount should have failed.
> >>>>Shouldn't sync should wait for truncate to finish?
> >>>I can't think of anything in there at present which would cause that to
> >>>happen, and it's not immediately obvious how we _could_ make it happen -
> >>>we
> >>>have an inode which potentially has no dirty pages and which is itself
> >>>clean. The truncate can span multiple journal commits, so forcing a
> >>>journal commit in sync() won't necessarily block behind the truncate.
> >>>
> >>>I guess we could ask sync to speculatively take and release every inode's
> >>>i_mutex or something. But even that would involve quite some
> >>>hoop-jumping
> >>>due to those infuriating spinlock-protected list_heads on the superblock.
> >>>
> >>>hmm.
> >>Okay, I added more instrumentation and retested today.
> >>
> >>Good and Bad.
> >>The umount does indeed fail while the massive unlink is happening,
> >>so I can just loop on that a few times before giving up.
> >>
> >>But.. the earlier "remount,ro".. well.. I don't know what it does.
> >>I did get it to lock up solid, though.. hung on the "remount,ro"
> >>when issued during an unlink of a 15GB file. The disk I/O eventually
> >>completes, and drives go idle, but the system remains hung inside
> >>the remount,ro call.
> >>
> >>Alt-sysrq-T was functioning, so I have some screen shots (.jpg) here:
> >>
> >> http://rtr.ca/remount_ro/
> > Thanks for the traces.
> >
> >>That's definitely a bug.
> > Yes. We have a nice lock inversion there. ext3_remount() is called
> >with sb->s_lock held and waits for transaction to finish in
> >journal_lock_updates(). On the other hand ext3_orphan_del() is called
> >inside a transaction and tries to do lock_super()... Bad luck.
>
> Peachy. Do you have enough knowledge here to generate a fix for this?
> Maybe just have the remount break out, releasing all locks, and then
> loop and retry (or return -EBUSY?) when this happens?
Yes, I'll try to cook up some patch. As I'm looking through the code,
ext3_remount seems to be the only place where we need to start a
transaction under s_lock. So probably we could release sb->s_lock for
the time we have to wait for a transaction...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-06-12 14:41:23

by Pavel Machek

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Hi!

> >Did this succeed? If the application is still
> >truncating that file, the
> >umount should have failed.
>
> Actually, what I expect to happen is for the remount,ro
> to block until the file deletion completes. But it
> doesn't.
>
> Once a f/s is read-only, there should be NO writing to
> it. Right?

Linux happily writes to filesystems mounted read-only. It will replay
journal on them.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-06-12 14:41:37

by Pavel Machek

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Hi!

> >>Did this succeed? If the application is still
> >>truncating that file, the
> >>umount should have failed.
> >
> >Shouldn't sync should wait for truncate to finish?
>
> The part that gets me here, and that others might be
> missing,
> is that we are not waiting for ftruncate at this point.
>
> We're waiting for unlink. The application that was
> doing ftruncate
> in tiny little doses has been sent a kill-9 signal, so
> what should
> be happening now (confirmed by disk activity LEDs) is
> the file should
> just be getting deleted the same as if we did "rm
> bigfile" on it.

Well, AFAICT kill-9 signal delivery can take time. It still might be
doing ftruncate.

It would be interesting to know when in that sequence mythtv dies...
or perhaps put loop 'while killall -9 mythtv returns success, sleep 1'
into shutdown scripts?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2007-06-12 15:16:22

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Hi,

On Sun, 2007-06-10 at 18:27 +0000, Pavel Machek wrote:

> > Once a f/s is read-only, there should be NO writing to
> > it. Right?
>
> Linux happily writes to filesystems mounted read-only. It will replay
> journal on them.

Only at mount time, not on unmount; and it does check whether the
underlying device is truly readonly or not first (assuming
bdev_read_only() is working on the device in question.)

--Stephen


2007-06-14 19:01:16

by Phillip Susi

[permalink] [raw]
Subject: Re: ext3fs: umount+sync not enough to guarantee metadata-on-disk

Pavel Machek wrote:
> Hi!
>
>>> Did this succeed? If the application is still
>>> truncating that file, the
>>> umount should have failed.
>> Actually, what I expect to happen is for the remount,ro
>> to block until the file deletion completes. But it
>> doesn't.
>>
>> Once a f/s is read-only, there should be NO writing to
>> it. Right?
>
> Linux happily writes to filesystems mounted read-only. It will replay
> journal on them.

That's a bug and needs fixed. Read only means read _only_.

And the question still remains; why is sync() not blocking until the
file has been completely unlinked and the disk is consistent?