2011-08-01 06:41:05

by Toshiyuki Okajima

[permalink] [raw]
Subject: [PATCH] ext3: fix message in ext3_remount for rw-remount case

If there are some inodes in orphan list while a filesystem is being
read-only mounted, we should recommend that pepole umount and then
mount it when they try to remount with read-write. But the current
message/comment recommends that they umount and then remount it.

ext3_remount:
/*
* If we have an unprocessed orphan list hanging
* around from a previously readonly bdev mount,
* require a full umount/remount for now.
^^^^^^^^^^^^^^
*/
if (es->s_last_orphan) {
printk(KERN_WARNING "EXT3-fs: %s: couldn't "
"remount RDWR because of unprocessed "
"orphan inode list. Please "
"umount/remount instead.\n",
^^^^^^^^^^^^^^
sb->s_id);

Signed-off-by: Toshiyuki Okajima <[email protected]>
---
fs/ext3/super.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 7beb69a..d3df0d4 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -2669,13 +2669,13 @@ static int ext3_remount (struct super_block * sb, int * flags, char * data)
/*
* If we have an unprocessed orphan list hanging
* around from a previously readonly bdev mount,
- * require a full umount/remount for now.
+ * require a full umount/mount for now.
*/
if (es->s_last_orphan) {
ext3_msg(sb, KERN_WARNING, "warning: couldn't "
"remount RDWR because of unprocessed "
"orphan inode list. Please "
- "umount/remount instead.");
+ "umount/mount instead.");
err = -EINVAL;
goto restore_opts;
}
--
1.5.5.6


2011-08-01 08:45:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
> If there are some inodes in orphan list while a filesystem is being
> read-only mounted, we should recommend that pepole umount and then
> mount it when they try to remount with read-write. But the current
> message/comment recommends that they umount and then remount it.
>
> ext3_remount:
> /*
> * If we have an unprocessed orphan list hanging
> * around from a previously readonly bdev mount,
> * require a full umount/remount for now.
> ^^^^^^^^^^^^^^
> */
> if (es->s_last_orphan) {
> printk(KERN_WARNING "EXT3-fs: %s: couldn't "
> "remount RDWR because of unprocessed "
> "orphan inode list. Please "
> "umount/remount instead.\n",
> ^^^^^^^^^^^^^^
> sb->s_id);
OK, so how about using "umount & mount"? The '/' is what would confuse me
the most... BTW, I guess you didn't really see this message in practice, did
you?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-01 09:45:20

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hi.

(2011/08/01 17:45), Jan Kara wrote:
> On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
>> If there are some inodes in orphan list while a filesystem is being
>> read-only mounted, we should recommend that pepole umount and then
>> mount it when they try to remount with read-write. But the current
>> message/comment recommends that they umount and then remount it.
>>
>> ext3_remount:
>> /*
>> * If we have an unprocessed orphan list hanging
>> * around from a previously readonly bdev mount,
>> * require a full umount/remount for now.
>> ^^^^^^^^^^^^^^
>> */
>> if (es->s_last_orphan) {
>> printk(KERN_WARNING "EXT3-fs: %s: couldn't "
>> "remount RDWR because of unprocessed "
>> "orphan inode list. Please "
>> "umount/remount instead.\n",
>> ^^^^^^^^^^^^^^
>> sb->s_id);

> OK, so how about using "umount& mount"? The '/' is what would confuse me
OK. I modify it like your comment.

umount/mount => umount & mount

> the most... BTW, I guess you didn't really see this message in practice, did
> you?
No.
I have seen this message in practice while quotacheck command was repeatedly
executed per an hour.

Thanks,
Toshiyuki Okajima


2011-08-01 09:57:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

On Mon 01-08-11 18:45:58, Toshiyuki Okajima wrote:
> (2011/08/01 17:45), Jan Kara wrote:
> >On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
> >>If there are some inodes in orphan list while a filesystem is being
> >>read-only mounted, we should recommend that pepole umount and then
> >>mount it when they try to remount with read-write. But the current
> >>message/comment recommends that they umount and then remount it.
> >>
> >>ext3_remount:
> >> /*
> >> * If we have an unprocessed orphan list hanging
> >> * around from a previously readonly bdev mount,
> >> * require a full umount/remount for now.
> >> ^^^^^^^^^^^^^^
> >> */
> >> if (es->s_last_orphan) {
> >> printk(KERN_WARNING "EXT3-fs: %s: couldn't "
> >> "remount RDWR because of unprocessed "
> >> "orphan inode list. Please "
> >> "umount/remount instead.\n",
> >> ^^^^^^^^^^^^^^
> >> sb->s_id);
>
> > OK, so how about using "umount& mount"? The '/' is what would confuse me
> OK. I modify it like your comment.
>
> umount/mount => umount & mount
Thanks.

> >the most... BTW, I guess you didn't really see this message in practice, did
> >you?
> No.
> I have seen this message in practice while quotacheck command was repeatedly
> executed per an hour.
Interesting. Are you able to reproduce this? Quotacheck does remount
read-only + remount read-write but you cannot really remount the filesystem
read-only when it has orphan inodes and so you should not see those when
you remount read-write again. Possibly there's race between remounting and
unlinking...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-02 09:13:40

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hi.

(2011/08/01 18:57), Jan Kara wrote:
> On Mon 01-08-11 18:45:58, Toshiyuki Okajima wrote:
>> (2011/08/01 17:45), Jan Kara wrote:
>>> On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
>>>> If there are some inodes in orphan list while a filesystem is being
>>>> read-only mounted, we should recommend that pepole umount and then
>>>> mount it when they try to remount with read-write. But the current
>>>> message/comment recommends that they umount and then remount it.
>>>>
>>>> ext3_remount:
>>>> /*
>>>> * If we have an unprocessed orphan list hanging
>>>> * around from a previously readonly bdev mount,
>>>> * require a full umount/remount for now.
>>>> ^^^^^^^^^^^^^^
>>>> */
>>>> if (es->s_last_orphan) {
>>>> printk(KERN_WARNING "EXT3-fs: %s: couldn't "
>>>> "remount RDWR because of unprocessed "
>>>> "orphan inode list. Please "
>>>> "umount/remount instead.\n",
>>>> ^^^^^^^^^^^^^^
>>>> sb->s_id);
>>
>>> OK, so how about using "umount& mount"? The '/' is what would confuse me
>> OK. I modify it like your comment.
>>
>> umount/mount => umount& mount
> Thanks.
>

>>> the most... BTW, I guess you didn't really see this message in practice, did
>>> you?
>> No.
>> I have seen this message in practice while quotacheck command was repeatedly
>> executed per an hour.
> Interesting. Are you able to reproduce this? Quotacheck does remount
> read-only + remount read-write but you cannot really remount the filesystem
> read-only when it has orphan inodes and so you should not see those when
> you remount read-write again. Possibly there's race between remounting and
> unlinking...
Yes. I can reproduce it. However, it is not frequently reproduced
by using the original procedure (qutacheck per an hour). So, I made a
reproducer.

It is:
[go.sh]
#!/bin/sh

dd if=/dev/zero of=./img bs=1k count=1 seek=100k > /dev/null 2>&1
/sbin/mkfs.ext3 -Fq ./img
/sbin/tune2fs -c 0 ./img
mkdir -p mnt
LOOP=10000
for ((i=0; i<LOOP; i++));
do
mount -o loop ./img ./mnt
sh ./writer.sh ./mnt &
PID=$!
j=0
while [ 1 ];
do
sleep 1
# remount
if ((j%2 == 0));
then
mount -o loop,remount,ro ./mnt
else
mount -o loop,remount,rw ./mnt
fi
tail -n 1 /var/log/messages| grep "remount RDWR"
if [ $? -eq 0 ];
then
break
fi
j=$((j+1))
done
kill -9 $PID
umount ./mnt
/sbin/e2fsck -pf ./img
done
exit

[writer.sh]
#!/bin/sh

i=0
path=$1
num=0
stride=64
while [ 1 ];
do
for ((j=0;j<stride;j++));
do
num=$((i*stride + j))
dd if=/dev/zero of=${path}/file${num} bs=8k count=1 > /dev/null 2>&1 &
done
for ((j=0;j<stride;j++));
do
num=$((i*stride + j))
rm -f ${path}/file${num} &
done
wait
i=$((i+1))
done

# vi go.sh
# vi writer.sh
# sh go.sh
(It might be reproduced after a while...)

Thanks,
Toshiyuki Okajima


2011-08-03 02:41:22

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hi.

(2011/08/02 18:14), Toshiyuki Okajima wrote:
> Hi.
>
> (2011/08/01 18:57), Jan Kara wrote:
>> On Mon 01-08-11 18:45:58, Toshiyuki Okajima wrote:
>>> (2011/08/01 17:45), Jan Kara wrote:
>>>> On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
>>>>> If there are some inodes in orphan list while a filesystem is being
>>>>> read-only mounted, we should recommend that pepole umount and then
>>>>> mount it when they try to remount with read-write. But the current
>>>>> message/comment recommends that they umount and then remount it.
<SNIP>
>>>> the most... BTW, I guess you didn't really see this message in practice, did
>>>> you?
>>> No.
>>> I have seen this message in practice while quotacheck command was repeatedly
>>> executed per an hour.
>> Interesting. Are you able to reproduce this? Quotacheck does remount
>> read-only + remount read-write but you cannot really remount the filesystem
>> read-only when it has orphan inodes and so you should not see those when
>> you remount read-write again. Possibly there's race between remounting and
>> unlinking...
> Yes. I can reproduce it. However, it is not frequently reproduced
> by using the original procedure (qutacheck per an hour). So, I made a
> reproducer.
To tell the truth, I think the race creates the message:
-----------------------------------------------------------------------
EXT3-fs: <dev>: couldn't remount RDWR because of
     unprocessed orphan inode list. Please umount/remount instead.
-----------------------------------------------------------------------
which hides a serious problem.

By using my reproducer, I found that it can show another message that
is not the above mentioned message:
-----------------------------------------------------------------------
EXT3-fs error (device <dev>) in start_transaction: Readonly filesystem
-----------------------------------------------------------------------
After I examined the code path which message could display, I found
it can display if the following steps are satisfied:

[[CASE 1]]
( 1) [process A] do_unlinkat
( 2) [process B] do_remount_sb(, RDONLY, )
( 3) [process A] vfs_unlink
( 4) [process A] ext3_unlink
( 5) [process A] ext3_journal_start
( 6) [process B] fs_may_remount_ro (=> return 0)
( 7) [process A] inode->i_nlink-- (i_nlink=0)
( 8) [process A] ext3_orphan_add
( 9) [process A] ext3_journal_stop
(10) [process A] dput
(11) [process A] iput
(12) [process A] ext3_evict_inode
(13) [process B] ext3_remount
(14) [process A] start_transaction
(15) [process B] sb->s_flags |= MS_RDONLY
(16) [process B] ext3_mark_recovery_complete
(17) [process A] start_this_handle (new transaction is created)
(18) [process A] ext3_truncate
(19) [process A] start_transaction (failed => this message is displayed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(20) [process A] ext3_orphan_del
(21) [process A] ext3_journal_stop

* "Process A" deletes a file successfully(21). However the file data is left
because (18) fails. **Furthermore, new transaction can be created after
ext3_mark_recovery_complete finishes.**

[[CASE2]]
( 1) [process A] do_unlinkat
( 2) [process B] do_remount_sb(, RDONLY, )
( 3) [process A] vfs_unlink
( 4) [process A] ext3_unlink
( 5) [process A] ext3_journal_start
( 6) [process B] fs_may_remount_ro (=> return 0)
( 7) [process A] inode->i_nlink-- (i_nlink=0)
( 8) [process A] ext3_orphan_add
( 9) [process A] ext3_journal_stop
(10) [process A] dput
(11) [process A] iput
(12) [process A] ext3_evict_inode
(13) [process B] ext3_remount
(14) [process A] start_transaction
(15) [process B] sb->s_flags |= MS_RDONLY
(17) [process A] start_this_handle (new transaction is created)
(18) [process A] ext3_truncate
(19) [process A] start_transaction (failed => this message is displayed)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(20) [process A] ext3_orphan_del
(21) [process A] ext3_journal_stop
(22) [process B] ext3_mark_recovery_complete

* "Process A" deletes a file successfully(21). However the file data is left
because (18) fails. This transaction can finish before
ext3_mark_recovery_complete finishes.

I will try to fix this problem not to do with fs-error.
Please comment about the fix if I have created one.

Thanks,
Toshiyuki Okajima

2011-08-03 09:58:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hello,

On Wed 03-08-11 11:42:03, Toshiyuki Okajima wrote:
> >(2011/08/01 18:57), Jan Kara wrote:
> >>On Mon 01-08-11 18:45:58, Toshiyuki Okajima wrote:
> >>>(2011/08/01 17:45), Jan Kara wrote:
> >>>>On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
> >>>>>If there are some inodes in orphan list while a filesystem is being
> >>>>>read-only mounted, we should recommend that pepole umount and then
> >>>>>mount it when they try to remount with read-write. But the current
> >>>>>message/comment recommends that they umount and then remount it.
> <SNIP>
> >>>>the most... BTW, I guess you didn't really see this message in practice, did
> >>>>you?
> >>>No.
> >>>I have seen this message in practice while quotacheck command was repeatedly
> >>>executed per an hour.
> >>Interesting. Are you able to reproduce this? Quotacheck does remount
> >>read-only + remount read-write but you cannot really remount the filesystem
> >>read-only when it has orphan inodes and so you should not see those when
> >>you remount read-write again. Possibly there's race between remounting and
> >>unlinking...
> >Yes. I can reproduce it. However, it is not frequently reproduced
> >by using the original procedure (qutacheck per an hour). So, I made a
> >reproducer.
> To tell the truth, I think the race creates the message:
> -----------------------------------------------------------------------
> EXT3-fs: <dev>: couldn't remount RDWR because of
>      unprocessed orphan inode list. Please umount/remount instead.
> -----------------------------------------------------------------------
> which hides a serious problem.
I've inquired about this at linux-fsdevel (I think you were in CC unless
I forgot). It's a race in VFS remount code as you properly analyzed below.
People are working on fixing it but it's not trivial. Filesystem is really
a wrong place to fix such problem. If there is a trivial fix for ext3 to
workaround the issue, I can take it but I'm not willing to push anything
complex - effort should better be spent working on a generic fix.

Honza

> By using my reproducer, I found that it can show another message that
> is not the above mentioned message:
> -----------------------------------------------------------------------
> EXT3-fs error (device <dev>) in start_transaction: Readonly filesystem
> -----------------------------------------------------------------------
> After I examined the code path which message could display, I found
> it can display if the following steps are satisfied:
>
> [[CASE 1]]
> ( 1) [process A] do_unlinkat
> ( 2) [process B] do_remount_sb(, RDONLY, )
> ( 3) [process A] vfs_unlink
> ( 4) [process A] ext3_unlink
> ( 5) [process A] ext3_journal_start
> ( 6) [process B] fs_may_remount_ro (=> return 0)
> ( 7) [process A] inode->i_nlink-- (i_nlink=0)
> ( 8) [process A] ext3_orphan_add
> ( 9) [process A] ext3_journal_stop
> (10) [process A] dput
> (11) [process A] iput
> (12) [process A] ext3_evict_inode
> (13) [process B] ext3_remount
> (14) [process A] start_transaction
> (15) [process B] sb->s_flags |= MS_RDONLY
> (16) [process B] ext3_mark_recovery_complete
> (17) [process A] start_this_handle (new transaction is created)
> (18) [process A] ext3_truncate
> (19) [process A] start_transaction (failed => this message is displayed)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> (20) [process A] ext3_orphan_del
> (21) [process A] ext3_journal_stop
>
> * "Process A" deletes a file successfully(21). However the file data is left
> because (18) fails. **Furthermore, new transaction can be created after
> ext3_mark_recovery_complete finishes.**
>
> [[CASE2]]
> ( 1) [process A] do_unlinkat
> ( 2) [process B] do_remount_sb(, RDONLY, )
> ( 3) [process A] vfs_unlink
> ( 4) [process A] ext3_unlink
> ( 5) [process A] ext3_journal_start
> ( 6) [process B] fs_may_remount_ro (=> return 0)
> ( 7) [process A] inode->i_nlink-- (i_nlink=0)
> ( 8) [process A] ext3_orphan_add
> ( 9) [process A] ext3_journal_stop
> (10) [process A] dput
> (11) [process A] iput
> (12) [process A] ext3_evict_inode
> (13) [process B] ext3_remount
> (14) [process A] start_transaction
> (15) [process B] sb->s_flags |= MS_RDONLY
> (17) [process A] start_this_handle (new transaction is created)
> (18) [process A] ext3_truncate
> (19) [process A] start_transaction (failed => this message is displayed)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> (20) [process A] ext3_orphan_del
> (21) [process A] ext3_journal_stop
> (22) [process B] ext3_mark_recovery_complete
>
> * "Process A" deletes a file successfully(21). However the file data is left
> because (18) fails. This transaction can finish before
> ext3_mark_recovery_complete finishes.
>
> I will try to fix this problem not to do with fs-error.
> Please comment about the fix if I have created one.
>
> Thanks,
> Toshiyuki Okajima
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-03 13:27:33

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hi.

On Wed, 3 Aug 2011 11:57:54 +0200
Jan Kara <[email protected]> wrote:
> Hello,
>
> On Wed 03-08-11 11:42:03, Toshiyuki Okajima wrote:
> > >(2011/08/01 18:57), Jan Kara wrote:
> > >>On Mon 01-08-11 18:45:58, Toshiyuki Okajima wrote:
> > >>>(2011/08/01 17:45), Jan Kara wrote:
> > >>>>On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
> > >>>>>If there are some inodes in orphan list while a filesystem is being
> > >>>>>read-only mounted, we should recommend that pepole umount and then
> > >>>>>mount it when they try to remount with read-write. But the current
> > >>>>>message/comment recommends that they umount and then remount it.
> > <SNIP>
> > >>>>the most... BTW, I guess you didn't really see this message in practice, did
> > >>>>you?
> > >>>No.
> > >>>I have seen this message in practice while quotacheck command was repeatedly
> > >>>executed per an hour.
> > >>Interesting. Are you able to reproduce this? Quotacheck does remount
> > >>read-only + remount read-write but you cannot really remount the filesystem
> > >>read-only when it has orphan inodes and so you should not see those when
> > >>you remount read-write again. Possibly there's race between remounting and
> > >>unlinking...
> > >Yes. I can reproduce it. However, it is not frequently reproduced
> > >by using the original procedure (qutacheck per an hour). So, I made a
> > >reproducer.
> > To tell the truth, I think the race creates the message:
> > -----------------------------------------------------------------------
> > EXT3-fs: <dev>: couldn't remount RDWR because of
> > $B!!!!!!(B $B!!(Bunprocessed orphan inode list. Please umount/remount instead.
> > -----------------------------------------------------------------------
> > which hides a serious problem.
> I've inquired about this at linux-fsdevel (I think you were in CC unless
> I forgot). It's a race in VFS remount code as you properly analyzed below.
> People are working on fixing it but it's not trivial. Filesystem is really
> a wrong place to fix such problem. If there is a trivial fix for ext3 to
> workaround the issue, I can take it but I'm not willing to push anything
> complex - effort should better be spent working on a generic fix.
I also think read-only remount race in VFS layer should be fixed.
However, I think this race depends on ext3/ext4 filesystem
implementation. (Orphan inode list)
So, we should modify ext3/ext4(jbd/jbd2) to fix it.

>
> Honza
>
> > By using my reproducer, I found that it can show another message that
> > is not the above mentioned message:
> > -----------------------------------------------------------------------
> > EXT3-fs error (device <dev>) in start_transaction: Readonly filesystem
> > -----------------------------------------------------------------------
> > After I examined the code path which message could display, I found
> > it can display if the following steps are satisfied:
<SNIP>
> > I will try to fix this problem not to do with fs-error.
> > Please comment about the fix if I have created one.
I created a sample patch to fix above EXT3-fs error. But it is tricky,
I think. :-)

Signed-off-by: Toshiyuki Okajima <[email protected]>
---
fs/ext3/inode.c | 27 ++++++++++++++++++++++-----
fs/ext3/super.c | 15 ++++++++++++++-
fs/jbd/transaction.c | 1 +
include/linux/ext3_fs.h | 3 +++
include/linux/jbd.h | 2 ++
5 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 04da6ac..abfa500 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -201,6 +201,7 @@ void ext3_evict_inode (struct inode *inode)
struct ext3_block_alloc_info *rsv;
handle_t *handle;
int want_delete = 0;
+ int err = 0;

trace_ext3_evict_inode(inode);
if (!inode->i_nlink && !is_bad_inode(inode)) {
@@ -246,6 +247,7 @@ void ext3_evict_inode (struct inode *inode)

handle = start_transaction(inode);
if (IS_ERR(handle)) {
+skip_trans:
/*
* If we're going to skip the normal cleanup, we still need to
* make sure that the in-core orphan linked list is properly
@@ -258,8 +260,13 @@ void ext3_evict_inode (struct inode *inode)
if (IS_SYNC(inode))
handle->h_sync = 1;
inode->i_size = 0;
- if (inode->i_blocks)
- ext3_truncate(inode);
+ if (inode->i_blocks) {
+ err = __ext3_truncate(inode);
+ if (err == -EROFS) {
+ ext3_journal_stop(handle);
+ goto skip_trans;
+ }
+ }
/*
* Kill off the orphan record created when the inode lost the last
* link. Note that ext3_orphan_del() has to be able to cope with the
@@ -2511,7 +2518,7 @@ int ext3_can_truncate(struct inode *inode)
* that's fine - as long as they are linked from the inode, the post-crash
* ext3_truncate() run will find them and release them.
*/
-void ext3_truncate(struct inode *inode)
+int __ext3_truncate(struct inode *inode)
{
handle_t *handle;
struct ext3_inode_info *ei = EXT3_I(inode);
@@ -2524,6 +2531,7 @@ void ext3_truncate(struct inode *inode)
int n;
long last_block;
unsigned blocksize = inode->i_sb->s_blocksize;
+ int err = 0;

trace_ext3_truncate_enter(inode);

@@ -2534,8 +2542,11 @@ void ext3_truncate(struct inode *inode)
ext3_set_inode_state(inode, EXT3_STATE_FLUSH_ON_CLOSE);

handle = start_transaction(inode);
- if (IS_ERR(handle))
+ if (IS_ERR(handle)) {
+ if (PTR_ERR(handle) == -EROFS)
+ err = -EROFS;
goto out_notrans;
+ }

last_block = (inode->i_size + blocksize-1)
>> EXT3_BLOCK_SIZE_BITS(inode->i_sb);
@@ -2654,7 +2665,7 @@ out_stop:

ext3_journal_stop(handle);
trace_ext3_truncate_exit(inode);
- return;
+ return 0;
out_notrans:
/*
* Delete the inode from orphan list so that it doesn't stay there
@@ -2663,6 +2674,12 @@ out_notrans:
if (inode->i_nlink)
ext3_orphan_del(NULL, inode);
trace_ext3_truncate_exit(inode);
+ return err;
+}
+
+void ext3_truncate(struct inode *inode)
+{
+ __ext3_truncate(inode);
}

static ext3_fsblk_t ext3_get_inode_block(struct super_block *sb,
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 7beb69a..22ffb04 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -270,6 +270,10 @@ void __ext3_std_error (struct super_block * sb, const char * function,
if (errno == -EROFS && journal_current_handle() == NULL &&
(sb->s_flags & MS_RDONLY))
return;
+ smp_rmb();
+ if ((sb->s_flags & MS_RDONLY) && errno == -EROFS &&
+ (EXT3_SB(sb)->s_mount_state & EXT3_ROMOUNT))
+ return;

errstr = ext3_decode_error(sb, errno, nbuf);
ext3_msg(sb, KERN_CRIT, "error in %s: %s", function, errstr);
@@ -2642,7 +2646,12 @@ static int ext3_remount (struct super_block * sb, int * flags, char * data)
* First of all, the unconditional stuff we have to do
* to disable replay of the journal when we next remount
*/
+ spin_lock(&sbi->s_journal->j_state_lock);
+ sbi->s_journal->j_flags |= JFS_ROMOUNT;
+ spin_unlock(&sbi->s_journal->j_state_lock);
+ sbi->s_mount_state |= EXT3_ROMOUNT;
sb->s_flags |= MS_RDONLY;
+ smp_wmb();

/*
* OK, test if we are remounting a valid rw partition
@@ -2651,7 +2660,8 @@ static int ext3_remount (struct super_block * sb, int * flags, char * data)
*/
if (!(es->s_state & cpu_to_le16(EXT3_VALID_FS)) &&
(sbi->s_mount_state & EXT3_VALID_FS))
- es->s_state = cpu_to_le16(sbi->s_mount_state);
+ es->s_state =
+ cpu_to_le16(sbi->s_mount_state & ~EXT3_ROMOUNT);

ext3_mark_recovery_complete(sb, es);
} else {
@@ -2686,6 +2696,9 @@ static int ext3_remount (struct super_block * sb, int * flags, char * data)
* been changed by e2fsck since we originally mounted
* the partition.)
*/
+ spin_lock(&sbi->s_journal->j_state_lock);
+ sbi->s_journal->j_flags &= ~JFS_ROMOUNT;
+ spin_unlock(&sbi->s_journal->j_state_lock);
ext3_clear_journal_err(sb, es);
sbi->s_mount_state = le16_to_cpu(es->s_state);
if ((err = ext3_group_extend(sb, es, n_blocks_count)))
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 7e59c6e..26139a3 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -118,6 +118,7 @@ repeat:
spin_lock(&journal->j_state_lock);
repeat_locked:
if (is_journal_aborted(journal) ||
+ journal->j_flags & JFS_ROMOUNT ||
(journal->j_errno != 0 && !(journal->j_flags & JFS_ACK_ERR))) {
spin_unlock(&journal->j_state_lock);
ret = -EROFS;
diff --git a/include/linux/ext3_fs.h b/include/linux/ext3_fs.h
index 67a803a..f02ed52 100644
--- a/include/linux/ext3_fs.h
+++ b/include/linux/ext3_fs.h
@@ -369,6 +369,8 @@ struct ext3_inode {
#define EXT3_VALID_FS 0x0001 /* Unmounted cleanly */
#define EXT3_ERROR_FS 0x0002 /* Errors detected */
#define EXT3_ORPHAN_FS 0x0004 /* Orphans being recovered */
+#define EXT3_ROMOUNT 0x0008 /* FS is ro-mounted
+ * (Internal use only) */

/*
* Misc. filesystem flags
@@ -912,6 +914,7 @@ extern void ext3_dirty_inode(struct inode *, int);
extern int ext3_change_inode_journal_flag(struct inode *, int);
extern int ext3_get_inode_loc(struct inode *, struct ext3_iloc *);
extern int ext3_can_truncate(struct inode *inode);
+extern int __ext3_truncate(struct inode *inode);
extern void ext3_truncate(struct inode *inode);
extern void ext3_set_inode_flags(struct inode *);
extern void ext3_get_inode_flags(struct ext3_inode_info *);
diff --git a/include/linux/jbd.h b/include/linux/jbd.h
index e6a5e34..eae4b62 100644
--- a/include/linux/jbd.h
+++ b/include/linux/jbd.h
@@ -831,6 +831,8 @@ struct journal_s
#define JFS_ABORT_ON_SYNCDATA_ERR 0x040 /* Abort the journal on file
* data write error in ordered
* mode */
+#define JFS_ROMOUNT 0x080 /* Prevent new transaction creating
+ * while ro-mounting. */

/*
* Function declarations for the journaling transaction and buffer
--
1.5.5.6

2011-08-03 16:25:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hello,

On Wed 03-08-11 22:25:48, Toshiyuki Okajima wrote:
> On Wed, 3 Aug 2011 11:57:54 +0200
> Jan Kara <[email protected]> wrote:
> > On Wed 03-08-11 11:42:03, Toshiyuki Okajima wrote:
> > > >(2011/08/01 18:57), Jan Kara wrote:
> > > >>On Mon 01-08-11 18:45:58, Toshiyuki Okajima wrote:
> > > >>>(2011/08/01 17:45), Jan Kara wrote:
> > > >>>>On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
> > > >>>>>If there are some inodes in orphan list while a filesystem is being
> > > >>>>>read-only mounted, we should recommend that pepole umount and then
> > > >>>>>mount it when they try to remount with read-write. But the current
> > > >>>>>message/comment recommends that they umount and then remount it.
> > > <SNIP>
> > > >>>>the most... BTW, I guess you didn't really see this message in practice, did
> > > >>>>you?
> > > >>>No.
> > > >>>I have seen this message in practice while quotacheck command was repeatedly
> > > >>>executed per an hour.
> > > >>Interesting. Are you able to reproduce this? Quotacheck does remount
> > > >>read-only + remount read-write but you cannot really remount the filesystem
> > > >>read-only when it has orphan inodes and so you should not see those when
> > > >>you remount read-write again. Possibly there's race between remounting and
> > > >>unlinking...
> > > >Yes. I can reproduce it. However, it is not frequently reproduced
> > > >by using the original procedure (qutacheck per an hour). So, I made a
> > > >reproducer.
> > > To tell the truth, I think the race creates the message:
> > > -----------------------------------------------------------------------
> > > EXT3-fs: <dev>: couldn't remount RDWR because of
> > >      unprocessed orphan inode list. Please umount/remount instead.
> > > -----------------------------------------------------------------------
> > > which hides a serious problem.
> > I've inquired about this at linux-fsdevel (I think you were in CC unless
> > I forgot). It's a race in VFS remount code as you properly analyzed below.
> > People are working on fixing it but it's not trivial. Filesystem is really
> > a wrong place to fix such problem. If there is a trivial fix for ext3 to
> > workaround the issue, I can take it but I'm not willing to push anything
> > complex - effort should better be spent working on a generic fix.
> I also think read-only remount race in VFS layer should be fixed.
> However, I think this race depends on ext3/ext4 filesystem
> implementation. (Orphan inode list)
> So, we should modify ext3/ext4(jbd/jbd2) to fix it.
Umm, I don't understand here. If VFS makes sure that there are no
files open for writing, no unfinished operations changing the filesystem (e.g.
unlink), and no open-but-unlinked files, what remains for ext3 to check?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-08-08 04:26:25

by Toshiyuki Okajima

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hi.

I'm sorry for my late response. I took vacations till yesterday.

(2011/08/04 1:25), Jan Kara wrote:
> Hello,
>
> On Wed 03-08-11 22:25:48, Toshiyuki Okajima wrote:
>> On Wed, 3 Aug 2011 11:57:54 +0200
>> Jan Kara<[email protected]> wrote:
>>> On Wed 03-08-11 11:42:03, Toshiyuki Okajima wrote:
>>>>> (2011/08/01 18:57), Jan Kara wrote:
>>>>>> On Mon 01-08-11 18:45:58, Toshiyuki Okajima wrote:
>>>>>>> (2011/08/01 17:45), Jan Kara wrote:
>>>>>>>> On Mon 01-08-11 13:54:51, Toshiyuki Okajima wrote:
>>>>>>>>> If there are some inodes in orphan list while a filesystem is being
>>>>>>>>> read-only mounted, we should recommend that pepole umount and then
>>>>>>>>> mount it when they try to remount with read-write. But the current
>>>>>>>>> message/comment recommends that they umount and then remount it.
>>>> <SNIP>
>>>>>>>> the most... BTW, I guess you didn't really see this message in practice, did
>>>>>>>> you?
>>>>>>> No.
>>>>>>> I have seen this message in practice while quotacheck command was repeatedly
>>>>>>> executed per an hour.
>>>>>> Interesting. Are you able to reproduce this? Quotacheck does remount
>>>>>> read-only + remount read-write but you cannot really remount the filesystem
>>>>>> read-only when it has orphan inodes and so you should not see those when
>>>>>> you remount read-write again. Possibly there's race between remounting and
>>>>>> unlinking...
>>>>> Yes. I can reproduce it. However, it is not frequently reproduced
>>>>> by using the original procedure (qutacheck per an hour). So, I made a
>>>>> reproducer.
>>>> To tell the truth, I think the race creates the message:
>>>> -----------------------------------------------------------------------
>>>> EXT3-fs:<dev>: couldn't remount RDWR because of
>>>>      unprocessed orphan inode list. Please umount/remount instead.
>>>> -----------------------------------------------------------------------
>>>> which hides a serious problem.
>>> I've inquired about this at linux-fsdevel (I think you were in CC unless
>>> I forgot). It's a race in VFS remount code as you properly analyzed below.
>>> People are working on fixing it but it's not trivial. Filesystem is really
>>> a wrong place to fix such problem. If there is a trivial fix for ext3 to
>>> workaround the issue, I can take it but I'm not willing to push anything
>>> complex - effort should better be spent working on a generic fix.
>> I also think read-only remount race in VFS layer should be fixed.
>> However, I think this race depends on ext3/ext4 filesystem
>> implementation. (Orphan inode list)
>> So, we should modify ext3/ext4(jbd/jbd2) to fix it.

> Umm, I don't understand here. If VFS makes sure that there are no

After I saw the following messages, I thought we must fix EXT3-fs error
at first. So, I created the fix patch.

(1) kernel: EXT3-fs: <dev>: couldn't remount RDWR because of
     unprocessed orphan inode list. Please umount/remount instead.
(2) kernel: EXT3-fs error (device <dev>) in start_transaction: Readonly filesystem

I wasn't aware that by fixing the race between "ro-remount" and "unlink",
that EXT3-fs error can be also fixed then.

> files open for writing, no unfinished operations changing the filesystem (e.g.
> unlink), and no open-but-unlinked files, what remains for ext3 to check?
OK.
Now, I also think we need not modify ext3 to fix these problems.
If we can prevent to add an inode into the orphan list (to start unlinking)
while ro-remounting, we can also prevent (1) and (2).

However, new mechanism to confirm whether "no open-but-unlinked" files
exist while ro-remounting is required, isn't it?

Thanks,
Toshiyuki Okajima

2011-08-08 12:48:22

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] ext3: fix message in ext3_remount for rw-remount case

Hello,

On Mon 08-08-11 13:27:12, Toshiyuki Okajima wrote:
> >On Wed 03-08-11 22:25:48, Toshiyuki Okajima wrote:
> >>On Wed, 3 Aug 2011 11:57:54 +0200
> >>Jan Kara<[email protected]> wrote:
> >>>>To tell the truth, I think the race creates the message:
> >>>>-----------------------------------------------------------------------
> >>>> EXT3-fs:<dev>: couldn't remount RDWR because of
> >>>>      unprocessed orphan inode list. Please umount/remount instead.
> >>>>-----------------------------------------------------------------------
> >>>>which hides a serious problem.
> >>> I've inquired about this at linux-fsdevel (I think you were in CC unless
> >>>I forgot). It's a race in VFS remount code as you properly analyzed below.
> >>>People are working on fixing it but it's not trivial. Filesystem is really
> >>>a wrong place to fix such problem. If there is a trivial fix for ext3 to
> >>>workaround the issue, I can take it but I'm not willing to push anything
> >>>complex - effort should better be spent working on a generic fix.
> >>I also think read-only remount race in VFS layer should be fixed.
> >>However, I think this race depends on ext3/ext4 filesystem
> >>implementation. (Orphan inode list)
> >>So, we should modify ext3/ext4(jbd/jbd2) to fix it.
>
> > Umm, I don't understand here. If VFS makes sure that there are no
>
> After I saw the following messages, I thought we must fix EXT3-fs error
> at first. So, I created the fix patch.
>
> (1) kernel: EXT3-fs: <dev>: couldn't remount RDWR because of
>      unprocessed orphan inode list. Please umount/remount instead.
> (2) kernel: EXT3-fs error (device <dev>) in start_transaction: Readonly filesystem
>
> I wasn't aware that by fixing the race between "ro-remount" and "unlink",
> that EXT3-fs error can be also fixed then.
>
> >files open for writing, no unfinished operations changing the filesystem (e.g.
> >unlink), and no open-but-unlinked files, what remains for ext3 to check?
> OK.
> Now, I also think we need not modify ext3 to fix these problems.
> If we can prevent to add an inode into the orphan list (to start unlinking)
> while ro-remounting, we can also prevent (1) and (2).
>
> However, new mechanism to confirm whether "no open-but-unlinked" files
> exist while ro-remounting is required, isn't it?
Yes. Actually, VFS already tracks open-but-unlinked files but the check &
remount pair is not atomic so that needs to be fixed and it is not that
simple.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR