LinuxLists.cc - ext2 FS corruption with 2.5.59.

2003-01-23 12:29:40

Subject: ext2 FS corruption with 2.5.59.

Hello!

While doing my usual quick tests I pass all my reiserfs patches through,
I accidentally forgot to mount reiserfs partition and all the tests
were performed on ext2 instead.
So I found that first of all fsx (freebsd nfs testing tool)
breaks if there are parallel writers (no matter if they write to this
same partition or not). i_blocks value becomes negative on truncate
(and then it is written to disk). fsx log attached.
Also subsequent fsck found lots of other corruptions.
(previously fs was checked and no corruptions were found).
Excerpts from e2fsck output are attached.

My test consists of running "fsx -c 1234 testfile", "iozone -a",
"dbench 60", "fsstress -p10 -n1000000 -d ." at the same time on the
tested FS.
fsx usually breaks just when dbench is finished.

Host is dual athlon 1700+, 1G RAM, SMP kernel, Highmem support, no preemt.

I reproduced this behaviour on vanilla 2.5.59. (and It is easily
reproducable)

"..." symbols in fsck output excerpts mean "lots of these".
angband:~ # /sbin/e2fsck -fn /dev/hda1 | wc -l
e2fsck 1.24a (02-Sep-2001)
520

Bye,
Oleg

Attachments:

(No filename) (1.14 kB)
fsck (1.54 kB)
testfile.fsxlog (60.48 kB)
typescript (193.00 B)
Download all attachments

2003-01-23 13:58:36

by Hugh Dickins

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

On Thu, 23 Jan 2003, Oleg Drokin wrote:
>
> My test consists of running "fsx -c 1234 testfile", "iozone -a",
> "dbench 60", "fsstress -p10 -n1000000 -d ." at the same time on the
> tested FS.
> fsx usually breaks just when dbench is finished.

It was "dbench 100" which led me to investigate and post patch below
a couple of days ago: worth trying again with this patch applied.
But I'm no expert on ext2, and fear there may be lots more wrong:
does look rather as if nobody has been stressing ext2 for a while.

For almost a year (since 2.5.4) ext2_new_block has tended to set err
0 instead of -ENOSPC or -EIO. This manifested variously (typically
depends on what's stale in ext2_get_block's chain[4] array): sometimes
__brelse free free buffer backtraces, sometimes release_pages oops,
usually generic_make_request beyond end of device messages, followed
by further ext2 errors.

Hugh

--- 2.5.59/fs/ext2/balloc.c Tue Dec 24 06:23:03 2002
+++ linux/fs/ext2/balloc.c Tue Jan 21 20:14:37 2003
@@ -470,10 +470,10 @@

ext2_debug ("allocating block %d. ", block);

+ *err = 0;
out_release:
group_release_blocks(desc, gdp_bh, group_alloc);
release_blocks(sb, es_alloc);
- *err = 0;
out_unlock:
unlock_super (sb);
DQUOT_FREE_BLOCK(inode, dq_alloc);

2003-01-23 14:17:30

by Oleg Drokin

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Hello!

On Thu, Jan 23, 2003 at 02:09:12PM +0000, Hugh Dickins wrote:
> > My test consists of running "fsx -c 1234 testfile", "iozone -a",
> > "dbench 60", "fsstress -p10 -n1000000 -d ." at the same time on the
> > tested FS.
> > fsx usually breaks just when dbench is finished.
> It was "dbench 100" which led me to investigate and post patch below
> a couple of days ago: worth trying again with this patch applied.

I thought your patch only fixes stuff with insufficient space and/or IO
errors.
Anyway I applied the patch and still can reproduce the problem.

Also I decided to run same test with ext3 and it deadlocked.
This time it was not absolutely vanilla kernel, so I am going to try it on
vanilla kernel and report back if it will be reproducable there.
(with stacktraces)

Bye,
Oleg

2003-01-23 14:29:56

by Oleg Drokin

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Hello!

On Thu, Jan 23, 2003 at 05:26:38PM +0300, Oleg Drokin wrote:
> > > My test consists of running "fsx -c 1234 testfile", "iozone -a",
> > > "dbench 60", "fsstress -p10 -n1000000 -d ." at the same time on the
> > > tested FS.
> > > fsx usually breaks just when dbench is finished.
> Also I decided to run same test with ext3 and it deadlocked.
> This time it was not absolutely vanilla kernel, so I am going to try it on
> vanilla kernel and report back if it will be reproducable there.
> (with stacktraces)

Ok, so on vanilla 2.5.59, the above test on ext3 fs crashes my kernel
soon (in several seconds) after dbench is over.

kjournald starting. Commit interval 5 seconds
EXT3 FS 2.4-0.9.16, 02 Dec 2001 on ide0(3,6), internal journal
EXT3-fs: mounted filesystem with ordered data mode.
------------[ cut here ]------------
kernel BUG at fs/buffer.c:2545!
invalid operand: 0000
CPU: 0
EIP: 0060:[<c0151d86>] Not tainted
EFLAGS: 00010246
EIP is at submit_bh+0x26/0x110
eax: 00000405 ebx: efe93180 ecx: f7c5a4b4 edx: f1b928a0
esi: 00000001 edi: 00000001 ebp: e1bc1e2c esp: e1bc1e24
ds: 007b es: 007b ss: 0068
Process kjournald (pid: 204, threadinfo=e1bc0000 task=f7a0c680)
Stack: efe93180 00000000 e1bc1e48 c0151ebd 00000001 efe93180 e1bc1eb0 e0c5e5a0
df65ccc0 e1bc1fb0 c01941bc 00000001 00000001 e1bc1eb0 e1bc0000 f7c5a400
f7c5a45c 00000000 00000000 00000000 f7c5a4b4 f7c5a45c df65cd10 f7c5a43c
Call Trace:
[<c0151ebd>] ll_rw_block+0x4d/0x70
[<c01941bc>] journal_commit_transaction+0x49c/0x1150
[<c0118b3a>] schedule+0x3ea/0x4c0
[<c0196ed6>] kjournald+0x1e6/0x2f0
[<c0196cf0>] kjournald+0x0/0x2f0
[<c0196cd0>] commit_timeout+0x0/0x10
[<c01071b1>] kernel_thread_helper+0x5/0x14

Code: 0f 0b f1 09 7c 31 28 c0 89 f6 83 7b 20 00 75 0a 0f 0b f2 09

Bye,
Oleg

2003-01-24 10:22:46

by Andrew Morton

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Oleg Drokin <[email protected]> wrote:
>
> Hello!
>
> While doing my usual quick tests I pass all my reiserfs patches through,
> I accidentally forgot to mount reiserfs partition and all the tests
> were performed on ext2 instead.
> So I found that first of all fsx (freebsd nfs testing tool)
> breaks if there are parallel writers (no matter if they write to this
> same partition or not). i_blocks value becomes negative on truncate
> (and then it is written to disk). fsx log attached.
> Also subsequent fsck found lots of other corruptions.
> (previously fs was checked and no corruptions were found).
> Excerpts from e2fsck output are attached.
>
> My test consists of running "fsx -c 1234 testfile", "iozone -a",
> "dbench 60", "fsstress -p10 -n1000000 -d ." at the same time on the
> tested FS.
> fsx usually breaks just when dbench is finished.
>
> Host is dual athlon 1700+, 1G RAM, SMP kernel, Highmem support, no preemt.
>
> I reproduced this behaviour on vanilla 2.5.59. (and It is easily
> reproducable)
>

In 2.5.52 I broke sys_sync in subtle ways. It seems that fsstress hits
sync() hard enough to trigger the failure.

sys_sync() will set mapping->dirtied_when non-zero against a clean inode.
Later, in (say) __iget(), that inode gets moved over to inode_unused or
inode_in_use. But because it has non-zero ->dirtied_when,
__mark_inode_dirty() thinks that the inode must still be on sb->s_dirty.

But it isn't. It's on inode_in_use. It (and its pages) never get written
out and the data gets thrown away on unmount. Christoph has hit the same
thing. I bet he was running fsstress too. Sorry about that.

The below patch should fix it up - please test that.

So that's the umount problem. I don't know why you're getting the fsx-linux
failure - I was unable to hit it in an hour's run of the above workload on
4-way. Against both scsi and IDE. So please look further into that, thanks.

diff -puN fs/fs-writeback.c~sync-fix fs/fs-writeback.c
--- 25/fs/fs-writeback.c~sync-fix 2003-01-24 02:14:23.000000000 -0800
+++ 25-akpm/fs/fs-writeback.c 2003-01-24 02:17:12.000000000 -0800
@@ -67,6 +67,7 @@ void __mark_inode_dirty(struct inode *in

spin_lock(&inode_lock);
if ((inode->i_state & flags) != flags) {
+ const int was_dirty = inode->i_state & I_DIRTY;
struct address_space *mapping = inode->i_mapping;

inode->i_state |= flags;
@@ -90,7 +91,7 @@ void __mark_inode_dirty(struct inode *in
* If the inode was already on s_dirty or s_io, don't
* reposition it (that would break s_dirty time-ordering).
*/
- if (!mapping->dirtied_when) {
+ if (!was_dirty) {
mapping->dirtied_when = jiffies|1; /* 0 is special */
list_move(&inode->i_list, &sb->s_dirty);
}
@@ -280,7 +281,7 @@ sync_sb_inodes(struct super_block *sb, s
__iget(inode);
__writeback_single_inode(inode, really_sync, wbc);
if (wbc->sync_mode == WB_SYNC_HOLD) {
- mapping->dirtied_when = jiffies;
+ mapping->dirtied_when = jiffies|1;
list_move(&inode->i_list, &sb->s_dirty);
}
if (current_is_pdflush())

_

2003-01-24 12:30:24

by Oleg Drokin

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Hello!

On Fri, Jan 24, 2003 at 02:32:13AM -0800, Andrew Morton wrote:

> The below patch should fix it up - please test that.

Yup. The patch gets rids of those lots of inode problems.
But fsx issue is still there.

> So that's the umount problem. I don't know why you're getting the fsx-linux
> failure - I was unable to hit it in an hour's run of the above workload on
> 4-way. Against both scsi and IDE. So please look further into that, thanks.

Ok, So far simplest way of reproducing for me was this:
mkdir /mnt
fsstress -p 10 -n1000000 -d /mnt &
fsx -c 1234 testfile

Now look at fsx output. When it says about "truncating to largest ever: 0x3ffff",
wait 10 more seconds and if nothing happens, ^C the fsx,
run "rm -rf testfile*"
now run fsx again
and so on until it fails.

Last time it took me three iterations to reproduce.

Bye,
Oleg

2003-01-25 06:43:44

by Andrew Morton

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Oleg Drokin <[email protected]> wrote:
>
> Ok, So far simplest way of reproducing for me was this:
> mkdir /mnt
> fsstress -p 10 -n1000000 -d /mnt &
> fsx -c 1234 testfile
>
> Now look at fsx output. When it says about "truncating to largest ever: 0x3ffff",
> wait 10 more seconds and if nothing happens, ^C the fsx,
> run "rm -rf testfile*"
> now run fsx again
> and so on until it fails.
>
> Last time it took me three iterations to reproduce.
>

Well I've been running this for a couple of hours:

#!/bin/sh
while true
do
fsstress -p 10 -n1000000 -d . &
fsx-linux -c 1234 testfile &
sleep 180
killall fsx-linux fsstress
sleep 1
done

Chance of close/open is 1 in 1234
seed = 1043574092
truncating to largest ever: 0x13e76
truncating to largest ever: 0x2e52c
truncating to largest ever: 0x3c2c2
truncating to largest ever: 0x3f15f
truncating to largest ever: 0x3fcb9
truncating to largest ever: 0x3fe96
truncating to largest ever: 0x3ff9d
truncating to largest ever: 0x3ffff
skipping zero size read
skipping zero size write
/home/akpm/fsx-test: line 9: 1253 Terminated fsstress -p 10 -n1000000 -d .
signal 15
testcalls = 95103
seed = 1043594552

So I'm going to have to ask you to investigate further. Does it happen on
other machines? Other filesystems? Older kernels? That sort of thing.

Thanks.

2003-01-25 20:05:51

by Oleg Drokin

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Hello!

On Fri, Jan 24, 2003 at 10:53:20PM -0800, Andrew Morton wrote:
> > Ok, So far simplest way of reproducing for me was this:
> > mkdir /mnt
> > fsstress -p 10 -n1000000 -d /mnt &
> > fsx -c 1234 testfile
> > Now look at fsx output. When it says about "truncating to largest ever: 0x3ffff",
> > wait 10 more seconds and if nothing happens, ^C the fsx,
> > run "rm -rf testfile*"
> > now run fsx again
> > and so on until it fails.
> > Last time it took me three iterations to reproduce.
> Well I've been running this for a couple of hours:
> #!/bin/sh
> while true
> do
> fsstress -p 10 -n1000000 -d . &
> fsx-linux -c 1234 testfile &
> sleep 180
> killall fsx-linux fsstress
> sleep 1
> done

Hm. I do not kill my fsstress.
Here is my version:

#!/bin/bash
mke2fs /dev/hdb2
mount /dev/hdb2 /mnt -t ext2
cd /mnt
/tests/fsstress -p10 -n 1000000 -d . &
while /tests/fsx -c 1234 -N60309 testfile ; do rm -rf testfile*; done

> So I'm going to have to ask you to investigate further. Does it happen on
> other machines? Other filesystems? Older kernels? That sort of thing.

Ok. I am able to reproduce that same thing on dual Pentium4 2.2GHz (HT enabled), 512M RAM.
My testing reveals that 2.5.50, 2.5.52 and 2.5.53 do not have the problem.
2.5.54, 2.5.55, 2.5.59 have the problem.
ext3 and reiserfs do not show this problem.
Looking through 2.5.54, I found that changeset named "[PATCH] quota locking update" forwarded by you
is the cause for the problem.
(also by reviewing the changeset it became clear that in order to succesfully reproduce the
problem one must have CONFIG_QUOTA to be disabled).
This small patch below fixes the problem for me (just reverts back part of quota locking patch in fact).
Also I think that taking BKL just to update some inode accounting stuff is kind of expensive,
so certainly better solution must exist.

===== include/linux/quotaops.h 1.13 vs edited =====
--- 1.13/include/linux/quotaops.h Wed Jan 1 05:44:36 2003
+++ edited/include/linux/quotaops.h Sat Jan 25 14:52:57 2003
@@ -175,7 +175,9 @@
#define DQUOT_TRANSFER(inode, iattr) (0)
extern __inline__ int DQUOT_PREALLOC_SPACE_NODIRTY(struct inode *inode, qsize_t nr)
{
+ lock_kernel();
inode_add_bytes(inode, nr);
+ unlock_kernel();
return 0;
}

@@ -188,7 +190,9 @@

extern __inline__ int DQUOT_ALLOC_SPACE_NODIRTY(struct inode *inode, qsize_t nr)
{
+ lock_kernel();
inode_add_bytes(inode, nr);
+ unlock_kernel();
return 0;
}

@@ -201,7 +205,9 @@

extern __inline__ void DQUOT_FREE_SPACE_NODIRTY(struct inode *inode, qsize_t nr)
{
+ lock_kernel();
inode_sub_bytes(inode, nr);
+ unlock_kernel();
}

extern __inline__ void DQUOT_FREE_SPACE(struct inode *inode, qsize_t nr)

Bye,
Oleg

2003-01-25 23:03:21

by Andrew Morton

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Oleg Drokin <[email protected]> wrote:
>
> Here is my version:
>
> #!/bin/bash
> mke2fs /dev/hdb2
> mount /dev/hdb2 /mnt -t ext2
> cd /mnt
> /tests/fsstress -p10 -n 1000000 -d . &
> while /tests/fsx -c 1234 -N60309 testfile ; do rm -rf testfile*; done
>
> > So I'm going to have to ask you to investigate further. Does it happen on
> > other machines? Other filesystems? Older kernels? That sort of thing.
>
> Ok. I am able to reproduce that same thing on dual Pentium4 2.2GHz (HT enabled), 512M RAM.
> My testing reveals that 2.5.50, 2.5.52 and 2.5.53 do not have the problem.
> 2.5.54, 2.5.55, 2.5.59 have the problem.
> ext3 and reiserfs do not show this problem.
> Looking through 2.5.54, I found that changeset named "[PATCH] quota locking update" forwarded by you
> is the cause for the problem.
> (also by reviewing the changeset it became clear that in order to succesfully reproduce the
> problem one must have CONFIG_QUOTA to be disabled).
> This small patch below fixes the problem for me (just reverts back part of quota locking patch in fact).
> Also I think that taking BKL just to update some inode accounting stuff is kind of expensive,
> so certainly better solution must exist.

aaargggggghhh. I noticed that when I was reviewing the patch, but promptly
forgot to check it more closely.

However, in reviewing it, I don't see exactly what's going on. Because only
one process is accessing the stat information of the fsx inode anyway?

Yes, we need to rub out that i_bytes/i_blocks thing, replace it with an
atomically updated loff_t, etc. But I'd like to understand what the exact
failure is first. It _seems_ that stat has somehow seen a >4G value of
stat->size, but how can this happen???

I need to breathe life back into my P4 box, see if I can get it to happen
there.

Thanks.

2003-01-26 02:54:28

by Andrew Morton

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Oleg Drokin <[email protected]> wrote:
>
> Looking through 2.5.54, I found that changeset named "[PATCH] quota locking update" forwarded by you
> is the cause for the problem.

OK, it's writepage-versus-writepage and/or writepage-versus-truncate.
There's no locking between the updaters and i_blocks goes negative.
This is good ;) We have done well.

I went for the new frlocks which I dragged into -mm to fix up the xtime_lock
problems. It's not 100% appropriate, but it'll do just fine.

Stephen, we can use that frlock in the inode to protect access to the 64-bit
i_size on 32-bit platforms as well. However in the case of i_size, the
updaters do not need to take the frlock's spinlock, because writers are
serialised inode->i_sem.

So could you please take a look at adding new writer-side primitives to
frlocks to cover this requirement? See i_size_read/i_size_write from the -aa
kernel:

+static inline void i_size_write(struct inode * inode, loff_t i_size)
+{
+#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
+#ifdef __ARCH_HAS_GET_SET_64BIT
+ set_64bit((unsigned long long *) &inode->i_size, (unsigned long long) i_size);
+#else
+ inode->i_size_version2++;
+ wmb();
+ inode->i_size = i_size;
+ wmb();
+ inode->i_size_version1++;
+ wmb(); /* make it visible ASAP */
+#endif
+#elif BITS_PER_LONG==64 || !defined(CONFIG_SMP)
+ inode->i_size = i_size;
+#endif
+}

Something like this might suffice:

diff -puN include/linux/frlock.h~frlock_write_begin include/linux/frlock.h
--- 25/include/linux/frlock.h~frlock_write_begin 2003-01-25 18:59:27.000000000 -0800
+++ 25-akpm/include/linux/frlock.h 2003-01-25 19:01:27.000000000 -0800
@@ -80,6 +80,22 @@ static inline unsigned fr_read_end(frloc
return rw->pre_sequence;
}

+static inline unsigned fr_write_begin(frlock_t *rw)
+{
+ unsigned ret = rw->pre_sequence++;
+ wmb();
+ return ret;
+}
+
+static inline unsigned fr_write_end(frlock_t *rw)
+{
+ unsgned ret;
+ wmb();
+ ret = ++(rw->post_sequence);
+ wmb();
+ return ret;
+}
+
/*
* Possible sw/hw IRQ protected versions of the interfaces.
*/

Thanks.

Since Jan removed the lock_kernel()s in inode_add_bytes() and
inode_sub_bytes(), these functions have been racy.

One problematic workload has been discovered in which concurrent writepage
and truncate on SMP quickly causes i_blocks to go negative. writepage() does
not take i_sem, and it seems that for ext2, there are no other locks in
force when inode_add_bytes() is called.

Putting the BKL back in there is not acceptable. To fix this race I have
used the new frlocks.

This is actually fairly pointless, because these fields are write-mostly and
read-rarely. But we need a spinlock anyway because of the concurrent
modifiers problem.

And we need this frlock in the inode to solve the problem of the nonatomicity
of i_size. (But some aditional frlock work is needed first: in the case of
i_size we do not need to take the spinlock because of i_sem protection).

The splitting of the used disk space into i_blocks and i_bytes is silly - we
should nuke all that and just have a bare loff_t i_usedbytes. Later.

inode.c | 1 +
stat.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
linux/fs.h | 52 +++++++++++++++++-----------------------------------
ksyms.c | 4 ++++
4 files changed, 68 insertions(+), 35 deletions(-)

diff -puN include/linux/fs.h~inode-accounting-race-fix include/linux/fs.h
--- linux-hype/include/linux/fs.h~inode-accounting-race-fix 2003-01-25 18:36:25.000000000 -0800
+++ linux-hype-akpm/include/linux/fs.h 2003-01-25 18:36:25.000000000 -0800
@@ -18,6 +18,7 @@
#include <linux/stat.h>
#include <linux/cache.h>
#include <linux/radix-tree.h>
+#include <linux/frlock.h>
#include <asm/atomic.h>

struct iovec;
@@ -371,9 +372,19 @@ struct inode {
struct timespec i_ctime;
unsigned int i_blkbits;
unsigned long i_blksize;
- unsigned long i_blocks;
unsigned long i_version;
+
+ /*
+ * i_bytes and i_blocks represent the amount of allocated disk space:
+ * (i_blocks * 512 + i_bytes) (FIXME: just use an loff_t here). They
+ * are protected by i_frlock.
+ */
+ unsigned long i_blocks;
unsigned short i_bytes;
+
+ /* Protects i_blocks, i_bytes and, later, i_size */
+ frlock_t i_frlock;
+
struct semaphore i_sem;
struct inode_operations *i_op;
struct file_operations *i_fop; /* former ->i_op->default_file_ops */
@@ -400,7 +411,7 @@ struct inode {
void *i_security;
__u32 i_generation;
union {
- void *generic_ip;
+ void *generic_ip;
} u;
};

@@ -412,39 +423,6 @@ struct fown_struct {
void *security;
};

-static inline void inode_add_bytes(struct inode *inode, loff_t bytes)
-{
- inode->i_blocks += bytes >> 9;
- bytes &= 511;
- inode->i_bytes += bytes;
- if (inode->i_bytes >= 512) {
- inode->i_blocks++;
- inode->i_bytes -= 512;
- }
-}
-
-static inline void inode_sub_bytes(struct inode *inode, loff_t bytes)
-{
- inode->i_blocks -= bytes >> 9;
- bytes &= 511;
- if (inode->i_bytes < bytes) {
- inode->i_blocks--;
- inode->i_bytes += 512;
- }
- inode->i_bytes -= bytes;
-}
-
-static inline loff_t inode_get_bytes(struct inode *inode)
-{
- return (((loff_t)inode->i_blocks) << 9) + inode->i_bytes;
-}
-
-static inline void inode_set_bytes(struct inode *inode, loff_t bytes)
-{
- inode->i_blocks = bytes >> 9;
- inode->i_bytes = bytes & 511;
-}
-
/*
* Track a single file's readahead state
*/
@@ -1277,6 +1255,10 @@ extern int page_symlink(struct inode *in
extern struct inode_operations page_symlink_inode_operations;
extern void generic_fillattr(struct inode *, struct kstat *);
extern int vfs_getattr(struct vfsmount *, struct dentry *, struct kstat *);
+void inode_add_bytes(struct inode *inode, loff_t bytes);
+void inode_sub_bytes(struct inode *inode, loff_t bytes);
+loff_t inode_get_bytes(struct inode *inode);
+void inode_set_bytes(struct inode *inode, loff_t bytes);

extern int vfs_readdir(struct file *, filldir_t, void *);

diff -puN fs/stat.c~inode-accounting-race-fix fs/stat.c
--- linux-hype/fs/stat.c~inode-accounting-race-fix 2003-01-25 18:36:25.000000000 -0800
+++ linux-hype-akpm/fs/stat.c 2003-01-25 18:36:25.000000000 -0800
@@ -316,3 +316,49 @@ asmlinkage long sys_fstat64(unsigned lon
}

#endif /* LFS-64 */
+
+void inode_add_bytes(struct inode *inode, loff_t bytes)
+{
+ fr_write_lock(&inode->i_frlock);
+ inode->i_blocks += bytes >> 9;
+ bytes &= 511;
+ inode->i_bytes += bytes;
+ if (inode->i_bytes >= 512) {
+ inode->i_blocks++;
+ inode->i_bytes -= 512;
+ }
+ fr_write_unlock(&inode->i_frlock);
+}
+
+void inode_sub_bytes(struct inode *inode, loff_t bytes)
+{
+ fr_write_lock(&inode->i_frlock);
+ inode->i_blocks -= bytes >> 9;
+ bytes &= 511;
+ if (inode->i_bytes < bytes) {
+ inode->i_blocks--;
+ inode->i_bytes += 512;
+ }
+ inode->i_bytes -= bytes;
+ fr_write_unlock(&inode->i_frlock);
+}
+
+loff_t inode_get_bytes(struct inode *inode)
+{
+ unsigned seq;
+ loff_t ret;
+
+ do {
+ seq = fr_read_begin(&inode->i_frlock);
+ ret = (((loff_t)inode->i_blocks) << 9) + inode->i_bytes;
+ } while (seq != fr_read_end(&inode->i_frlock));
+ return ret;
+}
+
+void inode_set_bytes(struct inode *inode, loff_t bytes)
+{
+ fr_write_lock(&inode->i_frlock);
+ inode->i_blocks = bytes >> 9;
+ inode->i_bytes = bytes & 511;
+ fr_write_unlock(&inode->i_frlock);
+}
diff -puN kernel/ksyms.c~inode-accounting-race-fix kernel/ksyms.c
--- linux-hype/kernel/ksyms.c~inode-accounting-race-fix 2003-01-25 18:36:25.000000000 -0800
+++ linux-hype-akpm/kernel/ksyms.c 2003-01-25 18:36:25.000000000 -0800
@@ -272,6 +272,10 @@ EXPORT_SYMBOL(vfs_fstat);
EXPORT_SYMBOL(vfs_stat);
EXPORT_SYMBOL(vfs_lstat);
EXPORT_SYMBOL(vfs_getattr);
+EXPORT_SYMBOL(inode_add_bytes);
+EXPORT_SYMBOL(inode_sub_bytes);
+EXPORT_SYMBOL(inode_get_bytes);
+EXPORT_SYMBOL(inode_set_bytes);
EXPORT_SYMBOL(lock_rename);
EXPORT_SYMBOL(unlock_rename);
EXPORT_SYMBOL(generic_read_dir);
diff -puN fs/inode.c~inode-accounting-race-fix fs/inode.c
--- linux-hype/fs/inode.c~inode-accounting-race-fix 2003-01-25 18:36:25.000000000 -0800
+++ linux-hype-akpm/fs/inode.c 2003-01-25 18:36:25.000000000 -0800
@@ -176,6 +176,7 @@ void inode_init_once(struct inode *inode
spin_lock_init(&inode->i_data.private_lock);
INIT_LIST_HEAD(&inode->i_data.i_mmap);
INIT_LIST_HEAD(&inode->i_data.i_mmap_shared);
+ frlock_init(&inode->i_frlock);
}

static void init_once(void * foo, kmem_cache_t * cachep, unsigned long flags)

_

2003-01-26 03:19:14

by William Lee Irwin III

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

On Sat, Jan 25, 2003 at 07:04:10PM -0800, Andrew Morton wrote:
+static inline unsigned fr_write_begin(frlock_t *rw)
+{
+ unsigned ret = rw->pre_sequence++;
+ wmb();
+ return ret;
+}
+
+static inline unsigned fr_write_end(frlock_t *rw)
+{
+ unsgned ret;
+ wmb();
+ ret = ++(rw->post_sequence);
+ wmb();
+ return ret;
+}

Ticket locks need atomic fetch and increment. These don't look right.

-- wli

2003-01-26 03:37:03

by Andrew Morton

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

William Lee Irwin III <[email protected]> wrote:
>
> On Sat, Jan 25, 2003 at 07:04:10PM -0800, Andrew Morton wrote:
> +static inline unsigned fr_write_begin(frlock_t *rw)
> +{
> + unsigned ret = rw->pre_sequence++;
> + wmb();
> + return ret;
> +}
> +
> +static inline unsigned fr_write_end(frlock_t *rw)
> +{
> + unsgned ret;
> + wmb();
> + ret = ++(rw->post_sequence);
> + wmb();
> + return ret;
> +}
>
> Ticket locks need atomic fetch and increment. These don't look right.

Well look at the reader side:

loff_t i_size_read(struct inode *inode)
{
unsigned seq;
loff_t ret;

do {
seq = fr_write_begin(&inode->i_frlock);
ret = inode->i_size;
} while (seq != fr_write_end(&inode->i_frlock);
return ret;
}

One change which is needed here is to disable preemption in fr_write_begin();
otherwise an frlock could be in the pre!=post state for hundreds of
milliseconds while the writer gets preempted. Other CPUs would just spin for
the duration.

The same would happen if the writer takes an interrupt while pre!=post, but
that's the same for all spinlocks...

2003-01-26 04:05:29

by William Lee Irwin III

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

William Lee Irwin III <[email protected]> wrote:
>> Ticket locks need atomic fetch and increment. These don't look right.

On Sat, Jan 25, 2003 at 07:46:48PM -0800, Andrew Morton wrote:
> Well look at the reader side:
> loff_t i_size_read(struct inode *inode)
> {
> unsigned seq;
> loff_t ret;
>
> do {
> seq = fr_write_begin(&inode->i_frlock);
> ret = inode->i_size;
> } while (seq != fr_write_end(&inode->i_frlock);
> return ret;
> }

This doesn't look particularly reassuring either. We have:

(1) increment ->pre_sequence
(2) wmb()
(3) get inode->i_size
(4) wmb()
(5) increment ->post_sequence
(6) wmb()

Supposing the overall scheme is sound, one of the wmb()'s is unnecessary;
in theory rmb() is all that's needed before (5) to catch writes.

I'd have to go through some kind of state transition fiasco to be sure
this actually recovers from the races where two readers fetch the same
value of ->pre_sequence or ->post_sequence and store the same
incremented value to convince myself this is right. I'll assume you've
either done so yourself or are relying on someone else's verification.

Restarting the read like this is highly unusual; if retrying the
critical section is in fact the basis of this locking algorithm then
it's not a true ticket lock.

On Sat, Jan 25, 2003 at 07:46:48PM -0800, Andrew Morton wrote:
> One change which is needed here is to disable preemption in fr_write_begin();
> otherwise an frlock could be in the pre!=post state for hundreds of
> milliseconds while the writer gets preempted. Other CPUs would just spin for
> the duration.
> The same would happen if the writer takes an interrupt while pre!=post, but
> that's the same for all spinlocks...

Yes, this is a standard requirement for all non-sleeping locks. There
was only enough code in the post to "be sure" that the fetch and
increment bits were missing; I assumed that otherwise they'd be wrapped.

-- wli

2003-01-26 05:00:58

by Andrew Morton

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

William Lee Irwin III <[email protected]> wrote:
>
> William Lee Irwin III <[email protected]> wrote:
> >> Ticket locks need atomic fetch and increment. These don't look right.
>
> On Sat, Jan 25, 2003 at 07:46:48PM -0800, Andrew Morton wrote:
> > Well look at the reader side:
> > loff_t i_size_read(struct inode *inode)
> > {
> > unsigned seq;
> > loff_t ret;
> >
> > do {
> > seq = fr_write_begin(&inode->i_frlock);
> > ret = inode->i_size;
> > } while (seq != fr_write_end(&inode->i_frlock);
> > return ret;
> > }

argh. That should have been:

> > seq = fr_read_begin(&inode->i_frlock);
> > ret = inode->i_size;
> > } while (seq != fr_read_end(&inode->i_frlock);
> > return ret;
> > }

of course.

> This doesn't look particularly reassuring either. We have:
>
> (1) increment ->pre_sequence
> (2) wmb()
> (3) get inode->i_size
> (4) wmb()
> (5) increment ->post_sequence
> (6) wmb()
>
> Supposing the overall scheme is sound, one of the wmb()'s is unnecessary;

Could be.

> I'd have to go through some kind of state transition fiasco to be sure
> this actually recovers from the races where two readers fetch the same
> value of ->pre_sequence or ->post_sequence and store the same
> incremented value to convince myself this is right.

readers do not modify the lock - they simply observe.

The fr_write_begin/fr_write_end pair assumes that there is only a single
writer possible. In the case of i_size, that exclusion is provided by i_sem.
i_size is always modified under i_sem.

> I'll assume you've
> either done so yourself or are relying on someone else's verification.

More the latter ;)

> Restarting the read like this is highly unusual; if retrying the
> critical section is in fact the basis of this locking algorithm then
> it's not a true ticket lock.

Retrying the read is the basis of the locking algorithm.

The frlock stuff needs more work for non-SMP bloat avoidance, but it's simple
and seems sensible.

2003-01-26 09:16:39

by Oleg Drokin

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Hello!

On Sat, Jan 25, 2003 at 03:13:01PM -0800, Andrew Morton wrote:
> > Also I think that taking BKL just to update some inode accounting stuff is kind of expensive,
> > so certainly better solution must exist.
> However, in reviewing it, I don't see exactly what's going on. Because only
> one process is accessing the stat information of the fsx inode anyway?

Well, I came to conclusion that we have fsx doing truncate racing
with fsstress doing sync->iput->ext2_discard_prealloc()

> Yes, we need to rub out that i_bytes/i_blocks thing, replace it with an
> atomically updated loff_t, etc. But I'd like to understand what the exact
> failure is first. It _seems_ that stat has somehow seen a >4G value of
> stat->size, but how can this happen???

It's just become negative as we discarding prealloc twice.

Bye,
Oleg

2003-01-26 11:03:44

by Anton Blanchard

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Hi,

> +static inline void i_size_write(struct inode * inode, loff_t i_size)
> +{
> +#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
> +#ifdef __ARCH_HAS_GET_SET_64BIT
> + set_64bit((unsigned long long *) &inode->i_size, (unsigned long long) i_size);
> +#else
> + inode->i_size_version2++;
> + wmb();
> + inode->i_size = i_size;
> + wmb();
> + inode->i_size_version1++;
> + wmb(); /* make it visible ASAP */
> +#endif
> +#elif BITS_PER_LONG==64 || !defined(CONFIG_SMP)
> + inode->i_size = i_size;
> +#endif
> +}

That last wmb is suspect. We dont put an wmb after a spinlock to "make it
visible". If you think of an wmb as an ordering tag that propagates out
through the cpu and storage hierarchy then wmb is not going to help us here.

I guess the store could get reordered (and so delayed) a bit, but an wmb
is relatively expensive on some architectures.

> This is actually fairly pointless, because these fields are write-mostly and
> read-rarely. But we need a spinlock anyway because of the concurrent
> modifiers problem.

It would be interesting to compare a spinlock or rwlock against a frlock
in a write mostly situation. Actually isnt it going to be slower because
we have to take a spinlock to serialise around the frlock write path?

Anton

2003-01-26 11:13:57

by Andrew Morton

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

Anton Blanchard <[email protected]> wrote:
>
>
> Hi,
>
> > +static inline void i_size_write(struct inode * inode, loff_t i_size)
> > +{
> > +#if BITS_PER_LONG==32 && defined(CONFIG_SMP)
> > +#ifdef __ARCH_HAS_GET_SET_64BIT
> > + set_64bit((unsigned long long *) &inode->i_size, (unsigned long long) i_size);
> > +#else
> > + inode->i_size_version2++;
> > + wmb();
> > + inode->i_size = i_size;
> > + wmb();
> > + inode->i_size_version1++;
> > + wmb(); /* make it visible ASAP */
> > +#endif
> > +#elif BITS_PER_LONG==64 || !defined(CONFIG_SMP)
> > + inode->i_size = i_size;
> > +#endif
> > +}
>
> That last wmb is suspect. We dont put an wmb after a spinlock to "make it
> visible". If you think of an wmb as an ordering tag that propagates out
> through the cpu and storage hierarchy then wmb is not going to help us here.

OK.

> I guess the store could get reordered (and so delayed) a bit, but an wmb
> is relatively expensive on some architectures.
>
> > This is actually fairly pointless, because these fields are write-mostly and
> > read-rarely. But we need a spinlock anyway because of the concurrent
> > modifiers problem.
>
> It would be interesting to compare a spinlock or rwlock against a frlock
> in a write mostly situation. Actually isnt it going to be slower because
> we have to take a spinlock to serialise around the frlock write path?

A little bit, if it it very write-mostly. But the gains on the read side
will compensate.

Here is Stephen's implementation:

linux/frlock.h | 100 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
linux/time.h | 3 +
time.c | 31 +++++++++--------
timer.c | 22 ++++++------
4 files changed, 131 insertions(+), 25 deletions(-)

diff -puN /dev/null include/linux/frlock.h
--- /dev/null 2002-08-30 16:31:37.000000000 -0700
+++ linux-hype-akpm/include/linux/frlock.h 2003-01-25 18:35:14.000000000 -0800
@@ -0,0 +1,100 @@
+#ifndef __LINUX_FRLOCK_H
+#define __LINUX_FRLOCK_H
+
+/*
+ * Fast read-write spinlocks.
+ *
+ * Fast reader/writer locks without starving writers. This type of
+ * lock for data where the reader wants a consitent set of information
+ * and is willing to retry if the information changes. Readers never
+ * block but they may have to retry if a writer is in
+ * progress. Writers do not wait for readers.
+ *
+ * Generalization on sequence variables used for gettimeofday on x86-64
+ * by Andrea Arcangeli
+ *
+ * This is not as cache friendly as brlock. Also, this will not work
+ * for data that contains pointers, because any writer could
+ * invalidate a pointer that a reader was following.
+ *
+ *
+ * Expected reader usage:
+ * do {
+ * seq = fr_read_begin();
+ * ...
+ * } while (seq != fr_read_end());
+ *
+ * On non-SMP the spin locks disappear but the writer still needs
+ * to increment the sequence variables because an interrupt routine could
+ * change the state of the data.
+ */
+
+#include <linux/config.h>
+#include <linux/spinlock.h>
+
+typedef struct {
+ spinlock_t lock;
+ unsigned pre_sequence;
+ unsigned post_sequence;
+} frlock_t;
+
+#define FR_LOCK_UNLOCKED (frlock_t) { SPIN_LOCK_UNLOCKED, 0, 0 }
+#define frlock_init(x) do { *(x) = FR_LOCK_UNLOCKED; } while (0)
+
+static inline void fr_write_lock(frlock_t *rw)
+{
+ spin_lock(&rw->lock);
+ rw->pre_sequence++;
+ wmb();
+}
+
+static inline void fr_write_unlock(frlock_t *rw)
+{
+ wmb();
+ rw->post_sequence++;
+ spin_unlock(&rw->lock);
+}
+
+static inline int fr_write_trylock(frlock_t *rw)
+{
+ int ret = spin_trylock(&rw->lock);
+
+ if (ret) {
+ ++rw->pre_sequence;
+ wmb();
+ }
+ return ret;
+}
+
+static inline unsigned fr_read_begin(frlock_t *rw)
+{
+ unsigned ret = rw->post_sequence;
+ rmb();
+ return ret;
+
+}
+
+static inline unsigned fr_read_end(frlock_t *rw)
+{
+ rmb();
+ return rw->pre_sequence;
+}
+
+/*
+ * Possible sw/hw IRQ protected versions of the interfaces.
+ */
+#define fr_write_lock_irqsave(lock, flags) \
+ do { local_irq_save(flags); fr_write_lock(lock); } while (0)
+#define fr_write_lock_irq(lock) \
+ do { local_irq_disable(); fr_write_lock(lock); } while (0)
+#define fr_write_lock_bh(lock) \
+ do { local_bh_disable(); fr_write_lock(lock); } while (0)
+
+#define fr_write_unlock_irqrestore(lock, flags) \
+ do { fr_write_unlock(lock); local_irq_restore(flags); } while(0)
+#define fr_write_unlock_irq(lock) \
+ do { fr_write_unlock(lock); local_irq_enable(); } while(0)
+#define fr_write_unlock_bh(lock) \
+ do { fr_write_unlock(lock); local_bh_enable(); } while(0)
+
+#endif /* __LINUX_FRLOCK_H */
diff -puN include/linux/time.h~frlock-xtime include/linux/time.h
--- linux-hype/include/linux/time.h~frlock-xtime 2003-01-25 18:11:45.000000000 -0800
+++ linux-hype-akpm/include/linux/time.h 2003-01-25 18:11:45.000000000 -0800
@@ -25,6 +25,7 @@ struct timezone {
#ifdef __KERNEL__

#include <linux/spinlock.h>
+#include <linux/frlock.h>

/*
* Change timeval to jiffies, trying to avoid the
@@ -120,7 +121,7 @@ mktime (unsigned int year, unsigned int
}

extern struct timespec xtime;
-extern rwlock_t xtime_lock;
+extern frlock_t xtime_lock;

static inline unsigned long get_seconds(void)
{
diff -puN kernel/time.c~frlock-xtime kernel/time.c
--- linux-hype/kernel/time.c~frlock-xtime 2003-01-25 18:11:45.000000000 -0800
+++ linux-hype-akpm/kernel/time.c 2003-01-25 18:11:45.000000000 -0800
@@ -27,7 +27,6 @@
#include <linux/timex.h>
#include <linux/errno.h>
#include <linux/smp_lock.h>
-
#include <asm/uaccess.h>

/*
@@ -38,7 +37,7 @@ struct timezone sys_tz;

/* The xtime_lock is not only serializing the xtime read/writes but it's also
serializing all accesses to the global NTP variables now. */
-extern rwlock_t xtime_lock;
+extern frlock_t xtime_lock;
extern unsigned long last_time_offset;

#if !defined(__alpha__) && !defined(__ia64__)
@@ -80,7 +79,7 @@ asmlinkage long sys_stime(int * tptr)
return -EPERM;
if (get_user(value, tptr))
return -EFAULT;
- write_lock_irq(&xtime_lock);
+ fr_write_lock_irq(&xtime_lock);
xtime.tv_sec = value;
xtime.tv_nsec = 0;
last_time_offset = 0;
@@ -88,7 +87,7 @@ asmlinkage long sys_stime(int * tptr)
time_status |= STA_UNSYNC;
time_maxerror = NTP_PHASE_LIMIT;
time_esterror = NTP_PHASE_LIMIT;
- write_unlock_irq(&xtime_lock);
+ fr_write_unlock_irq(&xtime_lock);
return 0;
}

@@ -96,13 +95,13 @@ asmlinkage long sys_stime(int * tptr)

asmlinkage long sys_gettimeofday(struct timeval *tv, struct timezone *tz)
{
- if (tv) {
+ if (likely(tv != NULL)) {
struct timeval ktv;
do_gettimeofday(&ktv);
if (copy_to_user(tv, &ktv, sizeof(ktv)))
return -EFAULT;
}
- if (tz) {
+ if (unlikely(tz != NULL)) {
if (copy_to_user(tz, &sys_tz, sizeof(sys_tz)))
return -EFAULT;
}
@@ -127,10 +126,10 @@ asmlinkage long sys_gettimeofday(struct
*/
inline static void warp_clock(void)
{
- write_lock_irq(&xtime_lock);
+ fr_write_lock_irq(&xtime_lock);
xtime.tv_sec += sys_tz.tz_minuteswest * 60;
last_time_offset = 0;
- write_unlock_irq(&xtime_lock);
+ fr_write_unlock_irq(&xtime_lock);
}

/*
@@ -235,7 +234,7 @@ int do_adjtimex(struct timex *txc)
txc->tick > 1100000/USER_HZ)
return -EINVAL;

- write_lock_irq(&xtime_lock);
+ fr_write_lock_irq(&xtime_lock);
result = time_state; /* mostly `TIME_OK' */

/* Save for later - semantics of adjtime is to return old value */
@@ -386,7 +385,7 @@ leave: if ((time_status & (STA_UNSYNC|ST
txc->errcnt = pps_errcnt;
txc->stbcnt = pps_stbcnt;
last_time_offset = 0;
- write_unlock_irq(&xtime_lock);
+ fr_write_unlock_irq(&xtime_lock);
do_gettimeofday(&txc->time);
return(result);
}
@@ -409,9 +408,13 @@ asmlinkage long sys_adjtimex(struct time
struct timespec current_kernel_time(void)
{
struct timespec now;
- unsigned long flags;
- read_lock_irqsave(&xtime_lock,flags);
- now = xtime;
- read_unlock_irqrestore(&xtime_lock,flags);
+ unsigned long seq;
+
+ do {
+ seq = fr_read_begin(&xtime_lock);
+
+ now = xtime;
+ } while (seq != fr_read_end(&xtime_lock));
+
return now;
}
diff -puN kernel/timer.c~frlock-xtime kernel/timer.c
--- linux-hype/kernel/timer.c~frlock-xtime 2003-01-25 18:11:45.000000000 -0800
+++ linux-hype-akpm/kernel/timer.c 2003-01-25 18:11:45.000000000 -0800
@@ -758,7 +758,7 @@ unsigned long wall_jiffies;
* This read-write spinlock protects us from races in SMP while
* playing with xtime and avenrun.
*/
-rwlock_t xtime_lock __cacheline_aligned_in_smp = RW_LOCK_UNLOCKED;
+frlock_t xtime_lock __cacheline_aligned_in_smp = FR_LOCK_UNLOCKED;
unsigned long last_time_offset;

/*
@@ -798,8 +798,7 @@ static inline void update_times(void)
}

/*
- * The 64-bit jiffies value is not atomic - you MUST NOT read it
- * without holding read_lock_irq(&xtime_lock).
+ * The 64-bit jiffies value is not atomic
* jiffies is defined in the linker script...
*/

@@ -1087,18 +1086,21 @@ asmlinkage long sys_sysinfo(struct sysin
struct sysinfo val;
unsigned long mem_total, sav_total;
unsigned int mem_unit, bitcount;
+ unsigned long seq;

memset((char *)&val, 0, sizeof(struct sysinfo));

- read_lock_irq(&xtime_lock);
- val.uptime = jiffies / HZ;
+ do {
+ seq = fr_read_begin(&xtime_lock);

- val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT);
- val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT);
- val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT);
+ val.uptime = jiffies / HZ;

- val.procs = nr_threads;
- read_unlock_irq(&xtime_lock);
+ val.loads[0] = avenrun[0] << (SI_LOAD_SHIFT - FSHIFT);
+ val.loads[1] = avenrun[1] << (SI_LOAD_SHIFT - FSHIFT);
+ val.loads[2] = avenrun[2] << (SI_LOAD_SHIFT - FSHIFT);
+
+ val.procs = nr_threads;
+ } while (seq != fr_read_end(&xtime_lock));

si_meminfo(&val);
si_swapinfo(&val);

_

2003-01-27 22:50:18

by Stephen Hemminger

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

On Sat, 2003-01-25 at 21:10, Andrew Morton wrote:
> William Lee Irwin III <[email protected]> wrote:
> >
> > William Lee Irwin III <[email protected]> wrote:
> > >> Ticket locks need atomic fetch and increment. These don't look right.
> >

Atomic fetch/increment is not necessary since it is assumed that
only a single writer is doing the increment at a time, either with a
lock or a semaphore. The fr_write_lock primitive incorporates the
spinlock and the sequence number.

> > On Sat, Jan 25, 2003 at 07:46:48PM -0800, Andrew Morton wrote:
> > > Well look at the reader side:
> > > loff_t i_size_read(struct inode *inode)
> > > {
> > > unsigned seq;
> > > loff_t ret;
> > >
> > > do {
> > > seq = fr_write_begin(&inode->i_frlock);
> > > ret = inode->i_size;
> > > } while (seq != fr_write_end(&inode->i_frlock);
> > > return ret;
> > > }
>
> argh. That should have been:
>
> > > seq = fr_read_begin(&inode->i_frlock);
> > > ret = inode->i_size;
> > > } while (seq != fr_read_end(&inode->i_frlock);
> > > return ret;
> > > }
>
> of course.
>
> > This doesn't look particularly reassuring either. We have:
> >
> > (1) increment ->pre_sequence
> > (2) wmb()
> > (3) get inode->i_size
> > (4) wmb()
> > (5) increment ->post_sequence
> > (6) wmb()
> >
> > Supposing the overall scheme is sound, one of the wmb()'s is unnecessary;

Each wmb() has a purpose. (2) is to make sure the first increment
happens before the update. (4) makes sure the update happens before the
second increment.

The last wmb is unnecessary. Also on many architectures, the wmb()
disappears since writes are never reordered.

> Could be.
>
> > I'd have to go through some kind of state transition fiasco to be sure
> > this actually recovers from the races where two readers fetch the same
> > value of ->pre_sequence or ->post_sequence and store the same
> > incremented value to convince myself this is right.
>
> readers do not modify the lock - they simply observe.

correct

> The fr_write_begin/fr_write_end pair assumes that there is only a single
> writer possible. In the case of i_size, that exclusion is provided by i_sem.
> i_size is always modified under i_sem.
>
> > I'll assume you've
> > either done so yourself or are relying on someone else's verification.
>
> More the latter ;)
>
> > Restarting the read like this is highly unusual; if retrying the
> > critical section is in fact the basis of this locking algorithm then
> > it's not a true ticket lock.
>
> Retrying the read is the basis of the locking algorithm.
>
> The frlock stuff needs more work for non-SMP bloat avoidance, but it's simple
> and seems sensible.

An updated version is coming out to day. It still needs to have sequence numbers
and barriers even on non-SMP to deal with interrupts.

2003-01-27 23:51:33

by William Lee Irwin III

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

William Lee Irwin III <[email protected]> wrote:
>>>>> Ticket locks need atomic fetch and increment. These don't look right.

On Mon, Jan 27, 2003 at 02:59:21PM -0800, Stephen Hemminger wrote:
> Atomic fetch/increment is not necessary since it is assumed that
> only a single writer is doing the increment at a time, either with a
> lock or a semaphore. The fr_write_lock primitive incorporates the
> spinlock and the sequence number.

Ticket locks still need atomic fetch and increment. You don't because
not only are you not implementing a ticket lock, you've got an outright
spinlock around the fetch and increment.

William Lee Irwin III <[email protected]> wrote:
>>> (1) increment ->pre_sequence
>>> (2) wmb()
>>> (3) get inode->i_size
>>> (4) wmb()
>>> (5) increment ->post_sequence
>>> (6) wmb()
>>> Supposing the overall scheme is sound, one of the wmb()'s is unnecessary;

On Mon, Jan 27, 2003 at 02:59:21PM -0800, Stephen Hemminger wrote:
> Each wmb() has a purpose. (2) is to make sure the first increment
> happens before the update. (4) makes sure the update happens before the
> second increment.
> The last wmb is unnecessary. Also on many architectures, the wmb()
> disappears since writes are never reordered.

This is apparently based on some misunderstanding wrt. thinking the
sequence of events above described a read. Obviously converting (3)
to "modify inode->i_size" makes the (4) wmb() necessary.

-- wli

2003-01-28 13:41:48

by Christoph Hellwig

[permalink] [raw]

Subject: Re: ext2 FS corruption with 2.5.59.

On Fri, Jan 24, 2003 at 02:32:13AM -0800, Andrew Morton wrote:
> But it isn't. It's on inode_in_use. It (and its pages) never get written
> out and the data gets thrown away on unmount. Christoph has hit the same
> thing. I bet he was running fsstress too. Sorry about that.

Yes, I was hitting the bug in XFS's testcase 013 which is based on
fsstress.