Currently preallocated blocks in ext2 are discarded on every call
to iput() (by ext2_put_inode() calling ext2_discard_prealloc()).
An earlier attempt to fix this ("discard ext2 preallocation in last
iput") moved the ext2_discard_prealloc() call to ext2_clear_inode(),
but was found to cause filesystem corruption in a test using fsx.
The problem was that ext2_clear_inode() was writing the inode data
to disk before calling ext2_discard_prealloc(), so the value of
i_blocks on disk included the preallocated blocks.
This patch moves the call to ext2_discard_prealloc() to the new
function ext2_drop_inode(). This should be both efficient (discard
happens on only the last call to iput()) and correct (fixes i_blocks
before writing to disk). Also, as there is now possibly a longer
window during which an open file may have an incorrrect block count
in its on-disk inode, ext2_update_inode adjusts the block count to
account for preallocated blocks.
No corruption has been detected using the fsx test.
Signed-off-by: Ron Yorston <[email protected]>
---
--- linux-2.6.17/fs/ext2/super.c.prealloc 2006-06-18 02:49:35.000000000 +0100
+++ linux-2.6.17/fs/ext2/super.c 2006-08-17 20:16:34.000000000 +0100
@@ -238,7 +238,7 @@ static struct super_operations ext2_sops
.destroy_inode = ext2_destroy_inode,
.read_inode = ext2_read_inode,
.write_inode = ext2_write_inode,
- .put_inode = ext2_put_inode,
+ .drop_inode = ext2_drop_inode,
.delete_inode = ext2_delete_inode,
.put_super = ext2_put_super,
.write_super = ext2_write_super,
--- linux-2.6.17/fs/ext2/inode.c.prealloc 2006-06-18 02:49:35.000000000 +0100
+++ linux-2.6.17/fs/ext2/inode.c 2006-08-17 20:16:34.000000000 +0100
@@ -54,16 +54,18 @@ static inline int ext2_inode_is_fast_sym
}
/*
- * Called at each iput().
+ * Called from iput_final().
*
* The inode may be "bad" if ext2_read_inode() saw an error from
* ext2_get_inode(), so we need to check that to avoid freeing random disk
* blocks.
*/
-void ext2_put_inode(struct inode *inode)
+void ext2_drop_inode(struct inode *inode)
{
if (!is_bad_inode(inode))
ext2_discard_prealloc(inode);
+
+ generic_drop_inode(inode);
}
/*
@@ -1176,6 +1178,7 @@ static int ext2_update_inode(struct inod
ino_t ino = inode->i_ino;
uid_t uid = inode->i_uid;
gid_t gid = inode->i_gid;
+ blkcnt_t blocks = inode->i_blocks;
struct buffer_head * bh;
struct ext2_inode * raw_inode = ext2_get_inode(sb, ino, &bh);
int n;
@@ -1216,7 +1219,8 @@ static int ext2_update_inode(struct inod
raw_inode->i_ctime = cpu_to_le32(inode->i_ctime.tv_sec);
raw_inode->i_mtime = cpu_to_le32(inode->i_mtime.tv_sec);
- raw_inode->i_blocks = cpu_to_le32(inode->i_blocks);
+ blocks -= ei->i_prealloc_count * (inode->i_sb->s_blocksize >> 9);
+ raw_inode->i_blocks = cpu_to_le32(blocks);
raw_inode->i_dtime = cpu_to_le32(ei->i_dtime);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
--- linux-2.6.17/fs/ext2/ext2.h.prealloc 2006-06-18 02:49:35.000000000 +0100
+++ linux-2.6.17/fs/ext2/ext2.h 2006-08-17 20:16:34.000000000 +0100
@@ -125,7 +125,7 @@ extern unsigned long ext2_count_free (st
/* inode.c */
extern void ext2_read_inode (struct inode *);
extern int ext2_write_inode (struct inode *, int);
-extern void ext2_put_inode (struct inode *);
+extern void ext2_drop_inode (struct inode *);
extern void ext2_delete_inode (struct inode *);
extern int ext2_sync_inode (struct inode *);
extern void ext2_discard_prealloc (struct inode *);
On Thu, 17 Aug 2006 20:45:36 +0100
Ron Yorston <[email protected]> wrote:
> Currently preallocated blocks in ext2 are discarded on every call
> to iput() (by ext2_put_inode() calling ext2_discard_prealloc()).
>
> An earlier attempt to fix this ("discard ext2 preallocation in last
> iput") moved the ext2_discard_prealloc() call to ext2_clear_inode(),
> but was found to cause filesystem corruption in a test using fsx.
> The problem was that ext2_clear_inode() was writing the inode data
> to disk before calling ext2_discard_prealloc(), so the value of
> i_blocks on disk included the preallocated blocks.
>
> This patch moves the call to ext2_discard_prealloc() to the new
> function ext2_drop_inode(). This should be both efficient (discard
> happens on only the last call to iput()) and correct (fixes i_blocks
> before writing to disk). Also, as there is now possibly a longer
> window during which an open file may have an incorrrect block count
> in its on-disk inode, ext2_update_inode adjusts the block count to
> account for preallocated blocks.
Been there, done that. The problem was that hanging onto the preallocation
will cause separate files to have up-to-seven-block gaps between them. So
if you put a large number of small files in the same directory, the time to
read them all back is quite significantly impacted: they cover a lot more
disk.
Andrew Morton <[email protected]> wrote:
>Been there, done that. The problem was that hanging onto the preallocation
>will cause separate files to have up-to-seven-block gaps between them. So
>if you put a large number of small files in the same directory, the time to
>read them all back is quite significantly impacted: they cover a lot more
>disk.
The preallocation is only held while the file is open, so there will only
be gaps between files that are open simultaneously. If they're created
sequentially there will be no gap.
This issue exists even with the current code.
The patch will have a small effect. With the current code an open file
will lose its preallocation when some other process touches the inode.
In that case a subsequently created file will follow without a gap. As
soon as the open file is written to, though, it gets a new preallocation.
On Sun, 2006-08-20 at 12:48 +0100, Ron Yorston wrote:
> Andrew Morton <[email protected]> wrote:
> >Been there, done that. The problem was that hanging onto the preallocation
> >will cause separate files to have up-to-seven-block gaps between them. So
> >if you put a large number of small files in the same directory, the time to
> >read them all back is quite significantly impacted: they cover a lot more
> >disk.
>
> The preallocation is only held while the file is open, so there will only
> be gaps between files that are open simultaneously. If they're created
> sequentially there will be no gap.
>
> This issue exists even with the current code.
>
> The patch will have a small effect. With the current code an open file
> will lose its preallocation when some other process touches the inode.
> In that case a subsequently created file will follow without a gap. As
> soon as the open file is written to, though, it gets a new preallocation.
> -
maybe porting the reservation code to ext2 (as Val has done) is a nicer
long term solution..
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
On 8/20/06, Arjan van de Ven <[email protected]> wrote:
> On Sun, 2006-08-20 at 12:48 +0100, Ron Yorston wrote:
> > Andrew Morton <[email protected]> wrote:
> > >Been there, done that. The problem was that hanging onto the preallocation
> > >will cause separate files to have up-to-seven-block gaps between them. So
> > >if you put a large number of small files in the same directory, the time to
> > >read them all back is quite significantly impacted: they cover a lot more
> > >disk.
> >
> > The preallocation is only held while the file is open, so there will only
> > be gaps between files that are open simultaneously. If they're created
> > sequentially there will be no gap.
> >
> > This issue exists even with the current code.
> >
> > The patch will have a small effect. With the current code an open file
> > will lose its preallocation when some other process touches the inode.
> > In that case a subsequently created file will follow without a gap. As
> > soon as the open file is written to, though, it gets a new preallocation.
> > -
>
> maybe porting the reservation code to ext2 (as Val has done) is a nicer
> long term solution..
The even nicer solution long term solution is to abstract out the
reservation code as much as possible and share it. But if you want
my (hasty and unlovely) port, you can grab it out of this patch:
http://infohost.nmt.edu/~val/patches/fswide_latest_patch
I'm not sure what benchmark to use to test the performance; our
initial benchmarks (run by Zach Brown) didn't show an improvement (see
the OLS paper).
-VAL
On Sun, Aug 20, 2006 at 08:33:29AM -0700, Val Henson wrote:
> On 8/20/06, Arjan van de Ven <[email protected]> wrote:
> >maybe porting the reservation code to ext2 (as Val has done) is a nicer
> >long term solution..
>
> The even nicer solution long term solution is to abstract out the
> reservation code as much as possible and share it. But if you want
> my (hasty and unlovely) port, you can grab it out of this patch:
>
> http://infohost.nmt.edu/~val/patches/fswide_latest_patch
As it turns out, I already split it out:
http://infohost.nmt.edu/~val/patches/resv_only_patch
I added this to the Easy File System Projects wiki:
http://linuxfs.pbwiki.com/EasyFsProjects
-VAL