Fixes a race between unlink and writeback: on the sys_sync() and
pdflush paths the caller does not have a reference against the inode.
So run __iget prior to dropping inode_lock.
Oleg Drokin reported this and seems to believe that it fixes the
crashes he was observing. But I was never able to reproduce them..
=====================================
--- 2.5.19/fs/fs-writeback.c~sync-race Sat Jun 1 01:18:12 2002
+++ 2.5.19-akpm/fs/fs-writeback.c Sat Jun 1 01:18:12 2002
@@ -245,17 +245,19 @@ static void sync_sb_inodes(struct super_
if ((sync_mode == WB_SYNC_LAST) && (head->prev == head))
really_sync = 1;
+ BUG_ON(inode->i_state & I_FREEING);
+ __iget(inode);
__writeback_single_inode(inode, really_sync, nr_to_write);
-
if (sync_mode == WB_SYNC_HOLD) {
mapping->dirtied_when = jiffies;
list_del(&inode->i_list);
list_add(&inode->i_list, &inode->i_sb->s_dirty);
}
-
if (current_is_pdflush())
writeback_release(bdi);
-
+ spin_unlock(&inode_lock);
+ iput(inode);
+ spin_lock(&inode_lock);
if (nr_to_write && *nr_to_write <= 0)
break;
}
-
On Sat, 1 Jun 2002, Andrew Morton wrote:
>
> So run __iget prior to dropping inode_lock.
This part looks horrible:
+ spin_unlock(&inode_lock);
+ iput(inode);
+ spin_lock(&inode_lock);
Why not just split up the code inside iput(), and then just do
if (atomic_dec(&inode->i_count))
final_iput(inode);
where final_iput() _wants_ the spinlock held already?
That's basically what "iput()" will end up doing, except for that
"put_inode()" thing, which is just a horrible hack anyway.
So get rid of "put_inode()", and replace it with a new one that takes the
place of the
if (!inode->i_nlink) {
... delete ..
} else {
.. free ..
}
and makes that one be a "i_op->drop_inode" thing that defaults to the
current "delete if i_nlink is zero, free it if i_nlink is not zero and
nobody uses it".
The general VFS layer really shouldn't have assigned that strogn a meaning
to "i_nlink" anyway, it's not for the VFS layer to decide (and it only
causes problems for any non-UNIX-on-a-disk filesystems).
Linus
Linus Torvalds wrote:
>
> On Sat, 1 Jun 2002, Andrew Morton wrote:
> >
> > So run __iget prior to dropping inode_lock.
>
> This part looks horrible:
>
> + spin_unlock(&inode_lock);
> + iput(inode);
> + spin_lock(&inode_lock);
Yup. The inode refcounting APIs are really awkward. Note how I recently
added dopey code in ext2_put_inode() to only drop the prealloc window on
the "final" iput().
> Why not just split up the code inside iput(), and then just do
>
> if (atomic_dec(&inode->i_count))
> final_iput(inode);
>
> where final_iput() _wants_ the spinlock held already?
>
> That's basically what "iput()" will end up doing, except for that
> "put_inode()" thing, which is just a horrible hack anyway.
>
> So get rid of "put_inode()", and replace it with a new one that takes the
> place of the
>
> if (!inode->i_nlink) {
> ... delete ..
> } else {
> .. free ..
> }
>
> and makes that one be a "i_op->drop_inode" thing that defaults to the
> current "delete if i_nlink is zero, free it if i_nlink is not zero and
> nobody uses it".
>
> The general VFS layer really shouldn't have assigned that strogn a meaning
> to "i_nlink" anyway, it's not for the VFS layer to decide (and it only
> causes problems for any non-UNIX-on-a-disk filesystems).
>
Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc
could do with a spring clean. Make it a bit more conventional. I'll
discuss with Al when he resurfaces.
-
Linus Torvalds wrote:
>> The general VFS layer really shouldn't have assigned that strogn a meaning
>> to "i_nlink" anyway, it's not for the VFS layer to decide (and it only
>> causes problems for any non-UNIX-on-a-disk filesystems).
On Sat, Jun 01, 2002 at 12:19:36PM -0700, Andrew Morton wrote:
> Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc
> could do with a spring clean. Make it a bit more conventional. I'll
> discuss with Al when he resurfaces.
I'm somewhat concerned about the protection of ->i_size, since that
appears to be accessed in generic_file_read() without any protection
against writers to the field. From a quick glance at current 2.5 (it
looks like 2.4 has this too) it looks like it's written to by
vmtruncate() through notify_change() with the ->i_sem and BKL held at
the moment, but generic_file_read() doesn't take either before reading
it, and there may be still other writers. I also don't see the anything
like read_barrier_depends() for lockless algorithms or any atomic reads.
Even on machines with extremely strong memory consistency models like
i386, as loff_t is long long, it would seem possible to catch a partial
update and see an entirely bogus ->i_size value. It also appears
->i_size is used to provide some protection against reads of truncated
pages, which may be unreliable without some protection of ->i_size.
If these issues are not what I believe them to be I would be more than
happy to have these impressions corrected.
Cheers,
Bill
William Lee Irwin III wrote:
>
> Linus Torvalds wrote:
> >> The general VFS layer really shouldn't have assigned that strogn a meaning
> >> to "i_nlink" anyway, it's not for the VFS layer to decide (and it only
> >> causes problems for any non-UNIX-on-a-disk filesystems).
>
> On Sat, Jun 01, 2002 at 12:19:36PM -0700, Andrew Morton wrote:
> > Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc
> > could do with a spring clean. Make it a bit more conventional. I'll
> > discuss with Al when he resurfaces.
>
> I'm somewhat concerned about the protection of ->i_size, since that
> appears to be accessed in generic_file_read() without any protection
> against writers to the field. From a quick glance at current 2.5 (it
> looks like 2.4 has this too) it looks like it's written to by
> vmtruncate() through notify_change() with the ->i_sem and BKL held at
> the moment, but generic_file_read() doesn't take either before reading
> it, and there may be still other writers.
truncate and write change i_size, under i_sem. The i_size test on
the read path doesn't really need to be there, I suspect. It handles
the window where i_size has been decreased by truncate but the filesystem
hasn't finished truncating the blocks yet. It also optimises reads
outside the end of file - no point in calling into the filesystem
to try to map blocks which aren't there.
> I also don't see the anything
> like read_barrier_depends() for lockless algorithms or any atomic reads.
> Even on machines with extremely strong memory consistency models like
> i386, as loff_t is long long, it would seem possible to catch a partial
> update and see an entirely bogus ->i_size value.
That's true. sys_stat() also could see a confusing intermediate value.
A while back Ingo and Linus were tossing around possible solutions to
this based on x86 compare-and-exchange operations, but nothing conclusive
came out of it. It's a "known bug".
-
On Sat, 1 Jun 2002, Andrew Morton wrote:
>
> > Why not just split up the code inside iput(), and then just do
> >
> > if (atomic_dec(&inode->i_count))
> > final_iput(inode);
>
> Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc
> could do with a spring clean. Make it a bit more conventional. I'll
> discuss with Al when he resurfaces.
This is a first cut at cleaning up "iput()" and getting rid of some of the
magic VFS-level behaviour of the i_nlink field which many filesystems do
not actually want - as shown by the number of "force_delete" users out
there.
It does not change any real behaviour, but it splits up the "iput()"
behaviour into several functions ("common_delete_inode()",
"common_forget_inode()" and "common_drop_inode()"), and adds a place for a
low-level filesystem to hook into the behaviour at inode drop time,
through the "drop_inode" superblock operation.
This part of the diff probably explains it best
+ drop_inode: called when the last access to the inode is dropped,
+ with the inode_lock spinlock held.
+
+ This method should be either NULL (normal unix filesystem
+ semantics) or "common_delete_inode" (for filesystems that do not
+ want to cache inodes - causing "delete_inode" to always be
+ called regardless of the value of i_nlink)
+
+ The "common_delete_inode()" behaviour is equivalent to the
+ old practice of using "force_delete" in the put_inode() case,
+ but does not have the races that the "force_delete()" approach
+ had.
This removes _most_ of the "put_inode()" users, although filesystems can
(and do) still use it for other things - notably to drop pre-allocations
etc.
Comments?
Linus
----
===== Documentation/filesystems/Locking 1.29 vs edited =====
--- 1.29/Documentation/filesystems/Locking Fri May 31 18:18:14 2002
+++ edited/Documentation/filesystems/Locking Sun Jun 2 20:58:05 2002
@@ -88,6 +88,7 @@
void (*read_inode) (struct inode *);
void (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
+ void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
@@ -102,6 +103,7 @@
read_inode: yes (see below)
write_inode: no
put_inode: no
+drop_inode: no !!!inode_lock!!!
delete_inode: no
clear_inode: no
put_super: yes yes maybe (see below)
===== Documentation/filesystems/vfs.txt 1.1 vs edited =====
--- 1.1/Documentation/filesystems/vfs.txt Tue Feb 5 09:40:36 2002
+++ edited/Documentation/filesystems/vfs.txt Sun Jun 2 21:17:23 2002
@@ -178,6 +178,7 @@
void (*read_inode) (struct inode *);
void (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
+ void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
int (*notify_change) (struct dentry *, struct iattr *);
void (*put_super) (struct super_block *);
@@ -203,6 +204,19 @@
put_inode: called when the VFS inode is removed from the inode
cache. This method is optional
+
+ drop_inode: called when the last access to the inode is dropped,
+ with the inode_lock spinlock held.
+
+ This method should be either NULL (normal unix filesystem
+ semantics) or "common_delete_inode" (for filesystems that do not
+ want to cache inodes - causing "delete_inode" to always be
+ called regardless of the value of i_nlink)
+
+ The "common_delete_inode()" behaviour is equivalent to the
+ old practice of using "force_delete" in the put_inode() case,
+ but does not have the races that the "force_delete()" approach
+ had.
delete_inode: called when the VFS wants to delete an inode
===== drivers/hotplug/pci_hotplug_core.c 1.14 vs edited =====
--- 1.14/drivers/hotplug/pci_hotplug_core.c Thu May 9 13:44:41 2002
+++ edited/drivers/hotplug/pci_hotplug_core.c Sun Jun 2 20:12:36 2002
@@ -290,7 +290,7 @@
static struct super_operations pcihpfs_ops = {
statfs: simple_statfs,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
};
static int pcihpfs_fill_super(struct super_block *sb, void *data, int silent)
===== drivers/usb/core/inode.c 1.31 vs edited =====
--- 1.31/drivers/usb/core/inode.c Wed May 22 09:25:48 2002
+++ edited/drivers/usb/core/inode.c Sun Jun 2 20:13:58 2002
@@ -298,7 +298,7 @@
static struct super_operations usbfs_ops = {
statfs: simple_statfs,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
};
static int usbfs_fill_super(struct super_block *sb, void *data, int silent)
===== fs/binfmt_misc.c 1.13 vs edited =====
--- 1.13/fs/binfmt_misc.c Thu May 23 09:08:34 2002
+++ edited/fs/binfmt_misc.c Sun Jun 2 19:48:46 2002
@@ -621,7 +621,7 @@
static struct super_operations s_ops = {
statfs: simple_statfs,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
clear_inode: bm_clear_inode,
};
===== fs/inode.c 1.60 vs edited =====
--- 1.60/fs/inode.c Fri May 31 18:18:09 2002
+++ edited/fs/inode.c Sun Jun 2 20:11:02 2002
@@ -782,6 +782,97 @@
spin_unlock(&inode_lock);
}
+void common_delete_inode(struct inode *inode)
+{
+ struct super_operations *op = inode->i_sb->s_op;
+
+ list_del(&inode->i_hash);
+ INIT_LIST_HEAD(&inode->i_hash);
+ list_del(&inode->i_list);
+ INIT_LIST_HEAD(&inode->i_list);
+ inode->i_state|=I_FREEING;
+ inodes_stat.nr_inodes--;
+ spin_unlock(&inode_lock);
+
+ if (inode->i_data.nrpages)
+ truncate_inode_pages(&inode->i_data, 0);
+
+ if (op && op->delete_inode) {
+ void (*delete)(struct inode *) = op->delete_inode;
+ if (!is_bad_inode(inode))
+ DQUOT_INIT(inode);
+ /* s_op->delete_inode internally recalls clear_inode() */
+ delete(inode);
+ } else
+ clear_inode(inode);
+ if (inode->i_state != I_CLEAR)
+ BUG();
+ destroy_inode(inode);
+}
+EXPORT_SYMBOL(common_delete_inode);
+
+static void common_forget_inode(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (!list_empty(&inode->i_hash)) {
+ if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
+ list_del(&inode->i_list);
+ list_add(&inode->i_list, &inode_unused);
+ }
+ inodes_stat.nr_unused++;
+ spin_unlock(&inode_lock);
+ if (!sb || (sb->s_flags & MS_ACTIVE))
+ return;
+ write_inode_now(inode, 1);
+ spin_lock(&inode_lock);
+ inodes_stat.nr_unused--;
+ list_del_init(&inode->i_hash);
+ }
+ list_del_init(&inode->i_list);
+ inode->i_state|=I_FREEING;
+ inodes_stat.nr_inodes--;
+ spin_unlock(&inode_lock);
+ if (inode->i_data.nrpages)
+ truncate_inode_pages(&inode->i_data, 0);
+ clear_inode(inode);
+ destroy_inode(inode);
+}
+
+/*
+ * Normal UNIX filesystem behaviour: delete the
+ * inode when the usage count drops to zero, and
+ * i_nlink is zero.
+ */
+static void common_drop_inode(struct inode *inode)
+{
+ if (!inode->i_nlink)
+ common_delete_inode(inode);
+ else
+ common_forget_inode(inode);
+}
+
+/*
+ * Called when we're dropping the last reference
+ * to an inode.
+ *
+ * Call the FS "drop()" function, defaulting to
+ * the legacy UNIX filesystem behaviour..
+ *
+ * NOTE! NOTE! NOTE! We're called with the inode lock
+ * held, and the drop function is supposed to release
+ * the lock!
+ */
+static inline void iput_final(struct inode *inode)
+{
+ struct super_operations *op = inode->i_sb->s_op;
+ void (*drop)(struct inode *) = common_drop_inode;
+
+ if (op && op->drop_inode)
+ drop = op->drop_inode;
+ drop(inode);
+}
+
/**
* iput - put an inode
* @inode: inode to put
@@ -793,77 +884,17 @@
void iput(struct inode *inode)
{
if (inode) {
- struct super_block *sb = inode->i_sb;
- struct super_operations *op = NULL;
+ struct super_operations *op = inode->i_sb->s_op;
if (inode->i_state == I_CLEAR)
BUG();
- if (sb && sb->s_op)
- op = sb->s_op;
if (op && op->put_inode)
op->put_inode(inode);
- if (!atomic_dec_and_lock(&inode->i_count, &inode_lock))
- return;
-
- if (!inode->i_nlink) {
- list_del(&inode->i_hash);
- INIT_LIST_HEAD(&inode->i_hash);
- list_del(&inode->i_list);
- INIT_LIST_HEAD(&inode->i_list);
- inode->i_state|=I_FREEING;
- inodes_stat.nr_inodes--;
- spin_unlock(&inode_lock);
-
- if (inode->i_data.nrpages)
- truncate_inode_pages(&inode->i_data, 0);
-
- if (op && op->delete_inode) {
- void (*delete)(struct inode *) = op->delete_inode;
- if (!is_bad_inode(inode))
- DQUOT_INIT(inode);
- /* s_op->delete_inode internally recalls clear_inode() */
- delete(inode);
- } else
- clear_inode(inode);
- if (inode->i_state != I_CLEAR)
- BUG();
- } else {
- if (!list_empty(&inode->i_hash)) {
- if (!(inode->i_state & (I_DIRTY|I_LOCK))) {
- list_del(&inode->i_list);
- list_add(&inode->i_list, &inode_unused);
- }
- inodes_stat.nr_unused++;
- spin_unlock(&inode_lock);
- if (!sb || (sb->s_flags & MS_ACTIVE))
- return;
- write_inode_now(inode, 1);
- spin_lock(&inode_lock);
- inodes_stat.nr_unused--;
- list_del_init(&inode->i_hash);
- }
- list_del_init(&inode->i_list);
- inode->i_state|=I_FREEING;
- inodes_stat.nr_inodes--;
- spin_unlock(&inode_lock);
- if (inode->i_data.nrpages)
- truncate_inode_pages(&inode->i_data, 0);
- clear_inode(inode);
- }
- destroy_inode(inode);
+ if (atomic_dec_and_lock(&inode->i_count, &inode_lock))
+ iput_final(inode);
}
-}
-
-void force_delete(struct inode *inode)
-{
- /*
- * Kill off unused inodes ... iput() will unhash and
- * delete the inode if we set i_nlink to zero.
- */
- if (atomic_read(&inode->i_count) == 1)
- inode->i_nlink = 0;
}
/**
===== fs/devfs/base.c 1.39 vs edited =====
--- 1.39/fs/devfs/base.c Tue May 14 09:27:48 2002
+++ edited/fs/devfs/base.c Sun Jun 2 20:15:00 2002
@@ -2576,7 +2576,7 @@
static struct super_operations devfs_sops =
{
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
clear_inode: devfs_clear_inode,
statfs: simple_statfs,
};
===== fs/driverfs/inode.c 1.22 vs edited =====
--- 1.22/fs/driverfs/inode.c Fri May 31 18:18:10 2002
+++ edited/fs/driverfs/inode.c Sun Jun 2 19:49:16 2002
@@ -442,7 +442,7 @@
static struct super_operations driverfs_ops = {
statfs: simple_statfs,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
};
static int driverfs_fill_super(struct super_block *sb, void *data, int silent)
===== fs/ncpfs/inode.c 1.19 vs edited =====
--- 1.19/fs/ncpfs/inode.c Thu May 23 09:08:34 2002
+++ edited/fs/ncpfs/inode.c Sun Jun 2 20:14:34 2002
@@ -84,7 +84,7 @@
{
alloc_inode: ncp_alloc_inode,
destroy_inode: ncp_destroy_inode,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
delete_inode: ncp_delete_inode,
put_super: ncp_put_super,
statfs: ncp_statfs,
===== fs/proc/inode.c 1.13 vs edited =====
--- 1.13/fs/proc/inode.c Mon May 20 06:38:28 2002
+++ edited/fs/proc/inode.c Sun Jun 2 19:59:25 2002
@@ -121,7 +121,7 @@
alloc_inode: proc_alloc_inode,
destroy_inode: proc_destroy_inode,
read_inode: proc_read_inode,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
delete_inode: proc_delete_inode,
statfs: simple_statfs,
};
===== fs/ramfs/inode.c 1.19 vs edited =====
--- 1.19/fs/ramfs/inode.c Fri May 31 18:18:10 2002
+++ edited/fs/ramfs/inode.c Sun Jun 2 19:59:56 2002
@@ -277,7 +277,7 @@
static struct super_operations ramfs_ops = {
statfs: simple_statfs,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
};
static int ramfs_fill_super(struct super_block * sb, void * data, int silent)
===== fs/smbfs/inode.c 1.23 vs edited =====
--- 1.23/fs/smbfs/inode.c Thu May 23 09:08:34 2002
+++ edited/fs/smbfs/inode.c Sun Jun 2 20:14:19 2002
@@ -94,7 +94,7 @@
{
alloc_inode: smb_alloc_inode,
destroy_inode: smb_destroy_inode,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
delete_inode: smb_delete_inode,
put_super: smb_put_super,
statfs: smb_statfs,
===== include/linux/fs.h 1.131 vs edited =====
--- 1.131/include/linux/fs.h Fri May 31 18:18:14 2002
+++ edited/include/linux/fs.h Sun Jun 2 20:49:29 2002
@@ -800,6 +800,7 @@
void (*dirty_inode) (struct inode *);
void (*write_inode) (struct inode *, int);
void (*put_inode) (struct inode *);
+ void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
@@ -1183,10 +1184,10 @@
extern void inode_init_once(struct inode *);
extern void iput(struct inode *);
-extern void force_delete(struct inode *);
extern struct inode * igrab(struct inode *);
extern ino_t iunique(struct super_block *, ino_t);
extern int inode_needs_sync(struct inode *inode);
+extern void common_delete_inode(struct inode *inode);
extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
extern struct inode * iget_locked(struct super_block *, unsigned long);
===== kernel/ksyms.c 1.95 vs edited =====
--- 1.95/kernel/ksyms.c Fri May 31 18:18:10 2002
+++ edited/kernel/ksyms.c Sun Jun 2 19:39:17 2002
@@ -140,7 +140,6 @@
EXPORT_SYMBOL(iunique);
EXPORT_SYMBOL(iput);
EXPORT_SYMBOL(inode_init_once);
-EXPORT_SYMBOL(force_delete);
EXPORT_SYMBOL(follow_up);
EXPORT_SYMBOL(follow_down);
EXPORT_SYMBOL(lookup_mnt);
===== mm/shmem.c 1.52 vs edited =====
--- 1.52/mm/shmem.c Fri May 31 18:18:13 2002
+++ edited/mm/shmem.c Sun Jun 2 19:46:02 2002
@@ -1483,7 +1483,7 @@
remount_fs: shmem_remount_fs,
#endif
delete_inode: shmem_delete_inode,
- put_inode: force_delete,
+ drop_inode: common_delete_inode,
put_super: shmem_put_super,
};
On Jun 02, 2002 21:27 -0700, Linus Torvalds wrote:
> This is a first cut at cleaning up "iput()" and getting rid of some of the
> magic VFS-level behaviour of the i_nlink field which many filesystems do
> not actually want - as shown by the number of "force_delete" users out
> there.
>
> It does not change any real behaviour, but it splits up the "iput()"
> behaviour into several functions ("common_delete_inode()",
> "common_forget_inode()" and "common_drop_inode()"), and adds a place for a
> low-level filesystem to hook into the behaviour at inode drop time,
> through the "drop_inode" superblock operation.
If I had one minor note it would be to rename "common_*()" to "generic_*()"
to match the other VFS helper routines.
Cheers, Andreas
--
Andreas Dilger
http://www-mddsp.enel.ucalgary.ca/People/adilger/
http://sourceforge.net/projects/ext2resize/
On Mon, 3 Jun 2002, Andreas Dilger wrote:
>
> If I had one minor note it would be to rename "common_*()" to "generic_*()"
> to match the other VFS helper routines.
Good point. Will do.
Linus
On Mon, 2002-06-03 at 00:27, Linus Torvalds wrote:
>
>
> On Sat, 1 Jun 2002, Andrew Morton wrote:
> >
> > > Why not just split up the code inside iput(), and then just do
> > >
> > > if (atomic_dec(&inode->i_count))
> > > final_iput(inode);
> >
> > Yes, I suspect all the inode refcounting, locking, I_FREEING, I_LOCK, etc
> > could do with a spring clean. Make it a bit more conventional. I'll
> > discuss with Al when he resurfaces.
>
> This is a first cut at cleaning up "iput()" and getting rid of some of the
> magic VFS-level behaviour of the i_nlink field which many filesystems do
> not actually want - as shown by the number of "force_delete" users out
> there.
>
> It does not change any real behaviour, but it splits up the "iput()"
> behaviour into several functions ("common_delete_inode()",
> "common_forget_inode()" and "common_drop_inode()"), and adds a place for a
> low-level filesystem to hook into the behaviour at inode drop time,
> through the "drop_inode" superblock operation.
Now that is kinda neat, calling it with the inode lock held lets me move
some things out of reiserfs_file_release which need i_sem, and move them
into a less expensive drop_inode call without grabbing the semaphore.
-chris
On 3 Jun 2002, Chris Mason wrote:
>
> Now that is kinda neat, calling it with the inode lock held lets me move
> some things out of reiserfs_file_release which need i_sem, and move them
> into a less expensive drop_inode call without grabbing the semaphore.
CAREFUL!
If you make real per-FS use of this, and aren't just using the standard
ones, you need to be very very careful. In particular, you get called with
the inode lock held, but you would have to drop the lock yourself after
having removed the inode from the hash chains etc. I'd like people to
avoid playing too many games in this area, the locking and the exact
semantics of "drop_inode" are rather nasty.
Linus
On Mon, 2002-06-03 at 15:34, Linus Torvalds wrote:
>
> On 3 Jun 2002, Chris Mason wrote:
> >
> > Now that is kinda neat, calling it with the inode lock held lets me move
> > some things out of reiserfs_file_release which need i_sem, and move them
> > into a less expensive drop_inode call without grabbing the semaphore.
>
> CAREFUL!
>
> If you make real per-FS use of this, and aren't just using the standard
> ones, you need to be very very careful. In particular, you get called with
> the inode lock held, but you would have to drop the lock yourself after
> having removed the inode from the hash chains etc. I'd like people to
> avoid playing too many games in this area, the locking and the exact
> semantics of "drop_inode" are rather nasty.
Right, I don't want too much in there. There are a few things I need to
do when I know nobody else is messing with the inode, and I'm using
i_sem to provide that now. put_inode doesn't do what I need because
knfsd might iget his way into the mess.
I'm talking a very limited set of operations followed by calling the
generic functions. I might not do it at all if I can't get them safe
when called under the spin lock.
-chris
On 3 Jun 2002, Chris Mason wrote:
>
> I'm talking a very limited set of operations followed by calling the
> generic functions. I might not do it at all if I can't get them safe
> when called under the spin lock.
Ok, that should be reasonably "portable" (ie it won't break horribly and
silently in the future when something changes in inode-land). Just doing a
few ops (knowing you're under the inode lock) and then calling
"generic_drop_inode()" should be fine.
[ Except right now only the "generic_delete_inode()" thing is exported, so
you'd need to export the other generic stuff, but that was kind of my
plan anyway, I just don't wan tto do it until there is some real need. ]
Linus
On Sat, 2002-06-01 at 15:19, Andrew Morton wrote:
> Linus Torvalds wrote:
> >
> > On Sat, 1 Jun 2002, Andrew Morton wrote:
> > >
> > > So run __iget prior to dropping inode_lock.
> >
> > This part looks horrible:
> >
> > + spin_unlock(&inode_lock);
> > + iput(inode);
> > + spin_lock(&inode_lock);
>
> Yup. The inode refcounting APIs are really awkward. Note how I recently
> added dopey code in ext2_put_inode() to only drop the prealloc window on
> the "final" iput().
Hmmm, a quick glance makes the test in ext2_put_inode look unsafe.
iput calls put_inode before decrementing i_count. So, nothing stops 5
iput callers from all deciding i_count > 2 and leaving the preallocation
blocks hanging.
Also, a knfsd triggered iget/iput pair should hit the same race with an
put_inode call.
Or am I missing something?
-chris
On 3 Jun 2002, Chris Mason wrote:
>
> Or am I missing something?
No. I think that in the long run we really would want all of the writeback
preallocation should happen in the "struct file", not in "struct inode".
And they should be released at file close ("release()"), not at iput()
time.
I _think_ that right now nfsd doesn't cache file opens (only inodes), so
this could be a performance issue for nfsd, but it might be possible to
change how nfsd acts. And it would be a _lot_ cleaner to do it at the file
level.
Linus
Linus Torvalds wrote:
>
> On 3 Jun 2002, Chris Mason wrote:
> >
> > Or am I missing something?
>
> No. I think that in the long run we really would want all of the writeback
> preallocation should happen in the "struct file", not in "struct inode".
> And they should be released at file close ("release()"), not at iput()
> time.
That would be a lot nicer.
But why does ext2_put_inode() even exist? We're already throwing
away the prealloc window in ext2_release_file? I guess for
shared mappings over spares files: if all file handles have
closed off, we still need to make allocations against that
inode, yes?
-
On Mon, 2002-06-03 at 18:19, Linus Torvalds wrote:
>
>
> On 3 Jun 2002, Chris Mason wrote:
> >
> > Or am I missing something?
>
> No. I think that in the long run we really would want all of the writeback
> preallocation should happen in the "struct file", not in "struct inode".
> And they should be released at file close ("release()"), not at iput()
> time.
reiserfs does preallocation and tail packing in release() right now, but
do we really need preallocation at all once delayed allocation is
stable?
-chris
Chris Mason wrote:
>
> On Mon, 2002-06-03 at 18:19, Linus Torvalds wrote:
> >
> >
> > On 3 Jun 2002, Chris Mason wrote:
> > >
> > > Or am I missing something?
> >
> > No. I think that in the long run we really would want all of the writeback
> > preallocation should happen in the "struct file", not in "struct inode".
> > And they should be released at file close ("release()"), not at iput()
> > time.
>
> reiserfs does preallocation and tail packing in release() right now, but
> do we really need preallocation at all once delayed allocation is
> stable?
>
well.. Will delayed allocation even exist? Grafting reservations
onto ext2 had a few awkward moments in the area of handling ENOSPC,
and generally a lot of the benefits of that code (ie: avoiding buffers)
have been whittled away by other means.
So right now I don't see a strong reason for adding delayed allocation
support into the core kernel for ext2. If another fs comes up and
creates a demand for core kernel support then fine. (Thinking XFS
here). Probably I should resurrect the ext2 code as something for you
and Steve to poke at.
Plus ext3 needs prealloc as well. Performing that within the live
bitmaps like ext2 is really awkward. I have half-a-patch which performs
area reservations in ext3 via a per-private-inode "start, length"
tuple. These zones never get near being written to disk. That
seems a reasonable approach for ext3, but it does use the inode.
If that info was inside struct file, writepage() would get rather
confused - it doesn't have a file*.
-
On Mon, 3 Jun 2002, Andrew Morton wrote:
>
> But why does ext2_put_inode() even exist? We're already throwing
> away the prealloc window in ext2_release_file? I guess for
> shared mappings over spares files: if all file handles have
> closed off, we still need to make allocations against that
> inode, yes?
Shared mappings still hold the "struct file" open (you have
"vma->vm_file->f_dentry->d_inode"), so you still have the file handle
while the mapping is open.
I assume that the reason is that _any_ block allocation will trigegr
pre-alloc, which means that we have preallocation for things like
directories etc too - which really do not have a "struct file" associated
with them.
Whether that is really worth it is unclear, but it also means that ext2
doesn't have to pass down the "struct file" to the lower levels at all, as
it keeps all pre-alloc stuff in the inode.
On the whole it's probably a mistake, but my point is that it's likely a
mistake that is hard to fix. Which is why I didn't even try to fix ext2 to
not use "put_inode" and the prealloc dropping there..
Linus
Linus Torvalds wrote:
>
> On Mon, 3 Jun 2002, Andrew Morton wrote:
> >
> > But why does ext2_put_inode() even exist? We're already throwing
> > away the prealloc window in ext2_release_file? I guess for
> > shared mappings over spares files: if all file handles have
> > closed off, we still need to make allocations against that
> > inode, yes?
>
> Shared mappings still hold the "struct file" open (you have
> "vma->vm_file->f_dentry->d_inode"), so you still have the file handle
> while the mapping is open.
But after the vma has been destroyed (the application did exit()),
the dirty pagecache data against the sparse mapped file still
hasn't been written, and still doesn't have a disk mapping.
So in this case, we have an inode which still has pending block
allocations which has no struct file's pointing at it. Or
am I wrong?
> I assume that the reason is that _any_ block allocation will trigegr
> pre-alloc, which means that we have preallocation for things like
> directories etc too - which really do not have a "struct file" associated
> with them.
>
> Whether that is really worth it is unclear, but it also means that ext2
> doesn't have to pass down the "struct file" to the lower levels at all, as
> it keeps all pre-alloc stuff in the inode.
The current preallocation will screw up is when there's a
large-and-sparse file which has blocks being allocated against it
at two or more offsets. And those allocations are for a large number
of blocks, and they are proceeding slowly. Something like:
for (i = 0; i < 16; i++) {
pwrite(fd, buf, 4096, offset1 + i * 4096);
pwrite(fd, buf, 4096, offset2 + i * 4096);
}
this would cause preallocation thrashing, and those blocks
would be laid out as ABABABAB. The same would happen if
the prealloc information is per-file as well, of course.
The app would have to be using different file descriptors
to avoid it.
So from a file-layout point of view, I don't see a lot of
benefit in moving the preallocation info into struct file.
(Delayed allocation plus writeback-time sorting would fix it
all up. With delayed allocate, the prealloc code can visit
the bitbucket).
-
On Tue, 4 Jun 2002, Andrew Morton wrote:
>
> But after the vma has been destroyed (the application did exit()),
> the dirty pagecache data against the sparse mapped file still
> hasn't been written, and still doesn't have a disk mapping.
>
> So in this case, we have an inode which still has pending block
> allocations which has no struct file's pointing at it. Or
> am I wrong?
I think you're right..
> The current preallocation will screw up is when there's a
> large-and-sparse file which has blocks being allocated against it
> at two or more offsets. And those allocations are for a large number
> of blocks, and they are proceeding slowly.
Sure. However, I don't think it should come as any surprise to anybody
that trying to write to two different points in the same file is a bad
idea. _regardless_ of whether you do pre-allocation or not, and whether
the pre-allocation is on the inode or file level.
I'd still love to see a "fast and slightly stupid" allocator for both
blocks and inodes, and have some infrastructure to do run-time defragging
in the background.
Linus
Linus Torvalds wrote:
>
> ...
> I'd still love to see a "fast and slightly stupid" allocator for both
> blocks and inodes, and have some infrastructure to do run-time defragging
> in the background.
>
I think runtime defrag could yield really good benefits. In
particular it would allow us to find_group_dir(), and always
put directories in the same blockgroup as their parent (big
speedups for the `untar-a-kernel-tree' workload).
There's a patch at
http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre10/ext3-reloc-page.patch
which provides a simple `relocate page' ioctl for ext3 files. It
relocates a page's blocks. The operation is fully journalled and
pagecache-coherent. So you can turn off the power in the middle
of a defrag operation and the fs will come back just fine. It doesn't
make any attempt to relocate inodes. If the page relocation attempt fails
then it just returns -EAGAIN and userspace gets to worry about what
to do.
I simply have not had the time to do anything about the userspace
program which drives that ioctl. So if there's anyone out there
who has a little time on their hands...
-
On Tue, 4 Jun 2002, Andrew Morton wrote:
>
> There's a patch at
> http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre10/ext3-reloc-page.patch
> which provides a simple `relocate page' ioctl for ext3 files.
That's a good start, but before even egtting that far there is some need
for a way to get a picture of the FS layout in a reasonably fs-independent
way.
Sure, bmap() actually does part of this (the "where are my blocks" part),
but right now there is no way to query the FS for the "where can I put
blocks" part.
You can do it with direct disk access and knowledge of the FS internals,
but it should not be all that hard to add some simple interface to get a
"block usage byte array" kind of thing (more efficient than doing bmap on
all files, _and_ can tell about blocks reserved for inodes, superblocks
and other special uses), which together with a user-level interface to
"preallocate" and your "relocate page" should actually make it possible to
make a fairly FS-independent defragmenter.
Add a nice graphical front-end, and you can make it a useful screen-saver.
Linus
On Tue, Jun 04, 2002 at 02:37:26PM -0700, Linus Torvalds wrote:
> That's a good start, but before even egtting that far there is some need
> for a way to get a picture of the FS layout in a reasonably fs-independent
> way.
Al Viro actually came up with a beautiful mechanism for it: present
the filesystem's metadata as a filesystem itself. Blocks within a
particular inode can then be viewed with cat, and suggested replacements
can be made via echo. Such a best could be called ext2metafs and work
cooperatively with the existing filesystem code. It's quite an elegant
design that has utility beyond the simple case of defragmenting.
> Add a nice graphical front-end, and you can make it a useful screen-saver.
Heh, that would be excellent.
-ben
--
"You will be reincarnated as a toad; and you will be much happier."
>Sure. However, I don't think it should come as any surprise to anybody
>that trying to write to two different points in the same file is a bad
>idea.
Databases?
Craig Milo Rogers
On Tue, 4 Jun 2002, Craig Milo Rogers wrote:
>
> >Sure. However, I don't think it should come as any surprise to anybody
> >that trying to write to two different points in the same file is a bad
> >idea.
>
> Databases?
They uniformly (as far as I know) preallocate all the data blocks.
Exactly to avoid the block layout issue - they want the data blocks as
contiguous as possible.
Linus
Linus Torvalds wrote:
>
> On Tue, 4 Jun 2002, Andrew Morton wrote:
> >
> > There's a patch at
> > http://www.zip.com.au/~akpm/linux/patches/2.4/2.4.19-pre10/ext3-reloc-page.patch
> > which provides a simple `relocate page' ioctl for ext3 files.
>
> That's a good start, but before even egtting that far there is some need
> for a way to get a picture of the FS layout in a reasonably fs-independent
> way.
>
> Sure, bmap() actually does part of this (the "where are my blocks" part),
> but right now there is no way to query the FS for the "where can I put
> blocks" part.
Jeff Garzik was working on that a while back - a separate filesystem
which provides a "metadata view" of a real filesytem. So you can
poke around and find all these things out. In theory, different
filesystems should be able to offer the same view.
> You can do it with direct disk access and knowledge of the FS internals,
> but it should not be all that hard to add some simple interface to get a
> "block usage byte array" kind of thing (more efficient than doing bmap on
> all files, _and_ can tell about blocks reserved for inodes, superblocks
> and other special uses), which together with a user-level interface to
> "preallocate" and your "relocate page" should actually make it possible to
> make a fairly FS-independent defragmenter.
The e2fsprogs package includes a `libe2fs' library which offers
APIs for accessing the fs internals. It's exactly what you
say - direct disk access and knowledge of internals. So
that plus the try_to_relocate_page() ioctl is a shortest-path
route to a defragmenter for ext3, and only ext3. I wasn't
aiming very high here ;)
A totally different way of performing defrag could be to
copy the entire fs from one partition to a different one,
with kernel support for providing coherency while the copy
is in progress. It's basically a union/translucent mount
with COW. Swizzle the backing blockdev, drop the disk
mappings from all incore pages, renumber the inode without
breaking stuff... (OK, I've talked myself out of it ;/) It's
not super efficient, and it does require the provisioning of a
bounce disk, but it would use infrastructure which would be
useful for other stuff and it is fs-agnostic.
-
Hi Linus.
> You can do it with direct disk access and knowledge of the FS
> internals, but it should not be all that hard to add some simple
> interface to get a "block usage byte array" kind of thing (more
> efficient than doing bmap on all files, _and_ can tell about blocks
> reserved for inodes, superblocks and other special uses), which
> together with a user-level interface to "preallocate" and your
> "relocate page" should actually make it possible to make a fairly
> FS-independent defragmenter.
Would I be right in thinking that this would be an array with three
possible values for each element...
FIXED Block not movable.
MOVABLE Block in use and movable.
UNUSED Block not in use.
...with the defragmenter basically exchanging MOVABLE and UNUSED blocks
to get the files in an unfragmented state. The FIXED value would handle
things like the superblock, inode blocks and other special uses, and all
other blocks would be either MOVABLE or UNUSED as appropriate.
Another option might be to split MOVABLE into DIRECTORY and FILE values
instead, but whether that would be useful is questionable at best.
> Add a nice graphical front-end, and you can make it a useful
> screen-saver.
As long as it runs on systems without X-windows available rather than
being limited to only run under X... My preference would be for a tool
similar to `make menuconfig` for this sort of utility.
Best wishes from Riley.