Hello!
This patch (originally by Chris Mason) fixes link/unlink races in reiserfs.
The patch is against 2.6.0-test2, please apply.
===== fs/reiserfs/namei.c 1.47 vs edited =====
--- 1.47/fs/reiserfs/namei.c Mon Jun 30 10:49:04 2003
+++ edited/fs/reiserfs/namei.c Thu Jul 31 18:01:55 2003
@@ -822,6 +822,7 @@
int windex ;
struct reiserfs_transaction_handle th ;
int jbegin_count;
+ unsigned long savelink;
inode = dentry->d_inode;
@@ -858,11 +859,20 @@
inode->i_nlink = 1;
}
+ inode->i_nlink--;
+
+ /*
+ * we schedule before doing the add_save_link call, save the link
+ * count so we don't race
+ */
+ savelink = inode->i_nlink;
+
+
retval = reiserfs_cut_from_item (&th, &path, &(de.de_entry_key), dir, NULL, 0);
- if (retval < 0)
+ if (retval < 0) {
+ inode->i_nlink++;
goto end_unlink;
-
- inode->i_nlink--;
+ }
inode->i_ctime = CURRENT_TIME;
reiserfs_update_sd (&th, inode);
@@ -871,7 +881,7 @@
dir->i_ctime = dir->i_mtime = CURRENT_TIME;
reiserfs_update_sd (&th, dir);
- if (!inode->i_nlink)
+ if (!savelink)
/* prevent file from getting lost */
add_save_link (&th, inode, 0/* not truncate */);
@@ -976,6 +986,12 @@
reiserfs_write_unlock(dir->i_sb);
return -EMLINK;
}
+ if (inode->i_nlink == 0) {
+ return -ENOENT;
+ }
+
+ /* inc before scheduling so reiserfs_unlink knows we are here */
+ inode->i_nlink++;
journal_begin(&th, dir->i_sb, jbegin_count) ;
windex = push_journal_writer("reiserfs_link") ;
@@ -988,13 +1004,13 @@
reiserfs_update_inode_transaction(dir) ;
if (retval) {
+ inode->i_nlink--;
pop_journal_writer(windex) ;
journal_end(&th, dir->i_sb, jbegin_count) ;
reiserfs_write_unlock(dir->i_sb);
return retval;
}
- inode->i_nlink++;
inode->i_ctime = CURRENT_TIME;
reiserfs_update_sd (&th, inode);
@@ -1068,6 +1084,7 @@
struct reiserfs_transaction_handle th ;
int jbegin_count ;
umode_t old_inode_mode;
+ unsigned long savelink = 1;
/* two balancings: old name removal, new name insertion or "save" link,
stat data updates: old directory and new directory and maybe block
@@ -1246,6 +1263,7 @@
new_dentry_inode->i_nlink--;
}
new_dentry_inode->i_ctime = new_dir->i_ctime;
+ savelink = new_dentry_inode->i_nlink;
}
if (S_ISDIR(old_inode_mode)) {
@@ -1279,7 +1297,7 @@
reiserfs_update_sd (&th, new_dir);
if (new_dentry_inode) {
- if (new_dentry_inode->i_nlink == 0)
+ if (savelink == 0)
add_save_link (&th, new_dentry_inode, 0/* not truncate */);
reiserfs_update_sd (&th, new_dentry_inode);
}
Oleg Drokin <[email protected]> wrote:
>
> This patch (originally by Chris Mason) fixes link/unlink races in reiserfs.
>
Could you describe the race a little more please? Why is the VFS's hold of
i_sem on the parent directory not sufficient?
> +
> + /*
> + * we schedule before doing the add_save_link call, save the link
> + * count so we don't race
This comment would seem to imply lock_kernel()-based locking, but
lock_kernel() is not held here.
Hello!
On Thu, Jul 31, 2003 at 01:37:08PM -0700, Andrew Morton wrote:
> > This patch (originally by Chris Mason) fixes link/unlink races in reiserfs.
> Could you describe the race a little more please? Why is the VFS's hold of
> i_sem on the parent directory not sufficient?
Well, we do not take i_sem on parent directory of source filename for sys_link, I think.
So we might endup in sys_link() with inode that is already deleted/being deleted (and nlink==0).
Actually, I naturally thought that only i_nlink check to be non zero at reiserfs_link time should be
enough, but Chris is sure that we need entire patch, so may be he may add more comments to that.
BTW, looking at vfs_link, this patch (below the message) seems to be natural thing to do, is not it?
> > +
> > + /*
> > + * we schedule before doing the add_save_link call, save the link
> > + * count so we don't race
> This comment would seem to imply lock_kernel()-based locking, but
> lock_kernel() is not held here.
It is.
This reiserfs_write_lock/unlock stuff are in fact lock_kernel()/unlock_kernel(),
only I invented those macroses to easy conversion to more fine-grained locking later.
Except that Hans now does not want this to happen.
Bye,
Oleg
===== fs/namei.c 1.81 vs edited =====
--- 1.81/fs/namei.c Fri Jul 11 09:23:45 2003
+++ edited/fs/namei.c Fri Aug 1 14:40:29 2003
@@ -1793,17 +1793,17 @@
return -EPERM;
if (!dir->i_op || !dir->i_op->link)
return -EPERM;
- if (S_ISDIR(old_dentry->d_inode->i_mode))
+ if (S_ISDIR(inode->i_mode))
return -EPERM;
error = security_inode_link(old_dentry, dir, new_dentry);
if (error)
return error;
- down(&old_dentry->d_inode->i_sem);
+ down(&inode->i_sem);
DQUOT_INIT(dir);
error = dir->i_op->link(old_dentry, dir, new_dentry);
- up(&old_dentry->d_inode->i_sem);
+ up(&inode->i_sem);
if (!error) {
inode_dir_notify(dir, DN_CREATE);
security_inode_post_link(old_dentry, dir, new_dentry);
On Fri, 2003-08-01 at 07:10, Oleg Drokin wrote:
> Hello!
>
> On Thu, Jul 31, 2003 at 01:37:08PM -0700, Andrew Morton wrote:
>
> > > This patch (originally by Chris Mason) fixes link/unlink races in reiserfs.
> > Could you describe the race a little more please? Why is the VFS's hold of
> > i_sem on the parent directory not sufficient?
>
> Well, we do not take i_sem on parent directory of source filename for sys_link, I think.
> So we might endup in sys_link() with inode that is already deleted/being deleted (and nlink==0).
> Actually, I naturally thought that only i_nlink check to be non zero at reiserfs_link time should be
> enough, but Chris is sure that we need entire patch, so may be he may add more comments to that.
>
Yes, it's basically a problem of making sure reiserfs_link sees any
changes made by reiserfs_unlink and vice versa. Toss in the BKL and a
few funcs that schedule and you get small windows where the object
you're linking to can have a inode->i_nlink of zero.
All of which is dealt with properly at the VFS level. The trick in
reiserfs is that somebody needs to take care of removing the savelink
used to prevent lost files after a crash. If we don't get the i_nlink
timing right, two different procs might try to remove the savelink, or
it might not get removed at all.
> BTW, looking at vfs_link, this patch (below the message) seems to be
> natural thing to do, is not it?
>
Looks fine.
-chris