2006-08-01 23:52:47

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 00/28] Mount writer count and read-only bind mounts (v5)

Tries to incorporate comments from Al:
http://article.gmane.org/gmane.linux.kernel/421029

Al wrote:
> b) figuring out what (if anything) should be done with
> propagation when we have shared subtrees... (not trivial at all)

Talked with Ram: Shared subtrees are about having identical views
into the filesystem. Changing the mount permissions doesn't affect
the view of the filesystem, so it should not be propagated.

The things that probably need the heaviest review in here are the
i_nlink monitoring patch (including the inode state flag patches 03
and 06) and the new MNT_SB_WRITABLE flag (patch 05).

These are against 2.6.18-rc2-mm1. Does anybody have anything against
them having some testing in -mm?

---

The following series implements read-only bind mounts. This feature
allows a read-only view into a read-write filesystem. In the process
of doing that, it also provides infrastructure for keeping track of
the number of writers to any given mount. In this version, if there
are writers on a superblock, the filesystem may not be remounted
r/o. The same goes for MS_BIND mounts, and writers on a vfsmount.

This has a number of uses. It allows chroots to have parts of
filesystems writable. It will be useful for containers in the future
and is intended to replace patches that vserver has had out of the
tree for several years. It allows security enhancement by making
sure that parts of your filesystem read-only, when you don't want
to have entire new filesystems mounted, or when you want atime
selectively updated.

This set makes no attempt to keep the return codes for these
r/o bind mounts the same as for a real r/o filesystem or device.
It would require significantly more code and be quite a bit more
invasive.

Using this feature requires two steps:

mount --bind /source /dest
mount -o remount,ro /dest

Signed-off-by: Dave Hansen <[email protected]>


2006-08-01 23:53:26

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 25/28] elevate mnt writers for vfs_unlink() callers



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 4 ++++
lxc-dave/ipc/mqueue.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)

diff -puN fs/namei.c~C-elevate-writers-vfs_unlink fs/namei.c
--- lxc/fs/namei.c~C-elevate-writers-vfs_unlink 2006-08-01 16:35:31.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:32.000000000 -0700
@@ -2171,7 +2171,11 @@ static long do_unlinkat(int dfd, const c
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit2;
error = vfs_unlink(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
exit2:
dput(dentry);
}
diff -puN ipc/mqueue.c~C-elevate-writers-vfs_unlink ipc/mqueue.c
--- lxc/ipc/mqueue.c~C-elevate-writers-vfs_unlink 2006-08-01 16:35:28.000000000 -0700
+++ lxc-dave/ipc/mqueue.c 2006-08-01 16:35:32.000000000 -0700
@@ -747,8 +747,11 @@ asmlinkage long sys_mq_unlink(const char
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
-
+ err = mnt_want_write(mqueue_mnt);
+ if (err)
+ goto out_err;
err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+ mnt_drop_write(mqueue_mnt);
out_err:
dput(dentry);

_

2006-08-01 23:53:17

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 02/28] r/o bind mount prepwork: move open_namei()'s vfs_create()


The code around vfs_create() in open_namei() is getting a
bit too complex. Right now, there is at least the reference
count on the dentry, and the i_mutex to worry about. Soon,
we'll also have mnt_writecount.

So, break the vfs_create() call out of open_namei(), and
into a helper function. This duplicates the call to
may_open(), but that isn't such a bad thing since the
arguments (acc_mode and flag) were being heavily massaged
anyway.

Later in the series, we'll add the mnt_writecount handling
around this new function call.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 30 ++++++++++++++++++++----------
1 files changed, 20 insertions(+), 10 deletions(-)

diff -puN fs/namei.c~B-prepwork-cleanup-open_namei fs/namei.c
--- lxc/fs/namei.c~B-prepwork-cleanup-open_namei 2006-08-01 16:35:13.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:14.000000000 -0700
@@ -1587,6 +1587,24 @@ int may_open(struct nameidata *nd, int a
return 0;
}

+static int open_namei_create(struct nameidata *nd, struct path *path,
+ int flag, int mode)
+{
+ int error;
+ struct dentry *dir = nd->dentry;
+
+ if (!IS_POSIXACL(dir->d_inode))
+ mode &= ~current->fs->umask;
+ error = vfs_create(dir->d_inode, path->dentry, mode, nd);
+ mutex_unlock(&dir->d_inode->i_mutex);
+ dput(nd->dentry);
+ nd->dentry = path->dentry;
+ if (error)
+ return error;
+ /* Don't check for write permission, don't truncate */
+ return may_open(nd, 0, flag & ~O_TRUNC);
+}
+
/*
* open_namei()
*
@@ -1668,18 +1686,10 @@ do_last:

/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
- if (!IS_POSIXACL(dir->d_inode))
- mode &= ~current->fs->umask;
- error = vfs_create(dir->d_inode, path.dentry, mode, nd);
- mutex_unlock(&dir->d_inode->i_mutex);
- dput(nd->dentry);
- nd->dentry = path.dentry;
+ error = open_namei_create(nd, &path, flag, mode);
if (error)
goto exit;
- /* Don't check for write permission, don't truncate */
- acc_mode = 0;
- flag &= ~O_TRUNC;
- goto ok;
+ return 0;
}

/*
_

2006-08-01 23:52:49

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 03/28] unlink: monitor i_nlink


When a filesystem decrements i_nlink to zero, it means that a
write must be performed in order to drop the inode from the
filesystem.

We're shortly going to have keep filesystems from being remounted
r/o between the time that this i_nlink decrement and that write
occurs.

So, add a little helper function to do the decrements. We'll
tie into it in a bit to note when i_nlink hits zero. Should
we also be checking all of the places where i_nlink is explicitly
set to 0 in the unlink paths?

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/drivers/usb/core/inode.c | 7 ++++---
lxc-dave/fs/autofs/root.c | 2 +-
lxc-dave/fs/autofs4/root.c | 2 +-
lxc-dave/fs/bfs/dir.c | 9 +++------
lxc-dave/fs/cifs/inode.c | 10 +++++-----
lxc-dave/fs/coda/dir.c | 4 ++--
lxc-dave/fs/ext2/namei.c | 2 +-
lxc-dave/fs/ext3/namei.c | 14 +++++++-------
lxc-dave/fs/hfs/dir.c | 2 +-
lxc-dave/fs/hfsplus/dir.c | 2 +-
lxc-dave/fs/hpfs/namei.c | 6 +++---
lxc-dave/fs/jffs/inode-v23.c | 3 +--
lxc-dave/fs/jffs2/dir.c | 6 +++---
lxc-dave/fs/jfs/namei.c | 14 ++++++--------
lxc-dave/fs/libfs.c | 10 +++++-----
lxc-dave/fs/minix/namei.c | 2 +-
lxc-dave/fs/msdos/namei.c | 9 ++++-----
lxc-dave/fs/nfs/dir.c | 6 +++---
lxc-dave/fs/ocfs2/namei.c | 4 ++--
lxc-dave/fs/qnx4/namei.c | 6 ++----
lxc-dave/fs/reiserfs/namei.c | 6 +++---
lxc-dave/fs/sysv/namei.c | 2 +-
lxc-dave/fs/udf/namei.c | 18 ++++++------------
lxc-dave/fs/ufs/namei.c | 2 +-
lxc-dave/fs/vfat/namei.c | 9 ++++-----
lxc-dave/include/linux/fs.h | 8 +++++++-
lxc-dave/ipc/mqueue.c | 2 +-
lxc-dave/mm/shmem.c | 10 +++++-----
28 files changed, 84 insertions(+), 93 deletions(-)

diff -puN drivers/usb/core/inode.c~B-unlink-monitor-i_nlink-0 drivers/usb/core/inode.c
--- lxc/drivers/usb/core/inode.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/drivers/usb/core/inode.c 2006-08-01 16:35:15.000000000 -0700
@@ -332,7 +332,7 @@ static int usbfs_unlink (struct inode *d
{
struct inode *inode = dentry->d_inode;
mutex_lock(&inode->i_mutex);
- dentry->d_inode->i_nlink--;
+ inode_drop_nlink(dentry->d_inode);
dput(dentry);
mutex_unlock(&inode->i_mutex);
d_delete(dentry);
@@ -347,10 +347,11 @@ static int usbfs_rmdir(struct inode *dir
mutex_lock(&inode->i_mutex);
dentry_unhash(dentry);
if (usbfs_empty(dentry)) {
- dentry->d_inode->i_nlink -= 2;
+ inode_drop_nlink(dentry->d_inode);
+ inode_drop_nlink(dentry->d_inode);
dput(dentry);
inode->i_flags |= S_DEAD;
- dir->i_nlink--;
+ inode_drop_nlink(dir);
error = 0;
}
mutex_unlock(&inode->i_mutex);
diff -puN fs/autofs/root.c~B-unlink-monitor-i_nlink-0 fs/autofs/root.c
--- lxc/fs/autofs/root.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/autofs/root.c 2006-08-01 16:35:15.000000000 -0700
@@ -414,7 +414,7 @@ static int autofs_root_rmdir(struct inod

dentry->d_time = (unsigned long)(struct autofs_dir_ent *)NULL;
autofs_hash_delete(ent);
- dir->i_nlink--;
+ inode_drop_nlink(dir);
d_drop(dentry);
unlock_kernel();

diff -puN fs/autofs4/root.c~B-unlink-monitor-i_nlink-0 fs/autofs4/root.c
--- lxc/fs/autofs4/root.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/autofs4/root.c 2006-08-01 16:35:15.000000000 -0700
@@ -676,7 +676,7 @@ static int autofs4_dir_rmdir(struct inod
dentry->d_inode->i_nlink = 0;

if (dir->i_nlink)
- dir->i_nlink--;
+ inode_drop_nlink(dir);

return 0;
}
diff -puN fs/bfs/dir.c~B-unlink-monitor-i_nlink-0 fs/bfs/dir.c
--- lxc/fs/bfs/dir.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/bfs/dir.c 2006-08-01 16:35:15.000000000 -0700
@@ -117,8 +117,7 @@ static int bfs_create(struct inode * dir

err = bfs_add_entry(dir, dentry->d_name.name, dentry->d_name.len, inode->i_ino);
if (err) {
- inode->i_nlink--;
- mark_inode_dirty(inode);
+ inode_dec_link_count(inode);
iput(inode);
unlock_kernel();
return err;
@@ -196,9 +195,8 @@ static int bfs_unlink(struct inode * dir
mark_buffer_dirty(bh);
dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
- inode->i_nlink--;
inode->i_ctime = dir->i_ctime;
- mark_inode_dirty(inode);
+ inode_dec_link_count(inode);
error = 0;

out_brelse:
@@ -249,9 +247,8 @@ static int bfs_rename(struct inode * old
old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME_SEC;
mark_inode_dirty(old_dir);
if (new_inode) {
- new_inode->i_nlink--;
new_inode->i_ctime = CURRENT_TIME_SEC;
- mark_inode_dirty(new_inode);
+ inode_dec_link_count(new_inode);
}
mark_buffer_dirty(old_bh);
error = 0;
diff -puN fs/cifs/inode.c~B-unlink-monitor-i_nlink-0 fs/cifs/inode.c
--- lxc/fs/cifs/inode.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/cifs/inode.c 2006-08-01 16:35:15.000000000 -0700
@@ -591,7 +591,7 @@ int cifs_unlink(struct inode *inode, str

if (!rc) {
if (direntry->d_inode)
- direntry->d_inode->i_nlink--;
+ inode_drop_nlink(direntry->d_inode);
} else if (rc == -ENOENT) {
d_drop(direntry);
} else if (rc == -ETXTBSY) {
@@ -610,7 +610,7 @@ int cifs_unlink(struct inode *inode, str
CIFS_MOUNT_MAP_SPECIAL_CHR);
CIFSSMBClose(xid, pTcon, netfid);
if (direntry->d_inode)
- direntry->d_inode->i_nlink--;
+ inode_drop_nlink(direntry->d_inode);
}
} else if (rc == -EACCES) {
/* try only if r/o attribute set in local lookup data? */
@@ -664,7 +664,7 @@ int cifs_unlink(struct inode *inode, str
CIFS_MOUNT_MAP_SPECIAL_CHR);
if (!rc) {
if (direntry->d_inode)
- direntry->d_inode->i_nlink--;
+ inode_drop_nlink(direntry->d_inode);
} else if (rc == -ETXTBSY) {
int oplock = FALSE;
__u16 netfid;
@@ -685,7 +685,7 @@ int cifs_unlink(struct inode *inode, str
CIFS_MOUNT_MAP_SPECIAL_CHR);
CIFSSMBClose(xid, pTcon, netfid);
if (direntry->d_inode)
- direntry->d_inode->i_nlink--;
+ inode_drop_nlink(direntry->d_inode);
}
/* BB if rc = -ETXTBUSY goto the rename logic BB */
}
@@ -817,7 +817,7 @@ int cifs_rmdir(struct inode *inode, stru
cifs_sb->mnt_cifs_flags & CIFS_MOUNT_MAP_SPECIAL_CHR);

if (!rc) {
- inode->i_nlink--;
+ inode_drop_nlink(inode);
i_size_write(direntry->d_inode,0);
direntry->d_inode->i_nlink = 0;
}
diff -puN fs/coda/dir.c~B-unlink-monitor-i_nlink-0 fs/coda/dir.c
--- lxc/fs/coda/dir.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/coda/dir.c 2006-08-01 16:35:15.000000000 -0700
@@ -367,7 +367,7 @@ int coda_unlink(struct inode *dir, struc
}

coda_dir_changed(dir, 0);
- de->d_inode->i_nlink--;
+ inode_drop_nlink(de->d_inode);
unlock_kernel();

return 0;
@@ -394,7 +394,7 @@ int coda_rmdir(struct inode *dir, struct
}

coda_dir_changed(dir, -1);
- de->d_inode->i_nlink--;
+ inode_drop_nlink(de->d_inode);
d_delete(de);
unlock_kernel();

diff -puN fs/ext2/namei.c~B-unlink-monitor-i_nlink-0 fs/ext2/namei.c
--- lxc/fs/ext2/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/ext2/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -326,7 +326,7 @@ static int ext2_rename (struct inode * o
ext2_set_link(new_dir, new_de, new_page, old_inode);
new_inode->i_ctime = CURRENT_TIME_SEC;
if (dir_de)
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
inode_dec_link_count(new_inode);
} else {
if (dir_de) {
diff -puN fs/ext3/namei.c~B-unlink-monitor-i_nlink-0 fs/ext3/namei.c
--- lxc/fs/ext3/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/ext3/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -1620,7 +1620,7 @@ static inline void ext3_inc_count(handle

static inline void ext3_dec_count(handle_t *handle, struct inode *inode)
{
- inode->i_nlink--;
+ inode_drop_nlink(inode);
}

static int ext3_add_nondir(handle_t *handle,
@@ -1742,7 +1742,7 @@ retry:
inode->i_size = EXT3_I(inode)->i_disksize = inode->i_sb->s_blocksize;
dir_block = ext3_bread (handle, inode, 0, 1, &err);
if (!dir_block) {
- inode->i_nlink--; /* is this nlink == 0? */
+ inode_drop_nlink(inode); /* is this nlink == 0? */
ext3_mark_inode_dirty(handle, inode);
iput (inode);
goto out_stop;
@@ -2052,7 +2052,7 @@ static int ext3_rmdir (struct inode * di
ext3_orphan_add(handle, inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
ext3_mark_inode_dirty(handle, inode);
- dir->i_nlink--;
+ inode_drop_nlink(dir);
ext3_update_dx_flag(dir);
ext3_mark_inode_dirty(handle, dir);

@@ -2103,7 +2103,7 @@ static int ext3_unlink(struct inode * di
dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
ext3_update_dx_flag(dir);
ext3_mark_inode_dirty(handle, dir);
- inode->i_nlink--;
+ inode_drop_nlink(inode);
if (!inode->i_nlink)
ext3_orphan_add(handle, inode);
inode->i_ctime = dir->i_ctime;
@@ -2325,7 +2325,7 @@ static int ext3_rename (struct inode * o
}

if (new_inode) {
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
new_inode->i_ctime = CURRENT_TIME_SEC;
}
old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME_SEC;
@@ -2336,9 +2336,9 @@ static int ext3_rename (struct inode * o
PARENT_INO(dir_bh->b_data) = cpu_to_le32(new_dir->i_ino);
BUFFER_TRACE(dir_bh, "call ext3_journal_dirty_metadata");
ext3_journal_dirty_metadata(handle, dir_bh);
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
if (new_inode) {
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
} else {
new_dir->i_nlink++;
ext3_update_dx_flag(new_dir);
diff -puN fs/hfs/dir.c~B-unlink-monitor-i_nlink-0 fs/hfs/dir.c
--- lxc/fs/hfs/dir.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/hfs/dir.c 2006-08-01 16:35:15.000000000 -0700
@@ -246,7 +246,7 @@ static int hfs_unlink(struct inode *dir,
if (res)
return res;

- inode->i_nlink--;
+ inode_drop_nlink(inode);
hfs_delete_inode(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
diff -puN fs/hfsplus/dir.c~B-unlink-monitor-i_nlink-0 fs/hfsplus/dir.c
--- lxc/fs/hfsplus/dir.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/hfsplus/dir.c 2006-08-01 16:35:15.000000000 -0700
@@ -338,7 +338,7 @@ static int hfsplus_unlink(struct inode *
return res;

if (inode->i_nlink > 0)
- inode->i_nlink--;
+ inode_drop_nlink(inode);
hfsplus_delete_inode(inode);
if (inode->i_ino != cnid && !inode->i_nlink) {
if (!atomic_read(&HFSPLUS_I(inode).opencnt)) {
diff -puN fs/hpfs/namei.c~B-unlink-monitor-i_nlink-0 fs/hpfs/namei.c
--- lxc/fs/hpfs/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/hpfs/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -434,7 +434,7 @@ again:
unlock_kernel();
return -ENOSPC;
default:
- inode->i_nlink--;
+ inode_drop_nlink(inode);
err = 0;
}
goto out;
@@ -494,7 +494,7 @@ static int hpfs_rmdir(struct inode *dir,
err = -ENOSPC;
break;
default:
- dir->i_nlink--;
+ inode_drop_nlink(dir);
inode->i_nlink = 0;
err = 0;
}
@@ -636,7 +636,7 @@ static int hpfs_rename(struct inode *old
hpfs_i(i)->i_parent_dir = new_dir->i_ino;
if (S_ISDIR(i->i_mode)) {
new_dir->i_nlink++;
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
}
if ((fnode = hpfs_map_fnode(i->i_sb, i->i_ino, &bh))) {
fnode->up = new_dir->i_ino;
diff -puN fs/jffs2/dir.c~B-unlink-monitor-i_nlink-0 fs/jffs2/dir.c
--- lxc/fs/jffs2/dir.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/jffs2/dir.c 2006-08-01 16:35:15.000000000 -0700
@@ -615,7 +615,7 @@ static int jffs2_rmdir (struct inode *di
}
ret = jffs2_unlink(dir_i, dentry);
if (!ret)
- dir_i->i_nlink--;
+ inode_drop_nlink(dir_i);
return ret;
}

@@ -823,7 +823,7 @@ static int jffs2_rename (struct inode *o

if (victim_f) {
/* There was a victim. Kill it off nicely */
- new_dentry->d_inode->i_nlink--;
+ inode_drop_nlink(new_dentry->d_inode);
/* Don't oops if the victim was a dirent pointing to an
inode which didn't exist. */
if (victim_f->inocache) {
@@ -862,7 +862,7 @@ static int jffs2_rename (struct inode *o
}

if (S_ISDIR(old_dentry->d_inode->i_mode))
- old_dir_i->i_nlink--;
+ inode_drop_nlink(old_dir_i);

new_dir_i->i_mtime = new_dir_i->i_ctime = old_dir_i->i_mtime = old_dir_i->i_ctime = ITIME(now);

diff -puN fs/jfs/namei.c~B-unlink-monitor-i_nlink-0 fs/jfs/namei.c
--- lxc/fs/jfs/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/jfs/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -393,9 +393,8 @@ static int jfs_rmdir(struct inode *dip,
/* update parent directory's link count corresponding
* to ".." entry of the target directory deleted
*/
- dip->i_nlink--;
dip->i_ctime = dip->i_mtime = CURRENT_TIME;
- mark_inode_dirty(dip);
+ inode_dec_link_count(dip);

/*
* OS/2 could have created EA and/or ACL
@@ -515,8 +514,7 @@ static int jfs_unlink(struct inode *dip,
mark_inode_dirty(dip);

/* update target's inode */
- ip->i_nlink--;
- mark_inode_dirty(ip);
+ inode_dec_link_count(ip);

/*
* commit zero link count object
@@ -835,7 +833,7 @@ static int jfs_link(struct dentry *old_d
rc = txCommit(tid, 2, &iplist[0], 0);

if (rc) {
- ip->i_nlink--;
+ ip->i_nlink--; /* never instantiated */
iput(ip);
} else
d_instantiate(dentry, ip);
@@ -1155,9 +1153,9 @@ static int jfs_rename(struct inode *old_
old_ip->i_ino, JFS_RENAME);
if (rc)
goto out4;
- new_ip->i_nlink--;
+ inode_drop_nlink(new_ip);
if (S_ISDIR(new_ip->i_mode)) {
- new_ip->i_nlink--;
+ inode_drop_nlink(new_ip);
if (new_ip->i_nlink) {
mutex_unlock(&JFS_IP(new_ip)->commit_mutex);
if (old_dir != new_dir)
@@ -1223,7 +1221,7 @@ static int jfs_rename(struct inode *old_
goto out4;
}
if (S_ISDIR(old_ip->i_mode)) {
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
if (old_dir != new_dir) {
/*
* Change inode number of parent for moved directory
diff -puN fs/libfs.c~B-unlink-monitor-i_nlink-0 fs/libfs.c
--- lxc/fs/libfs.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/libfs.c 2006-08-01 16:35:15.000000000 -0700
@@ -275,7 +275,7 @@ int simple_unlink(struct inode *dir, str
struct inode *inode = dentry->d_inode;

inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- inode->i_nlink--;
+ inode_drop_nlink(inode);
dput(dentry);
return 0;
}
@@ -285,9 +285,9 @@ int simple_rmdir(struct inode *dir, stru
if (!simple_empty(dentry))
return -ENOTEMPTY;

- dentry->d_inode->i_nlink--;
+ inode_drop_nlink(dentry->d_inode);
simple_unlink(dir, dentry);
- dir->i_nlink--;
+ inode_drop_nlink(dir);
return 0;
}

@@ -303,9 +303,9 @@ int simple_rename(struct inode *old_dir,
if (new_dentry->d_inode) {
simple_unlink(new_dir, new_dentry);
if (they_are_dirs)
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
} else if (they_are_dirs) {
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
new_dir->i_nlink++;
}

diff -puN fs/nfs/dir.c~B-unlink-monitor-i_nlink-0 fs/nfs/dir.c
--- lxc/fs/nfs/dir.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/nfs/dir.c 2006-08-01 16:35:15.000000000 -0700
@@ -842,7 +842,7 @@ static void nfs_dentry_iput(struct dentr
nfs_inode_return_delegation(inode);
if (dentry->d_flags & DCACHE_NFSFS_RENAMED) {
lock_kernel();
- inode->i_nlink--;
+ inode_drop_nlink(inode);
nfs_complete_unlink(dentry);
unlock_kernel();
}
@@ -1395,7 +1395,7 @@ static int nfs_safe_remove(struct dentry
error = NFS_PROTO(dir)->remove(dir, &dentry->d_name);
/* The VFS may want to delete this inode */
if (error == 0)
- inode->i_nlink--;
+ inode_drop_nlink(inode);
nfs_mark_for_revalidate(inode);
nfs_end_data_update(inode);
} else
@@ -1599,7 +1599,7 @@ static int nfs_rename(struct inode *old_
goto out;
}
} else
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);

go_ahead:
/*
diff -puN fs/ocfs2/namei.c~B-unlink-monitor-i_nlink-0 fs/ocfs2/namei.c
--- lxc/fs/ocfs2/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/ocfs2/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -711,7 +711,7 @@ static int ocfs2_link(struct dentry *old
err = ocfs2_journal_dirty(handle, fe_bh);
if (err < 0) {
le16_add_cpu(&fe->i_links_count, -1);
- inode->i_nlink--;
+ inode_drop_nlink(inode);
mlog_errno(err);
goto bail;
}
@@ -721,7 +721,7 @@ static int ocfs2_link(struct dentry *old
parent_fe_bh, de_bh);
if (err) {
le16_add_cpu(&fe->i_links_count, -1);
- inode->i_nlink--;
+ inode_drop_nlink(inode);
mlog_errno(err);
goto bail;
}
diff -puN fs/qnx4/namei.c~B-unlink-monitor-i_nlink-0 fs/qnx4/namei.c
--- lxc/fs/qnx4/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/qnx4/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -189,8 +189,7 @@ int qnx4_rmdir(struct inode *dir, struct
inode->i_nlink = 0;
mark_inode_dirty(inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
- dir->i_nlink--;
- mark_inode_dirty(dir);
+ inode_dec_link_count(dir);
retval = 0;

end_rmdir:
@@ -234,9 +233,8 @@ int qnx4_unlink(struct inode *dir, struc
mark_buffer_dirty(bh);
dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
- inode->i_nlink--;
inode->i_ctime = dir->i_ctime;
- mark_inode_dirty(inode);
+ inode_dec_link_count(inode);
retval = 0;

end_unlink:
diff -puN fs/reiserfs/namei.c~B-unlink-monitor-i_nlink-0 fs/reiserfs/namei.c
--- lxc/fs/reiserfs/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/reiserfs/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -20,7 +20,7 @@
#include <linux/quotaops.h>

#define INC_DIR_INODE_NLINK(i) if (i->i_nlink != 1) { i->i_nlink++; if (i->i_nlink >= REISERFS_LINK_MAX) i->i_nlink=1; }
-#define DEC_DIR_INODE_NLINK(i) if (i->i_nlink != 1) i->i_nlink--;
+#define DEC_DIR_INODE_NLINK(i) if (i->i_nlink != 1) inode_drop_nlink(i);

// directory item contains array of entry headers. This performs
// binary search through that array
@@ -994,7 +994,7 @@ static int reiserfs_unlink(struct inode
inode->i_nlink = 1;
}

- inode->i_nlink--;
+ inode_drop_nlink(inode);

/*
* we schedule before doing the add_save_link call, save the link
@@ -1475,7 +1475,7 @@ static int reiserfs_rename(struct inode
if (S_ISDIR(new_dentry_inode->i_mode)) {
new_dentry_inode->i_nlink = 0;
} else {
- new_dentry_inode->i_nlink--;
+ inode_drop_nlink(new_dentry_inode);
}
new_dentry_inode->i_ctime = ctime;
savelink = new_dentry_inode->i_nlink;
diff -puN fs/sysv/namei.c~B-unlink-monitor-i_nlink-0 fs/sysv/namei.c
--- lxc/fs/sysv/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/sysv/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -250,7 +250,7 @@ static int sysv_rename(struct inode * ol
sysv_set_link(new_de, new_page, old_inode);
new_inode->i_ctime = CURRENT_TIME_SEC;
if (dir_de)
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
inode_dec_link_count(new_inode);
} else {
if (dir_de) {
diff -puN fs/udf/namei.c~B-unlink-monitor-i_nlink-0 fs/udf/namei.c
--- lxc/fs/udf/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/udf/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -878,8 +878,7 @@ static int udf_rmdir(struct inode * dir,
inode->i_nlink);
inode->i_nlink = 0;
inode->i_size = 0;
- mark_inode_dirty(inode);
- dir->i_nlink --;
+ inode_dec_link_count(inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = current_fs_time(dir->i_sb);
mark_inode_dirty(dir);

@@ -923,8 +922,7 @@ static int udf_unlink(struct inode * dir
goto end_unlink;
dir->i_ctime = dir->i_mtime = current_fs_time(dir->i_sb);
mark_inode_dirty(dir);
- inode->i_nlink--;
- mark_inode_dirty(inode);
+ inode_dec_link_count(inode);
inode->i_ctime = dir->i_ctime;
retval = 0;

@@ -1101,8 +1099,7 @@ out:
return err;

out_no_entry:
- inode->i_nlink--;
- mark_inode_dirty(inode);
+ inode_dec_link_count(inode);
iput(inode);
goto out;
}
@@ -1261,9 +1258,8 @@ static int udf_rename (struct inode * ol

if (new_inode)
{
- new_inode->i_nlink--;
new_inode->i_ctime = current_fs_time(new_inode->i_sb);
- mark_inode_dirty(new_inode);
+ inode_dec_link_count(new_inode);
}
old_dir->i_ctime = old_dir->i_mtime = current_fs_time(old_dir->i_sb);
mark_inode_dirty(old_dir);
@@ -1279,12 +1275,10 @@ static int udf_rename (struct inode * ol
}
else
mark_buffer_dirty_inode(dir_bh, old_inode);
- old_dir->i_nlink --;
- mark_inode_dirty(old_dir);
+ inode_dec_link_count(old_dir);
if (new_inode)
{
- new_inode->i_nlink --;
- mark_inode_dirty(new_inode);
+ inode_dec_link_count(new_inode);
}
else
{
diff -puN fs/ufs/namei.c~B-unlink-monitor-i_nlink-0 fs/ufs/namei.c
--- lxc/fs/ufs/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/ufs/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -308,7 +308,7 @@ static int ufs_rename(struct inode *old_
ufs_set_link(new_dir, new_de, new_page, old_inode);
new_inode->i_ctime = CURRENT_TIME_SEC;
if (dir_de)
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
inode_dec_link_count(new_inode);
} else {
if (dir_de) {
diff -puN fs/vfat/namei.c~B-unlink-monitor-i_nlink-0 fs/vfat/namei.c
--- lxc/fs/vfat/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/vfat/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -782,7 +782,7 @@ static int vfat_rmdir(struct inode *dir,
err = fat_remove_entries(dir, &sinfo); /* and releases bh */
if (err)
goto out;
- dir->i_nlink--;
+ inode_drop_nlink(dir);

inode->i_nlink = 0;
inode->i_mtime = inode->i_atime = CURRENT_TIME_SEC;
@@ -930,7 +930,7 @@ static int vfat_rename(struct inode *old
if (err)
goto error_dotdot;
}
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
if (!new_inode)
new_dir->i_nlink++;
}
@@ -947,10 +947,9 @@ static int vfat_rename(struct inode *old
mark_inode_dirty(old_dir);

if (new_inode) {
+ inode_drop_nlink(new_inode);
if (is_dir)
- new_inode->i_nlink -= 2;
- else
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
new_inode->i_ctime = ts;
}
out:
diff -puN include/linux/fs.h~B-unlink-monitor-i_nlink-0 include/linux/fs.h
--- lxc/include/linux/fs.h~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2006-08-01 16:35:15.000000000 -0700
@@ -1256,9 +1256,15 @@ static inline void inode_inc_link_count(
mark_inode_dirty(inode);
}

-static inline void inode_dec_link_count(struct inode *inode)
+extern void __inode_set_awaiting_final_iput(struct inode *inode);
+static inline void inode_drop_nlink(struct inode *inode)
{
inode->i_nlink--;
+}
+
+static inline void inode_dec_link_count(struct inode *inode)
+{
+ inode_drop_nlink(inode)
mark_inode_dirty(inode);
}

diff -puN ipc/mqueue.c~B-unlink-monitor-i_nlink-0 ipc/mqueue.c
--- lxc/ipc/mqueue.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/ipc/mqueue.c 2006-08-01 16:35:15.000000000 -0700
@@ -307,7 +307,7 @@ static int mqueue_unlink(struct inode *d

dir->i_ctime = dir->i_mtime = dir->i_atime = CURRENT_TIME;
dir->i_size -= DIRENT_SIZE;
- inode->i_nlink--;
+ inode_drop_nlink(inode);
dput(dentry);
return 0;
}
diff -puN mm/shmem.c~B-unlink-monitor-i_nlink-0 mm/shmem.c
--- lxc/mm/shmem.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/mm/shmem.c 2006-08-01 16:35:15.000000000 -0700
@@ -1760,7 +1760,7 @@ static int shmem_unlink(struct inode *di

dir->i_size -= BOGO_DIRENT_SIZE;
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME;
- inode->i_nlink--;
+ inode_drop_nlink(inode);
dput(dentry); /* Undo the count from "create" - this does all the work */
return 0;
}
@@ -1770,8 +1770,8 @@ static int shmem_rmdir(struct inode *dir
if (!simple_empty(dentry))
return -ENOTEMPTY;

- dentry->d_inode->i_nlink--;
- dir->i_nlink--;
+ inode_drop_nlink(dentry->d_inode);
+ inode_drop_nlink(dir);
return shmem_unlink(dir, dentry);
}

@@ -1792,9 +1792,9 @@ static int shmem_rename(struct inode *ol
if (new_dentry->d_inode) {
(void) shmem_unlink(new_dir, new_dentry);
if (they_are_dirs)
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
} else if (they_are_dirs) {
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
new_dir->i_nlink++;
}

diff -puN fs/binfmt_misc.c~B-unlink-monitor-i_nlink-0 fs/binfmt_misc.c
diff -puN fs/jffs/inode-v23.c~B-unlink-monitor-i_nlink-0 fs/jffs/inode-v23.c
--- lxc/fs/jffs/inode-v23.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/jffs/inode-v23.c 2006-08-01 16:35:15.000000000 -0700
@@ -1052,9 +1052,8 @@ jffs_remove(struct inode *dir, struct de

dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
mark_inode_dirty(dir);
- inode->i_nlink--;
inode->i_ctime = dir->i_ctime;
- mark_inode_dirty(inode);
+ inode_dec_link_count(inode);

d_delete(dentry); /* This also frees the inode */

diff -puN fs/minix/namei.c~B-unlink-monitor-i_nlink-0 fs/minix/namei.c
--- lxc/fs/minix/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/minix/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -249,7 +249,7 @@ static int minix_rename(struct inode * o
minix_set_link(new_de, new_page, old_inode);
new_inode->i_ctime = CURRENT_TIME_SEC;
if (dir_de)
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
inode_dec_link_count(new_inode);
} else {
if (dir_de) {
diff -puN fs/msdos/namei.c~B-unlink-monitor-i_nlink-0 fs/msdos/namei.c
--- lxc/fs/msdos/namei.c~B-unlink-monitor-i_nlink-0 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/msdos/namei.c 2006-08-01 16:35:15.000000000 -0700
@@ -341,7 +341,7 @@ static int msdos_rmdir(struct inode *dir
err = fat_remove_entries(dir, &sinfo); /* and releases bh */
if (err)
goto out;
- dir->i_nlink--;
+ inode_drop_nlink(dir);

inode->i_nlink = 0;
inode->i_ctime = CURRENT_TIME_SEC;
@@ -542,7 +542,7 @@ static int do_msdos_rename(struct inode
if (err)
goto error_dotdot;
}
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
if (!new_inode)
new_dir->i_nlink++;
}
@@ -559,10 +559,9 @@ static int do_msdos_rename(struct inode
mark_inode_dirty(old_dir);

if (new_inode) {
+ inode_drop_nlink(new_inode);
if (is_dir)
- new_inode->i_nlink -= 2;
- else
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
new_inode->i_ctime = ts;
}
out:
_

2006-08-01 23:53:59

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 20/28] tricky: elevate write count files are open()ed


This is the first really tricky patch in the series. It
elevates the writer count on a mount each time a
non-special file is opened for write.

This is not completely apparent in the patch because the
two if() conditions in may_open() above the
mnt_want_write() call are, combined, equivalent to
special_file().

There is also an elevated count around the vfs_create()
call in open_namei(). The count needs to be kept elevated
all the way into the may_open() call. Otherwise, when the
write is dropped, a ro->rw transisition could occur. This
would lead to having rw access on the newly created file,
while the vfsmount is ro. That is bad.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/file_table.c | 5 ++++-
lxc-dave/fs/namei.c | 22 ++++++++++++++++++----
lxc-dave/ipc/mqueue.c | 3 +++
3 files changed, 25 insertions(+), 5 deletions(-)

diff -puN fs/file_table.c~C-elevate-writers-opens-part1 fs/file_table.c
--- lxc/fs/file_table.c~C-elevate-writers-opens-part1 2006-08-01 16:35:21.000000000 -0700
+++ lxc-dave/fs/file_table.c 2006-08-01 16:35:28.000000000 -0700
@@ -179,8 +179,11 @@ void fastcall __fput(struct file *file)
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
- if (file->f_mode & FMODE_WRITE)
+ if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
+ if(!special_file(inode->i_mode))
+ mnt_drop_write(mnt);
+ }
file_kill(file);
file->f_dentry = NULL;
file->f_vfsmnt = NULL;
diff -puN fs/namei.c~C-elevate-writers-opens-part1 fs/namei.c
--- lxc/fs/namei.c~C-elevate-writers-opens-part1 2006-08-01 16:35:27.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:28.000000000 -0700
@@ -1539,8 +1539,17 @@ int may_open(struct nameidata *nd, int a
return -EACCES;

flag &= ~O_TRUNC;
- } else if (IS_RDONLY(inode) && (flag & FMODE_WRITE))
- return -EROFS;
+ } else if (flag & FMODE_WRITE) {
+ /*
+ * effectively: !special_file()
+ * balanced by __fput()
+ */
+ error = mnt_want_write(nd->mnt);
+ if (error)
+ return error;
+ if (IS_RDONLY(inode))
+ return -EROFS;
+ }
/*
* An append-only file must be opened in append mode for writing.
*/
@@ -1679,14 +1688,17 @@ do_last:
}

if (IS_ERR(nd->intent.open.file)) {
- mutex_unlock(&dir->d_inode->i_mutex);
error = PTR_ERR(nd->intent.open.file);
- goto exit_dput;
+ goto exit_mutex_unlock;
}

/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
+ error = mnt_want_write(nd->mnt);
+ if (error)
+ goto exit_mutex_unlock;
error = open_namei_create(nd, &path, flag, mode);
+ mnt_drop_write(nd->mnt);
if (error)
goto exit;
return 0;
@@ -1722,6 +1734,8 @@ ok:
goto exit;
return 0;

+exit_mutex_unlock:
+ mutex_unlock(&dir->d_inode->i_mutex);
exit_dput:
dput_path(&path, nd);
exit:
diff -puN ipc/mqueue.c~C-elevate-writers-opens-part1 ipc/mqueue.c
--- lxc/ipc/mqueue.c~C-elevate-writers-opens-part1 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/ipc/mqueue.c 2006-08-01 16:35:28.000000000 -0700
@@ -685,6 +685,9 @@ asmlinkage long sys_mq_open(const char _
goto out;
filp = do_open(dentry, oflag);
} else {
+ error = mnt_want_write(mqueue_mnt);
+ if (error)
+ goto out;
filp = do_create(mqueue_mnt->mnt_root, dentry,
oflag, mode, u_attr);
}
_

2006-08-01 23:53:59

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 17/28] mount_is_safe(): add comment


This area of code is currently #ifdef'd out, so add a comment
for the time when it is actually used.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namespace.c | 4 ++++
1 files changed, 4 insertions(+)

diff -puN fs/namespace.c~C-elevate-writers-opens-part5 fs/namespace.c
--- lxc/fs/namespace.c~C-elevate-writers-opens-part5 2006-08-01 16:35:19.000000000 -0700
+++ lxc-dave/fs/namespace.c 2006-08-01 16:35:26.000000000 -0700
@@ -692,6 +692,10 @@ static int mount_is_safe(struct nameidat
if (current->uid != nd->dentry->d_inode->i_uid)
return -EPERM;
}
+ /*
+ * We will eventually check for the mnt->writer_count here,
+ * but since the code is not used now, skip it - Dave Hansen
+ */
if (vfs_permission(nd, MAY_WRITE))
return -EPERM;
return 0;
_

2006-08-01 23:54:49

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 05/28] monitor zeroing of i_nlink


Some filesystems, instead of simply decrementing i_nlink, simply zero
it during an unlink operation. We need to catch these in addition
to the decrement operations.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/autofs4/root.c | 4 ++--
lxc-dave/fs/cifs/inode.c | 2 +-
lxc-dave/fs/ext3/namei.c | 2 +-
lxc-dave/fs/fuse/dir.c | 4 ++--
lxc-dave/fs/hfs/dir.c | 2 +-
lxc-dave/fs/hfsplus/dir.c | 4 ++--
lxc-dave/fs/hpfs/namei.c | 4 ++--
lxc-dave/fs/jfs/namei.c | 2 +-
lxc-dave/fs/libfs.c | 1 -
lxc-dave/fs/msdos/namei.c | 4 ++--
lxc-dave/fs/nfs/dir.c | 2 +-
lxc-dave/fs/qnx4/namei.c | 2 +-
lxc-dave/fs/reiserfs/namei.c | 4 ++--
lxc-dave/fs/udf/namei.c | 2 +-
lxc-dave/fs/vfat/namei.c | 4 ++--
lxc-dave/include/linux/fs.h | 2 +-
16 files changed, 22 insertions(+), 23 deletions(-)

diff -puN fs/libfs.c~monitor-clear-i_nlink fs/libfs.c
--- lxc/fs/libfs.c~monitor-clear-i_nlink 2006-08-01 16:35:16.000000000 -0700
+++ lxc-dave/fs/libfs.c 2006-08-01 16:35:17.000000000 -0700
@@ -273,7 +273,6 @@ out:
void __inode_set_awaiting_final_iput(struct inode *inode)
{
}
-EXPORT_SYMBOL_GPL(inode_drop_nlink);

int simple_unlink(struct inode *dir, struct dentry *dentry)
{
diff -puN fs/cifs/inode.c~monitor-clear-i_nlink fs/cifs/inode.c
--- lxc/fs/cifs/inode.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/cifs/inode.c 2006-08-01 16:35:17.000000000 -0700
@@ -819,7 +819,7 @@ int cifs_rmdir(struct inode *inode, stru
if (!rc) {
inode_drop_nlink(inode);
i_size_write(direntry->d_inode,0);
- direntry->d_inode->i_nlink = 0;
+ inode_clear_nlink(direntry->d_inode);
}

cifsInode = CIFS_I(direntry->d_inode);
diff -puN fs/ext3/namei.c~monitor-clear-i_nlink fs/ext3/namei.c
--- lxc/fs/ext3/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/ext3/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -2044,7 +2044,7 @@ static int ext3_rmdir (struct inode * di
"empty directory has nlink!=2 (%d)",
inode->i_nlink);
inode->i_version++;
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
/* There's no need to set i_disksize: the fact that i_nlink is
* zero will ensure that the right thing happens during any
* recovery. */
diff -puN fs/autofs4/root.c~monitor-clear-i_nlink fs/autofs4/root.c
--- lxc/fs/autofs4/root.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/autofs4/root.c 2006-08-01 16:35:17.000000000 -0700
@@ -638,7 +638,7 @@ static int autofs4_dir_unlink(struct ino
dput(ino->dentry);

dentry->d_inode->i_size = 0;
- dentry->d_inode->i_nlink = 0;
+ inode_clear_nlink(dentry->d_inode);

dir->i_mtime = CURRENT_TIME;

@@ -673,7 +673,7 @@ static int autofs4_dir_rmdir(struct inod
}
dput(ino->dentry);
dentry->d_inode->i_size = 0;
- dentry->d_inode->i_nlink = 0;
+ inode_clear_nlink(dentry->d_inode);

if (dir->i_nlink)
inode_drop_nlink(dir);
diff -puN fs/fuse/dir.c~monitor-clear-i_nlink fs/fuse/dir.c
--- lxc/fs/fuse/dir.c~monitor-clear-i_nlink 2006-08-01 16:35:11.000000000 -0700
+++ lxc-dave/fs/fuse/dir.c 2006-08-01 16:35:17.000000000 -0700
@@ -477,7 +477,7 @@ static int fuse_unlink(struct inode *dir
/* Set nlink to zero so the inode can be cleared, if
the inode does have more links this will be
discovered at the next lookup/getattr */
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
fuse_invalidate_attr(inode);
fuse_invalidate_attr(dir);
fuse_invalidate_entry_cache(entry);
@@ -503,7 +503,7 @@ static int fuse_rmdir(struct inode *dir,
err = req->out.h.error;
fuse_put_request(fc, req);
if (!err) {
- entry->d_inode->i_nlink = 0;
+ inode_clear_nlink(entry->d_inode);
fuse_invalidate_attr(dir);
fuse_invalidate_entry_cache(entry);
} else if (err == -EINTR)
diff -puN fs/hfs/dir.c~monitor-clear-i_nlink fs/hfs/dir.c
--- lxc/fs/hfs/dir.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/hfs/dir.c 2006-08-01 16:35:17.000000000 -0700
@@ -273,7 +273,7 @@ static int hfs_rmdir(struct inode *dir,
res = hfs_cat_delete(inode->i_ino, dir, &dentry->d_name);
if (res)
return res;
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_ctime = CURRENT_TIME_SEC;
hfs_delete_inode(inode);
mark_inode_dirty(inode);
diff -puN fs/hfsplus/dir.c~monitor-clear-i_nlink fs/hfsplus/dir.c
--- lxc/fs/hfsplus/dir.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/hfsplus/dir.c 2006-08-01 16:35:17.000000000 -0700
@@ -348,7 +348,7 @@ static int hfsplus_unlink(struct inode *
} else
inode->i_flags |= S_DEAD;
} else
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);

@@ -387,7 +387,7 @@ static int hfsplus_rmdir(struct inode *d
res = hfsplus_delete_cat(inode->i_ino, dir, &dentry->d_name);
if (res)
return res;
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_ctime = CURRENT_TIME_SEC;
hfsplus_delete_inode(inode);
mark_inode_dirty(inode);
diff -puN fs/hpfs/namei.c~monitor-clear-i_nlink fs/hpfs/namei.c
--- lxc/fs/hpfs/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/hpfs/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -495,7 +495,7 @@ static int hpfs_rmdir(struct inode *dir,
break;
default:
inode_drop_nlink(dir);
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
err = 0;
}
goto out;
@@ -590,7 +590,7 @@ static int hpfs_rename(struct inode *old
int r;
if ((r = hpfs_remove_dirent(old_dir, dno, dep, &qbh, 1)) != 2) {
if ((nde = map_dirent(new_dir, hpfs_i(new_dir)->i_dno, (char *)new_name, new_len, NULL, &qbh1))) {
- new_inode->i_nlink = 0;
+ inode_clear_nlink(new_inode);
copy_de(nde, &de);
memcpy(nde->name, new_name, new_len);
hpfs_mark_4buffers_dirty(&qbh1);
diff -puN fs/jfs/namei.c~monitor-clear-i_nlink fs/jfs/namei.c
--- lxc/fs/jfs/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/jfs/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -414,7 +414,7 @@ static int jfs_rmdir(struct inode *dip,
JFS_IP(ip)->acl.flag = 0;

/* mark the target directory as deleted */
- ip->i_nlink = 0;
+ inode_clear_nlink(ip);
mark_inode_dirty(ip);

rc = txCommit(tid, 2, &iplist[0], 0);
diff -puN fs/msdos/namei.c~monitor-clear-i_nlink fs/msdos/namei.c
--- lxc/fs/msdos/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/msdos/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -343,7 +343,7 @@ static int msdos_rmdir(struct inode *dir
goto out;
inode_drop_nlink(dir);

- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_ctime = CURRENT_TIME_SEC;
fat_detach(inode);
out:
@@ -425,7 +425,7 @@ static int msdos_unlink(struct inode *di
err = fat_remove_entries(dir, &sinfo); /* and releases bh */
if (err)
goto out;
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_ctime = CURRENT_TIME_SEC;
fat_detach(inode);
out:
diff -puN fs/nfs/dir.c~monitor-clear-i_nlink fs/nfs/dir.c
--- lxc/fs/nfs/dir.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/nfs/dir.c 2006-08-01 16:35:17.000000000 -0700
@@ -1280,7 +1280,7 @@ static int nfs_rmdir(struct inode *dir,
error = NFS_PROTO(dir)->rmdir(dir, &dentry->d_name);
/* Ensure the VFS deletes this inode */
if (error == 0 && dentry->d_inode != NULL)
- dentry->d_inode->i_nlink = 0;
+ inode_clear_nlink(dentry->d_inode);
nfs_end_data_update(dir);
unlock_kernel();

diff -puN fs/ocfs2/namei.c~monitor-clear-i_nlink fs/ocfs2/namei.c
diff -puN fs/qnx4/namei.c~monitor-clear-i_nlink fs/qnx4/namei.c
--- lxc/fs/qnx4/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/qnx4/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -186,7 +186,7 @@ int qnx4_rmdir(struct inode *dir, struct
memset(de->di_fname, 0, sizeof de->di_fname);
de->di_mode = 0;
mark_buffer_dirty(bh);
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
mark_inode_dirty(inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
inode_dec_link_count(dir);
diff -puN fs/reiserfs/namei.c~monitor-clear-i_nlink fs/reiserfs/namei.c
--- lxc/fs/reiserfs/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/reiserfs/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -913,7 +913,7 @@ static int reiserfs_rmdir(struct inode *
reiserfs_warning(inode->i_sb, "%s: empty directory has nlink "
"!= 2 (%d)", __FUNCTION__, inode->i_nlink);

- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = CURRENT_TIME_SEC;
reiserfs_update_sd(&th, inode);

@@ -1473,7 +1473,7 @@ static int reiserfs_rename(struct inode
if (new_dentry_inode) {
// adjust link number of the victim
if (S_ISDIR(new_dentry_inode->i_mode)) {
- new_dentry_inode->i_nlink = 0;
+ inode_clear_nlink(new_dentry_inode);
} else {
inode_drop_nlink(new_dentry_inode);
}
diff -puN fs/udf/namei.c~monitor-clear-i_nlink fs/udf/namei.c
--- lxc/fs/udf/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/udf/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -876,7 +876,7 @@ static int udf_rmdir(struct inode * dir,
udf_warning(inode->i_sb, "udf_rmdir",
"empty directory has nlink != 2 (%d)",
inode->i_nlink);
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_size = 0;
inode_dec_link_count(inode);
inode->i_ctime = dir->i_ctime = dir->i_mtime = current_fs_time(dir->i_sb);
diff -puN fs/vfat/namei.c~monitor-clear-i_nlink fs/vfat/namei.c
--- lxc/fs/vfat/namei.c~monitor-clear-i_nlink 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/vfat/namei.c 2006-08-01 16:35:17.000000000 -0700
@@ -784,7 +784,7 @@ static int vfat_rmdir(struct inode *dir,
goto out;
inode_drop_nlink(dir);

- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_mtime = inode->i_atime = CURRENT_TIME_SEC;
fat_detach(inode);
out:
@@ -808,7 +808,7 @@ static int vfat_unlink(struct inode *dir
err = fat_remove_entries(dir, &sinfo); /* and releases bh */
if (err)
goto out;
- inode->i_nlink = 0;
+ inode_clear_nlink(inode);
inode->i_mtime = inode->i_atime = CURRENT_TIME_SEC;
fat_detach(inode);
out:
diff -puN include/linux/fs.h~monitor-clear-i_nlink include/linux/fs.h
--- lxc/include/linux/fs.h~monitor-clear-i_nlink 2006-08-01 16:35:16.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2006-08-01 16:35:17.000000000 -0700
@@ -1273,7 +1273,7 @@ static inline void inode_drop_nlink(stru

static inline void inode_dec_link_count(struct inode *inode)
{
- inode_drop_nlink(inode)
+ inode_drop_nlink(inode);
mark_inode_dirty(inode);
}

_

2006-08-01 23:55:09

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 18/28] unix_find_other() elevate write count for touch_atime()



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/net/unix/af_unix.c | 16 ++++++++++++----
1 files changed, 12 insertions(+), 4 deletions(-)

diff -puN net/unix/af_unix.c~C-elevate-writers-opens-part4 net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~C-elevate-writers-opens-part4 2006-08-01 16:35:09.000000000 -0700
+++ lxc-dave/net/unix/af_unix.c 2006-08-01 16:35:27.000000000 -0700
@@ -708,21 +708,27 @@ static struct sock *unix_find_other(stru
err = path_lookup(sunname->sun_path, LOOKUP_FOLLOW, &nd);
if (err)
goto fail;
+
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto put_path_fail;
+
err = vfs_permission(&nd, MAY_WRITE);
if (err)
- goto put_fail;
+ goto mnt_drop_write_fail;

err = -ECONNREFUSED;
if (!S_ISSOCK(nd.dentry->d_inode->i_mode))
- goto put_fail;
+ goto mnt_drop_write_fail;
u=unix_find_socket_byinode(nd.dentry->d_inode);
if (!u)
- goto put_fail;
+ goto mnt_drop_write_fail;

if (u->sk_type == type)
touch_atime(nd.mnt, nd.dentry);

path_release(&nd);
+ mnt_drop_write(nd.mnt);

err=-EPROTOTYPE;
if (u->sk_type != type) {
@@ -742,7 +748,9 @@ static struct sock *unix_find_other(stru
}
return u;

-put_fail:
+mnt_drop_write_fail:
+ mnt_drop_write(nd.mnt);
+put_path_fail:
path_release(&nd);
fail:
*error=err;
_

2006-08-01 23:53:59

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 19/28] elevate write count over calls to vfs_rename()


This does create a little helper in the NFS code to
make an if() a little bit less ugly.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 4 ++++
lxc-dave/fs/nfsd/vfs.c | 25 ++++++++++++++++++++-----
2 files changed, 24 insertions(+), 5 deletions(-)

diff -puN fs/namei.c~C-elevate-writers-vfs_rename-part1 fs/namei.c
--- lxc/fs/namei.c~C-elevate-writers-vfs_rename-part1 2006-08-01 16:35:25.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:27.000000000 -0700
@@ -2557,8 +2557,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;

+ error = mnt_want_write(oldnd.mnt);
+ if (error)
+ goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
+ mnt_drop_write(oldnd.mnt);
exit5:
dput(new_dentry);
exit4:
diff -puN fs/nfsd/vfs.c~C-elevate-writers-vfs_rename-part1 fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~C-elevate-writers-vfs_rename-part1 2006-08-01 16:35:09.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2006-08-01 16:35:27.000000000 -0700
@@ -1533,6 +1533,14 @@ out_nfserr:
goto out_unlock;
}

+static inline int svc_msnfs(struct svc_fh *ffhp)
+{
+#ifdef MSNFS
+ return (ffhp->fh_export->ex_flags & NFSEXP_MSNFS);
+#else
+ return 0;
+#endif
+}
/*
* Rename a file
* N.B. After this call _both_ ffhp and tfhp need an fh_put
@@ -1593,20 +1601,27 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;

-#ifdef MSNFS
- if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+ if (svc_msnfs(ffhp) &&
((atomic_read(&odentry->d_count) > 1)
|| (atomic_read(&ndentry->d_count) > 1))) {
err = -EPERM;
- } else
-#endif
+ goto out_dput_new;
+ }
+
+ err = -EXDEV;
+ if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+ goto out_dput_new;
+ err = mnt_want_write(ffhp->fh_export->ex_mnt);
+ if (err)
+ goto out_dput_new;
+
err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!err && EX_ISSYNC(tfhp->fh_export)) {
err = nfsd_sync_dir(tdentry);
if (!err)
err = nfsd_sync_dir(fdentry);
}
-
+ mnt_drop_write(ffhp->fh_export->ex_mnt);
out_dput_new:
dput(ndentry);
out_dput_old:
_

2006-08-01 23:55:09

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 01/28] prepare for write access checks: collapse if()


We're shortly going to be adding a bunch more permission
checks in these functions. That requires adding either a
bunch of new if() conditions, or some gotos. This patch
collapses existing if()s and uses gotos instead to
prepare for the upcoming changes.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 93 +++++++++++++++++++++++++++-------------------------
lxc-dave/fs/open.c | 64 ++++++++++++++++++++---------------
2 files changed, 87 insertions(+), 70 deletions(-)

diff -puN fs/namei.c~B-prepwork-collapse-ifs fs/namei.c
--- lxc/fs/namei.c~B-prepwork-collapse-ifs 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:13.000000000 -0700
@@ -1924,30 +1924,32 @@ asmlinkage long sys_mkdirat(int dfd, con
{
int error = 0;
char * tmp;
+ struct dentry *dentry;
+ struct nameidata nd;

tmp = getname(pathname);
error = PTR_ERR(tmp);
- if (!IS_ERR(tmp)) {
- struct dentry *dentry;
- struct nameidata nd;
+ if (IS_ERR(tmp))
+ goto out_err;

- error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, &nd);
- if (error)
- goto out;
- dentry = lookup_create(&nd, 1);
- error = PTR_ERR(dentry);
- if (!IS_ERR(dentry)) {
- if (!IS_POSIXACL(nd.dentry->d_inode))
- mode &= ~current->fs->umask;
- error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
- dput(dentry);
- }
- mutex_unlock(&nd.dentry->d_inode->i_mutex);
- path_release(&nd);
-out:
- putname(tmp);
- }
+ error = do_path_lookup(dfd, tmp, LOOKUP_PARENT, &nd);
+ if (error)
+ goto out;
+ dentry = lookup_create(&nd, 1);
+ error = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out_unlock;

+ if (!IS_POSIXACL(nd.dentry->d_inode))
+ mode &= ~current->fs->umask;
+ error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+ dput(dentry);
+out_unlock:
+ mutex_unlock(&nd.dentry->d_inode->i_mutex);
+ path_release(&nd);
+out:
+ putname(tmp);
+out_err:
return error;
}

@@ -2046,10 +2048,11 @@ static long do_rmdir(int dfd, const char
mutex_lock_nested(&nd.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
dentry = lookup_hash(&nd);
error = PTR_ERR(dentry);
- if (!IS_ERR(dentry)) {
- error = vfs_rmdir(nd.dentry->d_inode, dentry);
- dput(dentry);
- }
+ if (IS_ERR(dentry))
+ goto exit2;
+ error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ dput(dentry);
+exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
exit1:
path_release(&nd);
@@ -2189,30 +2192,33 @@ asmlinkage long sys_symlinkat(const char
int error = 0;
char * from;
char * to;
+ struct dentry *dentry;
+ struct nameidata nd;

from = getname(oldname);
if(IS_ERR(from))
return PTR_ERR(from);
to = getname(newname);
error = PTR_ERR(to);
- if (!IS_ERR(to)) {
- struct dentry *dentry;
- struct nameidata nd;
+ if (IS_ERR(to))
+ goto out_putname;

- error = do_path_lookup(newdfd, to, LOOKUP_PARENT, &nd);
- if (error)
- goto out;
- dentry = lookup_create(&nd, 0);
- error = PTR_ERR(dentry);
- if (!IS_ERR(dentry)) {
- error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
- dput(dentry);
- }
- mutex_unlock(&nd.dentry->d_inode->i_mutex);
- path_release(&nd);
+ error = do_path_lookup(newdfd, to, LOOKUP_PARENT, &nd);
+ if (error)
+ goto out;
+ dentry = lookup_create(&nd, 0);
+ error = PTR_ERR(dentry);
+ if (IS_ERR(dentry))
+ goto out_unlock;
+
+ error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
+ dput(dentry);
+out_unlock:
+ mutex_unlock(&nd.dentry->d_inode->i_mutex);
+ path_release(&nd);
out:
- putname(to);
- }
+ putname(to);
+out_putname:
putname(from);
return error;
}
@@ -2298,10 +2304,11 @@ asmlinkage long sys_linkat(int olddfd, c
goto out_release;
new_dentry = lookup_create(&nd, 0);
error = PTR_ERR(new_dentry);
- if (!IS_ERR(new_dentry)) {
- error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
- dput(new_dentry);
- }
+ if (IS_ERR(new_dentry))
+ goto out_unlock;
+ error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+ dput(new_dentry);
+out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
out_release:
path_release(&nd);
diff -puN fs/open.c~B-prepwork-collapse-ifs fs/open.c
--- lxc/fs/open.c~B-prepwork-collapse-ifs 2006-08-01 16:35:12.000000000 -0700
+++ lxc-dave/fs/open.c 2006-08-01 16:35:13.000000000 -0700
@@ -520,15 +520,21 @@ asmlinkage long sys_faccessat(int dfd, c
current->cap_effective = current->cap_permitted;

res = __user_walk_fd(dfd, filename, LOOKUP_FOLLOW|LOOKUP_ACCESS, &nd);
- if (!res) {
- res = vfs_permission(&nd, mode);
- /* SuS v2 requires we report a read only fs too */
- if(!res && (mode & S_IWOTH) && IS_RDONLY(nd.dentry->d_inode)
- && !special_file(nd.dentry->d_inode->i_mode))
- res = -EROFS;
- path_release(&nd);
- }
+ if (res)
+ goto out;
+
+ res = vfs_permission(&nd, mode);
+ /* SuS v2 requires we report a read only fs too */
+ if(res || !(mode & S_IWOTH) ||
+ special_file(nd.dentry->d_inode->i_mode))
+ goto out_path_release;
+
+ if(IS_RDONLY(nd.dentry->d_inode))
+ res = -EROFS;

+out_path_release:
+ path_release(&nd);
+out:
current->fsuid = old_fsuid;
current->fsgid = old_fsgid;
current->cap_effective = old_cap;
@@ -736,10 +742,11 @@ asmlinkage long sys_chown(const char __u
int error;

error = user_path_walk(filename, &nd);
- if (!error) {
- error = chown_common(nd.dentry, user, group);
- path_release(&nd);
- }
+ if (error)
+ goto out;
+ error = chown_common(nd.dentry, user, group);
+ path_release(&nd);
+out:
return error;
}

@@ -755,10 +762,10 @@ asmlinkage long sys_fchownat(int dfd, co

follow = (flag & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW;
error = __user_walk_fd(dfd, filename, follow, &nd);
- if (!error) {
- error = chown_common(nd.dentry, user, group);
- path_release(&nd);
- }
+ if (error)
+ goto out;
+ error = chown_common(nd.dentry, user, group);
+ path_release(&nd);
out:
return error;
}
@@ -769,10 +776,11 @@ asmlinkage long sys_lchown(const char __
int error;

error = user_path_walk_link(filename, &nd);
- if (!error) {
- error = chown_common(nd.dentry, user, group);
- path_release(&nd);
- }
+ if (error)
+ goto out;
+ error = chown_common(nd.dentry, user, group);
+ path_release(&nd);
+out:
return error;
}

@@ -781,15 +789,17 @@ asmlinkage long sys_fchown(unsigned int
{
struct file * file;
int error = -EBADF;
+ struct dentry * dentry;

file = fget(fd);
- if (file) {
- struct dentry * dentry;
- dentry = file->f_dentry;
- audit_inode(NULL, dentry->d_inode);
- error = chown_common(dentry, user, group);
- fput(file);
- }
+ if (!file)
+ goto out;
+
+ dentry = file->f_dentry;
+ audit_inode(NULL, dentry->d_inode);
+ error = chown_common(dentry, user, group);
+ fput(file);
+out:
return error;
}

_

2006-08-01 23:54:04

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 24/28] sys_mknodat(): elevate write count for vfs_mknod/create()


This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 12 ++++++++++++
lxc-dave/fs/nfsd/vfs.c | 4 ++++
lxc-dave/net/unix/af_unix.c | 4 ++++
3 files changed, 20 insertions(+)

diff -puN fs/namei.c~C-elevate-writers-vfs_mknod-try2 fs/namei.c
--- lxc/fs/namei.c~C-elevate-writers-vfs_mknod-try2 2006-08-01 16:35:28.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:31.000000000 -0700
@@ -1892,14 +1892,26 @@ asmlinkage long sys_mknodat(int dfd, con
if (!IS_ERR(dentry)) {
switch (mode & S_IFMT) {
case 0: case S_IFREG:
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ break;
error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
+ mnt_drop_write(nd.mnt);
break;
case S_IFCHR: case S_IFBLK:
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ break;
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,
new_decode_dev(dev));
+ mnt_drop_write(nd.mnt);
break;
case S_IFIFO: case S_IFSOCK:
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ break;
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
+ mnt_drop_write(nd.mnt);
break;
case S_IFDIR:
error = -EPERM;
diff -puN fs/nfsd/vfs.c~C-elevate-writers-vfs_mknod-try2 fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~C-elevate-writers-vfs_mknod-try2 2006-08-01 16:35:27.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2006-08-01 16:35:31.000000000 -0700
@@ -1155,6 +1155,9 @@ nfsd_create(struct svc_rqst *rqstp, stru
/*
* Get the dir op function pointer.
*/
+ err = mnt_want_write(fhp->fh_export->ex_mnt);
+ if (err)
+ goto out_nfserr;
err = nfserr_perm;
switch (type) {
case S_IFREG:
@@ -1173,6 +1176,7 @@ nfsd_create(struct svc_rqst *rqstp, stru
printk("nfsd: bad file type %o in nfsd_create\n", type);
err = -EINVAL;
}
+ mnt_drop_write(fhp->fh_export->ex_mnt);
if (err < 0)
goto out_nfserr;

diff -puN net/unix/af_unix.c~C-elevate-writers-vfs_mknod-try2 net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~C-elevate-writers-vfs_mknod-try2 2006-08-01 16:35:27.000000000 -0700
+++ lxc-dave/net/unix/af_unix.c 2006-08-01 16:35:31.000000000 -0700
@@ -821,7 +821,11 @@ static int unix_bind(struct socket *sock
*/
mode = S_IFSOCK |
(SOCK_INODE(sock)->i_mode & ~current->fs->umask);
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto out_mknod_dput;
err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
+ mnt_drop_write(nd.mnt);
if (err)
goto out_mknod_dput;
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_

2006-08-01 23:56:51

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 21/28] elevate writer count for do_sys_truncate()



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/open.c | 16 +++++++++++-----
1 files changed, 11 insertions(+), 5 deletions(-)

diff -puN fs/open.c~C-elevate-writers-opens-part2-do_sys_truncate fs/open.c
--- lxc/fs/open.c~C-elevate-writers-opens-part2-do_sys_truncate 2006-08-01 16:35:22.000000000 -0700
+++ lxc-dave/fs/open.c 2006-08-01 16:35:29.000000000 -0700
@@ -244,28 +244,32 @@ static long do_sys_truncate(const char _
if (!S_ISREG(inode->i_mode))
goto dput_and_out;

- error = vfs_permission(&nd, MAY_WRITE);
+ error = mnt_want_write(nd.mnt);
if (error)
goto dput_and_out;

+ error = vfs_permission(&nd, MAY_WRITE);
+ if (error)
+ goto mnt_drop_write_and_out;
+
error = -EROFS;
if (IS_RDONLY(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

/*
* Make sure that there are no leases.
*/
error = break_lease(inode, FMODE_WRITE);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

error = get_write_access(inode);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

error = locks_verify_truncate(inode, NULL, length);
if (!error) {
@@ -274,6 +278,8 @@ static long do_sys_truncate(const char _
}
put_write_access(inode);

+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_

2006-08-01 23:56:51

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 10/28] increment sb writer count when nlink hits zero


When a file is unlinked, there will soon be a write to the
filesystem. Note this, and disallow remounts to r/o during
the time when this write is pending.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/libfs.c | 1 +
1 files changed, 1 insertion(+)

diff -puN fs/libfs.c~C-inc-sb-writer-count-on-dec-nlink-to-zero fs/libfs.c
--- lxc/fs/libfs.c~C-inc-sb-writer-count-on-dec-nlink-to-zero 2006-08-01 16:35:20.000000000 -0700
+++ lxc-dave/fs/libfs.c 2006-08-01 16:35:21.000000000 -0700
@@ -273,6 +273,7 @@ out:
void __inode_set_awaiting_final_iput(struct inode *inode)
{
inode->i_state |= I_AWAITING_FINAL_IPUT;
+ atomic_inc(&inode->i_sb->s_mnt_writers);
}

int simple_unlink(struct inode *dir, struct dentry *dentry)
_

2006-08-01 23:56:50

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 26/28] do_rmdir(): elevate write count


Elevate the write count during the vfs_rmdir() call.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 5 +++++
1 files changed, 5 insertions(+)

diff -puN fs/namei.c~C-rmdir1 fs/namei.c
--- lxc/fs/namei.c~C-rmdir1 2006-08-01 16:35:32.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:32.000000000 -0700
@@ -2091,7 +2091,12 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit3;
error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
+exit3:
dput(dentry);
exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_

2006-08-01 23:58:09

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 22/28] elevate write count for do_utimes()



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/open.c | 13 +++++++++----
1 files changed, 9 insertions(+), 4 deletions(-)

diff -puN fs/open.c~C-elevate-writers-opens-part2-do_utimes fs/open.c
--- lxc/fs/open.c~C-elevate-writers-opens-part2-do_utimes 2006-08-01 16:35:29.000000000 -0700
+++ lxc-dave/fs/open.c 2006-08-01 16:35:29.000000000 -0700
@@ -441,16 +441,19 @@ long do_utimes(int dfd, char __user *fil
goto out;
inode = nd.dentry->d_inode;

+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto dput_and_out;
error = -EROFS;
if (IS_RDONLY(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

/* Don't worry, the checks are done in inode_change_ok() */
newattrs.ia_valid = ATTR_CTIME | ATTR_MTIME | ATTR_ATIME;
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

newattrs.ia_atime.tv_sec = times[0].tv_sec;
newattrs.ia_atime.tv_nsec = times[0].tv_usec * 1000;
@@ -460,15 +463,17 @@ long do_utimes(int dfd, char __user *fil
} else {
error = -EACCES;
if (IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

if (current->fsuid != inode->i_uid &&
(error = vfs_permission(&nd, MAY_WRITE)) != 0)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
}
mutex_lock(&inode->i_mutex);
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_

2006-08-01 23:58:37

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 27/28] elevate writer count for custom 'struct file'


Some filesystems forego the vfs and may_open() and create their
own 'struct file's. Any of these users which set the write flag
on the file will cause an extra mnt_drop_write() on __fput(),
thus dropping the reference count too low.

These users tend to have artifical in-kernel vfsmounts which
aren't really exposed to userspace and can't be remounted, but
this patch is included for completeness and so that the warnings
don't trip over these cases.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/hugetlbfs/inode.c | 2 ++
lxc-dave/mm/shmem.c | 2 ++
lxc-dave/mm/tiny-shmem.c | 2 ++
lxc-dave/net/socket.c | 3 +++
4 files changed, 9 insertions(+)

diff -puN fs/hugetlbfs/inode.c~C-elevate-writer-count-for-custom-struct_file fs/hugetlbfs/inode.c
--- lxc/fs/hugetlbfs/inode.c~C-elevate-writer-count-for-custom-struct_file 2006-08-01 16:35:07.000000000 -0700
+++ lxc-dave/fs/hugetlbfs/inode.c 2006-08-01 16:35:33.000000000 -0700
@@ -787,6 +787,8 @@ struct file *hugetlb_zero_setup(size_t s
file->f_mapping = inode->i_mapping;
file->f_op = &hugetlbfs_file_operations;
file->f_mode = FMODE_WRITE | FMODE_READ;
+ error = mnt_want_write(hugetlbfs_vfsmount);
+ WARN_ON(error);
return file;

out_inode:
diff -puN mm/shmem.c~C-elevate-writer-count-for-custom-struct_file mm/shmem.c
--- lxc/mm/shmem.c~C-elevate-writer-count-for-custom-struct_file 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/mm/shmem.c 2006-08-01 16:35:33.000000000 -0700
@@ -2322,6 +2322,8 @@ struct file *shmem_file_setup(char *name
file->f_mapping = inode->i_mapping;
file->f_op = &shmem_file_operations;
file->f_mode = FMODE_WRITE | FMODE_READ;
+ error = mnt_want_write(shm_mnt);
+ WARN_ON(error);
return file;

close_file:
diff -puN mm/tiny-shmem.c~C-elevate-writer-count-for-custom-struct_file mm/tiny-shmem.c
--- lxc/mm/tiny-shmem.c~C-elevate-writer-count-for-custom-struct_file 2006-08-01 16:35:07.000000000 -0700
+++ lxc-dave/mm/tiny-shmem.c 2006-08-01 16:35:33.000000000 -0700
@@ -84,6 +84,8 @@ struct file *shmem_file_setup(char *name
file->f_mapping = inode->i_mapping;
file->f_op = &ramfs_file_operations;
file->f_mode = FMODE_WRITE | FMODE_READ;
+ error = mnt_want_write(shm_mnt);
+ WARN_ON(error);

/* notify everyone as to the change of file size */
error = do_truncate(dentry, size, 0, file);
diff -puN net/socket.c~C-elevate-writer-count-for-custom-struct_file net/socket.c
--- lxc/net/socket.c~C-elevate-writer-count-for-custom-struct_file 2006-08-01 16:35:07.000000000 -0700
+++ lxc-dave/net/socket.c 2006-08-01 16:35:33.000000000 -0700
@@ -389,6 +389,7 @@ static int sock_attach_fd(struct socket
{
struct qstr this;
char name[32];
+ int error;

this.len = sprintf(name, "[%lu]", SOCK_INODE(sock)->i_ino);
this.name = name;
@@ -409,6 +410,8 @@ static int sock_attach_fd(struct socket
file->f_flags = O_RDWR;
file->f_pos = 0;
file->private_data = sock;
+ error = mnt_want_write(sock_mnt);
+ WARN_ON(error);

return 0;
}
_

2006-08-01 23:58:38

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 28/28] honor r/w changes at do_remount() time


Originally from: Herbert Poetzl <[email protected]>

This is the core of the read-only bind mount patch set.

Note that this does _not_ add a "ro" option directly to
the bind mount operation. If you require such a mount,
you must first do the bind, then follow it up with a
'mount -o remount,ro' operation.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namespace.c | 24 ++++++++++++++++++++++--
lxc-dave/fs/open.c | 2 +-
2 files changed, 23 insertions(+), 3 deletions(-)

diff -puN fs/namespace.c~C-D8-actually-add-flags fs/namespace.c
--- lxc/fs/namespace.c~C-D8-actually-add-flags 2006-08-01 16:35:26.000000000 -0700
+++ lxc-dave/fs/namespace.c 2006-08-01 16:35:34.000000000 -0700
@@ -391,7 +391,7 @@ static int show_vfsmnt(struct seq_file *
seq_path(m, mnt, mnt->mnt_root, " \t\n\\");
seq_putc(m, ' ');
mangle(m, mnt->mnt_sb->s_type->name);
- seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw");
+ seq_puts(m, __mnt_is_readonly(mnt) ? " ro" : " rw");
for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
if (mnt->mnt_sb->s_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
@@ -1014,6 +1014,23 @@ out:
return err;
}

+static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+{
+ int error = 0;
+ int readonly_request = 0;
+
+ if (ms_flags & MS_RDONLY)
+ readonly_request = 1;
+ if (readonly_request == __mnt_is_readonly(mnt))
+ return 0;
+
+ if (readonly_request)
+ error = mnt_make_readonly(mnt);
+ else
+ __mnt_unmake_readonly(mnt);
+ return error;
+}
+
/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
@@ -1035,7 +1052,10 @@ static int do_remount(struct nameidata *
return -EINVAL;

down_write(&sb->s_umount);
- err = do_remount_sb(sb, flags, data, 0);
+ if (flags & MS_BIND)
+ err = change_mount_flags(nd->mnt, flags);
+ else
+ err = do_remount_sb(sb, flags, data, 0);
if (!(sb->s_flags & MS_RDONLY))
mnt_flags |= MNT_SB_WRITABLE;
if (!err)
diff -puN fs/open.c~C-D8-actually-add-flags fs/open.c
--- lxc/fs/open.c~C-D8-actually-add-flags 2006-08-01 16:35:30.000000000 -0700
+++ lxc-dave/fs/open.c 2006-08-01 16:35:34.000000000 -0700
@@ -546,7 +546,7 @@ asmlinkage long sys_faccessat(int dfd, c
special_file(nd.dentry->d_inode->i_mode))
goto out_path_release;

- if(IS_RDONLY(nd.dentry->d_inode))
+ if(__mnt_is_readonly(nd.mnt) || IS_RDONLY(nd.dentry->d_inode))
res = -EROFS;

out_path_release:
_

2006-08-01 23:58:37

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 23/28] elevate write count for do_sys_utime() and touch_atime()



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/inode.c | 9 +++++++--
lxc-dave/fs/open.c | 16 +++++++++++-----
2 files changed, 18 insertions(+), 7 deletions(-)

diff -puN fs/inode.c~C-elevate-writers-opens-part2-do_sys_utime fs/inode.c
--- lxc/fs/inode.c~C-elevate-writers-opens-part2-do_sys_utime 2006-08-01 16:35:20.000000000 -0700
+++ lxc-dave/fs/inode.c 2006-08-01 16:35:30.000000000 -0700
@@ -1191,10 +1191,13 @@ void touch_atime(struct vfsmount *mnt, s
if (IS_RDONLY(inode))
return;

+ if (mnt_want_write(mnt))
+ return;
+
if ((inode->i_flags & S_NOATIME) ||
(inode->i_sb->s_flags & MS_NOATIME) ||
((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode)))
- return;
+ goto out;

/*
* We may have a NULL vfsmount when coming from NFSD
@@ -1202,13 +1205,15 @@ void touch_atime(struct vfsmount *mnt, s
if (mnt &&
((mnt->mnt_flags & MNT_NOATIME) ||
((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))))
- return;
+ goto out;

now = current_fs_time(inode->i_sb);
if (!timespec_equal(&inode->i_atime, &now)) {
inode->i_atime = now;
mark_inode_dirty_sync(inode);
}
+out:
+ mnt_drop_write(mnt);
}

EXPORT_SYMBOL(touch_atime);
diff -puN fs/open.c~C-elevate-writers-opens-part2-do_sys_utime fs/open.c
--- lxc/fs/open.c~C-elevate-writers-opens-part2-do_sys_utime 2006-08-01 16:35:29.000000000 -0700
+++ lxc-dave/fs/open.c 2006-08-01 16:35:30.000000000 -0700
@@ -384,16 +384,20 @@ asmlinkage long sys_utime(char __user *
goto out;
inode = nd.dentry->d_inode;

+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto dput_and_out;
+
error = -EROFS;
if (IS_RDONLY(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

/* Don't worry, the checks are done in inode_change_ok() */
newattrs.ia_valid = ATTR_CTIME | ATTR_MTIME | ATTR_ATIME;
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

error = get_user(newattrs.ia_atime.tv_sec, &times->actime);
newattrs.ia_atime.tv_nsec = 0;
@@ -401,21 +405,23 @@ asmlinkage long sys_utime(char __user *
error = get_user(newattrs.ia_mtime.tv_sec, &times->modtime);
newattrs.ia_mtime.tv_nsec = 0;
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

newattrs.ia_valid |= ATTR_ATIME_SET | ATTR_MTIME_SET;
} else {
error = -EACCES;
if (IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

if (current->fsuid != inode->i_uid &&
(error = vfs_permission(&nd, MAY_WRITE)) != 0)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
}
mutex_lock(&inode->i_mutex);
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_

2006-08-01 23:59:41

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 07/28] Add vfsmount writer count


This allows a vfsmount to keep track of how many instances
of files open for write there are at a given time.

A mount can refuse to allow writers. This can be because
it is a read-only bind mount, or for other functionality
in the future. A mount can also now refuse to make a
transition from r/w to r/o whenever it has currently active
writers.

When a mount gets its first writer, it tells the superblock
about it by incrementing sb->s_mnt_writers. When the last
writer goes away, this count is bumped back down. This makes
the question of whether or not anything is writing to the sb
a much quicker, easier operation than scanning the open files
list.

The WARN_ON()s can go away before this is merged into mainline
provided it has had some time in -mm or equivalent.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namespace.c | 61 +++++++++++++++++++++++++++++++++++++
lxc-dave/fs/super.c | 66 +++++++++++++++++++++++++++++++++++++----
lxc-dave/include/linux/fs.h | 2 +
lxc-dave/include/linux/mount.h | 18 +++++++++++
4 files changed, 142 insertions(+), 5 deletions(-)

diff -puN fs/namespace.c~C-add-vfsmount-writer_count fs/namespace.c
--- lxc/fs/namespace.c~C-add-vfsmount-writer_count 2006-08-01 16:35:18.000000000 -0700
+++ lxc-dave/fs/namespace.c 2006-08-01 16:35:19.000000000 -0700
@@ -58,6 +58,7 @@ struct vfsmount *alloc_vfsmnt(const char
if (mnt) {
memset(mnt, 0, sizeof(struct vfsmount));
atomic_set(&mnt->mnt_count, 1);
+ atomic_set(&mnt->mnt_writers, 0);
INIT_LIST_HEAD(&mnt->mnt_hash);
INIT_LIST_HEAD(&mnt->mnt_child);
INIT_LIST_HEAD(&mnt->mnt_mounts);
@@ -886,6 +887,60 @@ out_unlock:
return err;
}

+int mnt_make_readonly(struct vfsmount *mnt)
+{
+ int ret = 0;
+
+ WARN_ON(__mnt_is_readonly(mnt));
+
+ /*
+ * This flag set is actually redundant with what
+ * happens in do_remount(), but since we do this
+ * under the lock, anyone attempting to get a write
+ * on it after this will fail.
+ */
+ spin_lock(&mnt->mnt_sb->s_mnt_writers_lock);
+ if (!atomic_read(&mnt->mnt_writers))
+ mnt->mnt_flags |= MNT_READONLY;
+ else
+ ret = -EBUSY;
+ spin_unlock(&mnt->mnt_sb->s_mnt_writers_lock);
+ return ret;
+}
+
+int mnt_want_write(struct vfsmount *mnt)
+{
+ int ret = 0;
+repeat:
+ if (atomic_add_unless(&mnt->mnt_writers, 1, 0))
+ return 0;
+
+ spin_lock(&mnt->mnt_sb->s_mnt_writers_lock);
+ if (__mnt_is_readonly(mnt)) {
+ ret = -EROFS;
+ goto out;
+ }
+ if (atomic_add_return(1, &mnt->mnt_writers) != 1) {
+ atomic_dec(&mnt->mnt_writers);
+ spin_unlock(&mnt->mnt_sb->s_mnt_writers_lock);
+ goto repeat;
+ }
+ atomic_inc(&mnt->mnt_sb->s_mnt_writers);
+out:
+ spin_unlock(&mnt->mnt_sb->s_mnt_writers_lock);
+ return ret;
+}
+
+void mnt_drop_write(struct vfsmount *mnt)
+{
+ if (!atomic_dec_and_lock(&mnt->mnt_writers,
+ &mnt->mnt_sb->s_mnt_writers_lock))
+ return;
+
+ atomic_dec(&mnt->mnt_sb->s_mnt_writers);
+ spin_unlock(&mnt->mnt_sb->s_mnt_writers_lock);
+}
+
/*
* recursively change the type of the mountpoint.
*/
@@ -977,6 +1032,8 @@ static int do_remount(struct nameidata *

down_write(&sb->s_umount);
err = do_remount_sb(sb, flags, data, 0);
+ if (!(sb->s_flags & MS_RDONLY))
+ mnt_flags |= MNT_SB_WRITABLE;
if (!err)
nd->mnt->mnt_flags = mnt_flags;
up_write(&sb->s_umount);
@@ -1117,6 +1174,8 @@ int do_add_mount(struct vfsmount *newmnt
if (S_ISLNK(newmnt->mnt_root->d_inode->i_mode))
goto unlock;

+ if (!(newmnt->mnt_sb->s_flags & MS_RDONLY))
+ mnt_flags |= MNT_SB_WRITABLE;
newmnt->mnt_flags = mnt_flags;
if ((err = graft_tree(newmnt, nd)))
goto unlock;
@@ -1408,6 +1467,8 @@ long do_mount(char *dev_name, char *dir_
((char *)data_page)[PAGE_SIZE - 1] = 0;

/* Separate the per-mountpoint flags */
+ if (flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
if (flags & MS_NOSUID)
mnt_flags |= MNT_NOSUID;
if (flags & MS_NODEV)
diff -puN fs/super.c~C-add-vfsmount-writer_count fs/super.c
--- lxc/fs/super.c~C-add-vfsmount-writer_count 2006-08-01 16:35:18.000000000 -0700
+++ lxc-dave/fs/super.c 2006-08-01 16:35:19.000000000 -0700
@@ -93,6 +93,8 @@ static struct super_block *alloc_super(s
s->s_qcop = sb_quotactl_ops;
s->s_op = &default_op;
s->s_time_gran = 1000000000;
+ atomic_set(&s->s_mnt_writers, 0);
+ spin_lock_init(&s->s_mnt_writers_lock);
}
out:
return s;
@@ -527,6 +529,53 @@ static void mark_files_ro(struct super_b
file_list_unlock();
}

+static void __sb_mounts_mod_flag(struct super_block *sb, int set, int flag)
+{
+ struct list_head *p;
+
+ spin_lock(&vfsmount_lock);
+ list_for_each(p, &sb->s_vfsmounts) {
+ struct vfsmount *mnt =
+ list_entry(p, struct vfsmount, mnt_sb_list);
+ if (set)
+ mnt->mnt_flags |= flag;
+ else
+ mnt->mnt_flags &= ~flag;
+ }
+ spin_unlock(&vfsmount_lock);
+}
+
+static void sb_mounts_set_flag(struct super_block *sb, int flag)
+{
+ __sb_mounts_mod_flag(sb, 1, flag);
+}
+static void sb_mounts_clear_flag(struct super_block *sb, int flag)
+{
+ __sb_mounts_mod_flag(sb, 0, flag);
+}
+
+static int sb_remount_ro(struct super_block *sb)
+{
+ return fs_may_remount_ro(sb);
+ spin_lock(&sb->s_mnt_writers_lock);
+ if (atomic_read(&sb->s_mnt_writers) > 0) {
+ spin_unlock(&sb->s_mnt_writers_lock);
+ return -EBUSY;
+ }
+
+ sb_mounts_clear_flag(sb, MNT_SB_WRITABLE);
+ spin_unlock(&sb->s_mnt_writers_lock);
+
+ return 0;
+}
+
+static void sb_remount_rw(struct super_block *sb)
+{
+ spin_lock(&sb->s_mnt_writers_lock);
+ sb_mounts_set_flag(sb, MNT_SB_WRITABLE);
+ spin_unlock(&sb->s_mnt_writers_lock);
+}
+
/**
* do_remount_sb - asks filesystem to change mount options.
* @sb: superblock in question
@@ -538,7 +587,8 @@ static void mark_files_ro(struct super_b
*/
int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
{
- int retval;
+ int retval = 0;
+ int sb_started_ro = (sb->s_flags & MS_RDONLY);

if (!(flags & MS_RDONLY) && bdev_read_only(sb->s_bdev))
return -EACCES;
@@ -549,13 +599,14 @@ int do_remount_sb(struct super_block *sb

/* If we are remounting RDONLY and current sb is read/write,
make sure there are no rw files opened */
- if ((flags & MS_RDONLY) && !(sb->s_flags & MS_RDONLY)) {
+ if ((flags & MS_RDONLY) && !sb_started_ro) {
if (force)
mark_files_ro(sb);
- else if (!fs_may_remount_ro(sb))
- return -EBUSY;
+ else
+ retval = sb_remount_ro(sb);
+ if (retval)
+ return retval;
}
-
if (sb->s_op->remount_fs) {
lock_super(sb);
retval = sb->s_op->remount_fs(sb, &flags, data);
@@ -563,6 +614,9 @@ int do_remount_sb(struct super_block *sb
if (retval)
return retval;
}
+ if (!(flags & MS_RDONLY) && sb_started_ro)
+ sb_remount_rw(sb);
+
sb->s_flags = (sb->s_flags & ~MS_RMT_MASK) | (flags & MS_RMT_MASK);
return 0;
}
@@ -849,6 +903,8 @@ vfs_kern_mount(struct file_system_type *

mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
+ if (!(mnt->mnt_sb->s_flags & MS_RDONLY))
+ mnt->mnt_flags |= MNT_SB_WRITABLE;
up_write(&mnt->mnt_sb->s_umount);
free_secdata(secdata);
return mnt;
diff -puN include/linux/fs.h~C-add-vfsmount-writer_count include/linux/fs.h
--- lxc/include/linux/fs.h~C-add-vfsmount-writer_count 2006-08-01 16:35:18.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2006-08-01 16:35:19.000000000 -0700
@@ -963,6 +963,8 @@ struct super_block {
struct list_head s_io; /* parked for writeback */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
struct list_head s_vfsmounts;
+ atomic_t s_mnt_writers; /* vfsmounts with active writers */
+ spinlock_t s_mnt_writers_lock; /* taken when mounts change rw state */
struct list_head s_files;

struct block_device *s_bdev;
diff -puN include/linux/mount.h~C-add-vfsmount-writer_count include/linux/mount.h
--- lxc/include/linux/mount.h~C-add-vfsmount-writer_count 2006-08-01 16:35:18.000000000 -0700
+++ lxc-dave/include/linux/mount.h 2006-08-01 16:35:19.000000000 -0700
@@ -27,6 +27,8 @@ struct namespace;
#define MNT_NOEXEC 0x04
#define MNT_NOATIME 0x08
#define MNT_NODIRATIME 0x10
+#define MNT_READONLY 0x20 /* does the user want this to be r/o? */
+#define MNT_SB_WRITABLE 0x40 /* does the SB currently allow writes? */

#define MNT_SHRINKABLE 0x100

@@ -44,6 +46,7 @@ struct vfsmount {
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
atomic_t mnt_count;
+ atomic_t mnt_writers;
int mnt_flags;
int mnt_expiry_mark; /* true if marked for expiry */
char *mnt_devname; /* Name of device e.g. /dev/dsk/hda1 */
@@ -64,6 +67,21 @@ static inline struct vfsmount *mntget(st
return mnt;
}

+static inline int __mnt_is_readonly(struct vfsmount *mnt)
+{
+ return (mnt->mnt_flags & MNT_READONLY) ||
+ !(mnt->mnt_flags & MNT_SB_WRITABLE);
+}
+
+static inline void __mnt_unmake_readonly(struct vfsmount *mnt)
+{
+ WARN_ON(!__mnt_is_readonly(mnt));
+ mnt->mnt_flags &= ~MNT_READONLY;
+}
+
+extern int mnt_make_readonly(struct vfsmount *mnt);
+extern int mnt_want_write(struct vfsmount *mnt);
+extern void mnt_drop_write(struct vfsmount *mnt);
extern void mntput_no_expire(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
_

2006-08-01 23:59:51

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 06/28] reintroduce list of vfsmounts over superblock


We're moving a big chunk of the burden of keeping people from
writing to r/o filesystems from the superblock into the
vfsmount. This requires that we consult the superblock's
vfsmounts when things like remounts occur.

So, introduce a list of vfsmounts hanging off the superblock.
We'll use this in a bit.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namespace.c | 12 ++++++++++++
lxc-dave/fs/super.c | 1 +
lxc-dave/include/linux/fs.h | 1 +
lxc-dave/include/linux/mount.h | 1 +
4 files changed, 15 insertions(+)

diff -puN fs/namespace.c~C-reintroduce-list-of-vfsmounts-over-superblock fs/namespace.c
--- lxc/fs/namespace.c~C-reintroduce-list-of-vfsmounts-over-superblock 2006-08-01 16:35:11.000000000 -0700
+++ lxc-dave/fs/namespace.c 2006-08-01 16:35:18.000000000 -0700
@@ -78,10 +78,18 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
}

+void add_mount_to_sb_list(struct vfsmount *mnt, struct super_block *sb)
+{
+ spin_lock(&vfsmount_lock);
+ list_add(&mnt->mnt_sb_list, &sb->s_vfsmounts);
+ spin_unlock(&vfsmount_lock);
+}
+
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
mnt->mnt_sb = sb;
mnt->mnt_root = dget(sb->s_root);
+ add_mount_to_sb_list(mnt, sb);
return 0;
}

@@ -89,6 +97,9 @@ EXPORT_SYMBOL(simple_set_mnt);

void free_vfsmnt(struct vfsmount *mnt)
{
+ spin_lock(&vfsmount_lock);
+ list_del(&mnt->mnt_sb_list);
+ spin_unlock(&vfsmount_lock);
kfree(mnt->mnt_devname);
kmem_cache_free(mnt_cache, mnt);
}
@@ -242,6 +253,7 @@ static struct vfsmount *clone_mnt(struct
mnt->mnt_root = dget(root);
mnt->mnt_mountpoint = mnt->mnt_root;
mnt->mnt_parent = mnt;
+ add_mount_to_sb_list(mnt, sb);

if (flag & CL_SLAVE) {
list_add(&mnt->mnt_slave, &old->mnt_slave_list);
diff -puN fs/super.c~C-reintroduce-list-of-vfsmounts-over-superblock fs/super.c
--- lxc/fs/super.c~C-reintroduce-list-of-vfsmounts-over-superblock 2006-08-01 16:35:11.000000000 -0700
+++ lxc-dave/fs/super.c 2006-08-01 16:35:18.000000000 -0700
@@ -67,6 +67,7 @@ static struct super_block *alloc_super(s
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_files);
+ INIT_LIST_HEAD(&s->s_vfsmounts);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
diff -puN include/linux/fs.h~C-reintroduce-list-of-vfsmounts-over-superblock include/linux/fs.h
--- lxc/include/linux/fs.h~C-reintroduce-list-of-vfsmounts-over-superblock 2006-08-01 16:35:17.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2006-08-01 16:35:18.000000000 -0700
@@ -962,6 +962,7 @@ struct super_block {
struct list_head s_dirty; /* dirty inodes */
struct list_head s_io; /* parked for writeback */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
+ struct list_head s_vfsmounts;
struct list_head s_files;

struct block_device *s_bdev;
diff -puN include/linux/mount.h~C-reintroduce-list-of-vfsmounts-over-superblock include/linux/mount.h
--- lxc/include/linux/mount.h~C-reintroduce-list-of-vfsmounts-over-superblock 2006-08-01 16:35:11.000000000 -0700
+++ lxc-dave/include/linux/mount.h 2006-08-01 16:35:18.000000000 -0700
@@ -40,6 +40,7 @@ struct vfsmount {
struct dentry *mnt_mountpoint; /* dentry of mountpoint */
struct dentry *mnt_root; /* root of the mounted tree */
struct super_block *mnt_sb; /* pointer to superblock */
+ struct list_head mnt_sb_list; /* list of all mounts on same sb */
struct list_head mnt_mounts; /* list of children, anchored here */
struct list_head mnt_child; /* and going through their mnt_child */
atomic_t mnt_count;
_

2006-08-02 00:00:25

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 12/28] elevate mnt writers for callers of vfs_mkdir()



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 5 +++++
lxc-dave/fs/nfsd/nfs4recover.c | 4 ++++
2 files changed, 9 insertions(+)

diff -puN fs/namei.c~C-elevate-writers-vfs_mkdir fs/namei.c
--- lxc/fs/namei.c~C-elevate-writers-vfs_mkdir 2006-08-01 16:35:14.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:23.000000000 -0700
@@ -1952,7 +1952,12 @@ asmlinkage long sys_mkdirat(int dfd, con

if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
diff -puN fs/nfsd/nfs4recover.c~C-elevate-writers-vfs_mkdir fs/nfsd/nfs4recover.c
--- lxc/fs/nfsd/nfs4recover.c~C-elevate-writers-vfs_mkdir 2006-08-01 16:35:10.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4recover.c 2006-08-01 16:35:23.000000000 -0700
@@ -155,7 +155,11 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n");
goto out_put;
}
+ status = mnt_want_write(rec_dir.mnt);
+ if (status)
+ goto out_put;
status = vfs_mkdir(rec_dir.dentry->d_inode, dentry, S_IRWXU);
+ mnt_drop_write(rec_dir.mnt);
out_put:
dput(dentry);
out_unlock:
_

2006-08-02 00:00:38

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 13/28] elevate write count during entire ncp_ioctl()


Some ioctls need write access, but others don't. Make a helper
function to decide when write access is needed, and take it.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/ncpfs/ioctl.c | 55 +++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 54 insertions(+), 1 deletion(-)

diff -puN fs/ncpfs/ioctl.c~C-elevate-writers-file_permission-callers fs/ncpfs/ioctl.c
--- lxc/fs/ncpfs/ioctl.c~C-elevate-writers-file_permission-callers 2006-08-01 16:35:10.000000000 -0700
+++ lxc-dave/fs/ncpfs/ioctl.c 2006-08-01 16:35:23.000000000 -0700
@@ -15,6 +15,7 @@
#include <linux/ioctl.h>
#include <linux/time.h>
#include <linux/mm.h>
+#include <linux/mount.h>
#include <linux/highuid.h>
#include <linux/vmalloc.h>

@@ -182,7 +183,7 @@ ncp_get_charsets(struct ncp_server* serv
}
#endif /* CONFIG_NCPFS_NLS */

-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
{
struct ncp_server *server = NCP_SERVER(inode);
@@ -653,3 +654,55 @@ outrel:
/* #endif */
return -EINVAL;
}
+
+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+ switch (cmd) {
+ case NCP_IOC_GET_FS_INFO:
+ case NCP_IOC_GET_FS_INFO_V2:
+ case NCP_IOC_NCPREQUEST:
+ case NCP_IOC_SETDENTRYTTL:
+ case NCP_IOC_SIGN_INIT:
+ case NCP_IOC_LOCKUNLOCK:
+ case NCP_IOC_SET_SIGN_WANTED:
+ return 1;
+ case NCP_IOC_GETOBJECTNAME:
+ case NCP_IOC_SETOBJECTNAME:
+ case NCP_IOC_GETPRIVATEDATA:
+ case NCP_IOC_SETPRIVATEDATA:
+ case NCP_IOC_SETCHARSETS:
+ case NCP_IOC_GETCHARSETS:
+ case NCP_IOC_CONN_LOGGED_IN:
+ case NCP_IOC_GETDENTRYTTL:
+ case NCP_IOC_GETMOUNTUID2:
+ case NCP_IOC_SIGN_WANTED:
+ case NCP_IOC_GETROOT:
+ case NCP_IOC_SETROOT:
+ return 0;
+ default:
+ /* unkown IOCTL command, assume write */
+ WARN_ON(1);
+ }
+ return 1;
+}
+
+int ncp_ioctl(struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+ int ret;
+
+ if (ncp_ioctl_need_write(cmd)) {
+ /*
+ * inside the ioctl(), any failures which
+ * are because of file_permission() are
+ * -EACCESS, so it seems consistent to keep
+ * that here.
+ */
+ if (mnt_want_write(filp->f_vfsmnt))
+ return -EACCES;
+ }
+ ret = __ncp_ioctl(inode, filp, cmd, arg);
+ if (ncp_ioctl_need_write(cmd))
+ mnt_drop_write(filp->f_vfsmnt);
+ return ret;
+}
_

2006-08-02 00:01:16

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 04/28] OCFS2 is screwy


OCFS2 plays some games with i_nlink. It modifies it a bunch in
its unlink function, but rolls back the changes if an error
occurs. So, we can't just assume that iput_final() will happen
whenever i_nlink hits 0 in ocfs's unlink().

Create a helper function to do the hard work of i_nlink hitting
zero, and use it in ocfs2.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/libfs.c | 6 ++++++
lxc-dave/fs/ocfs2/namei.c | 8 +++++---
lxc-dave/include/linux/fs.h | 9 +++++++++
3 files changed, 20 insertions(+), 3 deletions(-)

diff -puN fs/ocfs2/namei.c~ofcfs-is-screwy fs/ocfs2/namei.c
--- lxc/fs/ocfs2/namei.c~ofcfs-is-screwy 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/ocfs2/namei.c 2006-08-01 16:35:16.000000000 -0700
@@ -909,6 +909,8 @@ static int ocfs2_unlink(struct inode *di
leave:
if (status < 0 && saved_nlink)
inode->i_nlink = saved_nlink;
+ if (inode->i_nlink == 0)
+ __inode_set_awaiting_final_iput(inode);

if (handle)
ocfs2_commit_trans(handle);
@@ -1329,7 +1331,7 @@ static int ocfs2_rename(struct inode *ol
}

if (new_inode) {
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
new_inode->i_ctime = CURRENT_TIME;
}
old_dir->i_ctime = old_dir->i_mtime = CURRENT_TIME;
@@ -1340,9 +1342,9 @@ static int ocfs2_rename(struct inode *ol
PARENT_INO(old_inode_de_bh->b_data) =
cpu_to_le64(OCFS2_I(new_dir)->ip_blkno);
status = ocfs2_journal_dirty(handle, old_inode_de_bh);
- old_dir->i_nlink--;
+ inode_drop_nlink(old_dir);
if (new_inode) {
- new_inode->i_nlink--;
+ inode_drop_nlink(new_inode);
} else {
new_dir->i_nlink++;
mark_inode_dirty(new_dir);
diff -puN fs/libfs.c~ofcfs-is-screwy fs/libfs.c
--- lxc/fs/libfs.c~ofcfs-is-screwy 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/fs/libfs.c 2006-08-01 16:35:16.000000000 -0700
@@ -270,6 +270,11 @@ out:
return ret;
}

+void __inode_set_awaiting_final_iput(struct inode *inode)
+{
+}
+EXPORT_SYMBOL_GPL(inode_drop_nlink);
+
int simple_unlink(struct inode *dir, struct dentry *dentry)
{
struct inode *inode = dentry->d_inode;
@@ -617,6 +622,7 @@ EXPORT_SYMBOL(simple_commit_write);
EXPORT_SYMBOL(simple_dir_inode_operations);
EXPORT_SYMBOL(simple_dir_operations);
EXPORT_SYMBOL(simple_empty);
+EXPORT_SYMBOL(__inode_set_awaiting_final_iput);
EXPORT_SYMBOL(d_alloc_name);
EXPORT_SYMBOL(simple_fill_super);
EXPORT_SYMBOL(simple_getattr);
diff -puN drivers/usb/core/inode.c~ofcfs-is-screwy drivers/usb/core/inode.c
diff -puN include/linux/fs.h~ofcfs-is-screwy include/linux/fs.h
--- lxc/include/linux/fs.h~ofcfs-is-screwy 2006-08-01 16:35:15.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2006-08-01 16:35:16.000000000 -0700
@@ -1257,9 +1257,18 @@ static inline void inode_inc_link_count(
}

extern void __inode_set_awaiting_final_iput(struct inode *inode);
+static inline void inode_clear_nlink(struct inode *inode)
+{
+ if (inode->i_nlink)
+ __inode_set_awaiting_final_iput(inode);
+ inode->i_nlink = 0;
+}
+
static inline void inode_drop_nlink(struct inode *inode)
{
inode->i_nlink--;
+ if (!inode->i_nlink)
+ __inode_set_awaiting_final_iput(inode);
}

static inline void inode_dec_link_count(struct inode *inode)
_

2006-08-02 00:01:16

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 15/28] elevate mount count for extended attributes


This basically audits the callers of xattr_permission(), which
calls permission() and can perform writes to the filesystem.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/nfsd/nfs4proc.c | 7 ++++++-
lxc-dave/fs/xattr.c | 14 ++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)

diff -puN fs/nfsd/nfs4proc.c~C-xattr fs/nfsd/nfs4proc.c
--- lxc/fs/nfsd/nfs4proc.c~C-xattr 2006-08-01 16:35:09.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4proc.c 2006-08-01 16:35:25.000000000 -0700
@@ -604,13 +604,18 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
return status;
}
}
+ status = mnt_want_write(current_fh->fh_export->ex_mnt);
+ if (status)
+ return status;
status = nfs_ok;
if (setattr->sa_acl != NULL)
status = nfsd4_set_nfs4_acl(rqstp, current_fh, setattr->sa_acl);
if (status)
- return status;
+ goto out;
status = nfsd_setattr(rqstp, current_fh, &setattr->sa_iattr,
0, (time_t)0);
+out:
+ mnt_drop_write(current_fh->fh_export->ex_mnt);
return status;
}

diff -puN fs/xattr.c~C-xattr fs/xattr.c
--- lxc/fs/xattr.c~C-xattr 2006-08-01 16:35:09.000000000 -0700
+++ lxc-dave/fs/xattr.c 2006-08-01 16:35:25.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/smp_lock.h>
#include <linux/file.h>
#include <linux/xattr.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -210,7 +211,11 @@ sys_setxattr(char __user *path, char __u
error = user_path_walk(path, &nd);
if (error)
return error;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ return error;
error = setxattr(nd.dentry, name, value, size, flags);
+ mnt_drop_write(nd.mnt);
path_release(&nd);
return error;
}
@@ -225,7 +230,11 @@ sys_lsetxattr(char __user *path, char __
error = user_path_walk_link(path, &nd);
if (error)
return error;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ return error;
error = setxattr(nd.dentry, name, value, size, flags);
+ mnt_drop_write(nd.mnt);
path_release(&nd);
return error;
}
@@ -241,9 +250,14 @@ sys_fsetxattr(int fd, char __user *name,
f = fget(fd);
if (!f)
return error;
+ error = mnt_want_write(f->f_vfsmnt);
+ if (error)
+ goto out_fput;
dentry = f->f_dentry;
audit_inode(NULL, dentry->d_inode);
error = setxattr(dentry, name, value, size, flags);
+ mnt_drop_write(f->f_vfsmnt);
+out_fput:
fput(f);
return error;
}
_

2006-08-02 00:01:16

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 16/28] sys_linkat(): elevate write count around vfs_link()



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 5 +++++
1 files changed, 5 insertions(+)

diff -puN fs/namei.c~C-elevate-writers-vfs_link-part1 fs/namei.c
--- lxc/fs/namei.c~C-elevate-writers-vfs_link-part1 2006-08-01 16:35:24.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:25.000000000 -0700
@@ -2326,7 +2326,12 @@ asmlinkage long sys_linkat(int olddfd, c
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_unlock;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(new_dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_

2006-08-02 00:01:14

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 14/28] sys_symlinkat() elevate write count around vfs_symlink()



Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 5 +++++
1 files changed, 5 insertions(+)

diff -puN fs/namei.c~C-elevate-writers-vfs_symlink-part3 fs/namei.c
--- lxc/fs/namei.c~C-elevate-writers-vfs_symlink-part3 2006-08-01 16:35:23.000000000 -0700
+++ lxc-dave/fs/namei.c 2006-08-01 16:35:24.000000000 -0700
@@ -2226,7 +2226,12 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(dentry))
goto out_unlock;

+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_

2006-08-02 00:03:08

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 09/28] kill open files traverse on remount ro


Now that we have the sb writer count, we don't need to
go looking at all of the individual open files.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/file_table.c | 25 -------------------------
lxc-dave/fs/super.c | 1 -
lxc-dave/include/linux/fs.h | 2 --
3 files changed, 28 deletions(-)

diff -puN fs/file_table.c~C-kill-open-file-traverse-on-remount-ro fs/file_table.c
--- lxc/fs/file_table.c~C-kill-open-file-traverse-on-remount-ro 2006-08-01 16:35:10.000000000 -0700
+++ lxc-dave/fs/file_table.c 2006-08-01 16:35:21.000000000 -0700
@@ -269,31 +269,6 @@ void file_kill(struct file *file)
}
}

-int fs_may_remount_ro(struct super_block *sb)
-{
- struct list_head *p;
-
- /* Check that no files are currently opened for writing. */
- file_list_lock();
- list_for_each(p, &sb->s_files) {
- struct file *file = list_entry(p, struct file, f_u.fu_list);
- struct inode *inode = file->f_dentry->d_inode;
-
- /* File with pending delete? */
- if (inode->i_nlink == 0)
- goto too_bad;
-
- /* Writeable file? */
- if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
- goto too_bad;
- }
- file_list_unlock();
- return 1; /* Tis' cool bro. */
-too_bad:
- file_list_unlock();
- return 0;
-}
-
void __init files_init(unsigned long mempages)
{
int n;
diff -puN fs/super.c~C-kill-open-file-traverse-on-remount-ro fs/super.c
--- lxc/fs/super.c~C-kill-open-file-traverse-on-remount-ro 2006-08-01 16:35:19.000000000 -0700
+++ lxc-dave/fs/super.c 2006-08-01 16:35:21.000000000 -0700
@@ -556,7 +556,6 @@ static void sb_mounts_clear_flag(struct

static int sb_remount_ro(struct super_block *sb)
{
- return fs_may_remount_ro(sb);
spin_lock(&sb->s_mnt_writers_lock);
if (atomic_read(&sb->s_mnt_writers) > 0) {
spin_unlock(&sb->s_mnt_writers_lock);
diff -puN include/linux/fs.h~C-kill-open-file-traverse-on-remount-ro include/linux/fs.h
--- lxc/include/linux/fs.h~C-kill-open-file-traverse-on-remount-ro 2006-08-01 16:35:20.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2006-08-01 16:35:21.000000000 -0700
@@ -1601,8 +1601,6 @@ extern const struct file_operations read
extern const struct file_operations write_fifo_fops;
extern const struct file_operations rdwr_fifo_fops;

-extern int fs_may_remount_ro(struct super_block *);
-
/*
* return READ, READA, or WRITE
*/
_

2006-08-02 00:03:09

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 08/28] record when sb_writer_count elevated for inode


There are a number of filesystems that do iput()s without first
having messed with i_nlink. In order to keep from accidentally
decrementing the superblock writer count for these, we record
when the count is bumped up, so that we can properly balance
it.

I know the flag name sucks. Anybody have better ideas?

I first tried to do this by catching all of the users an intentions
whenever i_nlink is modified, but all of the filesystems do enough
creative things with it that even if it was all properly fixed now,
new issues with vfsmnt writer count imaglance will probably pop
up in the future. This patch trades that possibility for the chance
that we will miss a i_nlink--, and not bump the sb writer count.

I like the idea screwing up writing out a single inode better than
screwing up a global superblock count imbalance that will affect
all inodes.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/inode.c | 7 ++++++-
lxc-dave/fs/libfs.c | 2 ++
lxc-dave/include/linux/fs.h | 1 +
mm/page_alloc.c | 0
4 files changed, 9 insertions(+), 1 deletion(-)

diff -puN fs/inode.c~C-record-when-sb_writer_count-elevated-for-inode-in-inode fs/inode.c
--- lxc/fs/inode.c~C-record-when-sb_writer_count-elevated-for-inode-in-inode 2006-08-01 16:35:11.000000000 -0700
+++ lxc-dave/fs/inode.c 2006-08-01 16:35:20.000000000 -0700
@@ -1114,12 +1114,17 @@ EXPORT_SYMBOL_GPL(generic_drop_inode);
*/
static inline void iput_final(struct inode *inode)
{
- struct super_operations *op = inode->i_sb->s_op;
+ struct super_block *sb = inode->i_sb;
+ struct super_operations *op = sb->s_op;
void (*drop)(struct inode *) = generic_drop_inode;
+ int must_drop_sb_write = (inode->i_state & I_AWAITING_FINAL_IPUT);

+ inode->i_state &= ~I_AWAITING_FINAL_IPUT;
if (op && op->drop_inode)
drop = op->drop_inode;
drop(inode);
+ if (must_drop_sb_write)
+ atomic_dec(&sb->s_mnt_writers);
}

/**
diff -puN fs/libfs.c~C-record-when-sb_writer_count-elevated-for-inode-in-inode fs/libfs.c
--- lxc/fs/libfs.c~C-record-when-sb_writer_count-elevated-for-inode-in-inode 2006-08-01 16:35:17.000000000 -0700
+++ lxc-dave/fs/libfs.c 2006-08-01 16:35:20.000000000 -0700
@@ -272,6 +272,7 @@ out:

void __inode_set_awaiting_final_iput(struct inode *inode)
{
+ inode->i_state |= I_AWAITING_FINAL_IPUT;
}

int simple_unlink(struct inode *dir, struct dentry *dentry)
@@ -377,6 +378,7 @@ int simple_fill_super(struct super_block
inode = new_inode(s);
if (!inode)
return -ENOMEM;
+ inode->i_state |= I_AWAITING_FINAL_IPUT;
inode->i_mode = S_IFDIR | 0755;
inode->i_uid = inode->i_gid = 0;
inode->i_blocks = 0;
diff -puN include/linux/fs.h~C-record-when-sb_writer_count-elevated-for-inode-in-inode include/linux/fs.h
--- lxc/include/linux/fs.h~C-record-when-sb_writer_count-elevated-for-inode-in-inode 2006-08-01 16:35:19.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2006-08-01 16:35:20.000000000 -0700
@@ -1239,6 +1239,7 @@ struct super_operations {
#define I_CLEAR 32
#define I_NEW 64
#define I_WILL_FREE 128
+#define I_AWAITING_FINAL_IPUT 256

#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

diff -puN mm/page_alloc.c~C-record-when-sb_writer_count-elevated-for-inode-in-inode mm/page_alloc.c
_

2006-08-02 00:02:40

by Dave Hansen

[permalink] [raw]
Subject: [PATCH 11/28] elevate writer count for chown and friends


chown/chmod,etc... don't call permission in the same way
that the normal "open for write" calls do. They still
write to the filesystem, so bump the write count during
these operations.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/open.c | 38 +++++++++++++++++++++++++++++++++-----
1 files changed, 33 insertions(+), 5 deletions(-)

diff -puN fs/open.c~C-elevate-writers-chown-and-friends fs/open.c
--- lxc/fs/open.c~C-elevate-writers-chown-and-friends 2006-08-01 16:35:13.000000000 -0700
+++ lxc-dave/fs/open.c 2006-08-01 16:35:22.000000000 -0700
@@ -644,9 +644,12 @@ asmlinkage long sys_fchmod(unsigned int
err = -EROFS;
if (IS_RDONLY(inode))
goto out_putf;
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
+ goto out_putf;
err = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto out_putf;
+ goto out_drop_write;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -655,6 +658,8 @@ asmlinkage long sys_fchmod(unsigned int
err = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);

+out_drop_write:
+ mnt_drop_write(file->f_vfsmnt);
out_putf:
fput(file);
out:
@@ -674,13 +679,16 @@ asmlinkage long sys_fchmodat(int dfd, co
goto out;
inode = nd.dentry->d_inode;

+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto dput_and_out;
error = -EROFS;
if (IS_RDONLY(inode))
- goto dput_and_out;
+ goto out_drop_write;

error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto out_drop_write;

mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
@@ -690,6 +698,8 @@ asmlinkage long sys_fchmodat(int dfd, co
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);

+out_drop_write:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
@@ -715,7 +725,7 @@ static int chown_common(struct dentry *
error = -EROFS;
if (IS_RDONLY(inode))
goto out;
- error = -EPERM;
+ error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
goto out;
newattrs.ia_valid = ATTR_CTIME;
@@ -744,7 +754,12 @@ asmlinkage long sys_chown(const char __u
error = user_path_walk(filename, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -764,7 +779,12 @@ asmlinkage long sys_fchownat(int dfd, co
error = __user_walk_fd(dfd, filename, follow, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -778,7 +798,11 @@ asmlinkage long sys_lchown(const char __
error = user_path_walk_link(filename, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+out_release:
path_release(&nd);
out:
return error;
@@ -794,10 +818,14 @@ asmlinkage long sys_fchown(unsigned int
file = fget(fd);
if (!file)
goto out;
-
+ error = mnt_want_write(file->f_vfsmnt);
+ if (error)
+ goto out_fput;
dentry = file->f_dentry;
audit_inode(NULL, dentry->d_inode);
error = chown_common(dentry, user, group);
+ mnt_drop_write(file->f_vfsmnt);
+out_fput:
fput(file);
out:
return error;
_

2006-08-02 02:14:30

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 04/28] OCFS2 is screwy

On Tue, Aug 01, 2006 at 04:52:43PM -0700, Dave Hansen wrote:
> OCFS2 plays some games with i_nlink. It modifies it a bunch in
> its unlink function, but rolls back the changes if an error
> occurs. So, we can't just assume that iput_final() will happen
> whenever i_nlink hits 0 in ocfs's unlink().
Huh? Did you read the code? Or is it just easier to call things "screwy" and
start hacking away?

i_nlink only gets rolled back in the case that the file system wasn't able to
actually complete the unlink / orphan operation. The idea is to keep it in
sync with what's actually on disk. So when we call iput() in the unlink
path, disk and struct inode should be accurate.

> Create a helper function to do the hard work of i_nlink hitting
> zero, and use it in ocfs2.
Anyway, it's only done that way because at the time it seemed the cleanest
approach - there's no technical reason why we must roll it back. So a helper
function doesn't seem necessary - you'd just have to look through the small
amount of code and re-order things a bit.

Unfortunately you didn't actually put the contents of
__inode_set_awaiting_final_iput() in this patch, so I'll have to dig through
the others to see what exactly you're up to. In the meantime I'd say this
patch is either unecessary or just plain wrong.
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-08-02 03:25:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 04/28] OCFS2 is screwy

Please ignore that last one. It didn't correctly handle directories'
with a remaining i_nlink of 2.

--- linux-2.6-patches/fs/ocfs2/namei.c.orig 2006-08-01 20:00:58.000000000 -0700
+++ linux-2.6-patches/fs/ocfs2/namei.c 2006-08-01 20:17:45.000000000 -0700
@@ -744,11 +744,23 @@
return err;
}

+static inline inode_is_unlinkable(struct inode *inode)
+{
+ if (S_ISDIR(inode->i_mode)) {
+ if (inode->i_nlink == 2)
+ return 1;
+ return 0;
+ }
+
+ if (inode->i_nlink == 1)
+ return 1;
+ return 0;
+}
+
static int ocfs2_unlink(struct inode *dir,
struct dentry *dentry)
{
int status;
- unsigned int saved_nlink = 0;
struct inode *inode = dentry->d_inode;
struct ocfs2_super *osb = OCFS2_SB(dir->i_sb);
u64 blkno;
@@ -823,16 +835,6 @@
}
}

- /* There are still a few steps left until we can consider the
- * unlink to have succeeded. Save off nlink here before
- * modification so we can set it back in case we hit an issue
- * before commit. */
- saved_nlink = inode->i_nlink;
- if (S_ISDIR(inode->i_mode))
- inode->i_nlink = 0;
- else
- inode->i_nlink--;
-
status = ocfs2_request_unlink_vote(inode, dentry,
(unsigned int) inode->i_nlink);
if (status < 0) {
@@ -842,7 +844,7 @@
goto leave;
}

- if (!inode->i_nlink) {
+ if (inode_is_unlinkable(inode)) {
status = ocfs2_prepare_orphan_dir(osb, handle, inode,
orphan_name,
&orphan_entry_bh);
@@ -869,7 +871,7 @@

fe = (struct ocfs2_dinode *) fe_bh->b_data;

- if (!inode->i_nlink) {
+ if (inode_is_unlinkable(inode)) {
status = ocfs2_orphan_add(osb, handle, inode, fe, orphan_name,
orphan_entry_bh);
if (status < 0) {
@@ -888,7 +890,9 @@
/* We can set nlink on the dinode now. clear the saved version
* so that it doesn't get set later. */
fe->i_links_count = cpu_to_le16(inode->i_nlink);
- saved_nlink = 0;
+ inode_drop_nlink(inode);
+ if (S_ISDIR(inode->i_mode))
+ inode_drop_nlink(inode);

status = ocfs2_journal_dirty(handle, fe_bh);
if (status < 0) {
@@ -897,19 +901,15 @@
}

if (S_ISDIR(inode->i_mode)) {
- dir->i_nlink--;
status = ocfs2_mark_inode_dirty(handle, dir,
parent_node_bh);
- if (status < 0) {
+ if (status < 0)
mlog_errno(status);
- dir->i_nlink++;
- }
+ else
+ inode_drop_nlink(dir);
}

leave:
- if (status < 0 && saved_nlink)
- inode->i_nlink = saved_nlink;
-
if (handle)
ocfs2_commit_trans(handle);


-- Dave

2006-08-02 03:25:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 04/28] OCFS2 is (not) screwy

On Tue, 2006-08-01 at 19:14 -0700, Mark Fasheh wrote:
> On Tue, Aug 01, 2006 at 04:52:43PM -0700, Dave Hansen wrote:
> > OCFS2 plays some games with i_nlink. It modifies it a bunch in
> > its unlink function, but rolls back the changes if an error
> > occurs. So, we can't just assume that iput_final() will happen
> > whenever i_nlink hits 0 in ocfs's unlink().
> Huh? Did you read the code? Or is it just easier to call things "screwy" and
> start hacking away?
>
> i_nlink only gets rolled back in the case that the file system wasn't able to
> actually complete the unlink / orphan operation. The idea is to keep it in
> sync with what's actually on disk. So when we call iput() in the unlink
> path, disk and struct inode should be accurate.

BTW, some gunk appears to have migrated into this patch that should have
been earlier in the series. I'll fix that up.

What do you think about the attached patch? It delays actually touching
i_nlink until the place where saved_nlink used to be zero'd. I assume
that is the point when we're sure that the inode is going to go away.

Also, instead of just clearing i_nlink for the directory case, I just do
two decrements. I did that for a few other filesystems as well. I
guess it can be collapsed to a single operation, but I'm not sure it is
worth the trouble.

Completely and utterly untested, uncompiled patch attached. Please
consider its filename a formal apology for calling your filesystem
screwy. :)

It might also be worth putting the 'double decrement i_nlink if it is a
directory' behavior in libfs.c. It appears to be pretty common logic
around the different filesystems.

Thanks for the thorough review!

-- Dave


Attachments:
ocfs-is-not-screwy.patch (1.97 kB)

2006-08-02 04:34:46

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 04/28] OCFS2 is screwy

Hi Dave,

On Tue, Aug 01, 2006 at 08:21:46PM -0700, Dave Hansen wrote:
> Please ignore that last one. It didn't correctly handle directories'
> with a remaining i_nlink of 2.
Thanks for following up with this patch - it looks pretty good. One comment
below.


> @@ -888,7 +890,9 @@
> /* We can set nlink on the dinode now. clear the saved version
> * so that it doesn't get set later. */
> fe->i_links_count = cpu_to_le16(inode->i_nlink);
> - saved_nlink = 0;
> + inode_drop_nlink(inode);
> + if (S_ISDIR(inode->i_mode))
> + inode_drop_nlink(inode);
The set of 'i_links_count' on 'fe' should be below the inode_drop_nlink()
calls - otherwise we'll be setting the old nlink value on the disk inode :)

While you're there you can just remove that comment - it's no longer
accurate :)

Thanks again,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-08-03 00:20:58

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH 04/28] OCFS2 is screwy

On Tue, Aug 01, 2006 at 08:21:46PM -0700, Dave Hansen wrote:
> Please ignore that last one. It didn't correctly handle directories'
> with a remaining i_nlink of 2.
Oops, one more minor thing wrong with this patch:


> @@ -823,16 +835,6 @@
> }
> }
>
> - /* There are still a few steps left until we can consider the
> - * unlink to have succeeded. Save off nlink here before
> - * modification so we can set it back in case we hit an issue
> - * before commit. */
> - saved_nlink = inode->i_nlink;
> - if (S_ISDIR(inode->i_mode))
> - inode->i_nlink = 0;
> - else
> - inode->i_nlink--;
> -
> status = ocfs2_request_unlink_vote(inode, dentry,
> (unsigned int) inode->i_nlink);
The network request above needs to send what the new nlink will be set to.
This is so that the other nodes which have an interest in the inode can
determine whether the will need to do orphan processing on their last iput.

Really what they do is just compare against zero and mark the inode with a
flag indicating that it may have been orphaned. So we could just pass the
result of inode_is_unlinkable(), but I'd rather preserve the behavior that
we have today.

The ugly method is:

- status = ocfs2_request_unlink_vote(inode, dentry,
- (unsigned int) inode->i_nlink);
+ status = ocfs2_request_unlink_vote(inode, dentry,
+ S_ISDIR(inode->i_mode) && inode->i_nlink == 2 ? 0 : inode->i_nlink - 1);

Better yet, we could store the result of the evaluation in a temporary
variable and pass that through to the request function.

Thanks,
--Mark

--
Mark Fasheh
Senior Software Developer, Oracle
[email protected]

2006-08-03 14:32:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 01/28] prepare for write access checks: collapse if()

On Tue, Aug 01, 2006 at 04:52:40PM -0700, Dave Hansen wrote:
>
> We're shortly going to be adding a bunch more permission
> checks in these functions. That requires adding either a
> bunch of new if() conditions, or some gotos. This patch
> collapses existing if()s and uses gotos instead to
> prepare for the upcoming changes.

Ok, please send this to akpm for inclusion ASAP so the patch series shrinks.

2006-08-03 14:33:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 02/28] r/o bind mount prepwork: move open_namei()'s vfs_create()

On Tue, Aug 01, 2006 at 04:52:41PM -0700, Dave Hansen wrote:
>
> The code around vfs_create() in open_namei() is getting a
> bit too complex. Right now, there is at least the reference
> count on the dentry, and the i_mutex to worry about. Soon,
> we'll also have mnt_writecount.
>
> So, break the vfs_create() call out of open_namei(), and
> into a helper function. This duplicates the call to
> may_open(), but that isn't such a bad thing since the
> arguments (acc_mode and flag) were being heavily massaged
> anyway.
>
> Later in the series, we'll add the mnt_writecount handling
> around this new function call.
>

Ok. Again please send to Andrew ASAP.

2006-08-03 14:35:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 03/28] unlink: monitor i_nlink

On Tue, Aug 01, 2006 at 04:52:42PM -0700, Dave Hansen wrote:
>
> When a filesystem decrements i_nlink to zero, it means that a
> write must be performed in order to drop the inode from the
> filesystem.
>
> We're shortly going to have keep filesystems from being remounted
> r/o between the time that this i_nlink decrement and that write
> occurs.
>
> So, add a little helper function to do the decrements. We'll
> tie into it in a bit to note when i_nlink hits zero.

Looks good. I wonder if we could find a better name for it. Maybe just
dec_nlink? And add a simple counterpart

static inline void inc_nlink(struct inode *inode)
{
inode->i_nlink++;
}

for symmetry reasons. After that the patch can go off to Andrew before
the series aswell.

> Should
> we also be checking all of the places where i_nlink is explicitly
> set to 0 in the unlink paths?

They'll definitly need some auditing.

2006-08-03 14:39:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 06/28] reintroduce list of vfsmounts over superblock

On Tue, Aug 01, 2006 at 04:52:44PM -0700, Dave Hansen wrote:
>
> We're moving a big chunk of the burden of keeping people from
> writing to r/o filesystems from the superblock into the
> vfsmount. This requires that we consult the superblock's
> vfsmounts when things like remounts occur.
>
> So, introduce a list of vfsmounts hanging off the superblock.
> We'll use this in a bit.

I don't think we'll need it. We really need to keep is someone writing
to this vfsmount counters in addition to is someone writing to this sb.

In fact there are cases were we want a superblock to be writeable to
without any view into it being writeable, e.g. for journal recovery.

2006-08-03 14:42:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 27/28] elevate writer count for custom 'struct file'

On Tue, Aug 01, 2006 at 04:53:00PM -0700, Dave Hansen wrote:
>
> Some filesystems forego the vfs and may_open() and create their
> own 'struct file's. Any of these users which set the write flag
> on the file will cause an extra mnt_drop_write() on __fput(),
> thus dropping the reference count too low.
>
> These users tend to have artifical in-kernel vfsmounts which
> aren't really exposed to userspace and can't be remounted, but
> this patch is included for completeness and so that the warnings
> don't trip over these cases.

Please add a helper to create these files so all this happens in just
one place. There's a fare bit of code duplication already that this will
cleanup. Another nice cleanup you should push out first before doing
the actual feature :)

2006-08-04 21:01:38

by Dave Hansen

[permalink] [raw]
Subject: [PATCH] clean up OCFS2 nlink handling

On Wed, 2006-08-02 at 17:20 -0700, Mark Fasheh wrote:
> > - saved_nlink = inode->i_nlink;
> > - if (S_ISDIR(inode->i_mode))
> > - inode->i_nlink = 0;
> > - else
> > - inode->i_nlink--;
> > -
> > status = ocfs2_request_unlink_vote(inode, dentry,
> > (unsigned int) inode->i_nlink);
> The network request above needs to send what the new nlink will be set to.
> This is so that the other nodes which have an interest in the inode can
> determine whether the will need to do orphan processing on their last iput.
>
> Really what they do is just compare against zero and mark the inode with a
> flag indicating that it may have been orphaned. So we could just pass the
> result of inode_is_unlinkable(), but I'd rather preserve the behavior that
> we have today.
>
> The ugly method is:
>
> - status = ocfs2_request_unlink_vote(inode, dentry,
> - (unsigned int) inode->i_nlink);
> + status = ocfs2_request_unlink_vote(inode, dentry,
> + S_ISDIR(inode->i_mode) && inode->i_nlink == 2 ? 0 : inode->i_nlink - 1);
>
> Better yet, we could store the result of the evaluation in a temporary
> variable and pass that through to the request function.

OK, I hope this does it.

-- Dave

OCFS2 does some operations on i_nlink, then reverts them if some
of its operations fail to complete. This does not fit in well
with the inode_drop_nlink() logic where we expect i_nlink to stay
at zero once it gets there.

So, delay all of the nlink operations until we're sure that the
operations have completed. Also, introduce a small helper to
check whether an inode has proper "unlinkable" i_nlink counts
no matter whether it is a directory or regular inode.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/ocfs2/namei.c | 50
lxc-dave/fs/ocfs2/namei.c.orig | 2268 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 2294 insertions(+), 24 deletions(-)

diff -puN fs/ocfs2/namei.c~Re-_PATCH_04_28_OCFS2_is_screwy fs/ocfs2/namei.c
--- lxc/fs/ocfs2/namei.c~Re-_PATCH_04_28_OCFS2_is_screwy 2006-08-04 13:53:36.000000000 -0700
+++ lxc-dave/fs/ocfs2/namei.c 2006-08-04 13:59:15.000000000 -0700
@@ -744,11 +744,23 @@ bail:
return err;
}

+static inline inode_is_unlinkable(struct inode *inode)
+{
+ if (S_ISDIR(inode->i_mode)) {
+ if (inode->i_nlink == 2)
+ return 1;
+ return 0;
+ }
+
+ if (inode->i_nlink == 1)
+ return 1;
+ return 0;
+}
+
static int ocfs2_unlink(struct inode *dir,
struct dentry *dentry)
{
int status;
- unsigned int saved_nlink = 0;
struct inode *inode = dentry->d_inode;
struct ocfs2_super *osb = OCFS2_SB(dir->i_sb);
u64 blkno;
@@ -760,6 +772,7 @@ static int ocfs2_unlink(struct inode *di
struct buffer_head *dirent_bh = NULL;
char orphan_name[OCFS2_ORPHAN_NAMELEN + 1];
struct buffer_head *orphan_entry_bh = NULL;
+ unsigned int future_nlink;

mlog_entry("(0x%p, 0x%p, '%.*s')\n", dir, dentry,
dentry->d_name.len, dentry->d_name.name);
@@ -823,18 +836,11 @@ static int ocfs2_unlink(struct inode *di
}
}

- /* There are still a few steps left until we can consider the
- * unlink to have succeeded. Save off nlink here before
- * modification so we can set it back in case we hit an issue
- * before commit. */
- saved_nlink = inode->i_nlink;
- if (S_ISDIR(inode->i_mode))
- inode->i_nlink = 0;
+ if (S_ISDIR(inode->i_mode) && (inode->i_nlink == 2))
+ future_nlink = 0;
else
- inode->i_nlink--;
-
- status = ocfs2_request_unlink_vote(inode, dentry,
- (unsigned int) inode->i_nlink);
+ future_nlink = inode->i_nlink - 1;
+ status = ocfs2_request_unlink_vote(inode, dentry, future_nlink);
if (status < 0) {
/* This vote should succeed under all normal
* circumstances. */
@@ -842,7 +848,7 @@ static int ocfs2_unlink(struct inode *di
goto leave;
}

- if (!inode->i_nlink) {
+ if (inode_is_unlinkable(inode)) {
status = ocfs2_prepare_orphan_dir(osb, handle, inode,
orphan_name,
&orphan_entry_bh);
@@ -869,7 +875,7 @@ static int ocfs2_unlink(struct inode *di

fe = (struct ocfs2_dinode *) fe_bh->b_data;

- if (!inode->i_nlink) {
+ if (inode_is_unlinkable(inode)) {
status = ocfs2_orphan_add(osb, handle, inode, fe, orphan_name,
orphan_entry_bh);
if (status < 0) {
@@ -885,10 +891,10 @@ static int ocfs2_unlink(struct inode *di
goto leave;
}

- /* We can set nlink on the dinode now. clear the saved version
- * so that it doesn't get set later. */
+ if (S_ISDIR(inode->i_mode))
+ inode_drop_nlink(inode);
+ inode_drop_nlink(inode);
fe->i_links_count = cpu_to_le16(inode->i_nlink);
- saved_nlink = 0;

status = ocfs2_journal_dirty(handle, fe_bh);
if (status < 0) {
@@ -897,19 +903,15 @@ static int ocfs2_unlink(struct inode *di
}

if (S_ISDIR(inode->i_mode)) {
- dir->i_nlink--;
status = ocfs2_mark_inode_dirty(handle, dir,
parent_node_bh);
- if (status < 0) {
+ if (status < 0)
mlog_errno(status);
- inc_nlink(dir);
- }
+ else
+ inode_drop_nlink(dir);
}

leave:
- if (status < 0 && saved_nlink)
- inode->i_nlink = saved_nlink;
-
if (handle)
ocfs2_commit_trans(handle);


2006-08-04 21:39:17

by Mark Fasheh

[permalink] [raw]
Subject: Re: [PATCH] clean up OCFS2 nlink handling

Hi Dave,

On Fri, Aug 04, 2006 at 02:01:32PM -0700, Dave Hansen wrote:
> OK, I hope this does it.
I think the last part of your patch is bad :/ At any rate, I think the
attached patch fixes up the last hunk for you. HTH.


From: Dave Hansen <[email protected]>

OCFS2 does some operations on i_nlink, then reverts them if some
of its operations fail to complete. This does not fit in well
with the inode_drop_nlink() logic where we expect i_nlink to stay
at zero once it gets there.

So, delay all of the nlink operations until we're sure that the
operations have completed. Also, introduce a small helper to
check whether an inode has proper "unlinkable" i_nlink counts
no matter whether it is a directory or regular inode.

Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Mark Fasheh <[email protected]>

diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c
index 0673862..666953b 100644
--- a/fs/ocfs2/namei.c
+++ b/fs/ocfs2/namei.c
@@ -744,11 +744,23 @@ bail:
return err;
}

+static inline inode_is_unlinkable(struct inode *inode)
+{
+ if (S_ISDIR(inode->i_mode)) {
+ if (inode->i_nlink == 2)
+ return 1;
+ return 0;
+ }
+
+ if (inode->i_nlink == 1)
+ return 1;
+ return 0;
+}
+
static int ocfs2_unlink(struct inode *dir,
struct dentry *dentry)
{
int status;
- unsigned int saved_nlink = 0;
struct inode *inode = dentry->d_inode;
struct ocfs2_super *osb = OCFS2_SB(dir->i_sb);
u64 blkno;
@@ -760,6 +772,7 @@ static int ocfs2_unlink(struct inode *di
struct buffer_head *dirent_bh = NULL;
char orphan_name[OCFS2_ORPHAN_NAMELEN + 1];
struct buffer_head *orphan_entry_bh = NULL;
+ unsigned int future_nlink;

mlog_entry("(0x%p, 0x%p, '%.*s')\n", dir, dentry,
dentry->d_name.len, dentry->d_name.name);
@@ -823,18 +836,11 @@ static int ocfs2_unlink(struct inode *di
}
}

- /* There are still a few steps left until we can consider the
- * unlink to have succeeded. Save off nlink here before
- * modification so we can set it back in case we hit an issue
- * before commit. */
- saved_nlink = inode->i_nlink;
- if (S_ISDIR(inode->i_mode))
- inode->i_nlink = 0;
+ if (S_ISDIR(inode->i_mode) && (inode->i_nlink == 2))
+ future_nlink = 0;
else
- inode->i_nlink--;
-
- status = ocfs2_request_unlink_vote(inode, dentry,
- (unsigned int) inode->i_nlink);
+ future_nlink = inode->i_nlink - 1;
+ status = ocfs2_request_unlink_vote(inode, dentry, future_nlink);
if (status < 0) {
/* This vote should succeed under all normal
* circumstances. */
@@ -842,7 +848,7 @@ static int ocfs2_unlink(struct inode *di
goto leave;
}

- if (!inode->i_nlink) {
+ if (inode_is_unlinkable(inode)) {
status = ocfs2_prepare_orphan_dir(osb, handle, inode,
orphan_name,
&orphan_entry_bh);
@@ -869,7 +875,7 @@ static int ocfs2_unlink(struct inode *di

fe = (struct ocfs2_dinode *) fe_bh->b_data;

- if (!inode->i_nlink) {
+ if (inode_is_unlinkable(inode)) {
status = ocfs2_orphan_add(osb, handle, inode, fe, orphan_name,
orphan_entry_bh);
if (status < 0) {
@@ -885,10 +891,10 @@ static int ocfs2_unlink(struct inode *di
goto leave;
}

- /* We can set nlink on the dinode now. clear the saved version
- * so that it doesn't get set later. */
+ if (S_ISDIR(inode->i_mode))
+ inode_drop_nlink(inode);
+ inode_drop_nlink(inode);
fe->i_links_count = cpu_to_le16(inode->i_nlink);
- saved_nlink = 0;

status = ocfs2_journal_dirty(handle, fe_bh);
if (status < 0) {
@@ -897,19 +903,16 @@ static int ocfs2_unlink(struct inode *di
}

if (S_ISDIR(inode->i_mode)) {
- dir->i_nlink--;
+ inode_drop_nlink(dir);
status = ocfs2_mark_inode_dirty(handle, dir,
parent_node_bh);
if (status < 0) {
mlog_errno(status);
- dir->i_nlink++;
+ inode_inc_nlink(dir);
}
}

leave:
- if (status < 0 && saved_nlink)
- inode->i_nlink = saved_nlink;
-
if (handle)
ocfs2_commit_trans(handle);

2006-08-04 21:47:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 06/28] reintroduce list of vfsmounts over superblock

On Thu, 2006-08-03 at 15:39 +0100, Christoph Hellwig wrote:
> On Tue, Aug 01, 2006 at 04:52:44PM -0700, Dave Hansen wrote:
> >
> > We're moving a big chunk of the burden of keeping people from
> > writing to r/o filesystems from the superblock into the
> > vfsmount. This requires that we consult the superblock's
> > vfsmounts when things like remounts occur.
> >
> > So, introduce a list of vfsmounts hanging off the superblock.
> > We'll use this in a bit.
>
> I don't think we'll need it. We really need to keep is someone writing
> to this vfsmount counters in addition to is someone writing to this sb.

This one was a direct request from Al:

http://article.gmane.org/gmane.linux.kernel/421029
> BTW, it might be worth doing the following:
> * reintroduce the list of vfsmounts over given superblock (protected
> by vfsmount_lock)
...

So, I assume that there are some evil plans in the future for this as
well.

> In fact there are cases were we want a superblock to be writeable to
> without any view into it being writeable, e.g. for journal recovery.

Note that, as it stands now, the vfsmount has a flag which duplicates
the superblock r/o flag. It is perfectly fine to do that journal
recovery by making the superblock r/w, but without telling any of the
mounts about it. All of the write requests will get blocked by the
mount flag, and no one will ever see that the superblock changed state.

-- Dave

2006-08-11 20:32:00

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 27/28] elevate writer count for custom 'struct file'

On Thu, 2006-08-03 at 15:42 +0100, Christoph Hellwig wrote:
> Please add a helper to create these files so all this happens in just
> one place. There's a fare bit of code duplication already that this will
> cleanup. Another nice cleanup you should push out first before doing
> the actual feature :)

I'm looking at this bit now. One thing that is a bit confusing is the
internal kernel distinction between a filp and a plain 'struct file'.

Do you have any ideas how you think these interfaces should look?

-- Dave