This is against 2.6.24-rc1 + recent git.
I've integrated all of the fixes from mm, and included cleanups
in a different order. This also includes some extra fput-time
checking to ensure that we have balanced mount writer counts.
These replace the patches in -mm mostly because the new fixes
require some cleanups that are functionally independent from
the r/o bind mount code itself. These fixes precent the rest
of the set and require some by hand merging.
If you're going to review one and only one patch, 17/27 (the
one for "opend" files) is the most critical.
---
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem.
In the process of doing that, it also provides infrastructure for
keeping track of the number of writers to any given mount.
This has a number of uses. It allows chroots to have parts of
filesystems writable. It will be useful for containers in the future
because users may have root inside a container, but should not
be allowed to write to somefilesystems. This also replaces
patches that vserver has had out of the tree for several years.
It allows security enhancement by making sure that parts of
your filesystem are read-only (such as when you don't trust your
FTP server), when you don't want to have entire new filesystems
mounted, or when you want atime selectively updated.
I've been using this script:
http://sr71.net/~dave/linux/robind-test.sh
to test that the feature is working as desired. It takes a
directory and makes a regular bind and a r/o bind mount of it.
It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail,
like creating a file on the r/o mount.
Signed-off-by: Dave Hansen <[email protected]>
open_namei() will, in the future, need to take mount write counts
over its creation and truncation (via may_open()) operations. It
needs to keep these write counts until any potential filp that is
created gets __fput()'d.
This gets complicated in the error handling and becomes very murky
as to how far open_namei() actually got, and whether or not that
mount write count was taken.
Creating the filps inside of open_namei() lets us shift the write
count to be taken and released along with the filp. We can hold
a temporary write count during those creation and truncation
operations, then release them once the write for the filp has
been established.
Any caller who gets a 'struct file' back must consider that filp
instantiated and fput() it normally. The callers no longer
have to worry about ever manually releasing a mnt write count.
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 16 ++++++++--------
linux-2.6.git-dave/fs/open.c | 7 +------
linux-2.6.git-dave/include/linux/fs.h | 2 +-
3 files changed, 10 insertions(+), 15 deletions(-)
diff -puN fs/namei.c~make-open_namei-return-a-filp fs/namei.c
--- linux-2.6.git/fs/namei.c~make-open_namei-return-a-filp 2007-11-01 14:46:05.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:05.000000000 -0700
@@ -1730,8 +1730,8 @@ static inline int sys_open_flags_to_name
* system call. See sys_open_flags_to_namei_flags().
* SMP-safe
*/
-int open_namei(int dfd, const char *pathname, int sys_open_flag,
- int mode, struct nameidata *nd)
+struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
+ int mode, struct nameidata *nd)
{
int acc_mode, error;
struct path path;
@@ -1757,7 +1757,7 @@ int open_namei(int dfd, const char *path
error = path_lookup_open(dfd, pathname, lookup_flags(flag),
nd, flag);
if (error)
- return error;
+ return ERR_PTR(error);
goto ok;
}
@@ -1766,7 +1766,7 @@ int open_namei(int dfd, const char *path
*/
error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
if (error)
- return error;
+ return ERR_PTR(error);
/*
* We have the parent and last component. First of all, check
@@ -1801,7 +1801,7 @@ do_last:
error = __open_namei_create(nd, &path, flag, mode);
if (error)
goto exit;
- return 0;
+ return nameidata_to_filp(nd, sys_open_flag);
}
/*
@@ -1834,7 +1834,7 @@ ok:
error = may_open(nd, acc_mode, flag);
if (error)
goto exit;
- return 0;
+ return nameidata_to_filp(nd, sys_open_flag);
exit_dput:
dput_path(&path, nd);
@@ -1842,7 +1842,7 @@ exit:
if (!IS_ERR(nd->intent.open.file))
release_open_intent(nd);
path_release(nd);
- return error;
+ return ERR_PTR(error);
do_link:
error = -ELOOP;
@@ -1869,7 +1869,7 @@ do_link:
* with "intent.open".
*/
release_open_intent(nd);
- return error;
+ return ERR_PTR(error);
}
nd->flags &= ~LOOKUP_PARENT;
if (nd->last_type == LAST_BIND)
diff -puN fs/open.c~make-open_namei-return-a-filp fs/open.c
--- linux-2.6.git/fs/open.c~make-open_namei-return-a-filp 2007-11-01 14:46:05.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:05.000000000 -0700
@@ -803,14 +803,9 @@ cleanup_file:
static struct file *do_filp_open(int dfd, const char *filename, int flags,
int mode)
{
- int error;
struct nameidata nd;
- error = open_namei(dfd, filename, flags, mode, &nd);
- if (!error)
- return nameidata_to_filp(&nd, flags);
-
- return ERR_PTR(error);
+ return open_namei(dfd, filename, flags, mode, &nd);
}
struct file *filp_open(const char *filename, int flags, int mode)
diff -puN include/linux/fs.h~make-open_namei-return-a-filp include/linux/fs.h
--- linux-2.6.git/include/linux/fs.h~make-open_namei-return-a-filp 2007-11-01 14:46:05.000000000 -0700
+++ linux-2.6.git-dave/include/linux/fs.h 2007-11-01 14:46:05.000000000 -0700
@@ -1714,7 +1714,7 @@ extern struct file *create_read_pipe(str
extern struct file *create_write_pipe(void);
extern void free_write_pipe(struct file *);
-extern int open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
extern int may_open(struct nameidata *, int, int);
extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
_
This kills off the almost empty do_filp_open(). The indenting
change in do_sys_open() is because we would have gone over our
80 characters otherwise.
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/fs/open.c | 39 ++++++++++++++++++---------------------
1 file changed, 18 insertions(+), 21 deletions(-)
diff -puN fs/open.c~kill-do_filp_open fs/open.c
--- linux-2.6.git/fs/open.c~kill-do_filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:06.000000000 -0700
@@ -800,17 +800,10 @@ cleanup_file:
return ERR_PTR(error);
}
-static struct file *do_filp_open(int dfd, const char *filename, int flags,
- int mode)
-{
- struct nameidata nd;
-
- return open_namei(dfd, filename, flags, mode, &nd);
-}
-
struct file *filp_open(const char *filename, int flags, int mode)
{
- return do_filp_open(AT_FDCWD, filename, flags, mode);
+ struct nameidata nd;
+ return open_namei(AT_FDCWD, filename, flags, mode, &nd);
}
EXPORT_SYMBOL(filp_open);
@@ -1009,20 +1002,24 @@ long do_sys_open(int dfd, const char __u
char *tmp = getname(filename);
int fd = PTR_ERR(tmp);
- if (!IS_ERR(tmp)) {
- fd = get_unused_fd_flags(flags);
- if (fd >= 0) {
- struct file *f = do_filp_open(dfd, tmp, flags, mode);
- if (IS_ERR(f)) {
- put_unused_fd(fd);
- fd = PTR_ERR(f);
- } else {
- fsnotify_open(f->f_path.dentry);
- fd_install(fd, f);
- }
+ if (IS_ERR(tmp))
+ goto out;
+
+ fd = get_unused_fd_flags(flags);
+ if (fd >= 0) {
+ struct nameidata nd;
+ struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+
+ if (IS_ERR(f)) {
+ put_unused_fd(fd);
+ fd = PTR_ERR(f);
+ } else {
+ fsnotify_open(f->f_path.dentry);
+ fd_install(fd, f);
}
- putname(tmp);
}
+ putname(tmp);
+out:
return fd;
}
_
Replace all callers with open_namei() directly, and move the
nameidata stack allocation into open_namei().
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/drivers/usb/gadget/file_storage.c | 5 -
linux-2.6.git-dave/fs/exec.c | 2
linux-2.6.git-dave/fs/namei.c | 71 +++++++++----------
linux-2.6.git-dave/fs/nfsctl.c | 5 +
linux-2.6.git-dave/fs/open.c | 6 -
linux-2.6.git-dave/fs/reiserfs/journal.c | 2
linux-2.6.git-dave/include/linux/fs.h | 3
linux-2.6.git-dave/kernel/acct.c | 2
linux-2.6.git-dave/mm/swapfile.c | 4 -
linux-2.6.git-dave/sound/sound_firmware.c | 2
10 files changed, 53 insertions(+), 49 deletions(-)
diff -puN drivers/usb/gadget/file_storage.c~kill-filp_open drivers/usb/gadget/file_storage.c
--- linux-2.6.git/drivers/usb/gadget/file_storage.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/drivers/usb/gadget/file_storage.c 2007-11-01 14:46:06.000000000 -0700
@@ -3468,16 +3468,17 @@ static int open_backing_file(struct lun
struct inode *inode = NULL;
loff_t size;
loff_t num_sectors;
+ int mode = O_LARGEFILE;
/* R/W if we can, R/O if we must */
ro = curlun->ro;
if (!ro) {
- filp = filp_open(filename, O_RDWR | O_LARGEFILE, 0);
+ filp = open_namei(AT_FDCWD, filename, O_RDWR | mode, 0);
if (-EROFS == PTR_ERR(filp))
ro = 1;
}
if (ro)
- filp = filp_open(filename, O_RDONLY | O_LARGEFILE, 0);
+ filp = open_namei(AT_FDCWD, filename, O_RDONLY | mode, 0);
if (IS_ERR(filp)) {
LINFO(curlun, "unable to open backing file: %s\n", filename);
return PTR_ERR(filp);
diff -puN fs/exec.c~kill-filp_open fs/exec.c
--- linux-2.6.git/fs/exec.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/fs/exec.c 2007-11-01 14:46:06.000000000 -0700
@@ -1763,7 +1763,7 @@ int do_coredump(long signr, int exit_cod
goto fail_unlock;
}
} else
- file = filp_open(corename,
+ file = open_namei(AT_FDCWD, corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
if (IS_ERR(file))
diff -puN fs/namei.c~kill-filp_open fs/namei.c
--- linux-2.6.git/fs/namei.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:06.000000000 -0700
@@ -1730,9 +1730,10 @@ static inline int sys_open_flags_to_name
* system call. See sys_open_flags_to_namei_flags().
* SMP-safe
*/
-struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
- int mode, struct nameidata *nd)
+struct file *open_pathname(int dfd, const char *pathname,
+ int sys_open_flag, int mode)
{
+ struct nameidata nd;
int acc_mode, error;
struct path path;
struct dentry *dir;
@@ -1755,7 +1756,7 @@ struct file *open_namei(int dfd, const c
*/
if (!(flag & O_CREAT)) {
error = path_lookup_open(dfd, pathname, lookup_flags(flag),
- nd, flag);
+ &nd, flag);
if (error)
return ERR_PTR(error);
goto ok;
@@ -1764,7 +1765,7 @@ struct file *open_namei(int dfd, const c
/*
* Create - we need to know the parent.
*/
- error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+ error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,&nd,flag,mode);
if (error)
return ERR_PTR(error);
@@ -1774,14 +1775,14 @@ struct file *open_namei(int dfd, const c
* will not do.
*/
error = -EISDIR;
- if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
+ if (nd.last_type != LAST_NORM || nd.last.name[nd.last.len])
goto exit;
- dir = nd->dentry;
- nd->flags &= ~LOOKUP_PARENT;
+ dir = nd.dentry;
+ nd.flags &= ~LOOKUP_PARENT;
mutex_lock(&dir->d_inode->i_mutex);
- path.dentry = lookup_hash(nd);
- path.mnt = nd->mnt;
+ path.dentry = lookup_hash(&nd);
+ path.mnt = nd.mnt;
do_last:
error = PTR_ERR(path.dentry);
@@ -1790,18 +1791,18 @@ do_last:
goto exit;
}
- if (IS_ERR(nd->intent.open.file)) {
+ if (IS_ERR(nd.intent.open.file)) {
mutex_unlock(&dir->d_inode->i_mutex);
- error = PTR_ERR(nd->intent.open.file);
+ error = PTR_ERR(nd.intent.open.file);
goto exit_dput;
}
/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
- error = __open_namei_create(nd, &path, flag, mode);
+ error = __open_namei_create(&nd, &path, flag, mode);
if (error)
goto exit;
- return nameidata_to_filp(nd, sys_open_flag);
+ return nameidata_to_filp(&nd, sys_open_flag);
}
/*
@@ -1826,22 +1827,22 @@ do_last:
if (path.dentry->d_inode->i_op && path.dentry->d_inode->i_op->follow_link)
goto do_link;
- path_to_nameidata(&path, nd);
+ path_to_nameidata(&path, &nd);
error = -EISDIR;
if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
goto exit;
ok:
- error = may_open(nd, acc_mode, flag);
+ error = may_open(&nd, acc_mode, flag);
if (error)
goto exit;
- return nameidata_to_filp(nd, sys_open_flag);
+ return nameidata_to_filp(&nd, sys_open_flag);
exit_dput:
- dput_path(&path, nd);
+ dput_path(&path, &nd);
exit:
- if (!IS_ERR(nd->intent.open.file))
- release_open_intent(nd);
- path_release(nd);
+ if (!IS_ERR(&nd.intent.open.file))
+ release_open_intent(&nd);
+ path_release(&nd);
return ERR_PTR(error);
do_link:
@@ -1855,42 +1856,42 @@ do_link:
* After that we have the parent and last component, i.e.
* we are in the same situation as after the first path_walk().
* Well, almost - if the last component is normal we get its copy
- * stored in nd->last.name and we will have to putname() it when we
+ * stored in nd.last.name and we will have to putname() it when we
* are done. Procfs-like symlinks just set LAST_BIND.
*/
- nd->flags |= LOOKUP_PARENT;
- error = security_inode_follow_link(path.dentry, nd);
+ nd.flags |= LOOKUP_PARENT;
+ error = security_inode_follow_link(path.dentry, &nd);
if (error)
goto exit_dput;
- error = __do_follow_link(&path, nd);
+ error = __do_follow_link(&path, &nd);
if (error) {
/* Does someone understand code flow here? Or it is only
* me so stupid? Anathema to whoever designed this non-sense
* with "intent.open".
*/
- release_open_intent(nd);
+ release_open_intent(&nd);
return ERR_PTR(error);
}
- nd->flags &= ~LOOKUP_PARENT;
- if (nd->last_type == LAST_BIND)
+ nd.flags &= ~LOOKUP_PARENT;
+ if (nd.last_type == LAST_BIND)
goto ok;
error = -EISDIR;
- if (nd->last_type != LAST_NORM)
+ if (nd.last_type != LAST_NORM)
goto exit;
- if (nd->last.name[nd->last.len]) {
- __putname(nd->last.name);
+ if (nd.last.name[nd.last.len]) {
+ __putname(nd.last.name);
goto exit;
}
error = -ELOOP;
if (count++==32) {
- __putname(nd->last.name);
+ __putname(nd.last.name);
goto exit;
}
- dir = nd->dentry;
+ dir = nd.dentry;
mutex_lock(&dir->d_inode->i_mutex);
- path.dentry = lookup_hash(nd);
- path.mnt = nd->mnt;
- __putname(nd->last.name);
+ path.dentry = lookup_hash(&nd);
+ path.mnt = nd.mnt;
+ __putname(nd.last.name);
goto do_last;
}
diff -puN fs/nfsctl.c~kill-filp_open fs/nfsctl.c
--- linux-2.6.git/fs/nfsctl.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/fs/nfsctl.c 2007-11-01 14:46:06.000000000 -0700
@@ -35,6 +35,11 @@ static struct file *do_open(char *name,
if (error)
return ERR_PTR(error);
+ /*
+ * If O_TRUNC is ever passed into may_open()
+ * it will be necessary to take and release
+ * mnt writer count around these calls.
+ */
if (flags == O_RDWR)
error = may_open(&nd,MAY_READ|MAY_WRITE,FMODE_READ|FMODE_WRITE);
else
diff -puN fs/open.c~kill-filp_open fs/open.c
--- linux-2.6.git/fs/open.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:06.000000000 -0700
@@ -802,8 +802,7 @@ cleanup_file:
struct file *filp_open(const char *filename, int flags, int mode)
{
- struct nameidata nd;
- return open_namei(AT_FDCWD, filename, flags, mode, &nd);
+ return open_namei(AT_FDCWD, filename, flags, mode);
}
EXPORT_SYMBOL(filp_open);
@@ -1007,8 +1006,7 @@ long do_sys_open(int dfd, const char __u
fd = get_unused_fd_flags(flags);
if (fd >= 0) {
- struct nameidata nd;
- struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+ struct file *f = open_pathname(dfd, tmp, flags, mode);
if (IS_ERR(f)) {
put_unused_fd(fd);
diff -puN fs/reiserfs/journal.c~kill-filp_open fs/reiserfs/journal.c
--- linux-2.6.git/fs/reiserfs/journal.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/fs/reiserfs/journal.c 2007-11-01 14:46:06.000000000 -0700
@@ -2625,7 +2625,7 @@ static int journal_init_dev(struct super
return 0;
}
- journal->j_dev_file = filp_open(jdev_name, 0, 0);
+ journal->j_dev_file = open_namei(AT_FDCWD, jdev_name, 0, 0);
if (!IS_ERR(journal->j_dev_file)) {
struct inode *jdev_inode = journal->j_dev_file->f_mapping->host;
if (!S_ISBLK(jdev_inode->i_mode)) {
diff -puN include/linux/fs.h~kill-filp_open include/linux/fs.h
--- linux-2.6.git/include/linux/fs.h~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/include/linux/fs.h 2007-11-01 14:46:06.000000000 -0700
@@ -1540,7 +1540,6 @@ extern int do_truncate(struct dentry *,
struct file *filp);
extern long do_sys_open(int dfd, const char __user *filename, int flags,
int mode);
-extern struct file *filp_open(const char *, int, int);
extern struct file * dentry_open(struct dentry *, struct vfsmount *, int);
extern int filp_close(struct file *, fl_owner_t id);
extern char * getname(const char __user *);
@@ -1714,7 +1713,7 @@ extern struct file *create_read_pipe(str
extern struct file *create_write_pipe(void);
extern void free_write_pipe(struct file *);
-extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_pathname(int dfd, const char *, int, int);
extern int may_open(struct nameidata *, int, int);
extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
diff -puN kernel/acct.c~kill-filp_open kernel/acct.c
--- linux-2.6.git/kernel/acct.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/kernel/acct.c 2007-11-01 14:46:06.000000000 -0700
@@ -208,7 +208,7 @@ static int acct_on(char *name)
int error;
/* Difference from BSD - they don't do O_APPEND */
- file = filp_open(name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
+ file = open_namei(AT_FDCWD, name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
if (IS_ERR(file))
return PTR_ERR(file);
diff -puN mm/swapfile.c~kill-filp_open mm/swapfile.c
--- linux-2.6.git/mm/swapfile.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/mm/swapfile.c 2007-11-01 14:46:06.000000000 -0700
@@ -1191,7 +1191,7 @@ asmlinkage long sys_swapoff(const char _
if (IS_ERR(pathname))
goto out;
- victim = filp_open(pathname, O_RDWR|O_LARGEFILE, 0);
+ victim = open_namei(AT_FDCWD, pathname, O_RDWR|O_LARGEFILE, 0);
putname(pathname);
err = PTR_ERR(victim);
if (IS_ERR(victim))
@@ -1470,7 +1470,7 @@ asmlinkage long sys_swapon(const char __
name = NULL;
goto bad_swap_2;
}
- swap_file = filp_open(name, O_RDWR|O_LARGEFILE, 0);
+ swap_file = open_namei(AT_FDCWD, name, O_RDWR|O_LARGEFILE, 0);
error = PTR_ERR(swap_file);
if (IS_ERR(swap_file)) {
swap_file = NULL;
diff -puN sound/sound_firmware.c~kill-filp_open sound/sound_firmware.c
--- linux-2.6.git/sound/sound_firmware.c~kill-filp_open 2007-11-01 14:46:06.000000000 -0700
+++ linux-2.6.git-dave/sound/sound_firmware.c 2007-11-01 14:46:06.000000000 -0700
@@ -14,7 +14,7 @@ static int do_mod_firmware_load(const ch
char *dp;
loff_t pos;
- filp = filp_open(fn, 0, 0);
+ filp = open_namei(AT_FDCWD, fn, 0, 0);
if (IS_ERR(filp))
{
printk(KERN_INFO "Unable to load '%s'.\n", fn);
_
open_namei() no longer touches namei's. rename it
to something more appropriate: open_pathname().
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/drivers/usb/gadget/file_storage.c | 4 ++--
linux-2.6.git-dave/fs/exec.c | 2 +-
linux-2.6.git-dave/fs/namei.c | 8 ++++----
linux-2.6.git-dave/fs/open.c | 6 ------
linux-2.6.git-dave/fs/reiserfs/journal.c | 2 +-
linux-2.6.git-dave/kernel/acct.c | 2 +-
linux-2.6.git-dave/mm/swapfile.c | 4 ++--
linux-2.6.git-dave/sound/sound_firmware.c | 2 +-
8 files changed, 12 insertions(+), 18 deletions(-)
diff -puN drivers/usb/gadget/file_storage.c~rename-open_namei drivers/usb/gadget/file_storage.c
--- linux-2.6.git/drivers/usb/gadget/file_storage.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/drivers/usb/gadget/file_storage.c 2007-11-01 14:46:07.000000000 -0700
@@ -3473,12 +3473,12 @@ static int open_backing_file(struct lun
/* R/W if we can, R/O if we must */
ro = curlun->ro;
if (!ro) {
- filp = open_namei(AT_FDCWD, filename, O_RDWR | mode, 0);
+ filp = open_pathname(AT_FDCWD, filename, O_RDWR | mode, 0);
if (-EROFS == PTR_ERR(filp))
ro = 1;
}
if (ro)
- filp = open_namei(AT_FDCWD, filename, O_RDONLY | mode, 0);
+ filp = open_pathname(AT_FDCWD, filename, O_RDONLY | mode, 0);
if (IS_ERR(filp)) {
LINFO(curlun, "unable to open backing file: %s\n", filename);
return PTR_ERR(filp);
diff -puN fs/exec.c~rename-open_namei fs/exec.c
--- linux-2.6.git/fs/exec.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/fs/exec.c 2007-11-01 14:46:07.000000000 -0700
@@ -1763,7 +1763,7 @@ int do_coredump(long signr, int exit_cod
goto fail_unlock;
}
} else
- file = open_namei(AT_FDCWD, corename,
+ file = open_pathname(AT_FDCWD, corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
if (IS_ERR(file))
diff -puN fs/namei.c~rename-open_namei fs/namei.c
--- linux-2.6.git/fs/namei.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:07.000000000 -0700
@@ -81,7 +81,7 @@
*/
/* [16-Dec-97 Kevin Buhr] For security reasons, we change some symlink
- * semantics. See the comments in "open_namei" and "do_link" below.
+ * semantics. See the comments in "open_pathname" and "do_link" below.
*
* [10-Sep-98 Alan Modra] Another symlink change.
*/
@@ -571,7 +571,7 @@ out:
if (nd->depth || res || nd->last_type!=LAST_NORM)
return res;
/*
- * If it is an iterative symlinks resolution in open_namei() we
+ * If it is an iterative symlinks resolution in open_pathname() we
* have to copy the last component. And all that crap because of
* bloody create() on broken symlinks. Furrfu...
*/
@@ -1708,7 +1708,7 @@ static int __open_namei_create(struct na
* 01 - read-permission
* 10 - write-permission
* 11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc)
+ * for the internal routines (ie open_pathname()/follow_link() etc)
* This is more logical, and also allows the 00 "no perm needed"
* to be used for symlinks (where the permissions are checked
* later).
@@ -1722,7 +1722,7 @@ static inline int sys_open_flags_to_name
}
/*
- * open_namei()
+ * open_pathname()
*
* namei for open - this is in fact almost the whole open-routine.
*
diff -puN fs/open.c~rename-open_namei fs/open.c
--- linux-2.6.git/fs/open.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:07.000000000 -0700
@@ -800,12 +800,6 @@ cleanup_file:
return ERR_PTR(error);
}
-struct file *filp_open(const char *filename, int flags, int mode)
-{
- return open_namei(AT_FDCWD, filename, flags, mode);
-}
-EXPORT_SYMBOL(filp_open);
-
/**
* lookup_instantiate_filp - instantiates the open intent filp
* @nd: pointer to nameidata
diff -puN fs/reiserfs/journal.c~rename-open_namei fs/reiserfs/journal.c
--- linux-2.6.git/fs/reiserfs/journal.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/fs/reiserfs/journal.c 2007-11-01 14:46:07.000000000 -0700
@@ -2625,7 +2625,7 @@ static int journal_init_dev(struct super
return 0;
}
- journal->j_dev_file = open_namei(AT_FDCWD, jdev_name, 0, 0);
+ journal->j_dev_file = open_pathname(AT_FDCWD, jdev_name, 0, 0);
if (!IS_ERR(journal->j_dev_file)) {
struct inode *jdev_inode = journal->j_dev_file->f_mapping->host;
if (!S_ISBLK(jdev_inode->i_mode)) {
diff -puN kernel/acct.c~rename-open_namei kernel/acct.c
--- linux-2.6.git/kernel/acct.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/kernel/acct.c 2007-11-01 14:46:07.000000000 -0700
@@ -208,7 +208,7 @@ static int acct_on(char *name)
int error;
/* Difference from BSD - they don't do O_APPEND */
- file = open_namei(AT_FDCWD, name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
+ file = open_pathname(AT_FDCWD, name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
if (IS_ERR(file))
return PTR_ERR(file);
diff -puN mm/swapfile.c~rename-open_namei mm/swapfile.c
--- linux-2.6.git/mm/swapfile.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/mm/swapfile.c 2007-11-01 14:46:07.000000000 -0700
@@ -1191,7 +1191,7 @@ asmlinkage long sys_swapoff(const char _
if (IS_ERR(pathname))
goto out;
- victim = open_namei(AT_FDCWD, pathname, O_RDWR|O_LARGEFILE, 0);
+ victim = open_pathname(AT_FDCWD, pathname, O_RDWR|O_LARGEFILE, 0);
putname(pathname);
err = PTR_ERR(victim);
if (IS_ERR(victim))
@@ -1470,7 +1470,7 @@ asmlinkage long sys_swapon(const char __
name = NULL;
goto bad_swap_2;
}
- swap_file = open_namei(AT_FDCWD, name, O_RDWR|O_LARGEFILE, 0);
+ swap_file = open_pathname(AT_FDCWD, name, O_RDWR|O_LARGEFILE, 0);
error = PTR_ERR(swap_file);
if (IS_ERR(swap_file)) {
swap_file = NULL;
diff -puN sound/sound_firmware.c~rename-open_namei sound/sound_firmware.c
--- linux-2.6.git/sound/sound_firmware.c~rename-open_namei 2007-11-01 14:46:07.000000000 -0700
+++ linux-2.6.git-dave/sound/sound_firmware.c 2007-11-01 14:46:07.000000000 -0700
@@ -14,7 +14,7 @@ static int do_mod_firmware_load(const ch
char *dp;
loff_t pos;
- filp = open_namei(AT_FDCWD, fn, 0, 0);
+ filp = open_pathname(AT_FDCWD, fn, 0, 0);
if (IS_ERR(filp))
{
printk(KERN_INFO "Unable to load '%s'.\n", fn);
_
Pretty self-explanatory. Fits in with the rest of the series.
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 5 +++++
linux-2.6.git-dave/fs/nfsd/nfs4recover.c | 5 +++++
2 files changed, 10 insertions(+)
diff -puN fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir 2007-11-01 14:46:09.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:09.000000000 -0700
@@ -2067,7 +2067,12 @@ asmlinkage long sys_mkdirat(int dfd, con
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
diff -puN fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir fs/nfsd/nfs4recover.c
--- linux-2.6.git/fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir 2007-11-01 14:46:09.000000000 -0700
+++ linux-2.6.git-dave/fs/nfsd/nfs4recover.c 2007-11-01 14:46:09.000000000 -0700
@@ -41,6 +41,7 @@
#include <linux/nfsd/xdr4.h>
#include <linux/param.h>
#include <linux/file.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <asm/uaccess.h>
#include <asm/scatterlist.h>
@@ -154,7 +155,11 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n");
goto out_put;
}
+ status = mnt_want_write(rec_dir.mnt);
+ if (status)
+ goto out_put;
status = vfs_mkdir(rec_dir.dentry->d_inode, dentry, S_IRWXU);
+ mnt_drop_write(rec_dir.mnt);
out_put:
dput(dentry);
out_unlock:
_
This patch adds two function mnt_want_write() and mnt_drop_write(). These are
used like a lock pair around and fs operations that might cause a write to the
filesystem.
Before these can become useful, we must first cover each place in the VFS
where writes are performed with a want/drop pair. When that is complete, we
can actually introduce code that will safely check the counts before allowing
r/w<->r/o transitions to occur.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
---
linux-2.6.git-dave/fs/namespace.c | 54 +++++++++++++++++++++++++++++++
linux-2.6.git-dave/include/linux/mount.h | 3 +
2 files changed, 57 insertions(+)
diff -puN fs/namespace.c~r-o-bind-mounts-stub-functions fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-stub-functions 2007-11-01 14:46:08.000000000 -0700
+++ linux-2.6.git-dave/fs/namespace.c 2007-11-01 14:46:08.000000000 -0700
@@ -77,6 +77,60 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
}
+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
+/**
+ * mnt_want_write - get write access to a mount
+ * @mnt: the mount on which to take a write
+ *
+ * This tells the low-level filesystem that a write is
+ * about to be performed to it, and makes sure that
+ * writes are allowed before returning success. When
+ * the write operation is finished, mnt_drop_write()
+ * must be called. This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *mnt)
+{
+ if (__mnt_is_readonly(mnt))
+ return -EROFS;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mnt_want_write);
+
+/**
+ * mnt_drop_write - give up write access to a mount
+ * @mnt: the mount on which to give up write access
+ *
+ * Tells the low-level filesystem that we are done
+ * performing writes to it. Must be matched with
+ * mnt_want_write() call above.
+ */
+void mnt_drop_write(struct vfsmount *mnt)
+{
+}
+EXPORT_SYMBOL_GPL(mnt_drop_write);
+
+/*
+ * __mnt_is_readonly: check whether a mount is read-only
+ * @mnt: the mount to check for its write status
+ *
+ * This shouldn't be used directly ouside of the VFS.
+ * It does not guarantee that the filesystem will stay
+ * r/w, just that it is right *now*. This can not and
+ * should not be used in place of IS_RDONLY(inode).
+ */
+int __mnt_is_readonly(struct vfsmount *mnt)
+{
+ return (mnt->mnt_sb->s_flags & MS_RDONLY);
+}
+EXPORT_SYMBOL_GPL(__mnt_is_readonly);
+
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
mnt->mnt_sb = sb;
diff -puN include/linux/mount.h~r-o-bind-mounts-stub-functions include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~r-o-bind-mounts-stub-functions 2007-11-01 14:46:08.000000000 -0700
+++ linux-2.6.git-dave/include/linux/mount.h 2007-11-01 14:46:08.000000000 -0700
@@ -70,9 +70,12 @@ static inline struct vfsmount *mntget(st
return mnt;
}
+extern int mnt_want_write(struct vfsmount *mnt);
+extern void mnt_drop_write(struct vfsmount *mnt);
extern void mntput_no_expire(struct vfsmount *mnt);
extern void mnt_pin(struct vfsmount *mnt);
extern void mnt_unpin(struct vfsmount *mnt);
+extern int __mnt_is_readonly(struct vfsmount *mnt);
static inline void mntput(struct vfsmount *mnt)
{
_
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 4 ++++
linux-2.6.git-dave/ipc/mqueue.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)
diff -puN fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers 2007-11-01 14:46:10.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:10.000000000 -0700
@@ -2264,7 +2264,11 @@ static long do_unlinkat(int dfd, const c
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit2;
error = vfs_unlink(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
exit2:
dput(dentry);
}
diff -puN ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers ipc/mqueue.c
--- linux-2.6.git/ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers 2007-11-01 14:46:10.000000000 -0700
+++ linux-2.6.git-dave/ipc/mqueue.c 2007-11-01 14:46:10.000000000 -0700
@@ -743,8 +743,11 @@ asmlinkage long sys_mq_unlink(const char
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
-
+ err = mnt_want_write(mqueue_mnt);
+ if (err)
+ goto out_err;
err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+ mnt_drop_write(mqueue_mnt);
out_err:
dput(dentry);
_
This basically audits the callers of xattr_permission(), which calls
permission() and can perform writes to the filesystem.
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/nfsd/nfs4proc.c | 7 ++++++-
linux-2.6.git-dave/fs/xattr.c | 16 ++++++++++++++--
2 files changed, 20 insertions(+), 3 deletions(-)
diff -puN fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c
--- linux-2.6.git/fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes 2007-11-01 14:46:11.000000000 -0700
+++ linux-2.6.git-dave/fs/nfsd/nfs4proc.c 2007-11-01 14:46:11.000000000 -0700
@@ -658,14 +658,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
return status;
}
}
+ status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt);
+ if (status)
+ return status;
status = nfs_ok;
if (setattr->sa_acl != NULL)
status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh,
setattr->sa_acl);
if (status)
- return status;
+ goto out;
status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr,
0, (time_t)0);
+out:
+ mnt_drop_write(cstate->current_fh.fh_export->ex_mnt);
return status;
}
diff -puN fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes fs/xattr.c
--- linux-2.6.git/fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes 2007-11-01 14:46:11.000000000 -0700
+++ linux-2.6.git-dave/fs/xattr.c 2007-11-01 14:46:11.000000000 -0700
@@ -11,6 +11,7 @@
#include <linux/slab.h>
#include <linux/file.h>
#include <linux/xattr.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co
* filesystem or on an immutable / append-only inode.
*/
if (mask & MAY_WRITE) {
- if (IS_RDONLY(inode))
- return -EROFS;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
return -EPERM;
}
@@ -235,7 +234,11 @@ sys_setxattr(char __user *path, char __u
error = user_path_walk(path, &nd);
if (error)
return error;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ return error;
error = setxattr(nd.dentry, name, value, size, flags);
+ mnt_drop_write(nd.mnt);
path_release(&nd);
return error;
}
@@ -250,7 +253,11 @@ sys_lsetxattr(char __user *path, char __
error = user_path_walk_link(path, &nd);
if (error)
return error;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ return error;
error = setxattr(nd.dentry, name, value, size, flags);
+ mnt_drop_write(nd.mnt);
path_release(&nd);
return error;
}
@@ -266,9 +273,14 @@ sys_fsetxattr(int fd, char __user *name,
f = fget(fd);
if (!f)
return error;
+ error = mnt_want_write(f->f_vfsmnt);
+ if (error)
+ goto out_fput;
dentry = f->f_path.dentry;
audit_inode(NULL, dentry);
error = setxattr(dentry, name, value, size, flags);
+ mnt_drop_write(f->f_vfsmnt);
+out_fput:
fput(f);
return error;
}
_
My end goal here is to make sure all users of may_open()
return filps. This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().
This patch moves the sys_open flags to namei flags
calculation into fs/namei.c. We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 43 +++++++++++++++++++++++++++++++++---------
linux-2.6.git-dave/fs/open.c | 22 +--------------------
2 files changed, 36 insertions(+), 29 deletions(-)
diff -puN fs/namei.c~do-namei_flags-calculation-inside-open_namei fs/namei.c
--- linux-2.6.git/fs/namei.c~do-namei_flags-calculation-inside-open_namei 2007-11-01 14:46:04.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:04.000000000 -0700
@@ -1674,7 +1674,12 @@ int may_open(struct nameidata *nd, int a
return 0;
}
-static int open_namei_create(struct nameidata *nd, struct path *path,
+/*
+ * Be careful about ever adding any more callers of this
+ * function. Its flags must be in the namei format, not
+ * what get passed to sys_open().
+ */
+static int __open_namei_create(struct nameidata *nd, struct path *path,
int flag, int mode)
{
int error;
@@ -1693,26 +1698,46 @@ static int open_namei_create(struct name
}
/*
+ * Note that while the flag value (low two bits) for sys_open means:
+ * 00 - read-only
+ * 01 - write-only
+ * 10 - read-write
+ * 11 - special
+ * it is changed into
+ * 00 - no permissions needed
+ * 01 - read-permission
+ * 10 - write-permission
+ * 11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * later).
+ *
+*/
+static inline int sys_open_flags_to_namei_flags(int flag)
+{
+ if ((flag+1) & O_ACCMODE)
+ flag++;
+ return flag;
+}
+
+/*
* open_namei()
*
* namei for open - this is in fact almost the whole open-routine.
*
* Note that the low bits of "flag" aren't the same as in the open
- * system call - they are 00 - no permissions needed
- * 01 - read permission needed
- * 10 - write permission needed
- * 11 - read/write permissions needed
- * which is a lot more logical, and also allows the "no perm" needed
- * for symlinks (where the permissions are checked later).
+ * system call. See sys_open_flags_to_namei_flags().
* SMP-safe
*/
-int open_namei(int dfd, const char *pathname, int flag,
+int open_namei(int dfd, const char *pathname, int sys_open_flag,
int mode, struct nameidata *nd)
{
int acc_mode, error;
struct path path;
struct dentry *dir;
int count = 0;
+ int flag = sys_open_flags_to_namei_flags(sys_open_flag);
acc_mode = ACC_MODE(flag);
@@ -1773,7 +1798,7 @@ do_last:
/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
- error = open_namei_create(nd, &path, flag, mode);
+ error = __open_namei_create(nd, &path, flag, mode);
if (error)
goto exit;
return 0;
diff -puN fs/open.c~do-namei_flags-calculation-inside-open_namei fs/open.c
--- linux-2.6.git/fs/open.c~do-namei_flags-calculation-inside-open_namei 2007-11-01 14:46:04.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:04.000000000 -0700
@@ -800,31 +800,13 @@ cleanup_file:
return ERR_PTR(error);
}
-/*
- * Note that while the flag value (low two bits) for sys_open means:
- * 00 - read-only
- * 01 - write-only
- * 10 - read-write
- * 11 - special
- * it is changed into
- * 00 - no permissions needed
- * 01 - read-permission
- * 10 - write-permission
- * 11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc). 00 is
- * used by symlinks.
- */
static struct file *do_filp_open(int dfd, const char *filename, int flags,
int mode)
{
- int namei_flags, error;
+ int error;
struct nameidata nd;
- namei_flags = flags;
- if ((namei_flags+1) & O_ACCMODE)
- namei_flags++;
-
- error = open_namei(dfd, filename, namei_flags, mode, &nd);
+ error = open_namei(dfd, filename, flags, mode, &nd);
if (!error)
return nameidata_to_filp(&nd, flags);
_
Elevate the write count during the vfs_rmdir() call.
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Serge Hallyn <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 5 +++++
1 file changed, 5 insertions(+)
diff -puN fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count 2007-11-01 14:46:09.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:09.000000000 -0700
@@ -2174,7 +2174,12 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit3;
error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
+exit3:
dput(dentry);
exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
Now includes fix for oops seen by akpm.
"never let a libc developer write your kernel code" - hch
"nor, apparently, a kernel developer" - akpm
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Valdis Kletnieks <[email protected]>
Cc: Balbir Singh <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/utimes.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff -puN fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes fs/utimes.c
--- linux-2.6.git/fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes 2007-11-01 14:46:13.000000000 -0700
+++ linux-2.6.git-dave/fs/utimes.c 2007-11-01 14:46:13.000000000 -0700
@@ -2,6 +2,7 @@
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/linkage.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/stat.h>
@@ -58,6 +59,7 @@ long do_utimes(int dfd, char __user *fil
struct inode *inode;
struct iattr newattrs;
struct file *f = NULL;
+ struct vfsmount *mnt;
error = -EINVAL;
if (times && (!nsec_valid(times[0].tv_nsec) ||
@@ -78,18 +80,20 @@ long do_utimes(int dfd, char __user *fil
if (!f)
goto out;
dentry = f->f_path.dentry;
+ mnt = f->f_path.mnt;
} else {
error = __user_walk_fd(dfd, filename, (flags & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW, &nd);
if (error)
goto out;
dentry = nd.dentry;
+ mnt = nd.mnt;
}
inode = dentry->d_inode;
- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(mnt);
+ if (error)
goto dput_and_out;
/* Don't worry, the checks are done in inode_change_ok() */
@@ -97,7 +101,7 @@ long do_utimes(int dfd, char __user *fil
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
if (times[0].tv_nsec == UTIME_OMIT)
newattrs.ia_valid &= ~ATTR_ATIME;
@@ -117,22 +121,24 @@ long do_utimes(int dfd, char __user *fil
} else {
error = -EACCES;
if (IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
if (!is_owner_or_cap(inode)) {
if (f) {
if (!(f->f_mode & FMODE_WRITE))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
} else {
error = vfs_permission(&nd, MAY_WRITE);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
}
}
}
mutex_lock(&inode->i_mutex);
error = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+mnt_drop_write_and_out:
+ mnt_drop_write(mnt);
dput_and_out:
if (f)
fput(f);
_
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/inode.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff -puN fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time fs/inode.c
--- linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time 2007-11-01 14:46:13.000000000 -0700
+++ linux-2.6.git-dave/fs/inode.c 2007-11-01 14:46:13.000000000 -0700
@@ -1263,10 +1263,19 @@ void file_update_time(struct file *file)
struct inode *inode = file->f_path.dentry->d_inode;
struct timespec now;
int sync_it = 0;
+ int err = 0;
if (IS_NOCMTIME(inode))
return;
- if (IS_RDONLY(inode))
+ /*
+ * Ideally, we want to guarantee that 'f_vfsmnt'
+ * is non-NULL here. But, NFS exports need to
+ * be fixed up before we can do that. So, check
+ * it for now. - Dave Hansen
+ */
+ if (file->f_vfsmnt)
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
return;
now = current_fs_time(inode->i_sb);
@@ -1282,6 +1291,8 @@ void file_update_time(struct file *file)
if (sync_it)
mark_inode_dirty_sync(inode);
+ if (file->f_vfsmnt)
+ mnt_drop_write(file->f_vfsmnt);
}
EXPORT_SYMBOL(file_update_time);
_
fs/ncpfs/ioctl.c: In function 'ncp_ioctl_need_write':
fs/ncpfs/ioctl.c:852: error: label at end of compound statement
Cc: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/ncpfs/ioctl.c | 57 ++++++++++++++++++++++++++++++++++--
1 file changed, 55 insertions(+), 2 deletions(-)
diff -puN fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl fs/ncpfs/ioctl.c
--- linux-2.6.git/fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl 2007-11-01 14:46:11.000000000 -0700
+++ linux-2.6.git-dave/fs/ncpfs/ioctl.c 2007-11-01 14:46:11.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/ioctl.h>
#include <linux/time.h>
#include <linux/mm.h>
+#include <linux/mount.h>
#include <linux/highuid.h>
#include <linux/smp_lock.h>
#include <linux/vmalloc.h>
@@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv
}
#endif /* CONFIG_NCPFS_NLS */
-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
{
struct ncp_server *server = NCP_SERVER(inode);
@@ -817,11 +818,63 @@ outrel:
return -EFAULT;
return 0;
}
-
+ default: /* unkown IOCTL command, assume write */
+ ;
}
return -EINVAL;
}
+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+ switch (cmd) {
+ case NCP_IOC_GET_FS_INFO:
+ case NCP_IOC_GET_FS_INFO_V2:
+ case NCP_IOC_NCPREQUEST:
+ case NCP_IOC_SETDENTRYTTL:
+ case NCP_IOC_SIGN_INIT:
+ case NCP_IOC_LOCKUNLOCK:
+ case NCP_IOC_SET_SIGN_WANTED:
+ return 1;
+ case NCP_IOC_GETOBJECTNAME:
+ case NCP_IOC_SETOBJECTNAME:
+ case NCP_IOC_GETPRIVATEDATA:
+ case NCP_IOC_SETPRIVATEDATA:
+ case NCP_IOC_SETCHARSETS:
+ case NCP_IOC_GETCHARSETS:
+ case NCP_IOC_CONN_LOGGED_IN:
+ case NCP_IOC_GETDENTRYTTL:
+ case NCP_IOC_GETMOUNTUID2:
+ case NCP_IOC_SIGN_WANTED:
+ case NCP_IOC_GETROOT:
+ case NCP_IOC_SETROOT:
+ return 0;
+ default:
+ /* unkown IOCTL command, assume write */
+ }
+ return 1;
+}
+
+int ncp_ioctl(struct inode *inode, struct file *filp,
+ unsigned int cmd, unsigned long arg)
+{
+ int ret;
+
+ if (ncp_ioctl_need_write(cmd)) {
+ /*
+ * inside the ioctl(), any failures which
+ * are because of file_permission() are
+ * -EACCESS, so it seems consistent to keep
+ * that here.
+ */
+ if (mnt_want_write(filp->f_vfsmnt))
+ return -EACCES;
+ }
+ ret = __ncp_ioctl(inode, filp, cmd, arg);
+ if (ncp_ioctl_need_write(cmd))
+ mnt_drop_write(filp->f_vfsmnt);
+ return ret;
+}
+
#ifdef CONFIG_COMPAT
long ncp_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{
_
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:14.000000000 -0700
@@ -2349,7 +2349,12 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(dentry))
goto out_unlock;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
@@ -2444,7 +2449,12 @@ asmlinkage long sys_linkat(int olddfd, c
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_unlock;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(new_dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/inode.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff -puN fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime fs/inode.c
--- linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime 2007-11-01 14:46:12.000000000 -0700
+++ linux-2.6.git-dave/fs/inode.c 2007-11-01 14:46:12.000000000 -0700
@@ -1203,22 +1203,23 @@ void touch_atime(struct vfsmount *mnt, s
struct inode *inode = dentry->d_inode;
struct timespec now;
- if (inode->i_flags & S_NOATIME)
+ if (mnt && mnt_want_write(mnt))
return;
+ if (inode->i_flags & S_NOATIME)
+ goto out;
if (IS_NOATIME(inode))
- return;
+ goto out;
if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
+ goto out;
/*
* We may have a NULL vfsmount when coming from NFSD
*/
if (mnt) {
if (mnt->mnt_flags & MNT_NOATIME)
- return;
+ goto out;
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
-
+ goto out;
if (mnt->mnt_flags & MNT_RELATIME) {
/*
* With relative atime, only update atime if the
@@ -1229,16 +1230,19 @@ void touch_atime(struct vfsmount *mnt, s
&inode->i_atime) < 0 &&
timespec_compare(&inode->i_ctime,
&inode->i_atime) < 0)
- return;
+ goto out;
}
}
now = current_fs_time(inode->i_sb);
if (timespec_equal(&inode->i_atime, &now))
- return;
+ goto out;
inode->i_atime = now;
mark_inode_dirty_sync(inode);
+out:
+ if (mnt)
+ mnt_drop_write(mnt);
}
EXPORT_SYMBOL(touch_atime);
_
Some ioctl()s can cause writes to the filesystem. Take these, and make them
use mnt_want/drop_write() instead.
We need to pass the filp one layer deeper in XFS, but somebody _just_ pulled
it out in February because nobody was using it, so I don't feel guilty for
adding it back.
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/ext2/ioctl.c | 46 ++++++----
linux-2.6.git-dave/fs/ext3/ioctl.c | 100 +++++++++++++++-------
linux-2.6.git-dave/fs/ext4/ioctl.c | 105 +++++++++++++++---------
linux-2.6.git-dave/fs/fat/file.c | 10 +-
linux-2.6.git-dave/fs/hfsplus/ioctl.c | 40 +++++----
linux-2.6.git-dave/fs/jfs/ioctl.c | 33 ++++---
linux-2.6.git-dave/fs/ocfs2/ioctl.c | 11 +-
linux-2.6.git-dave/fs/reiserfs/ioctl.c | 55 ++++++++----
linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_ioctl.c | 15 ++-
linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_iops.c | 7 -
linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_lrw.c | 9 +-
11 files changed, 274 insertions(+), 157 deletions(-)
diff -puN fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/ext2/ioctl.c
--- linux-2.6.git/fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/ext2/ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include <linux/smp_lock.h>
#include <asm/current.h>
#include <asm/uaccess.h>
@@ -20,6 +21,7 @@
int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
unsigned long arg)
{
+ int ret;
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
@@ -34,14 +36,19 @@ int ext2_ioctl (struct inode * inode, st
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
+ ret = mnt_want_write(filp->f_vfsmnt);
+ if (ret)
+ return ret;
+
+ if (!is_owner_or_cap(inode)) {
+ ret = -EACCES;
+ goto setflags_out;
+ }
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
+ if (get_user(flags, (int __user *) arg)) {
+ ret = -EFAULT;
+ goto setflags_out;
+ }
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT2_DIRSYNC_FL;
@@ -58,7 +65,8 @@ int ext2_ioctl (struct inode * inode, st
if ((flags ^ oldflags) & (EXT2_APPEND_FL | EXT2_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ ret = -EPERM;
+ goto setflags_out;
}
}
@@ -70,20 +78,26 @@ int ext2_ioctl (struct inode * inode, st
ext2_set_inode_flags(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return ret;
}
case EXT2_IOC_GETVERSION:
return put_user(inode->i_generation, (int __user *) arg);
case EXT2_IOC_SETVERSION:
if (!is_owner_or_cap(inode))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(inode->i_generation, (int __user *) arg))
- return -EFAULT;
- inode->i_ctime = CURRENT_TIME_SEC;
- mark_inode_dirty(inode);
- return 0;
+ ret = mnt_want_write(filp->f_vfsmnt);
+ if (ret)
+ return ret;
+ if (get_user(inode->i_generation, (int __user *) arg)) {
+ ret = -EFAULT;
+ } else {
+ inode->i_ctime = CURRENT_TIME_SEC;
+ mark_inode_dirty(inode);
+ }
+ mnt_drop_write(filp->f_vfsmnt);
+ return ret;
case EXT2_IOC_GETRSVSZ:
if (test_opt(inode->i_sb, RESERVATION)
&& S_ISREG(inode->i_mode)
diff -puN fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/ext3/ioctl.c
--- linux-2.6.git/fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/ext3/ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/capability.h>
#include <linux/ext3_fs.h>
#include <linux/ext3_jbd.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/compat.h>
#include <linux/smp_lock.h>
@@ -38,14 +39,19 @@ int ext3_ioctl (struct inode * inode, st
unsigned int oldflags;
unsigned int jflag;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
- if (!is_owner_or_cap(inode))
- return -EACCES;
+ if (!is_owner_or_cap(inode)) {
+ err = -EACCES;
+ goto flags_out;
+ }
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
+ if (get_user(flags, (int __user *) arg)) {
+ err = -EFAULT;
+ goto flags_out;
+ }
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT3_DIRSYNC_FL;
@@ -65,7 +71,8 @@ int ext3_ioctl (struct inode * inode, st
if ((flags ^ oldflags) & (EXT3_APPEND_FL | EXT3_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -76,7 +83,8 @@ int ext3_ioctl (struct inode * inode, st
if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -84,7 +92,8 @@ int ext3_ioctl (struct inode * inode, st
handle = ext3_journal_start(inode, 1);
if (IS_ERR(handle)) {
mutex_unlock(&inode->i_mutex);
- return PTR_ERR(handle);
+ err = PTR_ERR(handle);
+ goto flags_out;
}
if (IS_SYNC(inode))
handle->h_sync = 1;
@@ -110,6 +119,8 @@ flags_err:
if ((jflag ^ oldflags) & (EXT3_JOURNAL_DATA_FL))
err = ext3_change_inode_journal_flag(inode, jflag);
mutex_unlock(&inode->i_mutex);
+flags_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT3_IOC_GETVERSION:
@@ -124,14 +135,18 @@ flags_err:
if (!is_owner_or_cap(inode))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(generation, (int __user *) arg))
- return -EFAULT;
-
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(generation, (int __user *) arg)) {
+ err = -EFAULT;
+ goto setversion_out;
+ }
handle = ext3_journal_start(inode, 1);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto setversion_out;
+ }
err = ext3_reserve_inode_write(handle, inode, &iloc);
if (err == 0) {
inode->i_ctime = CURRENT_TIME_SEC;
@@ -139,6 +154,8 @@ flags_err:
err = ext3_mark_iloc_dirty(handle, inode, &iloc);
}
ext3_journal_stop(handle);
+setversion_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
#ifdef CONFIG_JBD_DEBUG
@@ -174,18 +191,24 @@ flags_err:
}
return -ENOTTY;
case EXT3_IOC_SETRSVSZ: {
+ int err;
if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode))
return -ENOTTY;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
- if (!is_owner_or_cap(inode))
- return -EACCES;
+ if (!is_owner_or_cap(inode)) {
+ err = -EACCES;
+ goto setrsvsz_out;
+ }
- if (get_user(rsv_window_size, (int __user *)arg))
- return -EFAULT;
+ if (get_user(rsv_window_size, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setrsvsz_out;
+ }
if (rsv_window_size > EXT3_MAX_RESERVE_BLOCKS)
rsv_window_size = EXT3_MAX_RESERVE_BLOCKS;
@@ -203,7 +226,9 @@ flags_err:
rsv->rsv_goal_size = rsv_window_size;
}
mutex_unlock(&ei->truncate_mutex);
- return 0;
+setrsvsz_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
case EXT3_IOC_GROUP_EXTEND: {
ext3_fsblk_t n_blocks_count;
@@ -213,17 +238,20 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (get_user(n_blocks_count, (__u32 __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(n_blocks_count, (__u32 __user *)arg)) {
+ err = -EFAULT;
+ goto group_extend_out;
+ }
err = ext3_group_extend(sb, EXT3_SB(sb)->s_es, n_blocks_count);
journal_lock_updates(EXT3_SB(sb)->s_journal);
journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
-
+group_extend_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT3_IOC_GROUP_ADD: {
@@ -234,18 +262,22 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
if (copy_from_user(&input, (struct ext3_new_group_input __user *)arg,
- sizeof(input)))
- return -EFAULT;
+ sizeof(input))) {
+ err = -EFAULT;
+ goto group_add_out;
+ }
err = ext3_group_add(sb, &input);
journal_lock_updates(EXT3_SB(sb)->s_journal);
journal_flush(EXT3_SB(sb)->s_journal);
journal_unlock_updates(EXT3_SB(sb)->s_journal);
-
+group_add_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
diff -puN fs/ext4/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/ext4/ioctl.c
--- linux-2.6.git/fs/ext4/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/ext4/ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/capability.h>
#include <linux/ext4_fs.h>
#include <linux/ext4_jbd2.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/compat.h>
#include <linux/smp_lock.h>
@@ -38,15 +39,19 @@ int ext4_ioctl (struct inode * inode, st
unsigned int oldflags;
unsigned int jflag;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
+ if (!is_owner_or_cap(inode)) {
+ err = -EACCES;
+ goto flags_out;
+ }
+ if (get_user(flags, (int __user *) arg)) {
+ err = -EFAULT;
+ goto flags_out;
+ }
if (!S_ISDIR(inode->i_mode))
flags &= ~EXT4_DIRSYNC_FL;
@@ -65,7 +70,8 @@ int ext4_ioctl (struct inode * inode, st
if ((flags ^ oldflags) & (EXT4_APPEND_FL | EXT4_IMMUTABLE_FL)) {
if (!capable(CAP_LINUX_IMMUTABLE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -76,7 +82,8 @@ int ext4_ioctl (struct inode * inode, st
if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL)) {
if (!capable(CAP_SYS_RESOURCE)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
}
@@ -84,7 +91,8 @@ int ext4_ioctl (struct inode * inode, st
handle = ext4_journal_start(inode, 1);
if (IS_ERR(handle)) {
mutex_unlock(&inode->i_mutex);
- return PTR_ERR(handle);
+ err = PTR_ERR(handle);
+ goto flags_out;
}
if (IS_SYNC(inode))
handle->h_sync = 1;
@@ -104,12 +112,14 @@ flags_err:
ext4_journal_stop(handle);
if (err) {
mutex_unlock(&inode->i_mutex);
- return err;
+ goto flags_out;
}
if ((jflag ^ oldflags) & (EXT4_JOURNAL_DATA_FL))
err = ext4_change_inode_journal_flag(inode, jflag);
mutex_unlock(&inode->i_mutex);
+flags_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT4_IOC_GETVERSION:
@@ -124,14 +134,18 @@ flags_err:
if (!is_owner_or_cap(inode))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(generation, (int __user *) arg))
- return -EFAULT;
-
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(generation, (int __user *) arg)) {
+ err = -EFAULT;
+ goto setversion_out;
+ }
handle = ext4_journal_start(inode, 1);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
+ if (IS_ERR(handle)) {
+ err = PTR_ERR(handle);
+ goto setversion_out;
+ }
err = ext4_reserve_inode_write(handle, inode, &iloc);
if (err == 0) {
inode->i_ctime = ext4_current_time(inode);
@@ -139,6 +153,8 @@ flags_err:
err = ext4_mark_iloc_dirty(handle, inode, &iloc);
}
ext4_journal_stop(handle);
+setversion_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
#ifdef CONFIG_JBD2_DEBUG
@@ -174,19 +190,23 @@ flags_err:
}
return -ENOTTY;
case EXT4_IOC_SETRSVSZ: {
+ int err;
if (!test_opt(inode->i_sb, RESERVATION) ||!S_ISREG(inode->i_mode))
return -ENOTTY;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
-
- if (get_user(rsv_window_size, (int __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (!is_owner_or_cap(inode)) {
+ err = -EACCES;
+ goto setrsvsz_out;
+ }
+ if (get_user(rsv_window_size, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setrsvsz_out;
+ }
if (rsv_window_size > EXT4_MAX_RESERVE_BLOCKS)
rsv_window_size = EXT4_MAX_RESERVE_BLOCKS;
@@ -203,7 +223,9 @@ flags_err:
rsv->rsv_goal_size = rsv_window_size;
}
mutex_unlock(&ei->truncate_mutex);
- return 0;
+setrsvsz_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
case EXT4_IOC_GROUP_EXTEND: {
ext4_fsblk_t n_blocks_count;
@@ -213,17 +235,21 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (get_user(n_blocks_count, (__u32 __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(n_blocks_count, (__u32 __user *)arg)) {
+ err = -EFAULT;
+ goto group_extend_out;
+ }
err = ext4_group_extend(sb, EXT4_SB(sb)->s_es, n_blocks_count);
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
jbd2_journal_flush(EXT4_SB(sb)->s_journal);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
+group_extend_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
case EXT4_IOC_GROUP_ADD: {
@@ -234,18 +260,21 @@ flags_err:
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
if (copy_from_user(&input, (struct ext4_new_group_input __user *)arg,
- sizeof(input)))
- return -EFAULT;
-
+ sizeof(input))) {
+ err = -EFAULT;
+ goto group_add_out;
+ }
err = ext4_group_add(sb, &input);
jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
jbd2_journal_flush(EXT4_SB(sb)->s_journal);
jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
-
+group_add_out:
+ mnt_drop_write(filp->f_vfsmnt);
return err;
}
diff -puN fs/fat/file.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/fat/file.c
--- linux-2.6.git/fs/fat/file.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/fat/file.c 2007-11-01 14:46:14.000000000 -0700
@@ -8,6 +8,7 @@
#include <linux/capability.h>
#include <linux/module.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/msdos_fs.h>
#include <linux/smp_lock.h>
@@ -46,10 +47,9 @@ int fat_generic_ioctl(struct inode *inod
mutex_lock(&inode->i_mutex);
- if (IS_RDONLY(inode)) {
- err = -EROFS;
- goto up;
- }
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ goto up_no_drop_write;
/*
* ATTR_VOLUME and ATTR_DIR cannot be changed; this also
@@ -106,6 +106,8 @@ int fat_generic_ioctl(struct inode *inod
MSDOS_I(inode)->i_attrs = attr & ATTR_UNUSED;
mark_inode_dirty(inode);
up:
+ mnt_drop_write(filp->f_vfsmnt);
+ up_no_drop_write:
mutex_unlock(&inode->i_mutex);
return err;
}
diff -puN fs/hfsplus/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/hfsplus/ioctl.c
--- linux-2.6.git/fs/hfsplus/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/hfsplus/ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/capability.h>
#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/sched.h>
#include <linux/xattr.h>
#include <asm/uaccess.h>
@@ -35,25 +36,32 @@ int hfsplus_ioctl(struct inode *inode, s
flags |= FS_NODUMP_FL; /* EXT2_NODUMP_FL */
return put_user(flags, (int __user *)arg);
case HFSPLUS_IOC_EXT2_SETFLAGS: {
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
-
- if (get_user(flags, (int __user *)arg))
- return -EFAULT;
-
+ int err = 0;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+
+ if (!is_owner_or_cap(inode)) {
+ err = -EACCES;
+ goto setflags_out;
+ }
+ if (get_user(flags, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setflags_out;
+ }
if (flags & (FS_IMMUTABLE_FL|FS_APPEND_FL) ||
HFSPLUS_I(inode).rootflags & (HFSPLUS_FLG_IMMUTABLE|HFSPLUS_FLG_APPEND)) {
- if (!capable(CAP_LINUX_IMMUTABLE))
- return -EPERM;
+ if (!capable(CAP_LINUX_IMMUTABLE)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
}
/* don't silently ignore unsupported ext2 flags */
- if (flags & ~(FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL))
- return -EOPNOTSUPP;
-
+ if (flags & ~(FS_IMMUTABLE_FL|FS_APPEND_FL|FS_NODUMP_FL)) {
+ err = -EOPNOTSUPP;
+ goto setflags_out;
+ }
if (flags & FS_IMMUTABLE_FL) { /* EXT2_IMMUTABLE_FL */
inode->i_flags |= S_IMMUTABLE;
HFSPLUS_I(inode).rootflags |= HFSPLUS_FLG_IMMUTABLE;
@@ -75,7 +83,9 @@ int hfsplus_ioctl(struct inode *inode, s
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
default:
return -ENOTTY;
diff -puN fs/jfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/jfs/ioctl.c
--- linux-2.6.git/fs/jfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/jfs/ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -8,6 +8,7 @@
#include <linux/fs.h>
#include <linux/ctype.h>
#include <linux/capability.h>
+#include <linux/mount.h>
#include <linux/time.h>
#include <linux/sched.h>
#include <asm/current.h>
@@ -65,16 +66,20 @@ int jfs_ioctl(struct inode * inode, stru
return put_user(flags, (int __user *) arg);
case JFS_IOC_SETFLAGS: {
unsigned int oldflags;
+ int err;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
-
- if (get_user(flags, (int __user *) arg))
- return -EFAULT;
-
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+
+ if (!is_owner_or_cap(inode)) {
+ err = -EACCES;
+ goto setflags_out;
+ }
+ if (get_user(flags, (int __user *) arg)) {
+ err = -EFAULT;
+ goto setflags_out;
+ }
flags = jfs_map_ext2(flags, 1);
if (!S_ISDIR(inode->i_mode))
flags &= ~JFS_DIRSYNC_FL;
@@ -89,8 +94,10 @@ int jfs_ioctl(struct inode * inode, stru
if ((oldflags & JFS_IMMUTABLE_FL) ||
((flags ^ oldflags) &
(JFS_APPEND_FL | JFS_IMMUTABLE_FL))) {
- if (!capable(CAP_LINUX_IMMUTABLE))
- return -EPERM;
+ if (!capable(CAP_LINUX_IMMUTABLE)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
}
flags = flags & JFS_FL_USER_MODIFIABLE;
@@ -100,7 +107,9 @@ int jfs_ioctl(struct inode * inode, stru
jfs_set_inode_flags(inode);
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
default:
return -ENOTTY;
diff -puN fs/ocfs2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/ocfs2/ioctl.c
--- linux-2.6.git/fs/ocfs2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/ocfs2/ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -58,10 +58,6 @@ static int ocfs2_set_inode_attr(struct i
goto bail;
}
- status = -EROFS;
- if (IS_RDONLY(inode))
- goto bail_unlock;
-
status = -EACCES;
if (!is_owner_or_cap(inode))
goto bail_unlock;
@@ -130,8 +126,13 @@ int ocfs2_ioctl(struct inode * inode, st
if (get_user(flags, (int __user *) arg))
return -EFAULT;
- return ocfs2_set_inode_attr(inode, flags,
+ status = mnt_want_write(filp->f_vfsmnt);
+ if (status)
+ return status;
+ status = ocfs2_set_inode_attr(inode, flags,
OCFS2_FL_MODIFIABLE);
+ mnt_drop_write(filp->f_vfsmnt);
+ return status;
case OCFS2_IOC_RESVSP:
case OCFS2_IOC_RESVSP64:
case OCFS2_IOC_UNRESVSP:
diff -puN fs/reiserfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/reiserfs/ioctl.c
--- linux-2.6.git/fs/reiserfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/reiserfs/ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -4,6 +4,7 @@
#include <linux/capability.h>
#include <linux/fs.h>
+#include <linux/mount.h>
#include <linux/reiserfs_fs.h>
#include <linux/time.h>
#include <asm/uaccess.h>
@@ -25,6 +26,7 @@ int reiserfs_ioctl(struct inode *inode,
unsigned long arg)
{
unsigned int flags;
+ int err = 0;
switch (cmd) {
case REISERFS_IOC_UNPACK:
@@ -48,47 +50,60 @@ int reiserfs_ioctl(struct inode *inode,
if (!reiserfs_attrs(inode->i_sb))
return -ENOTTY;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EPERM;
-
- if (get_user(flags, (int __user *)arg))
- return -EFAULT;
-
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+
+ if (!is_owner_or_cap(inode)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
+ if (get_user(flags, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setflags_out;
+ }
if (((flags ^ REISERFS_I(inode)->
i_attrs) & (REISERFS_IMMUTABLE_FL |
REISERFS_APPEND_FL))
- && !capable(CAP_LINUX_IMMUTABLE))
- return -EPERM;
-
+ && !capable(CAP_LINUX_IMMUTABLE)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
if ((flags & REISERFS_NOTAIL_FL) &&
S_ISREG(inode->i_mode)) {
int result;
result = reiserfs_unpack(inode, filp);
- if (result)
- return result;
+ if (result) {
+ err = result;
+ goto setflags_out;
+ }
}
sd_attrs_to_i_attrs(flags, inode);
REISERFS_I(inode)->i_attrs = flags;
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setflags_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
}
case REISERFS_IOC_GETVERSION:
return put_user(inode->i_generation, (int __user *)arg);
case REISERFS_IOC_SETVERSION:
if (!is_owner_or_cap(inode))
return -EPERM;
- if (IS_RDONLY(inode))
- return -EROFS;
- if (get_user(inode->i_generation, (int __user *)arg))
- return -EFAULT;
+ err = mnt_want_write(filp->f_vfsmnt);
+ if (err)
+ return err;
+ if (get_user(inode->i_generation, (int __user *)arg)) {
+ err = -EFAULT;
+ goto setversion_out;
+ }
inode->i_ctime = CURRENT_TIME_SEC;
mark_inode_dirty(inode);
- return 0;
+setversion_out:
+ mnt_drop_write(filp->f_vfsmnt);
+ return err;
default:
return -ENOTTY;
}
diff -puN fs/xfs/linux-2.6/xfs_ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/xfs/linux-2.6/xfs_ioctl.c
--- linux-2.6.git/fs/xfs/linux-2.6/xfs_ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_ioctl.c 2007-11-01 14:46:14.000000000 -0700
@@ -554,8 +554,6 @@ xfs_attrmulti_attr_set(
char *kbuf;
int error = EFAULT;
- if (IS_RDONLY(inode))
- return -EROFS;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
return EPERM;
if (len > XATTR_SIZE_MAX)
@@ -581,8 +579,6 @@ xfs_attrmulti_attr_remove(
char *name,
__uint32_t flags)
{
- if (IS_RDONLY(inode))
- return -EROFS;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
return EPERM;
return xfs_attr_remove(XFS_I(inode), name, flags);
@@ -592,6 +588,7 @@ STATIC int
xfs_attrmulti_by_handle(
xfs_mount_t *mp,
void __user *arg,
+ struct file *parfilp,
struct inode *parinode)
{
int error;
@@ -646,13 +643,21 @@ xfs_attrmulti_by_handle(
&ops[i].am_length, ops[i].am_flags);
break;
case ATTR_OP_SET:
+ ops[i].am_error = mnt_want_write(parfilp->f_vfsmnt);
+ if (ops[i].am_error)
+ break;
ops[i].am_error = xfs_attrmulti_attr_set(inode,
attr_name, ops[i].am_attrvalue,
ops[i].am_length, ops[i].am_flags);
+ mnt_drop_write(parfilp->f_vfsmnt);
break;
case ATTR_OP_REMOVE:
+ ops[i].am_error = mnt_want_write(parfilp->f_vfsmnt);
+ if (ops[i].am_error)
+ break;
ops[i].am_error = xfs_attrmulti_attr_remove(inode,
attr_name, ops[i].am_flags);
+ mnt_drop_write(parfilp->f_vfsmnt);
break;
default:
ops[i].am_error = EINVAL;
@@ -834,7 +839,7 @@ xfs_ioctl(
return xfs_attrlist_by_handle(mp, arg, inode);
case XFS_IOC_ATTRMULTI_BY_HANDLE:
- return xfs_attrmulti_by_handle(mp, arg, inode);
+ return xfs_attrmulti_by_handle(mp, arg, filp, inode);
case XFS_IOC_SWAPEXT: {
error = xfs_swapext((struct xfs_swapext __user *)arg);
diff -puN fs/xfs/linux-2.6/xfs_iops.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/xfs/linux-2.6/xfs_iops.c
--- linux-2.6.git/fs/xfs/linux-2.6/xfs_iops.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_iops.c 2007-11-01 14:46:14.000000000 -0700
@@ -140,13 +140,6 @@ xfs_ichgtime_fast(
*/
ASSERT((flags & XFS_ICHGTIME_ACC) == 0);
- /*
- * We're not supposed to change timestamps in readonly-mounted
- * filesystems. Throw it away if anyone asks us.
- */
- if (unlikely(IS_RDONLY(inode)))
- return;
-
if (flags & XFS_ICHGTIME_MOD) {
tvp = &inode->i_mtime;
ip->i_d.di_mtime.t_sec = (__int32_t)tvp->tv_sec;
diff -puN fs/xfs/linux-2.6/xfs_lrw.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/xfs/linux-2.6/xfs_lrw.c
--- linux-2.6.git/fs/xfs/linux-2.6/xfs_lrw.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2007-11-01 14:46:14.000000000 -0700
+++ linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_lrw.c 2007-11-01 14:46:14.000000000 -0700
@@ -51,6 +51,7 @@
#include "xfs_vnodeops.h"
#include <linux/capability.h>
+#include <linux/mount.h>
#include <linux/writeback.h>
@@ -690,10 +691,16 @@ start:
if (new_size > xip->i_size)
io->io_new_size = new_size;
- if (likely(!(ioflags & IO_INVIS))) {
+ /*
+ * We're not supposed to change timestamps in readonly-mounted
+ * filesystems. Throw it away if anyone asks us.
+ */
+ if (likely(!(ioflags & IO_INVIS) &&
+ !mnt_want_write(file->f_vfsmnt))) {
file_update_time(file);
xfs_ichgtime_fast(xip, inode,
XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+ mnt_drop_write(file->f_vfsmnt);
}
/*
_
This is an old patch combined with a couple other ones I've
been working on. It should the issues that Miklos has
spotted.
--
This is the first really tricky patch in the series. It
elevates the writer count on a mount each time a non-special
file is opened for write.
We used to do this in may_open(), but Miklos pointed out
that __dentry_open() is used as well to create filps. This
will cover even those cases, while a call in may_open()
would not have.
There is also an elevated count around the vfs_create() call
in open_namei(). See the comments for more details, but we
need this to fix a 'create, remount, fail r/w open()' race.
Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/fs/file_table.c | 16 +++++++-
linux-2.6.git-dave/fs/namei.c | 74 ++++++++++++++++++++++++++++++++-----
linux-2.6.git-dave/fs/open.c | 36 +++++++++++++++++-
linux-2.6.git-dave/ipc/mqueue.c | 3 +
4 files changed, 116 insertions(+), 13 deletions(-)
diff -puN fs/file_table.c~r-o-bind-mounts-elevate-write-count-opend-files fs/file_table.c
--- linux-2.6.git/fs/file_table.c~r-o-bind-mounts-elevate-write-count-opend-files 2007-11-01 14:46:16.000000000 -0700
+++ linux-2.6.git-dave/fs/file_table.c 2007-11-01 14:46:16.000000000 -0700
@@ -193,6 +193,17 @@ int init_file(struct file *file, struct
file->f_mapping = dentry->d_inode->i_mapping;
file->f_mode = mode;
file->f_op = fop;
+
+ /*
+ * These mounts don't really matter in practice
+ * for r/o bind mounts. They aren't userspace-
+ * visible. We do this for consistency, and so
+ * that we can do debugging checks at __fput()e
+ */
+ if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+ error = mnt_want_write(mnt);
+ WARN_ON(error);
+ }
return error;
}
EXPORT_SYMBOL(init_file);
@@ -230,8 +241,11 @@ void fastcall __fput(struct file *file)
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
- if (file->f_mode & FMODE_WRITE)
+ if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
+ if (!special_file(inode->i_mode))
+ mnt_drop_write(mnt);
+ }
put_pid(file->f_owner.pid);
file_kill(file);
file->f_path.dentry = NULL;
diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-opend-files fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-opend-files 2007-11-01 14:46:16.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:16.000000000 -0700
@@ -1620,12 +1620,12 @@ int may_open(struct nameidata *nd, int a
return -EACCES;
flag &= ~O_TRUNC;
- } else if (IS_RDONLY(inode) && (flag & FMODE_WRITE))
- return -EROFS;
+ }
error = vfs_permission(nd, acc_mode);
if (error)
return error;
+
/*
* An append-only file must be opened in append mode for writing.
*/
@@ -1721,18 +1721,31 @@ static inline int sys_open_flags_to_name
return flag;
}
+static inline int open_will_write_to_fs(int flag, struct inode *inode)
+{
+ /*
+ * We'll never write to the fs underlying
+ * a device file.
+ */
+ if (special_file(inode->i_mode))
+ return 0;
+ return (flag & O_TRUNC);
+}
+
/*
* open_pathname()
*
* namei for open - this is in fact almost the whole open-routine.
*
- * Note that the low bits of "flag" aren't the same as in the open
- * system call. See sys_open_flags_to_namei_flags().
+ * Note that the low bits of the passed in "sys_open_flag"
+ * are not the same as in the local variable "flag". See
+ * sys_open_flags_to_namei_flags() for more details.
* SMP-safe
*/
struct file *open_pathname(int dfd, const char *pathname,
int sys_open_flag, int mode)
{
+ struct file *filp;
struct nameidata nd;
int acc_mode, error;
struct path path;
@@ -1792,17 +1805,30 @@ do_last:
}
if (IS_ERR(nd.intent.open.file)) {
- mutex_unlock(&dir->d_inode->i_mutex);
error = PTR_ERR(nd.intent.open.file);
- goto exit_dput;
+ goto exit_mutex_unlock;
}
/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
- error = __open_namei_create(&nd, &path, flag, mode);
+ /*
+ * This write is needed to ensure that a
+ * ro->rw transition does not occur between
+ * the time when the file is created and when
+ * a permanent write count is taken through
+ * the 'struct file' in nameidata_to_filp().
+ */
+ error = mnt_want_write(nd.mnt);
if (error)
+ goto exit_mutex_unlock;
+ error = __open_namei_create(&nd, &path, flag, mode);
+ if (error) {
+ mnt_drop_write(nd.mnt);
goto exit;
- return nameidata_to_filp(&nd, sys_open_flag);
+ }
+ filp = nameidata_to_filp(&nd, sys_open_flag);
+ mnt_drop_write(nd.mnt);
+ return filp;
}
/*
@@ -1832,11 +1858,39 @@ do_last:
if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
goto exit;
ok:
+ /*
+ * Consider:
+ * 1. may_open() truncates a file
+ * 2. a rw->ro mount transition occurs
+ * 3. nameidata_to_filp() fails due to
+ * the ro mount.
+ * That would be inconsistent, and should
+ * be avoided. Taking this mnt write here
+ * ensures that (2) can not occur.
+ */
+ if (open_will_write_to_fs(flag, nd.dentry->d_inode)) {
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit;
+ }
error = may_open(&nd, acc_mode, flag);
- if (error)
+ if (error) {
+ if (open_will_write_to_fs(flag, nd.dentry->d_inode))
+ mnt_drop_write(nd.mnt);
goto exit;
- return nameidata_to_filp(&nd, sys_open_flag);
+ }
+ filp = nameidata_to_filp(&nd, sys_open_flag);
+ /*
+ * It is now safe to drop the mnt write
+ * because the filp has had a write taken
+ * on its behalf.
+ */
+ if (open_will_write_to_fs(flag, nd.dentry->d_inode))
+ mnt_drop_write(nd.mnt);
+ return filp;
+exit_mutex_unlock:
+ mutex_unlock(&dir->d_inode->i_mutex);
exit_dput:
dput_path(&path, &nd);
exit:
diff -puN fs/open.c~r-o-bind-mounts-elevate-write-count-opend-files fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-write-count-opend-files 2007-11-01 14:46:16.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:16.000000000 -0700
@@ -734,6 +734,35 @@ out:
return error;
}
+/*
+ * You have to be very careful that these write
+ * counts get cleaned up in error cases and
+ * upon __fput(). This should probably never
+ * be called outside of __dentry_open().
+ */
+static inline int __get_file_write_access(struct inode *inode,
+ struct vfsmount *mnt)
+{
+ int error;
+ error = get_write_access(inode);
+ if (error)
+ return error;
+ /*
+ * Do not take mount writer counts on
+ * special files since no writes to
+ * the mount itself will occur.
+ */
+ if (!special_file(inode->i_mode)) {
+ /*
+ * Balanced in __fput()
+ */
+ error = mnt_want_write(mnt);
+ if (error)
+ put_write_access(inode);
+ }
+ return error;
+}
+
static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
int flags, struct file *f,
int (*open)(struct inode *, struct file *))
@@ -746,7 +775,7 @@ static struct file *__dentry_open(struct
FMODE_PREAD | FMODE_PWRITE;
inode = dentry->d_inode;
if (f->f_mode & FMODE_WRITE) {
- error = get_write_access(inode);
+ error = __get_file_write_access(inode, mnt);
if (error)
goto cleanup_file;
}
@@ -788,8 +817,11 @@ static struct file *__dentry_open(struct
cleanup_all:
fops_put(f->f_op);
- if (f->f_mode & FMODE_WRITE)
+ if (f->f_mode & FMODE_WRITE) {
put_write_access(inode);
+ if (!special_file(inode->i_mode))
+ mnt_drop_write(mnt);
+ }
file_kill(f);
f->f_path.dentry = NULL;
f->f_path.mnt = NULL;
diff -puN ipc/mqueue.c~r-o-bind-mounts-elevate-write-count-opend-files ipc/mqueue.c
--- linux-2.6.git/ipc/mqueue.c~r-o-bind-mounts-elevate-write-count-opend-files 2007-11-01 14:46:16.000000000 -0700
+++ linux-2.6.git-dave/ipc/mqueue.c 2007-11-01 14:46:16.000000000 -0700
@@ -682,6 +682,9 @@ asmlinkage long sys_mq_open(const char _
goto out;
filp = do_open(dentry, oflag);
} else {
+ error = mnt_want_write(mqueue_mnt);
+ if (error)
+ goto out;
filp = do_create(mqueue_mnt->mnt_root, dentry,
oflag, mode, u_attr);
}
_
This also uses the little helper in the NFS code to make an if() a little bit
less ugly. We introduced the helper at the beginning of the series.
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 4 ++++
linux-2.6.git-dave/fs/nfsd/vfs.c | 15 +++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)
diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename 2007-11-01 14:46:16.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:16.000000000 -0700
@@ -2734,8 +2734,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;
+ error = mnt_want_write(oldnd.mnt);
+ if (error)
+ goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
+ mnt_drop_write(oldnd.mnt);
exit5:
dput(new_dentry);
exit4:
diff -puN fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename fs/nfsd/vfs.c
--- linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename 2007-11-01 14:46:16.000000000 -0700
+++ linux-2.6.git-dave/fs/nfsd/vfs.c 2007-11-01 14:46:16.000000000 -0700
@@ -1668,13 +1668,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;
-#ifdef MSNFS
- if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+ if (svc_msnfs(ffhp) &&
((atomic_read(&odentry->d_count) > 1)
|| (atomic_read(&ndentry->d_count) > 1))) {
host_err = -EPERM;
- } else
-#endif
+ goto out_dput_new;
+ }
+
+ host_err = -EXDEV;
+ if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+ goto out_dput_new;
+ host_err = mnt_want_write(ffhp->fh_export->ex_mnt);
+ if (host_err)
+ goto out_dput_new;
+
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
host_err = nfsd_sync_dir(tdentry);
_
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/open.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate 2007-11-01 14:46:18.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:18.000000000 -0700
@@ -244,21 +244,21 @@ static long do_sys_truncate(const char _
if (!S_ISREG(inode->i_mode))
goto dput_and_out;
- error = vfs_permission(&nd, MAY_WRITE);
+ error = mnt_want_write(nd.mnt);
if (error)
goto dput_and_out;
- error = -EROFS;
- if (IS_RDONLY(inode))
- goto dput_and_out;
+ error = vfs_permission(&nd, MAY_WRITE);
+ if (error)
+ goto mnt_drop_write_and_out;
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
error = get_write_access(inode);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
/*
* Make sure that there are no leases. get_write_access() protects
@@ -276,6 +276,8 @@ static long do_sys_truncate(const char _
put_write_and_out:
put_write_access(inode);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_
It is OK to let access() go without using a mnt_want/drop_write() pair because
it doesn't actually do writes to the filesystem, and it is inherently racy
anyway. This is a rare case when it is OK to use __mnt_is_readonly()
directly.
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/open.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff -puN fs/open.c~r-o-bind-mounts-make-access-use-mnt-check fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-make-access-use-mnt-check 2007-11-01 14:46:18.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:18.000000000 -0700
@@ -459,8 +459,17 @@ asmlinkage long sys_faccessat(int dfd, c
if(res || !(mode & S_IWOTH) ||
special_file(nd.dentry->d_inode->i_mode))
goto out_path_release;
-
- if(IS_RDONLY(nd.dentry->d_inode))
+ /*
+ * This is a rare case where using __mnt_is_readonly()
+ * is OK without a mnt_want/drop_write() pair. Since
+ * no actual write to the fs is performed here, we do
+ * not need to telegraph to that to anyone.
+ *
+ * By doing this, we accept that this access is
+ * inherently racy and know that the fs may change
+ * state before we even see this result.
+ */
+ if (__mnt_is_readonly(nd.mnt))
res = -EROFS;
out_path_release:
_
chown/chmod,etc... don't call permission in the same way that the normal
"open for write" calls do. They still write to the filesystem, so bump the
write count during these operations.
This conflicts with the current (~2.6.23-rc7) audit git tree in -mm.
wiggle'ing the patch merges it.
Signed-off-by: Dave Hansen <[email protected]>
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/open.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)
diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends 2007-11-01 14:46:17.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:17.000000000 -0700
@@ -571,12 +571,12 @@ asmlinkage long sys_fchmod(unsigned int
audit_inode(NULL, dentry);
- err = -EROFS;
- if (IS_RDONLY(inode))
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
goto out_putf;
err = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto out_putf;
+ goto out_drop_write;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -585,6 +585,8 @@ asmlinkage long sys_fchmod(unsigned int
err = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+out_drop_write:
+ mnt_drop_write(file->f_vfsmnt);
out_putf:
fput(file);
out:
@@ -604,13 +606,13 @@ asmlinkage long sys_fchmodat(int dfd, co
goto out;
inode = nd.dentry->d_inode;
- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto out_drop_write;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
@@ -620,6 +622,8 @@ asmlinkage long sys_fchmodat(int dfd, co
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+out_drop_write:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
@@ -642,9 +646,6 @@ static int chown_common(struct dentry *
printk(KERN_ERR "chown_common: NULL inode\n");
goto out;
}
- error = -EROFS;
- if (IS_RDONLY(inode))
- goto out;
error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
goto out;
@@ -675,7 +676,12 @@ asmlinkage long sys_chown(const char __u
error = user_path_walk(filename, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -695,7 +701,12 @@ asmlinkage long sys_fchownat(int dfd, co
error = __user_walk_fd(dfd, filename, follow, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -709,7 +720,12 @@ asmlinkage long sys_lchown(const char __
error = user_path_walk_link(filename, &nd);
if (error)
goto out;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_release;
error = chown_common(nd.dentry, user, group);
+ mnt_drop_write(nd.mnt);
+out_release:
path_release(&nd);
out:
return error;
@@ -726,9 +742,14 @@ asmlinkage long sys_fchown(unsigned int
if (!file)
goto out;
+ error = mnt_want_write(file->f_vfsmnt);
+ if (error)
+ goto out_fput;
dentry = file->f_path.dentry;
audit_inode(NULL, dentry);
error = chown_common(dentry, user, group);
+ mnt_drop_write(file->f_vfsmnt);
+out_fput:
fput(file);
out:
return error;
_
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().
So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch and into a helper function
suggested by Christoph.
This also encapsulates a fix for mknod(S_IFREG) that Miklos
found.
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/namei.c | 43 +++++++++++++++++++++++++---------
linux-2.6.git-dave/fs/nfsd/vfs.c | 4 +++
linux-2.6.git-dave/net/unix/af_unix.c | 4 +++
3 files changed, 40 insertions(+), 11 deletions(-)
diff -puN fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create 2007-11-01 14:46:20.000000000 -0700
+++ linux-2.6.git-dave/fs/namei.c 2007-11-01 14:46:20.000000000 -0700
@@ -2022,6 +2022,23 @@ int vfs_mknod(struct inode *dir, struct
return error;
}
+static int may_mknod(mode_t mode)
+{
+ switch (mode & S_IFMT) {
+ case S_IFREG:
+ case S_IFCHR:
+ case S_IFBLK:
+ case S_IFIFO:
+ case S_IFSOCK:
+ case 0: /* zero mode translates to S_IFREG */
+ return 0;
+ case S_IFDIR:
+ return -EPERM;
+ default:
+ return -EINVAL;
+ }
+}
+
asmlinkage long sys_mknodat(int dfd, const char __user *filename, int mode,
unsigned dev)
{
@@ -2040,12 +2057,19 @@ asmlinkage long sys_mknodat(int dfd, con
if (error)
goto out;
dentry = lookup_create(&nd, 0);
- error = PTR_ERR(dentry);
-
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto out_unlock;
+ }
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
- if (!IS_ERR(dentry)) {
- switch (mode & S_IFMT) {
+ error = may_mknod(mode);
+ if (error)
+ goto out_dput;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
+ switch (mode & S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
break;
@@ -2056,14 +2080,11 @@ asmlinkage long sys_mknodat(int dfd, con
case S_IFIFO: case S_IFSOCK:
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
break;
- case S_IFDIR:
- error = -EPERM;
- break;
- default:
- error = -EINVAL;
- }
- dput(dentry);
}
+ mnt_drop_write(nd.mnt);
+out_dput:
+ dput(dentry);
+out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
path_release(&nd);
out:
diff -puN fs/nfsd/vfs.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create fs/nfsd/vfs.c
--- linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create 2007-11-01 14:46:20.000000000 -0700
+++ linux-2.6.git-dave/fs/nfsd/vfs.c 2007-11-01 14:46:20.000000000 -0700
@@ -1242,7 +1242,11 @@ nfsd_create(struct svc_rqst *rqstp, stru
case S_IFBLK:
case S_IFIFO:
case S_IFSOCK:
+ host_err = mnt_want_write(fhp->fh_export->ex_mnt);
+ if (host_err)
+ break;
host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
+ mnt_drop_write(fhp->fh_export->ex_mnt);
break;
default:
printk("nfsd: bad file type %o in nfsd_create\n", type);
diff -puN net/unix/af_unix.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create net/unix/af_unix.c
--- linux-2.6.git/net/unix/af_unix.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create 2007-11-01 14:46:20.000000000 -0700
+++ linux-2.6.git-dave/net/unix/af_unix.c 2007-11-01 14:46:20.000000000 -0700
@@ -838,7 +838,11 @@ static int unix_bind(struct socket *sock
*/
mode = S_IFSOCK |
(SOCK_INODE(sock)->i_mode & ~current->fs->umask);
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto out_mknod_dput;
err = vfs_mknod(nd.dentry->d_inode, dentry, mode, 0);
+ mnt_drop_write(nd.mnt);
if (err)
goto out_mknod_dput;
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
This is the real meat of the entire series. It actually implements the
tracking of the number of writers to a mount. However, it causes scalability
problems because there can be hundreds of cpus doing open()/close() on files
on the same mnt at the same time. Even an atomic_t in the mnt has massive
scalaing problems because the cacheline gets so terribly contended.
This uses a statically-allocated percpu variable. All operations are local to
a cpu as long that cpu operates on the same mount, and there are no writer
count imbalances. Writer count imbalances happen when a write is taken on one
cpu, and released on another, like when an open/close pair is performed on two
Upon a remount,ro request, all of the data from the percpu variables is
collected (expensive, but very rare) and we determine if there are any
outstanding writers to the mount.
I've written a little benchmark to sit in a loop for a couple of seconds in
several cpus in parallel doing open/write/close loops.
http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch. It does opens on
a _pair_ of files in two different mounts in parallel. This should cause my
code to lose its "operate on the same mount" optimization completely. This
worst-case scenario causes a 3% degredation in the benchmark.
I could probably get rid of even this 3%, but it would be more complex than
what I have here, and I think this is getting into acceptable territory. In
practice, I expect writing more than 3 bytes to a file, as well as disk I/O to
mask any effects that this has.
(To get rid of that 3%, we could have an #defined number of mounts in the
percpu variable. So, instead of a CPU getting operate only on percpu data
when it accesses only one mount, it could stay on percpu data when it only
accesses N or fewer mounts.)
Signed-off-by: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/namespace.c | 205 ++++++++++++++++++++++++++++---
linux-2.6.git-dave/include/linux/mount.h | 8 +
2 files changed, 198 insertions(+), 15 deletions(-)
diff -puN fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers 2007-11-01 14:46:20.000000000 -0700
+++ linux-2.6.git-dave/fs/namespace.c 2007-11-01 14:46:20.000000000 -0700
@@ -17,6 +17,7 @@
#include <linux/quotaops.h>
#include <linux/acct.h>
#include <linux/capability.h>
+#include <linux/cpumask.h>
#include <linux/module.h>
#include <linux/sysfs.h>
#include <linux/seq_file.h>
@@ -52,6 +53,8 @@ static inline unsigned long hash(struct
return tmp & hash_mask;
}
+#define MNT_WRITER_UNDERFLOW_LIMIT -(1<<16)
+
struct vfsmount *alloc_vfsmnt(const char *name)
{
struct vfsmount *mnt = kmem_cache_zalloc(mnt_cache, GFP_KERNEL);
@@ -65,6 +68,7 @@ struct vfsmount *alloc_vfsmnt(const char
INIT_LIST_HEAD(&mnt->mnt_share);
INIT_LIST_HEAD(&mnt->mnt_slave_list);
INIT_LIST_HEAD(&mnt->mnt_slave);
+ atomic_set(&mnt->__mnt_writers, 0);
if (name) {
int size = strlen(name) + 1;
char *newname = kmalloc(size, GFP_KERNEL);
@@ -85,6 +89,84 @@ struct vfsmount *alloc_vfsmnt(const char
* we can determine when writes are able to occur to
* a filesystem.
*/
+/*
+ * __mnt_is_readonly: check whether a mount is read-only
+ * @mnt: the mount to check for its write status
+ *
+ * This shouldn't be used directly ouside of the VFS.
+ * It does not guarantee that the filesystem will stay
+ * r/w, just that it is right *now*. This can not and
+ * should not be used in place of IS_RDONLY(inode).
+ * mnt_want/drop_write() will _keep_ the filesystem
+ * r/w.
+ */
+int __mnt_is_readonly(struct vfsmount *mnt)
+{
+ return (mnt->mnt_sb->s_flags & MS_RDONLY);
+}
+EXPORT_SYMBOL_GPL(__mnt_is_readonly);
+
+struct mnt_writer {
+ /*
+ * If holding multiple instances of this lock, they
+ * must be ordered by cpu number.
+ */
+ spinlock_t lock;
+ unsigned long count;
+ struct vfsmount *mnt;
+} ____cacheline_aligned_in_smp;
+static DEFINE_PER_CPU(struct mnt_writer, mnt_writers);
+
+static int __init init_mnt_writers(void)
+{
+ int cpu;
+ for_each_possible_cpu(cpu) {
+ struct mnt_writer *writer = &per_cpu(mnt_writers, cpu);
+ spin_lock_init(&writer->lock);
+ writer->count = 0;
+ }
+ return 0;
+}
+fs_initcall(init_mnt_writers);
+
+static void mnt_unlock_cpus(void)
+{
+ int cpu;
+ struct mnt_writer *cpu_writer;
+
+ for_each_possible_cpu(cpu) {
+ cpu_writer = &per_cpu(mnt_writers, cpu);
+ spin_unlock(&cpu_writer->lock);
+ }
+}
+
+static inline void __clear_mnt_count(struct mnt_writer *cpu_writer)
+{
+ if (!cpu_writer->mnt)
+ return;
+ atomic_add(cpu_writer->count, &cpu_writer->mnt->__mnt_writers);
+ cpu_writer->count = 0;
+}
+ /*
+ * must hold cpu_writer->lock
+ */
+static inline void use_cpu_writer_for_mount(struct mnt_writer *cpu_writer,
+ struct vfsmount *mnt)
+{
+ if (cpu_writer->mnt == mnt)
+ return;
+ __clear_mnt_count(cpu_writer);
+ cpu_writer->mnt = mnt;
+}
+
+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
/**
* mnt_want_write - get write access to a mount
* @mnt: the mount on which to take a write
@@ -97,12 +179,58 @@ struct vfsmount *alloc_vfsmnt(const char
*/
int mnt_want_write(struct vfsmount *mnt)
{
- if (__mnt_is_readonly(mnt))
- return -EROFS;
- return 0;
+ int ret = 0;
+ struct mnt_writer *cpu_writer;
+
+ cpu_writer = &get_cpu_var(mnt_writers);
+ spin_lock(&cpu_writer->lock);
+ if (__mnt_is_readonly(mnt)) {
+ ret = -EROFS;
+ goto out;
+ }
+ use_cpu_writer_for_mount(cpu_writer, mnt);
+ cpu_writer->count++;
+out:
+ spin_unlock(&cpu_writer->lock);
+ put_cpu_var(mnt_writers);
+ return ret;
}
EXPORT_SYMBOL_GPL(mnt_want_write);
+static void lock_and_coalesce_cpu_mnt_writer_counts(void)
+{
+ int cpu;
+ struct mnt_writer *cpu_writer;
+
+ for_each_possible_cpu(cpu) {
+ cpu_writer = &per_cpu(mnt_writers, cpu);
+ spin_lock(&cpu_writer->lock);
+ __clear_mnt_count(cpu_writer);
+ cpu_writer->mnt = NULL;
+ }
+}
+
+/*
+ * These per-cpu write counts are not guaranteed to have
+ * matched increments and decrements on any given cpu.
+ * A file open()ed for write on one cpu and close()d on
+ * another cpu will imbalance this count. Make sure it
+ * does not get too far out of whack.
+ */
+static void handle_write_count_underflow(struct vfsmount *mnt)
+{
+ while (atomic_read(&mnt->__mnt_writers) <
+ MNT_WRITER_UNDERFLOW_LIMIT) {
+ /*
+ * It isn't necessary to hold all of the locks
+ * at the same time, but doing it this way makes
+ * us share a lot more code.
+ */
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ mnt_unlock_cpus();
+ }
+}
+
/**
* mnt_drop_write - give up write access to a mount
* @mnt: the mount on which to give up write access
@@ -113,23 +241,61 @@ EXPORT_SYMBOL_GPL(mnt_want_write);
*/
void mnt_drop_write(struct vfsmount *mnt)
{
+ int must_check_underflow = 0;
+ struct mnt_writer *cpu_writer;
+
+ cpu_writer = &get_cpu_var(mnt_writers);
+ spin_lock(&cpu_writer->lock);
+
+ use_cpu_writer_for_mount(cpu_writer, mnt);
+ if (cpu_writer->count > 0) {
+ cpu_writer->count--;
+ } else {
+ must_check_underflow = 1;
+ atomic_dec(&mnt->__mnt_writers);
+ }
+
+ spin_unlock(&cpu_writer->lock);
+ /*
+ * Logically, we could call this each time,
+ * but the __mnt_writers cacheline tends to
+ * be cold, and makes this expensive.
+ */
+ if (must_check_underflow)
+ handle_write_count_underflow(mnt);
+ /*
+ * This could be done right after the spinlock
+ * is taken because the spinlock keeps us on
+ * the cpu, and disables preemption. However,
+ * putting it here bounds the amount that
+ * __mnt_writers can underflow. Without it,
+ * we could theoretically wrap __mnt_writers.
+ */
+ put_cpu_var(mnt_writers);
}
EXPORT_SYMBOL_GPL(mnt_drop_write);
-/*
- * __mnt_is_readonly: check whether a mount is read-only
- * @mnt: the mount to check for its write status
- *
- * This shouldn't be used directly ouside of the VFS.
- * It does not guarantee that the filesystem will stay
- * r/w, just that it is right *now*. This can not and
- * should not be used in place of IS_RDONLY(inode).
- */
-int __mnt_is_readonly(struct vfsmount *mnt)
+int mnt_make_readonly(struct vfsmount *mnt)
{
- return (mnt->mnt_sb->s_flags & MS_RDONLY);
+ int ret = 0;
+
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ /*
+ * With all the locks held, this value is stable
+ */
+ if (atomic_read(&mnt->__mnt_writers) > 0) {
+ ret = -EBUSY;
+ goto out;
+ }
+ /*
+ * actually set mount's r/o flag here to make
+ * __mnt_is_readonly() true, which keeps anyone
+ * from doing a successful mnt_want_write().
+ */
+out:
+ mnt_unlock_cpus();
+ return ret;
}
-EXPORT_SYMBOL_GPL(__mnt_is_readonly);
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
@@ -325,6 +491,15 @@ static struct vfsmount *clone_mnt(struct
static inline void __mntput(struct vfsmount *mnt)
{
struct super_block *sb = mnt->mnt_sb;
+ lock_and_coalesce_cpu_mnt_writer_counts();
+ mnt_unlock_cpus();
+ /*
+ * This probably indicates that somebody messed
+ * up a mnt_want/drop_write() pair. If this
+ * happens, the filesystem was probably unable
+ * to make r/w->r/o transitions.
+ */
+ WARN_ON(atomic_read(&mnt->__mnt_writers));
dput(mnt->mnt_root);
free_vfsmnt(mnt);
deactivate_super(sb);
diff -puN include/linux/mount.h~r-o-bind-mounts-track-number-of-mount-writers include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~r-o-bind-mounts-track-number-of-mount-writers 2007-11-01 14:46:20.000000000 -0700
+++ linux-2.6.git-dave/include/linux/mount.h 2007-11-01 14:46:20.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/list.h>
+#include <linux/nodemask.h>
#include <linux/spinlock.h>
#include <asm/atomic.h>
@@ -61,6 +62,13 @@ struct vfsmount {
atomic_t mnt_count;
int mnt_expiry_mark; /* true if marked for expiry */
int mnt_pinned;
+ /*
+ * This value is not stable unless all of the
+ * mnt_writers[] spinlocks are held, and all
+ * mnt_writer[]s on this mount have 0 as
+ * their ->count
+ */
+ atomic_t __mnt_writers;
};
static inline struct vfsmount *mntget(struct vfsmount *mnt)
_
If we depend on the inodes for writeability, we will not catch the r/o mounts
when implemented.
This patches uses __mnt_want_write(). It does not guarantee that the mount
will stay writeable after the check. But, this is OK for one of the checks
because it is just for a printk().
The other two are probably unnecessary and duplicate existing checks in the
VFS. This won't make them better checks than before, but it will make them
detect r/o mounts.
Acked-by: Christoph Hellwig <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/nfs/dir.c | 3 ++-
linux-2.6.git-dave/fs/nfsd/vfs.c | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)
diff -puN fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly fs/nfs/dir.c
--- linux-2.6.git/fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly 2007-11-01 14:46:19.000000000 -0700
+++ linux-2.6.git-dave/fs/nfs/dir.c 2007-11-01 14:46:19.000000000 -0700
@@ -949,7 +949,8 @@ static int is_atomic_open(struct inode *
if (nd->flags & LOOKUP_DIRECTORY)
return 0;
/* Are we trying to write to a read only partition? */
- if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
+ if (__mnt_is_readonly(nd->mnt) &&
+ (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
return 0;
return 1;
}
diff -puN fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly fs/nfsd/vfs.c
--- linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly 2007-11-01 14:46:19.000000000 -0700
+++ linux-2.6.git-dave/fs/nfsd/vfs.c 2007-11-01 14:46:19.000000000 -0700
@@ -1862,7 +1862,7 @@ nfsd_permission(struct svc_rqst *rqstp,
inode->i_mode,
IS_IMMUTABLE(inode)? " immut" : "",
IS_APPEND(inode)? " append" : "",
- IS_RDONLY(inode)? " ro" : "");
+ __mnt_is_readonly(exp->ex_mnt)? " ro" : "");
dprintk(" owner %d/%d user %d/%d\n",
inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
#endif
@@ -1873,7 +1873,7 @@ nfsd_permission(struct svc_rqst *rqstp,
*/
if (!(acc & MAY_LOCAL_ACCESS))
if (acc & (MAY_WRITE | MAY_SATTR | MAY_TRUNC)) {
- if (exp_rdonly(rqstp, exp) || IS_RDONLY(inode))
+ if (exp_rdonly(rqstp, exp) || __mnt_is_readonly(exp->ex_mnt))
return nfserr_rofs;
if (/* (acc & MAY_WRITE) && */ IS_IMMUTABLE(inode))
return nfserr_perm;
_
Originally from: Herbert Poetzl <[email protected]>
This is the core of the read-only bind mount patch set.
Note that this does _not_ add a "ro" option directly to the bind mount
operation. If you require such a mount, you must first do the bind, then
follow it up with a 'mount -o remount,ro' operation.
Signed-off-by: Dave Hansen <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
linux-2.6.git-dave/fs/namespace.c | 46 ++++++++++++++++++++++++++-----
linux-2.6.git-dave/include/linux/mount.h | 1
2 files changed, 40 insertions(+), 7 deletions(-)
diff -puN fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time 2007-11-01 14:46:22.000000000 -0700
+++ linux-2.6.git-dave/fs/namespace.c 2007-11-01 14:46:22.000000000 -0700
@@ -102,7 +102,11 @@ struct vfsmount *alloc_vfsmnt(const char
*/
int __mnt_is_readonly(struct vfsmount *mnt)
{
- return (mnt->mnt_sb->s_flags & MS_RDONLY);
+ if (mnt->mnt_flags & MNT_READONLY)
+ return 1;
+ if (mnt->mnt_sb->s_flags & MS_RDONLY)
+ return 1;
+ return 0;
}
EXPORT_SYMBOL_GPL(__mnt_is_readonly);
@@ -277,7 +281,7 @@ void mnt_drop_write(struct vfsmount *mnt
}
EXPORT_SYMBOL_GPL(mnt_drop_write);
-int mnt_make_readonly(struct vfsmount *mnt)
+static int mnt_make_readonly(struct vfsmount *mnt)
{
int ret = 0;
@@ -290,15 +294,21 @@ int mnt_make_readonly(struct vfsmount *m
goto out;
}
/*
- * actually set mount's r/o flag here to make
- * __mnt_is_readonly() true, which keeps anyone
- * from doing a successful mnt_want_write().
+ * nobody can do a successful mnt_want_write() with all
+ * of the counts in MNT_DENIED_WRITE and the locks held.
*/
+ if (!ret)
+ mnt->mnt_flags |= MNT_READONLY;
out:
mnt_unlock_cpus();
return ret;
}
+static void __mnt_unmake_readonly(struct vfsmount *mnt)
+{
+ mnt->mnt_flags &= ~MNT_READONLY;
+}
+
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
mnt->mnt_sb = sb;
@@ -607,7 +617,7 @@ static int show_vfsmnt(struct seq_file *
seq_putc(m, '.');
mangle(m, mnt->mnt_sb->s_subtype);
}
- seq_puts(m, mnt->mnt_sb->s_flags & MS_RDONLY ? " ro" : " rw");
+ seq_puts(m, __mnt_is_readonly(mnt) ? " ro" : " rw");
for (fs_infop = fs_info; fs_infop->flag; fs_infop++) {
if (mnt->mnt_sb->s_flags & fs_infop->flag)
seq_puts(m, fs_infop->str);
@@ -1195,6 +1205,23 @@ out:
return err;
}
+static int change_mount_flags(struct vfsmount *mnt, int ms_flags)
+{
+ int error = 0;
+ int readonly_request = 0;
+
+ if (ms_flags & MS_RDONLY)
+ readonly_request = 1;
+ if (readonly_request == __mnt_is_readonly(mnt))
+ return 0;
+
+ if (readonly_request)
+ error = mnt_make_readonly(mnt);
+ else
+ __mnt_unmake_readonly(mnt);
+ return error;
+}
+
/*
* change filesystem flags. dir should be a physical root of filesystem.
* If you've mounted a non-root directory somewhere and want to do remount
@@ -1216,7 +1243,10 @@ static int do_remount(struct nameidata *
return -EINVAL;
down_write(&sb->s_umount);
- err = do_remount_sb(sb, flags, data, 0);
+ if (flags & MS_BIND)
+ err = change_mount_flags(nd->mnt, flags);
+ else
+ err = do_remount_sb(sb, flags, data, 0);
if (!err)
nd->mnt->mnt_flags = mnt_flags;
up_write(&sb->s_umount);
@@ -1660,6 +1690,8 @@ long do_mount(char *dev_name, char *dir_
mnt_flags |= MNT_NODIRATIME;
if (flags & MS_RELATIME)
mnt_flags |= MNT_RELATIME;
+ if (flags & MS_RDONLY)
+ mnt_flags |= MNT_READONLY;
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT);
diff -puN include/linux/mount.h~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time include/linux/mount.h
--- linux-2.6.git/include/linux/mount.h~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time 2007-11-01 14:46:22.000000000 -0700
+++ linux-2.6.git-dave/include/linux/mount.h 2007-11-01 14:46:22.000000000 -0700
@@ -29,6 +29,7 @@ struct mnt_namespace;
#define MNT_NOATIME 0x08
#define MNT_NODIRATIME 0x10
#define MNT_RELATIME 0x20
+#define MNT_READONLY 0x40 /* does the user want this to be r/o? */
#define MNT_SHRINKABLE 0x100
_
With the r/o bind mount patches, we can have as many
spinlocks nested as there are CPUs on the system.
Lockdep freaks out after 8.
So, create a new lockdep class of locks for the
mnt_writer spinlocks, and initialize each of the
cpu locks to be in a different class.
It should shut up warnings like this, while still
allowing some of the lockdep goodness to remain:
=============================================
[ INFO: possible recursive locking detected ]
2.6.23-rc6 #6
---
linux-2.6.git-dave/fs/namespace.c | 2 ++
1 file changed, 2 insertions(+)
diff -puN fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers-make-lockdep-happy-with-r-o-bind-mounts fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-track-number-of-mount-writers-make-lockdep-happy-with-r-o-bind-mounts 2007-11-01 14:46:21.000000000 -0700
+++ linux-2.6.git-dave/fs/namespace.c 2007-11-01 14:46:21.000000000 -0700
@@ -112,6 +112,7 @@ struct mnt_writer {
* must be ordered by cpu number.
*/
spinlock_t lock;
+ struct lock_class_key lock_class; /* compiles out with !lockdep */
unsigned long count;
struct vfsmount *mnt;
} ____cacheline_aligned_in_smp;
@@ -123,6 +124,7 @@ static int __init init_mnt_writers(void)
for_each_possible_cpu(cpu) {
struct mnt_writer *writer = &per_cpu(mnt_writers, cpu);
spin_lock_init(&writer->lock);
+ lockdep_set_class(&writer->lock, &writer->lock_class);
writer->count = 0;
}
return 0;
_
There have been a few oopses caused by 'struct file's with
NULL f_vfsmnts. There was also a set of potentially missed
mnt_want_write()s from dentry_open() calls.
This patch provides a very simple debugging framework to
catch these kinds of bugs. It will WARN_ON() them, but
should stop us from having any oopses or mnt_writer
count imbalances.
I'm quite convinced that this is a good thing because it
found bugs in the stuff I was working on as soon as I
wrote it.
Signed-off-by: Dave Hansen <[email protected]>
---
linux-2.6.git-dave/fs/file_table.c | 21 +++++++++++++++++++--
linux-2.6.git-dave/fs/open.c | 14 +++++++++++++-
linux-2.6.git-dave/include/linux/fs.h | 4 ++++
3 files changed, 36 insertions(+), 3 deletions(-)
diff -puN fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file fs/file_table.c
--- linux-2.6.git/fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file 2007-11-01 14:46:22.000000000 -0700
+++ linux-2.6.git-dave/fs/file_table.c 2007-11-01 14:46:22.000000000 -0700
@@ -42,6 +42,12 @@ static inline void file_free_rcu(struct
static inline void file_free(struct file *f)
{
percpu_counter_dec(&nr_files);
+ /*
+ * At this point, either both or neither of these bits
+ * should be set.
+ */
+ WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN);
+ WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_RELEASED);
call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}
@@ -201,6 +207,7 @@ int init_file(struct file *file, struct
* that we can do debugging checks at __fput()e
*/
if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+ file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
error = mnt_want_write(mnt);
WARN_ON(error);
}
@@ -243,8 +250,18 @@ void fastcall __fput(struct file *file)
fops_put(file->f_op);
if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
- if (!special_file(inode->i_mode))
- mnt_drop_write(mnt);
+ if (!special_file(inode->i_mode)) {
+ if (file->f_mnt_write_state == FILE_MNT_WRITE_TAKEN) {
+ mnt_drop_write(mnt);
+ file->f_mnt_write_state |=
+ FILE_MNT_WRITE_RELEASED;
+ } else {
+ printk(KERN_WARNING "__fput() of writeable "
+ "file with no "
+ "mnt_want_write()\n");
+ WARN_ON(1);
+ }
+ }
}
put_pid(file->f_owner.pid);
file_kill(file);
diff -puN fs/open.c~keep-track-of-mnt_writer-state-of-struct-file fs/open.c
--- linux-2.6.git/fs/open.c~keep-track-of-mnt_writer-state-of-struct-file 2007-11-01 14:46:22.000000000 -0700
+++ linux-2.6.git-dave/fs/open.c 2007-11-01 14:46:22.000000000 -0700
@@ -810,6 +810,10 @@ static struct file *__dentry_open(struct
error = __get_file_write_access(inode, mnt);
if (error)
goto cleanup_file;
+ if (!special_file(inode->i_mode)) {
+ WARN_ON(f->f_mnt_write_state != 0);
+ f->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
+ }
}
f->f_mapping = inode->i_mapping;
@@ -851,8 +855,16 @@ cleanup_all:
fops_put(f->f_op);
if (f->f_mode & FMODE_WRITE) {
put_write_access(inode);
- if (!special_file(inode->i_mode))
+ if (!special_file(inode->i_mode)) {
+ /*
+ * We don't consider this a real
+ * mnt_want/drop_write() pair
+ * because it all happenend right
+ * here, so just reset the state.
+ */
+ f->f_mnt_write_state = 0;
mnt_drop_write(mnt);
+ }
}
file_kill(f);
f->f_path.dentry = NULL;
diff -puN include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file include/linux/fs.h
--- linux-2.6.git/include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file 2007-11-01 14:46:22.000000000 -0700
+++ linux-2.6.git-dave/include/linux/fs.h 2007-11-01 14:46:22.000000000 -0700
@@ -774,6 +774,9 @@ static inline int ra_has_index(struct fi
index < ra->start + ra->size);
}
+#define FILE_MNT_WRITE_TAKEN 1
+#define FILE_MNT_WRITE_RELEASED 2
+
struct file {
/*
* fu_list becomes invalid after file_free is called and queued via
@@ -808,6 +811,7 @@ struct file {
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
+ unsigned long f_mnt_write_state;
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
_
On Thu, 01 Nov 2007 16:08:47 -0700
Dave Hansen <[email protected]> wrote:
> Some ioctl()s can cause writes to the filesystem. Take these, and make them
> use mnt_want/drop_write() instead.
>
> We need to pass the filp one layer deeper in XFS, but somebody _just_ pulled
> it out in February because nobody was using it, so I don't feel guilty for
> adding it back.
See, when we combine this patch with Jan's
forbid-user-to-change-file-flags-on-quota-files.patch we silently add bugs
to five filesystems. Lessons:
- never ever ever do `return' from deep in the guts of a function. This
is a *classic* instance of the maintainability risks which this practice
introduces.
- this whole elevate-the-write-count-in-a-zillion-places stuff is quite
fragile.
diff -puN fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/ext2/ioctl.c
--- a/fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
+++ a/fs/ext2/ioctl.c
@@ -57,7 +57,8 @@ int ext2_ioctl (struct inode * inode, st
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ ret = -EPERM;
+ goto setflags_out;
}
oldflags = ei->i_flags;
diff -puN fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/ext3/ioctl.c
--- a/fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
+++ a/fs/ext3/ioctl.c
@@ -60,7 +60,8 @@ int ext3_ioctl (struct inode * inode, st
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
oldflags = ei->i_flags;
diff -puN fs/ext4/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/ext4/ioctl.c
--- a/fs/ext4/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
+++ a/fs/ext4/ioctl.c
@@ -59,7 +59,8 @@ int ext4_ioctl (struct inode * inode, st
/* Is it quota file? Do not allow user to mess with it */
if (IS_NOQUOTA(inode)) {
mutex_unlock(&inode->i_mutex);
- return -EPERM;
+ err = -EPERM;
+ goto flags_out;
}
oldflags = ei->i_flags;
diff -puN fs/jfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/jfs/ioctl.c
--- a/fs/jfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
+++ a/fs/jfs/ioctl.c
@@ -85,8 +85,10 @@ int jfs_ioctl(struct inode * inode, stru
flags &= ~JFS_DIRSYNC_FL;
/* Is it quota file? Do not allow user to mess with it */
- if (IS_NOQUOTA(inode))
- return -EPERM;
+ if (IS_NOQUOTA(inode)) {
+ err = -EPERM;
+ goto setflags_out;
+ }
jfs_get_inode_flags(jfs_inode);
oldflags = jfs_inode->mode2;
diff -puN fs/reiserfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/reiserfs/ioctl.c
_
On Thu, 01 Nov 2007 16:09:00 -0700
Dave Hansen <[email protected]> wrote:
> With the r/o bind mount patches, we can have as many
> spinlocks nested as there are CPUs on the system.
> Lockdep freaks out after 8.
>
> So, create a new lockdep class of locks for the
> mnt_writer spinlocks, and initialize each of the
> cpu locks to be in a different class.
>
> It should shut up warnings like this, while still
> allowing some of the lockdep goodness to remain:
>
> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.23-rc6 #6
>
I added your signed-off-by: to this.
I wish you hadn't mucked up all those Subject:s
> On Thu, 01 Nov 2007 16:08:47 -0700
> Dave Hansen <[email protected]> wrote:
>
> > Some ioctl()s can cause writes to the filesystem. Take these, and make them
> > use mnt_want/drop_write() instead.
> >
> > We need to pass the filp one layer deeper in XFS, but somebody _just_ pulled
> > it out in February because nobody was using it, so I don't feel guilty for
> > adding it back.
>
> See, when we combine this patch with Jan's
> forbid-user-to-change-file-flags-on-quota-files.patch we silently add bugs
> to five filesystems. Lessons:
>
> - never ever ever do `return' from deep in the guts of a function. This
> is a *classic* instance of the maintainability risks which this practice
> introduces.
OK, lesson learned ;) Thanks for spotting this. But if there already
was a label like:
err_out:
return err;
And I would have just added a place which does 'goto err_out', then
the same problem could arise... But I agree it's less likely than if we
do just return err;.
> - this whole elevate-the-write-count-in-a-zillion-places stuff is quite
> fragile.
>
> diff -puN fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/ext2/ioctl.c
> --- a/fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
> +++ a/fs/ext2/ioctl.c
> @@ -57,7 +57,8 @@ int ext2_ioctl (struct inode * inode, st
> /* Is it quota file? Do not allow user to mess with it */
> if (IS_NOQUOTA(inode)) {
> mutex_unlock(&inode->i_mutex);
> - return -EPERM;
> + ret = -EPERM;
> + goto setflags_out;
> }
> oldflags = ei->i_flags;
>
> diff -puN fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/ext3/ioctl.c
> --- a/fs/ext3/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
> +++ a/fs/ext3/ioctl.c
> @@ -60,7 +60,8 @@ int ext3_ioctl (struct inode * inode, st
> /* Is it quota file? Do not allow user to mess with it */
> if (IS_NOQUOTA(inode)) {
> mutex_unlock(&inode->i_mutex);
> - return -EPERM;
> + err = -EPERM;
> + goto flags_out;
> }
> oldflags = ei->i_flags;
>
> diff -puN fs/ext4/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/ext4/ioctl.c
> --- a/fs/ext4/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
> +++ a/fs/ext4/ioctl.c
> @@ -59,7 +59,8 @@ int ext4_ioctl (struct inode * inode, st
> /* Is it quota file? Do not allow user to mess with it */
> if (IS_NOQUOTA(inode)) {
> mutex_unlock(&inode->i_mutex);
> - return -EPERM;
> + err = -EPERM;
> + goto flags_out;
> }
> oldflags = ei->i_flags;
>
> diff -puN fs/jfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/jfs/ioctl.c
> --- a/fs/jfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files
> +++ a/fs/jfs/ioctl.c
> @@ -85,8 +85,10 @@ int jfs_ioctl(struct inode * inode, stru
> flags &= ~JFS_DIRSYNC_FL;
>
> /* Is it quota file? Do not allow user to mess with it */
> - if (IS_NOQUOTA(inode))
> - return -EPERM;
> + if (IS_NOQUOTA(inode)) {
> + err = -EPERM;
> + goto setflags_out;
> + }
> jfs_get_inode_flags(jfs_inode);
> oldflags = jfs_inode->mode2;
>
> diff -puN fs/reiserfs/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls-vs-forbid-user-to-change-file-flags-on-quota-files fs/reiserfs/ioctl.c
> _
Why there's no modification for reiserfs?
Honza
--
Jan Kara <[email protected]>
SuSE CR Labs
On Tue, 6 Nov 2007 10:01:02 +0100 Jan Kara <[email protected]> wrote:
> Why there's no modification for reiserfs?
I fixed that in the base patch while repairing some rejects.
On Thu, Nov 01, 2007 at 04:08:32PM -0700, Dave Hansen wrote:
>
> open_namei() no longer touches namei's. rename it
> to something more appropriate: open_pathname().
The name is quite non-descriptive. What about just leaving it as
filp_open and merging this into the previous patches to avoid
the spurious renaming?
And btw, exported functions that get half-rewritten should always
grow a kerneldoc comment. Then again this particular one really
shouldn't be exported, but that's for another series.
On Thu, 01 Nov 2007 16:08:31 -0700 Dave Hansen <[email protected]> wrote:
> Replace all callers with open_namei() directly, and move the
> nameidata stack allocation into open_namei().
ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-loop.patch
is using filp_open() and hence doesn't work very well.
A shall revert dm-loop.patch and run away.
On Wed, 2008-01-16 at 00:52 -0800, Andrew Morton wrote:
> On Thu, 01 Nov 2007 16:08:31 -0700 Dave Hansen <[email protected]> wrote:
> > Replace all callers with open_namei() directly, and move the
> > nameidata stack allocation into open_namei().
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-loop.patch
> is using filp_open() and hence doesn't work very well.
>
> A shall revert dm-loop.patch and run away.
This one's pretty easy, thank goodness. Just replace filp_open() with
open_namei():
/filp_open(/open_namei(AT_FDCWD, /
BTW, why do we need this on top of the existing loopback driver? Can
they really share no code?
linux-2.6.git-dave/drivers/md/dm-loop.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff -puN drivers/md/dm-loop.c~fix-dm-loop drivers/md/dm-loop.c
--- linux-2.6.git/drivers/md/dm-loop.c~fix-dm-loop 2008-01-16 09:02:06.000000000 -0800
+++ linux-2.6.git-dave/drivers/md/dm-loop.c 2008-01-16 09:02:06.000000000 -0800
@@ -757,7 +757,7 @@ static int loop_get_file(struct dm_targe
int r = 0;
ti->error = "could not open backing file";
- filp = filp_open(lc->path, flags, 0);
+ filp = open_namei(AT_FDCWD, lc->path, flags, 0);
if (IS_ERR(filp))
return PTR_ERR(filp);
lc->filp = filp;
_
-- Dave
On Wed, Jan 16, 2008 at 09:04:56AM -0800, Dave Hansen wrote:
> > is using filp_open() and hence doesn't work very well.
> >
> > A shall revert dm-loop.patch and run away.
>
> This one's pretty easy, thank goodness. Just replace filp_open() with
> open_namei():
>
> /filp_open(/open_namei(AT_FDCWD, /
But ultimatively I think we should remove this silly renaming from
the patch. It doesn't help the goal and just created churn, so please
rename open_namei back to do_filp_open for now and put filp_open back.
> BTW, why do we need this on top of the existing loopback driver? Can
> they really share no code?
Yeah, given how much pain loop drivers in general are I don't think it's
a good idea to add a second one.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dave Hansen wrote:
> On Wed, 2008-01-16 at 00:52 -0800, Andrew Morton wrote:
>> On Thu, 01 Nov 2007 16:08:31 -0700 Dave Hansen <[email protected]> wrote:
>>> Replace all callers with open_namei() directly, and move the
>>> nameidata stack allocation into open_namei().
>> ftp://ftp.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-loop.patch
>> is using filp_open() and hence doesn't work very well.
>>
>> A shall revert dm-loop.patch and run away.
>
> This one's pretty easy, thank goodness. Just replace filp_open() with
> open_namei():
>
> /filp_open(/open_namei(AT_FDCWD, /
>
> BTW, why do we need this on top of the existing loopback driver? Can
> they really share no code?
The current dm-loop patch was really written to demonstrate the value of
block-mapping as an alternative to doing file backed I/O via the pagecache.
Personally, I would like to see a single "loopback block devices" driver
in the kernel with loop.c and dm-loop as alternate interfaces to it.
There seems to be some demand for this, e.g.:
http://bugzilla.kernel.org/show_bug.cgi?id=5333
And numerous similar posts to dm-devel (although I think obsoleting may
be going a bit far to start with!).
Regards,
Bryn.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org
iD8DBQFHjjsA6YSQoMYUY94RAv+fAJ90mbqDQ/17nd3j9OAIXXc8Y4SRbQCfR75I
KtDxHXDnYmCmBlw/lOzGmms=
=A+Vb
-----END PGP SIGNATURE-----
On Wed, 2008-01-16 at 17:10 +0000, Christoph Hellwig wrote:
>
> But ultimatively I think we should remove this silly renaming from
> the patch. It doesn't help the goal and just created churn, so please
> rename open_namei back to do_filp_open for now and put filp_open back.
It wasn't really a rename. More that I ended up moving everything
useful the filp_open() did to other functions, so I didn't see the need
to keep around a stub like this:
struct file *filp_open(const char *filename, int flags, int mode)
{
return open_namei(AT_FDCWD, filename, flags, mode);
}
It would be trivial to add this stub back, though.
-- Dave
On Wed, Jan 16, 2008 at 09:41:12AM -0800, Dave Hansen wrote:
> On Wed, 2008-01-16 at 17:10 +0000, Christoph Hellwig wrote:
> >
> > But ultimatively I think we should remove this silly renaming from
> > the patch. It doesn't help the goal and just created churn, so please
> > rename open_namei back to do_filp_open for now and put filp_open back.
>
> It wasn't really a rename. More that I ended up moving everything
> useful the filp_open() did to other functions, so I didn't see the need
> to keep around a stub like this:
>
> struct file *filp_open(const char *filename, int flags, int mode)
> {
> return open_namei(AT_FDCWD, filename, flags, mode);
> }
>
> It would be trivial to add this stub back, though.
Well, it was a name from the caller point of view :) You basically
merged do_filp_open and open_namei. Given that open_namei was an
internal helper only used by do_filp_open it seems quite sensible to
keep the existing do_filp_open user interface and call the merged
function do_filp_open. Especially as filp_open is a quite descriptive
name of what we're doing here - open the pathname and return a file
pointer for it.