2007-10-10 16:36:53

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 1/7] init_file(): only take writes on normal files



---

lxc-dave/fs/file_table.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff -puN fs/file_table.c~init_file-only-take-writes-on-normal-files fs/file_table.c
--- lxc/fs/file_table.c~init_file-only-take-writes-on-normal-files 2007-10-04 13:01:59.000000000 -0700
+++ lxc-dave/fs/file_table.c 2007-10-04 13:03:03.000000000 -0700
@@ -199,7 +199,12 @@ int init_file(struct file *file, struct
file->f_mapping = dentry->d_inode->i_mapping;
file->f_mode = mode;
file->f_op = fop;
- if (mode & FMODE_WRITE) {
+ /*
+ * These mounts don't really matter in practice
+ * for r/o bind mounts. They aren't userspace-
+ * visible. We do this for consistency.
+ */
+ if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
error = mnt_want_write(mnt);
WARN_ON(error);
_


2007-10-10 16:35:00

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 3/7] do namei_flags calculation inside open_namei()


My end goal here is to make sure all users of may_open()
return filps. This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().

This patch moves the sys_open flags to namei flags
calculation into fs/namei.c. We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 43 ++++++++++++++++++++++++++++++++++---------
lxc-dave/fs/open.c | 22 ++--------------------
2 files changed, 36 insertions(+), 29 deletions(-)

diff -puN fs/namei.c~do-namei_flags-calculation-inside-open_namei fs/namei.c
--- lxc/fs/namei.c~do-namei_flags-calculation-inside-open_namei 2007-10-04 13:13:00.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-10-04 13:13:00.000000000 -0700
@@ -1672,7 +1672,12 @@ int may_open(struct nameidata *nd, int a
return 0;
}

-static int open_namei_create(struct nameidata *nd, struct path *path,
+/*
+ * Be careful about ever adding any more callers of this
+ * function. Its flags must be in the namei format, not
+ * what get passed to sys_open().
+ */
+static int __open_namei_create(struct nameidata *nd, struct path *path,
int flag, int mode)
{
int error;
@@ -1691,26 +1696,46 @@ static int open_namei_create(struct name
}

/*
+ * Note that while the flag value (low two bits) for sys_open means:
+ * 00 - read-only
+ * 01 - write-only
+ * 10 - read-write
+ * 11 - special
+ * it is changed into
+ * 00 - no permissions needed
+ * 01 - read-permission
+ * 10 - write-permission
+ * 11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * later).
+ *
+*/
+static inline int sys_open_flags_to_namei_flags(int flag)
+{
+ if ((flag+1) & O_ACCMODE)
+ flag++;
+ return flag;
+}
+
+/*
* open_namei()
*
* namei for open - this is in fact almost the whole open-routine.
*
* Note that the low bits of "flag" aren't the same as in the open
- * system call - they are 00 - no permissions needed
- * 01 - read permission needed
- * 10 - write permission needed
- * 11 - read/write permissions needed
- * which is a lot more logical, and also allows the "no perm" needed
- * for symlinks (where the permissions are checked later).
+ * system call. See sys_open_flags_to_namei_flags().
* SMP-safe
*/
-int open_namei(int dfd, const char *pathname, int flag,
+int open_namei(int dfd, const char *pathname, int sys_open_flag,
int mode, struct nameidata *nd)
{
int acc_mode, error;
struct path path;
struct dentry *dir;
int count = 0;
+ int flag = sys_open_flags_to_namei_flags(sys_open_flag);

acc_mode = ACC_MODE(flag);

@@ -1770,7 +1795,7 @@ do_last:

/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
- error = open_namei_create(nd, &path, flag, mode);
+ error = __open_namei_create(nd, &path, flag, mode);
if (error)
goto exit;
return 0;
diff -puN fs/open.c~do-namei_flags-calculation-inside-open_namei fs/open.c
--- lxc/fs/open.c~do-namei_flags-calculation-inside-open_namei 2007-10-04 13:13:00.000000000 -0700
+++ lxc-dave/fs/open.c 2007-10-04 13:13:00.000000000 -0700
@@ -863,31 +863,13 @@ cleanup_file:
return ERR_PTR(error);
}

-/*
- * Note that while the flag value (low two bits) for sys_open means:
- * 00 - read-only
- * 01 - write-only
- * 10 - read-write
- * 11 - special
- * it is changed into
- * 00 - no permissions needed
- * 01 - read-permission
- * 10 - write-permission
- * 11 - read-write
- * for the internal routines (ie open_namei()/follow_link() etc). 00 is
- * used by symlinks.
- */
static struct file *do_filp_open(int dfd, const char *filename, int flags,
int mode)
{
- int namei_flags, error;
+ int error;
struct nameidata nd;

- namei_flags = flags;
- if ((namei_flags+1) & O_ACCMODE)
- namei_flags++;
-
- error = open_namei(dfd, filename, namei_flags, mode, &nd);
+ error = open_namei(dfd, filename, flags, mode, &nd);
if (!error)
return nameidata_to_filp(&nd, flags);

_

2007-10-10 16:35:25

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 4/7] make open_namei() return a filp


If open_namei() succeeds, there is potentially a mnt_want_write()
that needs to get balanced. If the caller doesn't create a
'struct file' and eventually __fput() it, or manually drop the
write count on an error, we have a bug.

Forcing open_namei() to return a filp fixes this. Any caller
getting a 'struct file' back must consider that filp instantiated
and fput() it normally. The callers no longer have to worry about
ever manually releasing a mnt write count.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 16 ++++++++--------
lxc-dave/fs/open.c | 7 +------
lxc-dave/include/linux/fs.h | 2 +-
3 files changed, 10 insertions(+), 15 deletions(-)

diff -puN fs/namei.c~make-open_namei-return-a-filp fs/namei.c
--- lxc/fs/namei.c~make-open_namei-return-a-filp 2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-10-03 09:01:45.000000000 -0700
@@ -1728,8 +1728,8 @@ static inline int sys_open_flags_to_name
* system call. See sys_open_flags_to_namei_flags().
* SMP-safe
*/
-int open_namei(int dfd, const char *pathname, int sys_open_flag,
- int mode, struct nameidata *nd)
+struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
+ int mode, struct nameidata *nd)
{
int acc_mode, error;
struct path path;
@@ -1755,7 +1755,7 @@ int open_namei(int dfd, const char *path
error = path_lookup_open(dfd, pathname, lookup_flags(flag),
nd, flag);
if (error)
- return error;
+ return ERR_PTR(error);
goto ok;
}

@@ -1764,7 +1764,7 @@ int open_namei(int dfd, const char *path
*/
error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
if (error)
- return error;
+ return ERR_PTR(error);

/*
* We have the parent and last component. First of all, check
@@ -1798,7 +1798,7 @@ do_last:
error = __open_namei_create(nd, &path, flag, mode);
if (error)
goto exit;
- return 0;
+ return nameidata_to_filp(nd, sys_open_flag);
}

/*
@@ -1831,7 +1831,7 @@ ok:
error = may_open(nd, acc_mode, flag);
if (error)
goto exit;
- return 0;
+ return nameidata_to_filp(nd, sys_open_flag);

exit_mutex_unlock:
mutex_unlock(&dir->d_inode->i_mutex);
@@ -1841,7 +1841,7 @@ exit:
if (!IS_ERR(nd->intent.open.file))
release_open_intent(nd);
path_release(nd);
- return error;
+ return ERR_PTR(error);

do_link:
error = -ELOOP;
@@ -1868,7 +1868,7 @@ do_link:
* with "intent.open".
*/
release_open_intent(nd);
- return error;
+ return ERR_PTR(error);
}
nd->flags &= ~LOOKUP_PARENT;
if (nd->last_type == LAST_BIND)
diff -puN fs/open.c~make-open_namei-return-a-filp fs/open.c
--- lxc/fs/open.c~make-open_namei-return-a-filp 2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/fs/open.c 2007-10-03 09:01:45.000000000 -0700
@@ -846,14 +846,9 @@ cleanup_file:
static struct file *do_filp_open(int dfd, const char *filename, int flags,
int mode)
{
- int error;
struct nameidata nd;

- error = open_namei(dfd, filename, flags, mode, &nd);
- if (!error)
- return nameidata_to_filp(&nd, flags);
-
- return ERR_PTR(error);
+ return open_namei(dfd, filename, flags, mode, &nd);
}

struct file *filp_open(const char *filename, int flags, int mode)
diff -puN include/linux/fs.h~make-open_namei-return-a-filp include/linux/fs.h
--- lxc/include/linux/fs.h~make-open_namei-return-a-filp 2007-10-03 09:01:45.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2007-10-03 09:01:45.000000000 -0700
@@ -1721,7 +1721,7 @@ extern struct file *create_read_pipe(str
extern struct file *create_write_pipe(void);
extern void free_write_pipe(struct file *);

-extern int open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
extern int may_open(struct nameidata *, int, int);

extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
_

2007-10-10 16:35:43

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 6/7] kill filp_open()


Replace all callers with open_namei() directly, and move the
nameidata stack allocation into open_namei().

Does it make sense to still call it open_namei(), even though it
doesn't actually take a nameidata any more?

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/drivers/usb/gadget/file_storage.c | 5 +-
lxc-dave/fs/exec.c | 2
lxc-dave/fs/namei.c | 69 ++++++++++++++---------------
lxc-dave/fs/open.c | 6 --
lxc-dave/fs/reiserfs/journal.c | 2
lxc-dave/include/linux/fs.h | 2
lxc-dave/kernel/acct.c | 2
lxc-dave/mm/swapfile.c | 4 -
lxc-dave/sound/sound_firmware.c | 2
9 files changed, 47 insertions(+), 47 deletions(-)

diff -puN drivers/usb/gadget/file_storage.c~kill-filp_open drivers/usb/gadget/file_storage.c
--- lxc/drivers/usb/gadget/file_storage.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/drivers/usb/gadget/file_storage.c 2007-10-03 09:01:47.000000000 -0700
@@ -3468,16 +3468,17 @@ static int open_backing_file(struct lun
struct inode *inode = NULL;
loff_t size;
loff_t num_sectors;
+ int mode = O_LARGEFILE;

/* R/W if we can, R/O if we must */
ro = curlun->ro;
if (!ro) {
- filp = filp_open(filename, O_RDWR | O_LARGEFILE, 0);
+ filp = open_namei(AT_FDCWD, filename, O_RDWR | mode, 0);
if (-EROFS == PTR_ERR(filp))
ro = 1;
}
if (ro)
- filp = filp_open(filename, O_RDONLY | O_LARGEFILE, 0);
+ filp = open_namei(AT_FDCWD, filename, O_RDONLY | mode, 0);
if (IS_ERR(filp)) {
LINFO(curlun, "unable to open backing file: %s\n", filename);
return PTR_ERR(filp);
diff -puN fs/exec.c~kill-filp_open fs/exec.c
--- lxc/fs/exec.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/exec.c 2007-10-03 09:01:47.000000000 -0700
@@ -1764,7 +1764,7 @@ int do_coredump(long signr, int exit_cod
goto fail_unlock;
}
} else
- file = filp_open(corename,
+ file = open_namei(AT_FDCWD, corename,
O_CREAT | 2 | O_NOFOLLOW | O_LARGEFILE | flag,
0600);
if (IS_ERR(file))
diff -puN fs/namei.c~kill-filp_open fs/namei.c
--- lxc/fs/namei.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-10-03 09:01:47.000000000 -0700
@@ -1729,8 +1729,9 @@ static inline int sys_open_flags_to_name
* SMP-safe
*/
struct file *open_namei(int dfd, const char *pathname, int sys_open_flag,
- int mode, struct nameidata *nd)
+ int mode)
{
+ struct nameidata nd;
int acc_mode, error;
struct path path;
struct dentry *dir;
@@ -1753,7 +1754,7 @@ struct file *open_namei(int dfd, const c
*/
if (!(flag & O_CREAT)) {
error = path_lookup_open(dfd, pathname, lookup_flags(flag),
- nd, flag);
+ &nd, flag);
if (error)
return ERR_PTR(error);
goto ok;
@@ -1762,7 +1763,7 @@ struct file *open_namei(int dfd, const c
/*
* Create - we need to know the parent.
*/
- error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,nd,flag,mode);
+ error = path_lookup_create(dfd,pathname,LOOKUP_PARENT,&nd,flag,mode);
if (error)
return ERR_PTR(error);

@@ -1772,14 +1773,14 @@ struct file *open_namei(int dfd, const c
* will not do.
*/
error = -EISDIR;
- if (nd->last_type != LAST_NORM || nd->last.name[nd->last.len])
+ if (nd.last_type != LAST_NORM || nd.last.name[nd.last.len])
goto exit;

- dir = nd->dentry;
- nd->flags &= ~LOOKUP_PARENT;
+ dir = nd.dentry;
+ nd.flags &= ~LOOKUP_PARENT;
mutex_lock(&dir->d_inode->i_mutex);
- path.dentry = lookup_hash(nd);
- path.mnt = nd->mnt;
+ path.dentry = lookup_hash(&nd);
+ path.mnt = nd.mnt;

do_last:
error = PTR_ERR(path.dentry);
@@ -1788,17 +1789,17 @@ do_last:
goto exit;
}

- if (IS_ERR(nd->intent.open.file)) {
- error = PTR_ERR(nd->intent.open.file);
+ if (IS_ERR(nd.intent.open.file)) {
+ error = PTR_ERR(nd.intent.open.file);
goto exit_mutex_unlock;
}

/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
- error = __open_namei_create(nd, &path, flag, mode);
+ error = __open_namei_create(&nd, &path, flag, mode);
if (error)
goto exit;
- return nameidata_to_filp(nd, sys_open_flag);
+ return nameidata_to_filp(&nd, sys_open_flag);
}

/*
@@ -1823,24 +1824,24 @@ do_last:
if (path.dentry->d_inode->i_op && path.dentry->d_inode->i_op->follow_link)
goto do_link;

- path_to_nameidata(&path, nd);
+ path_to_nameidata(&path, &nd);
error = -EISDIR;
if (path.dentry->d_inode && S_ISDIR(path.dentry->d_inode->i_mode))
goto exit;
ok:
- error = may_open(nd, acc_mode, flag);
+ error = may_open(&nd, acc_mode, flag);
if (error)
goto exit;
- return nameidata_to_filp(nd, sys_open_flag);
+ return nameidata_to_filp(&nd, sys_open_flag);

exit_mutex_unlock:
mutex_unlock(&dir->d_inode->i_mutex);
exit_dput:
- dput_path(&path, nd);
+ dput_path(&path, &nd);
exit:
- if (!IS_ERR(nd->intent.open.file))
- release_open_intent(nd);
- path_release(nd);
+ if (!IS_ERR(&nd.intent.open.file))
+ release_open_intent(&nd);
+ path_release(&nd);
return ERR_PTR(error);

do_link:
@@ -1854,42 +1855,42 @@ do_link:
* After that we have the parent and last component, i.e.
* we are in the same situation as after the first path_walk().
* Well, almost - if the last component is normal we get its copy
- * stored in nd->last.name and we will have to putname() it when we
+ * stored in nd.last.name and we will have to putname() it when we
* are done. Procfs-like symlinks just set LAST_BIND.
*/
- nd->flags |= LOOKUP_PARENT;
- error = security_inode_follow_link(path.dentry, nd);
+ nd.flags |= LOOKUP_PARENT;
+ error = security_inode_follow_link(path.dentry, &nd);
if (error)
goto exit_dput;
- error = __do_follow_link(&path, nd);
+ error = __do_follow_link(&path, &nd);
if (error) {
/* Does someone understand code flow here? Or it is only
* me so stupid? Anathema to whoever designed this non-sense
* with "intent.open".
*/
- release_open_intent(nd);
+ release_open_intent(&nd);
return ERR_PTR(error);
}
- nd->flags &= ~LOOKUP_PARENT;
- if (nd->last_type == LAST_BIND)
+ nd.flags &= ~LOOKUP_PARENT;
+ if (nd.last_type == LAST_BIND)
goto ok;
error = -EISDIR;
- if (nd->last_type != LAST_NORM)
+ if (nd.last_type != LAST_NORM)
goto exit;
- if (nd->last.name[nd->last.len]) {
- __putname(nd->last.name);
+ if (nd.last.name[nd.last.len]) {
+ __putname(nd.last.name);
goto exit;
}
error = -ELOOP;
if (count++==32) {
- __putname(nd->last.name);
+ __putname(nd.last.name);
goto exit;
}
- dir = nd->dentry;
+ dir = nd.dentry;
mutex_lock(&dir->d_inode->i_mutex);
- path.dentry = lookup_hash(nd);
- path.mnt = nd->mnt;
- __putname(nd->last.name);
+ path.dentry = lookup_hash(&nd);
+ path.mnt = nd.mnt;
+ __putname(nd.last.name);
goto do_last;
}

diff -puN fs/open.c~kill-filp_open fs/open.c
--- lxc/fs/open.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/open.c 2007-10-03 09:01:47.000000000 -0700
@@ -845,8 +845,7 @@ cleanup_file:

struct file *filp_open(const char *filename, int flags, int mode)
{
- struct nameidata nd;
- return open_namei(AT_FDCWD, filename, flags, mode, &nd);
+ return open_namei(AT_FDCWD, filename, flags, mode);
}
EXPORT_SYMBOL(filp_open);

@@ -1050,8 +1049,7 @@ long do_sys_open(int dfd, const char __u

fd = get_unused_fd_flags(flags);
if (fd >= 0) {
- struct nameidata nd;
- struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+ struct file *f = open_namei(dfd, tmp, flags, mode);

if (IS_ERR(f)) {
put_unused_fd(fd);
diff -puN fs/reiserfs/journal.c~kill-filp_open fs/reiserfs/journal.c
--- lxc/fs/reiserfs/journal.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/fs/reiserfs/journal.c 2007-10-03 09:01:47.000000000 -0700
@@ -2623,7 +2623,7 @@ static int journal_init_dev(struct super
return 0;
}

- journal->j_dev_file = filp_open(jdev_name, 0, 0);
+ journal->j_dev_file = open_namei(AT_FDCWD, jdev_name, 0, 0);
if (!IS_ERR(journal->j_dev_file)) {
struct inode *jdev_inode = journal->j_dev_file->f_mapping->host;
if (!S_ISBLK(jdev_inode->i_mode)) {
diff -puN include/linux/fs.h~kill-filp_open include/linux/fs.h
--- lxc/include/linux/fs.h~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2007-10-03 09:01:47.000000000 -0700
@@ -1721,7 +1721,7 @@ extern struct file *create_read_pipe(str
extern struct file *create_write_pipe(void);
extern void free_write_pipe(struct file *);

-extern struct file *open_namei(int dfd, const char *, int, int, struct nameidata *);
+extern struct file *open_namei(int dfd, const char *, int, int);
extern int may_open(struct nameidata *, int, int);

extern int kernel_read(struct file *, unsigned long, char *, unsigned long);
diff -puN kernel/acct.c~kill-filp_open kernel/acct.c
--- lxc/kernel/acct.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/kernel/acct.c 2007-10-03 09:01:47.000000000 -0700
@@ -208,7 +208,7 @@ static int acct_on(char *name)
int error;

/* Difference from BSD - they don't do O_APPEND */
- file = filp_open(name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
+ file = open_namei(AT_FDCWD, name, O_WRONLY|O_APPEND|O_LARGEFILE, 0);
if (IS_ERR(file))
return PTR_ERR(file);

diff -puN mm/swapfile.c~kill-filp_open mm/swapfile.c
--- lxc/mm/swapfile.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/mm/swapfile.c 2007-10-03 09:01:47.000000000 -0700
@@ -1198,7 +1198,7 @@ asmlinkage long sys_swapoff(const char _
if (IS_ERR(pathname))
goto out;

- victim = filp_open(pathname, O_RDWR|O_LARGEFILE, 0);
+ victim = open_namei(AT_FDCWD, pathname, O_RDWR|O_LARGEFILE, 0);
putname(pathname);
err = PTR_ERR(victim);
if (IS_ERR(victim))
@@ -1477,7 +1477,7 @@ asmlinkage long sys_swapon(const char __
name = NULL;
goto bad_swap_2;
}
- swap_file = filp_open(name, O_RDWR|O_LARGEFILE, 0);
+ swap_file = open_namei(AT_FDCWD, name, O_RDWR|O_LARGEFILE, 0);
error = PTR_ERR(swap_file);
if (IS_ERR(swap_file)) {
swap_file = NULL;
diff -puN sound/sound_firmware.c~kill-filp_open sound/sound_firmware.c
--- lxc/sound/sound_firmware.c~kill-filp_open 2007-10-03 09:01:47.000000000 -0700
+++ lxc-dave/sound/sound_firmware.c 2007-10-03 09:01:47.000000000 -0700
@@ -14,7 +14,7 @@ static int do_mod_firmware_load(const ch
char *dp;
loff_t pos;

- filp = filp_open(fn, 0, 0);
+ filp = open_namei(AT_FDCWD, fn, 0, 0);
if (IS_ERR(filp))
{
printk(KERN_INFO "Unable to load '%s'.\n", fn);
_

2007-10-10 16:35:58

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 5/7] kill do_filp_open()


This kills off the almost empty do_filp_open(). The indenting
change in do_sys_open() is because we would have gone over our
80 characters otherwise.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/open.c | 39 ++++++++++++++++++---------------------
1 file changed, 18 insertions(+), 21 deletions(-)

diff -puN fs/open.c~kill-do_filp_open fs/open.c
--- lxc/fs/open.c~kill-do_filp_open 2007-10-04 13:59:44.000000000 -0700
+++ lxc-dave/fs/open.c 2007-10-04 13:59:44.000000000 -0700
@@ -863,17 +863,10 @@ cleanup_file:
return ERR_PTR(error);
}

-static struct file *do_filp_open(int dfd, const char *filename, int flags,
- int mode)
-{
- struct nameidata nd;
-
- return open_namei(dfd, filename, flags, mode, &nd);
-}
-
struct file *filp_open(const char *filename, int flags, int mode)
{
- return do_filp_open(AT_FDCWD, filename, flags, mode);
+ struct nameidata nd;
+ return open_namei(AT_FDCWD, filename, flags, mode, &nd);
}
EXPORT_SYMBOL(filp_open);

@@ -1072,20 +1065,24 @@ long do_sys_open(int dfd, const char __u
char *tmp = getname(filename);
int fd = PTR_ERR(tmp);

- if (!IS_ERR(tmp)) {
- fd = get_unused_fd_flags(flags);
- if (fd >= 0) {
- struct file *f = do_filp_open(dfd, tmp, flags, mode);
- if (IS_ERR(f)) {
- put_unused_fd(fd);
- fd = PTR_ERR(f);
- } else {
- fsnotify_open(f->f_path.dentry);
- fd_install(fd, f);
- }
+ if (IS_ERR(tmp))
+ goto out;
+
+ fd = get_unused_fd_flags(flags);
+ if (fd >= 0) {
+ struct nameidata nd;
+ struct file *f = open_namei(dfd, tmp, flags, mode, &nd);
+
+ if (IS_ERR(f)) {
+ put_unused_fd(fd);
+ fd = PTR_ERR(f);
+ } else {
+ fsnotify_open(f->f_path.dentry);
+ fd_install(fd, f);
}
- putname(tmp);
}
+ putname(tmp);
+out:
return fd;
}

_

2007-10-10 16:36:22

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 7/7] keep track of mnt_writer state of struct file


There have been a few oopses caused by 'struct file's with
NULL f_vfsmnts. There was also a set of potentially missed
mnt_want_write()s from dentry_open() calls.

This patch provides a very simple debugging framework to
catch these kinds of bugs. It will WARN_ON() them, but
should stop us from having any oopses or mnt_writer
count imbalances.


Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/file_table.c | 21 +++++++++++++++++++--
lxc-dave/fs/open.c | 2 ++
lxc-dave/include/linux/fs.h | 4 ++++
3 files changed, 25 insertions(+), 2 deletions(-)

diff -puN fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file fs/file_table.c
--- lxc/fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file 2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/fs/file_table.c 2007-10-03 14:40:14.000000000 -0700
@@ -42,6 +42,12 @@ static inline void file_free_rcu(struct
static inline void file_free(struct file *f)
{
percpu_counter_dec(&nr_files);
+ /*
+ * At this point, either both or neither of these bits
+ * should be set.
+ */
+ WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_TAKEN);
+ WARN_ON(f->f_mnt_write_state == FILE_MNT_WRITE_RELEASED);
call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}

@@ -194,6 +200,7 @@ int init_file(struct file *file, struct
file->f_mode = mode;
file->f_op = fop;
if (mode & FMODE_WRITE) {
+ file->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
error = mnt_want_write(mnt);
WARN_ON(error);
}
@@ -236,8 +243,18 @@ void fastcall __fput(struct file *file)
fops_put(file->f_op);
if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
- if (!special_file(inode->i_mode))
- mnt_drop_write(mnt);
+ if (!special_file(inode->i_mode)) {
+ if (file->f_mnt_write_state == FILE_MNT_WRITE_TAKEN) {
+ mnt_drop_write(mnt);
+ file->f_mnt_write_state |=
+ FILE_MNT_WRITE_RELEASED;
+ } else {
+ printk(KERN_WARNING "__fput() of writeable "
+ "file with no "
+ "mnt_want_write()\n");
+ WARN_ON(1);
+ }
+ }
}
put_pid(file->f_owner.pid);
file_kill(file);
diff -puN fs/namei.c~keep-track-of-mnt_writer-state-of-struct-file fs/namei.c
diff -puN include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file include/linux/fs.h
--- lxc/include/linux/fs.h~keep-track-of-mnt_writer-state-of-struct-file 2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/include/linux/fs.h 2007-10-03 14:40:14.000000000 -0700
@@ -779,6 +779,9 @@ static inline int ra_has_index(struct fi
index < ra->start + ra->size);
}

+#define FILE_MNT_WRITE_TAKEN 1
+#define FILE_MNT_WRITE_RELEASED 2
+
struct file {
/*
* fu_list becomes invalid after file_free is called and queued via
@@ -813,6 +816,7 @@ struct file {
spinlock_t f_ep_lock;
#endif /* #ifdef CONFIG_EPOLL */
struct address_space *f_mapping;
+ unsigned long f_mnt_write_state;
};
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
diff -puN fs/open.c~keep-track-of-mnt_writer-state-of-struct-file fs/open.c
--- lxc/fs/open.c~keep-track-of-mnt_writer-state-of-struct-file 2007-10-03 14:40:14.000000000 -0700
+++ lxc-dave/fs/open.c 2007-10-03 14:42:01.000000000 -0700
@@ -790,6 +790,8 @@ static struct file *__dentry_open(struct
mnt_drop_write(mnt);
goto cleanup_file;
}
+ WARN_ON(f->f_mnt_write_state != 0);
+ f->f_mnt_write_state = FILE_MNT_WRITE_TAKEN;
}

f->f_mapping = inode->i_mapping;
_

2007-10-10 16:36:35

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 2/7] get mount write in __dentry_open()


The first patch fixes an actual bug. I think the
reset will reduce the chance for any future bugs
to creep in.

--

The r/o bind mount patches require matching mnt_want_write()
at filp creation time with a mnt_drop_write() at __fput().

We used to do this in may_open(), but Miklos pointed out
that __dentry_open() is used as well to create filps. We
don't currently do mnt_want_write() for these.

If a filp on a writeable file is created this way, and
destroyed via __fput() we'll get a mount count imbalance.

This patch moves the mount write count acquisition from
may_open() into __dentry_open(), where we should catch
many more of the users.

Signed-off-by: Dave Hansen <[email protected]>
---

lxc-dave/fs/namei.c | 12 ------------
lxc-dave/fs/open.c | 45 ++++++++++++++++++++++++++++++++++++++-------
2 files changed, 38 insertions(+), 19 deletions(-)

diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
--- lxc/fs/namei.c~get-write-in-__dentry_open 2007-10-03 14:44:52.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-10-04 18:02:48.000000000 -0700
@@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
return -EACCES;

flag &= ~O_TRUNC;
- } else if (flag & FMODE_WRITE) {
- /*
- * effectively: !special_file()
- * balanced by __fput()
- */
- error = mnt_want_write(nd->mnt);
- if (error)
- return error;
}

error = vfs_permission(nd, acc_mode);
@@ -1778,11 +1770,7 @@ do_last:

/* Negative dentry, just create the file */
if (!path.dentry->d_inode) {
- error = mnt_want_write(nd->mnt);
- if (error)
- goto exit_mutex_unlock;
error = open_namei_create(nd, &path, flag, mode);
- mnt_drop_write(nd->mnt);
if (error)
goto exit;
return 0;
diff -puN fs/open.c~get-write-in-__dentry_open fs/open.c
--- lxc/fs/open.c~get-write-in-__dentry_open 2007-10-03 14:44:52.000000000 -0700
+++ lxc-dave/fs/open.c 2007-10-04 18:02:48.000000000 -0700
@@ -766,22 +766,51 @@ out:
return error;
}

+/*
+ * You have to be very careful that these write
+ * counts get cleaned up in error cases and
+ * upon __fput(). This should probably never
+ * be called outside of __dentry_open().
+ */
+static inline int __get_file_write_access(struct inode *inode,
+ struct vfsmount *mnt)
+{
+ int error;
+ error = get_write_access(inode);
+ if (error)
+ return error;
+ /*
+ * Do not take mount writer counts on
+ * special files since no writes to
+ * the mount itself will occur.
+ */
+ if (special_file(inode->i_mode))
+ return 0;
+
+ /*
+ * Balanced in __fput()
+ */
+ error = mnt_want_write(mnt);
+ if (error)
+ put_write_access(inode);
+ return error;
+}
+
static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
int flags, struct file *f,
int (*open)(struct inode *, struct file *))
{
struct inode *inode;
- int error;
+ int error = 0;

f->f_flags = flags;
f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
FMODE_PREAD | FMODE_PWRITE;
inode = dentry->d_inode;
- if (f->f_mode & FMODE_WRITE) {
- error = get_write_access(inode);
- if (error)
- goto cleanup_file;
- }
+ if (f->f_mode & FMODE_WRITE)
+ error = __get_file_write_access(inode, mnt);
+ if (error)
+ goto cleanup_file;

f->f_mapping = inode->i_mapping;
f->f_path.dentry = dentry;
@@ -820,8 +849,10 @@ static struct file *__dentry_open(struct

cleanup_all:
fops_put(f->f_op);
- if (f->f_mode & FMODE_WRITE)
+ if (f->f_mode & FMODE_WRITE) {
put_write_access(inode);
+ mnt_drop_write(mnt);
+ }
file_kill(f);
f->f_path.dentry = NULL;
f->f_path.mnt = NULL;
diff -puN fs/file_table.c~get-write-in-__dentry_open fs/file_table.c
_

2007-10-11 15:14:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7] get mount write in __dentry_open()

>
>
> The first patch fixes an actual bug. I think the
> reset will reduce the chance for any future bugs
> to creep in.
>
> --
>
> The r/o bind mount patches require matching mnt_want_write()
> at filp creation time with a mnt_drop_write() at __fput().
>
> We used to do this in may_open(), but Miklos pointed out
> that __dentry_open() is used as well to create filps. We
> don't currently do mnt_want_write() for these.
>
> If a filp on a writeable file is created this way, and
> destroyed via __fput() we'll get a mount count imbalance.
>
> This patch moves the mount write count acquisition from
> may_open() into __dentry_open(), where we should catch
> many more of the users.
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> lxc-dave/fs/namei.c | 12 ------------
> lxc-dave/fs/open.c | 45 ++++++++++++++++++++++++++++++++++++++-------
> 2 files changed, 38 insertions(+), 19 deletions(-)
>
> diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> --- lxc/fs/namei.c~get-write-in-__dentry_open 2007-10-03 14:44:52.000000000 -0700
> +++ lxc-dave/fs/namei.c 2007-10-04 18:02:48.000000000 -0700
> @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> return -EACCES;
>
> flag &= ~O_TRUNC;
> - } else if (flag & FMODE_WRITE) {
> - /*
> - * effectively: !special_file()
> - * balanced by __fput()
> - */
> - error = mnt_want_write(nd->mnt);
> - if (error)
> - return error;
> }

Maybe readonly should still be checked here, so that the order of
error checking doesn't change. If racing with a read-only remount the
order is irrelevant anyway. Something like this?

} else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
return -EROFS
}

> error = vfs_permission(nd, acc_mode);
> @@ -1778,11 +1770,7 @@ do_last:
>
> /* Negative dentry, just create the file */
> if (!path.dentry->d_inode) {
> - error = mnt_want_write(nd->mnt);
> - if (error)
> - goto exit_mutex_unlock;
> error = open_namei_create(nd, &path, flag, mode);
> - mnt_drop_write(nd->mnt);

This is still needed, isn't it?

And they should be added around do_truncate() as well, since you
remove the protection from may_open().

This one introduces an interesting race between ro-remount and
open(O_TRUNC), where the truncate can succeed but the open fail with
EROFS. Is that a problem?

> if (error)
> goto exit;
> return 0;
> diff -puN fs/open.c~get-write-in-__dentry_open fs/open.c
> --- lxc/fs/open.c~get-write-in-__dentry_open 2007-10-03 14:44:52.000000000 -0700
> +++ lxc-dave/fs/open.c 2007-10-04 18:02:48.000000000 -0700
> @@ -766,22 +766,51 @@ out:
> return error;
> }
>
> +/*
> + * You have to be very careful that these write
> + * counts get cleaned up in error cases and
> + * upon __fput(). This should probably never
> + * be called outside of __dentry_open().
> + */
> +static inline int __get_file_write_access(struct inode *inode,
> + struct vfsmount *mnt)
> +{
> + int error;
> + error = get_write_access(inode);
> + if (error)
> + return error;
> + /*
> + * Do not take mount writer counts on
> + * special files since no writes to
> + * the mount itself will occur.
> + */
> + if (special_file(inode->i_mode))
> + return 0;
> +
> + /*
> + * Balanced in __fput()
> + */
> + error = mnt_want_write(mnt);
> + if (error)
> + put_write_access(inode);
> + return error;
> +}
> +
> static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
> int flags, struct file *f,
> int (*open)(struct inode *, struct file *))
> {
> struct inode *inode;
> - int error;
> + int error = 0;
>
> f->f_flags = flags;
> f->f_mode = ((flags+1) & O_ACCMODE) | FMODE_LSEEK |
> FMODE_PREAD | FMODE_PWRITE;
> inode = dentry->d_inode;
> - if (f->f_mode & FMODE_WRITE) {
> - error = get_write_access(inode);
> - if (error)
> - goto cleanup_file;
> - }
> + if (f->f_mode & FMODE_WRITE)
> + error = __get_file_write_access(inode, mnt);
> + if (error)
> + goto cleanup_file;
>
> f->f_mapping = inode->i_mapping;
> f->f_path.dentry = dentry;
> @@ -820,8 +849,10 @@ static struct file *__dentry_open(struct
>
> cleanup_all:
> fops_put(f->f_op);
> - if (f->f_mode & FMODE_WRITE)
> + if (f->f_mode & FMODE_WRITE) {
> put_write_access(inode);
> + mnt_drop_write(mnt);

Shouldn't this be conditional on !special_file()?

Miklos

2007-10-11 15:57:24

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7] make open_namei() return a filp

> If open_namei() succeeds, there is potentially a mnt_want_write()
> that needs to get balanced. If the caller doesn't create a
> 'struct file' and eventually __fput() it, or manually drop the
> write count on an error, we have a bug.
>
> Forcing open_namei() to return a filp fixes this. Any caller
> getting a 'struct file' back must consider that filp instantiated
> and fput() it normally. The callers no longer have to worry about
> ever manually releasing a mnt write count.

I think this changelog is out of date, as this problem should have
been dealt with in patch-2/7.

Otherwise I don't object to this change, it looks like a fine cleanup.

Miklos

2007-10-11 18:17:39

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7] get mount write in __dentry_open()

On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > --- lxc/fs/namei.c~get-write-in-__dentry_open 2007-10-03 14:44:52.000000000 -0700
> > +++ lxc-dave/fs/namei.c 2007-10-04 18:02:48.000000000 -0700
> > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> > return -EACCES;
> >
> > flag &= ~O_TRUNC;
> > - } else if (flag & FMODE_WRITE) {
> > - /*
> > - * effectively: !special_file()
> > - * balanced by __fput()
> > - */
> > - error = mnt_want_write(nd->mnt);
> > - if (error)
> > - return error;
> > }
>
> Maybe readonly should still be checked here, so that the order of
> error checking doesn't change. If racing with a read-only remount the
> order is irrelevant anyway. Something like this?
>
> } else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> return -EROFS
> }

I think that would be a bug if anything actually managed to trip that
code. all of the may_open() calls should have been covered by the
__dentry_open() mnt writer.

> > error = vfs_permission(nd, acc_mode);
> > @@ -1778,11 +1770,7 @@ do_last:
> >
> > /* Negative dentry, just create the file */
> > if (!path.dentry->d_inode) {
> > - error = mnt_want_write(nd->mnt);
> > - if (error)
> > - goto exit_mutex_unlock;
> > error = open_namei_create(nd, &path, flag, mode);
> > - mnt_drop_write(nd->mnt);
>
> This is still needed, isn't it?

Yes, it is. I'll add a big fat comment this time about why we need it.

> And they should be added around do_truncate() as well, since you
> remove the protection from may_open().
>
> This one introduces an interesting race between ro-remount and
> open(O_TRUNC), where the truncate can succeed but the open fail with
> EROFS. Is that a problem?

You're right, this does introduce that race, and it is relatively hard
to fix properly. But, the 'return a filp' patch makes it easy to fix.
I've put a temporary kludge in the updated version of this patch, and
fixed it properly in that later patch.

> > cleanup_all:
> > fops_put(f->f_op);
> > - if (f->f_mode & FMODE_WRITE)
> > + if (f->f_mode & FMODE_WRITE) {
> > put_write_access(inode);
> > + mnt_drop_write(mnt);
>
> Shouldn't this be conditional on !special_file()?

It certainly should.

-- Dave

2007-10-11 18:52:17

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7] get mount write in __dentry_open()

> On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > > --- lxc/fs/namei.c~get-write-in-__dentry_open 2007-10-03 14:44:52.000000000 -0700
> > > +++ lxc-dave/fs/namei.c 2007-10-04 18:02:48.000000000 -0700
> > > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> > > return -EACCES;
> > >
> > > flag &= ~O_TRUNC;
> > > - } else if (flag & FMODE_WRITE) {
> > > - /*
> > > - * effectively: !special_file()
> > > - * balanced by __fput()
> > > - */
> > > - error = mnt_want_write(nd->mnt);
> > > - if (error)
> > > - return error;
> > > }
> >
> > Maybe readonly should still be checked here, so that the order of
> > error checking doesn't change. If racing with a read-only remount the
> > order is irrelevant anyway. Something like this?
> >
> > } else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> > return -EROFS
> > }
>
> I think that would be a bug if anything actually managed to trip that
> code. all of the may_open() calls should have been covered by the
> __dentry_open() mnt writer.

AFACIS, __dentry_open() will normally be called later than may_open().
And we don't want it earlier, because ->open() may have side affects,
that could be unsafe if done before permission checking.

> > And they should be added around do_truncate() as well, since you
> > remove the protection from may_open().
> >
> > This one introduces an interesting race between ro-remount and
> > open(O_TRUNC), where the truncate can succeed but the open fail with
> > EROFS. Is that a problem?
>
> You're right, this does introduce that race, and it is relatively hard
> to fix properly. But, the 'return a filp' patch makes it easy to fix.
> I've put a temporary kludge in the updated version of this patch, and
> fixed it properly in that later patch.

If you fix this properly, that should take care of the first problem
as well.

Miklos

2007-10-11 19:24:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7] get mount write in __dentry_open()

On Thu, 2007-10-11 at 20:31 +0200, Miklos Szeredi wrote:
> > On Thu, 2007-10-11 at 17:08 +0200, Miklos Szeredi wrote:
> > > > diff -puN fs/namei.c~get-write-in-__dentry_open fs/namei.c
> > > > --- lxc/fs/namei.c~get-write-in-__dentry_open 2007-10-03 14:44:52.000000000 -0700
> > > > +++ lxc-dave/fs/namei.c 2007-10-04 18:02:48.000000000 -0700
> > > > @@ -1621,14 +1621,6 @@ int may_open(struct nameidata *nd, int a
> > > > return -EACCES;
> > > >
> > > > flag &= ~O_TRUNC;
> > > > - } else if (flag & FMODE_WRITE) {
> > > > - /*
> > > > - * effectively: !special_file()
> > > > - * balanced by __fput()
> > > > - */
> > > > - error = mnt_want_write(nd->mnt);
> > > > - if (error)
> > > > - return error;
> > > > }
> > >
> > > Maybe readonly should still be checked here, so that the order of
> > > error checking doesn't change. If racing with a read-only remount the
> > > order is irrelevant anyway. Something like this?
> > >
> > > } else if (flag & FMODE_WRITE && __mnt_is_readonly(nd->mnt)) {
> > > return -EROFS
> > > }
> >
> > I think that would be a bug if anything actually managed to trip that
> > code. all of the may_open() calls should have been covered by the
> > __dentry_open() mnt writer.
>
> AFACIS, __dentry_open() will normally be called later than may_open().
> And we don't want it earlier, because ->open() may have side affects,
> that could be unsafe if done before permission checking.

I actually check the mount write count before the ->open() in
__dentry_open(). The truncates are also definitely wrapped in their own
mnt_want_write() calls now.

> > > And they should be added around do_truncate() as well, since you
> > > remove the protection from may_open().
> > >
> > > This one introduces an interesting race between ro-remount and
> > > open(O_TRUNC), where the truncate can succeed but the open fail with
> > > EROFS. Is that a problem?
> >
> > You're right, this does introduce that race, and it is relatively hard
> > to fix properly. But, the 'return a filp' patch makes it easy to fix.
> > I've put a temporary kludge in the updated version of this patch, and
> > fixed it properly in that later patch.
>
> If you fix this properly, that should take care of the first problem
> as well.

Yup. New series coming up.

-- Dave