Up till now overlayfs didn't stack regular file operations. Instead, when
a file was opened on an overlay, the file from one of the underlying layers
would be opened and any file operations performed would directly go to the
underlying file on a real filesystem.
This works well mostly, but various hacks were added to the VFS to work
around issues with this:
- d_path() and friends
- relatime handling
- file locking
- fsnotify
- writecount handling
There are also issues that are unresolved before this patchset:
- ioctl's that need write access but can be performed on a O_RDONLY fd
- ro/rw inconsistency: file on lower layer opened for read-only will
return stale data on read after copy-up and modification
- ro/rw inconsistency for mmap: file on lower layer mapped shared will
contain stale data after copy-up and modification
This patch series reverts the VFS hacks (with the exception of d_path) and
fixes the unresoved issues. We need to keep d_path related hacks, because
memory maps are still not stacked, yet d_path() should keep working on
vma->vm_file->f_path.
No regressions were observed after running various test suites (xfstests,
ltp, unionmount-testsuite, pjd-fstest).
Performance impact of stacking was found to be minimal. Memory use for
open overlay files increases by about 256bytes or 12%.
Git tree is here:
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs-rorw
Thanks,
Miklos
---
Miklos Szeredi (35):
vfs: clean up dedup
vfs: add path_open()
vfs: optionally don't account file in nr_files
ovl: copy up times
ovl: copy up inode flags
Revert "Revert "ovl: get_write_access() in truncate""
ovl: copy up file size as well
ovl: deal with overlay files in ovl_d_real()
ovl: stack file ops
ovl: add helper to return real file
ovl: readd read_iter
ovl: readd write_iter
ovl: readd fsync
ovl: readd mmap
ovl: readd fallocate
ovl: readd lsattr/chattr support
ovl: readd fiemap
ovl: readd O_DIRECT support
ovl: readd reflink/copyfile/dedup support
vfs: don't open real
vfs: add f_op->pre_mmap()
ovl: copy-up on MAP_SHARED
vfs: simplify dentry_open()
Revert "ovl: fix relatime for directories"
Revert "vfs: update ovl inode before relatime check"
Revert "ovl: fix may_write_real() for overlayfs directories"
Revert "ovl: don't allow writing ioctl on lower layer"
Revert "vfs: add flags to d_real()"
Revert "vfs: do get_write_access() on upper layer of overlayfs"
Revert "vfs: make argument of d_real_inode() const"
Revert "vfs: add d_real_inode() helper"
Partially revert "locks: fix file locking on overlayfs"
Revert "fsnotify: support overlayfs"
vfs: simplify d_op->d_real()
ovl: fix documentation of non-standard behavior
Documentation/filesystems/Locking | 4 +-
Documentation/filesystems/overlayfs.txt | 64 +++--
Documentation/filesystems/vfs.txt | 25 +-
fs/file_table.c | 13 +-
fs/inode.c | 46 +---
fs/internal.h | 17 +-
fs/ioctl.c | 1 +
fs/locks.c | 20 +-
fs/namei.c | 2 +-
fs/namespace.c | 66 +----
fs/open.c | 79 +++---
fs/overlayfs/Kconfig | 21 ++
fs/overlayfs/Makefile | 4 +-
fs/overlayfs/dir.c | 5 +
fs/overlayfs/file.c | 466 ++++++++++++++++++++++++++++++++
fs/overlayfs/inode.c | 52 +++-
fs/overlayfs/overlayfs.h | 26 ++
fs/overlayfs/ovl_entry.h | 1 +
fs/overlayfs/super.c | 82 +++---
fs/overlayfs/util.c | 20 ++
fs/read_write.c | 90 +++---
fs/xattr.c | 9 +-
include/linux/dcache.h | 41 +--
include/linux/fs.h | 32 ++-
include/linux/fsnotify.h | 14 +-
include/uapi/linux/fs.h | 1 -
mm/util.c | 5 +
27 files changed, 831 insertions(+), 375 deletions(-)
create mode 100644 fs/overlayfs/file.c
--
2.14.3
In the common case we can just use the real file cached in
file->private_data. There are two exceptions:
1) File has been copied up since open: in this unlikely corner case just
use a throwaway real file for the operation. If ever this becomes a
perfomance problem (very unlikely, since overlayfs has been doing most fine
without correctly handling this case at all), then we can deal with that by
updating the cached real file.
2) File's f_flags have changed since open: no need to reopen the cached
real file, we can just change the flags there as well.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index a0b606885c41..409b542ff30c 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -31,6 +31,66 @@ static struct file *ovl_open_realfile(const struct file *file)
return realfile;
}
+#define OVL_SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT)
+
+static int ovl_change_flags(struct file *file, unsigned int flags)
+{
+ struct inode *inode = file_inode(file);
+ int err;
+
+ /* No atime modificaton on underlying */
+ flags |= O_NOATIME;
+
+ /* If some flag changed that cannot be changed then something's amiss */
+ if (WARN_ON((file->f_flags ^ flags) & ~OVL_SETFL_MASK))
+ return -EIO;
+
+ flags &= OVL_SETFL_MASK;
+
+ if (((flags ^ file->f_flags) & O_APPEND) && IS_APPEND(inode))
+ return -EPERM;
+
+ if (flags & O_DIRECT) {
+ if (!file->f_mapping->a_ops ||
+ !file->f_mapping->a_ops->direct_IO)
+ return -EINVAL;
+ }
+
+ if (file->f_op->check_flags) {
+ err = file->f_op->check_flags(flags);
+ if (err)
+ return err;
+ }
+
+ spin_lock(&file->f_lock);
+ file->f_flags = (file->f_flags & ~OVL_SETFL_MASK) | flags;
+ spin_unlock(&file->f_lock);
+
+ return 0;
+}
+
+static int ovl_real_file(const struct file *file, struct fd *real)
+{
+ struct inode *inode = file_inode(file);
+
+ real->flags = 0;
+ real->file = file->private_data;
+
+ /* Has it been copied up since we'd opened it? */
+ if (unlikely(file_inode(real->file) != ovl_inode_real(inode))) {
+ real->flags = FDPUT_FPUT;
+ real->file = ovl_open_realfile(file);
+
+ return PTR_ERR_OR_ZERO(real->file);
+ }
+
+ /* Did the flags change since open? */
+ if (unlikely((file->f_flags ^ real->file->f_flags) & ~O_NOATIME))
+ return ovl_change_flags(real->file, file->f_flags);
+
+ return 0;
+}
+
static int ovl_open(struct inode *inode, struct file *file)
{
struct dentry *dentry = file_dentry(file);
--
2.14.3
This reverts commit f3fbbb079263bd29ae592478de6808db7e708267.
Overlayfs now works correctly without adding hacks to fsnotify.
Signed-off-by: Miklos Szeredi <[email protected]>
---
include/linux/fsnotify.h | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)
diff --git a/include/linux/fsnotify.h b/include/linux/fsnotify.h
index bdaf22582f6e..fd1ce10553bf 100644
--- a/include/linux/fsnotify.h
+++ b/include/linux/fsnotify.h
@@ -30,11 +30,7 @@ static inline int fsnotify_parent(const struct path *path, struct dentry *dentry
static inline int fsnotify_perm(struct file *file, int mask)
{
const struct path *path = &file->f_path;
- /*
- * Do not use file_inode() here or anywhere in this file to get the
- * inode. That would break *notity on overlayfs.
- */
- struct inode *inode = path->dentry->d_inode;
+ struct inode *inode = file_inode(file);
__u32 fsnotify_mask = 0;
int ret;
@@ -178,7 +174,7 @@ static inline void fsnotify_mkdir(struct inode *inode, struct dentry *dentry)
static inline void fsnotify_access(struct file *file)
{
const struct path *path = &file->f_path;
- struct inode *inode = path->dentry->d_inode;
+ struct inode *inode = file_inode(file);
__u32 mask = FS_ACCESS;
if (S_ISDIR(inode->i_mode))
@@ -196,7 +192,7 @@ static inline void fsnotify_access(struct file *file)
static inline void fsnotify_modify(struct file *file)
{
const struct path *path = &file->f_path;
- struct inode *inode = path->dentry->d_inode;
+ struct inode *inode = file_inode(file);
__u32 mask = FS_MODIFY;
if (S_ISDIR(inode->i_mode))
@@ -214,7 +210,7 @@ static inline void fsnotify_modify(struct file *file)
static inline void fsnotify_open(struct file *file)
{
const struct path *path = &file->f_path;
- struct inode *inode = path->dentry->d_inode;
+ struct inode *inode = file_inode(file);
__u32 mask = FS_OPEN;
if (S_ISDIR(inode->i_mode))
@@ -230,7 +226,7 @@ static inline void fsnotify_open(struct file *file)
static inline void fsnotify_close(struct file *file)
{
const struct path *path = &file->f_path;
- struct inode *inode = path->dentry->d_inode;
+ struct inode *inode = file_inode(file);
fmode_t mode = file->f_mode;
__u32 mask = (mode & FMODE_WRITE) ? FS_CLOSE_WRITE : FS_CLOSE_NOWRITE;
--
2.14.3
We can now drop description of the ro/rw inconsistency from the
documentation.
Also clarify, that now fully standard compliant behavior can be enabled
with kernel/module/mount options.
Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/overlayfs.txt | 64 ++++++++++++++++++++++-----------
1 file changed, 43 insertions(+), 21 deletions(-)
diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
index 961b287ef323..095186080d23 100644
--- a/Documentation/filesystems/overlayfs.txt
+++ b/Documentation/filesystems/overlayfs.txt
@@ -306,27 +306,49 @@ the copied layers will fail the verification of the lower root file handle.
Non-standard behavior
---------------------
-The copy_up operation essentially creates a new, identical file and
-moves it over to the old name. Any open files referring to this inode
-will access the old data.
-
-The new file may be on a different filesystem, so both st_dev and st_ino
-of the real file may change. The values of st_dev and st_ino returned by
-stat(2) on an overlay object are often not the same as the real file
-stat(2) values to prevent the values from changing on copy_up.
-
-Unless "xino" feature is enabled, when overlay layers are not all on the
-same underlying filesystem, the value of st_dev may be different for two
-non-directory objects in the same overlay filesystem and the value of
-st_ino for directory objects may be non persistent and could change even
-while the overlay filesystem is still mounted.
-
-Unless "inode index" feature is enabled, if a file with multiple hard
-links is copied up, then this will "break" the link. Changes will not be
-propagated to other names referring to the same inode.
-
-Unless "redirect_dir" feature is enabled, rename(2) on a lower or merged
-directory will fail with EXDEV.
+Overlayfs can now act as a POSIX compliant filesystem with the following
+features turned on:
+
+1) "redirect_dir"
+
+Enabled with the mount option or module option: "redirect_dir=on" or with
+the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
+
+If this feature is disabled, then rename(2) on a lower or merged directory
+will fail with EXDEV ("Invalid cross-device link").
+
+2) "inode index"
+
+Enabled with the mount option or module option "index=on" or with the
+kernel config option CONFIG_OVERLAY_FS_INDEX=y.
+
+If this feature is disabled and a file with multiple hard links is copied
+up, then this will "break" the link. Changes will not be propagated to
+other names referring to the same inode.
+
+3) "xino"
+
+Enabled with the mount option "xino=auto" or "xino=on", with the module
+option "xino_auto=on" or with the kernel config option
+CONFIG_OVERLAY_FS_XINO_AUTO=y. Also implicitly enabled by using the same
+underlying filesystem for all layers making up the overlay.
+
+If this feature is disabled or the underlying filesystem doesn't have
+enough free bits in the inode number, then overlayfs will not be able to
+guarantee that the values of st_ino and st_dev returned by stat(2) and the
+value of d_ino returned by readdir(3) will act like on a normal filesystem.
+E.g. the value of st_dev may be different for two objects in the same
+overlay filesystem and the value of st_ino for directory objects may not be
+persistent and could change even while the overlay filesystem is mounted.
+
+4) "copy_up_shared"
+
+Enabled with the mount option or module option "copy_up_shared=on" or with
+the kernel config option CONFIG_OVERLAY_FS_COPY_UP_SHARED=y.
+
+If this feature is disabled, then a memory mapping created with MAP_SHARED
+might contain stale data if the file has been copied up and modified in the
+meantime.
Changes to underlying filesystems
--
2.14.3
This partially reverts commit c568d68341be7030f5647def68851e469b21ca11.
Overlayfs files will now automatically get the correct locks, no need to
hack overlay support in VFS.
It is a partial revert, because it leaves the locks_inode() calls in place
and defines locks_inode() to file_inode(). We could revert those as well,
but it would be unnecessary code churn and it makes sense to document that
we are getting the inode for locking purposes.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/locks.c | 17 ++++++-----------
fs/overlayfs/super.c | 2 +-
include/linux/fs.h | 13 +------------
include/uapi/linux/fs.h | 1 -
4 files changed, 8 insertions(+), 25 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index 8e9431a178a5..55069318b984 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -139,11 +139,6 @@
#define IS_OFDLCK(fl) (fl->fl_flags & FL_OFDLCK)
#define IS_REMOTELCK(fl) (fl->fl_pid <= 0)
-static inline bool is_remote_lock(struct file *filp)
-{
- return likely(!(filp->f_path.dentry->d_sb->s_flags & SB_NOREMOTELOCK));
-}
-
static bool lease_breaking(struct file_lock *fl)
{
return fl->fl_flags & (FL_UNLOCK_PENDING | FL_DOWNGRADE_PENDING);
@@ -1875,7 +1870,7 @@ EXPORT_SYMBOL(generic_setlease);
int
vfs_setlease(struct file *filp, long arg, struct file_lock **lease, void **priv)
{
- if (filp->f_op->setlease && is_remote_lock(filp))
+ if (filp->f_op->setlease)
return filp->f_op->setlease(filp, arg, lease, priv);
else
return generic_setlease(filp, arg, lease, priv);
@@ -2022,7 +2017,7 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
if (error)
goto out_free;
- if (f.file->f_op->flock && is_remote_lock(f.file))
+ if (f.file->f_op->flock)
error = f.file->f_op->flock(f.file,
(can_sleep) ? F_SETLKW : F_SETLK,
lock);
@@ -2048,7 +2043,7 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
*/
int vfs_test_lock(struct file *filp, struct file_lock *fl)
{
- if (filp->f_op->lock && is_remote_lock(filp))
+ if (filp->f_op->lock)
return filp->f_op->lock(filp, F_GETLK, fl);
posix_test_lock(filp, fl);
return 0;
@@ -2191,7 +2186,7 @@ int fcntl_getlk(struct file *filp, unsigned int cmd, struct flock *flock)
*/
int vfs_lock_file(struct file *filp, unsigned int cmd, struct file_lock *fl, struct file_lock *conf)
{
- if (filp->f_op->lock && is_remote_lock(filp))
+ if (filp->f_op->lock)
return filp->f_op->lock(filp, cmd, fl);
else
return posix_lock_file(filp, fl, conf);
@@ -2513,7 +2508,7 @@ locks_remove_flock(struct file *filp, struct file_lock_context *flctx)
if (list_empty(&flctx->flc_flock))
return;
- if (filp->f_op->flock && is_remote_lock(filp))
+ if (filp->f_op->flock)
filp->f_op->flock(filp, F_SETLKW, &fl);
else
flock_lock_inode(inode, &fl);
@@ -2600,7 +2595,7 @@ EXPORT_SYMBOL(posix_unblock_lock);
*/
int vfs_cancel_lock(struct file *filp, struct file_lock *fl)
{
- if (filp->f_op->lock && is_remote_lock(filp))
+ if (filp->f_op->lock)
return filp->f_op->lock(filp, F_CANCELLK, fl);
return 0;
}
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 7779fc610767..f2a83fabe2eb 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -1479,7 +1479,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
sb->s_op = &ovl_super_operations;
sb->s_xattr = ovl_xattr_handlers;
sb->s_fs_info = ofs;
- sb->s_flags |= SB_POSIXACL | SB_NOREMOTELOCK;
+ sb->s_flags |= SB_POSIXACL;
err = -ENOMEM;
root_dentry = d_make_root(ovl_new_inode(sb, S_IFDIR, 0));
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5294697217b7..6db6e83004d6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1050,17 +1050,7 @@ struct file_lock_context {
extern void send_sigio(struct fown_struct *fown, int fd, int band);
-/*
- * Return the inode to use for locking
- *
- * For overlayfs this should be the overlay inode, not the real inode returned
- * by file_inode(). For any other fs file_inode(filp) and locks_inode(filp) are
- * equal.
- */
-static inline struct inode *locks_inode(const struct file *f)
-{
- return f->f_path.dentry->d_inode;
-}
+#define locks_inode(f) file_inode(f)
#ifdef CONFIG_FILE_LOCKING
extern int fcntl_getlk(struct file *, unsigned int, struct flock *);
@@ -1300,7 +1290,6 @@ extern int send_sigurg(struct fown_struct *fown);
/* These sb flags are internal to the kernel */
#define SB_SUBMOUNT (1<<26)
-#define SB_NOREMOTELOCK (1<<27)
#define SB_NOSEC (1<<28)
#define SB_BORN (1<<29)
#define SB_ACTIVE (1<<30)
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index d2a8313fabd7..2840ddcece73 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -134,7 +134,6 @@ struct inodes_stat_t {
/* These sb flags are internal to the kernel */
#define MS_SUBMOUNT (1<<26)
-#define MS_NOREMOTELOCK (1<<27)
#define MS_NOSEC (1<<28)
#define MS_BORN (1<<29)
#define MS_ACTIVE (1<<30)
--
2.14.3
The only remaining user of d_real() is file_dentry().
Simplify code related to ->d_real() by:
- remove dead code from overlayfs implementation
- remove the open_flags argument
- expand d_real() helper into file_dentry()
- remove d_real() helper, since there are no callers and makes no sense to
add new ones
Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/Locking | 3 +--
Documentation/filesystems/vfs.txt | 22 +++-------------
fs/overlayfs/super.c | 55 +++++++--------------------------------
include/linux/dcache.h | 24 +----------------
include/linux/fs.h | 7 ++++-
5 files changed, 21 insertions(+), 90 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index a4afe96f0112..e1d7e43d302c 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -21,8 +21,7 @@ prototypes:
char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int);
+ struct dentry *(*d_real)(struct dentry *, const struct inode *);
locking rules:
rename_lock ->d_lock may block rcu-walk
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index af54d3651ff8..632d5c58fde9 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -990,8 +990,7 @@ struct dentry_operations {
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int);
+ struct dentry *(*d_real)(struct dentry *, const struct inode *);
};
d_revalidate: called when the VFS needs to revalidate a dentry. This
@@ -1124,23 +1123,8 @@ struct dentry_operations {
This function is only used if DCACHE_MANAGE_TRANSIT is set on the
dentry being transited from.
- d_real: overlay/union type filesystems implement this method to return one of
- the underlying dentries hidden by the overlay. It is used in three
- different modes:
-
- Called from open it may need to copy-up the file depending on the
- supplied open flags. This mode is selected with a non-zero flags
- argument. In this mode the d_real method can return an error.
-
- Called from file_dentry() it returns the real dentry matching the inode
- argument. The real dentry may be from a lower layer already copied up,
- but still referenced from the file. This mode is selected with a
- non-NULL inode argument. This will always succeed.
-
- With NULL inode and zero flags the topmost real underlying dentry is
- returned. This will always succeed.
-
- This method is never called with both non-NULL inode and non-zero flags.
+ d_real: This is called from file_dentry(). It returns the real dentry
+ matching the inode argument.
Each dentry has a pointer to its parent dentry, as well as a hash list
of child dentries. Child dentries are basically like files in a
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index f2a83fabe2eb..ab1642fcece3 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -80,64 +80,29 @@ static void ovl_dentry_release(struct dentry *dentry)
}
}
-static int ovl_check_append_only(struct inode *inode, int flag)
-{
- /*
- * This test was moot in vfs may_open() because overlay inode does
- * not have the S_APPEND flag, so re-check on real upper inode
- */
- if (IS_APPEND(inode)) {
- if ((flag & O_ACCMODE) != O_RDONLY && !(flag & O_APPEND))
- return -EPERM;
- if (flag & O_TRUNC)
- return -EPERM;
- }
-
- return 0;
-}
-
static struct dentry *ovl_d_real(struct dentry *dentry,
- const struct inode *inode,
- unsigned int open_flags)
+ const struct inode *inode)
{
struct dentry *real;
- int err;
+
+ if (WARN_ON(!inode))
+ return dentry;
/* It's an overlay file */
- if (inode && d_inode(dentry) == inode)
+ if (d_inode(dentry) == inode)
return dentry;
- if (!d_is_reg(dentry)) {
- if (!inode || inode == d_inode(dentry))
- return dentry;
+ if (!d_is_reg(dentry))
goto bug;
- }
- if (open_flags) {
- err = ovl_open_maybe_copy_up(dentry, open_flags);
- if (err)
- return ERR_PTR(err);
- }
-
- real = ovl_dentry_upper(dentry);
- if (real && (!inode || inode == d_inode(real))) {
- if (!inode) {
- err = ovl_check_append_only(d_inode(real), open_flags);
- if (err)
- return ERR_PTR(err);
- }
+ real = ovl_dentry_real(dentry);
+ if (inode == d_inode(real))
return real;
- }
-
- real = ovl_dentry_lower(dentry);
- if (!real)
- goto bug;
/* Handle recursion */
- real = d_real(real, inode, open_flags);
+ if (unlikely(real->d_flags & DCACHE_OP_REAL))
+ return real->d_op->d_real(real, inode);
- if (!inode || inode == d_inode(real))
- return real;
bug:
WARN(1, "ovl_d_real(%pd4, %s:%lu): real dentry not found\n", dentry,
inode ? inode->i_sb->s_id : "NULL", inode ? inode->i_ino : 0);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 129ef05759a9..712cc21c2e6d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -147,8 +147,7 @@ struct dentry_operations {
char *(*d_dname)(struct dentry *, char *, int);
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
- struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int);
+ struct dentry *(*d_real)(struct dentry *, const struct inode *);
} ____cacheline_aligned;
/*
@@ -565,27 +564,6 @@ static inline struct dentry *d_backing_dentry(struct dentry *upper)
return upper;
}
-/**
- * d_real - Return the real dentry
- * @dentry: the dentry to query
- * @inode: inode to select the dentry from multiple layers (can be NULL)
- * @flags: open flags to control copy-up behavior
- *
- * If dentry is on a union/overlay, then return the underlying, real dentry.
- * Otherwise return the dentry itself.
- *
- * See also: Documentation/filesystems/vfs.txt
- */
-static inline struct dentry *d_real(struct dentry *dentry,
- const struct inode *inode,
- unsigned int flags)
-{
- if (unlikely(dentry->d_flags & DCACHE_OP_REAL))
- return dentry->d_op->d_real(dentry, inode, flags);
- else
- return dentry;
-}
-
struct name_snapshot {
const unsigned char *name;
unsigned char inline_name[DNAME_INLINE_LEN];
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6db6e83004d6..77403c19dc69 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1234,7 +1234,12 @@ static inline struct inode *file_inode(const struct file *f)
static inline struct dentry *file_dentry(const struct file *file)
{
- return d_real(file->f_path.dentry, file_inode(file), 0);
+ struct dentry *dentry = file->f_path.dentry;
+
+ if (unlikely(dentry->d_flags & DCACHE_OP_REAL))
+ return dentry->d_op->d_real(dentry, file_inode(file));
+ else
+ return dentry;
}
static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
--
2.14.3
This reverts commit 954c736f865d6c0c68ae4263a2f3502ee7c447a3.
Overlayfs no longer relies on the vfs for checking writability of files.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/namespace.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 9d1374ab6e06..9ec9bf74116b 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -468,9 +468,7 @@ static inline int may_write_real(struct file *file)
/* File refers to upper, writable layer? */
upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);
- if (upperdentry &&
- (file_inode(file) == d_inode(upperdentry) ||
- file_inode(file) == d_inode(dentry)))
+ if (upperdentry && file_inode(file) == d_inode(upperdentry))
return 0;
/* Lower layer: can't write to real file, sorry... */
--
2.14.3
This reverts commit 7b1742eb06ead6d02a6cf3c44587088e5392d1aa.
No user of d_real_inode() remains, so it can be removed.
Signed-off-by: Miklos Szeredi <[email protected]>
---
include/linux/dcache.h | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 8ca25ea3b9db..c2dea5e78f54 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -593,10 +593,9 @@ static inline struct dentry *d_real(struct dentry *dentry,
* If dentry is on a union/overlay, then return the underlying, real inode.
* Otherwise return d_inode().
*/
-static inline struct inode *d_real_inode(const struct dentry *dentry)
+static inline struct inode *d_real_inode(struct dentry *dentry)
{
- /* This usage of d_real() results in const dentry */
- return d_backing_inode(d_real((struct dentry *) dentry, NULL, 0));
+ return d_backing_inode(d_real(dentry, NULL, 0));
}
struct name_snapshot {
--
2.14.3
This reverts commit 598e3c8f72f5b77c84d2cb26cfd936ffb3cfdbaa.
Overlayfs no longer relies on the vfs correct atime handling.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/inode.c | 33 ++++++---------------------------
fs/internal.h | 7 -------
fs/namei.c | 2 +-
include/linux/fs.h | 1 +
4 files changed, 8 insertions(+), 35 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 163715de8cb2..4832a2b47f65 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1564,37 +1564,17 @@ sector_t bmap(struct inode *inode, sector_t block)
}
EXPORT_SYMBOL(bmap);
-/*
- * Update times in overlayed inode from underlying real inode
- */
-static void update_ovl_inode_times(struct dentry *dentry, struct inode *inode,
- bool rcu)
-{
- if (!rcu) {
- struct inode *realinode = d_real_inode(dentry);
-
- if (unlikely(inode != realinode) &&
- (!timespec_equal(&inode->i_mtime, &realinode->i_mtime) ||
- !timespec_equal(&inode->i_ctime, &realinode->i_ctime))) {
- inode->i_mtime = realinode->i_mtime;
- inode->i_ctime = realinode->i_ctime;
- }
- }
-}
-
/*
* With relative atime, only update atime if the previous atime is
* earlier than either the ctime or mtime or if at least a day has
* passed since the last atime update.
*/
-static int relatime_need_update(const struct path *path, struct inode *inode,
- struct timespec now, bool rcu)
+static int relatime_need_update(struct vfsmount *mnt, struct inode *inode,
+ struct timespec now)
{
- if (!(path->mnt->mnt_flags & MNT_RELATIME))
+ if (!(mnt->mnt_flags & MNT_RELATIME))
return 1;
-
- update_ovl_inode_times(path->dentry, inode, rcu);
/*
* Is mtime younger than atime? If yes, update atime:
*/
@@ -1665,8 +1645,7 @@ static int update_time(struct inode *inode, struct timespec *time, int flags)
* This function automatically handles read only file systems and media,
* as well as the "noatime" flag and inode specific "noatime" markers.
*/
-bool __atime_needs_update(const struct path *path, struct inode *inode,
- bool rcu)
+bool atime_needs_update(const struct path *path, struct inode *inode)
{
struct vfsmount *mnt = path->mnt;
struct timespec now;
@@ -1692,7 +1671,7 @@ bool __atime_needs_update(const struct path *path, struct inode *inode,
now = current_time(inode);
- if (!relatime_need_update(path, inode, now, rcu))
+ if (!relatime_need_update(mnt, inode, now))
return false;
if (timespec_equal(&inode->i_atime, &now))
@@ -1707,7 +1686,7 @@ void touch_atime(const struct path *path)
struct inode *inode = d_inode(path->dentry);
struct timespec now;
- if (!__atime_needs_update(path, inode, false))
+ if (!atime_needs_update(path, inode))
return;
if (!sb_start_write_trylock(inode->i_sb))
diff --git a/fs/internal.h b/fs/internal.h
index d5108d9c6a2f..f0eaa33890d9 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -127,13 +127,6 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc);
extern void inode_add_lru(struct inode *inode);
extern int dentry_needs_remove_privs(struct dentry *dentry);
-extern bool __atime_needs_update(const struct path *, struct inode *, bool);
-static inline bool atime_needs_update_rcu(const struct path *path,
- struct inode *inode)
-{
- return __atime_needs_update(path, inode, true);
-}
-
/*
* fs-writeback.c
*/
diff --git a/fs/namei.c b/fs/namei.c
index cafa365eeb70..2b3931bf5fb4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1009,7 +1009,7 @@ const char *get_link(struct nameidata *nd)
if (!(nd->flags & LOOKUP_RCU)) {
touch_atime(&last->link);
cond_resched();
- } else if (atime_needs_update_rcu(&last->link, inode)) {
+ } else if (atime_needs_update(&last->link, inode)) {
if (unlikely(unlazy_walk(nd)))
return ERR_PTR(-ECHILD);
touch_atime(&last->link);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e28a2cf50fa7..5e5dd484937f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2062,6 +2062,7 @@ enum file_time_flags {
S_VERSION = 8,
};
+extern bool atime_needs_update(const struct path *, struct inode *);
extern void touch_atime(const struct path *);
static inline void file_accessed(struct file *file)
{
--
2.14.3
This reverts commit 4d0c5ba2ff79ef9f5188998b29fd28fcb05f3667.
We now get write access on both overlay and underlying layers so this patch
is no longer needed for correct operation.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/locks.c | 3 +--
fs/open.c | 15 ++-------------
2 files changed, 3 insertions(+), 15 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index d6ff4beb70ce..8e9431a178a5 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1654,8 +1654,7 @@ check_conflicting_open(const struct dentry *dentry, const long arg, int flags)
if (flags & FL_LAYOUT)
return 0;
- if ((arg == F_RDLCK) &&
- (atomic_read(&d_real_inode(dentry)->i_writecount) > 0))
+ if ((arg == F_RDLCK) && (atomic_read(&inode->i_writecount) > 0))
return -EAGAIN;
if ((arg == F_WRLCK) && ((d_count(dentry) > 1) ||
diff --git a/fs/open.c b/fs/open.c
index 73f3d0f35742..1b2f22d25545 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -68,7 +68,6 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
long vfs_truncate(const struct path *path, loff_t length)
{
struct inode *inode;
- struct dentry *upperdentry;
long error;
inode = path->dentry->d_inode;
@@ -91,17 +90,7 @@ long vfs_truncate(const struct path *path, loff_t length)
if (IS_APPEND(inode))
goto mnt_drop_write_and_out;
- /*
- * If this is an overlayfs then do as if opening the file so we get
- * write access on the upper inode, not on the overlay inode. For
- * non-overlay filesystems d_real() is an identity function.
- */
- upperdentry = d_real(path->dentry, NULL, O_WRONLY);
- error = PTR_ERR(upperdentry);
- if (IS_ERR(upperdentry))
- goto mnt_drop_write_and_out;
-
- error = get_write_access(upperdentry->d_inode);
+ error = get_write_access(inode);
if (error)
goto mnt_drop_write_and_out;
@@ -120,7 +109,7 @@ long vfs_truncate(const struct path *path, loff_t length)
error = do_truncate(path->dentry, length, 0, NULL);
put_write_and_out:
- put_write_access(upperdentry->d_inode);
+ put_write_access(inode);
mnt_drop_write_and_out:
mnt_drop_write(path->mnt);
out:
--
2.14.3
This reverts commit 495e642939114478a5237a7d91661ba93b76f15a.
No user of "flags" argument of d_real() remain.
Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/Locking | 2 +-
Documentation/filesystems/vfs.txt | 2 +-
fs/open.c | 2 +-
fs/overlayfs/super.c | 4 ++--
include/linux/dcache.h | 11 +++++------
include/linux/fs.h | 2 +-
6 files changed, 11 insertions(+), 12 deletions(-)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 60e76060baff..a4afe96f0112 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -22,7 +22,7 @@ prototypes:
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int, unsigned int);
+ unsigned int);
locking rules:
rename_lock ->d_lock may block rcu-walk
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 2bc77ea8aef4..af54d3651ff8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -991,7 +991,7 @@ struct dentry_operations {
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int, unsigned int);
+ unsigned int);
};
d_revalidate: called when the VFS needs to revalidate a dentry. This
diff --git a/fs/open.c b/fs/open.c
index 00b1f18475ba..73f3d0f35742 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -96,7 +96,7 @@ long vfs_truncate(const struct path *path, loff_t length)
* write access on the upper inode, not on the overlay inode. For
* non-overlay filesystems d_real() is an identity function.
*/
- upperdentry = d_real(path->dentry, NULL, O_WRONLY, 0);
+ upperdentry = d_real(path->dentry, NULL, O_WRONLY);
error = PTR_ERR(upperdentry);
if (IS_ERR(upperdentry))
goto mnt_drop_write_and_out;
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index 006dc70d7425..7779fc610767 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -98,7 +98,7 @@ static int ovl_check_append_only(struct inode *inode, int flag)
static struct dentry *ovl_d_real(struct dentry *dentry,
const struct inode *inode,
- unsigned int open_flags, unsigned int flags)
+ unsigned int open_flags)
{
struct dentry *real;
int err;
@@ -134,7 +134,7 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
goto bug;
/* Handle recursion */
- real = d_real(real, inode, open_flags, 0);
+ real = d_real(real, inode, open_flags);
if (!inode || inode == d_inode(real))
return real;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 4c7ab11c627a..8ca25ea3b9db 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -148,7 +148,7 @@ struct dentry_operations {
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int, unsigned int);
+ unsigned int);
} ____cacheline_aligned;
/*
@@ -569,8 +569,7 @@ static inline struct dentry *d_backing_dentry(struct dentry *upper)
* d_real - Return the real dentry
* @dentry: the dentry to query
* @inode: inode to select the dentry from multiple layers (can be NULL)
- * @open_flags: open flags to control copy-up behavior
- * @flags: flags to control what is returned by this function
+ * @flags: open flags to control copy-up behavior
*
* If dentry is on a union/overlay, then return the underlying, real dentry.
* Otherwise return the dentry itself.
@@ -579,10 +578,10 @@ static inline struct dentry *d_backing_dentry(struct dentry *upper)
*/
static inline struct dentry *d_real(struct dentry *dentry,
const struct inode *inode,
- unsigned int open_flags, unsigned int flags)
+ unsigned int flags)
{
if (unlikely(dentry->d_flags & DCACHE_OP_REAL))
- return dentry->d_op->d_real(dentry, inode, open_flags, flags);
+ return dentry->d_op->d_real(dentry, inode, flags);
else
return dentry;
}
@@ -597,7 +596,7 @@ static inline struct dentry *d_real(struct dentry *dentry,
static inline struct inode *d_real_inode(const struct dentry *dentry)
{
/* This usage of d_real() results in const dentry */
- return d_backing_inode(d_real((struct dentry *) dentry, NULL, 0, 0));
+ return d_backing_inode(d_real((struct dentry *) dentry, NULL, 0));
}
struct name_snapshot {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5e5dd484937f..5294697217b7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1244,7 +1244,7 @@ static inline struct inode *file_inode(const struct file *f)
static inline struct dentry *file_dentry(const struct file *file)
{
- return d_real(file->f_path.dentry, file_inode(file), 0, 0);
+ return d_real(file->f_path.dentry, file_inode(file), 0);
}
static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
--
2.14.3
Implement stacked fiemap().
Need to split inode operations for regular file (which has fiemap) and
special file (which doesn't have fiemap).
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/inode.c | 29 ++++++++++++++++++++++++++++-
1 file changed, 28 insertions(+), 1 deletion(-)
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 3b996639e1ad..b723327501f7 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -455,6 +455,23 @@ int ovl_update_time(struct inode *inode, struct timespec *ts, int flags)
return 0;
}
+static int ovl_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
+ u64 start, u64 len)
+{
+ int err;
+ struct inode *realinode = ovl_inode_real(inode);
+ const struct cred *old_cred;
+
+ if (!realinode->i_op->fiemap)
+ return -EOPNOTSUPP;
+
+ old_cred = ovl_override_creds(inode->i_sb);
+ err = realinode->i_op->fiemap(realinode, fieinfo, start, len);
+ revert_creds(old_cred);
+
+ return err;
+}
+
static const struct inode_operations ovl_file_inode_operations = {
.setattr = ovl_setattr,
.permission = ovl_permission,
@@ -462,6 +479,7 @@ static const struct inode_operations ovl_file_inode_operations = {
.listxattr = ovl_listxattr,
.get_acl = ovl_get_acl,
.update_time = ovl_update_time,
+ .fiemap = ovl_fiemap,
};
static const struct inode_operations ovl_symlink_inode_operations = {
@@ -472,6 +490,15 @@ static const struct inode_operations ovl_symlink_inode_operations = {
.update_time = ovl_update_time,
};
+static const struct inode_operations ovl_special_inode_operations = {
+ .setattr = ovl_setattr,
+ .permission = ovl_permission,
+ .getattr = ovl_getattr,
+ .listxattr = ovl_listxattr,
+ .get_acl = ovl_get_acl,
+ .update_time = ovl_update_time,
+};
+
/*
* It is possible to stack overlayfs instance on top of another
* overlayfs instance as lower layer. We need to annonate the
@@ -555,7 +582,7 @@ static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev,
break;
default:
- inode->i_op = &ovl_file_inode_operations;
+ inode->i_op = &ovl_special_inode_operations;
init_special_inode(inode, mode, rdev);
break;
}
--
2.14.3
This reverts commit cd91304e7190b4c4802f8e413ab2214b233e0260.
Overlayfs no longer relies on the vfs correct atime handling.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/inode.c | 21 ++++-----------------
fs/overlayfs/super.c | 3 ---
include/linux/dcache.h | 3 ---
3 files changed, 4 insertions(+), 23 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index ef362364d396..163715de8cb2 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1570,24 +1570,11 @@ EXPORT_SYMBOL(bmap);
static void update_ovl_inode_times(struct dentry *dentry, struct inode *inode,
bool rcu)
{
- struct dentry *upperdentry;
+ if (!rcu) {
+ struct inode *realinode = d_real_inode(dentry);
- /*
- * Nothing to do if in rcu or if non-overlayfs
- */
- if (rcu || likely(!(dentry->d_flags & DCACHE_OP_REAL)))
- return;
-
- upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);
-
- /*
- * If file is on lower then we can't update atime, so no worries about
- * stale mtime/ctime.
- */
- if (upperdentry) {
- struct inode *realinode = d_inode(upperdentry);
-
- if ((!timespec_equal(&inode->i_mtime, &realinode->i_mtime) ||
+ if (unlikely(inode != realinode) &&
+ (!timespec_equal(&inode->i_mtime, &realinode->i_mtime) ||
!timespec_equal(&inode->i_ctime, &realinode->i_ctime))) {
inode->i_mtime = realinode->i_mtime;
inode->i_ctime = realinode->i_ctime;
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index c3d8c7ea180f..006dc70d7425 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -107,9 +107,6 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
if (inode && d_inode(dentry) == inode)
return dentry;
- if (flags & D_REAL_UPPER)
- return ovl_dentry_upper(dentry);
-
if (!d_is_reg(dentry)) {
if (!inode || inode == d_inode(dentry))
return dentry;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 82a99d366aec..4c7ab11c627a 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -565,9 +565,6 @@ static inline struct dentry *d_backing_dentry(struct dentry *upper)
return upper;
}
-/* d_real() flags */
-#define D_REAL_UPPER 0x2 /* return upper dentry or NULL if non-upper */
-
/**
* d_real - Return the real dentry
* @dentry: the dentry to query
--
2.14.3
This reverts commit a118084432d642eeccb961c7c8cc61525a941fcb.
No user of d_real_inode() remains, so it can be removed.
Signed-off-by: Miklos Szeredi <[email protected]>
---
include/linux/dcache.h | 12 ------------
1 file changed, 12 deletions(-)
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index c2dea5e78f54..129ef05759a9 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -586,18 +586,6 @@ static inline struct dentry *d_real(struct dentry *dentry,
return dentry;
}
-/**
- * d_real_inode - Return the real inode
- * @dentry: The dentry to query
- *
- * If dentry is on a union/overlay, then return the underlying, real inode.
- * Otherwise return d_inode().
- */
-static inline struct inode *d_real_inode(struct dentry *dentry)
-{
- return d_backing_inode(d_real(dentry, NULL, 0));
-}
-
struct name_snapshot {
const unsigned char *name;
unsigned char inline_name[DNAME_INLINE_LEN];
--
2.14.3
A corner case of a corner case is when
- file opened for O_RDONLY
- which is then memory mapped SHARED
- file opened for O_WRONLY
- contents modified
- contents read back though the shared mapping
Unfortunately it looks very difficult to do anything about the established
shared map after the file is copied up.
Instead, when a read-only file is mapped shared, copy up the file before
actually doing the map. This may result in unnecessary copy-ups (but so
may copy-up on open(O_RDWR) for exampe).
We can revisit this later if it turns out to be a performance problem in
real life.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/Kconfig | 21 +++++++++++++++++++++
fs/overlayfs/file.c | 22 ++++++++++++++++++++++
fs/overlayfs/overlayfs.h | 7 +++++++
fs/overlayfs/ovl_entry.h | 1 +
fs/overlayfs/super.c | 22 ++++++++++++++++++++++
5 files changed, 73 insertions(+)
diff --git a/fs/overlayfs/Kconfig b/fs/overlayfs/Kconfig
index 17032631c5cf..991c0a5a0e00 100644
--- a/fs/overlayfs/Kconfig
+++ b/fs/overlayfs/Kconfig
@@ -103,3 +103,24 @@ config OVERLAY_FS_XINO_AUTO
For more information, see Documentation/filesystems/overlayfs.txt
If unsure, say N.
+
+config OVERLAY_FS_COPY_UP_SHARED
+ bool "Overlayfs: copy up when mapping a file shared"
+ default n
+ depends on OVERLAY_FS
+ help
+ If this option is enabled then on mapping a file with MAP_SHARED
+ overlayfs copies up the file in anticipation of it being modified (just
+ like we copy up the file on O_WRONLY and O_RDWR in anticipation of
+ modification). This does not interfere with shared library loading, as
+ that uses MAP_PRIVATE. But there might be use cases out there where
+ this impacts performance and disk usage.
+
+ This just selects the default, the feature can also be enabled or
+ disabled in the running kernel or individually on each overlay mount.
+
+ To get maximally standard compliant behavior, enable this option.
+
+ To get a maximally backward compatible kernel, disable this option.
+
+ If unsure, say N.
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 39b1b73334ad..23638d8ebab5 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -10,6 +10,7 @@
#include <linux/file.h>
#include <linux/mount.h>
#include <linux/xattr.h>
+#include <linux/mman.h>
#include <linux/uio.h>
#include "overlayfs.h"
@@ -245,6 +246,26 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
return ret;
}
+static int ovl_pre_mmap(struct file *file, unsigned long prot,
+ unsigned long flag)
+{
+ int err = 0;
+
+ /*
+ * Take MAP_SHARED as hint about future writes to the file (through
+ * another file descriptor). Caller might not have had such an intent,
+ * but we hope MAP_PRIVATE will be used in most such cases.
+ *
+ * If we don't copy up now and the file is modified, it becomes really
+ * difficult to change the mapping to match that of the file's content
+ * later.
+ */
+ if ((flag & MAP_SHARED) && ovl_copy_up_shared(file_inode(file)->i_sb))
+ err = ovl_copy_up(file_dentry(file));
+
+ return err;
+}
+
static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
{
struct fd real;
@@ -434,6 +455,7 @@ const struct file_operations ovl_file_operations = {
.read_iter = ovl_read_iter,
.write_iter = ovl_write_iter,
.fsync = ovl_fsync,
+ .pre_mmap = ovl_pre_mmap,
.mmap = ovl_mmap,
.fallocate = ovl_fallocate,
.unlocked_ioctl = ovl_ioctl,
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index b29c1688f372..dad54bc8de7d 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -279,6 +279,13 @@ static inline unsigned int ovl_xino_bits(struct super_block *sb)
return ofs->xino_bits;
}
+static inline bool ovl_copy_up_shared(struct super_block *sb)
+{
+ struct ovl_fs *ofs = sb->s_fs_info;
+
+ return !(sb->s_flags & SB_RDONLY) && ofs->config.copy_up_shared;
+}
+
/* namei.c */
int ovl_check_fh_len(struct ovl_fh *fh, int fh_len);
diff --git a/fs/overlayfs/ovl_entry.h b/fs/overlayfs/ovl_entry.h
index 41655a7d6894..3bea47c63fd9 100644
--- a/fs/overlayfs/ovl_entry.h
+++ b/fs/overlayfs/ovl_entry.h
@@ -18,6 +18,7 @@ struct ovl_config {
const char *redirect_mode;
bool index;
bool nfs_export;
+ bool copy_up_shared;
int xino;
};
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index ad6a5baf226b..c3d8c7ea180f 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -56,6 +56,12 @@ module_param_named(xino_auto, ovl_xino_auto_def, bool, 0644);
MODULE_PARM_DESC(ovl_xino_auto_def,
"Auto enable xino feature");
+static bool ovl_copy_up_shared_def =
+ IS_ENABLED(CONFIG_OVERLAY_FS_COPY_UP_SHARED);
+module_param_named(copy_up_shared, ovl_copy_up_shared_def, bool, 0644);
+MODULE_PARM_DESC(ovl_copy_up_shared_def,
+ "Copy up when mapping a file shared");
+
static void ovl_entry_stack_free(struct ovl_entry *oe)
{
unsigned int i;
@@ -380,6 +386,9 @@ static int ovl_show_options(struct seq_file *m, struct dentry *dentry)
"on" : "off");
if (ofs->config.xino != ovl_xino_def())
seq_printf(m, ",xino=%s", ovl_xino_str[ofs->config.xino]);
+ if (ofs->config.copy_up_shared != ovl_copy_up_shared_def)
+ seq_printf(m, ",copy_up_shared=%s",
+ ofs->config.copy_up_shared ? "on" : "off");
return 0;
}
@@ -417,6 +426,8 @@ enum {
OPT_XINO_ON,
OPT_XINO_OFF,
OPT_XINO_AUTO,
+ OPT_COPY_UP_SHARED_ON,
+ OPT_COPY_UP_SHARED_OFF,
OPT_ERR,
};
@@ -433,6 +444,8 @@ static const match_table_t ovl_tokens = {
{OPT_XINO_ON, "xino=on"},
{OPT_XINO_OFF, "xino=off"},
{OPT_XINO_AUTO, "xino=auto"},
+ {OPT_COPY_UP_SHARED_ON, "copy_up_shared=on"},
+ {OPT_COPY_UP_SHARED_OFF, "copy_up_shared=off"},
{OPT_ERR, NULL}
};
@@ -559,6 +572,14 @@ static int ovl_parse_opt(char *opt, struct ovl_config *config)
config->xino = OVL_XINO_AUTO;
break;
+ case OPT_COPY_UP_SHARED_ON:
+ config->copy_up_shared = true;
+ break;
+
+ case OPT_COPY_UP_SHARED_OFF:
+ config->copy_up_shared = false;
+ break;
+
default:
pr_err("overlayfs: unrecognized mount option \"%s\" or missing value\n", p);
return -EINVAL;
@@ -1380,6 +1401,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent)
ofs->config.index = ovl_index_def;
ofs->config.nfs_export = ovl_nfs_export_def;
ofs->config.xino = ovl_xino_def();
+ ofs->config.copy_up_shared = ovl_copy_up_shared_def;
err = ovl_parse_opt((char *) data, &ofs->config);
if (err)
goto out_err;
--
2.14.3
This is needed by overlayfs to be able to copy up a file from a read-only
lower layer to a writable layer when being mapped shared. When copying up,
overlayfs takes VFS locks that would violate locking order when nested
inside mmap_sem.
Add a new f_op->pre_mmap method, which is called before taking mmap_sem.
Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/Locking | 1 +
Documentation/filesystems/vfs.txt | 3 +++
include/linux/fs.h | 1 +
mm/util.c | 5 +++++
4 files changed, 10 insertions(+)
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 75d2d57e2c44..60e76060baff 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -442,6 +442,7 @@ prototypes:
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*pre_mmap) (struct file *, unsigned long, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *);
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 5fd325df59e2..2bc77ea8aef4 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -859,6 +859,7 @@ struct file_operations {
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*pre_mmap) (struct file *, unsigned long, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*mremap)(struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
@@ -906,6 +907,8 @@ otherwise noted.
compat_ioctl: called by the ioctl(2) system call when 32 bit system calls
are used on 64 bit kernels.
+ pre_mmap: called before mmap, without mmap_sem being held yet.
+
mmap: called by the mmap(2) system call
open: called by the VFS when an inode should be opened. When the VFS
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1add10f04b56..e28a2cf50fa7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1716,6 +1716,7 @@ struct file_operations {
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+ int (*pre_mmap) (struct file *, unsigned long, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
unsigned long mmap_supported_flags;
int (*open) (struct inode *, struct file *);
diff --git a/mm/util.c b/mm/util.c
index c1250501364f..0a48f9077bad 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -350,6 +350,11 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
ret = security_mmap_file(file, prot, flag);
if (!ret) {
+ if (file && file->f_op->pre_mmap) {
+ ret = file->f_op->pre_mmap(file, prot, flag);
+ if (ret)
+ return ret;
+ }
if (down_write_killable(&mm->mmap_sem))
return -EINTR;
ret = do_mmap_pgoff(file, addr, len, prot, flag, pgoff,
--
2.14.3
Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.
Needs vfs_ioctl() exported to modules.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/internal.h | 1 -
fs/ioctl.c | 1 +
fs/overlayfs/file.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
4 files changed, 62 insertions(+), 1 deletion(-)
diff --git a/fs/internal.h b/fs/internal.h
index 3319bf39e339..d5108d9c6a2f 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -176,7 +176,6 @@ extern const struct dentry_operations ns_dentry_operations;
*/
extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
unsigned long arg);
-extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
/*
* iomap support:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5ace7efb0d04..696f4c46a868 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -49,6 +49,7 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
out:
return error;
}
+EXPORT_SYMBOL(vfs_ioctl);
static int ioctl_fibmap(struct file *filp, int __user *p)
{
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 05e3e2f80b89..cc004ff1b05b 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -8,6 +8,7 @@
#include <linux/cred.h>
#include <linux/file.h>
+#include <linux/mount.h>
#include <linux/xattr.h>
#include <linux/uio.h>
#include "overlayfs.h"
@@ -291,6 +292,63 @@ long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
return ret;
}
+static long ovl_real_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ struct fd real;
+ const struct cred *old_cred;
+ long ret;
+
+ ret = ovl_real_file(file, &real);
+ if (ret)
+ return ret;
+
+ old_cred = ovl_override_creds(file_inode(file)->i_sb);
+ ret = vfs_ioctl(real.file, cmd, arg);
+ revert_creds(old_cred);
+
+ fdput(real);
+
+ return ret;
+}
+
+long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+ long ret;
+ struct inode *inode = file_inode(file);
+
+ switch (cmd) {
+ case FS_IOC_GETFLAGS:
+ ret = ovl_real_ioctl(file, cmd, arg);
+ break;
+
+ case FS_IOC_SETFLAGS:
+ if (!inode_owner_or_capable(inode))
+ return -EACCES;
+
+ ret = mnt_want_write_file(file);
+ if (ret)
+ return ret;
+
+ ret = ovl_copy_up(file_dentry(file));
+ if (!ret) {
+ ret = ovl_real_ioctl(file, cmd, arg);
+
+ inode_lock(inode);
+ ovl_copyflags(ovl_inode_real(inode), inode);
+ inode_unlock(inode);
+ }
+
+ mnt_drop_write_file(file);
+ break;
+
+ default:
+ ret = -ENOTTY;
+ }
+
+ return ret;
+}
+
const struct file_operations ovl_file_operations = {
.open = ovl_open,
.release = ovl_release,
@@ -300,4 +358,5 @@ const struct file_operations ovl_file_operations = {
.fsync = ovl_fsync,
.mmap = ovl_mmap,
.fallocate = ovl_fallocate,
+ .unlocked_ioctl = ovl_ioctl,
};
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0b215faa30ae..1add10f04b56 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1621,6 +1621,8 @@ int vfs_mkobj(struct dentry *, umode_t,
int (*f)(struct dentry *, umode_t, void *),
void *);
+extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
+
/*
* VFS file helper functions.
*/
--
2.14.3
Let overlayfs do its thing when opening a file.
This enables stacking and fixes the corner case when a file is opened for
read, modified through a writable open, and data is read from the read-only
file. After this patch the read-only open will not return stale data even
in this case.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/open.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index d509705d5740..1d1a52908b0f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -857,13 +857,8 @@ EXPORT_SYMBOL(file_path);
int vfs_open(const struct path *path, struct file *file,
const struct cred *cred)
{
- struct dentry *dentry = d_real(path->dentry, NULL, file->f_flags, 0);
-
- if (IS_ERR(dentry))
- return PTR_ERR(dentry);
-
file->f_path = *path;
- return do_dentry_open(file, d_backing_inode(dentry), NULL, cred);
+ return do_dentry_open(file, d_backing_inode(path->dentry), NULL, cred);
}
struct file *path_open(const struct path *path, int flags, struct inode *inode,
--
2.14.3
Since set of arguments are so similar, handle in a common helper.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 79 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 9670e160967e..39b1b73334ad 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -352,6 +352,81 @@ long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
return ret;
}
+enum ovl_copyop {
+ OVL_COPY,
+ OVL_CLONE,
+ OVL_DEDUPE,
+};
+
+static ssize_t ovl_copyfile(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ u64 len, unsigned int flags, enum ovl_copyop op)
+{
+ struct inode *inode_out = file_inode(file_out);
+ struct fd real_in, real_out;
+ const struct cred *old_cred;
+ int ret;
+
+ ret = ovl_real_file(file_out, &real_out);
+ if (ret)
+ return ret;
+
+ ret = ovl_real_file(file_in, &real_in);
+ if (ret) {
+ fdput(real_out);
+ return ret;
+ }
+
+ old_cred = ovl_override_creds(file_inode(file_out)->i_sb);
+ switch (op) {
+ case OVL_COPY:
+ ret = vfs_copy_file_range(real_in.file, pos_in,
+ real_out.file, pos_out, len, flags);
+ break;
+
+ case OVL_CLONE:
+ ret = vfs_clone_file_range(real_in.file, pos_in,
+ real_out.file, pos_out, len);
+ break;
+
+ case OVL_DEDUPE:
+ ret = vfs_dedupe_file_range_one(real_in.file, pos_in, len,
+ real_out.file, pos_out);
+ break;
+ }
+ revert_creds(old_cred);
+
+ /* Update size */
+ ovl_copyattr(ovl_inode_real(inode_out), inode_out);
+
+ fdput(real_in);
+ fdput(real_out);
+
+ return ret;
+}
+
+static ssize_t ovl_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, flags,
+ OVL_COPY);
+}
+
+static int ovl_clone_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out, u64 len)
+{
+ return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
+ OVL_CLONE);
+}
+
+static ssize_t ovl_dedupe_file_range(struct file *file_in, u64 pos_in, u64 len,
+ struct file *file_out, u64 pos_out)
+{
+ return ovl_copyfile(file_in, pos_in, file_out, pos_out, len, 0,
+ OVL_DEDUPE);
+}
+
const struct file_operations ovl_file_operations = {
.open = ovl_open,
.release = ovl_release,
@@ -362,4 +437,8 @@ const struct file_operations ovl_file_operations = {
.mmap = ovl_mmap,
.fallocate = ovl_fallocate,
.unlocked_ioctl = ovl_ioctl,
+
+ .copy_file_range = ovl_copy_file_range,
+ .clone_file_range = ovl_clone_file_range,
+ .dedupe_file_range = ovl_dedupe_file_range,
};
--
2.14.3
This reverts commit 7c6893e3c9abf6a9676e060a1e35e5caca673d57.
Overlayfs no longer relies on the vfs for checking writability of files.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/internal.h | 2 --
fs/namespace.c | 64 +++-------------------------------------------------------
fs/open.c | 4 ++--
fs/xattr.c | 9 ++++-----
4 files changed, 9 insertions(+), 70 deletions(-)
diff --git a/fs/internal.h b/fs/internal.h
index f0eaa33890d9..d7fdd56bafd3 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -72,10 +72,8 @@ extern void __init mnt_init(void);
extern int __mnt_want_write(struct vfsmount *);
extern int __mnt_want_write_file(struct file *);
-extern int mnt_want_write_file_path(struct file *);
extern void __mnt_drop_write(struct vfsmount *);
extern void __mnt_drop_write_file(struct file *);
-extern void mnt_drop_write_file_path(struct file *);
/*
* fs_struct.c
diff --git a/fs/namespace.c b/fs/namespace.c
index 9ec9bf74116b..e2645ef331f2 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -431,18 +431,13 @@ int __mnt_want_write_file(struct file *file)
}
/**
- * mnt_want_write_file_path - get write access to a file's mount
+ * mnt_want_write_file - get write access to a file's mount
* @file: the file who's mount on which to take a write
*
* This is like mnt_want_write, but it takes a file and can
* do some optimisations if the file is open for write already
- *
- * Called by the vfs for cases when we have an open file at hand, but will do an
- * inode operation on it (important distinction for files opened on overlayfs,
- * since the file operations will come from the real underlying file, while
- * inode operations come from the overlay).
*/
-int mnt_want_write_file_path(struct file *file)
+int mnt_want_write_file(struct file *file)
{
int ret;
@@ -452,53 +447,6 @@ int mnt_want_write_file_path(struct file *file)
sb_end_write(file->f_path.mnt->mnt_sb);
return ret;
}
-
-static inline int may_write_real(struct file *file)
-{
- struct dentry *dentry = file->f_path.dentry;
- struct dentry *upperdentry;
-
- /* Writable file? */
- if (file->f_mode & FMODE_WRITER)
- return 0;
-
- /* Not overlayfs? */
- if (likely(!(dentry->d_flags & DCACHE_OP_REAL)))
- return 0;
-
- /* File refers to upper, writable layer? */
- upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);
- if (upperdentry && file_inode(file) == d_inode(upperdentry))
- return 0;
-
- /* Lower layer: can't write to real file, sorry... */
- return -EPERM;
-}
-
-/**
- * mnt_want_write_file - get write access to a file's mount
- * @file: the file who's mount on which to take a write
- *
- * This is like mnt_want_write, but it takes a file and can
- * do some optimisations if the file is open for write already
- *
- * Mostly called by filesystems from their ioctl operation before performing
- * modification. On overlayfs this needs to check if the file is on a read-only
- * lower layer and deny access in that case.
- */
-int mnt_want_write_file(struct file *file)
-{
- int ret;
-
- ret = may_write_real(file);
- if (!ret) {
- sb_start_write(file_inode(file)->i_sb);
- ret = __mnt_want_write_file(file);
- if (ret)
- sb_end_write(file_inode(file)->i_sb);
- }
- return ret;
-}
EXPORT_SYMBOL_GPL(mnt_want_write_file);
/**
@@ -536,15 +484,9 @@ void __mnt_drop_write_file(struct file *file)
__mnt_drop_write(file->f_path.mnt);
}
-void mnt_drop_write_file_path(struct file *file)
-{
- mnt_drop_write(file->f_path.mnt);
-}
-
void mnt_drop_write_file(struct file *file)
{
- __mnt_drop_write(file->f_path.mnt);
- sb_end_write(file_inode(file)->i_sb);
+ mnt_drop_write(file->f_path.mnt);
}
EXPORT_SYMBOL(mnt_drop_write_file);
diff --git a/fs/open.c b/fs/open.c
index 442088ca30f9..00b1f18475ba 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -670,12 +670,12 @@ SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
if (!f.file)
goto out;
- error = mnt_want_write_file_path(f.file);
+ error = mnt_want_write_file(f.file);
if (error)
goto out_fput;
audit_file(f.file);
error = chown_common(&f.file->f_path, user, group);
- mnt_drop_write_file_path(f.file);
+ mnt_drop_write_file(f.file);
out_fput:
fdput(f);
out:
diff --git a/fs/xattr.c b/fs/xattr.c
index 61cd28ba25f3..78eaffbdbee0 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -23,7 +23,6 @@
#include <linux/posix_acl_xattr.h>
#include <linux/uaccess.h>
-#include "internal.h"
static const char *
strcmp_prefix(const char *a, const char *a_prefix)
@@ -503,10 +502,10 @@ SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name,
if (!f.file)
return error;
audit_file(f.file);
- error = mnt_want_write_file_path(f.file);
+ error = mnt_want_write_file(f.file);
if (!error) {
error = setxattr(f.file->f_path.dentry, name, value, size, flags);
- mnt_drop_write_file_path(f.file);
+ mnt_drop_write_file(f.file);
}
fdput(f);
return error;
@@ -735,10 +734,10 @@ SYSCALL_DEFINE2(fremovexattr, int, fd, const char __user *, name)
if (!f.file)
return error;
audit_file(f.file);
- error = mnt_want_write_file_path(f.file);
+ error = mnt_want_write_file(f.file);
if (!error) {
error = removexattr(f.file->f_path.dentry, name);
- mnt_drop_write_file_path(f.file);
+ mnt_drop_write_file(f.file);
}
fdput(f);
return error;
--
2.14.3
dentry_open() can now just call path_open().
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/open.c | 22 ++--------------------
1 file changed, 2 insertions(+), 20 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 1d1a52908b0f..442088ca30f9 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -891,31 +891,13 @@ EXPORT_SYMBOL(path_open);
struct file *dentry_open(const struct path *path, int flags,
const struct cred *cred)
{
- int error;
- struct file *f;
-
validate_creds(cred);
/* We must always pass in a valid mount pointer. */
BUG_ON(!path->mnt);
- f = get_empty_filp();
- if (!IS_ERR(f)) {
- f->f_flags = flags;
- error = vfs_open(path, f, cred);
- if (!error) {
- /* from now on we need fput() to dispose of f */
- error = open_check_o_direct(f);
- if (error) {
- fput(f);
- f = ERR_PTR(error);
- }
- } else {
- put_filp(f);
- f = ERR_PTR(error);
- }
- }
- return f;
+ return path_open(path, flags, d_backing_inode(path->dentry), cred,
+ true);
}
EXPORT_SYMBOL(dentry_open);
--
2.14.3
Implement stacked mmap.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 28 ++++++++++++++++++++++++++++
1 file changed, 28 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 4417527667ff..f1cc2fdc62c1 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -241,6 +241,33 @@ static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
return ret;
}
+static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ struct fd real;
+ const struct cred *old_cred;
+ int ret;
+
+ ret = ovl_real_file(file, &real);
+ if (ret)
+ return ret;
+
+ /* transfer ref: */
+ fput(vma->vm_file);
+ vma->vm_file = get_file(real.file);
+ fdput(real);
+
+ if (!vma->vm_file->f_op->mmap)
+ return -ENODEV;
+
+ old_cred = ovl_override_creds(file_inode(file)->i_sb);
+ ret = call_mmap(vma->vm_file, vma);
+ revert_creds(old_cred);
+
+ ovl_file_accessed(file);
+
+ return ret;
+}
+
const struct file_operations ovl_file_operations = {
.open = ovl_open,
.release = ovl_release,
@@ -248,4 +275,5 @@ const struct file_operations ovl_file_operations = {
.read_iter = ovl_read_iter,
.write_iter = ovl_write_iter,
.fsync = ovl_fsync,
+ .mmap = ovl_mmap,
};
--
2.14.3
Implement file operations on a regular overlay file. The underlying file
is opened separately and cached in ->private_data.
It might be worth making an exception for such files when accounting in
nr_file to confirm to userspace expectations. We are only adding a small
overhead (248bytes for the struct file) since the real inode and dentry are
pinned by overlayfs anyway.
This patch doesn't have any effect, since the vfs will use d_real() to find
the real underlying file to open. The patch at the end of the series will
actually enable this functionality.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/Makefile | 4 +--
fs/overlayfs/file.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/overlayfs/inode.c | 1 +
fs/overlayfs/overlayfs.h | 3 ++
4 files changed, 82 insertions(+), 2 deletions(-)
create mode 100644 fs/overlayfs/file.c
diff --git a/fs/overlayfs/Makefile b/fs/overlayfs/Makefile
index 30802347a020..46e1ff8ac056 100644
--- a/fs/overlayfs/Makefile
+++ b/fs/overlayfs/Makefile
@@ -4,5 +4,5 @@
obj-$(CONFIG_OVERLAY_FS) += overlay.o
-overlay-objs := super.o namei.o util.o inode.o dir.o readdir.o copy_up.o \
- export.o
+overlay-objs := super.o namei.o util.o inode.o file.o dir.o readdir.o \
+ copy_up.o export.o
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
new file mode 100644
index 000000000000..a0b606885c41
--- /dev/null
+++ b/fs/overlayfs/file.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include <linux/cred.h>
+#include <linux/file.h>
+#include <linux/xattr.h>
+#include "overlayfs.h"
+
+static struct file *ovl_open_realfile(const struct file *file)
+{
+ struct inode *inode = file_inode(file);
+ struct inode *upperinode = ovl_inode_upper(inode);
+ struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
+ struct file *realfile;
+ const struct cred *old_cred;
+
+ old_cred = ovl_override_creds(inode->i_sb);
+ realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
+ realinode, current_cred(), false);
+ revert_creds(old_cred);
+
+ pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
+ file, file, upperinode ? 'u' : 'l', file->f_flags,
+ realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
+
+ return realfile;
+}
+
+static int ovl_open(struct inode *inode, struct file *file)
+{
+ struct dentry *dentry = file_dentry(file);
+ struct file *realfile;
+ int err;
+
+ err = ovl_open_maybe_copy_up(dentry, file->f_flags);
+ if (err)
+ return err;
+
+ /* No longer need these flags, so don't pass them on to underlying fs */
+ file->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);
+
+ realfile = ovl_open_realfile(file);
+ if (IS_ERR(realfile))
+ return PTR_ERR(realfile);
+
+ file->private_data = realfile;
+
+ return 0;
+}
+
+static int ovl_release(struct inode *inode, struct file *file)
+{
+ fput(file->private_data);
+
+ return 0;
+}
+
+static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
+{
+ struct inode *realinode = ovl_inode_real(file_inode(file));
+
+ return generic_file_llseek_size(file, offset, whence,
+ realinode->i_sb->s_maxbytes,
+ i_size_read(realinode));
+}
+
+const struct file_operations ovl_file_operations = {
+ .open = ovl_open,
+ .release = ovl_release,
+ .llseek = ovl_llseek,
+};
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index e213bb9823ec..3b996639e1ad 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -542,6 +542,7 @@ static void ovl_fill_inode(struct inode *inode, umode_t mode, dev_t rdev,
switch (mode & S_IFMT) {
case S_IFREG:
inode->i_op = &ovl_file_inode_operations;
+ inode->i_fop = &ovl_file_operations;
break;
case S_IFDIR:
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 10f5f3bf9d96..b29c1688f372 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -381,6 +381,9 @@ int ovl_create_real(struct inode *dir, struct dentry *newdentry,
struct dentry *hardlink, bool debug);
int ovl_cleanup(struct inode *dir, struct dentry *dentry);
+/* file.c */
+extern const struct file_operations ovl_file_operations;
+
/* copy_up.c */
int ovl_copy_up(struct dentry *dentry);
int ovl_copy_up_flags(struct dentry *dentry, int flags);
--
2.14.3
Implement stacked reading.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 409b542ff30c..a19429c5965d 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -9,6 +9,7 @@
#include <linux/cred.h>
#include <linux/file.h>
#include <linux/xattr.h>
+#include <linux/uio.h>
#include "overlayfs.h"
static struct file *ovl_open_realfile(const struct file *file)
@@ -129,8 +130,63 @@ static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
i_size_read(realinode));
}
+static void ovl_file_accessed(struct file *file)
+{
+ struct inode *inode = file_inode(file);
+
+ if ((file->f_flags & O_NOATIME) || !ovl_inode_upper(inode))
+ return;
+
+ ovl_copytimes(inode);
+ touch_atime(&file->f_path);
+}
+
+static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb)
+{
+ int ifl = iocb->ki_flags;
+ rwf_t flags = 0;
+
+ if (ifl & IOCB_NOWAIT)
+ flags |= RWF_NOWAIT;
+ if (ifl & IOCB_HIPRI)
+ flags |= RWF_HIPRI;
+ if (ifl & IOCB_DSYNC)
+ flags |= RWF_DSYNC;
+ if (ifl & IOCB_SYNC)
+ flags |= RWF_SYNC;
+
+ return flags;
+}
+
+static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+ struct file *file = iocb->ki_filp;
+ struct fd real;
+ const struct cred *old_cred;
+ ssize_t ret;
+
+ if (!iov_iter_count(iter))
+ return 0;
+
+ ret = ovl_real_file(file, &real);
+ if (ret)
+ return ret;
+
+ old_cred = ovl_override_creds(file_inode(file)->i_sb);
+ ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
+ ovl_iocb_to_rwf(iocb));
+ revert_creds(old_cred);
+
+ ovl_file_accessed(file);
+
+ fdput(real);
+
+ return ret;
+}
+
const struct file_operations ovl_file_operations = {
.open = ovl_open,
.release = ovl_release,
.llseek = ovl_llseek,
+ .read_iter = ovl_read_iter,
};
--
2.14.3
Implement stacked writes.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index a19429c5965d..b98204c1c19c 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -184,9 +184,48 @@ static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}
+static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file_inode(file);
+ struct fd real;
+ const struct cred *old_cred;
+ ssize_t ret;
+
+ if (!iov_iter_count(iter))
+ return 0;
+
+ inode_lock(inode);
+ /* Update mode */
+ ovl_copyattr(ovl_inode_real(inode), inode);
+ ret = file_remove_privs(file);
+ if (ret)
+ goto out_unlock;
+
+ ret = ovl_real_file(file, &real);
+ if (ret)
+ goto out_unlock;
+
+ old_cred = ovl_override_creds(file_inode(file)->i_sb);
+ ret = vfs_iter_write(real.file, iter, &iocb->ki_pos,
+ ovl_iocb_to_rwf(iocb));
+ revert_creds(old_cred);
+
+ /* Update size */
+ ovl_copyattr(ovl_inode_real(inode), inode);
+
+ fdput(real);
+
+out_unlock:
+ inode_unlock(inode);
+
+ return ret;
+}
+
const struct file_operations ovl_file_operations = {
.open = ovl_open,
.release = ovl_release,
.llseek = ovl_llseek,
.read_iter = ovl_read_iter,
+ .write_iter = ovl_write_iter,
};
--
2.14.3
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index cc004ff1b05b..9670e160967e 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -110,6 +110,9 @@ static int ovl_open(struct inode *inode, struct file *file)
if (IS_ERR(realfile))
return PTR_ERR(realfile);
+ /* For O_DIRECT dentry_open() checks f_mapping->a_ops->direct_IO */
+ file->f_mapping = realfile->f_mapping;
+
file->private_data = realfile;
return 0;
--
2.14.3
Implement stacked fsync().
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index b98204c1c19c..4417527667ff 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -222,10 +222,30 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}
+static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
+{
+ struct fd real;
+ const struct cred *old_cred;
+ int ret;
+
+ ret = ovl_real_file(file, &real);
+ if (ret)
+ return ret;
+
+ old_cred = ovl_override_creds(file_inode(file)->i_sb);
+ ret = vfs_fsync_range(real.file, start, end, datasync);
+ revert_creds(old_cred);
+
+ fdput(real);
+
+ return ret;
+}
+
const struct file_operations ovl_file_operations = {
.open = ovl_open,
.release = ovl_release,
.llseek = ovl_llseek,
.read_iter = ovl_read_iter,
.write_iter = ovl_write_iter,
+ .fsync = ovl_fsync,
};
--
2.14.3
Implement stacked fallocate.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/file.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index f1cc2fdc62c1..05e3e2f80b89 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -268,6 +268,29 @@ static int ovl_mmap(struct file *file, struct vm_area_struct *vma)
return ret;
}
+long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
+{
+ struct inode *inode = file_inode(file);
+ struct fd real;
+ const struct cred *old_cred;
+ int ret;
+
+ ret = ovl_real_file(file, &real);
+ if (ret)
+ return ret;
+
+ old_cred = ovl_override_creds(file_inode(file)->i_sb);
+ ret = vfs_fallocate(real.file, mode, offset, len);
+ revert_creds(old_cred);
+
+ /* Update size */
+ ovl_copyattr(ovl_inode_real(inode), inode);
+
+ fdput(real);
+
+ return ret;
+}
+
const struct file_operations ovl_file_operations = {
.open = ovl_open,
.release = ovl_release,
@@ -276,4 +299,5 @@ const struct file_operations ovl_file_operations = {
.write_iter = ovl_write_iter,
.fsync = ovl_fsync,
.mmap = ovl_mmap,
+ .fallocate = ovl_fallocate,
};
--
2.14.3
This reverts commit 31c3a7069593b072bd57192b63b62f9a7e994e9a.
Re-add functionality dealing with i_writecount on truncate to overlayfs.
This patch shouldn't have any observable effects, since we just re-assert
the writecout that vfs_truncate() already got for us.
This is in preparation for moving overlay functionality out of the VFS.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/inode.c | 21 +++++++++++++++++++++
1 file changed, 21 insertions(+)
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 33635106c5f7..e213bb9823ec 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -39,10 +39,27 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
if (err)
goto out;
+ if (attr->ia_valid & ATTR_SIZE) {
+ struct inode *realinode = d_inode(ovl_dentry_real(dentry));
+
+ err = -ETXTBSY;
+ if (atomic_read(&realinode->i_writecount) < 0)
+ goto out_drop_write;
+ }
+
err = ovl_copy_up(dentry);
if (!err) {
+ struct inode *winode = NULL;
+
upperdentry = ovl_dentry_upper(dentry);
+ if (attr->ia_valid & ATTR_SIZE) {
+ winode = d_inode(upperdentry);
+ err = get_write_access(winode);
+ if (err)
+ goto out_drop_write;
+ }
+
if (attr->ia_valid & (ATTR_KILL_SUID|ATTR_KILL_SGID))
attr->ia_valid &= ~ATTR_MODE;
@@ -53,7 +70,11 @@ int ovl_setattr(struct dentry *dentry, struct iattr *attr)
if (!err)
ovl_copyattr(upperdentry->d_inode, dentry->d_inode);
inode_unlock(upperdentry->d_inode);
+
+ if (winode)
+ put_write_access(winode);
}
+out_drop_write:
ovl_drop_write(dentry);
out:
return err;
--
2.14.3
On inode creation copy certain inode flags from the underlying real inode
to the overlay inode.
This is in preparation for moving overlay functionality out of the VFS.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/overlayfs.h | 7 +++++++
fs/overlayfs/util.c | 1 +
2 files changed, 8 insertions(+)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index eef720ef0f07..265cb288417a 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -357,6 +357,13 @@ static inline void ovl_copyattr(struct inode *from, struct inode *to)
to->i_ctime = from->i_ctime;
}
+static inline void ovl_copyflags(struct inode *from, struct inode *to)
+{
+ unsigned int mask = S_SYNC | S_IMMUTABLE | S_APPEND | S_NOATIME;
+
+ inode_set_flags(to, from->i_flags & mask, mask);
+}
+
/* dir.c */
extern const struct inode_operations ovl_dir_inode_operations;
struct dentry *ovl_lookup_temp(struct dentry *workdir);
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 11e62e70733a..586f8d7440bb 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -310,6 +310,7 @@ void ovl_inode_init(struct inode *inode, struct dentry *upperdentry,
OVL_I(inode)->lower = igrab(d_inode(lowerdentry));
ovl_copyattr(realinode, inode);
+ ovl_copyflags(realinode, inode);
if (!inode->i_ino)
inode->i_ino = realinode->i_ino;
}
--
2.14.3
Copy i_size of the underlying inode to the overlay inode in ovl_copyattr().
This is in preparation for stacking I/O operations on overlay files.
This patch shouldn't have any observable effect.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/overlayfs.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index 265cb288417a..10f5f3bf9d96 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -9,6 +9,7 @@
#include <linux/kernel.h>
#include <linux/uuid.h>
+#include <linux/fs.h>
#include "ovl_entry.h"
enum ovl_path_type {
@@ -355,6 +356,7 @@ static inline void ovl_copyattr(struct inode *from, struct inode *to)
to->i_atime = from->i_atime;
to->i_mtime = from->i_mtime;
to->i_ctime = from->i_ctime;
+ i_size_write(to, i_size_read(from));
}
static inline void ovl_copyflags(struct inode *from, struct inode *to)
--
2.14.3
Copy up mtime and ctime to overlay inode after times in real object are
modified. Be careful not to dirty cachelines when not necessary.
This is in preparation for moving overlay functionality out of the VFS.
This patch shouldn't have any observable effect.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/dir.c | 5 +++++
fs/overlayfs/inode.c | 1 +
fs/overlayfs/overlayfs.h | 7 +++++++
fs/overlayfs/util.c | 19 +++++++++++++++++++
4 files changed, 32 insertions(+)
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 839709c7803a..cd0fa2363723 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -507,6 +507,7 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
else
err = ovl_create_over_whiteout(dentry, inode, attr,
hardlink);
+ ovl_copytimes_with_parent(dentry);
}
out_revert_creds:
revert_creds(old_cred);
@@ -768,6 +769,7 @@ static int ovl_do_remove(struct dentry *dentry, bool is_dir)
drop_nlink(dentry->d_inode);
}
ovl_nlink_end(dentry, locked);
+ ovl_copytimes_with_parent(dentry);
out_drop_write:
ovl_drop_write(dentry);
out:
@@ -1079,6 +1081,9 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
ovl_dentry_version_inc(new->d_parent, ovl_type_origin(old) ||
(d_inode(new) && ovl_type_origin(new)));
+ ovl_copytimes_with_parent(old);
+ ovl_copytimes_with_parent(new);
+
out_dput:
dput(newdentry);
out_dput_old:
diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
index 6e3815fb006b..33635106c5f7 100644
--- a/fs/overlayfs/inode.c
+++ b/fs/overlayfs/inode.c
@@ -303,6 +303,7 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
err = vfs_removexattr(realdentry, name);
}
revert_creds(old_cred);
+ ovl_copytimes(d_inode(dentry));
out_drop_write:
ovl_drop_write(dentry);
diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
index e0b7de799f6b..eef720ef0f07 100644
--- a/fs/overlayfs/overlayfs.h
+++ b/fs/overlayfs/overlayfs.h
@@ -258,6 +258,13 @@ bool ovl_need_index(struct dentry *dentry);
int ovl_nlink_start(struct dentry *dentry, bool *locked);
void ovl_nlink_end(struct dentry *dentry, bool locked);
int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
+void ovl_copytimes(struct inode *inode);
+
+static inline void ovl_copytimes_with_parent(struct dentry *dentry)
+{
+ ovl_copytimes(d_inode(dentry));
+ ovl_copytimes(d_inode(dentry->d_parent));
+}
static inline bool ovl_is_impuredir(struct dentry *dentry)
{
diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
index 6f1078028c66..11e62e70733a 100644
--- a/fs/overlayfs/util.c
+++ b/fs/overlayfs/util.c
@@ -675,3 +675,22 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
pr_err("overlayfs: failed to lock workdir+upperdir\n");
return -EIO;
}
+
+void ovl_copytimes(struct inode *inode)
+{
+ struct inode *upperinode;
+
+ if (!inode)
+ return;
+
+ upperinode = ovl_inode_upper(inode);
+
+ if (!upperinode)
+ return;
+
+ if ((!timespec_equal(&inode->i_mtime, &upperinode->i_mtime) ||
+ !timespec_equal(&inode->i_ctime, &upperinode->i_ctime))) {
+ inode->i_mtime = upperinode->i_mtime;
+ inode->i_ctime = upperinode->i_ctime;
+ }
+}
--
2.14.3
Stacking file operations in overlay will store an extra open file for each
overlay file opened.
The overhead is just that of "struct file" which is about 256bytes, because
overlay already pins an extra dentry and inode when the file is open, which
add up to a much larger overhead.
For fear of breaking working setups, don't start accounting the extra file.
The implementation adds a bool argument to path_open() to control whether
the returned file is to be accounted or not. If the file is not accounted,
f_mode will contain FMODE_NOACCOUNT, so that when freeing the file the
count is not decremented.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/file_table.c | 13 +++++++++----
fs/internal.h | 7 ++++++-
fs/open.c | 10 +++++-----
include/linux/fs.h | 5 ++++-
4 files changed, 24 insertions(+), 11 deletions(-)
diff --git a/fs/file_table.c b/fs/file_table.c
index 7ec0b3e5f05d..60376bfa04cf 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -51,7 +51,8 @@ static void file_free_rcu(struct rcu_head *head)
static inline void file_free(struct file *f)
{
- percpu_counter_dec(&nr_files);
+ if (!(f->f_mode & FMODE_NOACCOUNT))
+ percpu_counter_dec(&nr_files);
call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}
@@ -100,7 +101,7 @@ int proc_nr_files(struct ctl_table *table, int write,
* done, you will imbalance int the mount's writer count
* and a warning at __fput() time.
*/
-struct file *get_empty_filp(void)
+struct file *__get_empty_filp(bool account)
{
const struct cred *cred = current_cred();
static long old_max;
@@ -110,7 +111,8 @@ struct file *get_empty_filp(void)
/*
* Privileged users can go above max_files
*/
- if (get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) {
+ if (account &&
+ get_nr_files() >= files_stat.max_files && !capable(CAP_SYS_ADMIN)) {
/*
* percpu_counters are inaccurate. Do an expensive check before
* we go and fail.
@@ -123,7 +125,10 @@ struct file *get_empty_filp(void)
if (unlikely(!f))
return ERR_PTR(-ENOMEM);
- percpu_counter_inc(&nr_files);
+ if (account)
+ percpu_counter_inc(&nr_files);
+ else
+ f->f_mode = FMODE_NOACCOUNT;
f->f_cred = get_cred(cred);
error = security_file_alloc(f);
if (unlikely(error)) {
diff --git a/fs/internal.h b/fs/internal.h
index df262f41a0ef..3319bf39e339 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -85,7 +85,12 @@ extern void chroot_fs_refs(const struct path *, const struct path *);
/*
* file_table.c
*/
-extern struct file *get_empty_filp(void);
+extern struct file *__get_empty_filp(bool account);
+
+static inline struct file *get_empty_filp(void)
+{
+ return __get_empty_filp(true);
+}
/*
* super.c
diff --git a/fs/open.c b/fs/open.c
index 8c315f9ec2f6..d509705d5740 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -700,8 +700,8 @@ static int do_dentry_open(struct file *f,
static const struct file_operations empty_fops = {};
int error;
- f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |
- FMODE_PREAD | FMODE_PWRITE;
+ f->f_mode = (f->f_mode & FMODE_NOACCOUNT) | OPEN_FMODE(f->f_flags) |
+ FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
path_get(&f->f_path);
f->f_inode = inode;
@@ -711,7 +711,7 @@ static int do_dentry_open(struct file *f,
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH;
+ f->f_mode = (f->f_mode & FMODE_NOACCOUNT) | FMODE_PATH;
f->f_op = &empty_fops;
return 0;
}
@@ -867,12 +867,12 @@ int vfs_open(const struct path *path, struct file *file,
}
struct file *path_open(const struct path *path, int flags, struct inode *inode,
- const struct cred *cred)
+ const struct cred *cred, bool account)
{
struct file *file;
int retval;
- file = get_empty_filp();
+ file = __get_empty_filp(account);
if (!IS_ERR(file)) {
file->f_flags = flags;
file->f_path = *path;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c6d7e3b00163..0b215faa30ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -152,6 +152,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
/* File is capable of returning -EAGAIN if I/O will block */
#define FMODE_NOWAIT ((__force fmode_t)0x8000000)
+/* File does not contribute to nr_files count */
+#define FMODE_NOACCOUNT ((__force fmode_t)0x10000000)
+
/*
* Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
* that indicates that they should check the contents of the iovec are
@@ -2402,7 +2405,7 @@ extern struct file *file_open_root(struct dentry *, struct vfsmount *,
const char *, int, umode_t);
extern struct file * dentry_open(const struct path *, int, const struct cred *);
extern struct file *path_open(const struct path *, int, struct inode *,
- const struct cred *);
+ const struct cred *, bool);
extern int filp_close(struct file *, fl_owner_t id);
extern struct filename *getname_flags(const char __user *, int, int *);
--
2.14.3
Extract vfs_dedupe_file_range_one() helper to deal with a single dedup
request.
Export this helper to modules, so it can be used by overlayfs to stack the
dedupe method.
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/read_write.c | 90 ++++++++++++++++++++++++++++++------------------------
include/linux/fs.h | 3 ++
2 files changed, 53 insertions(+), 40 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index f8547b82dfb3..82aa0d32d0cd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1937,6 +1937,44 @@ int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
}
EXPORT_SYMBOL(vfs_dedupe_file_range_compare);
+ssize_t vfs_dedupe_file_range_one(struct file *src_file, u64 src_pos, u64 len,
+ struct file *dst_file, u64 dst_pos)
+{
+ ssize_t ret;
+
+ ret = mnt_want_write_file(dst_file);
+ if (ret)
+ return ret;
+
+ ret = clone_verify_area(dst_file, dst_pos, len, true);
+ if (ret < 0)
+ goto out_drop_write;
+
+ ret = -EINVAL;
+ if (!(capable(CAP_SYS_ADMIN) || (dst_file->f_mode & FMODE_WRITE)))
+ goto out_drop_write;
+
+ ret = -EXDEV;
+ if (src_file->f_path.mnt != dst_file->f_path.mnt)
+ goto out_drop_write;
+
+ ret = -EISDIR;
+ if (S_ISDIR(file_inode(dst_file)->i_mode))
+ goto out_drop_write;
+
+ ret = -EINVAL;
+ if (!dst_file->f_op->dedupe_file_range)
+ goto out_drop_write;
+
+ ret = dst_file->f_op->dedupe_file_range(src_file, src_pos, len,
+ dst_file, dst_pos);
+out_drop_write:
+ mnt_drop_write_file(dst_file);
+
+ return ret;
+}
+EXPORT_SYMBOL(vfs_dedupe_file_range_one);
+
int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
{
struct file_dedupe_range_info *info;
@@ -1945,10 +1983,7 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
u64 len;
int i;
int ret;
- bool is_admin = capable(CAP_SYS_ADMIN);
u16 count = same->dest_count;
- struct file *dst_file;
- loff_t dst_off;
ssize_t deduped;
if (!(file->f_mode & FMODE_READ))
@@ -1983,54 +2018,29 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
}
for (i = 0, info = same->info; i < count; i++, info++) {
- struct inode *dst;
struct fd dst_fd = fdget(info->dest_fd);
+ struct file *dst_file = dst_fd.file;
- dst_file = dst_fd.file;
if (!dst_file) {
info->status = -EBADF;
goto next_loop;
}
- dst = file_inode(dst_file);
-
- ret = mnt_want_write_file(dst_file);
- if (ret) {
- info->status = ret;
- goto next_loop;
- }
-
- dst_off = info->dest_offset;
- ret = clone_verify_area(dst_file, dst_off, len, true);
- if (ret < 0) {
- info->status = ret;
- goto next_file;
- }
- ret = 0;
if (info->reserved) {
info->status = -EINVAL;
- } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
- info->status = -EINVAL;
- } else if (file->f_path.mnt != dst_file->f_path.mnt) {
- info->status = -EXDEV;
- } else if (S_ISDIR(dst->i_mode)) {
- info->status = -EISDIR;
- } else if (dst_file->f_op->dedupe_file_range == NULL) {
- info->status = -EINVAL;
- } else {
- deduped = dst_file->f_op->dedupe_file_range(file, off,
- len, dst_file,
- info->dest_offset);
- if (deduped == -EBADE)
- info->status = FILE_DEDUPE_RANGE_DIFFERS;
- else if (deduped < 0)
- info->status = deduped;
- else
- info->bytes_deduped += deduped;
+ goto next_loop;
}
-next_file:
- mnt_drop_write_file(dst_file);
+ deduped = vfs_dedupe_file_range_one(file, off, len,
+ dst_file,
+ info->dest_offset);
+ if (deduped == -EBADE)
+ info->status = FILE_DEDUPE_RANGE_DIFFERS;
+ else if (deduped < 0)
+ info->status = deduped;
+ else
+ info->bytes_deduped += deduped;
+
next_loop:
fdput(dst_fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c6baf767619e..f0f87f2beb79 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1809,6 +1809,9 @@ extern int vfs_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
loff_t len, bool *is_same);
extern int vfs_dedupe_file_range(struct file *file,
struct file_dedupe_range *same);
+extern ssize_t vfs_dedupe_file_range_one(struct file *src_file, u64 src_pos,
+ u64 len, struct file *dst_file,
+ u64 dst_pos);
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
--
2.14.3
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/overlayfs/super.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
index e8551c97de51..ad6a5baf226b 100644
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -97,6 +97,10 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
struct dentry *real;
int err;
+ /* It's an overlay file */
+ if (inode && d_inode(dentry) == inode)
+ return dentry;
+
if (flags & D_REAL_UPPER)
return ovl_dentry_upper(dentry);
--
2.14.3
Currently opening an overlay file results in:
- the real file on the underlying layer being opened
- f_path being set to the overlay {mount, dentry} pair
This patch adds a new helper that allows the above to be explicitly
performed. I.e. it's the same as dentry_open(), except the underlying
inode to open is given as a separate argument.
This is in preparation for stacking I/O operations on overlay files.
Later, when implicit opening is removed, dentry_open() can be implemented
by just calling path_open().
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/open.c | 27 +++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
2 files changed, 29 insertions(+)
diff --git a/fs/open.c b/fs/open.c
index 7ea118471dce..8c315f9ec2f6 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -866,6 +866,33 @@ int vfs_open(const struct path *path, struct file *file,
return do_dentry_open(file, d_backing_inode(dentry), NULL, cred);
}
+struct file *path_open(const struct path *path, int flags, struct inode *inode,
+ const struct cred *cred)
+{
+ struct file *file;
+ int retval;
+
+ file = get_empty_filp();
+ if (!IS_ERR(file)) {
+ file->f_flags = flags;
+ file->f_path = *path;
+ retval = do_dentry_open(file, inode, NULL, cred);
+ if (retval) {
+ put_filp(file);
+ file = ERR_PTR(retval);
+ } else {
+ retval = open_check_o_direct(file);
+ if (retval) {
+ fput(file);
+ file = ERR_PTR(retval);
+ }
+ }
+ }
+
+ return file;
+}
+EXPORT_SYMBOL(path_open);
+
struct file *dentry_open(const struct path *path, int flags,
const struct cred *cred)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f0f87f2beb79..c6d7e3b00163 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2401,6 +2401,8 @@ extern struct file *filp_open(const char *, int, umode_t);
extern struct file *file_open_root(struct dentry *, struct vfsmount *,
const char *, int, umode_t);
extern struct file * dentry_open(const struct path *, int, const struct cred *);
+extern struct file *path_open(const struct path *, int, struct inode *,
+ const struct cred *);
extern int filp_close(struct file *, fl_owner_t id);
extern struct filename *getname_flags(const char __user *, int, int *);
--
2.14.3
On Thu, Apr 12, 2018 at 05:07:52PM +0200, Miklos Szeredi wrote:
> +ssize_t vfs_dedupe_file_range_one(struct file *src_file, u64 src_pos, u64 len,
> + struct file *dst_file, u64 dst_pos)
Why u64 instead of loff_t?
On Thu, Apr 12, 2018 at 05:07:53PM +0200, Miklos Szeredi wrote:
> Currently opening an overlay file results in:
>
> - the real file on the underlying layer being opened
> - f_path being set to the overlay {mount, dentry} pair
>
> This patch adds a new helper that allows the above to be explicitly
> performed. I.e. it's the same as dentry_open(), except the underlying
> inode to open is given as a separate argument.
>
> This is in preparation for stacking I/O operations on overlay files.
>
> Later, when implicit opening is removed, dentry_open() can be implemented
> by just calling path_open().
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/open.c | 27 +++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 2 files changed, 29 insertions(+)
>
> diff --git a/fs/open.c b/fs/open.c
> index 7ea118471dce..8c315f9ec2f6 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -866,6 +866,33 @@ int vfs_open(const struct path *path, struct file *file,
> return do_dentry_open(file, d_backing_inode(dentry), NULL, cred);
> }
>
This really warrants some kernel-doc. Let me try to get this started and
you can fix it:
/**
* path_open() - Open an inode by a particular name.
* @path: The name of the file.
* @flags: The O_ flags used to open this file.
* @inode: The inode to open.
* @cred: The task's credentials used when opening this file.
*
* ...
*
* Context: Process context.
* Return: A pointer to a struct file or an IS_ERR pointer. Cannot return NULL.
*/
> +struct file *path_open(const struct path *path, int flags, struct inode *inode,
> + const struct cred *cred)
> +{
> + struct file *file;
> + int retval;
> +
> + file = get_empty_filp();
> + if (!IS_ERR(file)) {
> + file->f_flags = flags;
> + file->f_path = *path;
> + retval = do_dentry_open(file, inode, NULL, cred);
> + if (retval) {
> + put_filp(file);
> + file = ERR_PTR(retval);
> + } else {
> + retval = open_check_o_direct(file);
> + if (retval) {
> + fput(file);
> + file = ERR_PTR(retval);
> + }
> + }
> + }
> +
> + return file;
I'd find this clearer written like this ...
+ file = get_empty_filp();
+ if (IS_ERR(file))
+ return file;
+
+ file->f_flags = flags;
+ file->f_path = *path;
+ retval = do_dentry_open(file, inode, NULL, cred);
+ if (retval)
+ goto err;
+ retval = open_check_o_direct(file);
+ if (retval)
+ goto err;
+ return file;
+err:
+ put_filp(file);
+ return ERR_PTR(retval);
On Thu, Apr 12, 2018 at 6:25 PM, Matthew Wilcox <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 05:07:52PM +0200, Miklos Szeredi wrote:
>> +ssize_t vfs_dedupe_file_range_one(struct file *src_file, u64 src_pos, u64 len,
>> + struct file *dst_file, u64 dst_pos)
>
> Why u64 instead of loff_t?
Peculiarity of f_op->dedupe_file_range().
I think a cleanup of copyfile/clone/dedupe interfaces would be good.
Perhaps merge them into one? Or just merge copyfile/clone? Even if
not merging it should make sense to make argument types and argument
order the same.
Thanks,
Miklos
On Thu, Apr 12, 2018 at 6:07 PM, Miklos Szeredi <[email protected]> wrote:
> Copy up mtime and ctime to overlay inode after times in real object are
> modified. Be careful not to dirty cachelines when not necessary.
>
> This is in preparation for moving overlay functionality out of the VFS.
>
> This patch shouldn't have any observable effect.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/overlayfs/dir.c | 5 +++++
> fs/overlayfs/inode.c | 1 +
> fs/overlayfs/overlayfs.h | 7 +++++++
> fs/overlayfs/util.c | 19 +++++++++++++++++++
> 4 files changed, 32 insertions(+)
>
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 839709c7803a..cd0fa2363723 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -507,6 +507,7 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
> else
> err = ovl_create_over_whiteout(dentry, inode, attr,
> hardlink);
> + ovl_copytimes_with_parent(dentry);
> }
> out_revert_creds:
> revert_creds(old_cred);
> @@ -768,6 +769,7 @@ static int ovl_do_remove(struct dentry *dentry, bool is_dir)
> drop_nlink(dentry->d_inode);
> }
> ovl_nlink_end(dentry, locked);
> + ovl_copytimes_with_parent(dentry);
> out_drop_write:
> ovl_drop_write(dentry);
> out:
> @@ -1079,6 +1081,9 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
> ovl_dentry_version_inc(new->d_parent, ovl_type_origin(old) ||
> (d_inode(new) && ovl_type_origin(new)));
>
> + ovl_copytimes_with_parent(old);
> + ovl_copytimes_with_parent(new);
> +
All the ovl_copytimes_with_parent() calls you added can be replaced with a
single call in ovl_dentry_version_inc(). Just need to change its name and
pass it the child instead of the parent dentry.
Thanks,
Amir.
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> We can now drop description of the ro/rw inconsistency from the
> documentation.
>
> Also clarify, that now fully standard compliant behavior can be enabled
> with kernel/module/mount options.
>
Very nice!
Is it maybe a good time to tone down this expectation-lowering
phrase from the introduction:
"The result will inevitably fail to look exactly like a normal
filesystem for various technical reasons. The expectation is that
many use cases will be able to ignore these differences."
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> Documentation/filesystems/overlayfs.txt | 64 ++++++++++++++++++++++-----------
> 1 file changed, 43 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/filesystems/overlayfs.txt b/Documentation/filesystems/overlayfs.txt
> index 961b287ef323..095186080d23 100644
> --- a/Documentation/filesystems/overlayfs.txt
> +++ b/Documentation/filesystems/overlayfs.txt
> @@ -306,27 +306,49 @@ the copied layers will fail the verification of the lower root file handle.
> Non-standard behavior
> ---------------------
>
> -The copy_up operation essentially creates a new, identical file and
> -moves it over to the old name. Any open files referring to this inode
> -will access the old data.
> -
> -The new file may be on a different filesystem, so both st_dev and st_ino
> -of the real file may change. The values of st_dev and st_ino returned by
> -stat(2) on an overlay object are often not the same as the real file
> -stat(2) values to prevent the values from changing on copy_up.
> -
> -Unless "xino" feature is enabled, when overlay layers are not all on the
> -same underlying filesystem, the value of st_dev may be different for two
> -non-directory objects in the same overlay filesystem and the value of
> -st_ino for directory objects may be non persistent and could change even
> -while the overlay filesystem is still mounted.
> -
> -Unless "inode index" feature is enabled, if a file with multiple hard
> -links is copied up, then this will "break" the link. Changes will not be
> -propagated to other names referring to the same inode.
> -
> -Unless "redirect_dir" feature is enabled, rename(2) on a lower or merged
> -directory will fail with EXDEV.
> +Overlayfs can now act as a POSIX compliant filesystem with the following
> +features turned on:
> +
> +1) "redirect_dir"
> +
> +Enabled with the mount option or module option: "redirect_dir=on" or with
> +the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y.
> +
> +If this feature is disabled, then rename(2) on a lower or merged directory
> +will fail with EXDEV ("Invalid cross-device link").
> +
> +2) "inode index"
> +
> +Enabled with the mount option or module option "index=on" or with the
> +kernel config option CONFIG_OVERLAY_FS_INDEX=y.
> +
> +If this feature is disabled and a file with multiple hard links is copied
> +up, then this will "break" the link. Changes will not be propagated to
> +other names referring to the same inode.
> +
> +3) "xino"
> +
> +Enabled with the mount option "xino=auto" or "xino=on", with the module
> +option "xino_auto=on" or with the kernel config option
> +CONFIG_OVERLAY_FS_XINO_AUTO=y. Also implicitly enabled by using the same
> +underlying filesystem for all layers making up the overlay.
> +
> +If this feature is disabled or the underlying filesystem doesn't have
> +enough free bits in the inode number, then overlayfs will not be able to
> +guarantee that the values of st_ino and st_dev returned by stat(2) and the
> +value of d_ino returned by readdir(3) will act like on a normal filesystem.
> +E.g. the value of st_dev may be different for two objects in the same
> +overlay filesystem and the value of st_ino for directory objects may not be
> +persistent and could change even while the overlay filesystem is mounted.
> +
> +4) "copy_up_shared"
> +
> +Enabled with the mount option or module option "copy_up_shared=on" or with
> +the kernel config option CONFIG_OVERLAY_FS_COPY_UP_SHARED=y.
> +
> +If this feature is disabled, then a memory mapping created with MAP_SHARED
> +might contain stale data if the file has been copied up and modified in the
> +meantime.
>
>
> Changes to underlying filesystems
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> Implement stacked reading.
>
I couldn't decipher the meaning of "readd" in the subject of this
and other file ops pacthes??
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/overlayfs/file.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 56 insertions(+)
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 409b542ff30c..a19429c5965d 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -9,6 +9,7 @@
> #include <linux/cred.h>
> #include <linux/file.h>
> #include <linux/xattr.h>
> +#include <linux/uio.h>
> #include "overlayfs.h"
>
> static struct file *ovl_open_realfile(const struct file *file)
> @@ -129,8 +130,63 @@ static loff_t ovl_llseek(struct file *file, loff_t offset, int whence)
> i_size_read(realinode));
> }
>
> +static void ovl_file_accessed(struct file *file)
> +{
> + struct inode *inode = file_inode(file);
> +
> + if ((file->f_flags & O_NOATIME) || !ovl_inode_upper(inode))
> + return;
> +
> + ovl_copytimes(inode);
> + touch_atime(&file->f_path);
> +}
> +
> +static rwf_t ovl_iocb_to_rwf(struct kiocb *iocb)
> +{
> + int ifl = iocb->ki_flags;
> + rwf_t flags = 0;
> +
> + if (ifl & IOCB_NOWAIT)
> + flags |= RWF_NOWAIT;
> + if (ifl & IOCB_HIPRI)
> + flags |= RWF_HIPRI;
> + if (ifl & IOCB_DSYNC)
> + flags |= RWF_DSYNC;
> + if (ifl & IOCB_SYNC)
> + flags |= RWF_SYNC;
> +
> + return flags;
> +}
> +
> +static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *iter)
> +{
> + struct file *file = iocb->ki_filp;
> + struct fd real;
> + const struct cred *old_cred;
> + ssize_t ret;
> +
> + if (!iov_iter_count(iter))
> + return 0;
> +
> + ret = ovl_real_file(file, &real);
> + if (ret)
> + return ret;
> +
> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
> + ret = vfs_iter_read(real.file, iter, &iocb->ki_pos,
> + ovl_iocb_to_rwf(iocb));
> + revert_creds(old_cred);
> +
> + ovl_file_accessed(file);
> +
> + fdput(real);
I find it confusing that the name of ovl_real_file() does not suggest it
may take a reference, so this fdput() looks unbalanced.
All other ovl_XXX_{real,upper,lower} helpers do not take a reference.
Perhaps something along the lines of ovl_file_real_fdget().
Thanks,
Amir.
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> This reverts commit cd91304e7190b4c4802f8e413ab2214b233e0260.
>
> Overlayfs no longer relies on the vfs correct atime handling.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/inode.c | 21 ++++-----------------
> fs/overlayfs/super.c | 3 ---
> include/linux/dcache.h | 3 ---
> 3 files changed, 4 insertions(+), 23 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index ef362364d396..163715de8cb2 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1570,24 +1570,11 @@ EXPORT_SYMBOL(bmap);
> static void update_ovl_inode_times(struct dentry *dentry, struct inode *inode,
> bool rcu)
> {
> - struct dentry *upperdentry;
> + if (!rcu) {
> + struct inode *realinode = d_real_inode(dentry);
>
> - /*
> - * Nothing to do if in rcu or if non-overlayfs
> - */
> - if (rcu || likely(!(dentry->d_flags & DCACHE_OP_REAL)))
> - return;
> -
> - upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);
> -
> - /*
> - * If file is on lower then we can't update atime, so no worries about
> - * stale mtime/ctime.
> - */
> - if (upperdentry) {
> - struct inode *realinode = d_inode(upperdentry);
> -
> - if ((!timespec_equal(&inode->i_mtime, &realinode->i_mtime) ||
> + if (unlikely(inode != realinode) &&
> + (!timespec_equal(&inode->i_mtime, &realinode->i_mtime) ||
> !timespec_equal(&inode->i_ctime, &realinode->i_ctime))) {
> inode->i_mtime = realinode->i_mtime;
> inode->i_ctime = realinode->i_ctime;
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index c3d8c7ea180f..006dc70d7425 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -107,9 +107,6 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
> if (inode && d_inode(dentry) == inode)
> return dentry;
>
> - if (flags & D_REAL_UPPER)
> - return ovl_dentry_upper(dentry);
> -
> if (!d_is_reg(dentry)) {
> if (!inode || inode == d_inode(dentry))
> return dentry;
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 82a99d366aec..4c7ab11c627a 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -565,9 +565,6 @@ static inline struct dentry *d_backing_dentry(struct dentry *upper)
> return upper;
> }
>
> -/* d_real() flags */
> -#define D_REAL_UPPER 0x2 /* return upper dentry or NULL if non-upper */
> -
Premature removal of constant. Still in use by may_write_real() at this point.
Thanks,
Amir.
On Thu, Apr 12, 2018 at 05:07:55PM +0200, Miklos Szeredi wrote:
> Copy up mtime and ctime to overlay inode after times in real object are
> modified. Be careful not to dirty cachelines when not necessary.
>
> This is in preparation for moving overlay functionality out of the VFS.
>
> This patch shouldn't have any observable effect.
So there are bunch of operations which will change inode ctime. I had
missed this in my metadata only copy up patch series and that would broken
atime updates in some cases.
Vivek
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/overlayfs/dir.c | 5 +++++
> fs/overlayfs/inode.c | 1 +
> fs/overlayfs/overlayfs.h | 7 +++++++
> fs/overlayfs/util.c | 19 +++++++++++++++++++
> 4 files changed, 32 insertions(+)
>
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 839709c7803a..cd0fa2363723 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -507,6 +507,7 @@ static int ovl_create_or_link(struct dentry *dentry, struct inode *inode,
> else
> err = ovl_create_over_whiteout(dentry, inode, attr,
> hardlink);
> + ovl_copytimes_with_parent(dentry);
> }
> out_revert_creds:
> revert_creds(old_cred);
> @@ -768,6 +769,7 @@ static int ovl_do_remove(struct dentry *dentry, bool is_dir)
> drop_nlink(dentry->d_inode);
> }
> ovl_nlink_end(dentry, locked);
> + ovl_copytimes_with_parent(dentry);
> out_drop_write:
> ovl_drop_write(dentry);
> out:
> @@ -1079,6 +1081,9 @@ static int ovl_rename(struct inode *olddir, struct dentry *old,
> ovl_dentry_version_inc(new->d_parent, ovl_type_origin(old) ||
> (d_inode(new) && ovl_type_origin(new)));
>
> + ovl_copytimes_with_parent(old);
> + ovl_copytimes_with_parent(new);
> +
> out_dput:
> dput(newdentry);
> out_dput_old:
> diff --git a/fs/overlayfs/inode.c b/fs/overlayfs/inode.c
> index 6e3815fb006b..33635106c5f7 100644
> --- a/fs/overlayfs/inode.c
> +++ b/fs/overlayfs/inode.c
> @@ -303,6 +303,7 @@ int ovl_xattr_set(struct dentry *dentry, struct inode *inode, const char *name,
> err = vfs_removexattr(realdentry, name);
> }
> revert_creds(old_cred);
> + ovl_copytimes(d_inode(dentry));
>
> out_drop_write:
> ovl_drop_write(dentry);
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index e0b7de799f6b..eef720ef0f07 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -258,6 +258,13 @@ bool ovl_need_index(struct dentry *dentry);
> int ovl_nlink_start(struct dentry *dentry, bool *locked);
> void ovl_nlink_end(struct dentry *dentry, bool locked);
> int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir);
> +void ovl_copytimes(struct inode *inode);
> +
> +static inline void ovl_copytimes_with_parent(struct dentry *dentry)
> +{
> + ovl_copytimes(d_inode(dentry));
> + ovl_copytimes(d_inode(dentry->d_parent));
> +}
>
> static inline bool ovl_is_impuredir(struct dentry *dentry)
> {
> diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c
> index 6f1078028c66..11e62e70733a 100644
> --- a/fs/overlayfs/util.c
> +++ b/fs/overlayfs/util.c
> @@ -675,3 +675,22 @@ int ovl_lock_rename_workdir(struct dentry *workdir, struct dentry *upperdir)
> pr_err("overlayfs: failed to lock workdir+upperdir\n");
> return -EIO;
> }
> +
> +void ovl_copytimes(struct inode *inode)
> +{
> + struct inode *upperinode;
> +
> + if (!inode)
> + return;
> +
> + upperinode = ovl_inode_upper(inode);
> +
> + if (!upperinode)
> + return;
> +
> + if ((!timespec_equal(&inode->i_mtime, &upperinode->i_mtime) ||
> + !timespec_equal(&inode->i_ctime, &upperinode->i_ctime))) {
> + inode->i_mtime = upperinode->i_mtime;
> + inode->i_ctime = upperinode->i_ctime;
> + }
> +}
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.
>
> Needs vfs_ioctl() exported to modules.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/internal.h | 1 -
> fs/ioctl.c | 1 +
> fs/overlayfs/file.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 4 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 3319bf39e339..d5108d9c6a2f 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -176,7 +176,6 @@ extern const struct dentry_operations ns_dentry_operations;
> */
> extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
> unsigned long arg);
> -extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
>
> /*
> * iomap support:
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb0d04..696f4c46a868 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -49,6 +49,7 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> out:
> return error;
> }
> +EXPORT_SYMBOL(vfs_ioctl);
>
> static int ioctl_fibmap(struct file *filp, int __user *p)
> {
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 05e3e2f80b89..cc004ff1b05b 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -8,6 +8,7 @@
>
> #include <linux/cred.h>
> #include <linux/file.h>
> +#include <linux/mount.h>
> #include <linux/xattr.h>
> #include <linux/uio.h>
> #include "overlayfs.h"
> @@ -291,6 +292,63 @@ long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return ret;
> }
>
> +static long ovl_real_ioctl(struct file *file, unsigned int cmd,
> + unsigned long arg)
> +{
> + struct fd real;
> + const struct cred *old_cred;
> + long ret;
> +
> + ret = ovl_real_file(file, &real);
> + if (ret)
> + return ret;
> +
> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
> + ret = vfs_ioctl(real.file, cmd, arg);
> + revert_creds(old_cred);
> +
> + fdput(real);
> +
> + return ret;
> +}
> +
> +long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + long ret;
> + struct inode *inode = file_inode(file);
> +
> + switch (cmd) {
> + case FS_IOC_GETFLAGS:
> + ret = ovl_real_ioctl(file, cmd, arg);
> + break;
> +
> + case FS_IOC_SETFLAGS:
> + if (!inode_owner_or_capable(inode))
> + return -EACCES;
> +
> + ret = mnt_want_write_file(file);
> + if (ret)
> + return ret;
> +
> + ret = ovl_copy_up(file_dentry(file));
> + if (!ret) {
> + ret = ovl_real_ioctl(file, cmd, arg);
> +
> + inode_lock(inode);
> + ovl_copyflags(ovl_inode_real(inode), inode);
> + inode_unlock(inode);
> + }
> +
> + mnt_drop_write_file(file);
> + break;
> +
> + default:
> + ret = -ENOTTY;
I am wondering out loud.
This is a change of behavior that fs specific ioctls cannot be executed
on overlay file - arguably a good change of behavior, but still a change
that applications may got dependent on.
Would it have been better to opt-in for this change by a more generic
config/mount options, for example "consistent_fd" , instead of
"copy_up_shared" and then we can choose whether or not to
pass though unknown ioctls to real file.
I know we removed the want_write_file() protection from VFS, but
still, pass through of ioctls was the legacy behavior. Thoughts?
I don't mind to wait and see if someone shouts.
Thanks,
Amir.
On Thu, Apr 12, 2018 at 05:08:15PM +0200, Miklos Szeredi wrote:
> This reverts commit cd91304e7190b4c4802f8e413ab2214b233e0260.
>
> Overlayfs no longer relies on the vfs correct atime handling.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/inode.c | 21 ++++-----------------
> fs/overlayfs/super.c | 3 ---
> include/linux/dcache.h | 3 ---
> 3 files changed, 4 insertions(+), 23 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index ef362364d396..163715de8cb2 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1570,24 +1570,11 @@ EXPORT_SYMBOL(bmap);
> static void update_ovl_inode_times(struct dentry *dentry, struct inode *inode,
> bool rcu)
> {
> - struct dentry *upperdentry;
> + if (!rcu) {
> + struct inode *realinode = d_real_inode(dentry);
>
> - /*
> - * Nothing to do if in rcu or if non-overlayfs
> - */
> - if (rcu || likely(!(dentry->d_flags & DCACHE_OP_REAL)))
> - return;
> -
> - upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);
> -
> - /*
> - * If file is on lower then we can't update atime, so no worries about
> - * stale mtime/ctime.
> - */
> - if (upperdentry) {
> - struct inode *realinode = d_inode(upperdentry);
> -
> - if ((!timespec_equal(&inode->i_mtime, &realinode->i_mtime) ||
> + if (unlikely(inode != realinode) &&
> + (!timespec_equal(&inode->i_mtime, &realinode->i_mtime) ||
> !timespec_equal(&inode->i_ctime, &realinode->i_ctime))) {
> inode->i_mtime = realinode->i_mtime;
> inode->i_ctime = realinode->i_ctime;
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index c3d8c7ea180f..006dc70d7425 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -107,9 +107,6 @@ static struct dentry *ovl_d_real(struct dentry *dentry,
> if (inode && d_inode(dentry) == inode)
> return dentry;
>
> - if (flags & D_REAL_UPPER)
> - return ovl_dentry_upper(dentry);
> -
> if (!d_is_reg(dentry)) {
> if (!inode || inode == d_inode(dentry))
> return dentry;
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index 82a99d366aec..4c7ab11c627a 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -565,9 +565,6 @@ static inline struct dentry *d_backing_dentry(struct dentry *upper)
> return upper;
> }
>
> -/* d_real() flags */
> -#define D_REAL_UPPER 0x2 /* return upper dentry or NULL if non-upper */
Good to see this go away. It was a major headache for metacopy only
patches.
Vivek
> -
> /**
> * d_real - Return the real dentry
> * @dentry: the dentry to query
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.
>
> Needs vfs_ioctl() exported to modules.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/internal.h | 1 -
> fs/ioctl.c | 1 +
> fs/overlayfs/file.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 4 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 3319bf39e339..d5108d9c6a2f 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -176,7 +176,6 @@ extern const struct dentry_operations ns_dentry_operations;
> */
> extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
> unsigned long arg);
> -extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
>
> /*
> * iomap support:
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb0d04..696f4c46a868 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -49,6 +49,7 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> out:
> return error;
> }
> +EXPORT_SYMBOL(vfs_ioctl);
>
> static int ioctl_fibmap(struct file *filp, int __user *p)
> {
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 05e3e2f80b89..cc004ff1b05b 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -8,6 +8,7 @@
>
> #include <linux/cred.h>
> #include <linux/file.h>
> +#include <linux/mount.h>
> #include <linux/xattr.h>
> #include <linux/uio.h>
> #include "overlayfs.h"
> @@ -291,6 +292,63 @@ long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return ret;
> }
>
> +static long ovl_real_ioctl(struct file *file, unsigned int cmd,
> + unsigned long arg)
> +{
> + struct fd real;
> + const struct cred *old_cred;
> + long ret;
> +
> + ret = ovl_real_file(file, &real);
> + if (ret)
> + return ret;
> +
> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
> + ret = vfs_ioctl(real.file, cmd, arg);
> + revert_creds(old_cred);
> +
> + fdput(real);
> +
> + return ret;
> +}
> +
> +long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + long ret;
> + struct inode *inode = file_inode(file);
> +
> + switch (cmd) {
> + case FS_IOC_GETFLAGS:
> + ret = ovl_real_ioctl(file, cmd, arg);
> + break;
> +
> + case FS_IOC_SETFLAGS:
> + if (!inode_owner_or_capable(inode))
> + return -EACCES;
> +
> + ret = mnt_want_write_file(file);
> + if (ret)
> + return ret;
> +
> + ret = ovl_copy_up(file_dentry(file));
> + if (!ret) {
> + ret = ovl_real_ioctl(file, cmd, arg);
> +
I got this lockdep splat with overlayfs-rorw and overlay/040, but I don't
see the problem in the patch:
overlay/040 [19:29:01][ 7.414427]
[ 7.415349] ============================================
[ 7.417863] WARNING: possible recursive locking detected
[ 7.419652] 4.16.0-rc7-xfstests-00054-g1b18a246f764-dirty #3233 Not tainted
[ 7.421517] --------------------------------------------
[ 7.422736] chattr/2376 is trying to acquire lock:
[ 7.423843] (sb_writers#10){.+.+}, at: [<000000003170ac81>]
mnt_want_write_file+0x21/0x4a
[ 7.425683]
[ 7.425683] but task is already holding lock:
[ 7.427397] (sb_writers#10){.+.+}, at: [<000000003170ac81>]
mnt_want_write_file+0x21/0x4a
[ 7.430180]
[ 7.430180] other info that might help us debug this:
[ 7.432511] Possible unsafe locking scenario:
[ 7.432511]
[ 7.433860] CPU0
[ 7.434424] ----
[ 7.434987] lock(sb_writers#10);
[ 7.435768] lock(sb_writers#10);
[ 7.436692]
[ 7.436692] *** DEADLOCK ***
[ 7.436692]
[ 7.438460] May be due to missing lock nesting notation
[ 7.438460]
[ 7.440477] 1 lock held by chattr/2376:
[ 7.441876] #0: (sb_writers#10){.+.+}, at: [<000000003170ac81>]
mnt_want_write_file+0x21/0x4a
[ 7.444537]
[ 7.444537] stack backtrace:
[ 7.445881] CPU: 1 PID: 2376 Comm: chattr Not tainted
4.16.0-rc7-xfstests-00054-g1b18a246f764-dirty #3233
[ 7.449594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 7.453121] Call Trace:
[ 7.453945] dump_stack+0x7c/0xb2
[ 7.454957] validate_chain.isra.24+0x6da/0x8af
[ 7.456341] __lock_acquire+0x5e6/0x67b
[ 7.457643] ? __lock_acquire+0x5e6/0x67b
[ 7.458879] lock_acquire+0x139/0x1dd
[ 7.459988] ? mnt_want_write_file+0x21/0x4a
[ 7.461317] __sb_start_write+0x91/0x163
[ 7.462500] ? mnt_want_write_file+0x21/0x4a
[ 7.463822] mnt_want_write_file+0x21/0x4a
[ 7.465091] xfs_ioc_setxflags+0x70/0xe5
[ 7.466266] xfs_file_ioctl+0x4a7/0xa90
[ 7.467452] ? __lock_acquire+0x5e6/0x67b
[ 7.468681] ? terminate_walk+0x20/0xd9
[ 7.469835] ? __lock_is_held+0x40/0x71
[ 7.471012] vfs_ioctl+0x1e/0x2b
[ 7.472008] ovl_real_ioctl+0x45/0x71
[ 7.473122] ovl_ioctl+0x9a/0xf2
[ 7.474114] vfs_ioctl+0x1e/0x2b
[ 7.475149] do_vfs_ioctl+0x579/0x5f1
[ 7.476341] ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
[ 7.477765] SyS_ioctl+0x52/0x74
[ 7.479041] do_syscall_64+0x76/0x182
[ 7.480419] entry_SYSCALL_64_after_hwframe+0x42/0xb7
[ 7.482536] RIP: 0033:0x7fde8d270dd7
[ 7.483919] RSP: 002b:00007ffde2ae7e98 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 7.486862] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fde8d270dd7
[ 7.489554] RDX: 00007ffde2ae7eac RSI: 0000000040086602 RDI: 0000000000000003
[ 7.492227] RBP: 0000000000000010 R08: 0000000000000000 R09: 0000000000000001
[ 7.494922] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[ 7.497642] R13: 00007ffde2ae81a0 R14: 00007ffde2ae8188 R15: 0000000000000000
[ 7.540886] XFS (vdf): Mounting V5 Filesystem
[ 7.563795] XFS (vdf): Ending clean mount
Thanks,
Amir.
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> Since set of arguments are so similar, handle in a common helper.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/overlayfs/file.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 79 insertions(+)
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 9670e160967e..39b1b73334ad 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -352,6 +352,81 @@ long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> return ret;
> }
>
> +enum ovl_copyop {
> + OVL_COPY,
> + OVL_CLONE,
> + OVL_DEDUPE,
> +};
> +
> +static ssize_t ovl_copyfile(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + u64 len, unsigned int flags, enum ovl_copyop op)
> +{
> + struct inode *inode_out = file_inode(file_out);
> + struct fd real_in, real_out;
> + const struct cred *old_cred;
> + int ret;
> +
> + ret = ovl_real_file(file_out, &real_out);
> + if (ret)
> + return ret;
> +
> + ret = ovl_real_file(file_in, &real_in);
> + if (ret) {
> + fdput(real_out);
> + return ret;
> + }
> +
> + old_cred = ovl_override_creds(file_inode(file_out)->i_sb);
> + switch (op) {
> + case OVL_COPY:
> + ret = vfs_copy_file_range(real_in.file, pos_in,
> + real_out.file, pos_out, len, flags);
Problem:
vfs_copy_file_range(ovl_lower_file, ovl_upper_file) on non samefs
will get -EXDEV from ovl_copy_file_range(), so will not fall back
to do_splice_direct().
We may be better off checking in_sb != out_sb and returning
-EOPNOTSUPP? not sure.
> + break;
> +
> + case OVL_CLONE:
> + ret = vfs_clone_file_range(real_in.file, pos_in,
> + real_out.file, pos_out, len);
> + break;
> +
> + case OVL_DEDUPE:
> + ret = vfs_dedupe_file_range_one(real_in.file, pos_in, len,
> + real_out.file, pos_out);
Problem:
real_out can be a readonly fd (for is_admin), so we will be deduping
the lower file.
I guess this problem is mitigated in current code by may_write_real().
How can we deal with that sort of "write leak" without patching
mnt_want_write_file()?
Thanks,
Amir.
On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> This reverts commit a118084432d642eeccb961c7c8cc61525a941fcb.
>
> No user of d_real_inode() remains, so it can be removed.
>
FYI, there is a new user in v4.17-rc1 added by commit
f0a2aa5a2a40 tracing/uprobe: Add support for overlayfs
Seems like this patch got merged without any CC to overlayfs
mailing list nor maintainer?
Not sure yet if overlayfs-rorw patches would allow reverting this
change.
Thanks,
Amir.
On Tue, Apr 17, 2018 at 11:31 PM, Amir Goldstein <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>> Since set of arguments are so similar, handle in a common helper.
>>
>> Signed-off-by: Miklos Szeredi <[email protected]>
>> ---
>> fs/overlayfs/file.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 79 insertions(+)
>>
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> index 9670e160967e..39b1b73334ad 100644
>> --- a/fs/overlayfs/file.c
>> +++ b/fs/overlayfs/file.c
>> @@ -352,6 +352,81 @@ long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> return ret;
>> }
>>
>> +enum ovl_copyop {
>> + OVL_COPY,
>> + OVL_CLONE,
>> + OVL_DEDUPE,
>> +};
>> +
>> +static ssize_t ovl_copyfile(struct file *file_in, loff_t pos_in,
>> + struct file *file_out, loff_t pos_out,
>> + u64 len, unsigned int flags, enum ovl_copyop op)
>> +{
>> + struct inode *inode_out = file_inode(file_out);
>> + struct fd real_in, real_out;
>> + const struct cred *old_cred;
>> + int ret;
>> +
>> + ret = ovl_real_file(file_out, &real_out);
>> + if (ret)
>> + return ret;
>> +
>> + ret = ovl_real_file(file_in, &real_in);
>> + if (ret) {
>> + fdput(real_out);
>> + return ret;
>> + }
>> +
>> + old_cred = ovl_override_creds(file_inode(file_out)->i_sb);
>> + switch (op) {
>> + case OVL_COPY:
>> + ret = vfs_copy_file_range(real_in.file, pos_in,
>> + real_out.file, pos_out, len, flags);
>
> Problem:
> vfs_copy_file_range(ovl_lower_file, ovl_upper_file) on non samefs
> will get -EXDEV from ovl_copy_file_range(), so will not fall back
> to do_splice_direct().
> We may be better off checking in_sb != out_sb and returning
> -EOPNOTSUPP? not sure.
>
>
>> + break;
>> +
>> + case OVL_CLONE:
>> + ret = vfs_clone_file_range(real_in.file, pos_in,
>> + real_out.file, pos_out, len);
>> + break;
>> +
>> + case OVL_DEDUPE:
>> + ret = vfs_dedupe_file_range_one(real_in.file, pos_in, len,
>> + real_out.file, pos_out);
>
> Problem:
> real_out can be a readonly fd (for is_admin), so we will be deduping
> the lower file.
> I guess this problem is mitigated in current code by may_write_real().
>
> How can we deal with that sort of "write leak" without patching
> mnt_want_write_file()?
>
Possible solution:
Add interface file_oprations->permission().
At least in rw_verify_area() and clone_verify_area() it is clear
how this would be used. Instead if calling security_file_permission()
call it via a helper file_permission() like with inode_permission.
My understanding in the VFS permission checks model is too
limited to say if this makes sense.
Thanks,
Amir.
On Wed, Apr 18, 2018 at 10:19 AM, Amir Goldstein <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>> This reverts commit a118084432d642eeccb961c7c8cc61525a941fcb.
>>
>> No user of d_real_inode() remains, so it can be removed.
>>
>
> FYI, there is a new user in v4.17-rc1 added by commit
> f0a2aa5a2a40 tracing/uprobe: Add support for overlayfs
>
> Seems like this patch got merged without any CC to overlayfs
> mailing list nor maintainer?
>
> Not sure yet if overlayfs-rorw patches would allow reverting this
> change.
Not trivial, because uprobe is looking at i_mapping to get a list of
current memory maps. We could set i_mapping at overlay inode
initialization time, but we definitely can't *change* i_mapping at
copy up. Which is bound to result in some weird inconsistencies. So
likely we'll need to keep d_real_inode() for the time being.
Thanks,
Miklos
On Wed, 18 Apr 2018 13:42:03 +0200
Miklos Szeredi <[email protected]> wrote:
> On Wed, Apr 18, 2018 at 10:19 AM, Amir Goldstein <[email protected]> wrote:
> > On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> >> This reverts commit a118084432d642eeccb961c7c8cc61525a941fcb.
> >>
> >> No user of d_real_inode() remains, so it can be removed.
> >>
> >
> > FYI, there is a new user in v4.17-rc1 added by commit
> > f0a2aa5a2a40 tracing/uprobe: Add support for overlayfs
> >
> > Seems like this patch got merged without any CC to overlayfs
> > mailing list nor maintainer?
It appeared to be a small change with lots of reviewers. I didn't think
it was something to notify the overlayfs folks with. But perhaps I was
wrong.
> >
> > Not sure yet if overlayfs-rorw patches would allow reverting this
> > change.
>
> Not trivial, because uprobe is looking at i_mapping to get a list of
> current memory maps. We could set i_mapping at overlay inode
> initialization time, but we definitely can't *change* i_mapping at
> copy up. Which is bound to result in some weird inconsistencies. So
> likely we'll need to keep d_real_inode() for the time being.
I just received this patch:
http://lkml.kernel.org/r/[email protected]
Which removes this code. Can you review it and I'll take it.
-- Steve
On Wed, Apr 18, 2018 at 3:38 PM, Steven Rostedt <[email protected]> wrote:
> On Wed, 18 Apr 2018 13:42:03 +0200
> Miklos Szeredi <[email protected]> wrote:
>
>> On Wed, Apr 18, 2018 at 10:19 AM, Amir Goldstein <[email protected]> wrote:
>> > On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>> >> This reverts commit a118084432d642eeccb961c7c8cc61525a941fcb.
>> >>
>> >> No user of d_real_inode() remains, so it can be removed.
>> >>
>> >
>> > FYI, there is a new user in v4.17-rc1 added by commit
>> > f0a2aa5a2a40 tracing/uprobe: Add support for overlayfs
>> >
>> > Seems like this patch got merged without any CC to overlayfs
>> > mailing list nor maintainer?
>
> It appeared to be a small change with lots of reviewers. I didn't think
> it was something to notify the overlayfs folks with. But perhaps I was
> wrong.
The patch is correct. The code surrounding it isn't, though.
>
>> >
>> > Not sure yet if overlayfs-rorw patches would allow reverting this
>> > change.
>>
>> Not trivial, because uprobe is looking at i_mapping to get a list of
>> current memory maps. We could set i_mapping at overlay inode
>> initialization time, but we definitely can't *change* i_mapping at
>> copy up. Which is bound to result in some weird inconsistencies. So
>> likely we'll need to keep d_real_inode() for the time being.
>
> I just received this patch:
>
> http://lkml.kernel.org/r/[email protected]
>
> Which removes this code. Can you review it and I'll take it.
It shouldn't remove d_real_inode(), because that part is correct and
fixes a real bug in handling overlayfs files.
I'll review, but apparently I wasn't CC-d on that patch. Weird.
Thanks,
Miklos
On Wed, 18 Apr 2018 15:49:02 +0200
Miklos Szeredi <[email protected]> wrote:
> > I just received this patch:
> >
> > http://lkml.kernel.org/r/[email protected]
> >
> > Which removes this code. Can you review it and I'll take it.
>
> It shouldn't remove d_real_inode(), because that part is correct and
> fixes a real bug in handling overlayfs files.
>
> I'll review, but apparently I wasn't CC-d on that patch. Weird.
Especially since you are on the "Reported-by".
My scripts know to add to the Cc "Reported-by" tags. Not all scripts do
though :-/
-- Steve
On Wed, Apr 18, 2018 at 03:49:02PM +0200, Miklos Szeredi wrote:
> On Wed, Apr 18, 2018 at 3:38 PM, Steven Rostedt <[email protected]> wrote:
> > On Wed, 18 Apr 2018 13:42:03 +0200
> > Miklos Szeredi <[email protected]> wrote:
> >
> >> On Wed, Apr 18, 2018 at 10:19 AM, Amir Goldstein <[email protected]> wrote:
> >> > On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
> >> >> This reverts commit a118084432d642eeccb961c7c8cc61525a941fcb.
> >> >>
> >> >> No user of d_real_inode() remains, so it can be removed.
> >> >>
> >> >
> >> > FYI, there is a new user in v4.17-rc1 added by commit
> >> > f0a2aa5a2a40 tracing/uprobe: Add support for overlayfs
> >> >
> >> > Seems like this patch got merged without any CC to overlayfs
> >> > mailing list nor maintainer?
> >
> > It appeared to be a small change with lots of reviewers. I didn't think
> > it was something to notify the overlayfs folks with. But perhaps I was
> > wrong.
>
> The patch is correct. The code surrounding it isn't, though.
>
> >
> >> >
> >> > Not sure yet if overlayfs-rorw patches would allow reverting this
> >> > change.
> >>
> >> Not trivial, because uprobe is looking at i_mapping to get a list of
> >> current memory maps. We could set i_mapping at overlay inode
> >> initialization time, but we definitely can't *change* i_mapping at
> >> copy up. Which is bound to result in some weird inconsistencies. So
> >> likely we'll need to keep d_real_inode() for the time being.
> >
> > I just received this patch:
> >
> > http://lkml.kernel.org/r/[email protected]
> >
> > Which removes this code. Can you review it and I'll take it.
>
> It shouldn't remove d_real_inode(), because that part is correct and
> fixes a real bug in handling overlayfs files.
I am wondering what does it practically mean for metdata only copy up
patches. Given this is uprobe code, I am assuming its modifying some
executable code dynamically. And for the the case of metadata only
copy up, it will return inode of lower. That probably means, as long
as all running instances of that exeutable are using that inode, things
will work fine.
But if for some reason somebody opens that file for WRITE and triggers
copy up and new instances of same binary will not see the probe taking
affect?
Which is probably very similar to what will happen if a lower executable
is copied up. Having said that, in normal cases there should not be a
need to copy up a binary in normal circumstances.
Am I missing the point completely.
Thanks
Vivek
On Thu, Apr 19, 2018 at 9:54 PM, Vivek Goyal <[email protected]> wrote:
> On Wed, Apr 18, 2018 at 03:49:02PM +0200, Miklos Szeredi wrote:
>> On Wed, Apr 18, 2018 at 3:38 PM, Steven Rostedt <[email protected]> wrote:
>> > On Wed, 18 Apr 2018 13:42:03 +0200
>> > Miklos Szeredi <[email protected]> wrote:
>> >
>> >> On Wed, Apr 18, 2018 at 10:19 AM, Amir Goldstein <[email protected]> wrote:
>> >> > On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>> >> >> This reverts commit a118084432d642eeccb961c7c8cc61525a941fcb.
>> >> >>
>> >> >> No user of d_real_inode() remains, so it can be removed.
>> >> >>
>> >> >
>> >> > FYI, there is a new user in v4.17-rc1 added by commit
>> >> > f0a2aa5a2a40 tracing/uprobe: Add support for overlayfs
>> >> >
>> >> > Seems like this patch got merged without any CC to overlayfs
>> >> > mailing list nor maintainer?
>> >
>> > It appeared to be a small change with lots of reviewers. I didn't think
>> > it was something to notify the overlayfs folks with. But perhaps I was
>> > wrong.
>>
>> The patch is correct. The code surrounding it isn't, though.
>>
>> >
>> >> >
>> >> > Not sure yet if overlayfs-rorw patches would allow reverting this
>> >> > change.
>> >>
>> >> Not trivial, because uprobe is looking at i_mapping to get a list of
>> >> current memory maps. We could set i_mapping at overlay inode
>> >> initialization time, but we definitely can't *change* i_mapping at
>> >> copy up. Which is bound to result in some weird inconsistencies. So
>> >> likely we'll need to keep d_real_inode() for the time being.
>> >
>> > I just received this patch:
>> >
>> > http://lkml.kernel.org/r/[email protected]
>> >
>> > Which removes this code. Can you review it and I'll take it.
>>
>> It shouldn't remove d_real_inode(), because that part is correct and
>> fixes a real bug in handling overlayfs files.
>
> I am wondering what does it practically mean for metdata only copy up
> patches. Given this is uprobe code, I am assuming its modifying some
> executable code dynamically. And for the the case of metadata only
> copy up, it will return inode of lower. That probably means, as long
> as all running instances of that exeutable are using that inode, things
> will work fine.
>
> But if for some reason somebody opens that file for WRITE and triggers
> copy up and new instances of same binary will not see the probe taking
> affect?
>
> Which is probably very similar to what will happen if a lower executable
> is copied up. Having said that, in normal cases there should not be a
> need to copy up a binary in normal circumstances.
The only thing we need to ensure when uprobes interact with copy-ups
is that the kernel doesn't crash and doesn't leak memory. Other than
that, it's a totally uninteresting corner case and we don't need to
worry about its behavior.
Thanks,
Miklos
On Tue, Apr 17, 2018 at 10:51 PM, Amir Goldstein <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>> Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.
>>
...
>> +long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> + long ret;
>> + struct inode *inode = file_inode(file);
>> +
>> + switch (cmd) {
>> + case FS_IOC_GETFLAGS:
>> + ret = ovl_real_ioctl(file, cmd, arg);
>> + break;
>> +
>> + case FS_IOC_SETFLAGS:
>> + if (!inode_owner_or_capable(inode))
>> + return -EACCES;
>> +
>> + ret = mnt_want_write_file(file);
>> + if (ret)
>> + return ret;
>> +
>> + ret = ovl_copy_up(file_dentry(file));
>> + if (!ret) {
>> + ret = ovl_real_ioctl(file, cmd, arg);
>> +
>
> I got this lockdep splat with overlayfs-rorw and overlay/040, but I don't
> see the problem in the patch:
>
Ouch! the problem is not with the patch. The patch is just bring to light
the fact that filesystems do mnt_want_write_file(file) on ioctls such as
FS_IOC_SETFLAGS and if that file happens to be an overlayfs file
then filesystems are getting write access to overlay mount and that was
not their intention. That can be a way to bypass filesystem ro mount
and freeze protection.
I couldn't reproduce ro/freeze protection with xfs and ext4 on upstream
kernel, but did reproduce freeze protection bypass with ext4 and the ro-rw
patches. ext4 also hits a WARN_ON with upstream kernel and with ro-rw:
root@kvm-xfstests:~# mount /vdf
root@kvm-xfstests:~# mkdir -p /vdf/ovl-lower /vdf/ovl-upper /vdf/ovl-work
root@kvm-xfstests:~# touch /vdf/ovl-upper/foo
root@kvm-xfstests:~# mount -t overlay none /mnt -o
lowerdir=/vdf/ovl-lower,upperdir=/vdf/ovl-upper,workdir=/vdf/ovl-work
root@kvm-xfstests:~# fsfreeze -f /vdf
root@kvm-xfstests:~# chattr +i /mnt/foo
root@kvm-xfstests:~# lsattr -l /mnt/foo
/mnt/scratch/foo Immutable, Extents
[ 53.478454] WARNING: CPU: 1 PID: 1415 at
/home/amir/build/src/linux/fs/ext4/ext4_jbd2.c:53
ext4_journal_check_start+0x48/0x82
[ 53.482094] CPU: 1 PID: 1415 Comm: chattr Not tainted
4.17.0-rc1-xfstests-00086-g5a6426c9b720 #3255
[ 53.484927] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 53.487792] RIP: 0010:ext4_journal_check_start+0x48/0x82
[ 53.489392] RSP: 0018:ffffc90000707b18 EFLAGS: 00010246
[ 53.491070] RAX: 00000000ffffffe2 RBX: ffff88007c497000 RCX: 0000000000000000
[ 53.493326] RDX: ffff880079284000 RSI: 000000000000002e RDI: ffffffff8208165c
[ 53.494850] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000001
[ 53.496356] R10: 0000000000000001 R11: ffff880077c49250 R12: 0000000000000000
[ 53.497857] R13: ffff88007ca61180 R14: 000000000000019c R15: 0000000000000000
[ 53.499376] FS: 00007f447dd2d780(0000) GS:ffff88007f400000(0000)
knlGS:0000000000000000
[ 53.501975] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 53.504280] CR2: 00007f447d1ff0e0 CR3: 00000000792fc000 CR4: 00000000000006e0
[ 53.505855] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 53.507814] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 53.511278] Call Trace:
[ 53.511861] __ext4_journal_start_sb+0xe4/0x1a4
[ 53.512860] ? ext4_file_open+0xb6/0x189
[ 53.513705] ext4_file_open+0xb6/0x189
[ 53.514547] ? ext4_release_file+0x9f/0x9f
[ 53.515426] do_dentry_open+0x1af/0x2e6
[ 53.516298] path_open+0x5d/0x7e
[ 53.516998] ovl_open_realfile+0x6b/0xcb
[ 53.517843] ? ovl_pre_mmap+0x4c/0x4c
[ 53.518659] ovl_open+0x51/0x77
[ 53.519343] do_dentry_open+0x1af/0x2e6
[ 53.520167] do_last+0x520/0x5f5
[ 53.520874] path_openat+0x1f1/0x274
[ 53.521648] do_filp_open+0x4d/0xa3
[ 53.522422] ? __alloc_fd+0x2f/0x1b6
[ 53.523193] ? do_sys_open+0x13d/0x1c4
[ 53.523997] do_sys_open+0x13d/0x1c4
[ 53.525056] do_syscall_64+0x5d/0x167
[ 53.526099] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 53.527620] RIP: 0033:0x7f447d1ff4b0
[ 53.528693] RSP: 002b:00007fffaef09438 EFLAGS: 00000246 ORIG_RAX:
0000000000000002
[ 53.530779] RAX: ffffffffffffffda RBX: 00007fffaef0adaf RCX: 00007f447d1ff4b0
[ 53.532495] RDX: 00007fffaef09450 RSI: 0000000000000800 RDI: 00007fffaef0adaf
[ 53.534053] RBP: 00007fffaef0adaf R08: 0000000000000000 R09: 0000000000000001
[ 53.536649] R10: 00007f447d1a2ff0 R11: 0000000000000246 R12: 00007fffaef09528
[ 53.539024] R13: 00007fffaef09740 R14: 00007fffaef09728 R15: 0000000000000000
[ 53.541658] Code: 8b 93 60 07 00 00 b8 fb ff ff ff 48 8b 8a e8 03
00 00 80 e1 02 75 4c f6 43 50 01 b8 e2 ff ff ff 75 41 83 bb 28 03 00
00 04 75 02 <0f> 0b 48 8b 92 30 03 00 00 31 c0 48 85 d2 74 28 48 8b 02
83 e0
[ 53.548190] irq event stamp: 0
[ 53.549431] hardirqs last enabled at (0): [<0000000000000000>]
(null)
[ 53.552361] hardirqs last disabled at (0): [<ffffffff810741a9>]
copy_process.part.9+0x654/0x1afb
[ 53.555814] softirqs last enabled at (0): [<ffffffff810741a9>]
copy_process.part.9+0x654/0x1afb
[ 53.559164] softirqs last disabled at (0): [<0000000000000000>]
(null)
[ 53.562005] ---[ end trace adfe58189c7e6188 ]---
Upstream WARN_ON:
[ 302.631228] WARNING: CPU: 0 PID: 1406 at
/home/amir/build/src/linux/fs/ext4/ext4_jbd2.c:53
ext4_journal_check_start+0x48/0x82
[ 302.635440] CPU: 0 PID: 1406 Comm: chattr Not tainted
4.17.0-rc1-xfstests #3237
[ 302.638200] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
[ 302.641172] RIP: 0010:ext4_journal_check_start+0x48/0x82
[ 302.643154] RSP: 0018:ffffc9000076fbd8 EFLAGS: 00010246
[ 302.644466] RAX: 00000000ffffffe2 RBX: ffff88007a77b000 RCX: 0000000000000000
[ 302.646418] RDX: ffff88007c9df000 RSI: 00000000ffffffff RDI: 0000000000000246
[ 302.648130] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000006
[ 302.649764] R10: 0000000000000001 R11: ffffffff82210708 R12: 0000000000000000
[ 302.651719] R13: ffff88007c468180 R14: 000000000000019c R15: 0000000000000000
[ 302.653437] FS: 00007f070a480780(0000) GS:ffff88007f200000(0000)
knlGS:0000000000000000
[ 302.655711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 302.657923] CR2: 0000564edb963008 CR3: 000000007a676000 CR4: 00000000000006f0
[ 302.660718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 302.663476] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 302.666179] Call Trace:
[ 302.667071] __ext4_journal_start_sb+0xe4/0x1a4
[ 302.668763] ? ext4_file_open+0xb6/0x189
[ 302.670118] ext4_file_open+0xb6/0x189
[ 302.671528] ? ext4_release_file+0x9f/0x9f
[ 302.673211] do_dentry_open+0x19e/0x2d5
[ 302.674747] ? ovl_inode_init_once+0xe/0xe
[ 302.676398] do_last+0x520/0x5f9
[ 302.677668] path_openat+0x1fa/0x26b
[ 302.679100] do_filp_open+0x4d/0xa3
[ 302.680280] ? __lock_acquire+0x5e6/0x67b
[ 302.681567] ? __alloc_fd+0x1a4/0x1b6
[ 302.683051] ? do_sys_open+0x13c/0x1c1
[ 302.684170] do_sys_open+0x13c/0x1c1
[ 302.685361] do_syscall_64+0x5d/0x167
[ 302.686458] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 302.688029] RIP: 0033:0x7f07099524b0
[ 302.689090] RSP: 002b:00007ffcfefc3178 EFLAGS: 00000246 ORIG_RAX:
0000000000000002
[ 302.691309] RAX: ffffffffffffffda RBX: 00007ffcfefc3daf RCX: 00007f07099524b0
[ 302.693346] RDX: 00007ffcfefc3190 RSI: 0000000000000800 RDI: 00007ffcfefc3daf
[ 302.695389] RBP: 00007ffcfefc3daf R08: 0000000000000000 R09: 0000000000000001
[ 302.697766] R10: 00007f07098f5ff0 R11: 0000000000000246 R12: 00007ffcfefc3268
[ 302.699955] R13: 00007ffcfefc3480 R14: 00007ffcfefc3468 R15: 0000000000000000
[ 302.702521] Code: 8b 93 60 07 00 00 b8 fb ff ff ff 48 8b 8a e8 03
00 00 80 e1 02 75 4c f6 43 50 01 b8 e2 ff ff ff 75 41 83 bb 28 03 00
00 04 75 02 <0f> 0b 48 8b 92 30 03 00 00 31 c0 48 85 d2 74 28 48 8b 02
83 e0
[ 302.706996] irq event stamp: 2376
[ 302.707868] hardirqs last enabled at (2375): [<ffffffff811f8e72>]
prepend_path+0x205/0x449
[ 302.709872] hardirqs last disabled at (2376): [<ffffffff81a0118f>]
error_entry+0x7f/0x100
[ 302.712311] softirqs last enabled at (1060): [<ffffffff81c0033b>]
__do_softirq+0x33b/0x433
[ 302.714754] softirqs last disabled at (1041): [<ffffffff8107c541>]
irq_exit+0x59/0xa8
[ 302.716465] ---[ end trace e891c35ae0c8bbe5 ]---
Is there a reason why the real file can't get the real path?
For current kernels, can you say what else can go wrong when filesystems
call mnt_want_write_file() on an overlay file on ioctl with filesystem
inode and why I couldn't reproduce readonly/freeze bypass?
Thanks,
Amir.
On Sun, Apr 22, 2018 at 11:35 AM, Amir Goldstein <[email protected]> wrote:
> On Tue, Apr 17, 2018 at 10:51 PM, Amir Goldstein <[email protected]> wrote:
>> On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>>> Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.
>>>
> ...
>>> +long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>>> +{
>>> + long ret;
>>> + struct inode *inode = file_inode(file);
>>> +
>>> + switch (cmd) {
>>> + case FS_IOC_GETFLAGS:
>>> + ret = ovl_real_ioctl(file, cmd, arg);
>>> + break;
>>> +
>>> + case FS_IOC_SETFLAGS:
>>> + if (!inode_owner_or_capable(inode))
>>> + return -EACCES;
>>> +
>>> + ret = mnt_want_write_file(file);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + ret = ovl_copy_up(file_dentry(file));
>>> + if (!ret) {
>>> + ret = ovl_real_ioctl(file, cmd, arg);
>>> +
>>
>> I got this lockdep splat with overlayfs-rorw and overlay/040, but I don't
>> see the problem in the patch:
>>
>
> Ouch! the problem is not with the patch. The patch is just bring to light
> the fact that filesystems do mnt_want_write_file(file) on ioctls such as
> FS_IOC_SETFLAGS and if that file happens to be an overlayfs file
> then filesystems are getting write access to overlay mount and that was
> not their intention. That can be a way to bypass filesystem ro mount
> and freeze protection.
>
> I couldn't reproduce ro/freeze protection with xfs and ext4 on upstream
> kernel, but did reproduce freeze protection bypass with ext4 and the ro-rw
> patches. ext4 also hits a WARN_ON with upstream kernel and with ro-rw:
>
> root@kvm-xfstests:~# mount /vdf
> root@kvm-xfstests:~# mkdir -p /vdf/ovl-lower /vdf/ovl-upper /vdf/ovl-work
> root@kvm-xfstests:~# touch /vdf/ovl-upper/foo
> root@kvm-xfstests:~# mount -t overlay none /mnt -o
> lowerdir=/vdf/ovl-lower,upperdir=/vdf/ovl-upper,workdir=/vdf/ovl-work
> root@kvm-xfstests:~# fsfreeze -f /vdf
> root@kvm-xfstests:~# chattr +i /mnt/foo
> root@kvm-xfstests:~# lsattr -l /mnt/foo
> /mnt/scratch/foo Immutable, Extents
>
[...]
> Upstream WARN_ON:
>
> [ 302.631228] WARNING: CPU: 0 PID: 1406 at
> /home/amir/build/src/linux/fs/ext4/ext4_jbd2.c:53
> ext4_journal_check_start+0x48/0x82
> [ 302.635440] CPU: 0 PID: 1406 Comm: chattr Not tainted
> 4.17.0-rc1-xfstests #3237
> [ 302.638200] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
> [ 302.641172] RIP: 0010:ext4_journal_check_start+0x48/0x82
> [ 302.643154] RSP: 0018:ffffc9000076fbd8 EFLAGS: 00010246
> [ 302.644466] RAX: 00000000ffffffe2 RBX: ffff88007a77b000 RCX: 0000000000000000
> [ 302.646418] RDX: ffff88007c9df000 RSI: 00000000ffffffff RDI: 0000000000000246
> [ 302.648130] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000006
> [ 302.649764] R10: 0000000000000001 R11: ffffffff82210708 R12: 0000000000000000
> [ 302.651719] R13: ffff88007c468180 R14: 000000000000019c R15: 0000000000000000
> [ 302.653437] FS: 00007f070a480780(0000) GS:ffff88007f200000(0000)
> knlGS:0000000000000000
> [ 302.655711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 302.657923] CR2: 0000564edb963008 CR3: 000000007a676000 CR4: 00000000000006f0
> [ 302.660718] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 302.663476] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 302.666179] Call Trace:
> [ 302.667071] __ext4_journal_start_sb+0xe4/0x1a4
> [ 302.668763] ? ext4_file_open+0xb6/0x189
> [ 302.670118] ext4_file_open+0xb6/0x189
> [ 302.671528] ? ext4_release_file+0x9f/0x9f
> [ 302.673211] do_dentry_open+0x19e/0x2d5
> [ 302.674747] ? ovl_inode_init_once+0xe/0xe
> [ 302.676398] do_last+0x520/0x5f9
> [ 302.677668] path_openat+0x1fa/0x26b
> [ 302.679100] do_filp_open+0x4d/0xa3
> [ 302.680280] ? __lock_acquire+0x5e6/0x67b
> [ 302.681567] ? __alloc_fd+0x1a4/0x1b6
> [ 302.683051] ? do_sys_open+0x13c/0x1c1
> [ 302.684170] do_sys_open+0x13c/0x1c1
> [ 302.685361] do_syscall_64+0x5d/0x167
> [ 302.686458] entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [ 302.688029] RIP: 0033:0x7f07099524b0
> [ 302.689090] RSP: 002b:00007ffcfefc3178 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000002
> [ 302.691309] RAX: ffffffffffffffda RBX: 00007ffcfefc3daf RCX: 00007f07099524b0
> [ 302.693346] RDX: 00007ffcfefc3190 RSI: 0000000000000800 RDI: 00007ffcfefc3daf
> [ 302.695389] RBP: 00007ffcfefc3daf R08: 0000000000000000 R09: 0000000000000001
> [ 302.697766] R10: 00007f07098f5ff0 R11: 0000000000000246 R12: 00007ffcfefc3268
> [ 302.699955] R13: 00007ffcfefc3480 R14: 00007ffcfefc3468 R15: 0000000000000000
> [ 302.702521] Code: 8b 93 60 07 00 00 b8 fb ff ff ff 48 8b 8a e8 03
> 00 00 80 e1 02 75 4c f6 43 50 01 b8 e2 ff ff ff 75 41 83 bb 28 03 00
> 00 04 75 02 <0f> 0b 48 8b 92 30 03 00 00 31 c0 48 85 d2 74 28 48 8b 02
> 83 e0
> [ 302.706996] irq event stamp: 2376
> [ 302.707868] hardirqs last enabled at (2375): [<ffffffff811f8e72>]
> prepend_path+0x205/0x449
> [ 302.709872] hardirqs last disabled at (2376): [<ffffffff81a0118f>]
> error_entry+0x7f/0x100
> [ 302.712311] softirqs last enabled at (1060): [<ffffffff81c0033b>]
> __do_softirq+0x33b/0x433
> [ 302.714754] softirqs last disabled at (1041): [<ffffffff8107c541>]
> irq_exit+0x59/0xa8
> [ 302.716465] ---[ end trace e891c35ae0c8bbe5 ]---
>
That's an ext4 bug unrelated to overlayfs.
reproduced by:
# mount /vdf/
# fsfreeze -f /vdf/
# cat /vdf/foo
> Is there a reason why the real file can't get the real path?
> For current kernels, can you say what else can go wrong when filesystems
> call mnt_want_write_file() on an overlay file on ioctl with filesystem
> inode and why I couldn't reproduce readonly/freeze bypass?
>
The reason is that commit 7c6893e3c9ab ("ovl: don't allow writing ioctl
on lower layer") also silently fixed "block writing ioctl on upper layer
of frozen fs".
Thanks,
Amir.
On 4/12/2018 8:38 PM, Miklos Szeredi wrote:
> Implement FS_IOC_GETFLAGS and FS_IOC_SETFLAGS.
>
> Needs vfs_ioctl() exported to modules.
Do you think if it is better to separate out ovl implementation and
export of vfs_ioctl method ?
Probably there are other users which should be using vfs_ioctl method as
well (like ecryptfs) ?
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/internal.h | 1 -
> fs/ioctl.c | 1 +
> fs/overlayfs/file.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 4 files changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/fs/internal.h b/fs/internal.h
> index 3319bf39e339..d5108d9c6a2f 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -176,7 +176,6 @@ extern const struct dentry_operations ns_dentry_operations;
> */
> extern int do_vfs_ioctl(struct file *file, unsigned int fd, unsigned int cmd,
> unsigned long arg);
> -extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
>
> /*
> * iomap support:
> diff --git a/fs/ioctl.c b/fs/ioctl.c
> index 5ace7efb0d04..696f4c46a868 100644
> --- a/fs/ioctl.c
> +++ b/fs/ioctl.c
> @@ -49,6 +49,7 @@ long vfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> out:
> return error;
> }
> +EXPORT_SYMBOL(vfs_ioctl);
>
> static int ioctl_fibmap(struct file *filp, int __user *p)
> {
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index 05e3e2f80b89..cc004ff1b05b 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -8,6 +8,7 @@
>
> #include <linux/cred.h>
> #include <linux/file.h>
> +#include <linux/mount.h>
> #include <linux/xattr.h>
> #include <linux/uio.h>
> #include "overlayfs.h"
> @@ -291,6 +292,63 @@ long ovl_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return ret;
> }
>
> +static long ovl_real_ioctl(struct file *file, unsigned int cmd,
> + unsigned long arg)
> +{
> + struct fd real;
> + const struct cred *old_cred;
> + long ret;
> +
> + ret = ovl_real_file(file, &real);
> + if (ret)
> + return ret;
> +
> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
> + ret = vfs_ioctl(real.file, cmd, arg);
> + revert_creds(old_cred);
> +
> + fdput(real);
> +
> + return ret;
> +}
> +
> +long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> + long ret;
> + struct inode *inode = file_inode(file);
> +
> + switch (cmd) {
> + case FS_IOC_GETFLAGS:
> + ret = ovl_real_ioctl(file, cmd, arg);
> + break;
> +
> + case FS_IOC_SETFLAGS:
> + if (!inode_owner_or_capable(inode))
> + return -EACCES;
> +
> + ret = mnt_want_write_file(file);
> + if (ret)
> + return ret;
> +
> + ret = ovl_copy_up(file_dentry(file));
> + if (!ret) {
> + ret = ovl_real_ioctl(file, cmd, arg);
> +
> + inode_lock(inode);
> + ovl_copyflags(ovl_inode_real(inode), inode);
> + inode_unlock(inode);
> + }
> +
> + mnt_drop_write_file(file);
> + break;
> +
> + default:
> + ret = -ENOTTY;
> + }
> +
> + return ret;
> +}
> +
> const struct file_operations ovl_file_operations = {
> .open = ovl_open,
> .release = ovl_release,
> @@ -300,4 +358,5 @@ const struct file_operations ovl_file_operations = {
> .fsync = ovl_fsync,
> .mmap = ovl_mmap,
> .fallocate = ovl_fallocate,
> + .unlocked_ioctl = ovl_ioctl,
What about compat_ioctl ? should that implementation also go in the same
patch ?
> };
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0b215faa30ae..1add10f04b56 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1621,6 +1621,8 @@ int vfs_mkobj(struct dentry *, umode_t,
> int (*f)(struct dentry *, umode_t, void *),
> void *);
>
> +extern long vfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg);
> +
> /*
> * VFS file helper functions.
> */
>
--
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center,
Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
Linux Foundation Collaborative Project.
On Sun, Apr 22, 2018 at 10:35 AM, Amir Goldstein <[email protected]> wrote:
> On Tue, Apr 17, 2018 at 10:51 PM, Amir Goldstein <[email protected]> wrote:
[snip]
> Is there a reason why the real file can't get the real path?
It could, except for vma->vm_file.
Now, we could have a separate realfile for mmap (with overlay path)
and one for everything else (with real path). Maybe that's the way to
go, to minimize the chance of trouble caused by this irregularity.
> For current kernels, can you say what else can go wrong when filesystems
> call mnt_want_write_file() on an overlay file on ioctl with filesystem
> inode and why I couldn't reproduce readonly/freeze bypass?
mnt_want_write_file() is overlayfs-aware in current kernels.
We could fix it to use file_inode()->i_sb instead of
f_path.dentry->d_sb after reverting the overlay specific hack, and
that would fix the freeze bypass bug and would be correct for all
filesystems. But I wonder how many such issues we have where
discrepancy between f_path.dentry and file_inode() matters.
Thanks,
Miklos
On Mon, Apr 23, 2018 at 12:21 PM, Miklos Szeredi <[email protected]> wrote:
> We could fix it to use file_inode()->i_sb instead of
> f_path.dentry->d_sb after reverting the overlay specific hack, and
> that would fix the freeze bypass bug and would be correct for all
> filesystems. But I wonder how many such issues we have where
> discrepancy between f_path.dentry and file_inode() matters.
OTOH, any such issues should already have surfaced. Just need to be
careful when reverting VFS patches that they don't regress in the face
of f_path containing the overlay path.
So we are probably better to keep using overlay path in real file
universally, just to keep the code simpler.
Thanks,
Miklos
On Thu, Apr 12, 2018 at 05:08:04PM +0200, Miklos Szeredi wrote:
> Implement stacked fsync().
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/overlayfs/file.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> index b98204c1c19c..4417527667ff 100644
> --- a/fs/overlayfs/file.c
> +++ b/fs/overlayfs/file.c
> @@ -222,10 +222,30 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> return ret;
> }
>
> +static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> +{
> + struct fd real;
> + const struct cred *old_cred;
> + int ret;
> +
> + ret = ovl_real_file(file, &real);
> + if (ret)
> + return ret;
> +
> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
> + ret = vfs_fsync_range(real.file, start, end, datasync);
> + revert_creds(old_cred);
Can we avoid calling fsync() on real file if it is not upper. Is it worth
optimizing.
Vivek
> +
> + fdput(real);
> +
> + return ret;
> +}
> +
> const struct file_operations ovl_file_operations = {
> .open = ovl_open,
> .release = ovl_release,
> .llseek = ovl_llseek,
> .read_iter = ovl_read_iter,
> .write_iter = ovl_write_iter,
> + .fsync = ovl_fsync,
> };
> --
> 2.14.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2018 at 3:36 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 05:08:04PM +0200, Miklos Szeredi wrote:
>> Implement stacked fsync().
>>
>> Signed-off-by: Miklos Szeredi <[email protected]>
>> ---
>> fs/overlayfs/file.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> index b98204c1c19c..4417527667ff 100644
>> --- a/fs/overlayfs/file.c
>> +++ b/fs/overlayfs/file.c
>> @@ -222,10 +222,30 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>> return ret;
>> }
>>
>> +static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>> +{
>> + struct fd real;
>> + const struct cred *old_cred;
>> + int ret;
>> +
>> + ret = ovl_real_file(file, &real);
>> + if (ret)
>> + return ret;
>> +
>> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
>> + ret = vfs_fsync_range(real.file, start, end, datasync);
>> + revert_creds(old_cred);
>
> Can we avoid calling fsync() on real file if it is not upper. Is it worth
> optimizing.
Not sure it's worth bothering with. If caller of fsync(2) didn't
worry about cost, then why should we?
Thanks,
Miklos
On Mon, Apr 23, 2018 at 03:39:45PM +0200, Miklos Szeredi wrote:
> On Mon, Apr 23, 2018 at 3:36 PM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Apr 12, 2018 at 05:08:04PM +0200, Miklos Szeredi wrote:
> >> Implement stacked fsync().
> >>
> >> Signed-off-by: Miklos Szeredi <[email protected]>
> >> ---
> >> fs/overlayfs/file.c | 20 ++++++++++++++++++++
> >> 1 file changed, 20 insertions(+)
> >>
> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >> index b98204c1c19c..4417527667ff 100644
> >> --- a/fs/overlayfs/file.c
> >> +++ b/fs/overlayfs/file.c
> >> @@ -222,10 +222,30 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
> >> return ret;
> >> }
> >>
> >> +static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
> >> +{
> >> + struct fd real;
> >> + const struct cred *old_cred;
> >> + int ret;
> >> +
> >> + ret = ovl_real_file(file, &real);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
> >> + ret = vfs_fsync_range(real.file, start, end, datasync);
> >> + revert_creds(old_cred);
> >
> > Can we avoid calling fsync() on real file if it is not upper. Is it worth
> > optimizing.
>
> Not sure it's worth bothering with. If caller of fsync(2) didn't
> worry about cost, then why should we?
I was thinking more from the point of view of metadata copy up patches.
For a metacopy file, I was thinking if I can just issue fsync() on upper
file and skip it on lower file.
Anyway, I don't have any strong opinion here.
Vivek
On Mon, Apr 23, 2018 at 3:53 PM, Vivek Goyal <[email protected]> wrote:
> On Mon, Apr 23, 2018 at 03:39:45PM +0200, Miklos Szeredi wrote:
>> On Mon, Apr 23, 2018 at 3:36 PM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Apr 12, 2018 at 05:08:04PM +0200, Miklos Szeredi wrote:
>> >> Implement stacked fsync().
>> >>
>> >> Signed-off-by: Miklos Szeredi <[email protected]>
>> >> ---
>> >> fs/overlayfs/file.c | 20 ++++++++++++++++++++
>> >> 1 file changed, 20 insertions(+)
>> >>
>> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> >> index b98204c1c19c..4417527667ff 100644
>> >> --- a/fs/overlayfs/file.c
>> >> +++ b/fs/overlayfs/file.c
>> >> @@ -222,10 +222,30 @@ static ssize_t ovl_write_iter(struct kiocb *iocb, struct iov_iter *iter)
>> >> return ret;
>> >> }
>> >>
>> >> +static int ovl_fsync(struct file *file, loff_t start, loff_t end, int datasync)
>> >> +{
>> >> + struct fd real;
>> >> + const struct cred *old_cred;
>> >> + int ret;
>> >> +
>> >> + ret = ovl_real_file(file, &real);
>> >> + if (ret)
>> >> + return ret;
>> >> +
>> >> + old_cred = ovl_override_creds(file_inode(file)->i_sb);
>> >> + ret = vfs_fsync_range(real.file, start, end, datasync);
>> >> + revert_creds(old_cred);
>> >
>> > Can we avoid calling fsync() on real file if it is not upper. Is it worth
>> > optimizing.
>>
>> Not sure it's worth bothering with. If caller of fsync(2) didn't
>> worry about cost, then why should we?
>
> I was thinking more from the point of view of metadata copy up patches.
> For a metacopy file, I was thinking if I can just issue fsync() on upper
> file and skip it on lower file.
Ah, in that case I agree with doing fsync only on upper. If there's
a choice in implementing the given functionality (and performance
doesn't matter) then the simplest one should be chosen.
Thanks,
Miklos
On Thu, Apr 12, 2018 at 05:07:58PM +0200, Miklos Szeredi wrote:
> Copy i_size of the underlying inode to the overlay inode in ovl_copyattr().
>
> This is in preparation for stacking I/O operations on overlay files.
>
> This patch shouldn't have any observable effect.
>
> Signed-off-by: Miklos Szeredi <[email protected]>
> ---
> fs/overlayfs/overlayfs.h | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h
> index 265cb288417a..10f5f3bf9d96 100644
> --- a/fs/overlayfs/overlayfs.h
> +++ b/fs/overlayfs/overlayfs.h
> @@ -9,6 +9,7 @@
>
> #include <linux/kernel.h>
> #include <linux/uuid.h>
> +#include <linux/fs.h>
> #include "ovl_entry.h"
>
> enum ovl_path_type {
> @@ -355,6 +356,7 @@ static inline void ovl_copyattr(struct inode *from, struct inode *to)
> to->i_atime = from->i_atime;
> to->i_mtime = from->i_mtime;
> to->i_ctime = from->i_ctime;
> + i_size_write(to, i_size_read(from));
> }
With this change, is following comment in ovl_getattr() stale now.
/*
* We don't initialize inode->size, which just means that
* inode_newsize_ok() will always check against MAX_LFS_FILESIZE and not
* check for a swapfile (which this won't be anyway).
*/
Thanks
Vivek
Miklos Szeredi:
> This patch series reverts the VFS hacks (with the exception of d_path) and
I totally agree with removing d_real things.
It must be good to the world.
If I understand correctly, this series affects file_inode() too.
So there may exist more commits to revert such as
fea6d2a6 2017-02-14 vfs: Use upper filesystem inode in bprm_fill_uid()
Here is another question.
Does overlayfs support atomic_open?
I remember implementing atomic_open on aufs was rather tricky many years
ago, and I am interested in how overlayfs addresses it.
J. R. Okajima
On Wed, Apr 25, 2018 at 4:49 PM, J. R. Okajima <[email protected]> wrote:
> Miklos Szeredi:
>> This patch series reverts the VFS hacks (with the exception of d_path) and
>
> I totally agree with removing d_real things.
> It must be good to the world.
>
> If I understand correctly, this series affects file_inode() too.
> So there may exist more commits to revert such as
> fea6d2a6 2017-02-14 vfs: Use upper filesystem inode in bprm_fill_uid()
Yep.
>
> Here is another question.
> Does overlayfs support atomic_open?
No. Overlayfs doesn't support network filesystems as modifiable
(upper) layer, and that's where atomic_open makes sense. I think
actually no filesystem that defines atomic_open() can be used as upper
layer. So that's currently not a worry. If it turns out to be a
requirement to use such filesystem as upper, then I'll worry about
atomic_open().
Thanks,
Miklos
On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
[..]
> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> new file mode 100644
> index 000000000000..a0b606885c41
> --- /dev/null
> +++ b/fs/overlayfs/file.c
> @@ -0,0 +1,76 @@
> +/*
> + * Copyright (C) 2017 Red Hat, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published by
> + * the Free Software Foundation.
> + */
> +
> +#include <linux/cred.h>
> +#include <linux/file.h>
> +#include <linux/xattr.h>
> +#include "overlayfs.h"
> +
> +static struct file *ovl_open_realfile(const struct file *file)
> +{
> + struct inode *inode = file_inode(file);
> + struct inode *upperinode = ovl_inode_upper(inode);
> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
> + struct file *realfile;
> + const struct cred *old_cred;
> +
> + old_cred = ovl_override_creds(inode->i_sb);
> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
> + realinode, current_cred(), false);
> + revert_creds(old_cred);
> +
> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
> + file, file, upperinode ? 'u' : 'l', file->f_flags,
> + realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
> +
> + return realfile;
> +}
> +
> +static int ovl_open(struct inode *inode, struct file *file)
> +{
> + struct dentry *dentry = file_dentry(file);
Hi Miklos,
There is one thing I can't wrap my head around, so I better ask.
file_dentry() will call ovl_d_real() and try to find dentry based on
inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
matching the passed in inode, it warns.
Assume, I have a stacked overlay configuration. Let me call top level
overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
inode and installs that inode f->f_inode of realfile. (This should be
ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
will be called and it will call file_dentry() and will look for dentry
corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
was triggered in ovl1 and by the time we called ovl_d_real(dentry,
ovl2_inode), it will start comparing with inode of ovl1_upper and never
find ovl2_inode.
IOW, I am not able to figure out how do we protect agains copy up races
when ovl_open() calls file_dentry().
Thanks
Vivek
On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
>
> [..]
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> new file mode 100644
>> index 000000000000..a0b606885c41
>> --- /dev/null
>> +++ b/fs/overlayfs/file.c
>> @@ -0,0 +1,76 @@
>> +/*
>> + * Copyright (C) 2017 Red Hat, Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include <linux/cred.h>
>> +#include <linux/file.h>
>> +#include <linux/xattr.h>
>> +#include "overlayfs.h"
>> +
>> +static struct file *ovl_open_realfile(const struct file *file)
>> +{
>> + struct inode *inode = file_inode(file);
>> + struct inode *upperinode = ovl_inode_upper(inode);
>> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
>> + struct file *realfile;
>> + const struct cred *old_cred;
>> +
>> + old_cred = ovl_override_creds(inode->i_sb);
>> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
>> + realinode, current_cred(), false);
>> + revert_creds(old_cred);
>> +
>> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
>> + file, file, upperinode ? 'u' : 'l', file->f_flags,
>> + realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
>> +
>> + return realfile;
>> +}
>> +
>> +static int ovl_open(struct inode *inode, struct file *file)
>> +{
>> + struct dentry *dentry = file_dentry(file);
>
> Hi Miklos,
>
> There is one thing I can't wrap my head around, so I better ask.
>
> file_dentry() will call ovl_d_real() and try to find dentry based on
> inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
> matching the passed in inode, it warns.
>
> Assume, I have a stacked overlay configuration. Let me call top level
> overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
> file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
> inode and installs that inode f->f_inode of realfile. (This should be
> ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
> will be called and it will call file_dentry() and will look for dentry
> corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
> was triggered in ovl1 and by the time we called ovl_d_real(dentry,
> ovl2_inode), it will start comparing with inode of ovl1_upper and never
> find ovl2_inode.
Okay, so we've modified ovl_d_real() to allow returning the overlay
dentry itself. This is important: when we fail to match ovl1_upper
with ovl2_inode, well go on to get ovl2_dentry and call d_real()
recursively. That recursive call should match the inode, return it to
outer ovl_d_real(), which again will match the inode and return
without warning.
> IOW, I am not able to figure out how do we protect agains copy up races
> when ovl_open() calls file_dentry().
Racing with a copy up cannot matter, since we'll continue looking for
the inode in the layers and stacks below, regardless of whether we
checked the upper dentry or not.
Does that make it clearer?
Thanks,
Miklos
On Thu, Apr 26, 2018 at 04:43:53PM +0200, Miklos Szeredi wrote:
> On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
> >
> > [..]
> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >> new file mode 100644
> >> index 000000000000..a0b606885c41
> >> --- /dev/null
> >> +++ b/fs/overlayfs/file.c
> >> @@ -0,0 +1,76 @@
> >> +/*
> >> + * Copyright (C) 2017 Red Hat, Inc.
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify it
> >> + * under the terms of the GNU General Public License version 2 as published by
> >> + * the Free Software Foundation.
> >> + */
> >> +
> >> +#include <linux/cred.h>
> >> +#include <linux/file.h>
> >> +#include <linux/xattr.h>
> >> +#include "overlayfs.h"
> >> +
> >> +static struct file *ovl_open_realfile(const struct file *file)
> >> +{
> >> + struct inode *inode = file_inode(file);
> >> + struct inode *upperinode = ovl_inode_upper(inode);
> >> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
> >> + struct file *realfile;
> >> + const struct cred *old_cred;
> >> +
> >> + old_cred = ovl_override_creds(inode->i_sb);
> >> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
> >> + realinode, current_cred(), false);
> >> + revert_creds(old_cred);
> >> +
> >> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
> >> + file, file, upperinode ? 'u' : 'l', file->f_flags,
> >> + realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
> >> +
> >> + return realfile;
> >> +}
> >> +
> >> +static int ovl_open(struct inode *inode, struct file *file)
> >> +{
> >> + struct dentry *dentry = file_dentry(file);
> >
> > Hi Miklos,
> >
> > There is one thing I can't wrap my head around, so I better ask.
> >
> > file_dentry() will call ovl_d_real() and try to find dentry based on
> > inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
> > matching the passed in inode, it warns.
> >
> > Assume, I have a stacked overlay configuration. Let me call top level
> > overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
> > file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
> > inode and installs that inode f->f_inode of realfile. (This should be
> > ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
> > will be called and it will call file_dentry() and will look for dentry
> > corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
> > was triggered in ovl1 and by the time we called ovl_d_real(dentry,
> > ovl2_inode), it will start comparing with inode of ovl1_upper and never
> > find ovl2_inode.
>
> Okay, so we've modified ovl_d_real() to allow returning the overlay
> dentry itself. This is important: when we fail to match ovl1_upper
> with ovl2_inode, well go on to get ovl2_dentry and call d_real()
> recursively. That recursive call should match the inode, return it to
> outer ovl_d_real(), which again will match the inode and return
> without warning.
So current code does following.
ovl_d_real() {
...
...
real = ovl_dentry_real(dentry);
if (inode == d_inode(real))
return real;
/* Handle recursion */
if (unlikely(real->d_flags & DCACHE_OP_REAL))
return real->d_op->d_real(real, inode);
}
If file got copied up in ovl1, then "real" will be ovl1_upper dentry. And
upper is regular fs (only ovl1 lower is overlay), then it should not have
DCACHE_OP_REAL set and that means we will not recurse further and not
find ovl2 dentry matching ovl2_inode and print warning and return
ovl1 dentry.
What am I missing.
Vivek
>
> > IOW, I am not able to figure out how do we protect agains copy up races
> > when ovl_open() calls file_dentry().
>
> Racing with a copy up cannot matter, since we'll continue looking for
> the inode in the layers and stacks below, regardless of whether we
> checked the upper dentry or not.
>
> Does that make it clearer?
>
> Thanks,
> Miklos
On Thu, Apr 26, 2018 at 4:56 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Apr 26, 2018 at 04:43:53PM +0200, Miklos Szeredi wrote:
>> On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal <[email protected]> wrote:
>> > On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
>> >
>> > [..]
>> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> >> new file mode 100644
>> >> index 000000000000..a0b606885c41
>> >> --- /dev/null
>> >> +++ b/fs/overlayfs/file.c
>> >> @@ -0,0 +1,76 @@
>> >> +/*
>> >> + * Copyright (C) 2017 Red Hat, Inc.
>> >> + *
>> >> + * This program is free software; you can redistribute it and/or modify it
>> >> + * under the terms of the GNU General Public License version 2 as published by
>> >> + * the Free Software Foundation.
>> >> + */
>> >> +
>> >> +#include <linux/cred.h>
>> >> +#include <linux/file.h>
>> >> +#include <linux/xattr.h>
>> >> +#include "overlayfs.h"
>> >> +
>> >> +static struct file *ovl_open_realfile(const struct file *file)
>> >> +{
>> >> + struct inode *inode = file_inode(file);
>> >> + struct inode *upperinode = ovl_inode_upper(inode);
>> >> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
>> >> + struct file *realfile;
>> >> + const struct cred *old_cred;
>> >> +
>> >> + old_cred = ovl_override_creds(inode->i_sb);
>> >> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
>> >> + realinode, current_cred(), false);
>> >> + revert_creds(old_cred);
>> >> +
>> >> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
>> >> + file, file, upperinode ? 'u' : 'l', file->f_flags,
>> >> + realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
>> >> +
>> >> + return realfile;
>> >> +}
>> >> +
>> >> +static int ovl_open(struct inode *inode, struct file *file)
>> >> +{
>> >> + struct dentry *dentry = file_dentry(file);
>> >
>> > Hi Miklos,
>> >
>> > There is one thing I can't wrap my head around, so I better ask.
>> >
>> > file_dentry() will call ovl_d_real() and try to find dentry based on
>> > inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
>> > matching the passed in inode, it warns.
>> >
>> > Assume, I have a stacked overlay configuration. Let me call top level
>> > overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
>> > file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
>> > inode and installs that inode f->f_inode of realfile. (This should be
>> > ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
>> > will be called and it will call file_dentry() and will look for dentry
>> > corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
>> > was triggered in ovl1 and by the time we called ovl_d_real(dentry,
>> > ovl2_inode), it will start comparing with inode of ovl1_upper and never
>> > find ovl2_inode.
>>
>> Okay, so we've modified ovl_d_real() to allow returning the overlay
>> dentry itself. This is important: when we fail to match ovl1_upper
>> with ovl2_inode, well go on to get ovl2_dentry and call d_real()
>> recursively. That recursive call should match the inode, return it to
>> outer ovl_d_real(), which again will match the inode and return
>> without warning.
>
> So current code does following.
>
> ovl_d_real() {
> ...
> ...
>
> real = ovl_dentry_real(dentry);
> if (inode == d_inode(real))
> return real;
>
> /* Handle recursion */
> if (unlikely(real->d_flags & DCACHE_OP_REAL))
> return real->d_op->d_real(real, inode);
> }
>
> If file got copied up in ovl1, then "real" will be ovl1_upper dentry. And
> upper is regular fs (only ovl1 lower is overlay), then it should not have
> DCACHE_OP_REAL set and that means we will not recurse further and not
> find ovl2 dentry matching ovl2_inode and print warning and return
> ovl1 dentry.
>
> What am I missing.
Ah, that's indeed buggy. The bug is in "[RFC PATCH 34/35] vfs:
simplify d_op->d_real()".
I've already reverted that (due to d_real_inode() acquiring a new
user) and the old code should be good (AFAICS).
Thanks,
Miklos
On Thu, Apr 26, 2018 at 05:01:37PM +0200, Miklos Szeredi wrote:
> On Thu, Apr 26, 2018 at 4:56 PM, Vivek Goyal <[email protected]> wrote:
> > On Thu, Apr 26, 2018 at 04:43:53PM +0200, Miklos Szeredi wrote:
> >> On Thu, Apr 26, 2018 at 4:13 PM, Vivek Goyal <[email protected]> wrote:
> >> > On Thu, Apr 12, 2018 at 05:08:00PM +0200, Miklos Szeredi wrote:
> >> >
> >> > [..]
> >> >> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
> >> >> new file mode 100644
> >> >> index 000000000000..a0b606885c41
> >> >> --- /dev/null
> >> >> +++ b/fs/overlayfs/file.c
> >> >> @@ -0,0 +1,76 @@
> >> >> +/*
> >> >> + * Copyright (C) 2017 Red Hat, Inc.
> >> >> + *
> >> >> + * This program is free software; you can redistribute it and/or modify it
> >> >> + * under the terms of the GNU General Public License version 2 as published by
> >> >> + * the Free Software Foundation.
> >> >> + */
> >> >> +
> >> >> +#include <linux/cred.h>
> >> >> +#include <linux/file.h>
> >> >> +#include <linux/xattr.h>
> >> >> +#include "overlayfs.h"
> >> >> +
> >> >> +static struct file *ovl_open_realfile(const struct file *file)
> >> >> +{
> >> >> + struct inode *inode = file_inode(file);
> >> >> + struct inode *upperinode = ovl_inode_upper(inode);
> >> >> + struct inode *realinode = upperinode ?: ovl_inode_lower(inode);
> >> >> + struct file *realfile;
> >> >> + const struct cred *old_cred;
> >> >> +
> >> >> + old_cred = ovl_override_creds(inode->i_sb);
> >> >> + realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
> >> >> + realinode, current_cred(), false);
> >> >> + revert_creds(old_cred);
> >> >> +
> >> >> + pr_debug("open(%p[%pD2/%c], 0%o) -> (%p, 0%o)\n",
> >> >> + file, file, upperinode ? 'u' : 'l', file->f_flags,
> >> >> + realfile, IS_ERR(realfile) ? 0 : realfile->f_flags);
> >> >> +
> >> >> + return realfile;
> >> >> +}
> >> >> +
> >> >> +static int ovl_open(struct inode *inode, struct file *file)
> >> >> +{
> >> >> + struct dentry *dentry = file_dentry(file);
> >> >
> >> > Hi Miklos,
> >> >
> >> > There is one thing I can't wrap my head around, so I better ask.
> >> >
> >> > file_dentry() will call ovl_d_real() and try to find dentry based on
> >> > inode installed in f->f_inode. If ovl_d_real() can't find inode dentry
> >> > matching the passed in inode, it warns.
> >> >
> >> > Assume, I have a stacked overlay configuration. Let me call top level
> >> > overlay layer ovl1 and lower level overlay layer ovl2. Say I open a
> >> > file foo.txt. Now ovl_open() in ovl1 decides that realinode is a lower
> >> > inode and installs that inode f->f_inode of realfile. (This should be
> >> > ovl2 layer inode, let me call it ovl2_inode). Now ovl_open() of ovl2 layer
> >> > will be called and it will call file_dentry() and will look for dentry
> >> > corresponding to ovl2_inode. I am wondering what if a copy up of foo.txt
> >> > was triggered in ovl1 and by the time we called ovl_d_real(dentry,
> >> > ovl2_inode), it will start comparing with inode of ovl1_upper and never
> >> > find ovl2_inode.
> >>
> >> Okay, so we've modified ovl_d_real() to allow returning the overlay
> >> dentry itself. This is important: when we fail to match ovl1_upper
> >> with ovl2_inode, well go on to get ovl2_dentry and call d_real()
> >> recursively. That recursive call should match the inode, return it to
> >> outer ovl_d_real(), which again will match the inode and return
> >> without warning.
> >
> > So current code does following.
> >
> > ovl_d_real() {
> > ...
> > ...
> >
> > real = ovl_dentry_real(dentry);
> > if (inode == d_inode(real))
> > return real;
> >
> > /* Handle recursion */
> > if (unlikely(real->d_flags & DCACHE_OP_REAL))
> > return real->d_op->d_real(real, inode);
> > }
> >
> > If file got copied up in ovl1, then "real" will be ovl1_upper dentry. And
> > upper is regular fs (only ovl1 lower is overlay), then it should not have
> > DCACHE_OP_REAL set and that means we will not recurse further and not
> > find ovl2 dentry matching ovl2_inode and print warning and return
> > ovl1 dentry.
> >
> > What am I missing.
>
> Ah, that's indeed buggy. The bug is in "[RFC PATCH 34/35] vfs:
> simplify d_op->d_real()".
>
> I've already reverted that (due to d_real_inode() acquiring a new
> user) and the old code should be good (AFAICS).
Aha, cool. thanks. While I am at it, let me just ask one more stupid
question.
I am wondering while opening the underlying realfile, why do we pass
in the path/dentry of ovl layer (and not underlying real layer).
realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
realinode, current_cred(), false);
This forces us to do file_dentry() in ovl_open() later to map top level
dentry to underlying dentry.
We know the realinode and should be figure out real dentry. Can't we
construct path from underlying dentry and mount point and use that
to open underlying real file. I am sure there is some reason for doing
this way, just trying to understand it.
Vivek
On Thu, Apr 26, 2018 at 5:13 PM, Vivek Goyal <[email protected]> wrote:
> Aha, cool. thanks. While I am at it, let me just ask one more stupid
> question.
>
> I am wondering while opening the underlying realfile, why do we pass
> in the path/dentry of ovl layer (and not underlying real layer).
>
> realfile = path_open(&file->f_path, file->f_flags | O_NOATIME,
> realinode, current_cred(), false);
>
> This forces us to do file_dentry() in ovl_open() later to map top level
> dentry to underlying dentry.
>
> We know the realinode and should be figure out real dentry. Can't we
> construct path from underlying dentry and mount point and use that
> to open underlying real file. I am sure there is some reason for doing
> this way, just trying to understand it.
The logical thing would be to just use the real path (as returned by
ovl_path_real()).
The reason we don't do that is because mmap stores the real file in
vma->vm_file and vm_file->f_path is used in various places (e.g.
/proc/PID/maps).
We could have a separate realfile for mmap, but that would be
additional complexity and memory use, so I don't think it makes sense.
Thanks,
Miklos
On Tue, Apr 17, 2018 at 10:31 PM, Amir Goldstein <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>> Since set of arguments are so similar, handle in a common helper.
>>
>> Signed-off-by: Miklos Szeredi <[email protected]>
>> ---
>> fs/overlayfs/file.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 79 insertions(+)
>>
>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>> index 9670e160967e..39b1b73334ad 100644
>> --- a/fs/overlayfs/file.c
>> +++ b/fs/overlayfs/file.c
>> @@ -352,6 +352,81 @@ long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>> return ret;
>> }
>>
>> +enum ovl_copyop {
>> + OVL_COPY,
>> + OVL_CLONE,
>> + OVL_DEDUPE,
>> +};
>> +
>> +static ssize_t ovl_copyfile(struct file *file_in, loff_t pos_in,
>> + struct file *file_out, loff_t pos_out,
>> + u64 len, unsigned int flags, enum ovl_copyop op)
>> +{
>> + struct inode *inode_out = file_inode(file_out);
>> + struct fd real_in, real_out;
>> + const struct cred *old_cred;
>> + int ret;
>> +
>> + ret = ovl_real_file(file_out, &real_out);
>> + if (ret)
>> + return ret;
>> +
>> + ret = ovl_real_file(file_in, &real_in);
>> + if (ret) {
>> + fdput(real_out);
>> + return ret;
>> + }
>> +
>> + old_cred = ovl_override_creds(file_inode(file_out)->i_sb);
>> + switch (op) {
>> + case OVL_COPY:
>> + ret = vfs_copy_file_range(real_in.file, pos_in,
>> + real_out.file, pos_out, len, flags);
>
> Problem:
> vfs_copy_file_range(ovl_lower_file, ovl_upper_file) on non samefs
> will get -EXDEV from ovl_copy_file_range(), so will not fall back
> to do_splice_direct().
This is not a regression, right?
> We may be better off checking in_sb != out_sb and returning
> -EOPNOTSUPP? not sure.
I think we should fix vfs_copy_file_range() to fall back to copying if
not on the same fs. Not sure why it doesn't do that now.
>
>
>> + break;
>> +
>> + case OVL_CLONE:
>> + ret = vfs_clone_file_range(real_in.file, pos_in,
>> + real_out.file, pos_out, len);
>> + break;
>> +
>> + case OVL_DEDUPE:
>> + ret = vfs_dedupe_file_range_one(real_in.file, pos_in, len,
>> + real_out.file, pos_out);
>
> Problem:
> real_out can be a readonly fd (for is_admin), so we will be deduping
> the lower file.
Ugh...
> I guess this problem is mitigated in current code by may_write_real().
>
> How can we deal with that sort of "write leak" without patching
> mnt_want_write_file()?
We need to check before calling dedupe on real files that both are on upper.
My problem is what error code to return. Neither EXDEV nor EINVAL
descibe the error adequately. It should be "We could dedupe if we
really wanted to, but it makes no sense to do so"... So now it
returns -EBADE, which means "data was different", but at least that
one should at least be expected by callers.
Thanks,
Miklos
On Thu, May 3, 2018 at 7:04 PM, Miklos Szeredi <[email protected]> wrote:
> On Tue, Apr 17, 2018 at 10:31 PM, Amir Goldstein <[email protected]> wrote:
>> On Thu, Apr 12, 2018 at 6:08 PM, Miklos Szeredi <[email protected]> wrote:
>>> Since set of arguments are so similar, handle in a common helper.
>>>
>>> Signed-off-by: Miklos Szeredi <[email protected]>
>>> ---
>>> fs/overlayfs/file.c | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 79 insertions(+)
>>>
>>> diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
>>> index 9670e160967e..39b1b73334ad 100644
>>> --- a/fs/overlayfs/file.c
>>> +++ b/fs/overlayfs/file.c
>>> @@ -352,6 +352,81 @@ long ovl_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>>> return ret;
>>> }
>>>
>>> +enum ovl_copyop {
>>> + OVL_COPY,
>>> + OVL_CLONE,
>>> + OVL_DEDUPE,
>>> +};
>>> +
>>> +static ssize_t ovl_copyfile(struct file *file_in, loff_t pos_in,
>>> + struct file *file_out, loff_t pos_out,
>>> + u64 len, unsigned int flags, enum ovl_copyop op)
>>> +{
>>> + struct inode *inode_out = file_inode(file_out);
>>> + struct fd real_in, real_out;
>>> + const struct cred *old_cred;
>>> + int ret;
>>> +
>>> + ret = ovl_real_file(file_out, &real_out);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + ret = ovl_real_file(file_in, &real_in);
>>> + if (ret) {
>>> + fdput(real_out);
>>> + return ret;
>>> + }
>>> +
>>> + old_cred = ovl_override_creds(file_inode(file_out)->i_sb);
>>> + switch (op) {
>>> + case OVL_COPY:
>>> + ret = vfs_copy_file_range(real_in.file, pos_in,
>>> + real_out.file, pos_out, len, flags);
>>
>> Problem:
>> vfs_copy_file_range(ovl_lower_file, ovl_upper_file) on non samefs
>> will get -EXDEV from ovl_copy_file_range(), so will not fall back
>> to do_splice_direct().
>
> This is not a regression, right?
Right.
>
>> We may be better off checking in_sb != out_sb and returning
>> -EOPNOTSUPP? not sure.
>
> I think we should fix vfs_copy_file_range() to fall back to copying if
> not on the same fs. Not sure why it doesn't do that now.
>
There seems to be a posting to fix that as we speak...
I seem to recall some flames from hch about a similar change
that NFS folks where trying to push for. Let's see how this one goes.
>>
>>
>>> + break;
>>> +
>>> + case OVL_CLONE:
>>> + ret = vfs_clone_file_range(real_in.file, pos_in,
>>> + real_out.file, pos_out, len);
>>> + break;
>>> +
>>> + case OVL_DEDUPE:
>>> + ret = vfs_dedupe_file_range_one(real_in.file, pos_in, len,
>>> + real_out.file, pos_out);
>>
>> Problem:
>> real_out can be a readonly fd (for is_admin), so we will be deduping
>> the lower file.
>
> Ugh...
>
>> I guess this problem is mitigated in current code by may_write_real().
>>
>> How can we deal with that sort of "write leak" without patching
>> mnt_want_write_file()?
>
> We need to check before calling dedupe on real files that both are on upper.
>
> My problem is what error code to return. Neither EXDEV nor EINVAL
> descibe the error adequately. It should be "We could dedupe if we
> really wanted to, but it makes no sense to do so"... So now it
> returns -EBADE, which means "data was different", but at least that
> one should at least be expected by callers.
>
EPERM dest_fd is immutable
Which exactly what may_write_real() returns today.
Thanks,
Amir.
On Thu, Apr 12, 2018 at 5:07 PM, Miklos Szeredi <[email protected]> wrote:
> Git tree is here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs-rorw
Thanks everyone for your review.
Force pushed new version to the above branch.
Hopefully no comment was missed (I didn't add more reverts yet, those
can wait). I'll do a mailbomb as well next week and start processing
the metacopy patches.
Thanks,
Miklos
---
Miklos Szeredi (38):
vfs: dedpue: return s64
vfs: dedupe: rationalize args
vfs: dedupe: extract helper for a single dedup
vfs: add path_open()
vfs: optionally don't account file in nr_files
vfs: add f_op->pre_mmap()
vfs: export vfs_ioctl() to modules
vfs: export vfs_dedupe_file_range_one() to modules
ovl: copy up times
ovl: copy up inode flags
Revert "Revert "ovl: get_write_access() in truncate""
ovl: copy up file size as well
ovl: deal with overlay files in ovl_d_real()
ovl: stack file ops
ovl: add helper to return real file
ovl: add ovl_read_iter()
ovl: add ovl_write_iter()
ovl: add ovl_fsync()
ovl: add ovl_mmap()
ovl: add ovl_fallocate()
ovl: add lsattr/chattr support
ovl: add ovl_fiemap()
ovl: add O_DIRECT support
ovl: add reflink/copyfile/dedup support
vfs: don't open real
ovl: copy-up on MAP_SHARED
vfs: simplify dentry_open()
Revert "ovl: fix may_write_real() for overlayfs directories"
Revert "ovl: don't allow writing ioctl on lower layer"
vfs: fix freeze protection in mnt_want_write_file() for overlayfs
Revert "ovl: fix relatime for directories"
Revert "vfs: update ovl inode before relatime check"
Revert "vfs: add flags to d_real()"
Revert "vfs: do get_write_access() on upper layer of overlayfs"
Partially revert "locks: fix file locking on overlayfs"
Revert "fsnotify: support overlayfs"
vfs: remove open_flags from d_real()
ovl: fix documentation of non-standard behavior
---
Documentation/filesystems/Locking | 4 +-
Documentation/filesystems/overlayfs.txt | 60 ++--
Documentation/filesystems/vfs.txt | 19 +-
fs/btrfs/ctree.h | 4 +-
fs/btrfs/ioctl.c | 6 +-
fs/file_table.c | 13 +-
fs/inode.c | 46 +--
fs/internal.h | 17 +-
fs/ioctl.c | 1 +
fs/locks.c | 20 +-
fs/namei.c | 2 +-
fs/namespace.c | 69 +----
fs/ocfs2/file.c | 10 +-
fs/open.c | 74 ++---
fs/overlayfs/Kconfig | 21 ++
fs/overlayfs/Makefile | 4 +-
fs/overlayfs/dir.c | 33 ++-
fs/overlayfs/file.c | 506 ++++++++++++++++++++++++++++++++
fs/overlayfs/inode.c | 63 +++-
fs/overlayfs/overlayfs.h | 21 +-
fs/overlayfs/ovl_entry.h | 1 +
fs/overlayfs/super.c | 65 ++--
fs/overlayfs/util.c | 11 +-
fs/read_write.c | 91 +++---
fs/xattr.c | 9 +-
fs/xfs/xfs_file.c | 8 +-
include/linux/dcache.h | 15 +-
include/linux/fs.h | 30 +-
include/linux/fsnotify.h | 14 +-
include/uapi/linux/fs.h | 1 -
mm/util.c | 5 +
31 files changed, 886 insertions(+), 357 deletions(-)
create mode 100644 fs/overlayfs/file.c
On Fri, May 4, 2018 at 6:23 PM, Miklos Szeredi <[email protected]> wrote:
> On Thu, Apr 12, 2018 at 5:07 PM, Miklos Szeredi <[email protected]> wrote:
>
>> Git tree is here:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs-rorw
>
> Thanks everyone for your review.
>
> Force pushed new version to the above branch.
>
> Hopefully no comment was missed (I didn't add more reverts yet, those
> can wait). I'll do a mailbomb as well next week and start processing
> the metacopy patches.
>
Ran -g auto xfstests on overlay over xfs+reflink.
There are no failures and no changes w.r.t skipped tests since v4.17-rc1.
ACK on w.r.t addressing my review comments.
ACK on the freeze protection fix.
Not sure if the fact that we take __mnt_want_write() on overlay and not on
upper_mnt for ovl_real_ioctl() is a problem (i.e. can upper_mnt become
readonly from under overlay? and do we care?).
Thanks,
Amir.
On Sat, May 5, 2018 at 6:37 PM, Amir Goldstein <[email protected]> wrote:
> On Fri, May 4, 2018 at 6:23 PM, Miklos Szeredi <[email protected]> wrote:
>> On Thu, Apr 12, 2018 at 5:07 PM, Miklos Szeredi <[email protected]> wrote:
>>
>>> Git tree is here:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs-rorw
>>
>> Thanks everyone for your review.
>>
>> Force pushed new version to the above branch.
>>
>> Hopefully no comment was missed (I didn't add more reverts yet, those
>> can wait). I'll do a mailbomb as well next week and start processing
>> the metacopy patches.
>>
>
> Ran -g auto xfstests on overlay over xfs+reflink.
> There are no failures and no changes w.r.t skipped tests since v4.17-rc1.
>
> ACK on w.r.t addressing my review comments.
>
> ACK on the freeze protection fix.
> Not sure if the fact that we take __mnt_want_write() on overlay and not on
> upper_mnt for ovl_real_ioctl() is a problem (i.e. can upper_mnt become
> readonly from under overlay? and do we care?).
I don't think we care. The mount part of the real path doesn't play a
very important part because it's private to the overlay and it's never
modified. It would be interesting to dig into the places it is used
(may not be necessary in all cases), but that's for later.
Thanks,
Miklos