2017-09-13 14:05:41

by Miklos Szeredi

[permalink] [raw]
Subject: [GIT PULL] overlayfs update for 4.14

Hi Linus,

Please pull from:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs-linus

This fixes d_ino correctness in readdir, which brings overlayfs on par with
normal filesystems regarding inode number semantics, as long as all layers are
on the same filesystem.

There are also some bug fixes, one in particular (random ioctl's shouldn't be
able to modify lower layers) that touches some vfs code, but of course no-op for
non-overlay fs.

Thanks,
Miklos

---
Amir Goldstein (2):
ovl: constant d_ino across copy up
ovl: fix false positive ESTALE on lookup

Miklos Szeredi (7):
ovl: check snprintf return
ovl: fix readdir error value
ovl: constant d_ino for non-merge dirs
ovl: cleanup d_real for negative
vfs: add flags to d_real()
ovl: fix relatime for directories
ovl: don't allow writing ioctl on lower layer

---
Documentation/filesystems/Locking | 2 +-
Documentation/filesystems/vfs.txt | 2 +-
fs/inode.c | 21 ++-
fs/internal.h | 2 +
fs/namespace.c | 64 ++++++-
fs/open.c | 8 +-
fs/overlayfs/dir.c | 11 +-
fs/overlayfs/inode.c | 14 +-
fs/overlayfs/overlayfs.h | 8 +-
fs/overlayfs/readdir.c | 383 ++++++++++++++++++++++++++++++++++----
fs/overlayfs/super.c | 11 +-
fs/overlayfs/util.c | 24 ++-
fs/xattr.c | 9 +-
include/linux/dcache.h | 14 +-
include/linux/fs.h | 2 +-
15 files changed, 496 insertions(+), 79 deletions(-)


2017-09-13 16:25:34

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Wed, Sep 13, 2017 at 7:05 AM, Miklos Szeredi <[email protected]> wrote:
>
> There are also some bug fixes, one in particular (random ioctl's shouldn't be
> able to modify lower layers) that touches some vfs code, but of course no-op for
> non-overlay fs.

Argh. I _despise_ those changes to 'd_real()'.

I reall ythink you should have made it a bit in the f_flags argument
instead. The whole "pick the upper file" is _very_ similar to the
file->f_flags bits conceptually - those change how accesses are done
in other ways (eg O_DIRECT etc), and it's entirely possible that some
day maybe you'd even want to expose it to user space as a O_UPPER flag
to open (which would then only succeed if the file is in the upper
overlay, and never open the lower filesystem).

So adding _another_ field that is only for overlayfs and makes
absolutely no sense in any other context is nasty. It's wrong. That's
not a "VFS layer interface" any more. It's just an ugly hack that
makes no sense outside of overlayfs.

Try to make it something that makes conceptual sense, and isn't just
an overlayfs issue. For example, maybe for cachefs the bit could mean
"give me the cached copy without doing validation or looking anything
new up". Try to make it something that actually makes sense from a
higher-level perspective, where overlayfs is just one implementation,
not the driving force. See what I'm saying? THAT is what VFS layer
interfaces should all be about, not about "this one specific
filesystem needs this one specific f*cking ugly hack".

I've pulled it, but I really think you should change this.

Linus

2017-09-14 06:47:02

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Wed, Sep 13, 2017 at 6:25 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Sep 13, 2017 at 7:05 AM, Miklos Szeredi <[email protected]> wrote:
>>
>> There are also some bug fixes, one in particular (random ioctl's shouldn't be
>> able to modify lower layers) that touches some vfs code, but of course no-op for
>> non-overlay fs.
>
> Argh. I _despise_ those changes to 'd_real()'.
>
> I reall ythink you should have made it a bit in the f_flags argument
> instead. The whole "pick the upper file" is _very_ similar to the
> file->f_flags bits conceptually - those change how accesses are done
> in other ways (eg O_DIRECT etc), and it's entirely possible that some
> day maybe you'd even want to expose it to user space as a O_UPPER flag
> to open (which would then only succeed if the file is in the upper
> overlay, and never open the lower filesystem).
>
> So adding _another_ field that is only for overlayfs and makes
> absolutely no sense in any other context is nasty. It's wrong. That's
> not a "VFS layer interface" any more. It's just an ugly hack that
> makes no sense outside of overlayfs.

d_real() is currently is *the* overlayfs-op:

$ git grep -l "\.d_real"
fs/overlayfs/super.c

It might find other uses in the future, but for now that's the fact.

I could have merged D_REAL_UPPER into f_flags, but it would have
polluted the O_ namespace (which is a uapi) with an internal flag.
Look at the MS_ flags mess to see how mixing internal and external
flags is a really bad idea.

Changing internal interfaces is so much easier than userspace
interfaces. So this is not set in stone, and since currently
overlayfs is the only user of the d_real api, it's really trivial to
change around if we find some other use for it.

But if we try to think up some interface (e.g. the O_UPPER that you
suggested) then we'll be stuck with it even if it makes no sense, or
worse, causes problems. So let's just wait until the need for
something comes up, then we can massage the internal interface to our
liking.

Thanks,
Miklos

2017-09-14 20:00:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Wed, Sep 13, 2017 at 11:46 PM, Miklos Szeredi <[email protected]> wrote:
>
> d_real() is currently is *the* overlayfs-op:

I know. And it's ugly as #%^! hell.

We don't want to make it uglier.

And honestly, if you think that "it's only for overlayfs, so I can do
anything I damn well want to this p[iece of shit" is valid, then I
want to re-educate you. That is *not* how the VFS layer works.

If you can't bother to try to make it act as a proper VFS thing, then
I don't want to see patches from you to the VFS code.

I'm perfectly happy just reverting those changes again. Because I'm
not making the VFS layer bend over backwards due to some overlayfs
braindamage.

Linus

2017-09-14 20:12:27

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Thu, Sep 14, 2017 at 10:00 PM, Linus Torvalds
<[email protected]> wrote:
> If you can't bother to try to make it act as a proper VFS thing, then
> I don't want to see patches from you to the VFS code.

I was saying that it's a bad idea to mix external and internal flags.
That was the reason the two were separated on that API. I'm open to
arguments about that.

Thanks,
Miklos

2017-09-14 20:24:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Thu, Sep 14, 2017 at 1:12 PM, Miklos Szeredi <[email protected]> wrote:
>
> I was saying that it's a bad idea to mix external and internal flags.
> That was the reason the two were separated on that API. I'm open to
> arguments about that.

The thing is, I don't think "upper layer" is any more internal than
"direct IO" is.

And I don't think it's necessarily overlayfs-specific. Other
filesystems have potential upper layers (eg things like caching
layers, where the "upper" layer could be "the cached copy").

And no, I'm, not saying we should expose things directly to user
space. We have lots of operations where we expose some flags to user
space but keep others internal (look at the VM_xyz flags for mmap, for
example - we obviously expose things like read/write/execute to user
space, but we have the MAYREAD/MAYWRITE/MAYEXEC flags in the same word
that are *not* something that user space can just set).

I just don't see any reason why those two "flags" arguments are separate.

Linus

2017-09-15 07:32:36

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Thu, Sep 14, 2017 at 01:24:22PM -0700, Linus Torvalds wrote:

> I just don't see any reason why those two "flags" arguments are separate.

Fine. Here's a patch reverting the new flags and adding O_UPPER.

Thanks,
Miklos
---

From: Miklos Szeredi <[email protected]>
Subject: vfs: d_real: merge flags and open_flags

Remove the second "flags" argument of d_real(). The remaining flags are
now used by:

- open() to pass the open flags, which lets us decide whether the file
needs to be copied up or not

- by the VFS when it needs to check if dentry in question is writable
(i.e. pass O_UPPER, which returns upper dentry or NULL)

Signed-off-by: Miklos Szeredi <[email protected]>
---
Documentation/filesystems/Locking | 2 +-
Documentation/filesystems/vfs.txt | 2 +-
fs/inode.c | 2 +-
fs/namespace.c | 2 +-
fs/open.c | 4 ++--
fs/overlayfs/super.c | 12 ++++++------
include/linux/dcache.h | 17 ++++++++---------
include/linux/fcntl.h | 11 +++++++++++
include/linux/fs.h | 2 +-
9 files changed, 32 insertions(+), 22 deletions(-)

--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -22,7 +22,7 @@ be able to use diff(1).
struct vfsmount *(*d_automount)(struct path *path);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int, unsigned int);
+ unsigned int);

locking rules:
rename_lock ->d_lock may block rcu-walk
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -988,7 +988,7 @@ struct dentry_operations {
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int, unsigned int);
+ unsigned int);
};

d_revalidate: called when the VFS needs to revalidate a dentry. This
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1578,7 +1578,7 @@ static void update_ovl_inode_times(struc
if (rcu || likely(!(dentry->d_flags & DCACHE_OP_REAL)))
return;

- upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);
+ upperdentry = d_real(dentry, NULL, O_UPPER);

/*
* If file is on lower then we can't update atime, so no worries about
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -467,7 +467,7 @@ static inline int may_write_real(struct
return 0;

/* File refers to upper, writable layer? */
- upperdentry = d_real(dentry, NULL, 0, D_REAL_UPPER);
+ upperdentry = d_real(dentry, NULL, O_UPPER);
if (upperdentry && file_inode(file) == d_inode(upperdentry))
return 0;

--- a/fs/open.c
+++ b/fs/open.c
@@ -96,7 +96,7 @@ long vfs_truncate(const struct path *pat
* write access on the upper inode, not on the overlay inode. For
* non-overlay filesystems d_real() is an identity function.
*/
- upperdentry = d_real(path->dentry, NULL, O_WRONLY, 0);
+ upperdentry = d_real(path->dentry, NULL, O_WRONLY);
error = PTR_ERR(upperdentry);
if (IS_ERR(upperdentry))
goto mnt_drop_write_and_out;
@@ -857,7 +857,7 @@ EXPORT_SYMBOL(file_path);
int vfs_open(const struct path *path, struct file *file,
const struct cred *cred)
{
- struct dentry *dentry = d_real(path->dentry, NULL, file->f_flags, 0);
+ struct dentry *dentry = d_real(path->dentry, NULL, file->f_flags);

if (IS_ERR(dentry))
return PTR_ERR(dentry);
--- a/fs/overlayfs/super.c
+++ b/fs/overlayfs/super.c
@@ -70,12 +70,12 @@ static int ovl_check_append_only(struct

static struct dentry *ovl_d_real(struct dentry *dentry,
const struct inode *inode,
- unsigned int open_flags, unsigned int flags)
+ unsigned int flags)
{
struct dentry *real;
int err;

- if (flags & D_REAL_UPPER)
+ if (flags & O_UPPER)
return ovl_dentry_upper(dentry);

if (!d_is_reg(dentry)) {
@@ -84,8 +84,8 @@ static struct dentry *ovl_d_real(struct
goto bug;
}

- if (open_flags) {
- err = ovl_open_maybe_copy_up(dentry, open_flags);
+ if (flags) {
+ err = ovl_open_maybe_copy_up(dentry, flags);
if (err)
return ERR_PTR(err);
}
@@ -93,7 +93,7 @@ static struct dentry *ovl_d_real(struct
real = ovl_dentry_upper(dentry);
if (real && (!inode || inode == d_inode(real))) {
if (!inode) {
- err = ovl_check_append_only(d_inode(real), open_flags);
+ err = ovl_check_append_only(d_inode(real), flags);
if (err)
return ERR_PTR(err);
}
@@ -105,7 +105,7 @@ static struct dentry *ovl_d_real(struct
goto bug;

/* Handle recursion */
- real = d_real(real, inode, open_flags, 0);
+ real = d_real(real, inode, flags);

if (!inode || inode == d_inode(real))
return real;
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -147,7 +147,7 @@ struct dentry_operations {
struct vfsmount *(*d_automount)(struct path *);
int (*d_manage)(const struct path *, bool);
struct dentry *(*d_real)(struct dentry *, const struct inode *,
- unsigned int, unsigned int);
+ unsigned int);
} ____cacheline_aligned;

/*
@@ -562,27 +562,26 @@ static inline struct dentry *d_backing_d
return upper;
}

-/* d_real() flags */
-#define D_REAL_UPPER 0x2 /* return upper dentry or NULL if non-upper */
-
/**
* d_real - Return the real dentry
* @dentry: the dentry to query
* @inode: inode to select the dentry from multiple layers (can be NULL)
- * @open_flags: open flags to control copy-up behavior
- * @flags: flags to control what is returned by this function
+ * @flags: control copy-up and what is returned by this function
*
* If dentry is on a union/overlay, then return the underlying, real dentry.
* Otherwise return the dentry itself.
*
+ * When opening, check flags to see if file needs to be copied up. If O_UPPER
+ * is given, then return upper dentry or NULL.
+ *
* See also: Documentation/filesystems/vfs.txt
*/
static inline struct dentry *d_real(struct dentry *dentry,
const struct inode *inode,
- unsigned int open_flags, unsigned int flags)
+ unsigned int flags)
{
if (unlikely(dentry->d_flags & DCACHE_OP_REAL))
- return dentry->d_op->d_real(dentry, inode, open_flags, flags);
+ return dentry->d_op->d_real(dentry, inode, flags);
else
return dentry;
}
@@ -597,7 +596,7 @@ static inline struct dentry *d_real(stru
static inline struct inode *d_real_inode(const struct dentry *dentry)
{
/* This usage of d_real() results in const dentry */
- return d_backing_inode(d_real((struct dentry *) dentry, NULL, 0, 0));
+ return d_backing_inode(d_real((struct dentry *) dentry, NULL, 0));
}

struct name_snapshot {
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1235,7 +1235,7 @@ static inline struct inode *file_inode(c

static inline struct dentry *file_dentry(const struct file *file)
{
- return d_real(file->f_path.dentry, file_inode(file), 0, 0);
+ return d_real(file->f_path.dentry, file_inode(file), 0);
}

static inline int locks_lock_file_wait(struct file *filp, struct file_lock *fl)
--- a/include/linux/fcntl.h
+++ b/include/linux/fcntl.h
@@ -10,6 +10,17 @@
FASYNC | O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)

+/*
+ * This is an internal flag currently used by d_real() to return the upper
+ * dentry of an overlayfs.
+ */
+#define O_UPPER 0x80000000
+
+#if O_UPPER & VALID_OPEN_FLAGS
+#error O_UPPER conflicts with some other open flag
+#endif
+
+
#ifndef force_o_largefile
#define force_o_largefile() (BITS_PER_LONG != 32)
#endif

2017-09-15 08:06:29

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Fri, Sep 15, 2017 at 9:32 AM, Miklos Szeredi <[email protected]> wrote:
> On Thu, Sep 14, 2017 at 01:24:22PM -0700, Linus Torvalds wrote:
>
>> I just don't see any reason why those two "flags" arguments are separate.
>
> Fine. Here's a patch reverting the new flags and adding O_UPPER.

And, btw. I also hate all the hacks we need to do in the VFS for the
sake of overlayfs. It may actually end up all being removed and all
the stacking moved to overlayfs; that's something I'd really like to
look at. All of these hacks are there because overlayfs lets the
open file be owned by the underlying filesystem, which is good for
performance and results in simpler code in overlayfs, but may not
actually be worth it.

The whole overlayfs thing started because I was looking at the union
mounts patches and seeing all the horrid vfs impacts and deciding,
that surely some of this can go into a filesystem without major loss
of performance. The first versions had really minimal vfs impact,
but it turned out that that wasn't enough for a bunch of things to
work correctly so d_real() evolved into the beast it is now.

And we still have that issue with an fd opened for read-only and then
one opened for write, resulting in a copy-up, modification, and the
read-only fd still seeing the old data (union mounts had the same
issue, BTW). So we need more hacks or some way to have a shared page
cache that's breakable on copy up, which does not look trivial at all.

Thanks,
Miklos

2017-09-15 18:35:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Fri, Sep 15, 2017 at 12:32 AM, Miklos Szeredi <[email protected]> wrote:
>
> Fine. Here's a patch reverting the new flags and adding O_UPPER.

Thanks, this patch looks fine to me, but now with the other discussion
I think I'll leave the d_real() thing alone and see if Al has more
comments about your other approach.

Because it _would_ be even more lovely if we could just get rid of it
entirely, and do all of this internally in ovlfs itself.

> And we still have that issue with an fd opened for read-only and then
> one opened for write, resulting in a copy-up, modification, and the
> read-only fd still seeing the old data (union mounts had the same
> issue, BTW). So we need more hacks or some way to have a shared page
> cache that's breakable on copy up, which does not look trivial at all.

Ugh, no, that's *very* non-trivial.

I can see us doing magical things with the inode->i_mapping being an
ovlfs-private mapping, and then sharing the actual physical pages with
the mapping of the lower filesystem.

So that way, the actual file data would never point directly at the
lower filesystem.

But equally obviously, that would *only* work when the lower
filesystem is never modified directly. There may be other, even worse,
issues with page sharing across mappings. So it might be a completely
broken model, I haven't thought much about it.

Linus

2017-09-17 14:38:19

by J. R. Okajima

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

Linus Torvalds:
> On Wed, Sep 13, 2017 at 11:46 PM, Miklos Szeredi <[email protected]> wrote:
> >
> > d_real() is currently is *the* overlayfs-op:
>
> I know. And it's ugly as #%^! hell.
>
> We don't want to make it uglier.
>
> And honestly, if you think that "it's only for overlayfs, so I can do
> anything I damn well want to this p[iece of shit" is valid, then I
> want to re-educate you. That is *not* how the VFS layer works.

Miklos,

I've just read your experimental patch to remove ->d_real(). I believe
it is the way to go. Moreover I'd suggest you to re-consider these
things.

- d_real_inode(), d_backing_inode(), and locks_inode()
I don't think it a good idea for VFS and other modules to care "which
inode I should use" issue. Ideally they should call d_inode() and
file_inode() only as they used to.

- RENAME_WHITEOUT flag for user-space
Why does this flag visible to users? It is a pure internal flag, isn't
it? When and what for do users specify this flag?

Git-grepping the overlayfs related changes in other modules, I'd also
suggest you to re-consider these.

v4.2
4bacc9c 2015-06-19 overlayfs: Make f_path always point to the overlay and f_inode to the underlay

v4.6
54d5ca8 2016-05-10 vfs: add vfs_select_inode() helper
d101a12 2016-03-26 fs: add file_dentry()

v4.7
a118084 2016-05-20 vfs: add d_real_inode() helper

v4.8
c1892c3 2016-08-03 vfs: fix deadlock in file_remove_privs() on overlayfs

v4.9
4d0c5ba 2016-09-16 vfs: do get_write_access() on upper layer of overlayfs
c568d68 2016-09-16 locks: fix file locking on overlayfs
f3fbbb0 2016-09-16 fsnotify: support overlayfs
598e3c8 2016-09-16 vfs: update ovl inode before relatime check

v4.12
78757af 2017-04-20 vfs: ftruncate check IS_APPEND() on real upper inode

v4.13
ad0af71 2017-07-04 vfs: introduce inode 'inuse' lock


J. R. Okajima

2017-09-17 21:27:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [GIT PULL] overlayfs update for 4.14

On Fri, Sep 15, 2017 at 10:06:24AM +0200, Miklos Szeredi wrote:
> or some way to have a shared page
> cache that's breakable on copy up, which does not look trivial at all.

If we get that, it will come through overlay using reflink for
copyup and filesystems doing the page cache sharing of multiply
referenced data blocks. I don't see overlay being involved in this
functionality at all....

Cheers,

Dave.
--
Dave Chinner
[email protected]