Here's an updated version of the overlay filesystem. I'd like to
propose it for inclusion into mainline.
Executive summary:
Overlayfs allows one, usually read-write, directory tree to be
overlaid onto another, read-only directory tree. All modifications
go to the upper, writable layer.
This type of mechanism is most often used for live CDs but there's a
wide variety of other uses.
The implementation differs from other "union filesystem"
implementations in that after a file is opened, all operations go
directly to the underlying, lower or upper, filesystems. This
simplifies the implementation and allows native performance in these
cases.
For more information see the excellent documentation written by Neil
Brown at the end of the series.
Al, can you please review the VFS parts (patches 1-3)?
Git tree is here:
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git overlayfs.v7
Thanks,
Miklos
---
Andy Whitcroft (1):
overlayfs: add statfs support
Miklos Szeredi (4):
vfs: add i_op->open()
vfs: export do_splice_direct() to modules
vfs: introduce clone_private_mount()
overlay filesystem prototype
Neil Brown (1):
overlay: overlay filesystem documentation
---
Documentation/filesystems/overlayfs.txt | 163 +++
fs/Kconfig | 1 +
fs/Makefile | 1 +
fs/namespace.c | 17 +
fs/open.c | 76 +-
fs/overlayfs/Kconfig | 4 +
fs/overlayfs/Makefile | 5 +
fs/overlayfs/overlayfs.c | 2414 +++++++++++++++++++++++++++++++
fs/splice.c | 1 +
include/linux/fs.h | 2 +
include/linux/mount.h | 3 +
11 files changed, 2661 insertions(+), 26 deletions(-)
create mode 100644 Documentation/filesystems/overlayfs.txt
create mode 100644 fs/overlayfs/Kconfig
create mode 100644 fs/overlayfs/Makefile
create mode 100644 fs/overlayfs/overlayfs.c
------------------------------------------------------------------------------
Changes from v6 to v7
- added patches from Felix Fietkau to fix deadlocks on jffs2
- optimized directory removal
- properly clean up after copy-up and other failures
------------------------------------------------------------------------------
Changes from v5 to v6
- optimize directory merging
o use rbtree for weeding out duplicates
o use a cursor for current position within the stream
- instead of f_op->open_other(), implement i_op->open()
- don't share inodes for non-directory dentries - for now. I hope
this can come back once RCU lookup code has settled.
- misc bug fixes
------------------------------------------------------------------------------
Changes from v4 to v5
- fix copying up if fs doesn't support xattrs (Andy Whitcroft)
- clone mounts to be used internally to access the underlying
filesystems
------------------------------------------------------------------------------
Changes from v3 to v4
- export security_inode_permission to allow overlayfs to be modular
(Andy Whitcroft)
- add statfs support (Andy Whitcroft)
- change BUG_ON to WARN_ON
- Revert "vfs: add flag to allow rename to same inode", instead
introduce s_op->is_same_inode()
- overlayfs: fix rename to self
- fix whiteout after rename
------------------------------------------------------------------------------
Changes from v2 to v3
- Minimal remount support. As overlayfs reflects the 'readonly'
mount status in write-access to the upper filesystem, we must
handle remount and either drop or take write access when the ro
status changes. (NeilBrown)
- Use correct seek function for directories. It is incorrect to call
generic_llseek_file on a file from a different filesystem. For
that we must use the seek function that the filesystem defines,
which is called by vfs_llseek. Also, we only want to seek the
realfile when is_real is true. Otherwise we just want to update
our own f_pos pointer, so use generic_llseek_file for
that. (NeilBrown)
- Initialise is_real before use. The previous patch can use
od->is_real before it is properly initialised is llseek is called
before readdir. So factor out the initialisation of is_real and
call it from both readdir and llseek when f_pos is 0. (NeilBrown)
- Rename ovl_fill_cache to ovl_dir_read (NeilBrown)
- Tiny optimisation in open_other handling (NeilBrown)
- Assorted updates to Documentation/filesystems/overlayfs.txt (NeilBrown)
- Make copy-up work for >=4G files, make it killable during copy-up.
Need to fix recovery after a failed/interrupted copy-up.
- Store and reference upper/lower dentries in overlay dentries.
Store and reference upper/lower vfsmounts in overlay superblock.
- Add necessary barriers for setting upper dentry in copyup and for
retrieving upper dentry locklessly.
- Make sure the right file is used for directory fsync() after
copy-up.
- Add locking to ovl_dir_llseek() to prevent concurrent call of
ovl_dir_reset() with ovl_dir_read().
- Get rid of ovl_dentry_iput(). The VFS doesn't provide enough
locking for this function that the contents of ->d_fsdata could be
safely updated.
- After copying up a non-directory unhash the dentry. This way the
lower dentry ref, which is no longer necessary, can go away. This
revealed a use-after-free bug in truncate handling in
fs/namei.c:finish_open().
- Fix if a copy-up happens between the follow_linka the put_link
calls.
- Replace some WARN_ONs with BUG_ON. Some things just _really_
shouldn't happen.
- Extract common code from ovl_unlink and ovl_rmdir to a helper
function.
- After unlink and rmdir unhash the dentry. This will get rid of the
lower and upper dentry references after there are no more users of
the deleted dentry. This is a safe replacement for the removed
->d_iput() functionality.
- Added checks to unlink, rmdir and rename to verify that the
parent-child relationship in the upper filesystem matches that of
the overlay. This is necessary to prevent crash and/or corruption
if the upper filesystem topology is being modified while part of
the overlay.
- Optimize checking whiteout and opaque attributes.
- Optimize copy-up on truncate: don't copy up whole file before
truncating
- Misc bug fixes
------------------------------------------------------------------------------
Changes from v1 to v2
- rename "hybrid union filesystem" to "overlay filesystem" or overlayfs
- added documentation written by Neil
- correct st_dev for directories (reported by Neil)
- use getattr() to get attributes from the underlying filesystems,
this means that now an overlay filesystem itself can be the lower,
read-only layer of another overlay
- listxattr filters out private extended attributes
- get write ref on the upper layer on mount unless the overlay
itself is mounted read-only
- raise capabilities for copy up, dealing with whiteouts and opaque
directories. Now the overlay works for non-root users as well
- "rm -rf" didn't work correctly in all cases if the directory was
copied up between opendir and the first readdir, this is now fixed
(and the directory operations consolidated)
- simplified copy up, this broke optimization for truncate and
open(O_TRUNC) (now file is copied up to be immediately truncated,
will fix)
- st_nlink for merged directories set to 1, this is an "illegal"
value that normal filesystems never have but some use it to
indicate that the number of subdirectories is unknown. Utilities
(find, ...) seem to tolerate this well.
- misc fixes I forgot about
On Tue, Mar 22, 2011 at 8:26 AM, Miklos Szeredi <[email protected]> wrote:
> Here's an updated version of the overlay filesystem. ?I'd like to
> propose it for inclusion into mainline.
So on the whole it looked pretty small and simple. And most of the VFS
level changes looked fine and I just reacted to the odd calling
convention for open (I really think you should aim for ->open to have
the basically same arguments as you made __dentry_open have: 'struct
path', 'struct filp' and 'struct cred').
But I'd want Al's ack on the series. And also hear who uses it and how
it's been tested?
Linus
On 2011-03-22 6:36 PM, Linus Torvalds wrote:
> On Tue, Mar 22, 2011 at 8:26 AM, Miklos Szeredi <[email protected]> wrote:
>> Here's an updated version of the overlay filesystem. I'd like to
>> propose it for inclusion into mainline.
>
> So on the whole it looked pretty small and simple. And most of the VFS
> level changes looked fine and I just reacted to the odd calling
> convention for open (I really think you should aim for ->open to have
> the basically same arguments as you made __dentry_open have: 'struct
> path', 'struct filp' and 'struct cred').
>
> But I'd want Al's ack on the series. And also hear who uses it and how
> it's been tested?
We're using it in OpenWrt (an Embedded Linux distribution) for devices
with tiny amounts of flash for the entire system (e.g. 4 MB).
We're using it to provide a writable on-flash root filesystem with
squashfs for the read-only part and jffs2 for the writable overlay. This
saves some precious flash space compared to using only jffs2, and it
makes it easy for users to reset their device to defaults without having
to reflash.
With a backport of v6 of this series + my fixes that went into v7 this
is working quite well on 2.6.37 and 2.6.38 - I'm using it on a few
wireless access points at home.
- Felix
On Tue, Mar 22, 2011 at 11:22 AM, Felix Fietkau <[email protected]> wrote:
>
> We're using it in OpenWrt (an Embedded Linux distribution) for devices
> with tiny amounts of flash for the entire system (e.g. 4 MB).
> We're using it to provide a writable on-flash root filesystem with
> squashfs for the read-only part and jffs2 for the writable overlay. This
> saves some precious flash space compared to using only jffs2, and it
> makes it easy for users to reset their device to defaults without having
> to reflash.
> With a backport of v6 of this series + my fixes that went into v7 this
> is working quite well on 2.6.37 and 2.6.38 - I'm using it on a few
> wireless access points at home.
Ok - sounds like an eminently sane and reasonable usage scenario.
Al?
Linus
On Tue, Mar 22, 2011 at 10:36:09AM -0700, Linus Torvalds wrote:
> On Tue, Mar 22, 2011 at 8:26 AM, Miklos Szeredi <[email protected]> wrote:
> > Here's an updated version of the overlay filesystem. ?I'd like to
> > propose it for inclusion into mainline.
>
> So on the whole it looked pretty small and simple. And most of the VFS
> level changes looked fine and I just reacted to the odd calling
> convention for open (I really think you should aim for ->open to have
> the basically same arguments as you made __dentry_open have: 'struct
> path', 'struct filp' and 'struct cred').
>
> But I'd want Al's ack on the series. And also hear who uses it and how
> it's been tested?
To be honest, I *really* don't like to touch that before ->atomic_open()
is finished. We are nearly there (the only remaining prereq is to sort out
revalidate vs. mountpoint crossing issues), so with any luck the neighborhood
of open() will be finished in for-next circa -rc3 or so.
As for the rest of it... I'll need to review that in details, but the first
impression from that code is that I don't like the way copyup is done.
Locking analysis would be really nice; AFAICS, it violates locking order
when called from e.g. ->setattr() and its protection against renames is
nowhere near enough. I might be missing something subtle, but...
On Tue, Mar 22, 2011 at 11:39 AM, Al Viro <[email protected]> wrote:
>
> Locking analysis would be really nice; AFAICS, it violates locking order
> when called from e.g. ->setattr() and its protection against renames is
> nowhere near enough. ?I might be missing something subtle, but...
Miklos - have you tried using this with lockdep (together with the
same filesystems mounted natively too)? I'd expect that that should
show any bad lock usage..
Linus
On Tue, Mar 22, 2011 at 1:22 PM, Felix Fietkau <[email protected]> wrote:
> On 2011-03-22 6:36 PM, Linus Torvalds wrote:
>> On Tue, Mar 22, 2011 at 8:26 AM, Miklos Szeredi <[email protected]> wrote:
>>> Here's an updated version of the overlay filesystem. I'd like to
>>> propose it for inclusion into mainline.
>>
>> So on the whole it looked pretty small and simple. And most of the VFS
>> level changes looked fine and I just reacted to the odd calling
>> convention for open (I really think you should aim for ->open to have
>> the basically same arguments as you made __dentry_open have: 'struct
>> path', 'struct filp' and 'struct cred').
>>
>> But I'd want Al's ack on the series. And also hear who uses it and how
>> it's been tested?
> We're using it in OpenWrt (an Embedded Linux distribution) for devices
> with tiny amounts of flash for the entire system (e.g. 4 MB).
> We're using it to provide a writable on-flash root filesystem with
> squashfs for the read-only part and jffs2 for the writable overlay. This
> saves some precious flash space compared to using only jffs2, and it
> makes it easy for users to reset their device to defaults without having
> to reflash.
> With a backport of v6 of this series + my fixes that went into v7 this
> is working quite well on 2.6.37 and 2.6.38 - I'm using it on a few
> wireless access points at home.
>
> - Felix
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
how is this filesystem related to mini_fo and unionfs?
On Tue, 22 Mar 2011, Linus Torvalds wrote:
> On Tue, Mar 22, 2011 at 11:39 AM, Al Viro <[email protected]> wrote:
> >
> > Locking analysis would be really nice; AFAICS, it violates locking order
> > when called from e.g. ->setattr()
Locking order is always:
-> overlayfs locks
-> upper fs locks
-> lower fs locks
So it's really pretty simple and easy to validate.
> > and its protection against renames is
> > nowhere near enough. I might be missing something subtle, but...
Protection is exactly as for userspace callers. AFAICT.
> Miklos - have you tried using this with lockdep (together with the
> same filesystems mounted natively too)? I'd expect that that should
> show any bad lock usage..
Ah, lockdep. I have tried, but there seems to be always something
that triggers it at boot time on my laptop, which makes it useless. I
could find some other machine to test this on, though.
Thanks,
Miklos
On Tue, Mar 22, 2011 at 07:58:17PM +0100, Miklos Szeredi wrote:
> On Tue, 22 Mar 2011, Linus Torvalds wrote:
> > On Tue, Mar 22, 2011 at 11:39 AM, Al Viro <[email protected]> wrote:
> > >
> > > Locking analysis would be really nice; AFAICS, it violates locking order
> > > when called from e.g. ->setattr()
>
> Locking order is always:
>
> -> overlayfs locks
> -> upper fs locks
> -> lower fs locks
>
> So it's really pretty simple and easy to validate.
In which *order* on the upper fs?
> Protection is exactly as for userspace callers. AFAICT.
Pardon? You traverse the chain of ancestors; fine, but who says it stays
anywhere near being relevant as you go?
On Tue, Mar 22, 2011 at 07:58:17PM +0100, Miklos Szeredi wrote:
> > > and its protection against renames is
> > > nowhere near enough. ??I might be missing something subtle, but...
>
> Protection is exactly as for userspace callers. AFAICT.
BTW, what filesystems can act as upper layers and how are you going to
prevent modifications of upper layer in normal way? It is mounted,
after all, or you would be unable to find it when mounting overlayfs.
And it might be mounted in any number of places, not all even visible to
you... I realize that you have it listed as a problem, but do you have
any ideas on how to deal with that?
If you allow NFS as upper layer, you really have a problem; with this
approach you probably want to prevent that very forcibly. Not that
your open() handling would work correctly with NFS, even with no modifications
from other clients or from server...
On Tue, 22 Mar 2011, Al Viro wrote:
> On Tue, Mar 22, 2011 at 07:58:17PM +0100, Miklos Szeredi wrote:
> > On Tue, 22 Mar 2011, Linus Torvalds wrote:
> > > On Tue, Mar 22, 2011 at 11:39 AM, Al Viro <[email protected]> wrote:
> > > >
> > > > Locking analysis would be really nice; AFAICS, it violates locking order
> > > > when called from e.g. ->setattr()
> >
> > Locking order is always:
> >
> > -> overlayfs locks
> > -> upper fs locks
> > -> lower fs locks
> >
> > So it's really pretty simple and easy to validate.
>
> In which *order* on the upper fs?
In copy up it does:
-> lock parent on upper
-> lock child on upper
So a setattr with copy up would go like this:
-> lock child on overlayfs
-> lock parent on upper
->lock child on upper
-> lock child on upper
> > Protection is exactly as for userspace callers. AFAICT.
>
> Pardon? You traverse the chain of ancestors; fine, but who says it stays
> anywhere near being relevant as you go?
Not quite sure I understand.
There are no assumptions about locks in overlayfs keeping anything
relevant in upper/lower fs. Everything is re-checked and re-locked on
the upper layer before proceeding with the rename.
Thanks,
Miklos
On Tue, Mar 22, 2011 at 08:43:17PM +0100, Miklos Szeredi wrote:
> In copy up it does:
>
> -> lock parent on upper
> -> lock child on upper
>
> So a setattr with copy up would go like this:
>
> -> lock child on overlayfs
> -> lock parent on upper
> ->lock child on upper
> -> lock child on upper
>
> > > Protection is exactly as for userspace callers. AFAICT.
> >
> > Pardon? You traverse the chain of ancestors; fine, but who says it stays
> > anywhere near being relevant as you go?
>
> Not quite sure I understand.
>
> There are no assumptions about locks in overlayfs keeping anything
> relevant in upper/lower fs. Everything is re-checked and re-locked on
> the upper layer before proceeding with the rename.
Proceeding with rename is not interesting; proceeding with copyup is.
Who said that by the time we get to copy_up_locked you will still have
dentry (and upper) match lowerpath? Or that ->d_parent on overlay and
on upper will change in sync, for that matter - there are two d_move()
calls involved...
On Tue, 22 Mar 2011, Al Viro wrote:
> On Tue, Mar 22, 2011 at 07:58:17PM +0100, Miklos Szeredi wrote:
>
> > > > and its protection against renames is
> > > > nowhere near enough. ??I might be missing something subtle, but...
> >
> > Protection is exactly as for userspace callers. AFAICT.
>
> BTW, what filesystems can act as upper layers and how are you going to
> prevent modifications of upper layer in normal way? It is mounted,
> after all, or you would be unable to find it when mounting overlayfs.
> And it might be mounted in any number of places, not all even visible to
> you... I realize that you have it listed as a problem, but do you have
> any ideas on how to deal with that?
Yes, I have some patches, but decided that that should be a separate
set, once the basics are ironed out.
Since the locking guarantees are separated on the upper/lower fs from
the overlayfs, allowing modification is not a huge problem. The worst
that can happen is that an attacker who has access to both the overlay
and the upper or lower fs then can "build" an arbitrarily deep
directory tree on the overlayfs. Not a big issue. There won't be
deadlocks or filesystem corruption.
> If you allow NFS as upper layer, you really have a problem; with this
> approach you probably want to prevent that very forcibly. Not that
> your open() handling would work correctly with NFS, even with no modifications
> from other clients or from server...
Upper layer doesn't work on NFS for multiple reasons.
Thanks,
Miklos
On Tue, 22 Mar 2011, Al Viro wrote:
> Proceeding with rename is not interesting; proceeding with copyup is.
>
> Who said that by the time we get to copy_up_locked you will still have
> dentry (and upper) match lowerpath? Or that ->d_parent on overlay and
> on upper will change in sync, for that matter - there are two d_move()
> calls involved...
If rename is involved, than rename itself already did the copy up.
And that's checked before proceeding with the actual copy up. If
there was no rename, then that guarantees that things are in sync, at
least for the duration of the copy up.
Is that not enough?
Thanks,
Miklos
On Tue, Mar 22, 2011 at 09:06:38PM +0100, Miklos Szeredi wrote:
> On Tue, 22 Mar 2011, Al Viro wrote:
> > Proceeding with rename is not interesting; proceeding with copyup is.
> >
> > Who said that by the time we get to copy_up_locked you will still have
> > dentry (and upper) match lowerpath? Or that ->d_parent on overlay and
> > on upper will change in sync, for that matter - there are two d_move()
> > calls involved...
>
> If rename is involved, than rename itself already did the copy up.
> And that's checked before proceeding with the actual copy up. If
> there was no rename, then that guarantees that things are in sync, at
> least for the duration of the copy up.
What do you mean, before? It's not atomic... What happens if e.g.
you get
A: decided to do copy_up_locked
blocked on i_mutex
B: did copy_up
did rename(), complete with d_move()
did unlink() in new place
A: got CPU back, got i_mutex
proceeded to copy lower into new location
set dentry to very odd (upper,lower) pair
On Tue, 22 Mar 2011, Al Viro wrote:
> On Tue, Mar 22, 2011 at 09:06:38PM +0100, Miklos Szeredi wrote:
> > On Tue, 22 Mar 2011, Al Viro wrote:
> > > Proceeding with rename is not interesting; proceeding with copyup is.
> > >
> > > Who said that by the time we get to copy_up_locked you will still have
> > > dentry (and upper) match lowerpath? Or that ->d_parent on overlay and
> > > on upper will change in sync, for that matter - there are two d_move()
> > > calls involved...
> >
> > If rename is involved, than rename itself already did the copy up.
> > And that's checked before proceeding with the actual copy up. If
> > there was no rename, then that guarantees that things are in sync, at
> > least for the duration of the copy up.
>
> What do you mean, before? It's not atomic... What happens if e.g.
> you get
>
> A: decided to do copy_up_locked
> blocked on i_mutex
>
> B: did copy_up
> did rename(), complete with d_move()
> did unlink() in new place
>
> A: got CPU back, got i_mutex
Here it can check if the file was copied up or not. OK, I see the
code doesn't quite get that right.
Patch below would fix it, I think.
Thanks,
Miklos
diff --git a/fs/overlayfs/overlayfs.c b/fs/overlayfs/overlayfs.c
index e7fcbde..0a1137b 100644
--- a/fs/overlayfs/overlayfs.c
+++ b/fs/overlayfs/overlayfs.c
@@ -1269,8 +1269,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
* racing with rename (rename means the copy up was already
* successful).
*/
- if (dentry->d_parent != parent) {
- WARN_ON((ovl_path_type(dentry) == OVL_PATH_LOWER));
+ if (ovl_path_type(dentry) != OVL_PATH_LOWER) {
err = 0;
} else {
err = ovl_copy_up_locked(upperdir, dentry, lowerpath,
On Tue, 22 Mar 2011, Miklos Szeredi wrote:
> On Tue, 22 Mar 2011, Al Viro wrote:
> > On Tue, Mar 22, 2011 at 09:06:38PM +0100, Miklos Szeredi wrote:
> > > On Tue, 22 Mar 2011, Al Viro wrote:
> > > > Proceeding with rename is not interesting; proceeding with copyup is.
> > > >
> > > > Who said that by the time we get to copy_up_locked you will still have
> > > > dentry (and upper) match lowerpath? Or that ->d_parent on overlay and
> > > > on upper will change in sync, for that matter - there are two d_move()
> > > > calls involved...
> > >
> > > If rename is involved, than rename itself already did the copy up.
> > > And that's checked before proceeding with the actual copy up. If
> > > there was no rename, then that guarantees that things are in sync, at
> > > least for the duration of the copy up.
> >
> > What do you mean, before? It's not atomic... What happens if e.g.
> > you get
> >
> > A: decided to do copy_up_locked
> > blocked on i_mutex
> >
> > B: did copy_up
> > did rename(), complete with d_move()
> > did unlink() in new place
> >
> > A: got CPU back, got i_mutex
>
> Here it can check if the file was copied up or not. OK, I see the
> code doesn't quite get that right.
>
> Patch below would fix it, I think.
Scratch that, d_move() on overlayfs would also have to be under
i_mutex of upper. But that should be doable as well. I'll fix that
tomorrow when my brain is a little fresher.
Thanks,
Miklos
On Tuesday 22 March 2011, 19:49:44 Xianghua Xiao wrote:
> On Tue, Mar 22, 2011 at 1:22 PM, Felix Fietkau <[email protected]>
wrote:
> > On 2011-03-22 6:36 PM, Linus Torvalds wrote:
> >> On Tue, Mar 22, 2011 at 8:26 AM, Miklos Szeredi <[email protected]>
wrote:
> >>> Here's an updated version of the overlay filesystem. I'd like to
> >>> propose it for inclusion into mainline.
> >>
> >> So on the whole it looked pretty small and simple. And most of the
> >> VFS level changes looked fine and I just reacted to the odd
> >> calling convention for open (I really think you should aim for
> >> ->open to have the basically same arguments as you made
> >> __dentry_open have: 'struct path', 'struct filp' and 'struct
> >> cred').
> >>
> >> But I'd want Al's ack on the series. And also hear who uses it and
> >> how it's been tested?
> >
> > We're using it in OpenWrt (an Embedded Linux distribution) for
> > devices with tiny amounts of flash for the entire system (e.g. 4
> > MB). We're using it to provide a writable on-flash root filesystem
> > with squashfs for the read-only part and jffs2 for the writable
> > overlay. This saves some precious flash space compared to using
> > only jffs2, and it makes it easy for users to reset their device to
> > defaults without having to reflash.
> > With a backport of v6 of this series + my fixes that went into v7
> > this is working quite well on 2.6.37 and 2.6.38 - I'm using it on a
> > few wireless access points at home.
> >
> > - Felix
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-kernel" in the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> how is this filesystem related to mini_fo and unionfs?
and aufs2: it's a minimalistic version of a layered filesystem, being
much less capable then its ancestors, but hopefully the right approach
to get the foot in the door.. Unfortunately, it's locking out NFS,
which is a big miss for diskless usage scenarios (where layered
filesystems are also used commonly since ages..).
Probably, it will grow up, Al gets the VFS sorted out from a FS layering
perspective, Valerie Aurora and friends get union mounts operating and
merged, and/or Linus finally merges Junjiro Okajima's aufs2, which is
often used in production environments today, where other layered
filesystem approaches are unusable or simply fail in practice.
Btw: Fortunately, Junjiro, being a Tokio habitant, wasn't injured from
the earth quake nor the tzunami, and let us all pray five minutes for
success and survival of those brave guys, that try to fix the cooling
in the Fukushima atomic plant.
Pete
On Tue, 22 Mar 2011, Miklos Szeredi wrote:
> > What do you mean, before? It's not atomic... What happens if e.g.
> > you get
> >
> > A: decided to do copy_up_locked
> > blocked on i_mutex
> >
> > B: did copy_up
> > did rename(), complete with d_move()
> > did unlink() in new place
> >
> > A: got CPU back, got i_mutex
>
> Here it can check if the file was copied up or not. OK, I see the
> code doesn't quite get that right.
>
> Patch below would fix it, I think.
Patch is correct, after all. More cleanups in that area below, as
well as analysis of rename vs. copy up race.
Thanks for taking a look.
Miklos
diff --git a/fs/overlayfs/overlayfs.c b/fs/overlayfs/overlayfs.c
index e7fcbde..b97a481 100644
--- a/fs/overlayfs/overlayfs.c
+++ b/fs/overlayfs/overlayfs.c
@@ -1166,15 +1166,8 @@ static int ovl_copy_up_locked(struct dentry *upperdir, struct dentry *dentry,
newpath.mnt = ofs->upper_mnt;
newpath.dentry = ovl_upper_create(upperdir, dentry, stat, link);
- if (IS_ERR(newpath.dentry)) {
- err = PTR_ERR(newpath.dentry);
-
- /* Already copied up? */
- if (err == -EEXIST && ovl_path_type(dentry) != OVL_PATH_LOWER)
- return 0;
-
- return err;
- }
+ if (IS_ERR(newpath.dentry))
+ return PTR_ERR(newpath.dentry);
if (S_ISREG(stat->mode)) {
err = ovl_copy_up_data(lowerpath, &newpath, stat->size);
@@ -1218,6 +1211,21 @@ err_remove:
return err;
}
+/*
+ * Copy up a single dentry
+ *
+ * Directory renames only allowed on "pure upper" (already created on
+ * upper filesystem, never copied up). Directories which are on lower or
+ * are merged may not be renamed. For these -EXDEV is returned and
+ * userspace has to deal with it. This means, when copying up a
+ * directory we can rely on it and ancestors being stable.
+ *
+ * Non-directory renames start with copy up of source if necessary. The
+ * actual rename will only proceed once the copy up was successful. Copy
+ * up uses upper parent i_mutex for exclusion. Since rename can change
+ * d_parent it is possible that the copy up will lock the old parent. At
+ * that point the file will have already been copied up anyway.
+ */
static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
struct path *lowerpath, struct kstat *stat)
{
@@ -1264,13 +1272,7 @@ static int ovl_copy_up_one(struct dentry *parent, struct dentry *dentry,
old_cred = override_creds(override_cred);
mutex_lock_nested(&upperdir->d_inode->i_mutex, I_MUTEX_PARENT);
- /*
- * Using upper filesystem locking to protect against copy up
- * racing with rename (rename means the copy up was already
- * successful).
- */
- if (dentry->d_parent != parent) {
- WARN_ON((ovl_path_type(dentry) == OVL_PATH_LOWER));
+ if (ovl_path_type(dentry) != OVL_PATH_LOWER) {
err = 0;
} else {
err = ovl_copy_up_locked(upperdir, dentry, lowerpath,
(removed aufs-users ML since it is restricted to its members only)
"Hans-Peter Jansen":
> and aufs2: it's a minimalistic version of a layered filesystem, being=20
:::
Thanks reminding us about aufs. :-)
As far as I know, overlayfs has some problems (minors?) similar to
UnionMount.
- for users, the inode number may change silently. eg. copy-up.
- hardlinks may break by copy-up.
- read(2) may get an obsoleted filedata (fstat(2) for metadata too).
- fcntl(F_SETLK) may be broken by copy-up.
- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after
open(O_RDWR).
And I noticed overlayfs adopts overriding credentials for copy-up.
Although I didn't read about credintials in detail yet, is it safe?
For example, during copy-up other thread in the same process may gain
the higher capabilities unexpectedly? Signal hander in the process too?
I just have read overlayfs once a long time ago, so I may be
misunderstanding.
But I have no objection to merge overlayfs into mainline.
Also I'd like to thank you Pete about mentioning the recent disaster in
Japan.
J. R. Okajima