Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755943AbcJTUqf (ORCPT ); Thu, 20 Oct 2016 16:46:35 -0400 Received: from mx1.redhat.com ([209.132.183.28]:48802 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755795AbcJTUqd (ORCPT ); Thu, 20 Oct 2016 16:46:33 -0400 Date: Thu, 20 Oct 2016 16:46:30 -0400 From: Vivek Goyal To: Miklos Szeredi Cc: linux-unionfs@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Jeremy Eder , David Howells , Ratna Bolla , Gou Rao , Vinod Jayaraman , Al Viro , Dave Chinner Subject: Re: [POC/RFC PATCH] overlayfs: fix data inconsistency at copy up Message-ID: <20161020204630.GA1000@redhat.com> References: <20161012133326.GD31239@veci.piliscsaba.szeredi.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20161012133326.GD31239@veci.piliscsaba.szeredi.hu> User-Agent: Mutt/1.7.0 (2016-08-17) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Thu, 20 Oct 2016 20:46:32 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9671 Lines: 196 On Wed, Oct 12, 2016 at 03:33:26PM +0200, Miklos Szeredi wrote: > This is a proof of concept patch to fix the following. > > /ovl is in overlay mount and /ovl/foo exists on the lower layer only. > > rofd = open("/ovl/foo", O_RDONLY); > rwfd = open("/ovl/foo", O_WRONLY); /* this causes copy up */ > write(rwfd, "bar", 3); > read(rofd, buf, 3); > assert(memcmp(buf, "bar", 3) == 0); > > Similar problem exists with an MAP_SHARED mmap created from rofd. > > While this has only caused few problems (yum/dnf failure is the only one I know > of) and easily worked around in userspace, many see it as a proof that overlayfs > can never be a proper "POSIX" filesystem. > > To quell those worries, here's a simple patch that should address the above. > > The only VFS change is that f_op is initialized from f_path.dentry->d_inode > instead of file_inode(filp) in open. The effect of this is that overlayfs can > intercept open and other file operations, while the file still effectively > belongs to the underlying fs. > > The patch does not give up on the nice properties of overlayfs, like sharing the > page cache with the underlying files. It does cause copy up in one case where > previously there wasn't one and that's the O_RDONLY/MAP_SHARED case. I haven't > done much research into this, but running some tests in chroot didn't trigger > this. > > Comments, testing are welcome. Hi Miklos, This looks like a very interesting idea. In fact once file has been copied up and writen to, and if I do fstat(rofd), it shows the size of copied up file but one can read the contents. So fixing that anomaly would be nice. Hopefully O_RDONLY/MAP_SHARED is not a common case and we get away with this forced copy up penalty. [..] > +static ssize_t ovl_read_iter(struct kiocb *iocb, struct iov_iter *to) > +{ > + struct file *file = iocb->ki_filp; > + bool isupper = OVL_TYPE_UPPER(ovl_path_type(file->f_path.dentry)); > + ssize_t ret = -EINVAL; > + > + if (likely(!isupper)) { > + const struct file_operations *fop = ovl_real_fop(file); > + > + if (likely(fop->read_iter)) > + ret = fop->read_iter(iocb, to); > + } else { > + struct file *upperfile = filp_clone_open(file); > + IIUC, every read of lower file will call filp_clone_open(). Looking at the code of filp_clone_open(), I am concerned about the overhead of this call. Is it significant? Don't want to be paying too much of penalty for read operation on lower files. That would be a common case for containers. BTW, I did a quick testing. Using docker launched a fedora container and called "dnf update" inside that. And later I noticed following on serial console. Thanks Vivek [ 309.075885] ====================================================== [ 309.076841] [ INFO: possible circular locking dependency detected ] [ 309.077818] 4.9.0-rc1+ #197 Not tainted [ 309.078411] ------------------------------------------------------- [ 309.079377] dnf/2468 is trying to acquire lock: [ 309.080082] ([ 309.080324] &type->s_vfs_rename_key #2[ 309.080942] ){+.+.+.} , at: [ 309.081435] [] lock_rename+0x32/0x100 [ 309.082261] [ 309.082261] but task is already holding lock: [ 309.083158] ([ 309.083399] &mm->mmap_sem ){++++++}[ 309.083974] , at: [ 309.084316] [] vm_mmap_pgoff+0x8c/0x100 [ 309.085150] [ 309.085150] which lock already depends on the new lock. [ 309.085150] [ 309.086393] [ 309.086393] the existing dependency chain (in reverse order) is: [ 309.088279] -> #3[ 309.088612] ( &mm->mmap_sem[ 309.089091] ){++++++} [ 309.089470] : [ 309.089735] [ 309.090046] [] lock_acquire+0xf6/0x1f0 [ 309.090884] [ 309.091197] [] __might_fault+0x70/0xa0 [ 309.092047] [ 309.092357] [] filldir+0xb5/0x140 [ 309.093128] [ 309.093434] [] call_filldir+0x65/0x130 [ 309.094273] [ 309.094590] [] ext4_readdir+0x6cf/0x8a0 [ 309.095425] [ 309.095742] [] iterate_dir+0x17b/0x1b0 [ 309.096572] [ 309.096878] [] SyS_getdents+0x9c/0x130 [ 309.097716] [ 309.098026] [] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 309.099005] -> #2[ 309.099304] ( &type->i_mutex_dir_key[ 309.099888] #3 ){++++++}[ 309.100301] : [ 309.100576] [ 309.100881] [] lock_acquire+0xf6/0x1f0 [ 309.101711] [ 309.102017] [] down_write+0x49/0x80 [ 309.102798] [ 309.103098] [] vfs_rmdir+0x55/0x140 [ 309.103878] [ 309.104179] [] do_rmdir+0x1bd/0x230 [ 309.104958] [ 309.105256] [] SyS_unlinkat+0x22/0x30 [ 309.106063] [ 309.106364] [] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 309.107345] -> #1[ 309.107661] ( &type->i_mutex_dir_key[ 309.108256] #3 /1[ 309.108597] ){+.+.+.} [ 309.108971] : [ 309.109225] [ 309.109532] [] lock_acquire+0xf6/0x1f0 [ 309.110370] [ 309.110686] [] down_write_nested+0x4f/0x80 [ 309.111606] [ 309.111916] [] lock_rename+0xe1/0x100 [ 309.112752] [ 309.113062] [] SyS_renameat+0x212/0x3f0 [ 309.113918] [ 309.114224] [] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 309.115218] -> #0[ 309.115525] ( &type->s_vfs_rename_key[ 309.116138] #2 ){+.+.+.}[ 309.116564] : [ 309.116837] [ 309.117146] [] __lock_acquire+0x1110/0x12a0 [ 309.118047] [ 309.118354] [] lock_acquire+0xf6/0x1f0 [ 309.119193] [ 309.119500] [] mutex_lock_nested+0x79/0x3c0 [ 309.120406] [ 309.120723] [] lock_rename+0x32/0x100 [ 309.121564] [ 309.121875] [] ovl_copy_up_one+0xf7/0x6a0 [overlay] [ 309.122866] [ 309.123170] [] ovl_copy_up+0x10b/0x13d [overlay] [ 309.124136] [ 309.124442] [] ovl_mmap+0x55/0x90 [overlay] [ 309.125348] [ 309.125677] [] mmap_region+0x394/0x630 [ 309.126510] [ 309.126829] [] do_mmap+0x446/0x530 [ 309.127616] [ 309.127928] [] vm_mmap_pgoff+0xbd/0x100 [ 309.128783] [ 309.129092] [] SyS_mmap_pgoff+0x1c1/0x290 [ 309.129973] [ 309.130284] [] SyS_mmap+0x1b/0x30 [ 309.131058] [ 309.131368] [] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 309.132377] [ 309.132377] other info that might help us debug this: [ 309.132377] [ 309.133608] Chain exists of: [ 309.134084] &type->s_vfs_rename_key #2[ 309.134694] --> &type->i_mutex_dir_key[ 309.135324] #3 --> [ 309.135705] &mm->mmap_sem [ 309.136131] [ 309.136131] [ 309.136599] Possible unsafe locking scenario: [ 309.136599] [ 309.137499] CPU0 CPU1 [ 309.138202] ---- ---- [ 309.138905] lock([ 309.139211] &mm->mmap_sem [ 309.139646] ); [ 309.139914] lock([ 309.140602] &type->i_mutex_dir_key #3[ 309.141190] ); [ 309.141472] lock([ 309.142175] &mm->mmap_sem [ 309.142610] ); [ 309.142877] lock([ 309.143186] &type->s_vfs_rename_key #2[ 309.143796] ); [ 309.144084] [ 309.144084] *** DEADLOCK *** [ 309.144084] [ 309.144991] 1 lock held by dnf/2468: [ 309.145543] #0: [ 309.145827] ( &mm->mmap_sem[ 309.146299] ){++++++} , at: [ 309.146782] [] vm_mmap_pgoff+0x8c/0x100 [ 309.147634] [ 309.147634] stack backtrace: [ 309.148310] CPU: 4 PID: 2468 Comm: dnf Not tainted 4.9.0-rc1+ #197 [ 309.149257] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 309.150717] ffffc900033cf900 ffffffff8144b253 ffffffff828b29f0 ffffffff828e8d50 [ 309.151942] ffffc900033cf940 ffffffff8110f83e 00000000f7ac0000 ffff8801f7ac0860 [ 309.153660] ffff8801f7ac0000 ffff8801f7ac0838 0000000000000001 0000000000000000 [ 309.154877] Call Trace: [ 309.155266] [] dump_stack+0x86/0xc3 [ 309.156062] [] print_circular_bug+0x1be/0x210 [ 309.156991] [] __lock_acquire+0x1110/0x12a0 [ 309.157896] [] ? debug_lockdep_rcu_enabled+0x1d/0x20 [ 309.158912] [] ? avc_has_perm+0x34/0x290 [ 309.159768] [] lock_acquire+0xf6/0x1f0 [ 309.160597] [] ? lock_rename+0x32/0x100 [ 309.161438] [] mutex_lock_nested+0x79/0x3c0 [ 309.162351] [] ? lock_rename+0x32/0x100 [ 309.163202] [] ? selinux_inode_getattr+0x8b/0xb0 [ 309.164162] [] lock_rename+0x32/0x100 [ 309.164981] [] ovl_copy_up_one+0xf7/0x6a0 [overlay] [ 309.165987] [] ? avc_has_perm+0x133/0x290 [ 309.166864] [] ? avc_has_perm+0x34/0x290 [ 309.167723] [] ? __might_sleep+0x4a/0x80 [ 309.168580] [] ovl_copy_up+0x10b/0x13d [overlay] [ 309.169539] [] ovl_mmap+0x55/0x90 [overlay] [ 309.170436] [] mmap_region+0x394/0x630 [ 309.171269] [] do_mmap+0x446/0x530 [ 309.172064] [] vm_mmap_pgoff+0xbd/0x100 [ 309.172908] [] SyS_mmap_pgoff+0x1c1/0x290 [ 309.173777] [] SyS_mmap+0x1b/0x30 [ 309.174539] [] entry_SYSCALL_64_fastpath+0x1f/0xc2