2013-07-16 01:53:16

by Dave Jones

[permalink] [raw]
Subject: splice vs execve lockdep trace.

(ignore taint, it's unrelated due to a broken debug patch in my tree).


[ 696.047396] ======================================================
[ 696.049036] [ INFO: possible circular locking dependency detected ]
[ 696.050689] 3.11.0-rc1+ #53 Tainted: G W
[ 696.052182] -------------------------------------------------------
[ 696.053846] trinity-child2/14017 is trying to acquire lock:
[ 696.055429] (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff8122fa55>] proc_pid_attr_write+0xf5/0x140
[ 696.057652] but task is already holding lock:
[ 696.060171] (&pipe->mutex/1){+.+.+.}, at: [<ffffffff811c7096>] pipe_lock+0x26/0x30
[ 696.062097] which lock already depends on the new lock.
[ 696.065723] the existing dependency chain (in reverse order) is:
[ 696.068295] -> #2 (&pipe->mutex/1){+.+.+.}:
[ 696.070650] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
[ 696.072258] [<ffffffff8170677a>] mutex_lock_nested+0x7a/0x410
[ 696.073916] [<ffffffff811c7096>] pipe_lock+0x26/0x30
[ 696.075464] [<ffffffff811f0024>] generic_file_splice_write+0x64/0x170
[ 696.077192] [<ffffffffa001afc0>] xfs_file_splice_write+0xb0/0x230 [xfs]
[ 696.078914] [<ffffffff811f1f5a>] SyS_splice+0x24a/0x7e0
[ 696.080455] [<ffffffff81714694>] tracesys+0xdd/0xe2
[ 696.082000] -> #1 (&(&ip->i_iolock)->mr_lock){++++++}:
[ 696.084428] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
[ 696.086017] [<ffffffff8108bc92>] down_read_nested+0x52/0xa0
[ 696.087643] [<ffffffffa007b310>] xfs_ilock+0x1d0/0x280 [xfs]
[ 696.089305] [<ffffffffa001aa12>] xfs_file_aio_read+0x112/0x3e0 [xfs]
[ 696.091037] [<ffffffff811be2c0>] do_sync_read+0x80/0xb0
[ 696.092571] [<ffffffff811be931>] vfs_read+0xa1/0x170
[ 696.094124] [<ffffffff811c4ae1>] kernel_read+0x41/0x60
[ 696.095693] [<ffffffff811c4d03>] prepare_binprm+0xb3/0x130
[ 696.097307] [<ffffffff811c6b29>] do_execve_common.isra.29+0x599/0x6c0
[ 696.099036] [<ffffffff811c6f36>] SyS_execve+0x36/0x50
[ 696.100577] [<ffffffff81714a79>] stub_execve+0x69/0xa0
[ 696.102157] -> #0 (&sig->cred_guard_mutex){+.+.+.}:
[ 696.104536] [<ffffffff810c40b6>] __lock_acquire+0x1786/0x1af0
[ 696.106189] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
[ 696.107813] [<ffffffff817062b5>] mutex_lock_interruptible_nested+0x75/0x4c0
[ 696.109616] [<ffffffff8122fa55>] proc_pid_attr_write+0xf5/0x140
[ 696.111243] [<ffffffff811bf372>] __kernel_write+0x72/0x150
[ 696.112870] [<ffffffff811f03f9>] write_pipe_buf+0x49/0x70
[ 696.114503] [<ffffffff811efa24>] splice_from_pipe_feed+0x84/0x140
[ 696.116202] [<ffffffff811efdce>] __splice_from_pipe+0x6e/0x90
[ 696.117836] [<ffffffff811f18c1>] splice_from_pipe+0x51/0x70
[ 696.119419] [<ffffffff811f1919>] default_file_splice_write+0x19/0x30
[ 696.121147] [<ffffffff811f1f5a>] SyS_splice+0x24a/0x7e0
[ 696.122719] [<ffffffff81714694>] tracesys+0xdd/0xe2
[ 696.124272] other info that might help us debug this:

[ 696.127604] Chain exists of: &sig->cred_guard_mutex --> &(&ip->i_iolock)->mr_lock --> &pipe->mutex/1

[ 696.131445] Possible unsafe locking scenario:

[ 696.133763] CPU0 CPU1
[ 696.135108] ---- ----
[ 696.136420] lock(&pipe->mutex/1);
[ 696.137745] lock(&(&ip->i_iolock)->mr_lock);
[ 696.139469] lock(&pipe->mutex/1);
[ 696.141030] lock(&sig->cred_guard_mutex);
[ 696.142409]
*** DEADLOCK ***

[ 696.145237] 2 locks held by trinity-child2/14017:
[ 696.146532] #0: (sb_writers#4){.+.+.+}, at: [<ffffffff811f2441>] SyS_splice+0x731/0x7e0
[ 696.148435] #1: (&pipe->mutex/1){+.+.+.}, at: [<ffffffff811c7096>] pipe_lock+0x26/0x30
[ 696.150357] stack backtrace:
[ 696.152287] CPU: 2 PID: 14017 Comm: trinity-child2 Tainted: G W 3.11.0-rc1+ #53
[ 696.156695] ffffffff825270a0 ffff8801fee2db90 ffffffff8170164e ffffffff824f84c0
[ 696.158392] ffff8801fee2dbd0 ffffffff816fda92 ffff8801fee2dc20 ffff8802331446d8
[ 696.160300] ffff880233143fc0 0000000000000002 0000000000000002 ffff8802331446d8
[ 696.162217] Call Trace:
[ 696.163292] [<ffffffff8170164e>] dump_stack+0x4e/0x82
[ 696.164695] [<ffffffff816fda92>] print_circular_bug+0x200/0x20f
[ 696.166194] [<ffffffff810c40b6>] __lock_acquire+0x1786/0x1af0
[ 696.167677] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
[ 696.169111] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
[ 696.170633] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
[ 696.172151] [<ffffffff817062b5>] mutex_lock_interruptible_nested+0x75/0x4c0
[ 696.173765] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
[ 696.175280] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
[ 696.176793] [<ffffffff8119e869>] ? alloc_pages_current+0xa9/0x170
[ 696.178301] [<ffffffff8122fa55>] proc_pid_attr_write+0xf5/0x140
[ 696.179799] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
[ 696.181346] [<ffffffff811bf372>] __kernel_write+0x72/0x150
[ 696.182796] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
[ 696.184354] [<ffffffff811f03f9>] write_pipe_buf+0x49/0x70
[ 696.185779] [<ffffffff811efa24>] splice_from_pipe_feed+0x84/0x140
[ 696.187293] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
[ 696.188850] [<ffffffff811efdce>] __splice_from_pipe+0x6e/0x90
[ 696.190325] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
[ 696.191878] [<ffffffff811f18c1>] splice_from_pipe+0x51/0x70
[ 696.193335] [<ffffffff811f1919>] default_file_splice_write+0x19/0x30
[ 696.194896] [<ffffffff811f1f5a>] SyS_splice+0x24a/0x7e0
[ 696.196312] [<ffffffff81714694>] tracesys+0xdd/0xe2


2013-07-16 02:32:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

Hmm. I don't have a lot of ideas, I'm just adding lockdep, splice and
FS people (and Oleg, just because) to the cc to see if anybody else
does.

Al, Peter?

So the problematic op *seems* to be the splice into /proc/<pid>/attr/
files, which causes that "pipe -> cred_guard_mutex" locking.

While the *normal* ordering would be the other way around, coming from
execve(), which has the cred_guard_mutex -> VFS locks ordering for
reading the executable headers.

Al, can we break either of those? Do we need to hold on to the cred
mutex that long? We get it fairly early (prepare_bprm_creds) and we
drop it very late (install_exec_creds), which means that it covers a
lot. But that seems pretty basic. The splice into /proc/<pid>/attr/*
seems to be the more annoying one, and maybe we just shouldn't allow
splicing into or from /proc?

Dave, is this new (it doesn't *smell* new to me), or is it just that
trinity is doing new splice things?

Or is the XFS i_iolock required for this thing to happen at all?
Adding Ben Myers to the cc just for luck/completeness.

Linus

On Mon, Jul 15, 2013 at 6:53 PM, Dave Jones <[email protected]> wrote:
>
> [ 696.047396] ======================================================
> [ 696.049036] [ INFO: possible circular locking dependency detected ]
> [ 696.050689] 3.11.0-rc1+ #53 Tainted: G W
> [ 696.052182] -------------------------------------------------------
> [ 696.053846] trinity-child2/14017 is trying to acquire lock:
> [ 696.055429] (&sig->cred_guard_mutex){+.+.+.}, at: [<ffffffff8122fa55>] proc_pid_attr_write+0xf5/0x140
> [ 696.057652] but task is already holding lock:
> [ 696.060171] (&pipe->mutex/1){+.+.+.}, at: [<ffffffff811c7096>] pipe_lock+0x26/0x30
> [ 696.062097] which lock already depends on the new lock.
> [ 696.065723] the existing dependency chain (in reverse order) is:
> [ 696.068295] -> #2 (&pipe->mutex/1){+.+.+.}:
> [ 696.070650] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
> [ 696.072258] [<ffffffff8170677a>] mutex_lock_nested+0x7a/0x410
> [ 696.073916] [<ffffffff811c7096>] pipe_lock+0x26/0x30
> [ 696.075464] [<ffffffff811f0024>] generic_file_splice_write+0x64/0x170
> [ 696.077192] [<ffffffffa001afc0>] xfs_file_splice_write+0xb0/0x230 [xfs]
> [ 696.078914] [<ffffffff811f1f5a>] SyS_splice+0x24a/0x7e0
> [ 696.080455] [<ffffffff81714694>] tracesys+0xdd/0xe2
> [ 696.082000] -> #1 (&(&ip->i_iolock)->mr_lock){++++++}:
> [ 696.084428] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
> [ 696.086017] [<ffffffff8108bc92>] down_read_nested+0x52/0xa0
> [ 696.087643] [<ffffffffa007b310>] xfs_ilock+0x1d0/0x280 [xfs]
> [ 696.089305] [<ffffffffa001aa12>] xfs_file_aio_read+0x112/0x3e0 [xfs]
> [ 696.091037] [<ffffffff811be2c0>] do_sync_read+0x80/0xb0
> [ 696.092571] [<ffffffff811be931>] vfs_read+0xa1/0x170
> [ 696.094124] [<ffffffff811c4ae1>] kernel_read+0x41/0x60
> [ 696.095693] [<ffffffff811c4d03>] prepare_binprm+0xb3/0x130
> [ 696.097307] [<ffffffff811c6b29>] do_execve_common.isra.29+0x599/0x6c0
> [ 696.099036] [<ffffffff811c6f36>] SyS_execve+0x36/0x50
> [ 696.100577] [<ffffffff81714a79>] stub_execve+0x69/0xa0
> [ 696.102157] -> #0 (&sig->cred_guard_mutex){+.+.+.}:
> [ 696.104536] [<ffffffff810c40b6>] __lock_acquire+0x1786/0x1af0
> [ 696.106189] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
> [ 696.107813] [<ffffffff817062b5>] mutex_lock_interruptible_nested+0x75/0x4c0
> [ 696.109616] [<ffffffff8122fa55>] proc_pid_attr_write+0xf5/0x140
> [ 696.111243] [<ffffffff811bf372>] __kernel_write+0x72/0x150
> [ 696.112870] [<ffffffff811f03f9>] write_pipe_buf+0x49/0x70
> [ 696.114503] [<ffffffff811efa24>] splice_from_pipe_feed+0x84/0x140
> [ 696.116202] [<ffffffff811efdce>] __splice_from_pipe+0x6e/0x90
> [ 696.117836] [<ffffffff811f18c1>] splice_from_pipe+0x51/0x70
> [ 696.119419] [<ffffffff811f1919>] default_file_splice_write+0x19/0x30
> [ 696.121147] [<ffffffff811f1f5a>] SyS_splice+0x24a/0x7e0
> [ 696.122719] [<ffffffff81714694>] tracesys+0xdd/0xe2
> [ 696.124272] other info that might help us debug this:
>
> [ 696.127604] Chain exists of: &sig->cred_guard_mutex --> &(&ip->i_iolock)->mr_lock --> &pipe->mutex/1
>
> [ 696.131445] Possible unsafe locking scenario:
>
> [ 696.133763] CPU0 CPU1
> [ 696.135108] ---- ----
> [ 696.136420] lock(&pipe->mutex/1);
> [ 696.137745] lock(&(&ip->i_iolock)->mr_lock);
> [ 696.139469] lock(&pipe->mutex/1);
> [ 696.141030] lock(&sig->cred_guard_mutex);
> [ 696.142409]
> *** DEADLOCK ***
>
> [ 696.145237] 2 locks held by trinity-child2/14017:
> [ 696.146532] #0: (sb_writers#4){.+.+.+}, at: [<ffffffff811f2441>] SyS_splice+0x731/0x7e0
> [ 696.148435] #1: (&pipe->mutex/1){+.+.+.}, at: [<ffffffff811c7096>] pipe_lock+0x26/0x30
> [ 696.150357] stack backtrace:
> [ 696.152287] CPU: 2 PID: 14017 Comm: trinity-child2 Tainted: G W 3.11.0-rc1+ #53
> [ 696.156695] ffffffff825270a0 ffff8801fee2db90 ffffffff8170164e ffffffff824f84c0
> [ 696.158392] ffff8801fee2dbd0 ffffffff816fda92 ffff8801fee2dc20 ffff8802331446d8
> [ 696.160300] ffff880233143fc0 0000000000000002 0000000000000002 ffff8802331446d8
> [ 696.162217] Call Trace:
> [ 696.163292] [<ffffffff8170164e>] dump_stack+0x4e/0x82
> [ 696.164695] [<ffffffff816fda92>] print_circular_bug+0x200/0x20f
> [ 696.166194] [<ffffffff810c40b6>] __lock_acquire+0x1786/0x1af0
> [ 696.167677] [<ffffffff810c4bd1>] lock_acquire+0x91/0x1f0
> [ 696.169111] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
> [ 696.170633] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
> [ 696.172151] [<ffffffff817062b5>] mutex_lock_interruptible_nested+0x75/0x4c0
> [ 696.173765] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
> [ 696.175280] [<ffffffff8122fa55>] ? proc_pid_attr_write+0xf5/0x140
> [ 696.176793] [<ffffffff8119e869>] ? alloc_pages_current+0xa9/0x170
> [ 696.178301] [<ffffffff8122fa55>] proc_pid_attr_write+0xf5/0x140
> [ 696.179799] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
> [ 696.181346] [<ffffffff811bf372>] __kernel_write+0x72/0x150
> [ 696.182796] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
> [ 696.184354] [<ffffffff811f03f9>] write_pipe_buf+0x49/0x70
> [ 696.185779] [<ffffffff811efa24>] splice_from_pipe_feed+0x84/0x140
> [ 696.187293] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
> [ 696.188850] [<ffffffff811efdce>] __splice_from_pipe+0x6e/0x90
> [ 696.190325] [<ffffffff811f03b0>] ? splice_direct_to_actor+0x1f0/0x1f0
> [ 696.191878] [<ffffffff811f18c1>] splice_from_pipe+0x51/0x70
> [ 696.193335] [<ffffffff811f1919>] default_file_splice_write+0x19/0x30
> [ 696.194896] [<ffffffff811f1f5a>] SyS_splice+0x24a/0x7e0
> [ 696.196312] [<ffffffff81714694>] tracesys+0xdd/0xe2
>
>

2013-07-16 02:39:26

by Dave Jones

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Mon, Jul 15, 2013 at 07:32:51PM -0700, Linus Torvalds wrote:

> So the problematic op *seems* to be the splice into /proc/<pid>/attr/
> files, which causes that "pipe -> cred_guard_mutex" locking.
>
> While the *normal* ordering would be the other way around, coming from
> execve(), which has the cred_guard_mutex -> VFS locks ordering for
> reading the executable headers.
>
> Al, can we break either of those? Do we need to hold on to the cred
> mutex that long? We get it fairly early (prepare_bprm_creds) and we
> drop it very late (install_exec_creds), which means that it covers a
> lot. But that seems pretty basic. The splice into /proc/<pid>/attr/*
> seems to be the more annoying one, and maybe we just shouldn't allow
> splicing into or from /proc?
>
> Dave, is this new (it doesn't *smell* new to me), or is it just that
> trinity is doing new splice things?

I think I've seen this a long time ago from another fuzzer (iknowthis).
I thought that had gotten fixed though. But I may be mixing up a
similar callchain. The recent trinity changes shouldn't have really made
any notable difference here. Interestingly, the 'soft lockups' I was
seeing all the time on that box seem to have gone into hiding.

> Or is the XFS i_iolock required for this thing to happen at all?
> Adding Ben Myers to the cc just for luck/completeness.

It is only happening (so far) on the XFS test box, but I don't have
enough data to say that's definite yet.

Dave

2013-07-16 03:25:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Mon, Jul 15, 2013 at 7:38 PM, Dave Jones <[email protected]> wrote:
>
> The recent trinity changes shouldn't have really made
> any notable difference here.

Hmm. I'm not aware pf anything that has changed in this area since
3.10 - neither in execve, xfs or in splice. Not even since 3.9.

But I may certainly have missed something.

> Interestingly, the 'soft lockups' I was
> seeing all the time on that box seem to have gone into hiding.

Honestly, I'm somewhat inclined to blame the whole perf situation, and
saying that we hopefully got that fixed. In between the silly do_div()
buglets and all the indications that the time was spent in nmi
handlers, I'd be willing to just ignore them as false positives
brought on by the whole switch to the perf irq..

> > Or is the XFS i_iolock required for this thing to happen at all?
> > Adding Ben Myers to the cc just for luck/completeness.
>
> It is only happening (so far) on the XFS test box, but I don't have
> enough data to say that's definite yet.

.. so there's been a number of xfs changes, and I don't know the code,
but none of them seem at all relevant to this.

The "pipe -> cred_guard_mutex" lock chain is pretty direct, and can be
clearly attributed to splicing into /proc. Now, whether that is a
*good* idea or not is clearly debatable, and I do think that maybe we
should just not splice to/from proc files, but that doesn't seem to be
new, and I don't think it's necessarily *broken* per se, it's just
that splicing into /proc seems somewhat unnecessary, and various proc
files do end up taking locks that can be "interesting".

At the other end of the spectrum, the "cred_guard_mutex -> FS locks"
thing from execve() is also pretty clear, and probably not fixable or
necessarily something we'd even want to fix.

But the "FS locks -> pipe" part is a bit questionable. Honestly, I'd
be much happier if XFS used generic_file_splice_read/write().

And looking more at that, I'm actually starting to think this is an
XFS locking problem. XFS really should not call back to splice while
holding the inode lock.

But that XFS code doesn't seem new either. Is XFS a new thing for you
to test with?

Ben? Comments? I added the xfs list too now that I'm starting to
possibly blame XFS more actively..

Linus

2013-07-16 03:29:18

by Dave Jones

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Mon, Jul 15, 2013 at 08:25:14PM -0700, Linus Torvalds wrote:


> And looking more at that, I'm actually starting to think this is an
> XFS locking problem. XFS really should not call back to splice while
> holding the inode lock.
>
> But that XFS code doesn't seem new either. Is XFS a new thing for you
> to test with?

I started pounding on it fairly recently and have shook out a number
of bugs (now fixed) since I started, so relatively new, but on the order of 'months' now.

Dave

2013-07-16 05:31:45

by Al Viro

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Mon, Jul 15, 2013 at 08:25:14PM -0700, Linus Torvalds wrote:

> The "pipe -> cred_guard_mutex" lock chain is pretty direct, and can be
> clearly attributed to splicing into /proc. Now, whether that is a
> *good* idea or not is clearly debatable, and I do think that maybe we
> should just not splice to/from proc files, but that doesn't seem to be
> new, and I don't think it's necessarily *broken* per se, it's just
> that splicing into /proc seems somewhat unnecessary, and various proc
> files do end up taking locks that can be "interesting".

FWIW, we might attack that one - after all, we could have ->splice_write()
for that guy that would grab cred_guard_mutex, then call splice_from_pipe()
with actor that would map/do security_setprocattr/unmap... Said that,
considering what it does on write, it really does *not* want to deal with
partial writes, so...

2013-07-16 06:04:18

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Mon, Jul 15, 2013 at 08:25:14PM -0700, Linus Torvalds wrote:
> On Mon, Jul 15, 2013 at 7:38 PM, Dave Jones <[email protected]> wrote:
> >
> > The recent trinity changes shouldn't have really made
> > any notable difference here.
>
> Hmm. I'm not aware pf anything that has changed in this area since
> 3.10 - neither in execve, xfs or in splice. Not even since 3.9.

It's been there for years.....

> The "pipe -> cred_guard_mutex" lock chain is pretty direct, and can be
> clearly attributed to splicing into /proc. Now, whether that is a
> *good* idea or not is clearly debatable, and I do think that maybe we
> should just not splice to/from proc files, but that doesn't seem to be
> new, and I don't think it's necessarily *broken* per se, it's just
> that splicing into /proc seems somewhat unnecessary, and various proc
> files do end up taking locks that can be "interesting".

But this is a new way of triggering the inversion, however....

> At the other end of the spectrum, the "cred_guard_mutex -> FS locks"
> thing from execve() is also pretty clear, and probably not fixable or
> necessarily something we'd even want to fix.
>
> But the "FS locks -> pipe" part is a bit questionable. Honestly, I'd
> be much happier if XFS used generic_file_splice_read/write().
>
> And looking more at that, I'm actually starting to think this is an
> XFS locking problem. XFS really should not call back to splice while
> holding the inode lock.
>
> But that XFS code doesn't seem new either. Is XFS a new thing for you
> to test with?

I posted patches to fix this i_mutex/i_iolock inversion a couple of
years ago (july 2011):

https://lkml.org/lkml/2011/7/18/4

And V2 was posted here and reviewed (aug 2011):

http://xfs.9218.n7.nabble.com/PATCH-0-2-splice-i-mutex-vs-splice-write-deadlock-V2-tt4072.html#none

It didn't get picked up by with a VFS tree, so sat moldering until
somebody else reported it (Nov 2012) and it reposted it again, only
to have it ignored again:

http://oss.sgi.com/archives/xfs/2012-11/msg00671.html

And I recently discussed it again with Al w.r.t. filesystem freeze
problems he was looking at, and I was waiting for that to settle
down before I posted the fixes again....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-16 06:16:08

by Al Viro

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 04:03:51PM +1000, Dave Chinner wrote:

> I posted patches to fix this i_mutex/i_iolock inversion a couple of
> years ago (july 2011):
>
> https://lkml.org/lkml/2011/7/18/4
>
> And V2 was posted here and reviewed (aug 2011):
>
> http://xfs.9218.n7.nabble.com/PATCH-0-2-splice-i-mutex-vs-splice-write-deadlock-V2-tt4072.html#none

Unless I'm misreading the patch, you end up doing file_remove_suid()
without holding i_mutex at all...

2013-07-16 06:41:40

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 07:16:02AM +0100, Al Viro wrote:
> On Tue, Jul 16, 2013 at 04:03:51PM +1000, Dave Chinner wrote:
>
> > I posted patches to fix this i_mutex/i_iolock inversion a couple of
> > years ago (july 2011):
> >
> > https://lkml.org/lkml/2011/7/18/4
> >
> > And V2 was posted here and reviewed (aug 2011):
> >
> > http://xfs.9218.n7.nabble.com/PATCH-0-2-splice-i-mutex-vs-splice-write-deadlock-V2-tt4072.html#none
>
> Unless I'm misreading the patch, you end up doing file_remove_suid()
> without holding i_mutex at all...

We've been calling file_remove_suid() since at least 2010 without
i_mutex held through the direct IO write path....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-16 06:50:43

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 07:16:02AM +0100, Al Viro wrote:
> On Tue, Jul 16, 2013 at 04:03:51PM +1000, Dave Chinner wrote:
>
> > I posted patches to fix this i_mutex/i_iolock inversion a couple of
> > years ago (july 2011):
> >
> > https://lkml.org/lkml/2011/7/18/4
> >
> > And V2 was posted here and reviewed (aug 2011):
> >
> > http://xfs.9218.n7.nabble.com/PATCH-0-2-splice-i-mutex-vs-splice-write-deadlock-V2-tt4072.html#none
>
> Unless I'm misreading the patch, you end up doing file_remove_suid()
> without holding i_mutex at all...

+ xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
+ ret = file_remove_suid(out);

Actaully, xfs_rw_ilock() takes the i_mutex due to teh exclusive locking ebing
done, so that's all fine.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-16 13:58:51

by Vince Weaver

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Mon, 15 Jul 2013, Linus Torvalds wrote:

> On Mon, Jul 15, 2013 at 7:38 PM, Dave Jones <[email protected]> wrote:
>
> > Interestingly, the 'soft lockups' I was
> > seeing all the time on that box seem to have gone into hiding.
>
> Honestly, I'm somewhat inclined to blame the whole perf situation, and
> saying that we hopefully got that fixed. In between the silly do_div()
> buglets and all the indications that the time was spent in nmi
> handlers, I'd be willing to just ignore them as false positives
> brought on by the whole switch to the perf irq..

Did the perf soft lockups go away with 3.11-rc1?

734df5ab549ca44f40de0f07af1c8803856dfb18 finally got committed,
and it fixes a major long-standing perf-related NMI soft lockup bug I
found when fuzzing.

Vince

2013-07-16 15:02:55

by Dave Jones

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 09:59:35AM -0400, Vince Weaver wrote:
> On Mon, 15 Jul 2013, Linus Torvalds wrote:
>
> > On Mon, Jul 15, 2013 at 7:38 PM, Dave Jones <[email protected]> wrote:
> >
> > > Interestingly, the 'soft lockups' I was
> > > seeing all the time on that box seem to have gone into hiding.
> >
> > Honestly, I'm somewhat inclined to blame the whole perf situation, and
> > saying that we hopefully got that fixed. In between the silly do_div()
> > buglets and all the indications that the time was spent in nmi
> > handlers, I'd be willing to just ignore them as false positives
> > brought on by the whole switch to the perf irq..
>
> Did the perf soft lockups go away with 3.11-rc1?
>
> 734df5ab549ca44f40de0f07af1c8803856dfb18 finally got committed,
> and it fixes a major long-standing perf-related NMI soft lockup bug I
> found when fuzzing.

That could be it. I never got around to trying that commit standalone
when you first pointed it out.

thanks,

Dave

2013-07-16 19:33:41

by Ben Myers

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

Hi Dave, Linus,

On Tue, Jul 16, 2013 at 04:03:51PM +1000, Dave Chinner wrote:
> On Mon, Jul 15, 2013 at 08:25:14PM -0700, Linus Torvalds wrote:
> > On Mon, Jul 15, 2013 at 7:38 PM, Dave Jones <[email protected]> wrote:
> > >
> > > The recent trinity changes shouldn't have really made
> > > any notable difference here.
> >
> > Hmm. I'm not aware pf anything that has changed in this area since
> > 3.10 - neither in execve, xfs or in splice. Not even since 3.9.
>
> It's been there for years.....
>
> > The "pipe -> cred_guard_mutex" lock chain is pretty direct, and can be
> > clearly attributed to splicing into /proc. Now, whether that is a
> > *good* idea or not is clearly debatable, and I do think that maybe we
> > should just not splice to/from proc files, but that doesn't seem to be
> > new, and I don't think it's necessarily *broken* per se, it's just
> > that splicing into /proc seems somewhat unnecessary, and various proc
> > files do end up taking locks that can be "interesting".
>
> But this is a new way of triggering the inversion, however....
>
> > At the other end of the spectrum, the "cred_guard_mutex -> FS locks"
> > thing from execve() is also pretty clear, and probably not fixable or
> > necessarily something we'd even want to fix.
> >
> > But the "FS locks -> pipe" part is a bit questionable. Honestly, I'd
> > be much happier if XFS used generic_file_splice_read/write().
> >
> > And looking more at that, I'm actually starting to think this is an
> > XFS locking problem. XFS really should not call back to splice while
> > holding the inode lock.

CPU0 CPU1 CPU2
---- ---- ----
lock(&sig->cred_guard_mutex);
lock(&pipe->mutex/1);
lock(&(&ip->io_lock)->mr_lock);
lock(&(&ip->io_lock)->mr_lock);
lock(&sig->cred_guard_mutex);
lock(&pipe->mutex/1);

CPU0 is do_execve_common, which called prepare_bprm_creds which takes the
cred_guard_mutex, and it is held across the call to xfs_file_aio_read, which
takes the iolock.

CPU1 is the /proc related codepath, where splice_from_pipe has held the
pipe_lock across the call to __splice_from_pipe, where proc_pid_attr_write is
eventually called and goes takes the cred_guard_mutex.

CPU2 is xfs_file_splice_write, which takes the iolock and holds it across the
call to generic_file_splice_write, which takes the pipe_mutex.

Yeah, I'll buy that.

> > But that XFS code doesn't seem new either. Is XFS a new thing for you
> > to test with?
>
> I posted patches to fix this i_mutex/i_iolock inversion a couple of
> years ago (july 2011):
>
> https://lkml.org/lkml/2011/7/18/4
>
> And V2 was posted here and reviewed (aug 2011):
>
> http://xfs.9218.n7.nabble.com/PATCH-0-2-splice-i-mutex-vs-splice-write-deadlock-V2-tt4072.html#none
>
> It didn't get picked up by with a VFS tree, so sat moldering until
> somebody else reported it (Nov 2012) and it reposted it again, only
> to have it ignored again:
>
> http://oss.sgi.com/archives/xfs/2012-11/msg00671.html
>
> And I recently discussed it again with Al w.r.t. filesystem freeze
> problems he was looking at, and I was waiting for that to settle
> down before I posted the fixes again....

I agree that fixing this in XFS seems to be the most reasonable plan,
and Dave's approach looks ok to me. Seems like patch 1 should go
through Al's tree, but we'll also need to commit it to the xfs tree
prerequisite to patch 2.

Dave, I'm sorry for my part in letting these moulder. I'll be happy to
review them once you have reposted.

Thanks,
Ben

2013-07-16 20:18:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 12:33 PM, Ben Myers <[email protected]> wrote:
>> >
>> > And looking more at that, I'm actually starting to think this is an
>> > XFS locking problem. XFS really should not call back to splice while
>> > holding the inode lock.

.. that was misleading, normally "inode lock" would be i_lock, but
here I meant the XFS-specific i_iolock.

But you clearly picked up on it:

> CPU0 CPU1 CPU2
> ---- ---- ----
> lock(&sig->cred_guard_mutex);
> lock(&pipe->mutex/1);
> lock(&(&ip->io_lock)->mr_lock);
> lock(&(&ip->io_lock)->mr_lock);
> lock(&sig->cred_guard_mutex);
> lock(&pipe->mutex/1);

Yup.

> I agree that fixing this in XFS seems to be the most reasonable plan,
> and Dave's approach looks ok to me. Seems like patch 1 should go
> through Al's tree, but we'll also need to commit it to the xfs tree
> prerequisite to patch 2.

Btw, is there some reason why XFS couldn't just use
generic_file_splice_read() directly?

I'm not arguing against Dave's patches, since the splice_write case
would seem to want them regardless, but I'm not even seeing why XFS
couldn't just avoid the locking for the splice_read case entirely..And
the generic file-splice read function does all that read-ahead stuff
and should actually be better for performance. And because it does it
from the page cache, it can avoid the locking issues (because it gets
the splice pipe lock later, just holding on to the page references).

And splice has mmap semantics - the whole point of splice is about
moving pages around, after all - so I *think* your current
"xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);" is actually over-serialization.
The pages will be shared by the pipe buffers anyway, so splicing by
definition doesn't imply full data serialization (because the reading
of the data itself will happen much later).

So the per-page serialization done by readpage() should already be
sufficient, no?

I dunno. Maybe there's something I'm missing. XFS does seem to wrap
all the other generic functions in that lock too, but since mmap() etc
clearly have to be able to get things one page at a time (without any
wrapping at higher layers), I'm just wondering whether splice_read
might not be able to avoid it.

Linus

2013-07-16 20:43:42

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 01:18:06PM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 12:33 PM, Ben Myers <[email protected]> wrote:
> >> >
> >> > And looking more at that, I'm actually starting to think this is an
> >> > XFS locking problem. XFS really should not call back to splice while
> >> > holding the inode lock.
>
> .. that was misleading, normally "inode lock" would be i_lock, but
> here I meant the XFS-specific i_iolock.
>
> But you clearly picked up on it:
>
> > CPU0 CPU1 CPU2
> > ---- ---- ----
> > lock(&sig->cred_guard_mutex);
> > lock(&pipe->mutex/1);
> > lock(&(&ip->io_lock)->mr_lock);
> > lock(&(&ip->io_lock)->mr_lock);
> > lock(&sig->cred_guard_mutex);
> > lock(&pipe->mutex/1);
>
> Yup.
>
> > I agree that fixing this in XFS seems to be the most reasonable plan,
> > and Dave's approach looks ok to me. Seems like patch 1 should go
> > through Al's tree, but we'll also need to commit it to the xfs tree
> > prerequisite to patch 2.
>
> Btw, is there some reason why XFS couldn't just use
> generic_file_splice_read() directly?

Yes - IO is serialised based on the ip->i_iolock, not i_mutex. We
don't use i_mutex for many things IO related, and so internal
locking is needed to serialise against stuff like truncate, hole
punching, etc, that are run through non-vfs interfaces.

> And splice has mmap semantics - the whole point of splice is about
> moving pages around, after all - so I *think* your current
> "xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);" is actually over-serialization.

No, that's just taking the ip->i_iolock in shared mode - that's less
serialisation than holding i_mutex as it allow parallel read
operations but still locks out concurrent buffered writes to the
file (i.e. posix atomic write vs read requirements)

> The pages will be shared by the pipe buffers anyway, so splicing by
> definition doesn't imply full data serialization (because the reading
> of the data itself will happen much later).
>
> So the per-page serialization done by readpage() should already be
> sufficient, no?
>
> I dunno. Maybe there's something I'm missing. XFS does seem to wrap
> all the other generic functions in that lock too, but since mmap() etc
> clearly have to be able to get things one page at a time (without any
> wrapping at higher layers), I'm just wondering whether splice_read
> might not be able to avoid it.

Read isn't the problem - it's write that's the deadlock issue...

Cheers,

Dave.
>
> Linus
>

--
Dave Chinner
[email protected]

2013-07-16 21:02:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 1:43 PM, Dave Chinner <[email protected]> wrote:
>
> Yes - IO is serialised based on the ip->i_iolock, not i_mutex. We
> don't use i_mutex for many things IO related, and so internal
> locking is needed to serialise against stuff like truncate, hole
> punching, etc, that are run through non-vfs interfaces.

Umm. But the page IO isn't serialized by i_mutext *either*. You don't
hold it across page faults. In fact you don't even take it at all
across page faults.

That's kind of my point. splice is about the page IO, and it's
serialized purely by the page lock. And then "->readpage()" will get
whatever IO mutex in order to do that right, but think about the case
where things are already in the page cache. There's no reason for any
serialization what-so-ever.

So this isn't about i_mutex. At all.

> Read isn't the problem - it's write that's the deadlock issue...

I agree, and I think your patches are needed, as I said in that email
you replied to. But due to this issue, I was looking at the XFS splice
support, and the read-side splice support seems inefficient and overly
complex. I'm not seeing why it needs that i_iolock.

And no, this really has nothing to do with i_mutex. Go look at
generic_file_splice_read(). There's no i_mutex there at all. It's more
like a series of magic page-faults without the actual page table
actions. Which is kind of the whole point of splice - zero-copy
without bothering with page table setup/teardown.

Now, it's perfectly possible that XFS really needs some odd locking
here, but your reply about i_mutex makes me think that you did it
because you were confused about what it actually wants.

*Every* other local filesystem uses generic_file_splice_read() with
just a single

.splice_read = generic_file_splice_read,

in the file ops initializer. Sure, nfs and ocfs2 wrap things like xfs
does, but they basically do it to revalidate their caches.

Linus

2013-07-17 04:06:25

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 02:02:12PM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 1:43 PM, Dave Chinner <[email protected]> wrote:
> >
> > Yes - IO is serialised based on the ip->i_iolock, not i_mutex. We
> > don't use i_mutex for many things IO related, and so internal
> > locking is needed to serialise against stuff like truncate, hole
> > punching, etc, that are run through non-vfs interfaces.
>
> Umm. But the page IO isn't serialized by i_mutext *either*. You don't
> hold it across page faults. In fact you don't even take it at all
> across page faults.

Right, and that's one of the biggest problems page based IO has - we
can't serialise it against other IO and other page cache
manipulation functions like hole punching. What happens when a
splice read or mmap page fault races with a hole punch? You get
stale data being left in the page cache because we can't serialise
the page read with the page cache invalidation and underlying extent
removal.

Indeed, why do you think we've been talking about VFS-level IO range
locking for the past year or more, and had a discussion session at
LSF/MM this year on it? i.e. this:

http://lwn.net/Articles/548939/

So forget about this "we don't need no steenkin' IO serialisation"
concept - it's fundamentally broken.

FWIW, hole punching in XFS takes the i_iolock in exclusive
mode, and hence serialises correctly against splice. IOWs, there is
a whole class of splice read data corruption race conditions that
XFS is not susceptible to but....

> *Every* other local filesystem uses generic_file_splice_read() with
> just a single
>
> .splice_read = generic_file_splice_read,

... and so they all are broken in a nasty, subtle way....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-17 04:54:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 9:06 PM, Dave Chinner <[email protected]> wrote:
>
> Right, and that's one of the biggest problems page based IO has - we
> can't serialise it against other IO and other page cache
> manipulation functions like hole punching. What happens when a
> splice read or mmap page fault races with a hole punch? You get
> stale data being left in the page cache because we can't serialise
> the page read with the page cache invalidation and underlying extent
> removal.

But Dave, that's *good*.

You call it "stale data".

I call it "the data was valid at some point".

This is what "splice()" is fundamentally all about.

Think of it this way: even if you are 100% serialized during the
"splice()" operation, what do you think happens afterwards?

Seriously, think it through.

That data is in a kernel buffer - the pipe. The fact that it was
serialized at the time of the original splice() doesn't make _one_
whit of a difference, because after the splice is over, the data still
sits around in that pipe buffer, and you're no longer serializing it.
Somebody else truncating the file or punching a hole in the file DOES
NOT MATTER. It's too late.

In other words, trying to "protect" against that kind of race is stupid.

You're missing the big picture because you're concentrating on the
details. Look beyond what happens inside XFS, and think about the
higher-level meaning of splice() itself.

So the only guarantee splice *should* give is entirely per-page. If
you think it gives any other serialization, you're fundamentally
wrong, because it *cannot*. See?

Linus

2013-07-17 05:51:13

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 09:54:09PM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 9:06 PM, Dave Chinner <[email protected]> wrote:
> >
> > Right, and that's one of the biggest problems page based IO has - we
> > can't serialise it against other IO and other page cache
> > manipulation functions like hole punching. What happens when a
> > splice read or mmap page fault races with a hole punch? You get
> > stale data being left in the page cache because we can't serialise
> > the page read with the page cache invalidation and underlying extent
> > removal.
>
> But Dave, that's *good*.
>
> You call it "stale data".
>
> I call it "the data was valid at some point".

Yes, that's true.

But When i say "stale data" I mean that the data being returned
might not have originally belonged to the underlying file you are
reading.

> This is what "splice()" is fundamentally all about.
>
> Think of it this way: even if you are 100% serialized during the
> "splice()" operation, what do you think happens afterwards?

Same as for read(2) - I don't care. Indeed, the locking is not to
protect what splice does "afterwards". It's for IO serialisation,
not page cache serialisation.

For example, splice read races with hole punch - after the page is
invalidated, but before the blocks are punched out. So we get a
new page, a mapping from the inode, get_blocks() returns valid disk
mapping, we start an IO. it gets blocked in the elevator. Maybe it's
been throttled because the process has used it's blkio quota.
Whatever - the IO gets stuck on the dispatch queue.

Meanwhile, the holepunch runs the transaction to free the disk
blocks, and then something else reallocates those blocks - maybe
even the filesystem for new metadata. That then gets dispatched to
disk. We now have two IOs pending for the same disk blocks....

What that means is that what the splice read returns is undefined.
It *might* be the original data that belonged to the file, in which
case we are good to go, but if the write wins the IO race then the
splice read could return someone else's data or even filesystem
metadata - that's what "stale data" means. Any way you look at it,
that's a bad thing....

> Seriously, think it through.

I have ;)

> That data is in a kernel buffer - the pipe. The fact that it was
> serialized at the time of the original splice() doesn't make _one_
> whit of a difference, because after the splice is over, the data still
> sits around in that pipe buffer, and you're no longer serializing it.
> Somebody else truncating the file or punching a hole in the file DOES
> NOT MATTER. It's too late.

Yes, I agree that once we have a page with valid data in it, we
don't need serialisation for splice. Again, I'll stress that he
serialisation is not for splice page cache manipulations but for
the IO that splice may issue to initialise the pages it them
manipulates....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-17 16:03:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Tue, Jul 16, 2013 at 10:51 PM, Dave Chinner <[email protected]> wrote:
>
> But When i say "stale data" I mean that the data being returned
> might not have originally belonged to the underlying file you are
> reading.

We're still talking at cross purposes then.

How the hell do you handle mmap() and page faulting?

Because if you return *that* kind of stale data, than you're horribly
horribly buggy. And you cannot *possibly* blame
generic_file_splice_read() on that.

Linus

2013-07-17 23:40:55

by Ben Myers

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

Linus,

On Wed, Jul 17, 2013 at 09:03:11AM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 10:51 PM, Dave Chinner <[email protected]> wrote:
> >
> > But When i say "stale data" I mean that the data being returned
> > might not have originally belonged to the underlying file you are
> > reading.
>
> We're still talking at cross purposes then.
>
> How the hell do you handle mmap() and page faulting?

__xfs_get_blocks serializes access to the block map with the i_lock on the
xfs_inode. This appears to be racy with respect to hole punching.

> Because if you return *that* kind of stale data, than you're horribly
> horribly buggy. And you cannot *possibly* blame
> generic_file_splice_read() on that.

Seems to me we'd need to hold the page lock on every page in the hole to
provide exclusion with splice read and mmap faults, then remove the extents,
and finally truncate the pages away. I think at that point the reads could be
done without the iolock. Or, is there a different lock that could do the trick?

Thanks,
Ben

2013-07-18 00:17:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Wed, Jul 17, 2013 at 4:40 PM, Ben Myers <[email protected]> wrote:
>>
>> We're still talking at cross purposes then.
>>
>> How the hell do you handle mmap() and page faulting?
>
> __xfs_get_blocks serializes access to the block map with the i_lock on the
> xfs_inode. This appears to be racy with respect to hole punching.

Would it be possible to just make __xfs_get_blocks get the i_iolock
(non-exclusively)?

Or, alternatively, do it in the readpage() function?

That was what I thought you did anyway. Exactly because of the whole
page faulting issue.

Linus

2013-07-18 03:18:06

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Wed, Jul 17, 2013 at 09:03:11AM -0700, Linus Torvalds wrote:
> On Tue, Jul 16, 2013 at 10:51 PM, Dave Chinner <[email protected]> wrote:
> >
> > But When i say "stale data" I mean that the data being returned
> > might not have originally belonged to the underlying file you are
> > reading.
>
> We're still talking at cross purposes then.
>
> How the hell do you handle mmap() and page faulting?

We cross our fingers and hope. Always have. Races are rare as
historically there have been only a handful of applications that do
the necessary operations to trigger them. However, with holepunch
now a generic fallocate() operation....

> Because if you return *that* kind of stale data, than you're horribly
> horribly buggy. And you cannot *possibly* blame
> generic_file_splice_read() on that.

Right, it's horribly buggy and I'm not blaming
generic_file_splice_read().

I'm saying that the page cache architecture does not providing
mechanisms to avoid the problem. i.e. that we can't synchronise
multi-page operations against a single page operation that only uses
the page lock for serialisation without some form of filesystem
specific locking. And that the i_mutex/i_iolock/mmap_sem inversion
problems essentially prevent us from beign able to fix it in a
filesystem specific manner.

We've hacked around this read vs invalidation race condition for
truncate() by putting ordered operations in place to avoid
refaulting after invalidation by read operations. i.e. truncate was
optimised to avoid extra locking, but now the realisation is that
truncate is just a degenerate case of hole punching and that hole
punching cannot make use of the same "beyond EOF" optimisations to
avoid race conditions with other IO.

We (XFS developers) have known about this for years, but we've
always been told when it's been raised that it's "just a wacky XFS
problem". Now that other filesystems are actually implementing the
same functionality that XFS has had since day zero, they are also
seeing the same architectural deficiencies in the generic code. i.e.
they are not actually "whacky XFS problems". That's why we were
talking about a range-locking solution to this problem at LSF/MM
this year - to find a generic solution to the issue...

FWIW, this problem is not just associated with splice reads - it's a
problem for the direct IO code, too. The direct IO layer has lots of
hacky invalidation code that tries to work around the fact that
mmap() page faults cannot be synchronised against direct IO in
progress. Hence it invalidates caches before and after direct IO is
done in the hope that we don't have a page fault that races and
leaves us with out-of-date data being exposed to userspace via mmap.
Indeed, we have a regression test that demonstrates how this often
fails - xfstests:generic/263 uses fsx with direct IO and mmap on the
same file and will fail with data corruption on XFS.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-18 03:42:31

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Wed, Jul 17, 2013 at 05:17:36PM -0700, Linus Torvalds wrote:
> On Wed, Jul 17, 2013 at 4:40 PM, Ben Myers <[email protected]> wrote:
> >>
> >> We're still talking at cross purposes then.
> >>
> >> How the hell do you handle mmap() and page faulting?
> >
> > __xfs_get_blocks serializes access to the block map with the i_lock on the
> > xfs_inode. This appears to be racy with respect to hole punching.
>
> Would it be possible to just make __xfs_get_blocks get the i_iolock
> (non-exclusively)?

No. __xfs_get_blocks() operates on metadata (e.g. extent lists), and
as such is protected by the i_ilock (note: not the i_iolock). i.e.
XFS has a multi-level locking strategy:

i_iolock is provided for *data IO serialisation*,
i_ilock is for *inode metadata serialisation*.

Truncate and hole punching require IO level serialisation rather
than metadata or page cache level serialisation as they have to be
safe against direct IO as well as page cache based IO.

> Or, alternatively, do it in the readpage() function?
>
> That was what I thought you did anyway. Exactly because of the whole
> page faulting issue.

We protect the inode itself with the i_ilock in the page fault path,
but we have no IO level serialisation. Racing faults serialise
access to inode metadata via on the i_ilock, but this doesn't
serialise against IO in progress....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-18 21:16:38

by Ben Myers

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

Dave,

On Thu, Jul 18, 2013 at 01:42:03PM +1000, Dave Chinner wrote:
> On Wed, Jul 17, 2013 at 05:17:36PM -0700, Linus Torvalds wrote:
> > On Wed, Jul 17, 2013 at 4:40 PM, Ben Myers <[email protected]> wrote:
> > >>
> > >> We're still talking at cross purposes then.
> > >>
> > >> How the hell do you handle mmap() and page faulting?
> > >
> > > __xfs_get_blocks serializes access to the block map with the i_lock on the
> > > xfs_inode. This appears to be racy with respect to hole punching.
> >
> > Would it be possible to just make __xfs_get_blocks get the i_iolock
> > (non-exclusively)?
>
> No. __xfs_get_blocks() operates on metadata (e.g. extent lists), and
> as such is protected by the i_ilock (note: not the i_iolock). i.e.
> XFS has a multi-level locking strategy:
>
> i_iolock is provided for *data IO serialisation*,
> i_ilock is for *inode metadata serialisation*.

I think if __xfs_get_blocks has some way of knowing it is the mmap/page fault
path, taking the iolock shared in addition to the ilock (in just that case)
would prevent the mmap from being able to read stale data from disk. You would
see either the data before the punch or you would see the hole.

Actually... I think that is wrong: You'd have to take the iolock across the
read itself (not just the access to the block map) for it to have the desired
effect:

1608 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
...
1704 page_not_uptodate:
1705 /*
1706 * Umm, take care of errors if the page isn't up-to-date.
1707 * Try to re-read it _once_. We do this synchronously,
1708 * because there really aren't any performance issues here
1709 * and we need to check for errors.
1710 */
1711 ClearPageError(page);
1712 error = mapping->a_ops->readpage(file, page);
1713 if (!error) {
1714 wait_on_page_locked(page);
1715 if (!PageUptodate(page))
1716 error = -EIO;
1717 }
1718 page_cache_release(page);

Wouldn't you have to hold the iolock until after wait_on_page_locked returns?

Regards,
Ben

2013-07-18 22:21:29

by Ben Myers

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

Dave,

On Thu, Jul 18, 2013 at 04:16:32PM -0500, Ben Myers wrote:
> On Thu, Jul 18, 2013 at 01:42:03PM +1000, Dave Chinner wrote:
> > On Wed, Jul 17, 2013 at 05:17:36PM -0700, Linus Torvalds wrote:
> > > On Wed, Jul 17, 2013 at 4:40 PM, Ben Myers <[email protected]> wrote:
> > > >>
> > > >> We're still talking at cross purposes then.
> > > >>
> > > >> How the hell do you handle mmap() and page faulting?
> > > >
> > > > __xfs_get_blocks serializes access to the block map with the i_lock on the
> > > > xfs_inode. This appears to be racy with respect to hole punching.
> > >
> > > Would it be possible to just make __xfs_get_blocks get the i_iolock
> > > (non-exclusively)?
> >
> > No. __xfs_get_blocks() operates on metadata (e.g. extent lists), and
> > as such is protected by the i_ilock (note: not the i_iolock). i.e.
> > XFS has a multi-level locking strategy:
> >
> > i_iolock is provided for *data IO serialisation*,
> > i_ilock is for *inode metadata serialisation*.
>
> I think if __xfs_get_blocks has some way of knowing it is the mmap/page fault
> path, taking the iolock shared in addition to the ilock (in just that case)
> would prevent the mmap from being able to read stale data from disk. You would
> see either the data before the punch or you would see the hole.
>
> Actually... I think that is wrong: You'd have to take the iolock across the
> read itself (not just the access to the block map) for it to have the desired
> effect:
>
> 1608 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> ...
> 1704 page_not_uptodate:
> 1705 /*
> 1706 * Umm, take care of errors if the page isn't up-to-date.
> 1707 * Try to re-read it _once_. We do this synchronously,
> 1708 * because there really aren't any performance issues here
> 1709 * and we need to check for errors.
> 1710 */
> 1711 ClearPageError(page);
> 1712 error = mapping->a_ops->readpage(file, page);
> 1713 if (!error) {
> 1714 wait_on_page_locked(page);
> 1715 if (!PageUptodate(page))
> 1716 error = -EIO;
> 1717 }
> 1718 page_cache_release(page);
>
> Wouldn't you have to hold the iolock until after wait_on_page_locked returns?

Maybe like so (crappy/untested/probably wrong/fodder for ridicule/etc):

Index: xfs/fs/xfs/xfs_aops.c
===================================================================
--- xfs.orig/fs/xfs/xfs_aops.c
+++ xfs/fs/xfs/xfs_aops.c
@@ -1663,6 +1663,24 @@ xfs_vm_readpages(
return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
}

+void
+xfs_vm_fault_lock(
+ struct inode *inode)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+
+ xfs_ilock(ip, XFS_IOLOCK_SHARED);
+}
+
+void
+xfs_vm_fault_unlock(
+ struct inode *inode)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+
+ xfs_iunlock(ip, XFS_IOLOCK_SHARED);
+}
+
const struct address_space_operations xfs_address_space_operations = {
.readpage = xfs_vm_readpage,
.readpages = xfs_vm_readpages,
@@ -1677,4 +1695,6 @@ const struct address_space_operations xf
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
+ .fault_lock = xfs_vm_fault_lock,
+ .fault_unlock = xfs_vm_fault_unlock,
};
Index: xfs/mm/filemap.c
===================================================================
--- xfs.orig/mm/filemap.c
+++ xfs/mm/filemap.c
@@ -1709,12 +1709,16 @@ page_not_uptodate:
* and we need to check for errors.
*/
ClearPageError(page);
+ if (mapping->a_ops->fault_lock)
+ mapping->a_ops->fault_lock(inode);
error = mapping->a_ops->readpage(file, page);
if (!error) {
wait_on_page_locked(page);
if (!PageUptodate(page))
error = -EIO;
}
+ if (mapping->a_ops->fault_unlock)
+ mapping->a_ops->fault_unlock(inode);
page_cache_release(page);

if (!error || error == AOP_TRUNCATED_PAGE)
Index: xfs/include/linux/fs.h
===================================================================
--- xfs.orig/include/linux/fs.h
+++ xfs/include/linux/fs.h
@@ -388,6 +388,9 @@ struct address_space_operations {
int (*swap_activate)(struct swap_info_struct *sis, struct file *file,
sector_t *span);
void (*swap_deactivate)(struct file *file);
+
+ void (*fault_lock) (struct inode *inode);
+ void (*fault_unlock) (struct inode *inode);
};

extern const struct address_space_operations empty_aops;

2013-07-18 22:49:15

by Dave Chinner

[permalink] [raw]
Subject: Re: splice vs execve lockdep trace.

On Thu, Jul 18, 2013 at 05:21:17PM -0500, Ben Myers wrote:
> Dave,
>
> On Thu, Jul 18, 2013 at 04:16:32PM -0500, Ben Myers wrote:
> > On Thu, Jul 18, 2013 at 01:42:03PM +1000, Dave Chinner wrote:
> > > On Wed, Jul 17, 2013 at 05:17:36PM -0700, Linus Torvalds wrote:
> > > > On Wed, Jul 17, 2013 at 4:40 PM, Ben Myers <[email protected]> wrote:
> > > > >>
> > > > >> We're still talking at cross purposes then.
> > > > >>
> > > > >> How the hell do you handle mmap() and page faulting?
> > > > >
> > > > > __xfs_get_blocks serializes access to the block map with the i_lock on the
> > > > > xfs_inode. This appears to be racy with respect to hole punching.
> > > >
> > > > Would it be possible to just make __xfs_get_blocks get the i_iolock
> > > > (non-exclusively)?
> > >
> > > No. __xfs_get_blocks() operates on metadata (e.g. extent lists), and
> > > as such is protected by the i_ilock (note: not the i_iolock). i.e.
> > > XFS has a multi-level locking strategy:
> > >
> > > i_iolock is provided for *data IO serialisation*,
> > > i_ilock is for *inode metadata serialisation*.
> >
> > I think if __xfs_get_blocks has some way of knowing it is the mmap/page fault
> > path, taking the iolock shared in addition to the ilock (in just that case)
> > would prevent the mmap from being able to read stale data from disk. You would
> > see either the data before the punch or you would see the hole.
> >
> > Actually... I think that is wrong: You'd have to take the iolock across the
> > read itself (not just the access to the block map) for it to have the desired
> > effect:
> >
> > 1608 int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > ...
> > 1704 page_not_uptodate:
> > 1705 /*
> > 1706 * Umm, take care of errors if the page isn't up-to-date.
> > 1707 * Try to re-read it _once_. We do this synchronously,
> > 1708 * because there really aren't any performance issues here
> > 1709 * and we need to check for errors.
> > 1710 */
> > 1711 ClearPageError(page);
> > 1712 error = mapping->a_ops->readpage(file, page);
> > 1713 if (!error) {
> > 1714 wait_on_page_locked(page);
> > 1715 if (!PageUptodate(page))
> > 1716 error = -EIO;
> > 1717 }
> > 1718 page_cache_release(page);
> >
> > Wouldn't you have to hold the iolock until after wait_on_page_locked returns?
>
> Maybe like so (crappy/untested/probably wrong/fodder for ridicule/etc):

Try running it with lockdep. You'll see pretty quickly why you can't
take the i_iolock or i_mutex in the ->fault path - it is called
with the mmap_sem held.

The lock inversion that can deadlock is that the page fault might be
occurring in the read path that is already holding the
i_mutex/i_iolock....

Cheers,

Dave.

--
Dave Chinner
[email protected]