2013-10-05 00:52:21

by Dave Jones

[permalink] [raw]
Subject: fs/attr.c:notify_change locking warning.

WARNING: CPU: 3 PID: 26128 at fs/attr.c:178 notify_change+0x34d/0x360()
Modules linked in: dlci 8021q garp sctp snd_seq_dummy bridge stp tun fuse rfcomm hidp ipt_ULOG nfc caif_socket caif af_802154 phonet af_rxrpc bnep bluetooth rfkill can_bcm can_raw can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds scsi_transport_iscsi nfnetlink af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm snd_hda_codec_hdmi xfs snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc libcrc32c snd_timer crct10dif_pclmul crc32c_intel snd ghash_clmulni_intel e1000e microcode usb_debug serio_raw pcspkr ptp soundcore pps_core shpchp
CPU: 3 PID: 26128 Comm: trinity-child57 Not tainted 3.12.0-rc3+ #93
ffffffff81a3e2ec ffff880071d71bb8 ffffffff8172a763 0000000000000000
ffff880071d71bf0 ffffffff810552cd 0000000000000a00 ffff88023d6e8b90
0000000000008ad0 ffff880071d71c50 ffff8802392882c8 ffff880071d71c00
Call Trace:
[<ffffffff8172a763>] dump_stack+0x4e/0x82
[<ffffffff810552cd>] warn_slowpath_common+0x7d/0xa0
[<ffffffff810553aa>] warn_slowpath_null+0x1a/0x20
[<ffffffff811e388d>] notify_change+0x34d/0x360
[<ffffffff811e18e7>] file_remove_suid+0x87/0xa0
[<ffffffff811e9019>] ? __mnt_drop_write+0x29/0x50
[<ffffffff811e90a2>] ? __mnt_drop_write_file+0x12/0x20
[<ffffffff811e1e6a>] ? file_update_time+0x8a/0xd0
[<ffffffffa01a4e7a>] xfs_file_aio_write_checks+0xea/0xf0 [xfs]
[<ffffffffa01a4f50>] xfs_file_dio_aio_write+0xd0/0x3e0 [xfs]
[<ffffffffa01a5629>] xfs_file_aio_write+0x129/0x130 [xfs]
[<ffffffff811c45dc>] do_sync_readv_writev+0x4c/0x80
[<ffffffff811c5aab>] do_readv_writev+0xbb/0x240
[<ffffffffa01a5500>] ? xfs_file_buffered_aio_write+0x2a0/0x2a0 [xfs]
[<ffffffff811c4500>] ? do_sync_read+0x90/0x90
[<ffffffff81734721>] ? _raw_spin_unlock+0x31/0x60
[<ffffffff810c9a86>] ? trace_hardirqs_on_caller+0x16/0x1e0
[<ffffffff811c5cc8>] vfs_writev+0x38/0x60
[<ffffffff811c5e10>] SyS_writev+0x50/0xd0
[<ffffffff8173d8e4>] tracesys+0xdd/0xe2
---[ end trace 201843ae71ab5a7c ]---


177
178 WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex));
179


2013-10-05 03:19:24

by Dave Chinner

[permalink] [raw]
Subject: Re: fs/attr.c:notify_change locking warning.

On Fri, Oct 04, 2013 at 08:52:10PM -0400, Dave Jones wrote:
> WARNING: CPU: 3 PID: 26128 at fs/attr.c:178 notify_change+0x34d/0x360()
> Modules linked in: dlci 8021q garp sctp snd_seq_dummy bridge stp tun fuse rfcomm hidp ipt_ULOG nfc caif_socket caif af_802154 phonet af_rxrpc bnep bluetooth rfkill can_bcm can_raw can llc2 pppoe pppox ppp_generic slhc irda crc_ccitt rds scsi_transport_iscsi nfnetlink af_key rose x25 atm netrom appletalk ipx p8023 psnap p8022 llc ax25 coretemp hwmon x86_pkg_temp_thermal kvm_intel kvm snd_hda_codec_hdmi xfs snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc libcrc32c snd_timer crct10dif_pclmul crc32c_intel snd ghash_clmulni_intel e1000e microcode usb_debug serio_raw pcspkr ptp soundcore pps_core shpchp
> CPU: 3 PID: 26128 Comm: trinity-child57 Not tainted 3.12.0-rc3+ #93
> ffffffff81a3e2ec ffff880071d71bb8 ffffffff8172a763 0000000000000000
> ffff880071d71bf0 ffffffff810552cd 0000000000000a00 ffff88023d6e8b90
> 0000000000008ad0 ffff880071d71c50 ffff8802392882c8 ffff880071d71c00
> Call Trace:
> [<ffffffff8172a763>] dump_stack+0x4e/0x82
> [<ffffffff810552cd>] warn_slowpath_common+0x7d/0xa0
> [<ffffffff810553aa>] warn_slowpath_null+0x1a/0x20
> [<ffffffff811e388d>] notify_change+0x34d/0x360
> [<ffffffff811e18e7>] file_remove_suid+0x87/0xa0
> [<ffffffff811e9019>] ? __mnt_drop_write+0x29/0x50
> [<ffffffff811e90a2>] ? __mnt_drop_write_file+0x12/0x20
> [<ffffffff811e1e6a>] ? file_update_time+0x8a/0xd0
> [<ffffffffa01a4e7a>] xfs_file_aio_write_checks+0xea/0xf0 [xfs]
> [<ffffffffa01a4f50>] xfs_file_dio_aio_write+0xd0/0x3e0 [xfs]
> [<ffffffffa01a5629>] xfs_file_aio_write+0x129/0x130 [xfs]
> [<ffffffff811c45dc>] do_sync_readv_writev+0x4c/0x80
> [<ffffffff811c5aab>] do_readv_writev+0xbb/0x240
> [<ffffffffa01a5500>] ? xfs_file_buffered_aio_write+0x2a0/0x2a0 [xfs]
> [<ffffffff811c4500>] ? do_sync_read+0x90/0x90
> [<ffffffff81734721>] ? _raw_spin_unlock+0x31/0x60
> [<ffffffff810c9a86>] ? trace_hardirqs_on_caller+0x16/0x1e0
> [<ffffffff811c5cc8>] vfs_writev+0x38/0x60
> [<ffffffff811c5e10>] SyS_writev+0x50/0xd0
> [<ffffffff8173d8e4>] tracesys+0xdd/0xe2
> ---[ end trace 201843ae71ab5a7c ]---
>
>
> 177
> 178 WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex));

Yup, we don't hold the i_mutex *at all* through the fast path for
direct IO writes. Having to grab the i_mutex on every IO just for
the extremely unlikely case we need to remove a suid bit on the file
would add a significant serialisation point into the direct Io model
that XFS uses, and is the difference between 50,000 and 2+ million
direct IO IOPS to a single file.

I'm unwilling to sacrifice the concurrency of direct IO writes just
to shut up ths warning, especially as the actual modifications that
are made to remove SUID bits are correctly serialised within XFS
once notify_change() calls ->setattr(). If it really matters, I'll
just open code file_remove_suid() into XFS like ocfs2 does just so
we don't get that warning being emitted by trinity.

FWIW, buffered IO on XFS - the normal case for most operations -
holds the i_mutex over the call to file_remove_suid(), and that's
why this warning is pretty much never seen - direct IO writes to
suid files is very unusual....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-10-15 20:19:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: fs/attr.c:notify_change locking warning.

On Sat, Oct 05, 2013 at 01:19:18PM +1000, Dave Chinner wrote:
> Yup, we don't hold the i_mutex *at all* through the fast path for
> direct IO writes. Having to grab the i_mutex on every IO just for
> the extremely unlikely case we need to remove a suid bit on the file
> would add a significant serialisation point into the direct Io model
> that XFS uses, and is the difference between 50,000 and 2+ million
> direct IO IOPS to a single file.
>
> I'm unwilling to sacrifice the concurrency of direct IO writes just
> to shut up ths warning, especially as the actual modifications that
> are made to remove SUID bits are correctly serialised within XFS
> once notify_change() calls ->setattr(). If it really matters, I'll
> just open code file_remove_suid() into XFS like ocfs2 does just so
> we don't get that warning being emitted by trinity.

But the i_lock doesn't synchronize against the VFS modifying various
struct inode fields. The right fix is to take i_mutex just in case
we actually need to remove the suid bit. The patch below should fix it,
although I need to write a testcase that actually exercises it first.

Dave (J.): if you have time to try the patch below please go ahead,
if not I'll make sure to write an isolated test ASAP to verify it and
will then submit the change.

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 4c749ab..e879f96 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -590,8 +590,22 @@ restart:
* If we're writing the file then make sure to clear the setuid and
* setgid bits if the process is not being run by root. This keeps
* people from modifying setuid and setgid binaries.
+ *
+ * Note that file_remove_suid must be called with the i_mutex held,
+ * so we have to go through some hoops here to make sure we hold it.
*/
- return file_remove_suid(file);
+ if (!IS_NOSEC(inode) && should_remove_suid(file->f_path.dentry)) {
+ if (*iolock == XFS_IOLOCK_SHARED) {
+ mutex_lock(&inode->i_mutex);
+ error = file_remove_suid(file);
+ mutex_unlock(&inode->i_mutex);
+ } else {
+ error = file_remove_suid(file);
+ }
+
+ }
+
+ return error;
}

/*

2013-10-15 21:36:26

by Dave Chinner

[permalink] [raw]
Subject: Re: fs/attr.c:notify_change locking warning.

On Tue, Oct 15, 2013 at 01:19:05PM -0700, Christoph Hellwig wrote:
> On Sat, Oct 05, 2013 at 01:19:18PM +1000, Dave Chinner wrote:
> > Yup, we don't hold the i_mutex *at all* through the fast path for
> > direct IO writes. Having to grab the i_mutex on every IO just for
> > the extremely unlikely case we need to remove a suid bit on the file
> > would add a significant serialisation point into the direct Io model
> > that XFS uses, and is the difference between 50,000 and 2+ million
> > direct IO IOPS to a single file.
> >
> > I'm unwilling to sacrifice the concurrency of direct IO writes just
> > to shut up ths warning, especially as the actual modifications that
> > are made to remove SUID bits are correctly serialised within XFS
> > once notify_change() calls ->setattr(). If it really matters, I'll
> > just open code file_remove_suid() into XFS like ocfs2 does just so
> > we don't get that warning being emitted by trinity.
>
> But the i_lock doesn't synchronize against the VFS modifying various
> struct inode fields.

Sure, but file_remove_suid() doesn't actually modify any VFS inode
structures until we process the flags and the modifications within
->setattr, which in XFS are all done under the XFS_ILOCK_EXCL via
xfs_setattr_mode(). i.e. both the VFS and XFS inodes S*ID bits are
removed only under XFS_ILOCK_EXCL....

Hence I see no point in adding extra serialisation via the i_mutex
to this path when we can just do something like:

killsuid = should_remove_suid(file->f_path.dentry);
if (killsuid) {
struct iattr newattr;

newattr.ia_valid = ATTR_FORCE | killsuid;
error = xfs_setattr_nonsize(ip, &newattr, 0);
if (error)
return error;
}

and not require the i_mutex at all...

Indeed, this is exactly what do_truncate() does - the check outside
the i_mutex, then calls notify_change() with the i_mutex held. IOWs,
the i_mutex does nothing to serialise concurrent attempts to check
and remove S*ID bits....

> The right fix is to take i_mutex just in case
> we actually need to remove the suid bit. The patch below should fix it,
> although I need to write a testcase that actually exercises it first.
>
> Dave (J.): if you have time to try the patch below please go ahead,
> if not I'll make sure to write an isolated test ASAP to verify it and
> will then submit the change.
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 4c749ab..e879f96 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -590,8 +590,22 @@ restart:
> * If we're writing the file then make sure to clear the setuid and
> * setgid bits if the process is not being run by root. This keeps
> * people from modifying setuid and setgid binaries.
> + *
> + * Note that file_remove_suid must be called with the i_mutex held,
> + * so we have to go through some hoops here to make sure we hold it.
> */
> - return file_remove_suid(file);
> + if (!IS_NOSEC(inode) && should_remove_suid(file->f_path.dentry)) {
> + if (*iolock == XFS_IOLOCK_SHARED) {
> + mutex_lock(&inode->i_mutex);
> + error = file_remove_suid(file);
> + mutex_unlock(&inode->i_mutex);

Lock inversion - i_mutex is always outside i_iolock. i.e. this will
deadlock if someone else calls xfs_rw_ilock(XFS_ILOCK_EXCL) at the
same time because we already hold the i_iolock in shared mode. It's
the same case that this function already handles for the EOF zeroing
relocking.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-10-16 07:05:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: fs/attr.c:notify_change locking warning.

On Wed, Oct 16, 2013 at 08:36:18AM +1100, Dave Chinner wrote:
> Sure, but file_remove_suid() doesn't actually modify any VFS inode
> structures until we process the flags and the modifications within
> ->setattr, which in XFS are all done under the XFS_ILOCK_EXCL via
> xfs_setattr_mode(). i.e. both the VFS and XFS inodes S*ID bits are
> removed only under XFS_ILOCK_EXCL....

It can set S_NOSEC after calling into ->setattr at least.

> Hence I see no point in adding extra serialisation via the i_mutex
> to this path when we can just do something like:
>
> killsuid = should_remove_suid(file->f_path.dentry);
> if (killsuid) {
> struct iattr newattr;
>
> newattr.ia_valid = ATTR_FORCE | killsuid;
> error = xfs_setattr_nonsize(ip, &newattr, 0);
> if (error)
> return error;
> }

We'd still need all the other magic in file_remove_suid, which I don't
actually quite undersdtand fully yet.

2013-10-16 10:26:56

by Dave Chinner

[permalink] [raw]
Subject: Re: fs/attr.c:notify_change locking warning.

On Wed, Oct 16, 2013 at 12:05:28AM -0700, Christoph Hellwig wrote:
> On Wed, Oct 16, 2013 at 08:36:18AM +1100, Dave Chinner wrote:
> > Sure, but file_remove_suid() doesn't actually modify any VFS inode
> > structures until we process the flags and the modifications within
> > ->setattr, which in XFS are all done under the XFS_ILOCK_EXCL via
> > xfs_setattr_mode(). i.e. both the VFS and XFS inodes S*ID bits are
> > removed only under XFS_ILOCK_EXCL....
>
> It can set S_NOSEC after calling into ->setattr at least.
>
> > Hence I see no point in adding extra serialisation via the i_mutex
> > to this path when we can just do something like:
> >
> > killsuid = should_remove_suid(file->f_path.dentry);
> > if (killsuid) {
> > struct iattr newattr;
> >
> > newattr.ia_valid = ATTR_FORCE | killsuid;
> > error = xfs_setattr_nonsize(ip, &newattr, 0);
> > if (error)
> > return error;
> > }
>
> We'd still need all the other magic in file_remove_suid, which I don't
> actually quite undersdtand fully yet.

The killpriv calls? I couldn't find anything that implemented those
security hooks nor any documentation about it, so I'm pretty much
clueless about it. FWIW, ocfs2 doesn't implement them, either....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-10-16 18:12:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: fs/attr.c:notify_change locking warning.

On Wed, Oct 16, 2013 at 09:26:51PM +1100, Dave Chinner wrote:
> The killpriv calls? I couldn't find anything that implemented those
> security hooks nor any documentation about it, so I'm pretty much
> clueless about it. FWIW, ocfs2 doesn't implement them, either....

The killpriv code ends up doing xattr calls for per-file capabilities
(grep security/commoncap.c for killpriv). Seems like ocfs2 is buggy in
that regard.

I suspect the easiest way to solve it properly in XFS is to simply
retake the iolock exclusive and get the i_mutex as part of it. This
means direct I/O writes to files with the suid bit won't scale, but I
think we can live with that given that it avoids introducing special
cases that impact more code.