2023-11-24 10:28:12

by Christian Brauner

[permalink] [raw]
Subject: [GIT PULL] vfs fixes

Hey Linus,

/* Summary */
This contains the usual miscellaneous fixes:

* Avoid calling back into LSMs from vfs_getattr_nosec() calls.

IMA used to query inode properties accessing raw inode fields without
dedicated helpers. That was finally fixed a few releases ago by
forcing IMA to use vfs_getattr_nosec() helpers.

The goal of the vfs_getattr_nosec() helper is to query for attributes
without calling into the LSM layer which would be quite problematic
because incredibly IMA is called from __fput()...

__fput()
-> ima_file_free()

What it does is to call back into the filesystem to update the file's
IMA xattr. Querying the inode without using vfs_getattr_nosec() meant
that IMA didn't handle stacking filesystems such as overlayfs
correctly. So the switch to vfs_getattr_nosec() is quite correct. But
the switch to vfs_getattr_nosec() revealed another bug when used on
stacking filesystems:

__fput()
-> ima_file_free()
-> vfs_getattr_nosec()
-> i_op->getattr::ovl_getattr()
-> vfs_getattr()
-> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr()
-> security_inode_getattr() # calls back into LSMs

Now, if that __fput() happens from task_work_run() of an exiting task
current->fs and various other pointer could already be NULL. So
anything in the LSM layer relying on that not being NULL would be
quite surprised.

Fix that by passing the information that this is a security request
through to the stacking filesystem by adding a new internal
ATT_GETATTR_NOSEC flag. Now the callchain becomes:

__fput()
-> ima_file_free()
-> vfs_getattr_nosec()
-> i_op->getattr::ovl_getattr()
-> if (AT_GETATTR_NOSEC)
vfs_getattr_nosec()
else
vfs_getattr()
-> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr()

* Fix a bug introduced with the iov_iter rework from last cycle.

This broke /proc/kcore by copying too much and without the correct
offset.

* Add a missing NULL check when allocating the root inode in
autofs_fill_super().

* Fix stable writes for multi-device filesystems (xfs, btrfs etc) and
the block device pseudo filesystem.

Stable writes used to be a superblock flag only, making it a per
filesystem property. Add an additional AS_STABLE_WRITES mapping flag
to allow for fine-grained control.

* Ensure that offset_iterate_dir() returns 0 after reaching the end of a
directory so it adheres to getdents() convention.

/* Testing */
clang: Debian clang version 16.0.6 (16)
gcc: gcc (Debian 13.2.0-5) 13.2.0

All patches are based on v6.7-rc1 and have been sitting in linux-next.
No build failures or warnings were observed. Passes xfstests.

/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with
current mainline.

The following changes since commit b85ea95d086471afb4ad062012a4d73cd328fa86:

Linux 6.7-rc1 (2023-11-12 16:19:07 -0800)

are available in the Git repository at:

[email protected]:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7-rc3.fixes

for you to fetch changes up to 796432efab1e372d404e7a71cc6891a53f105051:

libfs: getdents() should return 0 after reaching EOD (2023-11-20 15:34:22 +0100)

Please consider pulling these changes from the signed vfs-6.7-rc3.fixes tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.7-rc3.fixes

----------------------------------------------------------------
Christoph Hellwig (4):
filemap: add a per-mapping stable writes flag
block: update the stable_writes flag in bdev_add
xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags
xfs: respect the stable writes flag on the RT device

Chuck Lever (1):
libfs: getdents() should return 0 after reaching EOD

Ian Kent (1):
autofs: add: new_inode check in autofs_fill_super()

Omar Sandoval (1):
iov_iter: fix copy_page_to_iter_nofault()

Stefan Berger (1):
fs: Pass AT_GETATTR_NOSEC flag to getattr interface function

block/bdev.c | 2 ++
fs/autofs/inode.c | 56 +++++++++++++++++-----------------------------
fs/ecryptfs/inode.c | 12 ++++++++--
fs/inode.c | 2 ++
fs/libfs.c | 14 +++++++++---
fs/overlayfs/inode.c | 10 ++++-----
fs/overlayfs/overlayfs.h | 8 +++++++
fs/stat.c | 6 ++++-
fs/xfs/xfs_inode.h | 8 +++++++
fs/xfs/xfs_ioctl.c | 30 ++++++++++++++++---------
fs/xfs/xfs_iops.c | 7 ++++++
include/linux/pagemap.h | 17 ++++++++++++++
include/uapi/linux/fcntl.h | 3 +++
lib/iov_iter.c | 2 +-
mm/page-writeback.c | 2 +-
15 files changed, 121 insertions(+), 58 deletions(-)


2023-11-24 18:25:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

On Fri, 24 Nov 2023 at 02:28, Christian Brauner <[email protected]> wrote:
>
> * Fix a bug introduced with the iov_iter rework from last cycle.
>
> This broke /proc/kcore by copying too much and without the correct
> offset.

Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken.

It does this:

/*
* vmalloc uses spinlocks, so we optimistically try to
* read memory. If this fails, fault pages in and try
* again until we are done.
*/
while (true) {
read += vread_iter(iter, src, left);
if (read == tsz)
break;

src += read;
left -= read;

if (fault_in_iov_iter_writeable(iter, left)) {
ret = -EFAULT;
goto out;
}
}


and that is just broken beyond words for two totally independent reasons:

(a) vread_iter() looks like it can fail because of not having a
source, and return 0 (I dunno - it seems to try to avoid that, but it
all looks pretty dodgy)

At that point fault_in_iov_iter_writeable() will try to fault
in the destination, which may work just fine, but if the source was
the problem, you'd have an endless loop.

(b) That "read += X" is completely broken anyway. It should be just a
"=". So that whole loop is crazy broken, and only works for the case
where you get it all in one go. This code is crap.

Now, I think it all works in practice for one simple reason: I doubt
anybody uses this (and it looks like the callees in the while loop try
very hard to always fill the whole area - maybe people noticed the
first bug and tried to work around it that way).

I guess there is at least one test program, but it presumably doesn't
trigger or care about the bugs here.

But I think we can get rid of this all, and just deal with the
KCORE_VMALLOC case exactly the same way we already deal with VMEMMAP
and TEXT: by just doing copy_from_kernel_nofault() into a bounce
buffer, and then doing a regular _copy_to_iter() or whatever.

NOTE! I looked at the code, and threw up in my mouth a little, and
maybe I missed something. Maybe it all works fine. But Omar - since
you found the original problem, may I implore you to test this
attached patch?

I just like how the patch looks:

6 files changed, 1 insertion(+), 368 deletions(-)

and those 350+ deleted lines really looked disgusting to me.

This patch is on top of the pull I did, because obviously the fix in
that pull was correct, I just think we should go further and get rid
of this whole mess entirely.

Linus


Attachments:
patch.diff (11.70 kB)

2023-11-24 18:26:41

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

The pull request you sent on Fri, 24 Nov 2023 11:27:28 +0100:

> [email protected]:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.7-rc3.fixes

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/fa2b906f5148883e2d0be8952767469c2e3de274

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2023-11-24 18:52:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

On Fri, 24 Nov 2023 at 10:25, Linus Torvalds
<[email protected]> wrote:
>
> I just like how the patch looks:
>
> 6 files changed, 1 insertion(+), 368 deletions(-)
>
> and those 350+ deleted lines really looked disgusting to me.

Gaah. I guess it's the VM_IOREMAP case that is the cause of all this horridness.

So we'd have to know not to mess with IO mappings. Annoying.

But I think my patch could still act as a starting point, except that

case KCORE_VMALLOC:

would have to have some kind of "if this is not a regular vmalloc,
just skip it" logic.

So I guess we can't remove all those lines, but we *could* replace all
the vread_iter() code with a "bool can_I_vread_this()" instead. So the
fallback would still be to just do the bounce buffer copy.

Or something.

Oh well.

Linus

2023-11-24 20:12:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

On Fri, 24 Nov 2023 at 10:52, Linus Torvalds
<[email protected]> wrote:
>
> Gaah. I guess it's the VM_IOREMAP case that is the cause of all this horridness.
>
> So we'd have to know not to mess with IO mappings. Annoying.

Doing a debian code search, I see a number of programs that do a
"stat()" on the kcore file, to get some notion of "system memory
size". I don't think it's valid, but whatever. We probably shouldn't
change it.

I also see some programs that actually read the ELF notes and sections
for dumping purposes.

But does anybody actually run gdb on that thing or similar? That's the
original model for that file, but it was always more of a gimmick than
anything else.

Because we could just say "read zeroes from KCORE_VMALLOC" and be done
with it that way.

Linus

2023-11-25 13:06:17

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

> Because we could just say "read zeroes from KCORE_VMALLOC" and be done
> with it that way.

Let's try to do that and see what happens. If we get away with it then
great, if not we can think about fixing this.

2023-11-25 13:11:08

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote:
> On Fri, 24 Nov 2023 at 02:28, Christian Brauner <[email protected]> wrote:
> >
> > * Fix a bug introduced with the iov_iter rework from last cycle.
> >
> > This broke /proc/kcore by copying too much and without the correct
> > offset.
>
> Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken.

Ugh, I didn't even look at that closely because the fix was obviously
correct for that helper alone. Let's try and just return zeroed memory
like you suggested in your last mail before we bother fixing any of
this.

Long-term plan, it'd be nice to just get distros to start turning
/proc/kcore off. Maybe I underestimate legitimate users but this
requires CAP_SYS_RAW_IO so it really can only be useful to pretty
privileged stuff anyway.

2023-11-25 13:29:02

by Omar Sandoval

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

On Sat, Nov 25, 2023 at 02:10:52PM +0100, Christian Brauner wrote:
> On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote:
> > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <[email protected]> wrote:
> > >
> > > * Fix a bug introduced with the iov_iter rework from last cycle.
> > >
> > > This broke /proc/kcore by copying too much and without the correct
> > > offset.
> >
> > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken.
>
> Ugh, I didn't even look at that closely because the fix was obviously
> correct for that helper alone. Let's try and just return zeroed memory
> like you suggested in your last mail before we bother fixing any of
> this.
>
> Long-term plan, it'd be nice to just get distros to start turning
> /proc/kcore off. Maybe I underestimate legitimate users but this
> requires CAP_SYS_RAW_IO so it really can only be useful to pretty
> privileged stuff anyway.

drgn needs /proc/kcore for debugging the live kernel, which is a very
important use case for lots of our users. And it does in fact read from
KCORE_VMALLOC segments, which is why I found and fixed this bug. I'm
happy to clean up this code, although it's a holiday weekend here so I
won't get to it immediately of course. But please don't rip this out.

Omar

2023-11-25 14:04:57

by Christian Brauner

[permalink] [raw]
Subject: Re: [GIT PULL] vfs fixes

On Sat, Nov 25, 2023 at 05:28:49AM -0800, Omar Sandoval wrote:
> On Sat, Nov 25, 2023 at 02:10:52PM +0100, Christian Brauner wrote:
> > On Fri, Nov 24, 2023 at 10:25:15AM -0800, Linus Torvalds wrote:
> > > On Fri, 24 Nov 2023 at 02:28, Christian Brauner <[email protected]> wrote:
> > > >
> > > > * Fix a bug introduced with the iov_iter rework from last cycle.
> > > >
> > > > This broke /proc/kcore by copying too much and without the correct
> > > > offset.
> > >
> > > Ugh. I think the whole /proc/kcore vmalloc handling is just COMPLETELY broken.
> >
> > Ugh, I didn't even look at that closely because the fix was obviously
> > correct for that helper alone. Let's try and just return zeroed memory
> > like you suggested in your last mail before we bother fixing any of
> > this.
> >
> > Long-term plan, it'd be nice to just get distros to start turning
> > /proc/kcore off. Maybe I underestimate legitimate users but this
> > requires CAP_SYS_RAW_IO so it really can only be useful to pretty
> > privileged stuff anyway.
>
> drgn needs /proc/kcore for debugging the live kernel, which is a very
> important use case for lots of our users. And it does in fact read from
> KCORE_VMALLOC segments, which is why I found and fixed this bug. I'm
> happy to clean up this code, although it's a holiday weekend here so I
> won't get to it immediately of course. But please don't rip this out.

Ugh, I see. I just grepped through the drgn repo. I didn't realize that
you were actively using it and not just testing it. If this is actively
used then we won't break you ofc.

And yeah, if you would fix it that would be great. Given that you're the
main active user who happens to have kernel experience you might even
want to be made responsible for this file in the future. ;)