On Tue 10-10-23 15:17:17, Max Kellermann wrote:
> On Tue, Oct 10, 2023 at 3:11 PM Jan Kara <[email protected]> wrote:
> > Thanks for the updated changelog! But as I'm looking into VFS code isn't
> > this already handled by mode_strip_umask() / vfs_prepare_mode() in
> > fs/namei.c? Because posix_acl_create() doesn't do anything to 'mode' for
> > !IS_POSIXACL() filesystems either. So at least ext2 (where I've checked
> > the mount option handling) does seem to have umask properly applied in all
> > the cases. But I might be missing something...
>
> I'm not sure either. I was hoping the VFS experts could tell something
> about how this API is supposed to be used and whose responsibility it
> is to apply the umask. There used to be some confusion in the code, to
> the point it was missing completely for O_TMPFILE. I'm still somewhat
> confused. Maybe this is a chance to clear this confusion up and then
> document it?
So I've checked some more and the kernel doc comments before
mode_strip_umask() and vfs_prepare_mode() make it pretty obvious - all
paths creating new inodes must be calling vfs_prepare_mode(). As a result
mode_strip_umask() which handles umask stripping for filesystems not
supporting posix ACLs. For filesystems that do support ACLs,
posix_acl_create() must be call and that handles umask stripping. So your
fix should not be needed. CCed some relevant people for confirmation.
> I wish there was one central place to apply the umask, and not spread
> it around two (or more?) different code locations, depending on
> whether there's an ACL. For my taste, that sort of policy is too error
> prone for something as sensitive as umasks. After we already had the
> O_TMPFILE vulnerability (which was only fixed last year, three
> years(!) after I reported it).
I agree having umask stripping in two places is not great but it's
difficult to avoid with how posix ACLs are implemented and intertwined in
various filesystem implementations. At least the current design made it
quite a bit harder to forget to strip the umask.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Oct 11, 2023 at 12:05 PM Jan Kara <[email protected]> wrote:
> So I've checked some more and the kernel doc comments before
> mode_strip_umask() and vfs_prepare_mode() make it pretty obvious - all
> paths creating new inodes must be calling vfs_prepare_mode(). As a result
> mode_strip_umask() which handles umask stripping for filesystems not
> supporting posix ACLs. For filesystems that do support ACLs,
> posix_acl_create() must be call and that handles umask stripping. So your
> fix should not be needed. CCed some relevant people for confirmation.
Thanks, Jan. Do you think the documentation is obvious enough, or
shall I look around and try to improve the documentation? I'm not a FS
expert, so it may be just my fault that it confused me.... I just
analyzed the O_TMPFILE vulnerability four years ago (because it was
reported to me as the maintainer of a userspace software).
Apart from my doubts that this API contract is too error prone, I'm
not quite sure if all filesystems really implement it properly.
For example, overlayfs unconditionally sets SB_POSIXACL, even if the
kernel has no ACL support. Would this ignore the umask? I'm not sure,
overlayfs is a special beast.
Then there's orangefs which allows setting the "acl" mount option (and
thus SB_POSIXACL) even if the kernel has no ACL support. Same for gfs2
and maybe cifs, maybe more, I didn't check them all.
The "mainstream" filesystems like ext4 seem to be implemented
properly, though this is still too fragile for my taste... ext4 has
the SB_POSIXACL code even if there's no kernel ACL support, but it is
not reachable because EXT4_MOUNT_POSIX_ACL cannot be set from
userspace in that case. The code looks suspicious, but is okay in the
end - still not my taste.
I see so much redundant code regarding the "acl" mount option in all
filesystems. I believe the API should be designed in a way that it is
safe-by-default, and shouldn't need very careful considerations in
each and every filesystem, or else all filesystems repeat the same
mistakes until the last one gets fixed.
Max