Subject: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

This makes it possible to make stricter apparmor profile and don't
allow the program to read any coredump in the system.

Signed-off-by: Vladimir Sementsov-Ogievskiy <[email protected]>
---
fs/coredump.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 5df1e6e1eb2b..8f263a389175 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -646,7 +646,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
} else {
struct mnt_idmap *idmap;
struct inode *inode;
- int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW |
+ int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
O_LARGEFILE | O_EXCL;

if (cprm.limit < binfmt->min_coredump)
--
2.34.1


Subject: Re: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

Gently ping.

Is there any interest?

On 20.04.23 15:04, Vladimir Sementsov-Ogievskiy wrote:
> This makes it possible to make stricter apparmor profile and don't
> allow the program to read any coredump in the system.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <[email protected]>
> ---
> fs/coredump.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 5df1e6e1eb2b..8f263a389175 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -646,7 +646,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> } else {
> struct mnt_idmap *idmap;
> struct inode *inode;
> - int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW |
> + int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
> O_LARGEFILE | O_EXCL;
>
> if (cprm.limit < binfmt->min_coredump)

--
Best regards,
Vladimir


2023-05-15 18:20:00

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

On Wed, May 10, 2023 at 06:15:41PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> Gently ping.
>
> Is there any interest?

The question that I would've loved to have an answer to was why was it
made O_RDWR and not just O_WRONLY in the first place. Was there a time
when this was meaningful? Because honestly this looks innocent and
straightforward and then it always makes me go and think "Oh, there's
probably a good reason and something super obvious I'm missing.".

Funny enough, this code was originally:

if (open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)

and then became:

if (open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL))

in

commit 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)")

Author/applier of said patch Cced (more for the fun of referencing
Linux-0.99.10 than anything else).

So after this commit the flag combination just got copied over and over.
First when coredump handling was moved out of fs/exec.c into the
individual binfmt handlers and then again when it was moved back into
fs/exec.c and then again when it was moved to fs/coredump.c.

So that open-coded 2 added in commit 9cb9f18b5d26 ("[PATCH]
Linux-0.99.10 (June 7, 1993)") survived for 23 years until it was
replaced by Jan in 378c6520e7d2 ("fs/coredump: prevent fsuid=0 dumps
into user-controlled directories").

So no one could be bothered for 23 years to use O_RDWR instead of that
lonely 2 which is kinda funny. :)

In any case, I don't see anything that suggests this would cause issues.
So I'm going to pick this up unless I'm being told I'm being obviously
stupid and this absolutely needs to be O_RDWR...

>
> On 20.04.23 15:04, Vladimir Sementsov-Ogievskiy wrote:
> > This makes it possible to make stricter apparmor profile and don't
> > allow the program to read any coredump in the system.
> >
> > Signed-off-by: Vladimir Sementsov-Ogievskiy <[email protected]>
> > ---
> > fs/coredump.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/coredump.c b/fs/coredump.c
> > index 5df1e6e1eb2b..8f263a389175 100644
> > --- a/fs/coredump.c
> > +++ b/fs/coredump.c
> > @@ -646,7 +646,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
> > } else {
> > struct mnt_idmap *idmap;
> > struct inode *inode;
> > - int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW |
> > + int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
> > O_LARGEFILE | O_EXCL;
> > if (cprm.limit < binfmt->min_coredump)
>
> --
> Best regards,
> Vladimir
>

2023-05-15 19:02:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

On Mon, May 15, 2023 at 10:55 AM Christian Brauner <[email protected]> wrote:
>
> So that open-coded 2 added in commit 9cb9f18b5d26 ("[PATCH]
> Linux-0.99.10 (June 7, 1993)") survived for 23 years until it was
> replaced by Jan in 378c6520e7d2 ("fs/coredump: prevent fsuid=0 dumps
> into user-controlled directories").

Hmm.

I can *not* for the life of me remember anything that far back, and
our mail archives don't go that far back either.

It's strange, because the "O_WRONLY" -> "2" change that changes to a
magic raw number is right next to changing "(unsigned short) 0x10" to
"KERNEL_DS", so we're getting *rid* of a magic raw number there.

Which makes me think it was intentional, but I don't know why it
wouldn't have used O_RDWR instead of "2".

Back then we did *not* have O_EXCL in the core file creation flags, so
I'm wondering if it was some half-arsed thing as in "do not allow
core-files to overwrite non-readable files in-place".

They'd still have to be *writable*, though, so that still seems more
than a bit odd.

I have this *dim* memory of us having had filesystems that required
readability for over-writing existing file data (because we'd do a
read-modify-write for the page cache, kind of like how you can't have
write-only pages on many architectures). But while we didn't have
O_EXCL, we *did* have O_TRUNC, so that should be a non-issue.

I don't see a problem with making it O_WRONLY. Like it was 30 years
ago. But that unexplained "O_WRONLY" -> "2" annoys me. It does feel
like there was some reason for it.

Linus

2023-05-15 19:18:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

On Mon, May 15, 2023 at 11:50 AM Linus Torvalds
<[email protected]> wrote:
>
> It's strange, because the "O_WRONLY" -> "2" change that changes to a
> magic raw number is right next to changing "(unsigned short) 0x10" to
> "KERNEL_DS", so we're getting *rid* of a magic raw number there.

Oh, no, never mind. I see what is going on.

Back then, "open_namei()" didn't actually take O_RDWR style flags AT ALL.

The O_RDONLY flags are broken, because you cannot say "open with no
permissions", which we used internally. You have

0 - read-only
1 - write-only
2 - read-write

but the internal code actually wants to match that up with the
read-write permission bits (FMODE_READ etc).

And then we've long had a special value for "open for special
accesses" (format etc), which (naturally) was 3.

So then the open code would do

f->f_flags = flag = flags;
f->f_mode = (flag+1) & O_ACCMODE;
if (f->f_mode)
flag++;

which means that "f_mode" now becomes that FMODE_READ | FMODE_WRITE
mask, and "flag" ends up being a translation from that O_RDWR space
(0/1/2/3) into the FMODE_READ/WRITE space (1/2/3/3, where "special"
required read-write permissions, and 0 was only used for symlinks).

We still have that, although the code looks different.

So back then, "open_namei()" took that FMODE_READ/WRITE flag as an
argument, and the "O_WRONLY" -> "2" change is actually a bugfix and
makes sense. The O_WRONLY thing was wrong, because it was 1, which
actuall ymeant FMODE_READ.

And back then, we didn't *have* FMODE_READ and FMODE_WRITE.

So just writing it as "2" made sense, even if it was horrible. We
added FMODE_WRITE later, but never fixed up those core file writers.

So that 0.99pl10 commit from 1993 is actually correct, and the bug
happened *later*.

I think the real bug may have been in 2.2.4pre4 (February 16, 1999),
when this happened:

- dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
...
+ file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);

without realizing that the "2" in open_namei() should have become a
O_WRONLY for filp_open().

So I think this explains it all.

Very understandable mistake after all.

Linus

Subject: Re: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

On 15.05.23 22:13, Linus Torvalds wrote:
> On Mon, May 15, 2023 at 11:50 AM Linus Torvalds
> <[email protected]> wrote:
>>
>> It's strange, because the "O_WRONLY" -> "2" change that changes to a
>> magic raw number is right next to changing "(unsigned short) 0x10" to
>> "KERNEL_DS", so we're getting *rid* of a magic raw number there.
>
> Oh, no, never mind. I see what is going on.
>
> Back then, "open_namei()" didn't actually take O_RDWR style flags AT ALL.
>
> The O_RDONLY flags are broken, because you cannot say "open with no
> permissions", which we used internally. You have
>
> 0 - read-only
> 1 - write-only
> 2 - read-write
>
> but the internal code actually wants to match that up with the
> read-write permission bits (FMODE_READ etc).
>
> And then we've long had a special value for "open for special
> accesses" (format etc), which (naturally) was 3.
>
> So then the open code would do
>
> f->f_flags = flag = flags;
> f->f_mode = (flag+1) & O_ACCMODE;
> if (f->f_mode)
> flag++;
>
> which means that "f_mode" now becomes that FMODE_READ | FMODE_WRITE
> mask, and "flag" ends up being a translation from that O_RDWR space
> (0/1/2/3) into the FMODE_READ/WRITE space (1/2/3/3, where "special"
> required read-write permissions, and 0 was only used for symlinks).
>
> We still have that, although the code looks different.
>
> So back then, "open_namei()" took that FMODE_READ/WRITE flag as an
> argument, and the "O_WRONLY" -> "2" change is actually a bugfix and
> makes sense. The O_WRONLY thing was wrong, because it was 1, which
> actuall ymeant FMODE_READ.
>
> And back then, we didn't *have* FMODE_READ and FMODE_WRITE.
>
> So just writing it as "2" made sense, even if it was horrible. We
> added FMODE_WRITE later, but never fixed up those core file writers.
>
> So that 0.99pl10 commit from 1993 is actually correct, and the bug
> happened *later*.
>
> I think the real bug may have been in 2.2.4pre4 (February 16, 1999),
> when this happened:
>
> - dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
> ...
> + file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
>
> without realizing that the "2" in open_namei() should have become a
> O_WRONLY for filp_open().
>
> So I think this explains it all.
>
> Very understandable mistake after all.
>
> Linus

Wow that's became a detective story, great thanks! [took note to check history myself next time]

--
Best regards,
Vladimir


2023-05-16 13:55:46

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

On Thu, 20 Apr 2023 15:04:09 +0300, Vladimir Sementsov-Ogievskiy wrote:
> This makes it possible to make stricter apparmor profile and don't
> allow the program to read any coredump in the system.
>
>

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR
https://git.kernel.org/vfs/vfs/c/f84566e710af

2023-05-16 14:21:07

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR

On Tue, May 16, 2023 at 03:46:11PM +0200, Christian Brauner wrote:
> On Thu, 20 Apr 2023 15:04:09 +0300, Vladimir Sementsov-Ogievskiy wrote:
> > This makes it possible to make stricter apparmor profile and don't
> > allow the program to read any coredump in the system.
> >
> >
>
> Applied to the vfs.misc branch of the vfs/vfs.git tree.
> Patches in the vfs.misc branch should appear in linux-next soon.
>
> Please report any outstanding bugs that were missed during review in a
> new review to the original patch series allowing us to drop it.
>
> tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
> branch: vfs.misc
>
> [1/1] fs/coredump: open coredump file in O_WRONLY instead of O_RDWR
> https://git.kernel.org/vfs/vfs/c/f84566e710af

I updated the patch to include all the details we unearthed about this.
Linus, I added your SOB. It just made sense imho given that you provided
quite some details on this. Let me know if that bothers you. The patch
now is:

From f84566e710af39895e54d8e812cd47e5e53db671 Mon Sep 17 00:00:00 2001
From: Vladimir Sementsov-Ogievskiy <[email protected]>
Date: Thu, 20 Apr 2023 15:04:09 +0300
Subject: coredump: require O_WRONLY instead of O_RDWR

The motivation for this patch has been to enable using a stricter
apparmor profile to prevent programs from reading any coredump in the
system.

However, this became something else. The following details are based on
Christian's and Linus' archeology into the history of the number "2" in
the coredump handling code.

To make sure we're not accidently introducing some subtle behavioral
change into the coredump code we set out on a voyage into the depths of
history.git to figure out why this was O_RDWR in the first place.

Coredump handling was introduced over 30 years ago in commit
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)").
The original code used O_WRONLY:

open_namei("core",O_CREAT | O_WRONLY | O_TRUNC,0600,&inode,NULL)

However, this changed in 1993 and starting with commit
9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") the coredump code
suddenly used the constant "2":

open_namei("core",O_CREAT | 2 | O_TRUNC,0600,&inode,NULL)

This was curious as in the same commit the kernel switched from
constants to proper defines in other places such as KERNEL_DS and
USER_DS and O_RDWR did already exist.

So why was "2" used? It turns out that open_namei() - an early version
of what later turned into filp_open() - didn't accept O_RDWR.

A semantic quirk of the open() uapi is the definition of the O_RDONLY
flag. It would seem natural to define:

#define O_RDWR (O_RDONLY | O_WRONLY)

but that isn't possible because:

#define O_RDONLY 0

This makes O_RDONLY effectively meaningless when passed to the kernel.
In other words, there has never been a way - until O_PATH at least - to
open a file without any permission; O_RDONLY was always implied on the
uapi side while the kernel does in fact allow opening files without
permissions.

The trouble comes when trying to map the uapi flags onto the
corresponding file mode flags FMODE_{READ,WRITE}. This mapping still
happens today and is causing issues to this day (We ran into this
during additions for openat2() for example.).

So the special value "3" was used to indicate that the file was opened
for special access:

f->f_flags = flag = flags;
f->f_mode = (flag+1) & O_ACCMODE;
if (f->f_mode)
flag++;

This allowed the file mode to be set to FMODE_READ | FMODE_WRITE mapping
the O_{RDONLY,WRONLY,RDWR} flags into the FMODE_{READ,WRITE} flags. The
special access then required read-write permissions and 0 was used to
access symlinks.

But back when ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") added
coredump handling open_namei() took the FMODE_{READ,WRITE} flags as an
argument. So the coredump handling introduced in
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") was buggy because
O_WRONLY shouldn't have been passed. Since O_WRONLY is 1 but
open_namei() took FMODE_{READ,WRITE} it was passed FMODE_READ on
accident.

So 9cb9f18b5d26 ("[PATCH] Linux-0.99.10 (June 7, 1993)") was a bugfix
for this and the 2 didn't really mean O_RDWR, it meant FMODE_WRITE which
was correct.

The clue is that FMODE_{READ,WRITE} didn't exist yet and thus a raw "2"
value was passed.

Fast forward 5 years when around 2.2.4pre4 (February 16, 1999) this code
was changed to:

- dentry = open_namei(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);
...
+ file = filp_open(corefile,O_CREAT | 2 | O_TRUNC | O_NOFOLLOW, 0600);

At this point the raw "2" should have become O_WRONLY again as
filp_open() didn't take FMODE_{READ,WRITE} but O_{RDONLY,WRONLY,RDWR}.

Another 17 years later, the code was changed again cementing the mistake
and making it almost impossible to detect when commit
378c6520e7d2 ("fs/coredump: prevent fsuid=0 dumps into user-controlled directories")
replaced the raw "2" with O_RDWR.

And now, here we are with this patch that sent us on a quest to answer
the big questions in life such as "Why are coredump files opened with
O_RDWR?" and "Is it safe to just use O_WRONLY?".

So with this commit we're reintroducing O_WRONLY again and bringing this
code back to its original state when it was first introduced in commit
ddc733f452e0 ("[PATCH] Linux-0.97 (August 1, 1992)") over 30 years ago.

Signed-off-by: Vladimir Sementsov-Ogievskiy <[email protected]>
Message-Id: <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
Signed-off-by: Christian Brauner <[email protected]>
---
fs/coredump.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index ece7badf701bc..ead3b05fb8f48 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -646,7 +646,7 @@ void do_coredump(const kernel_siginfo_t *siginfo)
} else {
struct mnt_idmap *idmap;
struct inode *inode;
- int open_flags = O_CREAT | O_RDWR | O_NOFOLLOW |
+ int open_flags = O_CREAT | O_WRONLY | O_NOFOLLOW |
O_LARGEFILE | O_EXCL;

if (cprm.limit < binfmt->min_coredump)
--
cgit