2024-02-21 06:10:48

by WANG Xuerui

[permalink] [raw]
Subject: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

Hi,

Recently, we -- community LoongArch porters -- have noticed a problem
where the Chromium sandbox apparently wants to deny statx [^1] so it
could properly inspect arguments after the sandboxed process later falls
back to fstat. The reasoning behind the change was not clear in the
patch; but we found out it's basically because there's currently not a
"fd-only" version of statx, so that the sandbox has no way to ensure the
path argument is empty without being able to peek into the sandboxed
process's memory. For architectures able to do newfstatat though, the
glibc falls back to newfstatat after getting -ENOSYS for statx, then the
respective SIGSYS handler [^2] takes care of inspecting the path
argument, transforming allowed newfstatat's into fstat instead which is
allowed and has the same type of return value.

But, as loongarch is the first architecture to not have fstat nor
newfstatat, the LoongArch glibc does not attempt falling back at all
when it gets -ENOSYS for statx -- and you see the problem there!

Actually, back when the loongarch port was under review, people were
aware of the same problem with sandboxing clone3 [^3], so clone was
eventually kept. Unfortunately it seemed at that time no one had noticed
statx, so besides restoring fstat/newfstatat to loongarch uapi (and
postponing the problem further), it seems inevitable that we would need
to tackle seccomp deep argument inspection; this is obviously a decision
that shouldn't be taken lightly, so I'm posting this to restart the
conversation to figure out a way forward together. We basically could do
one of below:

- just restore fstat and be done with it;
- add a flag to statx so we can do the equivalent of just fstat(fd,
&out) with statx, and ensuring an error happens if path is not empty in
that case;
- tackle the long-standing problem of seccomp deep argument inspection (!).

Obviously, the simplest solution would be to "just restore fstat". But
again, in my opinion this is not quite a solution but a workaround -- we
have good reasons to keep just statx (mainly because its feature set is
a strict superset of those of fstat/newfstatat), and we're not quite in
a hurry to resolve this. Because one of the prerequisites for a new
Chromium port is "inclusion in Debian stable" -- as the loong64 port
[^4] is progressing and the next Debian release would not happen until
2025, we still have time for a more "elegant" solution.

Alternatively, we could also introduce a new flag for statx, maybe named
AT_STATX_NO_PATH or something like that, so that statx would ignore the
path altogether or error on non-empty paths if called with flags
containing AT_EMPTY_PATH | AT_STATX_NO_PATH. This way a seccomp policy
could allow statx calls only with such flags (that are passed via
register and accessible) and maintain the same level of safety as with
fstat. But there is also disadvantage to this approach: the flag would
be useful only for sandboxes, because in other cases there's no need to
avoid reading from &path. This is also more of a workaround to avoid the
deep argument inspection problem.

Lastly, should we decide to go the hardest way, according to a previous
mail [^5] (about clone3) and the LPC 2019 discussion [^6] [^7], we
probably would try the metadata-annotation-based and piece-by-piece
approach, as it's expected to provide the most benefit and involve less
code changes. The implementation, as I surmise, will involve modifying
the generic syscall entrypoint for early copying of user data, and
corresponding changes to seccomp plumbing so this information is
properly exposed. I don't have a roadmap for non-generic-entry arches
right now, and I also haven't started designing the new seccomp ABI for
that. As a matter of fact, members of the LoongArch community (myself
included) are still fresh to this area of expertise, so a bit more of
your feedback will be appreciated.

Thanks to Miao Wang from AOSC for doing much of the investigation.

[^1]: https://chromium-review.googlesource.com/c/chromium/src/+/2823150
[^2]:
https://chromium.googlesource.com/chromium/src/sandbox/+/c085b51940bd/linux/seccomp-bpf-helpers/sigsys_handlers.cc#355
[^3]:
https://lore.kernel.org/linux-arch/[email protected]/
[^4]: https://wiki.debian.org/Ports/loong64
[^5]: https://lwn.net/ml/linux-kernel/201905301122.88FD40B3@keescook/
[^6]: https://lwn.net/Articles/799557/
[^7]:
https://lpc.events/event/4/contributions/560/attachments/397/640/deep-arg-inspection.pdf

--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list:https://lore.kernel.org/loongarch/



2024-02-21 06:37:25

by Xi Ruoyao

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote:

> - just restore fstat and be done with it;
> - add a flag to statx so we can do the equivalent of just fstat(fd,
> &out) with statx, and ensuring an error happens if path is not empty in
> that case;

It's worse than "just restore fstat" considering the performance. Read
this thread:
https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html

> - tackle the long-standing problem of seccomp deep argument inspection (!).

Frankly I'm never a fan of syscall blocklisting. When I develop the
Online Judge system for the programming contest training in Xidian
University I deliberately avoid using seccomp. This thing is very
likely to break innocent programs with some system change innocent as
well (for example Glibc or libstdc++ update).

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2024-02-21 10:33:45

by Xi Ruoyao

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Wed, 2024-02-21 at 14:31 +0800, Xi Ruoyao wrote:
> On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote:
>
> > - just restore fstat and be done with it;
> > - add a flag to statx so we can do the equivalent of just fstat(fd,
> > &out) with statx, and ensuring an error happens if path is not empty in
> > that case;
>
> It's worse than "just restore fstat" considering the performance.  Read
> this thread:
> https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html

Hmm, but it looks like statx already suffers the same performance issue.
And in this libc-alpha discussion Linus said:

If the user asked for 'fstat()', just give the user 'fstat()'.

So to me we should just add fstat (and use it in Glibc for LoongArch, only
falling back to statx if fstat returns -ENOSYS), or am I missing something?

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2024-02-21 10:49:38

by WANG Xuerui

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?


On 2/21/24 18:31, Xi Ruoyao wrote:
> On Wed, 2024-02-21 at 14:31 +0800, Xi Ruoyao wrote:
>> On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote:
>>
>>> - just restore fstat and be done with it;
>>> - add a flag to statx so we can do the equivalent of just fstat(fd,
>>> &out) with statx, and ensuring an error happens if path is not empty in
>>> that case;
>> It's worse than "just restore fstat" considering the performance.  Read
>> this thread:
>> https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html
> Hmm, but it looks like statx already suffers the same performance issue.
> And in this libc-alpha discussion Linus said:
>
> If the user asked for 'fstat()', just give the user 'fstat()'.
>
> So to me we should just add fstat (and use it in Glibc for LoongArch, only
> falling back to statx if fstat returns -ENOSYS), or am I missing something?

Or we could add a AT_STATX_NULL_PATH flag and mandate that `path` must
be NULL if this flag is present -- then with simple checks we could have
statx(fd, NULL, AT_STATX_NULL_PATH, STATX_BASIC_STATS, &out) that's both
fstat-like and fast.

--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/


2024-02-21 12:03:50

by Xi Ruoyao

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Wed, 2024-02-21 at 18:49 +0800, WANG Xuerui wrote:
>
> On 2/21/24 18:31, Xi Ruoyao wrote:
> > On Wed, 2024-02-21 at 14:31 +0800, Xi Ruoyao wrote:
> > > On Wed, 2024-02-21 at 14:09 +0800, WANG Xuerui wrote:
> > >
> > > > - just restore fstat and be done with it;
> > > > - add a flag to statx so we can do the equivalent of just fstat(fd,
> > > > &out) with statx, and ensuring an error happens if path is not empty in
> > > > that case;
> > > It's worse than "just restore fstat" considering the performance.  Read
> > > this thread:
> > > https://sourceware.org/pipermail/libc-alpha/2023-September/151320.html
> > Hmm, but it looks like statx already suffers the same performance issue.
> > And in this libc-alpha discussion Linus said:
> >
> >     If the user asked for 'fstat()', just give the user 'fstat()'.
> >    
> > So to me we should just add fstat (and use it in Glibc for LoongArch, only
> > falling back to statx if fstat returns -ENOSYS), or am I missing something?
>
> Or we could add a AT_STATX_NULL_PATH flag and mandate that `path` must
> be NULL if this flag is present -- then with simple checks we could have
> statx(fd, NULL, AT_STATX_NULL_PATH, STATX_BASIC_STATS, &out) that's both
> fstat-like and fast.

But then to take the advantage in Glibc fstat() implementation, we'll
need to try AT_STATX_NULL_PATH first, then check for EFAULT (not
ENOSYS!) and fall back to AT_EMPTY_PATH. The EFAULT checking seems
nasty to me...

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2024-02-24 11:52:19

by Huacai Chen

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

Hi, Xuerui,

On Wed, Feb 21, 2024 at 2:10 PM WANG Xuerui <[email protected]> wrote:
>
> Hi,
>
> Recently, we -- community LoongArch porters -- have noticed a problem
> where the Chromium sandbox apparently wants to deny statx [^1] so it
> could properly inspect arguments after the sandboxed process later falls
> back to fstat. The reasoning behind the change was not clear in the
> patch; but we found out it's basically because there's currently not a
> "fd-only" version of statx, so that the sandbox has no way to ensure the
> path argument is empty without being able to peek into the sandboxed
> process's memory. For architectures able to do newfstatat though, the
> glibc falls back to newfstatat after getting -ENOSYS for statx, then the
> respective SIGSYS handler [^2] takes care of inspecting the path
> argument, transforming allowed newfstatat's into fstat instead which is
> allowed and has the same type of return value.
>
> But, as loongarch is the first architecture to not have fstat nor
> newfstatat, the LoongArch glibc does not attempt falling back at all
> when it gets -ENOSYS for statx -- and you see the problem there!
>
> Actually, back when the loongarch port was under review, people were
> aware of the same problem with sandboxing clone3 [^3], so clone was
> eventually kept. Unfortunately it seemed at that time no one had noticed
> statx, so besides restoring fstat/newfstatat to loongarch uapi (and
> postponing the problem further), it seems inevitable that we would need
> to tackle seccomp deep argument inspection; this is obviously a decision
> that shouldn't be taken lightly, so I'm posting this to restart the
> conversation to figure out a way forward together. We basically could do
> one of below:
>
> - just restore fstat and be done with it;
> - add a flag to statx so we can do the equivalent of just fstat(fd,
> &out) with statx, and ensuring an error happens if path is not empty in
> that case;
> - tackle the long-standing problem of seccomp deep argument inspection (!).
From my point of view, I prefer to "restore fstat", because we need to
use the Chrome sandbox everyday (even though it hasn't been upstream
by now). But I also hope "seccomp deep argument inspection" can be
solved in the future.


Huacai

>
> Obviously, the simplest solution would be to "just restore fstat". But
> again, in my opinion this is not quite a solution but a workaround -- we
> have good reasons to keep just statx (mainly because its feature set is
> a strict superset of those of fstat/newfstatat), and we're not quite in
> a hurry to resolve this. Because one of the prerequisites for a new
> Chromium port is "inclusion in Debian stable" -- as the loong64 port
> [^4] is progressing and the next Debian release would not happen until
> 2025, we still have time for a more "elegant" solution.
>
> Alternatively, we could also introduce a new flag for statx, maybe named
> AT_STATX_NO_PATH or something like that, so that statx would ignore the
> path altogether or error on non-empty paths if called with flags
> containing AT_EMPTY_PATH | AT_STATX_NO_PATH. This way a seccomp policy
> could allow statx calls only with such flags (that are passed via
> register and accessible) and maintain the same level of safety as with
> fstat. But there is also disadvantage to this approach: the flag would
> be useful only for sandboxes, because in other cases there's no need to
> avoid reading from &path. This is also more of a workaround to avoid the
> deep argument inspection problem.
>
> Lastly, should we decide to go the hardest way, according to a previous
> mail [^5] (about clone3) and the LPC 2019 discussion [^6] [^7], we
> probably would try the metadata-annotation-based and piece-by-piece
> approach, as it's expected to provide the most benefit and involve less
> code changes. The implementation, as I surmise, will involve modifying
> the generic syscall entrypoint for early copying of user data, and
> corresponding changes to seccomp plumbing so this information is
> properly exposed. I don't have a roadmap for non-generic-entry arches
> right now, and I also haven't started designing the new seccomp ABI for
> that. As a matter of fact, members of the LoongArch community (myself
> included) are still fresh to this area of expertise, so a bit more of
> your feedback will be appreciated.
>
> Thanks to Miao Wang from AOSC for doing much of the investigation.
>
> [^1]: https://chromium-review.googlesource.com/c/chromium/src/+/2823150
> [^2]:
> https://chromium.googlesource.com/chromium/src/sandbox/+/c085b51940bd/linux/seccomp-bpf-helpers/sigsys_handlers.cc#355
> [^3]:
> https://lore.kernel.org/linux-arch/[email protected]/
> [^4]: https://wiki.debian.org/Ports/loong64
> [^5]: https://lwn.net/ml/linux-kernel/201905301122.88FD40B3@keescook/
> [^6]: https://lwn.net/Articles/799557/
> [^7]:
> https://lpc.events/event/4/contributions/560/attachments/397/640/deep-arg-inspection.pdf
>
> --
> WANG "xen0n" Xuerui
>
> Linux/LoongArch mailing list:https://lore.kernel.org/loongarch/
>

2024-02-25 06:51:59

by Icenowy Zheng

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

在 2024-02-24星期六的 19:51 +0800,Huacai Chen写道:
> Hi, Xuerui,
>
> On Wed, Feb 21, 2024 at 2:10 PM WANG Xuerui <[email protected]>
> wrote:
> >
> > Hi,
> >
> > Recently, we -- community LoongArch porters -- have noticed a
> > problem
> > where the Chromium sandbox apparently wants to deny statx [^1] so
> > it
> > could properly inspect arguments after the sandboxed process later
> > falls
> > back to fstat. The reasoning behind the change was not clear in the
> > patch; but we found out it's basically because there's currently
> > not a
> > "fd-only" version of statx, so that the sandbox has no way to
> > ensure the
> > path argument is empty without being able to peek into the
> > sandboxed
> > process's memory. For architectures able to do newfstatat though,
> > the
> > glibc falls back to newfstatat after getting -ENOSYS for statx,
> > then the
> > respective SIGSYS handler [^2] takes care of inspecting the path
> > argument, transforming allowed newfstatat's into fstat instead
> > which is
> > allowed and has the same type of return value.
> >
> > But, as loongarch is the first architecture to not have fstat nor
> > newfstatat, the LoongArch glibc does not attempt falling back at
> > all
> > when it gets -ENOSYS for statx -- and you see the problem there!
> >
> > Actually, back when the loongarch port was under review, people
> > were
> > aware of the same problem with sandboxing clone3 [^3], so clone was
> > eventually kept. Unfortunately it seemed at that time no one had
> > noticed
> > statx, so besides restoring fstat/newfstatat to loongarch uapi (and
> > postponing the problem further), it seems inevitable that we would
> > need
> > to tackle seccomp deep argument inspection; this is obviously a
> > decision
> > that shouldn't be taken lightly, so I'm posting this to restart the
> > conversation to figure out a way forward together. We basically
> > could do
> > one of below:
> >
> > - just restore fstat and be done with it;
> > - add a flag to statx so we can do the equivalent of just fstat(fd,
> > &out) with statx, and ensuring an error happens if path is not
> > empty in
> > that case;
> > - tackle the long-standing problem of seccomp deep argument
> > inspection (!).
> From my point of view, I prefer to "restore fstat", because we need
> to
> use the Chrome sandbox everyday (even though it hasn't been upstream
> by now). But I also hope "seccomp deep argument inspection" can be
> solved in the future.

My idea is this problem needs syscalls to be designed with deep
argument inspection in mind; syscalls before this should be considered
as historical error and get fixed by resotring old syscalls.

>
>
> Huacai
>
> >
> > Obviously, the simplest solution would be to "just restore fstat".
> > But
> > again, in my opinion this is not quite a solution but a workaround
> > -- we
> > have good reasons to keep just statx (mainly because its feature
> > set is
> > a strict superset of those of fstat/newfstatat), and we're not
> > quite in
> > a hurry to resolve this. Because one of the prerequisites for a new
> > Chromium port is "inclusion in Debian stable" -- as the loong64
> > port
> > [^4] is progressing and the next Debian release would not happen
> > until
> > 2025, we still have time for a more "elegant" solution.
> >
> > Alternatively, we could also introduce a new flag for statx, maybe
> > named
> > AT_STATX_NO_PATH or something like that, so that statx would ignore
> > the
> > path altogether or error on non-empty paths if called with flags
> > containing AT_EMPTY_PATH | AT_STATX_NO_PATH. This way a seccomp
> > policy
> > could allow statx calls only with such flags (that are passed via
> > register and accessible) and maintain the same level of safety as
> > with
> > fstat. But there is also disadvantage to this approach: the flag
> > would
> > be useful only for sandboxes, because in other cases there's no
> > need to
> > avoid reading from &path. This is also more of a workaround to
> > avoid the
> > deep argument inspection problem.
> >
> > Lastly, should we decide to go the hardest way, according to a
> > previous
> > mail [^5] (about clone3) and the LPC 2019 discussion [^6] [^7], we
> > probably would try the metadata-annotation-based and piece-by-piece
> > approach, as it's expected to provide the most benefit and involve
> > less
> > code changes. The implementation, as I surmise, will involve
> > modifying
> > the generic syscall entrypoint for early copying of user data, and
> > corresponding changes to seccomp plumbing so this information is
> > properly exposed. I don't have a roadmap for non-generic-entry
> > arches
> > right now, and I also haven't started designing the new seccomp ABI
> > for
> > that. As a matter of fact, members of the LoongArch community
> > (myself
> > included) are still fresh to this area of expertise, so a bit more
> > of
> > your feedback will be appreciated.
> >
> > Thanks to Miao Wang from AOSC for doing much of the investigation.
> >
> > [^1]:
> > https://chromium-review.googlesource.com/c/chromium/src/+/2823150
> > [^2]:
> > https://chromium.googlesource.com/chromium/src/sandbox/+/c085b51940bd/linux/seccomp-bpf-helpers/sigsys_handlers.cc#355
> > [^3]:
> > https://lore.kernel.org/linux-arch/[email protected]/
> > [^4]: https://wiki.debian.org/Ports/loong64
> > [^5]:
> > https://lwn.net/ml/linux-kernel/201905301122.88FD40B3@keescook/
> > [^6]: https://lwn.net/Articles/799557/
> > [^7]:
> > https://lpc.events/event/4/contributions/560/attachments/397/640/deep-arg-inspection.pdf
> >
> > --
> > WANG "xen0n" Xuerui
> >
> > Linux/LoongArch mailing list:https://lore.kernel.org/loongarch/
> >


2024-02-25 07:32:55

by Xi Ruoyao

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> > From my point of view, I prefer to "restore fstat", because we need
> > to
> > use the Chrome sandbox everyday (even though it hasn't been upstream
> > by now). But I also hope "seccomp deep argument inspection" can be
> > solved in the future.
>
> My idea is this problem needs syscalls to be designed with deep
> argument inspection in mind; syscalls before this should be considered
> as historical error and get fixed by resotring old syscalls.

I'd not consider fstat an error as using statx for fstat has a
performance impact (severe for some workflows), and Linus has concluded
"if the user wants fstat, give them fstat" for the performance issue:

https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html

However we only want fstat (actually "newfstat" in fs/stat.c), and it
seems we don't want to resurrect newstat, newlstat, newfstatat, etc. (or
am I missing any benefit - performance or "just pleasing seccomp" - of
them comparing to statx?) so we don't want to just define
__ARCH_WANT_NEW_STAT. So it seems we need to add some new #if to
fs/stat.c and include/uapi/asm-generic/unistd.h.

And no, it's not a design issue of all other syscalls. It's just the
design issue of seccomp. There's no way to design a syscall allowing
seccomp to inspect a 100-character path in its argument unless
refactoring seccomp entirely because we cannot fit a 100-character path
into 8 registers.

As at now people do use PTRACE_PEEKDATA for "deep inspection" (actually
"debugging" the target process) but it obviously makes a very severe
performance impact.

<rant>

Today the entire software industry is saying "do things in a declarative
way" but seccomp is completely the opposite. It's auditing *how* the
sandboxed application is doing things instead of *what* will be done.

I've raised my against to seccomp and/or syscall allowlisting several
times after seeing so many breakages like:

- https://github.com/NetworkConfiguration/dhcpcd/issues/120
- https://gitlab.gnome.org/GNOME/tracker-miners/-/issues/252
- https://blog.pintia.cn/2018/06/27/glibc-segmentation-fault/
- http://web.archive.org/web/20210126121421/http://acm.xidian.edu.cn/discuss/thread.php?tid=148&cid=# (comment 3)

but people just keep telling me "you are wrong, you don't understand
security". Some of them even complain "seccomp is broken" as well but
still keep using it.

</rant>

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2024-02-26 06:04:49

by Icenowy Zheng

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
> On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> > > From my point of view, I prefer to "restore fstat", because we
> > > need
> > > to
> > > use the Chrome sandbox everyday (even though it hasn't been
> > > upstream
> > > by now). But I also hope "seccomp deep argument inspection" can
> > > be
> > > solved in the future.
> >
> > My idea is this problem needs syscalls to be designed with deep
> > argument inspection in mind; syscalls before this should be
> > considered
> > as historical error and get fixed by resotring old syscalls.
>
> I'd not consider fstat an error as using statx for fstat has a
> performance impact (severe for some workflows), and Linus has
> concluded

Sorry for clearance, I mean statx is an error in ABI design, not fstat.

> "if the user wants fstat, give them fstat" for the performance issue:
>
> https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html
>
> However we only want fstat (actually "newfstat" in fs/stat.c), and it
> seems we don't want to resurrect newstat, newlstat, newfstatat, etc.
> (or
> am I missing any benefit - performance or "just pleasing seccomp" -
> of
> them comparing to statx?) so we don't want to just define
> __ARCH_WANT_NEW_STAT.  So it seems we need to add some new #if to
> fs/stat.c and include/uapi/asm-generic/unistd.h.
>
> And no, it's not a design issue of all other syscalls.  It's just the
> design issue of seccomp.  There's no way to design a syscall allowing
> seccomp to inspect a 100-character path in its argument unless
> refactoring seccomp entirely because we cannot fit a 100-character
> path
> into 8 registers.

Well my meaning is that syscalls should be designed to be simple to
prevent this kind of circumstance.

>
> As at now people do use PTRACE_PEEKDATA for "deep inspection"
> (actually
> "debugging" the target process) but it obviously makes a very severe
> performance impact.
>
> <rant>
>
> Today the entire software industry is saying "do things in a
> declarative
> way" but seccomp is completely the opposite.  It's auditing *how* the
> sandboxed application is doing things instead of *what* will be done.
>
> I've raised my against to seccomp and/or syscall allowlisting several
> times after seeing so many breakages like:
>
> - https://github.com/NetworkConfiguration/dhcpcd/issues/120
> - https://gitlab.gnome.org/GNOME/tracker-miners/-/issues/252
> - https://blog.pintia.cn/2018/06/27/glibc-segmentation-fault/
> -
> http://web.archive.org/web/20210126121421/http://acm.xidian.edu.cn/discuss/thread.php?tid=148&cid=#
>  (comment 3)
>
> but people just keep telling me "you are wrong, you don't understand
> security".  Some of them even complain "seccomp is broken" as well
> but
> still keep using it.
>
> </rant>
>


2024-02-26 06:58:24

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
> 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
>> On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
>> > My idea is this problem needs syscalls to be designed with deep
>> > argument inspection in mind; syscalls before this should be
>> > considered
>> > as historical error and get fixed by resotring old syscalls.
>>
>> I'd not consider fstat an error as using statx for fstat has a
>> performance impact (severe for some workflows), and Linus has
>> concluded
>
> Sorry for clearance, I mean statx is an error in ABI design, not fstat.

The same has been said about seccomp(). ;-)

It's clear that the two don't go well together at the moment.

>> "if the user wants fstat, give them fstat" for the performance issue:
>>
>> https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html
>>
>> However we only want fstat (actually "newfstat" in fs/stat.c), and it
>> seems we don't want to resurrect newstat, newlstat, newfstatat, etc.
>> (or
>> am I missing any benefit - performance or "just pleasing seccomp" -
>> of them comparing to statx?) so we don't want to just define
>> __ARCH_WANT_NEW_STAT.  So it seems we need to add some new #if to
>> fs/stat.c and include/uapi/asm-generic/unistd.h.
>>
>> And no, it's not a design issue of all other syscalls.  It's just the
>> design issue of seccomp.  There's no way to design a syscall allowing
>> seccomp to inspect a 100-character path in its argument unless
>> refactoring seccomp entirely because we cannot fit a 100-character
>> path
>> into 8 registers.
>
> Well my meaning is that syscalls should be designed to be simple to
> prevent this kind of circumstance.

The problem I see with the 'use use fstat' approach is that this
does not work on 32-bit architectures, unless we define a new
fstatat64_time64() syscall, which is one of the things that statx()
was trying to avoid.

Whichever solution we end up with should work on both
loongarch64 and on armv7 at least.

Arnd

2024-02-26 07:09:52

by Xi Ruoyao

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
> > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
> > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> > > > My idea is this problem needs syscalls to be designed with deep
> > > > argument inspection in mind; syscalls before this should be
> > > > considered
> > > > as historical error and get fixed by resotring old syscalls.
> > >
> > > I'd not consider fstat an error as using statx for fstat has a
> > > performance impact (severe for some workflows), and Linus has
> > > concluded
> >
> > Sorry for clearance, I mean statx is an error in ABI design, not fstat.

I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
"AT_NULL_PATH"/nullptr in the first place?

> The same has been said about seccomp(). ;-)
>
> It's clear that the two don't go well together at the moment.
>
> > > "if the user wants fstat, give them fstat" for the performance issue:
> > >
> > > https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html
> > >
> > > However we only want fstat (actually "newfstat" in fs/stat.c), and it
> > > seems we don't want to resurrect newstat, newlstat, newfstatat, etc.
> > > (or
> > > am I missing any benefit - performance or "just pleasing seccomp" -
> > > of them comparing to statx?) so we don't want to just define
> > > __ARCH_WANT_NEW_STAT.  So it seems we need to add some new #if to
> > > fs/stat.c and include/uapi/asm-generic/unistd.h.
> > >
> > > And no, it's not a design issue of all other syscalls.  It's just the
> > > design issue of seccomp.  There's no way to design a syscall allowing
> > > seccomp to inspect a 100-character path in its argument unless
> > > refactoring seccomp entirely because we cannot fit a 100-character
> > > path
> > > into 8 registers.
> >
> > Well my meaning is that syscalls should be designed to be simple to
> > prevent this kind of circumstance.

But it's not irrational to pass a path to syscall, as long as we still
have the concept of file system (maybe in 2371 or some year we'll use a
128-bit UUID instead of path).

> The problem I see with the 'use use fstat' approach is that this
> does not work on 32-bit architectures, unless we define a new
> fstatat64_time64() syscall, which is one of the things that statx()

"fstat64_time64". Using statx for fstatat should be just fine.

Or maybe we can just introduce a new AT_something to make statx
completely ignore pathname but behave like AT_EMPTY_PATH + "".

> was trying to avoid.

Oops. I thought "newstat" should be using 64-bit time but it seems the
"new" is not what I'd expected... The "new" actually means "newer than
Linux 0.9"! :(

Let's not use "new" in future syscall names...

> Whichever solution we end up with should work on both
> loongarch64 and on armv7 at least.
>
>     Arnd

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2024-02-26 08:27:26

by Christian Brauner

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024 at 02:03:48PM +0800, Icenowy Zheng wrote:
> 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
> > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> > > > From my point of view, I prefer to "restore fstat", because we
> > > > need
> > > > to
> > > > use the Chrome sandbox everyday (even though it hasn't been
> > > > upstream
> > > > by now). But I also hope "seccomp deep argument inspection" can
> > > > be
> > > > solved in the future.
> > >
> > > My idea is this problem needs syscalls to be designed with deep
> > > argument inspection in mind; syscalls before this should be
> > > considered
> > > as historical error and get fixed by resotring old syscalls.
> >
> > I'd not consider fstat an error as using statx for fstat has a
> > performance impact (severe for some workflows), and Linus has
> > concluded
>
> Sorry for clearance, I mean statx is an error in ABI design, not fstat.

We will not be limited arbitrarly in system call design by seccomp being
unable to do deep argument inspection. That ship has sailed many years
ago. And it's a bit laughable to disalow pointer arguments and structs
in system calls because seccomp isn't able to inspect them.

2024-02-26 09:53:42

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
> On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
>> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
>> > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
>> > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
>> > > > My idea is this problem needs syscalls to be designed with deep
>> > > > argument inspection in mind; syscalls before this should be
>> > > > considered
>> > > > as historical error and get fixed by resotring old syscalls.
>> > >
>> > > I'd not consider fstat an error as using statx for fstat has a
>> > > performance impact (severe for some workflows), and Linus has
>> > > concluded
>> >
>> > Sorry for clearance, I mean statx is an error in ABI design, not fstat.
>
> I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
> "AT_NULL_PATH"/nullptr in the first place?

Not sure, but it's hard to change now since the libc
implementation won't easily know whether using the NULL
path is safe on a given kernel. It could check the kernel
version number, but that adds another bit of complexity in
the fast path and doesn't work on old kernels with the
feature backported.

> But it's not irrational to pass a path to syscall, as long as we still
> have the concept of file system (maybe in 2371 or some year we'll use a
> 128-bit UUID instead of path).
>
>> The problem I see with the 'use use fstat' approach is that this
>> does not work on 32-bit architectures, unless we define a new
>> fstatat64_time64() syscall, which is one of the things that statx()
>
> "fstat64_time64". Using statx for fstatat should be just fine.

Right. It does feel wrong to have only an fstat() variant but not
fstatat() if we go there.

> Or maybe we can just introduce a new AT_something to make statx
> completely ignore pathname but behave like AT_EMPTY_PATH + "".

I think this is better than going back to fstat64_time64(), but
it's still not great because

- all the reserved flags on statx() are by definition incompatible
with existing kernels that return -EINVAL for any flag they do
not recognize.

- you still need to convince libc developers to actually use
the flag despite the backwards compatibility problem, either
with a fallback to the current behavior or a version check.

Using the NULL path as a fallback would solve the problem with
seccomp, but it would not make the normal case any faster.

>> was trying to avoid.
>
> Oops. I thought "newstat" should be using 64-bit time but it seems the
> "new" is not what I'd expected... The "new" actually means "newer than
> Linux 0.9"! :(
>
> Let's not use "new" in future syscall names...

Right, we definitely can't ever succeed. On some architectures
we even had "oldstat" and "stat" before "newstat" and "stat64",
and on some architectures we mix them up. E.g. x86_64 has fstat()
and fstatat64() with the same structure but doesn't define
__NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
timestamps unlike all other 64-bit architectures.

statx() was intended to solve these problems once and for all,
and it appears that we have failed again.

Arnd

2024-02-26 11:58:25

by Xi Ruoyao

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, 2024-02-26 at 10:20 +0100, Arnd Bergmann wrote:

/* snip */

>
> > Or maybe we can just introduce a new AT_something to make statx
> > completely ignore pathname but behave like AT_EMPTY_PATH + "".
>
> I think this is better than going back to fstat64_time64(), but
> it's still not great because
>
> - all the reserved flags on statx() are by definition incompatible
>   with existing kernels that return -EINVAL for any flag they do
>   not recognize.

Oops, we are deeming passing undefined flags in "mask" undefined
behavior but not "flags", thus "wild software" may be relying on EINVAL
for invalid flags... We *might* make this new AT_xxx a bit in mask
instead of flags but it would be very dirty IMO.

> - you still need to convince libc developers to actually use
>   the flag despite the backwards compatibility problem, either
>   with a fallback to the current behavior or a version check.

Let me ping some libc developers then...

> Using the NULL path as a fallback would solve the problem with
> seccomp, but it would not make the normal case any faster.

But "wild software" may be relying on a EFAULT for NULL path too...

/* snip */

> >
> > Oops.  I thought "newstat" should be using 64-bit time but it seems the
> > "new" is not what I'd expected...  The "new" actually means "newer than
> > Linux 0.9"! :(
> >
> > Let's not use "new" in future syscall names...
>
> Right, we definitely can't ever succeed. On some architectures
> we even had "oldstat" and "stat" before "newstat" and "stat64",
> and on some architectures we mix them up. E.g. x86_64 has fstat()
> and fstatat64() with the same structure but doesn't define
> __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
> timestamps unlike all other 64-bit architectures.
>
> statx() was intended to solve these problems once and for all,
> and it appears that we have failed again.

https://xkcd.com/927/ :(

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2024-02-26 12:58:59

by Christian Brauner

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024 at 07:57:56PM +0800, Xi Ruoyao wrote:
> On Mon, 2024-02-26 at 10:20 +0100, Arnd Bergmann wrote:
>
> /* snip */
>
> >
> > > Or maybe we can just introduce a new AT_something to make statx
> > > completely ignore pathname but behave like AT_EMPTY_PATH + "".

I'm not at all convinced about doing custom semantics for this.

> > I think this is better than going back to fstat64_time64(), but
> > it's still not great because
> >
> > - all the reserved flags on statx() are by definition incompatible
> >   with existing kernels that return -EINVAL for any flag they do
> >   not recognize.
>
> Oops, we are deeming passing undefined flags in "mask" undefined
> behavior but not "flags", thus "wild software" may be relying on EINVAL
> for invalid flags... We *might* make this new AT_xxx a bit in mask
> instead of flags but it would be very dirty IMO.

Uhm, no. AT_* flags have nothing to do in statx()'s mask argument at all.

2024-02-26 13:32:24

by Christian Brauner

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
> > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
> >> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
> >> > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
> >> > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> >> > > > My idea is this problem needs syscalls to be designed with deep
> >> > > > argument inspection in mind; syscalls before this should be
> >> > > > considered
> >> > > > as historical error and get fixed by resotring old syscalls.
> >> > >
> >> > > I'd not consider fstat an error as using statx for fstat has a
> >> > > performance impact (severe for some workflows), and Linus has
> >> > > concluded
> >> >
> >> > Sorry for clearance, I mean statx is an error in ABI design, not fstat.
> >
> > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
> > "AT_NULL_PATH"/nullptr in the first place?
>
> Not sure, but it's hard to change now since the libc
> implementation won't easily know whether using the NULL
> path is safe on a given kernel. It could check the kernel
> version number, but that adds another bit of complexity in
> the fast path and doesn't work on old kernels with the
> feature backported.
>
> > But it's not irrational to pass a path to syscall, as long as we still
> > have the concept of file system (maybe in 2371 or some year we'll use a
> > 128-bit UUID instead of path).
> >
> >> The problem I see with the 'use use fstat' approach is that this
> >> does not work on 32-bit architectures, unless we define a new
> >> fstatat64_time64() syscall, which is one of the things that statx()
> >
> > "fstat64_time64". Using statx for fstatat should be just fine.
>
> Right. It does feel wrong to have only an fstat() variant but not
> fstatat() if we go there.
>
> > Or maybe we can just introduce a new AT_something to make statx
> > completely ignore pathname but behave like AT_EMPTY_PATH + "".
>
> I think this is better than going back to fstat64_time64(), but
> it's still not great because
>
> - all the reserved flags on statx() are by definition incompatible
> with existing kernels that return -EINVAL for any flag they do
> not recognize.
>
> - you still need to convince libc developers to actually use
> the flag despite the backwards compatibility problem, either
> with a fallback to the current behavior or a version check.
>
> Using the NULL path as a fallback would solve the problem with
> seccomp, but it would not make the normal case any faster.
>
> >> was trying to avoid.
> >
> > Oops. I thought "newstat" should be using 64-bit time but it seems the
> > "new" is not what I'd expected... The "new" actually means "newer than
> > Linux 0.9"! :(
> >
> > Let's not use "new" in future syscall names...
>
> Right, we definitely can't ever succeed. On some architectures
> we even had "oldstat" and "stat" before "newstat" and "stat64",
> and on some architectures we mix them up. E.g. x86_64 has fstat()
> and fstatat64() with the same structure but doesn't define
> __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
> timestamps unlike all other 64-bit architectures.
>
> statx() was intended to solve these problems once and for all,
> and it appears that we have failed again.

New apis don't invalidate old apis necessarily. That's just not going to
work in an age where you have containerized workloads.

statx() is just the beginning of this. A container may have aritrary
seccomp profiles that return ENOSYS or even EPERM for whatever reason
for any new api that exists. So not implementing fstat() might already
break container workloads.

Another example: You can't just skip on implementing mount() and only
implement the new mount api for example. Because tools that look for api
simplicity and don't need complex setup will _always_ continue to use
mount() and have a right to do so.

And fwiw, mount() isn't fully inspectable by seccomp since forever. The
list goes on and on.

But let's look at the original mail. Why are they denying statx() and
what's that claim about it not being able to be rewritten to something
safe? Looking at:

intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args,
void* fs_denied_errno) {
if (args.nr == __NR_fstatat_default) {
if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
return syscall(__NR_fstat_default, static_cast<int>(args.args[0]),
reinterpret_cast<default_stat_struct*>(args.args[2]));
}
return -reinterpret_cast<intptr_t>(fs_denied_errno);
}

What this does it to rewrite fstatat() to fstat() if it was made with
AT_EMPTY_PATH and the path argument was "". That is easily doable for
statx() because it has the exact same AT_EMPTY_PATH semantics that
fstatat() has.

Plus, they can even filter on mask and rewrite that to something that
they think is safe. For example, STATX_BASIC_STATS which is equivalent
to what any fstat() call returns. So it's pretty difficult to understand
what their actual gripe with statx() is.

It can't be that statx() passes a struct because fstatat() and fstat()
do that too. So what exactly is that problem there?

What this tells me without knowing the exact reason is that they thought
"Oh, if we just return ENOSYS then the workload or glibc will just
always be able to fallback to fstat() or fstatat()". Which ultimately is
the exact same thing that containers often assume.

So really, just skipping on various system calls isn't going to work.
You can't just implement new system calls and forget about the rest
unless you know exactly what workloads your architecure will run on.

Please implement fstat() or fstatat() and stop inventing hacks for
statx() to make weird sandboxing rules work, please.

2024-02-26 13:47:47

by Arnd Bergmann

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024, at 14:32, Christian Brauner wrote:
> On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
>> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
>
> What this tells me without knowing the exact reason is that they thought
> "Oh, if we just return ENOSYS then the workload or glibc will just
> always be able to fallback to fstat() or fstatat()". Which ultimately is
> the exact same thing that containers often assume.
>
> So really, just skipping on various system calls isn't going to work.
> You can't just implement new system calls and forget about the rest
> unless you know exactly what workloads your architecure will run on.
>
> Please implement fstat() or fstatat() and stop inventing hacks for
> statx() to make weird sandboxing rules work, please.

Do you mean we should add fstat64_time64() for all architectures
then? Would use use the same structure layout as statx for this,
the 64-bit version of the 'struct stat' layout from
include/uapi/asm-generic/stat.h, or something new that solves
the same problems?

I definitely don't want to see a new time32 API added to
mips64 and the 32-bit architectures, so the existing stat64
interface won't work as a statx replacement.

Arnd

2024-02-26 13:56:04

by Christian Brauner

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024 at 02:32:09PM +0100, Christian Brauner wrote:
> On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
> > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
> > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
> > >> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
> > >> > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
> > >> > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> > >> > > > My idea is this problem needs syscalls to be designed with deep
> > >> > > > argument inspection in mind; syscalls before this should be
> > >> > > > considered
> > >> > > > as historical error and get fixed by resotring old syscalls.
> > >> > >
> > >> > > I'd not consider fstat an error as using statx for fstat has a
> > >> > > performance impact (severe for some workflows), and Linus has
> > >> > > concluded
> > >> >
> > >> > Sorry for clearance, I mean statx is an error in ABI design, not fstat.
> > >
> > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
> > > "AT_NULL_PATH"/nullptr in the first place?
> >
> > Not sure, but it's hard to change now since the libc
> > implementation won't easily know whether using the NULL
> > path is safe on a given kernel. It could check the kernel
> > version number, but that adds another bit of complexity in
> > the fast path and doesn't work on old kernels with the
> > feature backported.
> >
> > > But it's not irrational to pass a path to syscall, as long as we still
> > > have the concept of file system (maybe in 2371 or some year we'll use a
> > > 128-bit UUID instead of path).
> > >
> > >> The problem I see with the 'use use fstat' approach is that this
> > >> does not work on 32-bit architectures, unless we define a new
> > >> fstatat64_time64() syscall, which is one of the things that statx()
> > >
> > > "fstat64_time64". Using statx for fstatat should be just fine.
> >
> > Right. It does feel wrong to have only an fstat() variant but not
> > fstatat() if we go there.
> >
> > > Or maybe we can just introduce a new AT_something to make statx
> > > completely ignore pathname but behave like AT_EMPTY_PATH + "".
> >
> > I think this is better than going back to fstat64_time64(), but
> > it's still not great because
> >
> > - all the reserved flags on statx() are by definition incompatible
> > with existing kernels that return -EINVAL for any flag they do
> > not recognize.
> >
> > - you still need to convince libc developers to actually use
> > the flag despite the backwards compatibility problem, either
> > with a fallback to the current behavior or a version check.
> >
> > Using the NULL path as a fallback would solve the problem with
> > seccomp, but it would not make the normal case any faster.
> >
> > >> was trying to avoid.
> > >
> > > Oops. I thought "newstat" should be using 64-bit time but it seems the
> > > "new" is not what I'd expected... The "new" actually means "newer than
> > > Linux 0.9"! :(
> > >
> > > Let's not use "new" in future syscall names...
> >
> > Right, we definitely can't ever succeed. On some architectures
> > we even had "oldstat" and "stat" before "newstat" and "stat64",
> > and on some architectures we mix them up. E.g. x86_64 has fstat()
> > and fstatat64() with the same structure but doesn't define
> > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
> > timestamps unlike all other 64-bit architectures.
> >
> > statx() was intended to solve these problems once and for all,
> > and it appears that we have failed again.
>
> New apis don't invalidate old apis necessarily. That's just not going to
> work in an age where you have containerized workloads.
>
> statx() is just the beginning of this. A container may have aritrary
> seccomp profiles that return ENOSYS or even EPERM for whatever reason
> for any new api that exists. So not implementing fstat() might already
> break container workloads.
>
> Another example: You can't just skip on implementing mount() and only
> implement the new mount api for example. Because tools that look for api
> simplicity and don't need complex setup will _always_ continue to use
> mount() and have a right to do so.
>
> And fwiw, mount() isn't fully inspectable by seccomp since forever. The
> list goes on and on.
>
> But let's look at the original mail. Why are they denying statx() and
> what's that claim about it not being able to be rewritten to something
> safe? Looking at:
>
> intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args,
> void* fs_denied_errno) {
> if (args.nr == __NR_fstatat_default) {
> if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
> args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
> return syscall(__NR_fstat_default, static_cast<int>(args.args[0]),
> reinterpret_cast<default_stat_struct*>(args.args[2]));
> }
> return -reinterpret_cast<intptr_t>(fs_denied_errno);
> }
>
> What this does it to rewrite fstatat() to fstat() if it was made with
> AT_EMPTY_PATH and the path argument was "". That is easily doable for
> statx() because it has the exact same AT_EMPTY_PATH semantics that
> fstatat() has.
>
> Plus, they can even filter on mask and rewrite that to something that
> they think is safe. For example, STATX_BASIC_STATS which is equivalent
> to what any fstat() call returns. So it's pretty difficult to understand
> what their actual gripe with statx() is.
>
> It can't be that statx() passes a struct because fstatat() and fstat()
> do that too. So what exactly is that problem there?
>
> What this tells me without knowing the exact reason is that they thought
> "Oh, if we just return ENOSYS then the workload or glibc will just
> always be able to fallback to fstat() or fstatat()". Which ultimately is
> the exact same thing that containers often assume.
>
> So really, just skipping on various system calls isn't going to work.
> You can't just implement new system calls and forget about the rest
> unless you know exactly what workloads your architecure will run on.
>
> Please implement fstat() or fstatat() and stop inventing hacks for
> statx() to make weird sandboxing rules work, please.

And fwiw, if they rewrite fstatat() to fstat() then they must be
worrying about another thread racing and putting something in the second
argument (the path argument) after they have inspected the system call
arguments but before the system call continue.

But if that is their worry then a new flag to statx() won't help at all.
Because then they really want to use fstat(). IOW, one would have to add
fstatx() which brings us back to square one. I'm mostly going by that
code snippet of course.

2024-02-26 14:13:42

by WANG Xuerui

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On 2/26/24 21:32, Christian Brauner wrote:
> On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
>> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
>>> On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
>>>> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
>>>>> 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
>>>>>> On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
>>>>>>> My idea is this problem needs syscalls to be designed with deep
>>>>>>> argument inspection in mind; syscalls before this should be
>>>>>>> considered
>>>>>>> as historical error and get fixed by resotring old syscalls.
>>>>>>
>>>>>> I'd not consider fstat an error as using statx for fstat has a
>>>>>> performance impact (severe for some workflows), and Linus has
>>>>>> concluded
>>>>>
>>>>> Sorry for clearance, I mean statx is an error in ABI design, not fstat.
>>>
>>> I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
>>> "AT_NULL_PATH"/nullptr in the first place?
>>
>> Not sure, but it's hard to change now since the libc
>> implementation won't easily know whether using the NULL
>> path is safe on a given kernel. It could check the kernel
>> version number, but that adds another bit of complexity in
>> the fast path and doesn't work on old kernels with the
>> feature backported.
>>
>>> But it's not irrational to pass a path to syscall, as long as we still
>>> have the concept of file system (maybe in 2371 or some year we'll use a
>>> 128-bit UUID instead of path).
>>>
>>>> The problem I see with the 'use use fstat' approach is that this
>>>> does not work on 32-bit architectures, unless we define a new
>>>> fstatat64_time64() syscall, which is one of the things that statx()
>>>
>>> "fstat64_time64". Using statx for fstatat should be just fine.
>>
>> Right. It does feel wrong to have only an fstat() variant but not
>> fstatat() if we go there.
>>
>>> Or maybe we can just introduce a new AT_something to make statx
>>> completely ignore pathname but behave like AT_EMPTY_PATH + "".
>>
>> I think this is better than going back to fstat64_time64(), but
>> it's still not great because
>>
>> - all the reserved flags on statx() are by definition incompatible
>> with existing kernels that return -EINVAL for any flag they do
>> not recognize.
>>
>> - you still need to convince libc developers to actually use
>> the flag despite the backwards compatibility problem, either
>> with a fallback to the current behavior or a version check.
>>
>> Using the NULL path as a fallback would solve the problem with
>> seccomp, but it would not make the normal case any faster.
>>
>>>> was trying to avoid.
>>>
>>> Oops. I thought "newstat" should be using 64-bit time but it seems the
>>> "new" is not what I'd expected... The "new" actually means "newer than
>>> Linux 0.9"! :(
>>>
>>> Let's not use "new" in future syscall names...
>>
>> Right, we definitely can't ever succeed. On some architectures
>> we even had "oldstat" and "stat" before "newstat" and "stat64",
>> and on some architectures we mix them up. E.g. x86_64 has fstat()
>> and fstatat64() with the same structure but doesn't define
>> __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
>> timestamps unlike all other 64-bit architectures.
>>
>> statx() was intended to solve these problems once and for all,
>> and it appears that we have failed again.
>
> New apis don't invalidate old apis necessarily. That's just not going to
> work in an age where you have containerized workloads.
>
> statx() is just the beginning of this. A container may have aritrary
> seccomp profiles that return ENOSYS or even EPERM for whatever reason
> for any new api that exists. So not implementing fstat() might already
> break container workloads.
>
> Another example: You can't just skip on implementing mount() and only
> implement the new mount api for example. Because tools that look for api
> simplicity and don't need complex setup will _always_ continue to use
> mount() and have a right to do so.
>
> And fwiw, mount() isn't fully inspectable by seccomp since forever. The
> list goes on and on.
>
> But let's look at the original mail. Why are they denying statx() and
> what's that claim about it not being able to be rewritten to something
> safe? Looking at:
>
> intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args,
> void* fs_denied_errno) {
> if (args.nr == __NR_fstatat_default) {
> if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
> args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
> return syscall(__NR_fstat_default, static_cast<int>(args.args[0]),
> reinterpret_cast<default_stat_struct*>(args.args[2]));
> }
> return -reinterpret_cast<intptr_t>(fs_denied_errno);
> }
>
> What this does it to rewrite fstatat() to fstat() if it was made with
> AT_EMPTY_PATH and the path argument was "". That is easily doable for
> statx() because it has the exact same AT_EMPTY_PATH semantics that
> fstatat() has.
>
> Plus, they can even filter on mask and rewrite that to something that
> they think is safe. For example, STATX_BASIC_STATS which is equivalent
> to what any fstat() call returns. So it's pretty difficult to understand
> what their actual gripe with statx() is.
>
> It can't be that statx() passes a struct because fstatat() and fstat()
> do that too. So what exactly is that problem there?

From our investigation:

For (new)fstatat calls that the sandboxed process may make, this SIGSYS
handler either:

* turns allowed calls (those looking at fd's) into fstat's that only
have one argument (the fd) each, or
* denies the call,

so the sandbox only ever sees fstat calls and no (new)fstatat's, and the
guarantee that only open fds can ever been stat'ed trivially holds.

With statx, however, there's no way of guaranteeing "only look at fd"
semantics without peeking into the path argument, because a non-empty
path makes AT_EMPTY_PATH ineffective, and the flags are not validated
prior to use making it near-impossible to introduce new semantics in a
backwards-compatible manner.

> What this tells me without knowing the exact reason is that they thought
> "Oh, if we just return ENOSYS then the workload or glibc will just
> always be able to fallback to fstat() or fstatat()". Which ultimately is
> the exact same thing that containers often assume.
>
> So really, just skipping on various system calls isn't going to work.
> You can't just implement new system calls and forget about the rest
> unless you know exactly what workloads your architecure will run on.
>
> Please implement fstat() or fstatat() and stop inventing hacks for
> statx() to make weird sandboxing rules work, please.

We have already provided fstat(at) on LoongArch for a while by
unconditionally doing statx and translating the returned structure --
see the [glibc] and [golang] [golang-2] implementations for example --
without fallbacks because we know that the old syscalls don't exist. So
if we effectively back out the kernel-side change (removing fstat) we
must arrange for the various syscall makers to be able to fallback to
fstat too.

(The glibc code seems self-adapting to presence of newfstatat, but Go
could need more care. Though a simple rebuild of libc is enough for the
Chromium sandbox to work without code changes on their side.)

[glibc]: https://sourceware.org/pipermail/libc-alpha/2022-May/138958.html
[golang]: https://go.dev/cl/407694
[golang-2]: https://go.dev/cl/411378

--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/


2024-02-26 15:05:42

by Rich Felker

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024 at 01:57:55PM +0100, Christian Brauner wrote:
> On Mon, Feb 26, 2024 at 07:57:56PM +0800, Xi Ruoyao wrote:
> > On Mon, 2024-02-26 at 10:20 +0100, Arnd Bergmann wrote:
> >
> > /* snip */
> >
> > >
> > > > Or maybe we can just introduce a new AT_something to make statx
> > > > completely ignore pathname but behave like AT_EMPTY_PATH + "".
>
> I'm not at all convinced about doing custom semantics for this.

I did not follow the entirety of this thread. I've been aware for a
while that the need to use AT_EMPTY_PATH (on archs that don't have old
syscalls) is a performance problem in addition to being a sandboxing
problem, because the semantics for it were defined to use the string
argument if present, thereby requiring the kernel to perform an
additional string read from user memory.

Unfortunately, I don't see any good fix. Even if we could add
AT_STATX_NO_PATH/AT_STATX_NULL_PATH, libc would not be using it,
because using them would incur EINVAL-then-fallback on kernels that
don't support it.

In regards to the Chromium sandbox, I think Chromium is just wrong
here. Blocking statx is not safe (it also does not work on 32-bit
archs -- it breaks time64 support! and riscv32 doesn't even have
legacy stat either, just like loongarch64), and there is really no
serious security risk from being able to stat arbitrary pathnames.
Maybe it's annoying from a theoretical purity standpoint that you
can't block that, but from a practical standpoint it doesn't really
matter.

I'd like to see a solution to this, but all the possible ones look
bad. And it's all a consequence of poor consideration of how
AT_EMPTY_PATH should work when it was first invented/added. >_<

> > > I think this is better than going back to fstat64_time64(), but
> > > it's still not great because
> > >
> > > - all the reserved flags on statx() are by definition incompatible
> > >   with existing kernels that return -EINVAL for any flag they do
> > >   not recognize.
> >
> > Oops, we are deeming passing undefined flags in "mask" undefined
> > behavior but not "flags", thus "wild software" may be relying on EINVAL
> > for invalid flags... We *might* make this new AT_xxx a bit in mask
> > instead of flags but it would be very dirty IMO.
>
> Uhm, no. AT_* flags have nothing to do in statx()'s mask argument at all.

They definitely cannot go in mask; that's semantically wrong.

Rich

2024-02-26 15:40:52

by Christian Brauner

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024 at 02:46:49PM +0100, Arnd Bergmann wrote:
> On Mon, Feb 26, 2024, at 14:32, Christian Brauner wrote:
> > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
> >> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
> >
> > What this tells me without knowing the exact reason is that they thought
> > "Oh, if we just return ENOSYS then the workload or glibc will just
> > always be able to fallback to fstat() or fstatat()". Which ultimately is
> > the exact same thing that containers often assume.
> >
> > So really, just skipping on various system calls isn't going to work.
> > You can't just implement new system calls and forget about the rest
> > unless you know exactly what workloads your architecure will run on.
> >
> > Please implement fstat() or fstatat() and stop inventing hacks for
> > statx() to make weird sandboxing rules work, please.
>
> Do you mean we should add fstat64_time64() for all architectures
> then? Would use use the same structure layout as statx for this,
> the 64-bit version of the 'struct stat' layout from
> include/uapi/asm-generic/stat.h, or something new that solves
> the same problems?
>
> I definitely don't want to see a new time32 API added to
> mips64 and the 32-bit architectures, so the existing stat64
> interface won't work as a statx replacement.

I don't specifically care but the same way you don't want to see newer
time32 apis added to architectures I don't want to have hacks in our
system calls that aren't even a clear solution to the problem outlined
in this thread.

Short of adding fstatx() the problem isn't solved by a new flag to
statx() as explained in my other mails. But I'm probably missing
something here because I find this notion of "design system calls for
seccomp and the Chromium sandbox" to be an absurd notion and it makes me
a bit impatient.

And fwiw, once mseal() lands seccomp should be a lot easier to get deep
argument inspection.

2024-02-26 15:42:24

by Christian Brauner

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote:
> On 2/26/24 21:32, Christian Brauner wrote:
> > On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
> > > On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
> > > > On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
> > > > > On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
> > > > > > 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
> > > > > > > On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
> > > > > > > > My idea is this problem needs syscalls to be designed with deep
> > > > > > > > argument inspection in mind; syscalls before this should be
> > > > > > > > considered
> > > > > > > > as historical error and get fixed by resotring old syscalls.
> > > > > > >
> > > > > > > I'd not consider fstat an error as using statx for fstat has a
> > > > > > > performance impact (severe for some workflows), and Linus has
> > > > > > > concluded
> > > > > >
> > > > > > Sorry for clearance, I mean statx is an error in ABI design, not fstat.
> > > >
> > > > I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
> > > > "AT_NULL_PATH"/nullptr in the first place?
> > >
> > > Not sure, but it's hard to change now since the libc
> > > implementation won't easily know whether using the NULL
> > > path is safe on a given kernel. It could check the kernel
> > > version number, but that adds another bit of complexity in
> > > the fast path and doesn't work on old kernels with the
> > > feature backported.
> > >
> > > > But it's not irrational to pass a path to syscall, as long as we still
> > > > have the concept of file system (maybe in 2371 or some year we'll use a
> > > > 128-bit UUID instead of path).
> > > >
> > > > > The problem I see with the 'use use fstat' approach is that this
> > > > > does not work on 32-bit architectures, unless we define a new
> > > > > fstatat64_time64() syscall, which is one of the things that statx()
> > > >
> > > > "fstat64_time64". Using statx for fstatat should be just fine.
> > >
> > > Right. It does feel wrong to have only an fstat() variant but not
> > > fstatat() if we go there.
> > >
> > > > Or maybe we can just introduce a new AT_something to make statx
> > > > completely ignore pathname but behave like AT_EMPTY_PATH + "".
> > >
> > > I think this is better than going back to fstat64_time64(), but
> > > it's still not great because
> > >
> > > - all the reserved flags on statx() are by definition incompatible
> > > with existing kernels that return -EINVAL for any flag they do
> > > not recognize.
> > >
> > > - you still need to convince libc developers to actually use
> > > the flag despite the backwards compatibility problem, either
> > > with a fallback to the current behavior or a version check.
> > >
> > > Using the NULL path as a fallback would solve the problem with
> > > seccomp, but it would not make the normal case any faster.
> > >
> > > > > was trying to avoid.
> > > >
> > > > Oops. I thought "newstat" should be using 64-bit time but it seems the
> > > > "new" is not what I'd expected... The "new" actually means "newer than
> > > > Linux 0.9"! :(
> > > >
> > > > Let's not use "new" in future syscall names...
> > >
> > > Right, we definitely can't ever succeed. On some architectures
> > > we even had "oldstat" and "stat" before "newstat" and "stat64",
> > > and on some architectures we mix them up. E.g. x86_64 has fstat()
> > > and fstatat64() with the same structure but doesn't define
> > > __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
> > > timestamps unlike all other 64-bit architectures.
> > >
> > > statx() was intended to solve these problems once and for all,
> > > and it appears that we have failed again.
> >
> > New apis don't invalidate old apis necessarily. That's just not going to
> > work in an age where you have containerized workloads.
> >
> > statx() is just the beginning of this. A container may have aritrary
> > seccomp profiles that return ENOSYS or even EPERM for whatever reason
> > for any new api that exists. So not implementing fstat() might already
> > break container workloads.
> >
> > Another example: You can't just skip on implementing mount() and only
> > implement the new mount api for example. Because tools that look for api
> > simplicity and don't need complex setup will _always_ continue to use
> > mount() and have a right to do so.
> >
> > And fwiw, mount() isn't fully inspectable by seccomp since forever. The
> > list goes on and on.
> >
> > But let's look at the original mail. Why are they denying statx() and
> > what's that claim about it not being able to be rewritten to something
> > safe? Looking at:
> >
> > intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args,
> > void* fs_denied_errno) {
> > if (args.nr == __NR_fstatat_default) {
> > if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
> > args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
> > return syscall(__NR_fstat_default, static_cast<int>(args.args[0]),
> > reinterpret_cast<default_stat_struct*>(args.args[2]));
> > }
> > return -reinterpret_cast<intptr_t>(fs_denied_errno);
> > }
> >
> > What this does it to rewrite fstatat() to fstat() if it was made with
> > AT_EMPTY_PATH and the path argument was "". That is easily doable for
> > statx() because it has the exact same AT_EMPTY_PATH semantics that
> > fstatat() has.
> >
> > Plus, they can even filter on mask and rewrite that to something that
> > they think is safe. For example, STATX_BASIC_STATS which is equivalent
> > to what any fstat() call returns. So it's pretty difficult to understand
> > what their actual gripe with statx() is.
> >
> > It can't be that statx() passes a struct because fstatat() and fstat()
> > do that too. So what exactly is that problem there?
>
> From our investigation:
>
> For (new)fstatat calls that the sandboxed process may make, this SIGSYS
> handler either:
>
> * turns allowed calls (those looking at fd's) into fstat's that only have
> one argument (the fd) each, or
> * denies the call,

Yes, but look at the filtering that they do:

if (args.nr == __NR_fstatat_default) {
if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {

So if you have a statx() call instead of an fstatat() call this is
trivially:

if (args.nr == __NR_statx) {
if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
args.args[2] == static_cast<uint64_t>(AT_EMPTY_PATH)) {

maybe if they care about it also simply check
args.args[3] == STATX_BASIC_STATS.

And then just as with fstatat() rewrite it to fstat().

>
> so the sandbox only ever sees fstat calls and no (new)fstatat's, and the
> guarantee that only open fds can ever been stat'ed trivially holds.
>
> With statx, however, there's no way of guaranteeing "only look at fd"
> semantics without peeking into the path argument, because a non-empty path
> makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to
> use making it near-impossible to introduce new semantics in a
> backwards-compatible manner.

I don't understand. That's exactly the same thing as for fstatat(). My
point is that you can turn statx() into fstat() just like you can turn
fstatat() into fstat(). So if you add fstat()/fstat64() what's left to
do?

> > What this tells me without knowing the exact reason is that they thought
> > "Oh, if we just return ENOSYS then the workload or glibc will just
> > always be able to fallback to fstat() or fstatat()". Which ultimately is
> > the exact same thing that containers often assume.
> >
> > So really, just skipping on various system calls isn't going to work.
> > You can't just implement new system calls and forget about the rest
> > unless you know exactly what workloads your architecure will run on.
> >
> > Please implement fstat() or fstatat() and stop inventing hacks for
> > statx() to make weird sandboxing rules work, please.
>
> We have already provided fstat(at) on LoongArch for a while by
> unconditionally doing statx and translating the returned structure -- see
> the [glibc] and [golang] [golang-2] implementations for example -- without

But you're doing that translation in userspace. I was talking about
adding the fstat()/fstat64() system calls.

2024-02-26 16:50:29

by Xi Ruoyao

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?

On Mon, 2024-02-26 at 16:40 +0100, Christian Brauner wrote:

> > I definitely don't want to see a new time32 API added to
> > mips64 and the 32-bit architectures, so the existing stat64
> > interface won't work as a statx replacement.
>
> I don't specifically care but the same way you don't want to see newer
> time32 apis added to architectures I don't want to have hacks in our
> system calls that aren't even a clear solution to the problem outlined
> in this thread.

So we should have a fstat_whatever64, IMO.

> Short of adding fstatx() the problem isn't solved by a new flag to
> statx() as explained in my other mails. But I'm probably missing
> something here because I find this notion of "design system calls for
> seccomp and the Chromium sandbox" to be an absurd notion and it makes me
> a bit impatient.

I'm sharing the feeling on seccomp and/or (mis)uses of it, but using
statx() or fstatat() for fstat() has a performance impact as they must
inspect path (do a uaccess) and make sure it's an empty string, and
Linus concluded "if the user want fstat, you should give the user fstat"
for this issue:

https://sourceware.org/pipermail/libc-alpha/2023-September/151365.html

If it was just seccomp I'd not comment on this topic at all.

--
Xi Ruoyao <[email protected]>
School of Aerospace Science and Technology, Xidian University

2024-02-26 17:53:51

by WANG Xuerui

[permalink] [raw]
Subject: Re: Chromium sandbox on LoongArch and statx -- seccomp deep argument inspection again?


On 2/26/24 23:35, Christian Brauner wrote:
> On Mon, Feb 26, 2024 at 10:00:05PM +0800, WANG Xuerui wrote:
>> On 2/26/24 21:32, Christian Brauner wrote:
>>> On Mon, Feb 26, 2024 at 10:20:23AM +0100, Arnd Bergmann wrote:
>>>> On Mon, Feb 26, 2024, at 08:09, Xi Ruoyao wrote:
>>>>> On Mon, 2024-02-26 at 07:56 +0100, Arnd Bergmann wrote:
>>>>>> On Mon, Feb 26, 2024, at 07:03, Icenowy Zheng wrote:
>>>>>>> 在 2024-02-25星期日的 15:32 +0800,Xi Ruoyao写道:
>>>>>>>> On Sun, 2024-02-25 at 14:51 +0800, Icenowy Zheng wrote:
>>>>>>>>> My idea is this problem needs syscalls to be designed with deep
>>>>>>>>> argument inspection in mind; syscalls before this should be
>>>>>>>>> considered
>>>>>>>>> as historical error and get fixed by resotring old syscalls.
>>>>>>>> I'd not consider fstat an error as using statx for fstat has a
>>>>>>>> performance impact (severe for some workflows), and Linus has
>>>>>>>> concluded
>>>>>>> Sorry for clearance, I mean statx is an error in ABI design, not fstat.
>>>>> I'm wondering why we decided to use AT_EMPTY_PATH/"" instead of
>>>>> "AT_NULL_PATH"/nullptr in the first place?
>>>> Not sure, but it's hard to change now since the libc
>>>> implementation won't easily know whether using the NULL
>>>> path is safe on a given kernel. It could check the kernel
>>>> version number, but that adds another bit of complexity in
>>>> the fast path and doesn't work on old kernels with the
>>>> feature backported.
>>>>
>>>>> But it's not irrational to pass a path to syscall, as long as we still
>>>>> have the concept of file system (maybe in 2371 or some year we'll use a
>>>>> 128-bit UUID instead of path).
>>>>>
>>>>>> The problem I see with the 'use use fstat' approach is that this
>>>>>> does not work on 32-bit architectures, unless we define a new
>>>>>> fstatat64_time64() syscall, which is one of the things that statx()
>>>>> "fstat64_time64". Using statx for fstatat should be just fine.
>>>> Right. It does feel wrong to have only an fstat() variant but not
>>>> fstatat() if we go there.
>>>>
>>>>> Or maybe we can just introduce a new AT_something to make statx
>>>>> completely ignore pathname but behave like AT_EMPTY_PATH + "".
>>>> I think this is better than going back to fstat64_time64(), but
>>>> it's still not great because
>>>>
>>>> - all the reserved flags on statx() are by definition incompatible
>>>> with existing kernels that return -EINVAL for any flag they do
>>>> not recognize.
>>>>
>>>> - you still need to convince libc developers to actually use
>>>> the flag despite the backwards compatibility problem, either
>>>> with a fallback to the current behavior or a version check.
>>>>
>>>> Using the NULL path as a fallback would solve the problem with
>>>> seccomp, but it would not make the normal case any faster.
>>>>
>>>>>> was trying to avoid.
>>>>> Oops. I thought "newstat" should be using 64-bit time but it seems the
>>>>> "new" is not what I'd expected... The "new" actually means "newer than
>>>>> Linux 0.9"! :(
>>>>>
>>>>> Let's not use "new" in future syscall names...
>>>> Right, we definitely can't ever succeed. On some architectures
>>>> we even had "oldstat" and "stat" before "newstat" and "stat64",
>>>> and on some architectures we mix them up. E.g. x86_64 has fstat()
>>>> and fstatat64() with the same structure but doesn't define
>>>> __NR_newfstat. On mips64, there is a 'newstat' but it has 32-bit
>>>> timestamps unlike all other 64-bit architectures.
>>>>
>>>> statx() was intended to solve these problems once and for all,
>>>> and it appears that we have failed again.
>>> New apis don't invalidate old apis necessarily. That's just not going to
>>> work in an age where you have containerized workloads.
>>>
>>> statx() is just the beginning of this. A container may have aritrary
>>> seccomp profiles that return ENOSYS or even EPERM for whatever reason
>>> for any new api that exists. So not implementing fstat() might already
>>> break container workloads.
>>>
>>> Another example: You can't just skip on implementing mount() and only
>>> implement the new mount api for example. Because tools that look for api
>>> simplicity and don't need complex setup will _always_ continue to use
>>> mount() and have a right to do so.
>>>
>>> And fwiw, mount() isn't fully inspectable by seccomp since forever. The
>>> list goes on and on.
>>>
>>> But let's look at the original mail. Why are they denying statx() and
>>> what's that claim about it not being able to be rewritten to something
>>> safe? Looking at:
>>>
>>> intptr_t SIGSYSFstatatHandler(const struct arch_seccomp_data& args,
>>> void* fs_denied_errno) {
>>> if (args.nr == __NR_fstatat_default) {
>>> if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
>>> args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
>>> return syscall(__NR_fstat_default, static_cast<int>(args.args[0]),
>>> reinterpret_cast<default_stat_struct*>(args.args[2]));
>>> }
>>> return -reinterpret_cast<intptr_t>(fs_denied_errno);
>>> }
>>>
>>> What this does it to rewrite fstatat() to fstat() if it was made with
>>> AT_EMPTY_PATH and the path argument was "". That is easily doable for
>>> statx() because it has the exact same AT_EMPTY_PATH semantics that
>>> fstatat() has.
>>>
>>> Plus, they can even filter on mask and rewrite that to something that
>>> they think is safe. For example, STATX_BASIC_STATS which is equivalent
>>> to what any fstat() call returns. So it's pretty difficult to understand
>>> what their actual gripe with statx() is.
>>>
>>> It can't be that statx() passes a struct because fstatat() and fstat()
>>> do that too. So what exactly is that problem there?
>> From our investigation:
>>
>> For (new)fstatat calls that the sandboxed process may make, this SIGSYS
>> handler either:
>>
>> * turns allowed calls (those looking at fd's) into fstat's that only have
>> one argument (the fd) each, or
>> * denies the call,
> Yes, but look at the filtering that they do:
>
> if (args.nr == __NR_fstatat_default) {
> if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
> args.args[3] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
>
> So if you have a statx() call instead of an fstatat() call this is
> trivially:
>
> if (args.nr == __NR_statx) {
> if (*reinterpret_cast<const char*>(args.args[1]) == '\0' &&
> args.args[2] == static_cast<uint64_t>(AT_EMPTY_PATH)) {
>
> maybe if they care about it also simply check
> args.args[3] == STATX_BASIC_STATS.
>
> And then just as with fstatat() rewrite it to fstat().

But fstat() and fstatat() share the same return value type i.e. struct
stat, different from struct statx. And they are different enough that
their existing seccomp policy can distinguish. In the statx-only case
though, the seccomp policy cannot distinguish "statx actually called
with empty path" from "statx called with AT_EMPTY_PATH but non-empty
path" because in both cases the path would be a non-NULL pointer opaque
to the policy cBPF program.

>> so the sandbox only ever sees fstat calls and no (new)fstatat's, and the
>> guarantee that only open fds can ever been stat'ed trivially holds.
>>
>> With statx, however, there's no way of guaranteeing "only look at fd"
>> semantics without peeking into the path argument, because a non-empty path
>> makes AT_EMPTY_PATH ineffective, and the flags are not validated prior to
>> use making it near-impossible to introduce new semantics in a
>> backwards-compatible manner.
> I don't understand. That's exactly the same thing as for fstatat(). My
> point is that you can turn statx() into fstat() just like you can turn
> fstatat() into fstat(). So if you add fstat()/fstat64() what's left to
> do?
Yes, once fstat is restored it's a matter of transforming every allowed
statx into fstat, then translating struct stat back into struct statx.
What we're seeking is a possible way forward without re-introducing that
though, because we still have some time and don't have to rush.
>>> What this tells me without knowing the exact reason is that they thought
>>> "Oh, if we just return ENOSYS then the workload or glibc will just
>>> always be able to fallback to fstat() or fstatat()". Which ultimately is
>>> the exact same thing that containers often assume.
>>>
>>> So really, just skipping on various system calls isn't going to work.
>>> You can't just implement new system calls and forget about the rest
>>> unless you know exactly what workloads your architecure will run on.
>>>
>>> Please implement fstat() or fstatat() and stop inventing hacks for
>>> statx() to make weird sandboxing rules work, please.
>> We have already provided fstat(at) on LoongArch for a while by
>> unconditionally doing statx and translating the returned structure -- see
>> the [glibc] and [golang] [golang-2] implementations for example -- without
> But you're doing that translation in userspace. I was talking about
> adding the fstat()/fstat64() system calls.
Hmm, yeah. I meant to provide some more context but I later realized
that the sandbox is in charge of rewriting the syscalls from inside the
sandbox, so the userland may not matter in the big picture after all.

--
WANG "xen0n" Xuerui

Linux/LoongArch mailing list: https://lore.kernel.org/loongarch/