2019-07-19 17:34:27

by Sean Christopherson

[permalink] [raw]
Subject: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

The generic vDSO implementation, starting with commit

7ac870747988 ("x86/vdso: Switch to generic vDSO implementation")

breaks seccomp-enabled userspace on 32-bit x86 (i386) kernels. Prior to
the generic implementation, the x86 vDSO used identical code for both
x86_64 and i386 kernels, which worked because it did all calcuations using
structs with naturally sized variables, i.e. didn't use __kernel_timespec.

The generic vDSO does its internal calculations using __kernel_timespec,
which in turn requires the i386 fallback syscall to use the 64-bit
variation, __NR_clock_gettime64.

Using __NR_clock_gettime64 instead of __NR_clock_gettime breaks userspace
applications that use seccomp filtering to block syscalls, as applications
are completely unaware of the newly added of __NR_clock_gettime64, e.g.
sshd gets zapped on syscall(403) when attempting to ssh into the system.

I can fudge around the issue easily enough, but I have no idea how to fix
this properly without duplicating __cvdso_clock_gettime, do_hres, etc...,
especially now that the i386 vDSO exposes __vdso_clock_gettime64().


2019-07-19 17:49:04

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386



> On Jul 19, 2019, at 1:03 PM, Sean Christopherson <[email protected]> wrote:
>
> The generic vDSO implementation, starting with commit
>
> 7ac870747988 ("x86/vdso: Switch to generic vDSO implementation")
>
> breaks seccomp-enabled userspace on 32-bit x86 (i386) kernels. Prior to
> the generic implementation, the x86 vDSO used identical code for both
> x86_64 and i386 kernels, which worked because it did all calcuations using
> structs with naturally sized variables, i.e. didn't use __kernel_timespec.
>
> The generic vDSO does its internal calculations using __kernel_timespec,
> which in turn requires the i386 fallback syscall to use the 64-bit
> variation, __NR_clock_gettime64.

This is basically doomed to break eventually, right?

I’ve occasionally considered adding a concept of “seccomp aliases”. The idea is that, if a filter returns anything other than ALLOW, we re-run it with a different nr that we dig out it a small list of such cases. This would be limited to cases where the new syscall does the same thing with the same arguments.

I want this for restart_syscall: I want to renumber it.

Kees?

2019-07-22 21:19:38

by Kees Cook

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Fri, Jul 19, 2019 at 01:40:13PM -0400, Andy Lutomirski wrote:
> > On Jul 19, 2019, at 1:03 PM, Sean Christopherson <[email protected]> wrote:
> >
> > The generic vDSO implementation, starting with commit
> >
> > 7ac870747988 ("x86/vdso: Switch to generic vDSO implementation")
> >
> > breaks seccomp-enabled userspace on 32-bit x86 (i386) kernels. Prior to
> > the generic implementation, the x86 vDSO used identical code for both
> > x86_64 and i386 kernels, which worked because it did all calcuations using
> > structs with naturally sized variables, i.e. didn't use __kernel_timespec.
> >
> > The generic vDSO does its internal calculations using __kernel_timespec,
> > which in turn requires the i386 fallback syscall to use the 64-bit
> > variation, __NR_clock_gettime64.
>
> This is basically doomed to break eventually, right?

Just so I'm understanding: the vDSO change introduced code to make an
actual syscall on i386, which for most seccomp filters would be rejected?

> I’ve occasionally considered adding a concept of “seccomp aliases”. The idea is that, if a filter returns anything other than ALLOW, we re-run it with a different nr that we dig out it a small list of such cases. This would be limited to cases where the new syscall does the same thing with the same arguments.

Would that help here? The kernel just sees this a direct syscall. I
guess it could whitelist it by checking the return address?

> I want this for restart_syscall: I want to renumber it.

Oh man, don't get me started on restart_syscall. Some architectures make
it invisible to seccomp and others don't. ugh.

--
Kees Cook

2019-07-23 00:31:03

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Mon, 22 Jul 2019, Kees Cook wrote:
> On Fri, Jul 19, 2019 at 01:40:13PM -0400, Andy Lutomirski wrote:
> > > On Jul 19, 2019, at 1:03 PM, Sean Christopherson <[email protected]> wrote:
> > >
> > > The generic vDSO implementation, starting with commit
> > >
> > > 7ac870747988 ("x86/vdso: Switch to generic vDSO implementation")
> > >
> > > breaks seccomp-enabled userspace on 32-bit x86 (i386) kernels. Prior to
> > > the generic implementation, the x86 vDSO used identical code for both
> > > x86_64 and i386 kernels, which worked because it did all calcuations using
> > > structs with naturally sized variables, i.e. didn't use __kernel_timespec.
> > >
> > > The generic vDSO does its internal calculations using __kernel_timespec,
> > > which in turn requires the i386 fallback syscall to use the 64-bit
> > > variation, __NR_clock_gettime64.
> >
> > This is basically doomed to break eventually, right?
>
> Just so I'm understanding: the vDSO change introduced code to make an
> actual syscall on i386, which for most seccomp filters would be rejected?

No. The old x86 specific VDSO implementation had a fallback syscall as
well, i.e. clock_gettime(). On 32bit clock_gettime() uses the y2038
endangered timespec.

So when the VDSO was made generic we changed the internal data structures
to be 2038 safe right away. As a consequence the fallback syscall is not
clock_gettime(), it's clock_gettime64(). which seems to surprise seccomp.

Thanks,

tglx

2019-07-23 00:40:13

by Kees Cook

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Mon, Jul 22, 2019 at 08:31:32PM +0200, Thomas Gleixner wrote:
> On Mon, 22 Jul 2019, Kees Cook wrote:
> > Just so I'm understanding: the vDSO change introduced code to make an
> > actual syscall on i386, which for most seccomp filters would be rejected?
>
> No. The old x86 specific VDSO implementation had a fallback syscall as
> well, i.e. clock_gettime(). On 32bit clock_gettime() uses the y2038
> endangered timespec.
>
> So when the VDSO was made generic we changed the internal data structures
> to be 2038 safe right away. As a consequence the fallback syscall is not
> clock_gettime(), it's clock_gettime64(). which seems to surprise seccomp.

Okay, it's didn't add a syscall, it just changed it. Results are the
same: conservative filters suddenly start breaking due to the different
call. (And now I see why Andy's alias suggestion would help...)

I'm not sure which direction to do with this. It seems like an alias
list is a large hammer for this case, and a "seccomp-bypass when calling
from vDSO" solution seems too fragile?

--
Kees Cook

2019-07-23 02:33:49

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Mon, Jul 22, 2019 at 11:39 AM Kees Cook <[email protected]> wrote:
>
> On Mon, Jul 22, 2019 at 08:31:32PM +0200, Thomas Gleixner wrote:
> > On Mon, 22 Jul 2019, Kees Cook wrote:
> > > Just so I'm understanding: the vDSO change introduced code to make an
> > > actual syscall on i386, which for most seccomp filters would be rejected?
> >
> > No. The old x86 specific VDSO implementation had a fallback syscall as
> > well, i.e. clock_gettime(). On 32bit clock_gettime() uses the y2038
> > endangered timespec.
> >
> > So when the VDSO was made generic we changed the internal data structures
> > to be 2038 safe right away. As a consequence the fallback syscall is not
> > clock_gettime(), it's clock_gettime64(). which seems to surprise seccomp.
>
> Okay, it's didn't add a syscall, it just changed it. Results are the
> same: conservative filters suddenly start breaking due to the different
> call. (And now I see why Andy's alias suggestion would help...)
>
> I'm not sure which direction to do with this. It seems like an alias
> list is a large hammer for this case, and a "seccomp-bypass when calling
> from vDSO" solution seems too fragile?
>

I don't like the seccomp bypass at all. If someone uses seccomp to
disallow all clock_gettime() variants, there shouldn't be a back door
to learn the time.

Here's the restart_syscall() logic that makes me want aliases: we have
different syscall numbers for restart_syscall() on 32-bit and 64-bit.
The logic to decide which one to use is dubious at best. I'd like to
introduce a restart_syscall2() that is identical to restart_syscall()
except that it has the same number on both variants.

--Andy

2019-07-23 10:47:28

by Kees Cook

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Mon, Jul 22, 2019 at 12:17:16PM -0700, Andy Lutomirski wrote:
> On Mon, Jul 22, 2019 at 11:39 AM Kees Cook <[email protected]> wrote:
> >
> > On Mon, Jul 22, 2019 at 08:31:32PM +0200, Thomas Gleixner wrote:
> > > On Mon, 22 Jul 2019, Kees Cook wrote:
> > > > Just so I'm understanding: the vDSO change introduced code to make an
> > > > actual syscall on i386, which for most seccomp filters would be rejected?
> > >
> > > No. The old x86 specific VDSO implementation had a fallback syscall as
> > > well, i.e. clock_gettime(). On 32bit clock_gettime() uses the y2038
> > > endangered timespec.
> > >
> > > So when the VDSO was made generic we changed the internal data structures
> > > to be 2038 safe right away. As a consequence the fallback syscall is not
> > > clock_gettime(), it's clock_gettime64(). which seems to surprise seccomp.
> >
> > Okay, it's didn't add a syscall, it just changed it. Results are the
> > same: conservative filters suddenly start breaking due to the different
> > call. (And now I see why Andy's alias suggestion would help...)
> >
> > I'm not sure which direction to do with this. It seems like an alias
> > list is a large hammer for this case, and a "seccomp-bypass when calling
> > from vDSO" solution seems too fragile?
> >
>
> I don't like the seccomp bypass at all. If someone uses seccomp to
> disallow all clock_gettime() variants, there shouldn't be a back door
> to learn the time.
>
> Here's the restart_syscall() logic that makes me want aliases: we have
> different syscall numbers for restart_syscall() on 32-bit and 64-bit.
> The logic to decide which one to use is dubious at best. I'd like to
> introduce a restart_syscall2() that is identical to restart_syscall()
> except that it has the same number on both variants.

I've built a straw-man for this idea... but I have to say I don't
like it. This can lead to really unexpected behaviors if someone
were to have differing filters for the two syscalls. For example,
let's say someone was doing a paranoid audit of 2038-unsafe clock usage
and marked clock_gettime() with RET_KILL and marked clock_gettime64()
with RET_LOG. This aliasing would make clock_gettime64() trigger with
RET_KILL...

There's no sense of "default" in BPF so we can't check for "and they
weren't expecting it", and we can't check for exact matches since the
filter might be using a balance tree to bucket the allowed syscalls.

Anyway, here it is in case it sparks some more thoughts...


diff --git a/arch/x86/entry/vdso/vdso32/vclock_gettime.c b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
index 9242b28418d5..1e82bd43286c 100644
--- a/arch/x86/entry/vdso/vdso32/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vdso32/vclock_gettime.c
@@ -13,6 +13,7 @@
*/
#undef CONFIG_64BIT
#undef CONFIG_X86_64
+#undef CONFIG_COMPAT
#undef CONFIG_PGTABLE_LEVELS
#undef CONFIG_ILLEGAL_POINTER_VALUE
#undef CONFIG_SPARSEMEM_VMEMMAP
diff --git a/arch/x86/include/asm/seccomp.h b/arch/x86/include/asm/seccomp.h
index 2bd1338de236..5d0da5ee1b61 100644
--- a/arch/x86/include/asm/seccomp.h
+++ b/arch/x86/include/asm/seccomp.h
@@ -6,6 +6,12 @@

#ifdef CONFIG_X86_32
#define __NR_seccomp_sigreturn __NR_sigreturn
+#define seccomp_syscall_alias(arch, syscall) \
+ ({ \
+ (syscall) == __NR_clock_gettime64 ? \
+ __NR_clock_gettime : \
+ -1; \
+ })
#endif

#ifdef CONFIG_COMPAT
@@ -14,6 +20,13 @@
#define __NR_seccomp_write_32 __NR_ia32_write
#define __NR_seccomp_exit_32 __NR_ia32_exit
#define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn
+#define seccomp_syscall_alias(arch, syscall) \
+ ({ \
+ (arch) != AUDIT_ARCH_I386 ? -1 : \
+ (syscall) == __NR_ia32_clock_gettime64 ? \
+ __NR_ia32_clock_gettime : \
+ -1; \
+ })
#endif

#include <asm-generic/seccomp.h>
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 84868d37b35d..06f5ca81d756 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -83,6 +83,14 @@ static inline int seccomp_mode(struct seccomp *s)
#ifdef CONFIG_SECCOMP_FILTER
extern void put_seccomp_filter(struct task_struct *tsk);
extern void get_seccomp_filter(struct task_struct *tsk);
+/*
+ * Allow architectures to provide syscall aliases for special cases
+ * when attempting to make invisible transitions to new syscalls that
+ * have no arguments (e.g. clock_gettime64, restart_syscall).
+ */
+# ifndef seccomp_syscall_alias
+# define seccomp_syscall_alias(arch, syscall) (-1)
+# endif
#else /* CONFIG_SECCOMP_FILTER */
static inline void put_seccomp_filter(struct task_struct *tsk)
{
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index dba52a7db5e8..5153c6155d9a 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -807,6 +807,26 @@ static int __seccomp_filter(int this_syscall, const struct seccomp_data *sd,
data = filter_ret & SECCOMP_RET_DATA;
action = filter_ret & SECCOMP_RET_ACTION_FULL;

+ /* Handle syscall aliases when result is not SECCOMP_RET_ALLOW. */
+ if (unlikely(action != SECCOMP_RET_ALLOW)) {
+ int alias;
+
+ alias = seccomp_syscall_alias(sd->arch, sd->nr);
+ if (unlikely(alias != -1)) {
+ /* Use sd_local for an aliased syscall. */
+ if (sd != &sd_local) {
+ sd_local = *sd;
+ sd = &sd_local;
+ }
+ sd_local.nr = alias;
+
+ /* Run again, with the alias, accepting the results. */
+ filter_ret = seccomp_run_filters(sd, &match);
+ data = filter_ret & SECCOMP_RET_DATA;
+ action = filter_ret & SECCOMP_RET_ACTION_FULL;
+ }
+ }
+
switch (action) {
case SECCOMP_RET_ERRNO:
/* Set low-order bits as an errno, capped at MAX_ERRNO. */
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 6ef7f16c4cf5..778619d145ea 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -3480,6 +3519,55 @@ TEST(seccomp_get_notif_sizes)
EXPECT_EQ(sizes.seccomp_notif_resp, sizeof(struct seccomp_notif_resp));
}

+#if defined(__i386__)
+TEST(seccomp_syscall_alias)
+{
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
+ offsetof(struct seccomp_data, nr)),
+#ifdef __NR_sigreturn
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_sigreturn, 6, 0),
+#endif
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_read, 5, 0),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_exit, 4, 0),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_rt_sigreturn, 3, 0),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_restart_syscall, 2, 0),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_write, 1, 0),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_clock_gettime, 0, 1),
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
+ BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_KILL),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+ unsigned char buffer[128];
+ long ret;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel refused to install seccomp filter!");
+ }
+
+ /* We allow clock_gettime explicitly. */
+ EXPECT_EQ(0, syscall(__NR_clock_gettime, CLOCK_REALTIME, &buffer)) {
+ TH_LOG("Failed __NR_clock_gettime!?");
+ }
+
+ /* With aliases, clock_gettime64 should be allowed too. */
+ EXPECT_EQ(0, syscall(__NR_clock_gettime64, CLOCK_REALTIME, &buffer)) {
+ TH_LOG("Failed __NR_clock_gettime64!");
+ }
+
+ syscall(__NR_exit, 0);
+}
+#endif
+
/*
* TODO:
* - add microbenchmarks

--
Kees Cook

2019-07-23 10:50:36

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Mon, Jul 22, 2019 at 4:28 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Jul 22, 2019 at 12:17:16PM -0700, Andy Lutomirski wrote:
> > On Mon, Jul 22, 2019 at 11:39 AM Kees Cook <[email protected]> wrote:
> > >
> > > On Mon, Jul 22, 2019 at 08:31:32PM +0200, Thomas Gleixner wrote:
> > > > On Mon, 22 Jul 2019, Kees Cook wrote:
> > > > > Just so I'm understanding: the vDSO change introduced code to make an
> > > > > actual syscall on i386, which for most seccomp filters would be rejected?
> > > >
> > > > No. The old x86 specific VDSO implementation had a fallback syscall as
> > > > well, i.e. clock_gettime(). On 32bit clock_gettime() uses the y2038
> > > > endangered timespec.
> > > >
> > > > So when the VDSO was made generic we changed the internal data structures
> > > > to be 2038 safe right away. As a consequence the fallback syscall is not
> > > > clock_gettime(), it's clock_gettime64(). which seems to surprise seccomp.
> > >
> > > Okay, it's didn't add a syscall, it just changed it. Results are the
> > > same: conservative filters suddenly start breaking due to the different
> > > call. (And now I see why Andy's alias suggestion would help...)
> > >
> > > I'm not sure which direction to do with this. It seems like an alias
> > > list is a large hammer for this case, and a "seccomp-bypass when calling
> > > from vDSO" solution seems too fragile?
> > >
> >
> > I don't like the seccomp bypass at all. If someone uses seccomp to
> > disallow all clock_gettime() variants, there shouldn't be a back door
> > to learn the time.
> >
> > Here's the restart_syscall() logic that makes me want aliases: we have
> > different syscall numbers for restart_syscall() on 32-bit and 64-bit.
> > The logic to decide which one to use is dubious at best. I'd like to
> > introduce a restart_syscall2() that is identical to restart_syscall()
> > except that it has the same number on both variants.
>
> I've built a straw-man for this idea... but I have to say I don't
> like it. This can lead to really unexpected behaviors if someone
> were to have differing filters for the two syscalls. For example,
> let's say someone was doing a paranoid audit of 2038-unsafe clock usage
> and marked clock_gettime() with RET_KILL and marked clock_gettime64()
> with RET_LOG. This aliasing would make clock_gettime64() trigger with
> RET_KILL...

This particular issue is solvable:

> + /* Handle syscall aliases when result is not SECCOMP_RET_ALLOW. */
> + if (unlikely(action != SECCOMP_RET_ALLOW)) {
> + int alias;
> +
> + alias = seccomp_syscall_alias(sd->arch, sd->nr);
> + if (unlikely(alias != -1)) {
> + /* Use sd_local for an aliased syscall. */
> + if (sd != &sd_local) {
> + sd_local = *sd;
> + sd = &sd_local;
> + }
> + sd_local.nr = alias;
> +
> + /* Run again, with the alias, accepting the results. */
> + filter_ret = seccomp_run_filters(sd, &match);
> + data = filter_ret & SECCOMP_RET_DATA;
> + action = filter_ret & SECCOMP_RET_ACTION_FULL;

How about:

new_data = ...;
new_action = ...;
if (new_action == SECCOMP_RET_ALLOW) {
data = new_data;
action = new_action;
}

It might also be nice to allow a filter to say "hey, I want to set
this result and I do *not* want compatibility aliases applied", but
I'm not quite sure how to express that.

I don't love this whole concept, but I also don't have a better idea.


--Andy

2019-07-23 15:36:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Mon, Jul 22, 2019 at 04:47:36PM -0700, Andy Lutomirski wrote:

> I don't love this whole concept, but I also don't have a better idea.

Are we really talking about changing the kernel because BPF is expecting
things? That is, did we just elevate everything BPF can observe to ABI?

/me runs for cover.

2019-07-24 00:03:14

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386



> On Jul 23, 2019, at 2:18 AM, Peter Zijlstra <[email protected]> wrote:
>
>> On Mon, Jul 22, 2019 at 04:47:36PM -0700, Andy Lutomirski wrote:
>>
>> I don't love this whole concept, but I also don't have a better idea.
>
> Are we really talking about changing the kernel because BPF is expecting
> things? That is, did we just elevate everything BPF can observe to ABI?
>

No, this isn’t about internals in the kernel mode sense. It’s about the smallish number of cases where the kernel causes user code to do a specific syscall and the user has a policy that doesn’t allow that syscall. This is visible to user code via seccomp and ptrace.

Yes, it’s obnoxious. Do you have any suggestions?

2019-07-24 00:27:02

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Tue, Jul 23, 2019 at 07:04:46AM -0700, Andy Lutomirski wrote:
>
>
> > On Jul 23, 2019, at 2:18 AM, Peter Zijlstra <[email protected]> wrote:
> >
> >> On Mon, Jul 22, 2019 at 04:47:36PM -0700, Andy Lutomirski wrote:
> >>
> >> I don't love this whole concept, but I also don't have a better idea.
> >
> > Are we really talking about changing the kernel because BPF is expecting
> > things? That is, did we just elevate everything BPF can observe to ABI?
> >
>
> No, this isn’t about internals in the kernel mode sense.

*phew*, I was scared for a wee moment.

> It’s about the smallish number of cases where the kernel causes user
> code to do a specific syscall and the user has a policy that doesn’t
> allow that syscall. This is visible to user code via seccomp and
> ptrace.
>
> Yes, it’s obnoxious. Do you have any suggestions?

Not really; I think I just demonstrated I don't fully understand the
problem space here :/

2019-07-24 02:33:01

by Kees Cook

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Mon, Jul 22, 2019 at 04:47:36PM -0700, Andy Lutomirski wrote:
> On Mon, Jul 22, 2019 at 4:28 PM Kees Cook <[email protected]> wrote:
> > I've built a straw-man for this idea... but I have to say I don't
> > like it. This can lead to really unexpected behaviors if someone
> > were to have differing filters for the two syscalls. For example,
> > let's say someone was doing a paranoid audit of 2038-unsafe clock usage
> > and marked clock_gettime() with RET_KILL and marked clock_gettime64()
> > with RET_LOG. This aliasing would make clock_gettime64() trigger with
> > RET_KILL...
>
> This particular issue is solvable:
>
> > + /* Handle syscall aliases when result is not SECCOMP_RET_ALLOW. */
> > + if (unlikely(action != SECCOMP_RET_ALLOW)) {
> > + int alias;
> > +
> > + alias = seccomp_syscall_alias(sd->arch, sd->nr);
> > + if (unlikely(alias != -1)) {
> > + /* Use sd_local for an aliased syscall. */
> > + if (sd != &sd_local) {
> > + sd_local = *sd;
> > + sd = &sd_local;
> > + }
> > + sd_local.nr = alias;
> > +
> > + /* Run again, with the alias, accepting the results. */
> > + filter_ret = seccomp_run_filters(sd, &match);
> > + data = filter_ret & SECCOMP_RET_DATA;
> > + action = filter_ret & SECCOMP_RET_ACTION_FULL;
>
> How about:
>
> new_data = ...;
> new_action = ...;
> if (new_action == SECCOMP_RET_ALLOW) {
> data = new_data;
> action = new_action;
> }

Spelling it out for myself: this means that if both syscalls have
non-RET_ALLOW results, the original result is kept. But if the alias
is allowed, allow it. That solves my particular example, but I don't
think it's enough. (And it might be just as bad.) What if someone wants
to RET_TRACE clock_gettime64 and other syscalls are RET_ALLOWed. Now
clock_gettime64 cannot be traced since clock_gettime gets RET_ALLOW and
replaces the results.

> It might also be nice to allow a filter to say "hey, I want to set
> this result and I do *not* want compatibility aliases applied", but
> I'm not quite sure how to express that.

Especially since we're working on "old" filters.

> I don't love this whole concept, but I also don't have a better idea.

How about we revert the vDSO change? :P

I keep coming back to using the vDSO return address as an indicator.
Most vDSO calls don't make syscalls, yes? So they're normally
unfilterable by seccomp.

What was the prior vDSO behavior?

--
Kees Cook

2019-07-24 02:34:45

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Tue, 23 Jul 2019, Kees Cook wrote:
> On Mon, Jul 22, 2019 at 04:47:36PM -0700, Andy Lutomirski wrote:
> > I don't love this whole concept, but I also don't have a better idea.
>
> How about we revert the vDSO change? :P

Sigh. Add more special case code to the VDSO again?

> I keep coming back to using the vDSO return address as an indicator.
> Most vDSO calls don't make syscalls, yes? So they're normally
> unfilterable by seccomp.
>
> What was the prior vDSO behavior?

The behaviour is pretty much the same as before:

If the requested clock id is supported by the VDSO and the platform has a
VDSO capable clocksource, everything is handled in user space.

If either of the conditions is false, fall back to a syscall.

The implementation detail changed for 32bit (native and compat):

. The original VDSO used sys_clock_gettime() as fallback, the new one uses
sys_clock_gettime64().

The reason is that we need to support 2038 safe vdso_clock_gettime64() for
32bit which requires to use sys_clock_gettime64() as fallback. So we use
the same fallback for the non 2038 safe vdso_clock_gettime() variant as
well to avoid having different implementations of the fallback code.

And as we have sys_clock_gettime64() exposed for 32bit anyway you need to
deal with that in seccomp independently of the VDSO. It does not make sense
to treat sys_clock_gettime() differently than sys_clock_gettime64(). They
both expose the same information, but the latter is y2038 safe.

So changing vdso back to the original fallback for 32bit (native and
compat) is just a temporary bandaid as seccomp needs to deal with the y2038
safe variant anyway.

Thanks,

tglx


2019-07-24 02:35:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Tue, Jul 23, 2019 at 2:55 PM Kees Cook <[email protected]> wrote:
>
> On Mon, Jul 22, 2019 at 04:47:36PM -0700, Andy Lutomirski wrote:
> > On Mon, Jul 22, 2019 at 4:28 PM Kees Cook <[email protected]> wrote:
> > > I've built a straw-man for this idea... but I have to say I don't
> > > like it. This can lead to really unexpected behaviors if someone
> > > were to have differing filters for the two syscalls. For example,
> > > let's say someone was doing a paranoid audit of 2038-unsafe clock usage
> > > and marked clock_gettime() with RET_KILL and marked clock_gettime64()
> > > with RET_LOG. This aliasing would make clock_gettime64() trigger with
> > > RET_KILL...
> >
> > This particular issue is solvable:
> >
> > > + /* Handle syscall aliases when result is not SECCOMP_RET_ALLOW. */
> > > + if (unlikely(action != SECCOMP_RET_ALLOW)) {
> > > + int alias;
> > > +
> > > + alias = seccomp_syscall_alias(sd->arch, sd->nr);
> > > + if (unlikely(alias != -1)) {
> > > + /* Use sd_local for an aliased syscall. */
> > > + if (sd != &sd_local) {
> > > + sd_local = *sd;
> > > + sd = &sd_local;
> > > + }
> > > + sd_local.nr = alias;
> > > +
> > > + /* Run again, with the alias, accepting the results. */
> > > + filter_ret = seccomp_run_filters(sd, &match);
> > > + data = filter_ret & SECCOMP_RET_DATA;
> > > + action = filter_ret & SECCOMP_RET_ACTION_FULL;
> >
> > How about:
> >
> > new_data = ...;
> > new_action = ...;
> > if (new_action == SECCOMP_RET_ALLOW) {
> > data = new_data;
> > action = new_action;
> > }
>
> Spelling it out for myself: this means that if both syscalls have
> non-RET_ALLOW results, the original result is kept. But if the alias
> is allowed, allow it. That solves my particular example, but I don't
> think it's enough. (And it might be just as bad.) What if someone wants
> to RET_TRACE clock_gettime64 and other syscalls are RET_ALLOWed. Now
> clock_gettime64 cannot be traced since clock_gettime gets RET_ALLOW and
> replaces the results.

Fair enough.

>
> > It might also be nice to allow a filter to say "hey, I want to set
> > this result and I do *not* want compatibility aliases applied", but
> > I'm not quite sure how to express that.
>
> Especially since we're working on "old" filters.
>
> > I don't love this whole concept, but I also don't have a better idea.
>
> How about we revert the vDSO change? :P
>
> I keep coming back to using the vDSO return address as an indicator.
> Most vDSO calls don't make syscalls, yes? So they're normally
> unfilterable by seccomp.
>
> What was the prior vDSO behavior?

The prior vDSO behavior was identical except that it was
clock_gettime, not clock_gettime64. It's a little bit more subtle:
the syscall only happens sometimes, mostly depending on hypervisor
details, hardware, and hardware configuration. In the preferred TSC
clock mode, we will not do the syscall for most clock ids.

The emulated vsyscall behavior is that, if the gettimeofday() vsyscall
is issued, we fake a seccomp event and we make a somewhat credible
effort to honor the result. See arch/x86/entry/vsyscall_64.c. In
general, I consider the combination of PR_SET_TSC to PR_TSC_DISABLE +
seccomp blocking timing syscalls to mean that fine-grained timing is
genuinely disabled. So I don't love bypassing seccomp in this type of
context. Also, there are a few ways that programs could fake the
return address as coming from the vDSO, the simplest of which is to
simply jump to that syscall instruction.

This is horrible but: what if we added a new parameter passed when a
seccomp filter is registered that indicates that the kernel should
remap syscalls with numbers above the given parameter. The default
value for legacy users would be clock_gettime64 - 1 :) The idea being
that an old trace-everything filter will sometimes get clock_gettime64
wrong, but a new one can opt in to more accurate behavior. Ditto for
a hypothetical restart_syscall_renumbered(). I admit I still don't
love this either.

2019-07-24 02:35:52

by Kees Cook

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Wed, Jul 24, 2019 at 12:59:03AM +0200, Thomas Gleixner wrote:
> And as we have sys_clock_gettime64() exposed for 32bit anyway you need to
> deal with that in seccomp independently of the VDSO. It does not make sense
> to treat sys_clock_gettime() differently than sys_clock_gettime64(). They
> both expose the same information, but the latter is y2038 safe.

Okay, so combining Andy's ideas on aliasing and "more seccomp flags",
we could declare that clock_gettime64() is not filterable on 32-bit at
all without the magic SECCOMP_IGNORE_ALIASES flag or something. Then we
would alias clock_gettime64 to clock_gettime _before_ the first evaluation
(unless SECCOMP_IGNORE_ALIASES is set)?

(When was clock_gettime64() introduced? Is it too long ago to do this
"you can't filter it without a special flag" change?)

--
Kees Cook

2019-07-24 02:37:09

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Tue, 23 Jul 2019, Kees Cook wrote:

> On Wed, Jul 24, 2019 at 12:59:03AM +0200, Thomas Gleixner wrote:
> > And as we have sys_clock_gettime64() exposed for 32bit anyway you need to
> > deal with that in seccomp independently of the VDSO. It does not make sense
> > to treat sys_clock_gettime() differently than sys_clock_gettime64(). They
> > both expose the same information, but the latter is y2038 safe.
>
> Okay, so combining Andy's ideas on aliasing and "more seccomp flags",
> we could declare that clock_gettime64() is not filterable on 32-bit at
> all without the magic SECCOMP_IGNORE_ALIASES flag or something. Then we
> would alias clock_gettime64 to clock_gettime _before_ the first evaluation
> (unless SECCOMP_IGNORE_ALIASES is set)?
>
> (When was clock_gettime64() introduced? Is it too long ago to do this
> "you can't filter it without a special flag" change?)

clock_gettime64() and the other sys_*time64() syscalls which address the
y2038 issue were added in 5.1

Thanks,

tglx

2019-07-26 19:36:16

by Sean Christopherson

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

+cc Paul

On Wed, Jul 24, 2019 at 01:56:34AM +0200, Thomas Gleixner wrote:
> On Tue, 23 Jul 2019, Kees Cook wrote:
>
> > On Wed, Jul 24, 2019 at 12:59:03AM +0200, Thomas Gleixner wrote:
> > > And as we have sys_clock_gettime64() exposed for 32bit anyway you need to
> > > deal with that in seccomp independently of the VDSO. It does not make sense
> > > to treat sys_clock_gettime() differently than sys_clock_gettime64(). They
> > > both expose the same information, but the latter is y2038 safe.
> >
> > Okay, so combining Andy's ideas on aliasing and "more seccomp flags",
> > we could declare that clock_gettime64() is not filterable on 32-bit at
> > all without the magic SECCOMP_IGNORE_ALIASES flag or something. Then we
> > would alias clock_gettime64 to clock_gettime _before_ the first evaluation
> > (unless SECCOMP_IGNORE_ALIASES is set)?
> >
> > (When was clock_gettime64() introduced? Is it too long ago to do this
> > "you can't filter it without a special flag" change?)
>
> clock_gettime64() and the other sys_*time64() syscalls which address the
> y2038 issue were added in 5.1

Paul Bolle pointed out that this regression showed up in v5.3-rc1, not
v5.2. In Paul's case, systemd-journal is failing.

2019-07-27 17:52:51

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Fri, Jul 26, 2019 at 11:01 AM Sean Christopherson
<[email protected]> wrote:
>
> +cc Paul
>
> On Wed, Jul 24, 2019 at 01:56:34AM +0200, Thomas Gleixner wrote:
> > On Tue, 23 Jul 2019, Kees Cook wrote:
> >
> > > On Wed, Jul 24, 2019 at 12:59:03AM +0200, Thomas Gleixner wrote:
> > > > And as we have sys_clock_gettime64() exposed for 32bit anyway you need to
> > > > deal with that in seccomp independently of the VDSO. It does not make sense
> > > > to treat sys_clock_gettime() differently than sys_clock_gettime64(). They
> > > > both expose the same information, but the latter is y2038 safe.
> > >
> > > Okay, so combining Andy's ideas on aliasing and "more seccomp flags",
> > > we could declare that clock_gettime64() is not filterable on 32-bit at
> > > all without the magic SECCOMP_IGNORE_ALIASES flag or something. Then we
> > > would alias clock_gettime64 to clock_gettime _before_ the first evaluation
> > > (unless SECCOMP_IGNORE_ALIASES is set)?
> > >
> > > (When was clock_gettime64() introduced? Is it too long ago to do this
> > > "you can't filter it without a special flag" change?)
> >
> > clock_gettime64() and the other sys_*time64() syscalls which address the
> > y2038 issue were added in 5.1
>
> Paul Bolle pointed out that this regression showed up in v5.3-rc1, not
> v5.2. In Paul's case, systemd-journal is failing.

I think it's getting quite late to start inventing new seccomp
features to fix this. I think the right solution for 5.3 is to change
the 32-bit vdso fallback to use the old clock_gettime, i.e.
clock_gettime32. This is obviously not an acceptable long-term
solution.

2019-07-27 19:14:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sat, 27 Jul 2019, Andy Lutomirski wrote:

> On Fri, Jul 26, 2019 at 11:01 AM Sean Christopherson
> <[email protected]> wrote:
> >
> > +cc Paul
> >
> > On Wed, Jul 24, 2019 at 01:56:34AM +0200, Thomas Gleixner wrote:
> > > On Tue, 23 Jul 2019, Kees Cook wrote:
> > >
> > > > On Wed, Jul 24, 2019 at 12:59:03AM +0200, Thomas Gleixner wrote:
> > > > > And as we have sys_clock_gettime64() exposed for 32bit anyway you need to
> > > > > deal with that in seccomp independently of the VDSO. It does not make sense
> > > > > to treat sys_clock_gettime() differently than sys_clock_gettime64(). They
> > > > > both expose the same information, but the latter is y2038 safe.
> > > >
> > > > Okay, so combining Andy's ideas on aliasing and "more seccomp flags",
> > > > we could declare that clock_gettime64() is not filterable on 32-bit at
> > > > all without the magic SECCOMP_IGNORE_ALIASES flag or something. Then we
> > > > would alias clock_gettime64 to clock_gettime _before_ the first evaluation
> > > > (unless SECCOMP_IGNORE_ALIASES is set)?
> > > >
> > > > (When was clock_gettime64() introduced? Is it too long ago to do this
> > > > "you can't filter it without a special flag" change?)
> > >
> > > clock_gettime64() and the other sys_*time64() syscalls which address the
> > > y2038 issue were added in 5.1
> >
> > Paul Bolle pointed out that this regression showed up in v5.3-rc1, not
> > v5.2. In Paul's case, systemd-journal is failing.
>
> I think it's getting quite late to start inventing new seccomp
> features to fix this. I think the right solution for 5.3 is to change
> the 32-bit vdso fallback to use the old clock_gettime, i.e.
> clock_gettime32. This is obviously not an acceptable long-term
> solution.

Sigh. I'll have a look....

Thanks,

tglx


2019-07-27 22:00:58

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sat, 27 Jul 2019, Thomas Gleixner wrote:
> On Sat, 27 Jul 2019, Andy Lutomirski wrote:
> >
> > I think it's getting quite late to start inventing new seccomp
> > features to fix this. I think the right solution for 5.3 is to change
> > the 32-bit vdso fallback to use the old clock_gettime, i.e.
> > clock_gettime32. This is obviously not an acceptable long-term
> > solution.
>
> Sigh. I'll have a look....

Completely untested patch below.

For the record: I have to say that I hate it.

Just to be clear. Once a clever seccomp admin decides to enforce Y2038
compliance by filtering out the legacy syscalls this will force glibc into
the syscall slowpath directly because __vdso_clock_gettime64() is gone.

So this needs a proper secccomp solution soener than later.

The fallback change to the legacy syscall is on purpose conditional on
CONFIG_SECCOMP so those people who care can get access to
__vdso_clock_gettime64() nevertheless.

Thanks,

tglx

8<-----------------
--- a/arch/x86/entry/vdso/vdso32/vdso32.lds.S
+++ b/arch/x86/entry/vdso/vdso32/vdso32.lds.S
@@ -27,7 +27,9 @@ VERSION
__vdso_gettimeofday;
__vdso_time;
__vdso_clock_getres;
+#ifndef CONFIG_SECCOMP
__vdso_clock_gettime64;
+#endif
};

LINUX_2.5 {
--- a/arch/x86/include/asm/vdso/gettimeofday.h
+++ b/arch/x86/include/asm/vdso/gettimeofday.h
@@ -101,6 +101,7 @@ long clock_gettime_fallback(clockid_t _c
{
long ret;

+#ifndef CONFIG_SECCOMP
asm (
"mov %%ebx, %%edx \n"
"mov %[clock], %%ebx \n"
@@ -110,6 +111,36 @@ long clock_gettime_fallback(clockid_t _c
: "0" (__NR_clock_gettime64), [clock] "g" (_clkid), "c" (_ts)
: "edx");

+#else
+ struct old_timespec32 tmpts;
+
+ /*
+ * Using clock_gettime and not clock_gettime64 here is a
+ * temporary workaround to pacify seccomp filters which are
+ * unaware of the Y2038 safe variant.
+ */
+
+ asm (
+ "mov %%ebx, %%edx \n"
+ "mov %[clock], %%ebx \n"
+ "call __kernel_vsyscall \n"
+ "mov %%edx, %%ebx \n"
+ : "=a" (ret), "=m" (tmpts)
+ : "0" (__NR_clock_gettime), [clock] "g" (_clkid), "c" (&tmpts)
+ : "edx");
+
+ /*
+ * The core code will have to convert that back. A smart compiler
+ * should notice and avoid the double conversion. If not, bad luck;
+ * we we are not going to change the core code just to make seccomp
+ * happy.
+ */
+
+ if (!ret) {
+ _ts->tv_sec = tmpts.tv_sec;
+ _ts->tv_nsec = tmpts.tv_nsec;
+ }
+#endif
return ret;
}

@@ -136,6 +167,7 @@ clock_getres_fallback(clockid_t _clkid,
{
long ret;

+#ifndef CONFIG_SECCOMP
asm (
"mov %%ebx, %%edx \n"
"mov %[clock], %%ebx \n"
@@ -144,7 +176,32 @@ clock_getres_fallback(clockid_t _clkid,
: "=a" (ret), "=m" (*_ts)
: "0" (__NR_clock_getres_time64), [clock] "g" (_clkid), "c" (_ts)
: "edx");
+#else
+ struct old_timespec32 tmpts;
+
+ /*
+ * Using clock_getres and not clock_getres_time64 here is a
+ * temporary workaround to pacify seccomp filters which are unaware
+ * of the time64 variants. Technically there is no requirement to
+ * use the 64bit variant here as the resolution is definitely not
+ * affected by Y2038, but the end goal of Y2038 is to utilize the
+ * new 64bit timespec variants for everything.
+ */
+
+ asm (
+ "mov %%ebx, %%edx \n"
+ "mov %[clock], %%ebx \n"
+ "call __kernel_vsyscall \n"
+ "mov %%edx, %%ebx \n"
+ : "=a" (ret), "=m" (tmpts)
+ : "0" (__NR_clock_getres), [clock] "g" (_clkid), "c" (&tmpts)
+ : "edx");

+ if (!ret) {
+ _ts->tv_sec = tmpts.tv_sec;
+ _ts->tv_nsec = tmpts.tv_nsec;
+ }
+#endif
return ret;
}


2019-07-28 00:37:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sat, Jul 27, 2019 at 2:52 PM Thomas Gleixner <[email protected]> wrote:
>
> On Sat, 27 Jul 2019, Thomas Gleixner wrote:
> > On Sat, 27 Jul 2019, Andy Lutomirski wrote:
> > >
> > > I think it's getting quite late to start inventing new seccomp
> > > features to fix this. I think the right solution for 5.3 is to change
> > > the 32-bit vdso fallback to use the old clock_gettime, i.e.
> > > clock_gettime32. This is obviously not an acceptable long-term
> > > solution.
> >
> > Sigh. I'll have a look....
>
> Completely untested patch below.
>
> For the record: I have to say that I hate it.

Me too.

How about something more like the attached. (The attachment obviously
won't compile, since it's incomplete. I'm wondering if you think the
idea is okay. The idea is to have the vdso calls fall back to the
corresponding syscalls rather than internally converting.)

--Andy


Attachments:
vdso.patch (1.75 kB)

2019-07-28 09:59:41

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sat, Jul 27, 2019 at 7:53 PM Andy Lutomirski <[email protected]> wrote:
>
> On Fri, Jul 26, 2019 at 11:01 AM Sean Christopherson
> <[email protected]> wrote:
> >
> > On Wed, Jul 24, 2019 at 01:56:34AM +0200, Thomas Gleixner wrote:
> > > On Tue, 23 Jul 2019, Kees Cook wrote:
> > >
> > > > On Wed, Jul 24, 2019 at 12:59:03AM +0200, Thomas Gleixner wrote:
> > > > > And as we have sys_clock_gettime64() exposed for 32bit anyway you need to
> > > > > deal with that in seccomp independently of the VDSO. It does not make sense
> > > > > to treat sys_clock_gettime() differently than sys_clock_gettime64(). They
> > > > > both expose the same information, but the latter is y2038 safe.
> > > >
> > > > Okay, so combining Andy's ideas on aliasing and "more seccomp flags",
> > > > we could declare that clock_gettime64() is not filterable on 32-bit at
> > > > all without the magic SECCOMP_IGNORE_ALIASES flag or something. Then we
> > > > would alias clock_gettime64 to clock_gettime _before_ the first evaluation
> > > > (unless SECCOMP_IGNORE_ALIASES is set)?
> > > >
> > > > (When was clock_gettime64() introduced? Is it too long ago to do this
> > > > "you can't filter it without a special flag" change?)
> > >
> > > clock_gettime64() and the other sys_*time64() syscalls which address the
> > > y2038 issue were added in 5.1
> >
> > Paul Bolle pointed out that this regression showed up in v5.3-rc1, not
> > v5.2. In Paul's case, systemd-journal is failing.
>
> I think it's getting quite late to start inventing new seccomp
> features to fix this. I think the right solution for 5.3 is to change
> the 32-bit vdso fallback to use the old clock_gettime, i.e.
> clock_gettime32. This is obviously not an acceptable long-term
> solution.

I think there is something else wrong with the fallback path, it seems
to pass the wrong structure in some cases:

arch/x86/include/asm/vdso/gettimeofday.h vdso32:

static __always_inline
long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
{
long ret;

asm (
"mov %%ebx, %%edx \n"
"mov %[clock], %%ebx \n"
"call __kernel_vsyscall \n"
"mov %%edx, %%ebx \n"
: "=a" (ret), "=m" (*_ts)
: "0" (__NR_clock_gettime64), [clock] "g" (_clkid), "c" (_ts)
: "edx");

return ret;
}

arch/x86/include/asm/vdso/gettimeofday.h vdso64:
static __always_inline
long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
{
long ret;

asm ("syscall" : "=a" (ret), "=m" (*_ts) :
"0" (__NR_clock_gettime), "D" (_clkid), "S" (_ts) :
"rcx", "r11");

return ret;
}


lib/vdso/gettimeofday.c:
static __maybe_unused int
__cvdso_clock_gettime32(clockid_t clock, struct old_timespec32 *res)
{
struct __kernel_timespec ts;
int ret;

if (res == NULL)
goto fallback;

ret = __cvdso_clock_gettime(clock, &ts);

if (ret == 0) {
res->tv_sec = ts.tv_sec;
res->tv_nsec = ts.tv_nsec;
}

return ret;

fallback:
return clock_gettime_fallback(clock, (struct __kernel_timespec *)res);
}

So we get an 'old_timespec32' pointer from user space, and cast
it to __kernel_timespec in order to pass it to the low-level function
that actually fills in the 64-bit structure.

On a little-endian machine, the first four bytes are actually correct
here, but this is followed by tv_nsec=0 and 8 more bytes that overwrite
whatever comes after the user space 'timespec'. [I missed the
typecast as an indication of a bug during my review, sorry about
that].

I think adding a clock_gettime32_fallback() function that calls
__NR_clock_gettime is both the simplest fix for this bug, and
the least ugly way to handle it in the long run.

We also need to decide what to do about __cvdso_clock_gettime32()
once we add a compile-time option to make all time32 syscalls
to return an error. Returning -ENOSYS from the clock_gettime32()
fallback is probably a good idea, but for consistency the
__vdso_clock_gettime() call should either always return the
same in that configuration, or be left out from the vdso build
endirely.

Arnd

2019-07-28 10:36:39

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sun, 28 Jul 2019, Arnd Bergmann wrote:
> On Sat, Jul 27, 2019 at 7:53 PM Andy Lutomirski <[email protected]> wrote:
> lib/vdso/gettimeofday.c:
> static __maybe_unused int
> __cvdso_clock_gettime32(clockid_t clock, struct old_timespec32 *res)
> {
> struct __kernel_timespec ts;
> int ret;
>
> if (res == NULL)
> goto fallback;
>
> ret = __cvdso_clock_gettime(clock, &ts);
>
> if (ret == 0) {
> res->tv_sec = ts.tv_sec;
> res->tv_nsec = ts.tv_nsec;
> }
>
> return ret;
>
> fallback:
> return clock_gettime_fallback(clock, (struct __kernel_timespec *)res);
> }
>
> So we get an 'old_timespec32' pointer from user space, and cast
> it to __kernel_timespec in order to pass it to the low-level function
> that actually fills in the 64-bit structure.
>
> On a little-endian machine, the first four bytes are actually correct
> here, but this is followed by tv_nsec=0 and 8 more bytes that overwrite
> whatever comes after the user space 'timespec'. [I missed the
> typecast as an indication of a bug during my review, sorry about
> that].

Which is totally irrelevant because res is NULL and that NULL pointer check
should simply return -EFAULT, which is what the syscall fallback returns
because the pointer is NULL.

But that NULL pointer check is inconsistent anyway:

- 64 bit does not have it and never had

- the vdso is not capable of handling faults properly anyway. If the
pointer is not valid, then it will segfault. So just preventing the
segfault for NULL is silly.

I'm going to just remove it.

Thanks,

tglx

2019-07-28 11:15:32

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sun, 28 Jul 2019, Arnd Bergmann wrote:
> We also need to decide what to do about __cvdso_clock_gettime32()
> once we add a compile-time option to make all time32 syscalls
> to return an error. Returning -ENOSYS from the clock_gettime32()
> fallback is probably a good idea, but for consistency the
> __vdso_clock_gettime() call should either always return the
> same in that configuration, or be left out from the vdso build
> endirely.

Either way works and it does not make a difference :)

Thanks,

tglx

2019-07-28 15:28:52

by Arnd Bergmann

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sun, Jul 28, 2019 at 12:30 PM Thomas Gleixner <[email protected]> wrote:
> On Sun, 28 Jul 2019, Arnd Bergmann wrote:
> > On Sat, Jul 27, 2019 at 7:53 PM Andy Lutomirski <[email protected]> wrote:

> Which is totally irrelevant because res is NULL and that NULL pointer check
> should simply return -EFAULT, which is what the syscall fallback returns
> because the pointer is NULL.
>
> But that NULL pointer check is inconsistent anyway:
>
> - 64 bit does not have it and never had
>
> - the vdso is not capable of handling faults properly anyway. If the
> pointer is not valid, then it will segfault. So just preventing the
> segfault for NULL is silly.
>
> I'm going to just remove it.

Ah, you are right, I misread.

Anyway, if we want to keep the traditional behavior and get fewer surprises
for users of seccomp and anything else that might observe clock_gettime
behavior, below is how I'd do it. (not even build tested, x86-only. I'll
send a proper patch if this is where we want to take it).

Arnd

diff --git a/arch/x86/include/asm/vdso/gettimeofday.h
b/arch/x86/include/asm/vdso/gettimeofday.h
index ae91429129a6..f7b6a50d4811 100644
--- a/arch/x86/include/asm/vdso/gettimeofday.h
+++ b/arch/x86/include/asm/vdso/gettimeofday.h
@@ -70,6 +70,13 @@ long clock_gettime_fallback(clockid_t _clkid,
struct __kernel_timespec *_ts)
return ret;
}

+
+static __always_inline
+long clock_gettime32_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
+{
+ return -ENOSYS;
+}
+
static __always_inline
long gettimeofday_fallback(struct __kernel_old_timeval *_tv,
struct timezone *_tz)
@@ -97,7 +104,7 @@ long clock_getres_fallback(clockid_t _clkid, struct
__kernel_timespec *_ts)
#else

static __always_inline
-long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
+long clock_gettime_fallback(clockid_t _clkid, struct old_timespec32 *_ts)
{
long ret;

@@ -113,6 +120,23 @@ long clock_gettime_fallback(clockid_t _clkid,
struct __kernel_timespec *_ts)
return ret;
}

+static __always_inline
+long clock_gettime32_fallback(clockid_t _clkid, struct __kernel_timespec *_ts)
+{
+ long ret;
+
+ asm (
+ "mov %%ebx, %%edx \n"
+ "mov %[clock], %%ebx \n"
+ "call __kernel_vsyscall \n"
+ "mov %%edx, %%ebx \n"
+ : "=a" (ret), "=m" (*_ts)
+ : "0" (__NR_clock_gettime), [clock] "g" (_clkid), "c" (_ts)
+ : "edx");
+
+ return ret;
+}
+
static __always_inline
long gettimeofday_fallback(struct __kernel_old_timeval *_tv,
struct timezone *_tz)
diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 2d1c1f241fd9..3b3dab53d611 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -18,6 +18,7 @@
* clock_mode.
* - gettimeofday_fallback(): fallback for gettimeofday.
* - clock_gettime_fallback(): fallback for clock_gettime.
+ * - clock_gettime_fallback(): fallback for clock_gettime32.
* - clock_getres_fallback(): fallback for clock_getres.
*/
#ifdef ENABLE_COMPAT_VDSO
@@ -51,7 +52,7 @@ static int do_hres(const struct vdso_data *vd, clockid_t clk,
ns = vdso_ts->nsec;
last = vd->cycle_last;
if (unlikely((s64)cycles < 0))
- return clock_gettime_fallback(clk, ts);
+ return -ENOSYS;

ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult);
ns >>= vd->shift;
@@ -81,15 +82,15 @@ static void do_coarse(const struct vdso_data *vd,
clockid_t clk,
} while (unlikely(vdso_read_retry(vd, seq)));
}

-static __maybe_unused int
-__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
+static int
+__cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts)
{
const struct vdso_data *vd = __arch_get_vdso_data();
u32 msk;

/* Check for negative values or invalid clocks */
if (unlikely((u32) clock >= MAX_CLOCKS))
- goto fallback;
+ return -EINVAL;

/*
* Convert the clockid to a bitmask and use it to check which
@@ -105,7 +106,15 @@ __cvdso_clock_gettime(clockid_t clock, struct
__kernel_timespec *ts)
return do_hres(&vd[CS_RAW], clock, ts);
}

-fallback:
+ return -EINVAL;
+}
+
+static __maybe_unused int
+__cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts)
+{
+ if (!__cvdso_clock_gettime_common(clock, ts))
+ return 0;
+
return clock_gettime_fallback(clock, ts);
}

@@ -113,22 +122,14 @@ static __maybe_unused int
__cvdso_clock_gettime32(clockid_t clock, struct old_timespec32 *res)
{
struct __kernel_timespec ts;
- int ret;
-
- if (res == NULL)
- goto fallback;
-
- ret = __cvdso_clock_gettime(clock, &ts);

- if (ret == 0) {
+ if (!__cvdso_clock_gettime_common(clock, &ts)) {
res->tv_sec = ts.tv_sec;
res->tv_nsec = ts.tv_nsec;
+ return 0;
}

- return ret;
-
-fallback:
- return clock_gettime_fallback(clock, (struct __kernel_timespec *)res);
+ return clock_gettime32_fallback(clock, res);
}

static __maybe_unused int

2019-07-28 18:17:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sun, 28 Jul 2019, Arnd Bergmann wrote:

> On Sun, Jul 28, 2019 at 12:30 PM Thomas Gleixner <[email protected]> wrote:
> > On Sun, 28 Jul 2019, Arnd Bergmann wrote:
> > > On Sat, Jul 27, 2019 at 7:53 PM Andy Lutomirski <[email protected]> wrote:
>
> > Which is totally irrelevant because res is NULL and that NULL pointer check
> > should simply return -EFAULT, which is what the syscall fallback returns
> > because the pointer is NULL.
> >
> > But that NULL pointer check is inconsistent anyway:
> >
> > - 64 bit does not have it and never had
> >
> > - the vdso is not capable of handling faults properly anyway. If the
> > pointer is not valid, then it will segfault. So just preventing the
> > segfault for NULL is silly.
> >
> > I'm going to just remove it.
>
> Ah, you are right, I misread.
>
> Anyway, if we want to keep the traditional behavior and get fewer surprises
> for users of seccomp and anything else that might observe clock_gettime
> behavior, below is how I'd do it. (not even build tested, x86-only. I'll
> send a proper patch if this is where we want to take it).

I posted a series which fixes up the mess 2 hours before you sent this mail :)

2019-07-28 18:18:37

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [5.2 REGRESSION] Generic vDSO breaks seccomp-enabled userspace on i386

On Sun, 28 Jul 2019, Thomas Gleixner wrote:
> On Sun, 28 Jul 2019, Arnd Bergmann wrote:
>
> > On Sun, Jul 28, 2019 at 12:30 PM Thomas Gleixner <[email protected]> wrote:
> > > On Sun, 28 Jul 2019, Arnd Bergmann wrote:
> > > > On Sat, Jul 27, 2019 at 7:53 PM Andy Lutomirski <[email protected]> wrote:
> >
> > > Which is totally irrelevant because res is NULL and that NULL pointer check
> > > should simply return -EFAULT, which is what the syscall fallback returns
> > > because the pointer is NULL.
> > >
> > > But that NULL pointer check is inconsistent anyway:
> > >
> > > - 64 bit does not have it and never had
> > >
> > > - the vdso is not capable of handling faults properly anyway. If the
> > > pointer is not valid, then it will segfault. So just preventing the
> > > segfault for NULL is silly.
> > >
> > > I'm going to just remove it.
> >
> > Ah, you are right, I misread.
> >
> > Anyway, if we want to keep the traditional behavior and get fewer surprises
> > for users of seccomp and anything else that might observe clock_gettime
> > behavior, below is how I'd do it. (not even build tested, x86-only. I'll
> > send a proper patch if this is where we want to take it).
>
> I posted a series which fixes up the mess 2 hours before you sent this mail :)

And stupid me forgot to CC you. I was entirely sure that I did....

https://lkml.kernel.org/r/[email protected]

Sorry

tglx