2014-07-11 03:39:12

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH 0/3] [RFC] X32: fix syscall_get_nr while not breaking seccomp BPF

This set reverts commit 8b4b9f2 which broke audit and potentially other users
of syscall_get_nr() which depend on that call as named without being overloaded
by architecture bits. It will satisfy other regular users of syscall_get_nr()
and syscall_get_arch() without changing the seccomp interface to BPF.

A new ARCH definition, AUDIT_ARCH_X86_X32, was added for syscall_get_arch().

Cc: Paul Moore <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Signed-off-by: Richard Guy Briggs <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]

Richard Guy Briggs (3):
audit: add AUDIT_ARCH_X86_X32 arch definition
seccomp: give BPF x32 bit when restoring x32 filter
Revert "x86: remove the x32 syscall bitmask from syscall_get_nr()"

arch/x86/include/asm/syscall.h | 8 ++++++--
include/uapi/linux/audit.h | 1 +
kernel/seccomp.c | 6 ++++++
3 files changed, 13 insertions(+), 2 deletions(-)


2014-07-11 03:39:16

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH 3/3] [RFC] Revert "x86: remove the x32 syscall bitmask from syscall_get_nr()"

This reverts commit 8b4b9f27e57584f3d90e0bb84cf800ad81cfe3a1.
which broke audit and potentially other users of syscall_get_nr() which depend
on that call as named without being overloaded by architecture bits.

This patch along with
seccomp: give BPF x32 bit when restoring x32 filter
will satisfy other regular users of syscall_get_nr() and syscall_get_arch()
without changing the seccomp interface to BPF.

Cc: Paul Moore <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Signed-off-by: Richard Guy Briggs <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/syscall.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d58b6be..8c1bb2b 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -30,13 +30,13 @@ extern const sys_call_ptr_t sys_call_table[];
*/
static inline int syscall_get_nr(struct task_struct *task, struct pt_regs *regs)
{
- return regs->orig_ax;
+ return regs->orig_ax & __SYSCALL_MASK;
}

static inline void syscall_rollback(struct task_struct *task,
struct pt_regs *regs)
{
- regs->ax = regs->orig_ax;
+ regs->ax = regs->orig_ax & __SYSCALL_MASK;
}

static inline long syscall_get_error(struct task_struct *task,
--
1.7.1

2014-07-11 03:39:14

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH 1/3] [RFC] audit: add AUDIT_ARCH_X86_X32 arch definition

Add a definition for 32-bit native system calls under 64-bit x86 architectures.
This is distict from 32-bit emulation under 64-bit x86 architectures.

Cc: Paul Moore <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Signed-off-by: Richard Guy Briggs <[email protected]>
---
include/uapi/linux/audit.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index e15d6fc..4f5607f 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -374,6 +374,7 @@ enum {
#define AUDIT_ARCH_SPARC (EM_SPARC)
#define AUDIT_ARCH_SPARC64 (EM_SPARCV9|__AUDIT_ARCH_64BIT)
#define AUDIT_ARCH_X86_64 (EM_X86_64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
+#define AUDIT_ARCH_X86_X32 (EM_X86_64|__AUDIT_ARCH_LE)

#define AUDIT_PERM_EXEC 1
#define AUDIT_PERM_WRITE 2
--
1.7.1

2014-07-11 03:39:40

by Richard Guy Briggs

[permalink] [raw]
Subject: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

Commit
fca460f [email protected] 2012-02-19 07:56:26 -0800
x32: Handle the x32 system call flag

provided a method to multiplex architecture with the syscall number for X32
calls.

Commit
8b4b9f2 [email protected] 2013-02-15 12:21:43 -0500
x86: remove the x32 syscall bitmask from syscall_get_nr()

broke audit and potentially other users of syscall_get_nr() which depend on
that call as named.

Commit
audit: add AUDIT_ARCH_X86_X32 arch definition

is required to provide the new ARCH definition AUDIT_ARCH_X86_X32 for
syscall_get_arch().

This patch along with reverting 8b4b9f2 should satisfy other regular users of
syscall_get_nr() without changing the seccomp interface to BPF.

Cc: Paul Moore <[email protected]>
Cc: Eric Paris <[email protected]>
Cc: Al Viro <[email protected]>
Cc: Will Drewry <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Signed-off-by: Richard Guy Briggs <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/syscall.h | 4 ++++
kernel/seccomp.c | 6 ++++++
2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d6a756a..d58b6be 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -236,6 +236,10 @@ static inline int syscall_get_arch(void)
return AUDIT_ARCH_I386;
#endif
/* Both x32 and x86_64 are considered "64-bit". */
+#ifdef CONFIG_X86_X32_ABI
+ if (task_pt_regs(current)->orig_ax & __X32_SYSCALL_BIT)
+ return AUDIT_ARCH_X86_X32;
+#endif
return AUDIT_ARCH_X86_64;
}
#endif /* CONFIG_X86_32 */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index b35c215..bc18214 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -73,6 +73,12 @@ static void populate_seccomp_data(struct seccomp_data *sd)

sd->nr = syscall_get_nr(task, regs);
sd->arch = syscall_get_arch();
+#ifdef CONFIG_X86_X32_ABI
+ if (sd->arch == AUDIT_ARCH_X86_X32) {
+ sd->arch = AUDIT_ARCH_X86_64;
+ sd->nr |= __X32_SYSCALL_BIT;
+ }
+#endif
syscall_get_arguments(task, regs, 0, 6, args);
sd->args[0] = args[0];
sd->args[1] = args[1];
--
1.7.1

2014-07-11 04:06:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On 07/10/2014 08:38 PM, Richard Guy Briggs wrote:
> Commit
> fca460f [email protected] 2012-02-19 07:56:26 -0800
> x32: Handle the x32 system call flag
>
> provided a method to multiplex architecture with the syscall number for X32
> calls.
>
> Commit
> 8b4b9f2 [email protected] 2013-02-15 12:21:43 -0500
> x86: remove the x32 syscall bitmask from syscall_get_nr()
>
> broke audit and potentially other users of syscall_get_nr() which depend on
> that call as named.
>
> Commit
> audit: add AUDIT_ARCH_X86_X32 arch definition
>
> is required to provide the new ARCH definition AUDIT_ARCH_X86_X32 for
> syscall_get_arch().
>
> This patch along with reverting 8b4b9f2 should satisfy other regular users of
> syscall_get_nr() without changing the seccomp interface to BPF.
>

Incidentally: do seccomp users know that on an x86-64 system you can
recevie system calls from any of the x86 architectures, regardless of
how the program is invoked? (This is unusual, so normally denying those
"alien" calls is the right thing to do.)

-hpa

2014-07-11 16:11:43

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Thursday, July 10, 2014 09:06:02 PM H. Peter Anvin wrote:
> Incidentally: do seccomp users know that on an x86-64 system you can
> recevie system calls from any of the x86 architectures, regardless of
> how the program is invoked? (This is unusual, so normally denying those
> "alien" calls is the right thing to do.)

I obviously can't speak for all seccomp users, but libseccomp handles this by
checking the seccomp_data->arch value at the start of the filter and killing
(by default) any non-native architectures. If you want, you can change this
default behavior or add support for other architectures (e.g. create a filter
that allows both x86-64 and x32 but disallows x86, or any combination of the
three for that matter).

--
paul moore
security and virtualization @ redhat

2014-07-11 16:14:21

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On 07/11/2014 09:11 AM, Paul Moore wrote:
> On Thursday, July 10, 2014 09:06:02 PM H. Peter Anvin wrote:
>> Incidentally: do seccomp users know that on an x86-64 system you can
>> recevie system calls from any of the x86 architectures, regardless of
>> how the program is invoked? (This is unusual, so normally denying those
>> "alien" calls is the right thing to do.)
>
> I obviously can't speak for all seccomp users, but libseccomp handles this by
> checking the seccomp_data->arch value at the start of the filter and killing
> (by default) any non-native architectures. If you want, you can change this
> default behavior or add support for other architectures (e.g. create a filter
> that allows both x86-64 and x32 but disallows x86, or any combination of the
> three for that matter).
>

OK, that seems reasonable.

-hpa

2014-07-11 16:15:52

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 1/3] [RFC] audit: add AUDIT_ARCH_X86_X32 arch definition

On Thursday, July 10, 2014 11:38:12 PM Richard Guy Briggs wrote:
> Add a definition for 32-bit native system calls under 64-bit x86
> architectures. This is distict from 32-bit emulation under 64-bit x86
> architectures.
>
> Cc: Paul Moore <[email protected]>
> Cc: Eric Paris <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: Will Drewry <[email protected]>
> Cc: H. Peter Anvin <[email protected]>
> Signed-off-by: Richard Guy Briggs <[email protected]>
> ---
> include/uapi/linux/audit.h | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index e15d6fc..4f5607f 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -374,6 +374,7 @@ enum {
> #define AUDIT_ARCH_SPARC (EM_SPARC)
> #define AUDIT_ARCH_SPARC64 (EM_SPARCV9|__AUDIT_ARCH_64BIT)
> #define AUDIT_ARCH_X86_64 (EM_X86_64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
> +#define AUDIT_ARCH_X86_X32 (EM_X86_64|__AUDIT_ARCH_LE)
>
> #define AUDIT_PERM_EXEC 1
> #define AUDIT_PERM_WRITE 2

While I'm opposed to the other patches in this series (comments to follow), I
think this is a worthwhile addition and arguably should have been done when
x32 was merged.

That said, this change should probably be included in whatever patch first
makes use of this new value as this patch does nothing by itself.

--
paul moore
security and virtualization @ redhat

2014-07-11 16:16:53

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Fri, 2014-07-11 at 12:11 -0400, Paul Moore wrote:
> On Thursday, July 10, 2014 09:06:02 PM H. Peter Anvin wrote:
> > Incidentally: do seccomp users know that on an x86-64 system you can
> > recevie system calls from any of the x86 architectures, regardless of
> > how the program is invoked? (This is unusual, so normally denying those
> > "alien" calls is the right thing to do.)
>
> I obviously can't speak for all seccomp users, but libseccomp handles this by
> checking the seccomp_data->arch value at the start of the filter and killing
> (by default) any non-native architectures. If you want, you can change this
> default behavior or add support for other architectures (e.g. create a filter
> that allows both x86-64 and x32 but disallows x86, or any combination of the
> three for that matter).

Maybe libseccomp does some HORRIFIC contortions under the hood, but the
interface is crap... Since seccomp_data->arch can't distinguish between
X32 and X86_64. If I write a seccomp filter which says

KILL arch != x86_64
KILL init_module
ALLOW everything else

I can still call init_module, I just have to use the X32 variant.

If libseccomp is translating:

KILL arch != x86_64 into:

KILL arch != x86_64
KILL syscall_nr >= 2000

That's just showing how dumb the kernel interface is... Good for you
guys, but the kernel is just being dumb :)

2014-07-11 16:21:23

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Friday, July 11, 2014 12:16:47 PM Eric Paris wrote:
> On Fri, 2014-07-11 at 12:11 -0400, Paul Moore wrote:
> > On Thursday, July 10, 2014 09:06:02 PM H. Peter Anvin wrote:
> > > Incidentally: do seccomp users know that on an x86-64 system you can
> > > recevie system calls from any of the x86 architectures, regardless of
> > > how the program is invoked? (This is unusual, so normally denying those
> > > "alien" calls is the right thing to do.)
> >
> > I obviously can't speak for all seccomp users, but libseccomp handles this
> > by checking the seccomp_data->arch value at the start of the filter and
> > killing (by default) any non-native architectures. If you want, you can
> > change this default behavior or add support for other architectures (e.g.
> > create a filter that allows both x86-64 and x32 but disallows x86, or any
> > combination of the three for that matter).
>
> Maybe libseccomp does some HORRIFIC contortions under the hood, but the
> interface is crap... Since seccomp_data->arch can't distinguish between
> X32 and X86_64. If I write a seccomp filter which says
>
> KILL arch != x86_64
> KILL init_module
> ALLOW everything else
>
> I can still call init_module, I just have to use the X32 variant.
>
> If libseccomp is translating:
>
> KILL arch != x86_64 into:
>
> KILL arch != x86_64
> KILL syscall_nr >= 2000
>
> That's just showing how dumb the kernel interface is... Good for you
> guys, but the kernel is just being dumb :)

You're not going to hear me ever say that I like how the x32 ABI was done, it
is a real mess from a seccomp filter point of view and we have to do some
nasty stuff in libseccomp to make it all work correctly (see my comments on
the libseccomp-devel list regarding my severe displeasure over x32), but
what's done is done.

I think it's too late to change the x32 seccomp filter ABI.

--
paul moore
security and virtualization @ redhat

2014-07-11 16:23:38

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Fri, 2014-07-11 at 12:21 -0400, Paul Moore wrote:
> On Friday, July 11, 2014 12:16:47 PM Eric Paris wrote:
> > On Fri, 2014-07-11 at 12:11 -0400, Paul Moore wrote:
> > > On Thursday, July 10, 2014 09:06:02 PM H. Peter Anvin wrote:
> > > > Incidentally: do seccomp users know that on an x86-64 system you can
> > > > recevie system calls from any of the x86 architectures, regardless of
> > > > how the program is invoked? (This is unusual, so normally denying those
> > > > "alien" calls is the right thing to do.)
> > >
> > > I obviously can't speak for all seccomp users, but libseccomp handles this
> > > by checking the seccomp_data->arch value at the start of the filter and
> > > killing (by default) any non-native architectures. If you want, you can
> > > change this default behavior or add support for other architectures (e.g.
> > > create a filter that allows both x86-64 and x32 but disallows x86, or any
> > > combination of the three for that matter).
> >
> > Maybe libseccomp does some HORRIFIC contortions under the hood, but the
> > interface is crap... Since seccomp_data->arch can't distinguish between
> > X32 and X86_64. If I write a seccomp filter which says
> >
> > KILL arch != x86_64
> > KILL init_module
> > ALLOW everything else
> >
> > I can still call init_module, I just have to use the X32 variant.
> >
> > If libseccomp is translating:
> >
> > KILL arch != x86_64 into:
> >
> > KILL arch != x86_64
> > KILL syscall_nr >= 2000
> >
> > That's just showing how dumb the kernel interface is... Good for you
> > guys, but the kernel is just being dumb :)
>
> You're not going to hear me ever say that I like how the x32 ABI was done, it
> is a real mess from a seccomp filter point of view and we have to do some
> nasty stuff in libseccomp to make it all work correctly (see my comments on
> the libseccomp-devel list regarding my severe displeasure over x32), but
> what's done is done.
>
> I think it's too late to change the x32 seccomp filter ABI.

So we have a security interface that is damn near impossible to get
right. Perfect.

I think this explains exactly why I support this idea. Make X32 look
like everyone else and put these custom horrific hacks in seccomp if we
are unwilling to 'do it right'

Honestly, how many people are using seccomp on X32 and would be horribly
pissed if we just fixed it?

2014-07-11 16:30:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On 07/11/2014 09:23 AM, Eric Paris wrote:
>>
>> You're not going to hear me ever say that I like how the x32 ABI was done, it
>> is a real mess from a seccomp filter point of view and we have to do some
>> nasty stuff in libseccomp to make it all work correctly (see my comments on
>> the libseccomp-devel list regarding my severe displeasure over x32), but
>> what's done is done.
>>
>> I think it's too late to change the x32 seccomp filter ABI.
>
> So we have a security interface that is damn near impossible to get
> right. Perfect.
>
> I think this explains exactly why I support this idea. Make X32 look
> like everyone else and put these custom horrific hacks in seccomp if we
> are unwilling to 'do it right'
>
> Honestly, how many people are using seccomp on X32 and would be horribly
> pissed if we just fixed it?
>

The bigger issue is probably if we will open a problem with the older
kernels.

-hpa

2014-07-11 16:32:10

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Friday, July 11, 2014 12:23:33 PM Eric Paris wrote:
> On Fri, 2014-07-11 at 12:21 -0400, Paul Moore wrote:
> > On Friday, July 11, 2014 12:16:47 PM Eric Paris wrote:
> > > On Fri, 2014-07-11 at 12:11 -0400, Paul Moore wrote:
> > > > On Thursday, July 10, 2014 09:06:02 PM H. Peter Anvin wrote:
> > > > > Incidentally: do seccomp users know that on an x86-64 system you can
> > > > > recevie system calls from any of the x86 architectures, regardless
> > > > > of
> > > > > how the program is invoked? (This is unusual, so normally denying
> > > > > those
> > > > > "alien" calls is the right thing to do.)
> > > >
> > > > I obviously can't speak for all seccomp users, but libseccomp handles
> > > > this
> > > > by checking the seccomp_data->arch value at the start of the filter
> > > > and
> > > > killing (by default) any non-native architectures. If you want, you
> > > > can
> > > > change this default behavior or add support for other architectures
> > > > (e.g.
> > > > create a filter that allows both x86-64 and x32 but disallows x86, or
> > > > any
> > > > combination of the three for that matter).
> > >
> > > Maybe libseccomp does some HORRIFIC contortions under the hood, but the
> > > interface is crap... Since seccomp_data->arch can't distinguish between
> > > X32 and X86_64. If I write a seccomp filter which says
> > >
> > > KILL arch != x86_64
> > > KILL init_module
> > > ALLOW everything else
> > >
> > > I can still call init_module, I just have to use the X32 variant.
> > >
> > > If libseccomp is translating:
> > >
> > > KILL arch != x86_64 into:
> > >
> > > KILL arch != x86_64
> > > KILL syscall_nr >= 2000
> > >
> > > That's just showing how dumb the kernel interface is... Good for you
> > > guys, but the kernel is just being dumb :)
> >
> > You're not going to hear me ever say that I like how the x32 ABI was done,
> > it is a real mess from a seccomp filter point of view and we have to do
> > some nasty stuff in libseccomp to make it all work correctly (see my
> > comments on the libseccomp-devel list regarding my severe displeasure
> > over x32), but what's done is done.
> >
> > I think it's too late to change the x32 seccomp filter ABI.
>
> So we have a security interface that is damn near impossible to get
> right. Perfect.

What? Having to do two comparisons instead of one is "damn near impossible"?
I think that might be a bit of an overreaction don't you think?

> I think this explains exactly why I support this idea. Make X32 look
> like everyone else ...

You do realize that this patch set makes x32 the odd man out by having
syscall_get_nr() return a different syscall number than what was used to make
the syscall? I don't understand how that makes "x32 look like everyone else".

> ... and put these custom horrific hacks in seccomp if we are unwilling to
> 'do it right'

If you want to add the new x32 audit arch #define, go for it, like I said that
was something that I feel should have been in there from the beginning. As
far as I'm concerned you can even put a hack in kernel/seccomp.c to rewrite
the arch token value if it makes your life easier.

> Honestly, how many people are using seccomp on X32 and would be horribly
> pissed if we just fixed it?

Okay, please stop suggesting we break the x32 kernel/user interface to
workaround a flaw in audit. I get that it sucks for audit, I really do, but
this is audit's problem.

--
paul moore
security and virtualization @ redhat

2014-07-11 16:36:28

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Thursday, July 10, 2014 11:38:13 PM Richard Guy Briggs wrote:
> Commit
> fca460f [email protected] 2012-02-19 07:56:26 -0800
> x32: Handle the x32 system call flag
>
> provided a method to multiplex architecture with the syscall number for X32
> calls.
>
> Commit
> 8b4b9f2 [email protected] 2013-02-15 12:21:43 -0500
> x86: remove the x32 syscall bitmask from syscall_get_nr()
>
> broke audit and potentially other users of syscall_get_nr() which depend on
> that call as named.

Arguably audit is broken anyway by not correctly treating syscall numbers as
32 bit integers like everyone else.

The commit above, 8b4b9f2, changed syscall_get_nr() so that it returned the
same syscall number that is used by the architecture's ABI; just like every*
other architecture in the kernel.

* Admittedly I didn't check every architecture's implementation, but after a
half dozen I stopped checking as there was a definite trend.

{snip}

> diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
> index d6a756a..d58b6be 100644
> --- a/arch/x86/include/asm/syscall.h
> +++ b/arch/x86/include/asm/syscall.h
> @@ -236,6 +236,10 @@ static inline int syscall_get_arch(void)
> return AUDIT_ARCH_I386;
> #endif
> /* Both x32 and x86_64 are considered "64-bit". */
> +#ifdef CONFIG_X86_X32_ABI
> + if (task_pt_regs(current)->orig_ax & __X32_SYSCALL_BIT)
> + return AUDIT_ARCH_X86_X32;
> +#endif

No. See my comments above and in other parts of this thread.

> return AUDIT_ARCH_X86_64;
> }
> #endif /* CONFIG_X86_32 */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index b35c215..bc18214 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -73,6 +73,12 @@ static void populate_seccomp_data(struct seccomp_data
> *sd)
>
> sd->nr = syscall_get_nr(task, regs);
> sd->arch = syscall_get_arch();
> +#ifdef CONFIG_X86_X32_ABI
> + if (sd->arch == AUDIT_ARCH_X86_X32) {
> + sd->arch = AUDIT_ARCH_X86_64;
> + sd->nr |= __X32_SYSCALL_BIT;
> + }
> +#endif

Once again, I'm not really sure I need to comment further here, but don't
change syscall_get_nr(), it should return the same syscall number as was used
by userspace to initiate the syscall. If you really want to use the new
AUDIT_ARCH_X86_X32 macro/define, go ahead, but make sure you rewrite it to the
x86-64 value here so as to not break compatibility with existing seccomp
filter users.

--
paul moore
security and virtualization @ redhat

2014-07-11 16:45:03

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On 07/11/2014 09:36 AM, Paul Moore wrote:
>
> Arguably audit is broken anyway by not correctly treating syscall numbers as
> 32 bit integers like everyone else.
>

That is really the root cause of the problem. x86 is not the only
architecture with a sparse syscall numbering scheme (in fact the x32
method was based on the MIPS syscall numbering scheme.)

What syscall_get_nr() returns becomes a matter of definition at that point.

-hpa

2014-07-11 18:31:15

by Eric Paris

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Fri, 2014-07-11 at 12:32 -0400, Paul Moore wrote:
> On Friday, July 11, 2014 12:23:33 PM Eric Paris wrote:
> > On Fri, 2014-07-11 at 12:21 -0400, Paul Moore wrote:
> > > On Friday, July 11, 2014 12:16:47 PM Eric Paris wrote:
> > > > On Fri, 2014-07-11 at 12:11 -0400, Paul Moore wrote:
> > > > > On Thursday, July 10, 2014 09:06:02 PM H. Peter Anvin wrote:
> > > > > > Incidentally: do seccomp users know that on an x86-64 system you can
> > > > > > recevie system calls from any of the x86 architectures, regardless
> > > > > > of
> > > > > > how the program is invoked? (This is unusual, so normally denying
> > > > > > those
> > > > > > "alien" calls is the right thing to do.)
> > > > >
> > > > > I obviously can't speak for all seccomp users, but libseccomp handles
> > > > > this
> > > > > by checking the seccomp_data->arch value at the start of the filter
> > > > > and
> > > > > killing (by default) any non-native architectures. If you want, you
> > > > > can
> > > > > change this default behavior or add support for other architectures
> > > > > (e.g.
> > > > > create a filter that allows both x86-64 and x32 but disallows x86, or
> > > > > any
> > > > > combination of the three for that matter).
> > > >
> > > > Maybe libseccomp does some HORRIFIC contortions under the hood, but the
> > > > interface is crap... Since seccomp_data->arch can't distinguish between
> > > > X32 and X86_64. If I write a seccomp filter which says
> > > >
> > > > KILL arch != x86_64
> > > > KILL init_module
> > > > ALLOW everything else
> > > >
> > > > I can still call init_module, I just have to use the X32 variant.
> > > >
> > > > If libseccomp is translating:
> > > >
> > > > KILL arch != x86_64 into:
> > > >
> > > > KILL arch != x86_64
> > > > KILL syscall_nr >= 2000
> > > >
> > > > That's just showing how dumb the kernel interface is... Good for you
> > > > guys, but the kernel is just being dumb :)
> > >
> > > You're not going to hear me ever say that I like how the x32 ABI was done,
> > > it is a real mess from a seccomp filter point of view and we have to do
> > > some nasty stuff in libseccomp to make it all work correctly (see my
> > > comments on the libseccomp-devel list regarding my severe displeasure
> > > over x32), but what's done is done.
> > >
> > > I think it's too late to change the x32 seccomp filter ABI.
> >
> > So we have a security interface that is damn near impossible to get
> > right. Perfect.
>
> What? Having to do two comparisons instead of one is "damn near impossible"?
> I think that might be a bit of an overreaction don't you think?

Actually no. How can a normal userspace application coder POSSIBLY know
this? Find this thread on an e-mail list, by accident?
>
> > I think this explains exactly why I support this idea. Make X32 look
> > like everyone else ...
>
> You do realize that this patch set makes x32 the odd man out by having
> syscall_get_nr() return a different syscall number than what was used to make
> the syscall? I don't understand how that makes "x32 look like everyone else".

Ok, I buy the __X32_SYSCALL_BIT argument. It can be dealt with in
audit. No problem. We don't need to strip it in syscall_get_nr().
I'll gladly concede that part of the patch series.

But given an x86_64 kernel a seccomp filter writer has to know about X32
and how to write rules to block the X32 ABI. And I stick with my
assessment that x32 + seccomp is darn near impossible for a normal
developer to handle.

Heck, even chromium took months to realize that x32 was a weird beast.
And they got it wrong on their first try. Their original implementation
didn't handle __X32_SYSCALL_BIT quite right. Looking at their code I'm
still not sure it does the right thing. And they are the EXPERTS. They
wrote seccomp!

> > Honestly, how many people are using seccomp on X32 and would be horribly
> > pissed if we just fixed it?
>
> Okay, please stop suggesting we break the x32 kernel/user interface to
> workaround a flaw in audit. I get that it sucks for audit, I really do, but
> this is audit's problem.

No one is asking to break X32 to fix audit. Audit can handle itself. I
don't want anything in the kernel to pretend that X32 is X86_64. It
isn't. It has its own syscall table. Its own syscalls. Its own ABI.
I'm suggesting to fix how seccomp exposes X32 information because it is
a HORRIBLE interface that even the experts have gotten wrong, over and
over and over.

I suggest we accept it as breakage and just return AUDIT_ARCH_X32.
(Leaving the _X32_SYSCALL_BIT exposed as it is today)

But I'd love to hear some thoughts on how that is a bad thing. If no
one is using the x32 seccomp abi, lets fix it. If someone is, lets see
what the fallout from fixing it will be.

-Eric

2014-07-11 19:36:45

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Friday, July 11, 2014 02:31:06 PM Eric Paris wrote:
> On Fri, 2014-07-11 at 12:32 -0400, Paul Moore wrote:
> > On Friday, July 11, 2014 12:23:33 PM Eric Paris wrote:

...

> > > So we have a security interface that is damn near impossible to get
> > > right. Perfect.
> >
> > What? Having to do two comparisons instead of one is "damn near
> > impossible"? I think that might be a bit of an overreaction don't you
> > think?
>
> Actually no. How can a normal userspace application coder POSSIBLY know
> this? Find this thread on an e-mail list, by accident?

I suppose some clarification of perspective is in order. Personally, I don't
think a "normal" (which brings up the age old question: what is normal
anyway?) userspace developer is going to write their own seccomp BPF filter,
there is just too much of a barrier to entry and it is way too easy to get it
wrong. This is why I started the libseccomp project.

However, I think there *may* be a solution to satisfy us both, see below.

> > > I think this explains exactly why I support this idea. Make X32 look
> > > like everyone else ...
> >
> > You do realize that this patch set makes x32 the odd man out by having
> > syscall_get_nr() return a different syscall number than what was used to
> > make the syscall? I don't understand how that makes "x32 look like
> > everyone else".
>
> Ok, I buy the __X32_SYSCALL_BIT argument. It can be dealt with in
> audit. No problem. We don't need to strip it in syscall_get_nr().
> I'll gladly concede that part of the patch series.
>
> But given an x86_64 kernel a seccomp filter writer has to know about X32
> and how to write rules to block the X32 ABI. And I stick with my
> assessment that x32 + seccomp is darn near impossible for a normal
> developer to handle.

No argument here; like I said, I think seccomp BPF filters are too much for
most folks regardless of the arch.

> Heck, even chromium took months to realize that x32 was a weird beast.
> And they got it wrong on their first try. Their original implementation
> didn't handle __X32_SYSCALL_BIT quite right. Looking at their code I'm
> still not sure it does the right thing. And they are the EXPERTS. They
> wrote seccomp!

I think you're helping prove my point. Or I'm providing yours. Or something.

However, my point about breaking userspace still stands. Assuming the kernel
interface is functional, regardless of warts, I stand fast in that anything
that breaks the kernel/userspace interface is bad.

> > > Honestly, how many people are using seccomp on X32 and would be horribly
> > > pissed if we just fixed it?
> >
> > Okay, please stop suggesting we break the x32 kernel/user interface to
> > workaround a flaw in audit. I get that it sucks for audit, I really do,
> > but this is audit's problem.
>
> No one is asking to break X32 to fix audit. Audit can handle itself. I
> don't want anything in the kernel to pretend that X32 is X86_64. It
> isn't. It has its own syscall table. Its own syscalls. Its own ABI.
> I'm suggesting to fix how seccomp exposes X32 information because it is
> a HORRIBLE interface that even the experts have gotten wrong, over and
> over and over.

See my comments above.

> I suggest we accept it as breakage and just return AUDIT_ARCH_X32.
> (Leaving the _X32_SYSCALL_BIT exposed as it is today)
>
> But I'd love to hear some thoughts on how that is a bad thing. If no
> one is using the x32 seccomp abi, lets fix it. If someone is, lets see
> what the fallout from fixing it will be.

This would break the seccomp filter API and any application that uses it
properly. I'm not excited about the idea of "let's just break the interface
and see who notices philosophy"; that's a very bad practice IMHO.

Anyway, getting back to the idea I mentioned earlier ... as many of you may
know, Kees (added to the CC line) is working on some seccomp filter
improvements which will result in a new seccomp syscall. Perhaps one way
forward is to preserve everything as it is currently with the prctl()
interface, but with the new seccomp() based interface we fixup x32 and use the
new AUDIT_ARCH_X32 token? It might result in a bit of ugliness in some of the
kernel, but I don't think it would be too bad, and I think it would address
both our concerns.

Thoughts?

--
paul moore
security and virtualization @ redhat

2014-07-11 22:48:24

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Fri, Jul 11, 2014 at 12:36 PM, Paul Moore <[email protected]> wrote:
> Anyway, getting back to the idea I mentioned earlier ... as many of you may
> know, Kees (added to the CC line) is working on some seccomp filter
> improvements which will result in a new seccomp syscall. Perhaps one way
> forward is to preserve everything as it is currently with the prctl()
> interface, but with the new seccomp() based interface we fixup x32 and use the
> new AUDIT_ARCH_X32 token? It might result in a bit of ugliness in some of the
> kernel, but I don't think it would be too bad, and I think it would address
> both our concerns.

Adding AUDIT_ARCH_X32: yes please. (On that note, the comment "/* Both
x32 and x86_64 are considered "64-bit". */" should be changed...)

Just so I understand: currently x86_64 and x32 both present as
AUDIT_ARCH_X86_64. The x32 syscalls are seen as in a different range
(due to the set high bit).

The seccomp used in Chrome, Chrome OS, and vsftpd should all only do
whitelisting by both arch and syscall, so adding AUDIT_ARCH_X32
without setting __X32_SYSCALL_BIT would be totally fine (it would
catch the arch instead of the syscall). This sounds similar to how
libseccomp is doing things, so these should be fine.

The only project I know of doing blacklisting is lxc, and Eric's
example looks a lot like a discussion I saw with lxc and init_module.
:) So it sounds like we can get this right there.

I'd like to avoid carrying a delta on filter logic based on the prctl
vs syscall entry. Can we find any userspace filters being used that a
"correct" fix would break? (If so, then yes, we'll need to do this
proposed "via prctl or via syscall?" change.)

-Kees

--
Kees Cook
Chrome OS Security

2014-07-11 22:52:45

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Fri, Jul 11, 2014 at 3:48 PM, Kees Cook <[email protected]> wrote:
> On Fri, Jul 11, 2014 at 12:36 PM, Paul Moore <[email protected]> wrote:
>> Anyway, getting back to the idea I mentioned earlier ... as many of you may
>> know, Kees (added to the CC line) is working on some seccomp filter
>> improvements which will result in a new seccomp syscall. Perhaps one way
>> forward is to preserve everything as it is currently with the prctl()
>> interface, but with the new seccomp() based interface we fixup x32 and use the
>> new AUDIT_ARCH_X32 token? It might result in a bit of ugliness in some of the
>> kernel, but I don't think it would be too bad, and I think it would address
>> both our concerns.
>
> Adding AUDIT_ARCH_X32: yes please. (On that note, the comment "/* Both
> x32 and x86_64 are considered "64-bit". */" should be changed...)
>
> Just so I understand: currently x86_64 and x32 both present as
> AUDIT_ARCH_X86_64. The x32 syscalls are seen as in a different range
> (due to the set high bit).
>
> The seccomp used in Chrome, Chrome OS, and vsftpd should all only do
> whitelisting by both arch and syscall, so adding AUDIT_ARCH_X32
> without setting __X32_SYSCALL_BIT would be totally fine (it would
> catch the arch instead of the syscall). This sounds similar to how
> libseccomp is doing things, so these should be fine.

I should clarify: seccomp expects to find whatever is sent as the
syscall nr... as in the __NR_read used like this:

BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
offsetof(struct seccomp_data, nr)),
BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 0, 1),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),

Are there native x32 users yet? What does __NR_read resolve to via the
uapi on a native x32 userspace?

-Kees

> The only project I know of doing blacklisting is lxc, and Eric's
> example looks a lot like a discussion I saw with lxc and init_module.
> :) So it sounds like we can get this right there.
>
> I'd like to avoid carrying a delta on filter logic based on the prctl
> vs syscall entry. Can we find any userspace filters being used that a
> "correct" fix would break? (If so, then yes, we'll need to do this
> proposed "via prctl or via syscall?" change.)
>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security



--
Kees Cook
Chrome OS Security

2014-07-11 22:58:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

It includes the X32 bit.

On July 11, 2014 3:52:42 PM PDT, Kees Cook <[email protected]> wrote:
>On Fri, Jul 11, 2014 at 3:48 PM, Kees Cook <[email protected]>
>wrote:
>> On Fri, Jul 11, 2014 at 12:36 PM, Paul Moore <[email protected]>
>wrote:
>>> Anyway, getting back to the idea I mentioned earlier ... as many of
>you may
>>> know, Kees (added to the CC line) is working on some seccomp filter
>>> improvements which will result in a new seccomp syscall. Perhaps
>one way
>>> forward is to preserve everything as it is currently with the
>prctl()
>>> interface, but with the new seccomp() based interface we fixup x32
>and use the
>>> new AUDIT_ARCH_X32 token? It might result in a bit of ugliness in
>some of the
>>> kernel, but I don't think it would be too bad, and I think it would
>address
>>> both our concerns.
>>
>> Adding AUDIT_ARCH_X32: yes please. (On that note, the comment "/*
>Both
>> x32 and x86_64 are considered "64-bit". */" should be changed...)
>>
>> Just so I understand: currently x86_64 and x32 both present as
>> AUDIT_ARCH_X86_64. The x32 syscalls are seen as in a different range
>> (due to the set high bit).
>>
>> The seccomp used in Chrome, Chrome OS, and vsftpd should all only do
>> whitelisting by both arch and syscall, so adding AUDIT_ARCH_X32
>> without setting __X32_SYSCALL_BIT would be totally fine (it would
>> catch the arch instead of the syscall). This sounds similar to how
>> libseccomp is doing things, so these should be fine.
>
>I should clarify: seccomp expects to find whatever is sent as the
>syscall nr... as in the __NR_read used like this:
>
> BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
> offsetof(struct seccomp_data, nr)),
> BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 0, 1),
> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
>
>Are there native x32 users yet? What does __NR_read resolve to via the
>uapi on a native x32 userspace?
>
>-Kees
>
>> The only project I know of doing blacklisting is lxc, and Eric's
>> example looks a lot like a discussion I saw with lxc and init_module.
>> :) So it sounds like we can get this right there.
>>
>> I'd like to avoid carrying a delta on filter logic based on the prctl
>> vs syscall entry. Can we find any userspace filters being used that a
>> "correct" fix would break? (If so, then yes, we'll need to do this
>> proposed "via prctl or via syscall?" change.)
>>
>> -Kees
>>
>> --
>> Kees Cook
>> Chrome OS Security

--
Sent from my mobile phone. Please pardon brevity and lack of formatting.

2014-07-11 23:02:48

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Fri, Jul 11, 2014 at 3:55 PM, H. Peter Anvin <[email protected]> wrote:
> It includes the X32 bit.

If the uapi for __NR_* includes the x32 bit, then that's what seccomp
filters must be seeing. Building seccomp filters is documented to use
the __NR_* values.

-Kees

>
> On July 11, 2014 3:52:42 PM PDT, Kees Cook <[email protected]> wrote:
>>On Fri, Jul 11, 2014 at 3:48 PM, Kees Cook <[email protected]>
>>wrote:
>>> On Fri, Jul 11, 2014 at 12:36 PM, Paul Moore <[email protected]>
>>wrote:
>>>> Anyway, getting back to the idea I mentioned earlier ... as many of
>>you may
>>>> know, Kees (added to the CC line) is working on some seccomp filter
>>>> improvements which will result in a new seccomp syscall. Perhaps
>>one way
>>>> forward is to preserve everything as it is currently with the
>>prctl()
>>>> interface, but with the new seccomp() based interface we fixup x32
>>and use the
>>>> new AUDIT_ARCH_X32 token? It might result in a bit of ugliness in
>>some of the
>>>> kernel, but I don't think it would be too bad, and I think it would
>>address
>>>> both our concerns.
>>>
>>> Adding AUDIT_ARCH_X32: yes please. (On that note, the comment "/*
>>Both
>>> x32 and x86_64 are considered "64-bit". */" should be changed...)
>>>
>>> Just so I understand: currently x86_64 and x32 both present as
>>> AUDIT_ARCH_X86_64. The x32 syscalls are seen as in a different range
>>> (due to the set high bit).
>>>
>>> The seccomp used in Chrome, Chrome OS, and vsftpd should all only do
>>> whitelisting by both arch and syscall, so adding AUDIT_ARCH_X32
>>> without setting __X32_SYSCALL_BIT would be totally fine (it would
>>> catch the arch instead of the syscall). This sounds similar to how
>>> libseccomp is doing things, so these should be fine.
>>
>>I should clarify: seccomp expects to find whatever is sent as the
>>syscall nr... as in the __NR_read used like this:
>>
>> BPF_STMT(BPF_LD+BPF_W+BPF_ABS,
>> offsetof(struct seccomp_data, nr)),
>> BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_read, 0, 1),
>> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL),
>> BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW),
>>
>>Are there native x32 users yet? What does __NR_read resolve to via the
>>uapi on a native x32 userspace?
>>
>>-Kees
>>
>>> The only project I know of doing blacklisting is lxc, and Eric's
>>> example looks a lot like a discussion I saw with lxc and init_module.
>>> :) So it sounds like we can get this right there.
>>>
>>> I'd like to avoid carrying a delta on filter logic based on the prctl
>>> vs syscall entry. Can we find any userspace filters being used that a
>>> "correct" fix would break? (If so, then yes, we'll need to do this
>>> proposed "via prctl or via syscall?" change.)
>>>
>>> -Kees
>>>
>>> --
>>> Kees Cook
>>> Chrome OS Security
>
> --
> Sent from my mobile phone. Please pardon brevity and lack of formatting.



--
Kees Cook
Chrome OS Security

2014-07-11 23:12:28

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 2/3] [RFC] seccomp: give BPF x32 bit when restoring x32 filter

On Jul 11, 2014 3:48 PM, "Kees Cook" <[email protected]> wrote:
>
> On Fri, Jul 11, 2014 at 12:36 PM, Paul Moore <[email protected]> wrote:
> > Anyway, getting back to the idea I mentioned earlier ... as many of you may
> > know, Kees (added to the CC line) is working on some seccomp filter
> > improvements which will result in a new seccomp syscall. Perhaps one way
> > forward is to preserve everything as it is currently with the prctl()
> > interface, but with the new seccomp() based interface we fixup x32 and use the
> > new AUDIT_ARCH_X32 token? It might result in a bit of ugliness in some of the
> > kernel, but I don't think it would be too bad, and I think it would address
> > both our concerns.
>
> Adding AUDIT_ARCH_X32: yes please. (On that note, the comment "/* Both
> x32 and x86_64 are considered "64-bit". */" should be changed...)
>

I admit I'm not convinced that the current situation is really wrong.
The "audit arch" uniquely defines a mapping between the actual syscall
and nr and the regs. The natural way to convert between seccomp_data
and the syscall entry and regs is completely correct.

For things like ARM OABI, the situation is different: OABI and EABI
really do work differently wrt the interpretation of the syscall regs.
This isn't the case for x32.

Of course, the audit code screws this up completely, but audit is
disabled for x32. My upcoming seccomp patchset will clean it up a
little. IMO there's no actual audit ABI to preserve for x32, because
that code has never been anything other than terminally fvcked.

--Andy