LinuxLists.cc - Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Wed, May 25, 2011 at 11:01 AM, Kees Cook <[email protected]> wrote:
>
> Can we just go back to the original spec? A lot of people were excited
> about the prctl() API as done in Will's earlier patchset, we don't lose the
> extremely useful "enable_on_exec" feature, and we can get away from all
> this disagreement.

.. and quite frankly, I'm not even convinced about the original simpler spec.

Security is a morass. People come up with cool ideas every day, and
nobody actually uses them - or if they use them, they are just a
maintenance nightmare.

Quite frankly, limiting pathname access by some prefix is "cool", but
it's basically useless.

That's not where security problems are.

Security problems are in the odd corners - ioctl's, /proc files,
random small interfaces that aren't just about file access.

And who would *use* this thing in real life? Nobody. In order to sell
me on a new security interface, give me a real actual use case that is
security-conscious and relevant to real users.

For things like web servers that actually want to limit filename
lookup, we'd be <i>much</i> better off with a few new flags to
pathname lookup that say "don't follow symlinks" and "don't follow
'..'". Things like that can actually be beneficial to
security-conscious programming, with very little overhead. Some of
those things currently look up pathnames one component at a time,
because they can't afford to not do so. That's a *much* better model
for the whole "only limit to this subtree" case that was quoted
sometime early in this thread.

And per-system-call permissions are very dubious. What system calls
don't you want to succeed? That ioctl? You just made it impossible to
do a modern graphical application. Yet the kind of thing where we
would _want_ to help users is in making it easier to sandbox something
like the adobe flash player. But without accelerated direct rendering,
that's not going to fly, is it?

So I'm sorry for throwing cold water on you guys, but the whole "let's
come up with a new security gadget" thing just makes me go "oh no, not
again".

Linus

2011-05-25 19:06:22

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Linus Torvalds <[email protected]> wrote:

> And per-system-call permissions are very dubious. What system calls
> don't you want to succeed? That ioctl? You just made it impossible
> to do a modern graphical application. Yet the kind of thing where
> we would _want_ to help users is in making it easier to sandbox
> something like the adobe flash player. But without accelerated
> direct rendering, that's not going to fly, is it?

I was under the impression that Will had a very specific application
in mind which actually works today and uses the inferior version of
seccomp.

Will, mind filling us in on that?

I'd agree that adding any of this without a real serious app making
real use of it would be pointless. I discussed this under the
impression that the app existed :-)

I also got the very distinct impression from the various iterations
that a real usecase existed behind it - all the fixes and
considerations looked very realistic, not designed up for security's
sake.

> So I'm sorry for throwing cold water on you guys, but the whole
> "let's come up with a new security gadget" thing just makes me go
> "oh no, not again".

Fair enough :-)

Thanks,

Ingo

2011-05-25 19:12:47

by Kees Cook

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

Hi Linus,

On Wed, May 25, 2011 at 11:42:44AM -0700, Linus Torvalds wrote:
> And who would *use* this thing in real life? Nobody. In order to sell
> me on a new security interface, give me a real actual use case that is
> security-conscious and relevant to real users.
> [...]
> And per-system-call permissions are very dubious. What system calls
> don't you want to succeed? That ioctl? You just made it impossible to
> do a modern graphical application. Yet the kind of thing where we
> would _want_ to help users is in making it easier to sandbox something
> like the adobe flash player. But without accelerated direct rendering,
> that's not going to fly, is it?

Uhm, what? Chrome would use it. And LXC would. Those were stated very
early on as projects extremely interested in syscall filtering. And that's
just the start, I can easily imagine Apache modules enforcing a very narrow
band of syscalls, or just about anything else that could be in a position
of running potentially malicious code. This could be very far-reaching, IMO.

-Kees

--
Kees Cook
Ubuntu Security Team

2011-05-25 19:54:42

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Wed, May 25, 2011 at 2:06 PM, Ingo Molnar <[email protected]> wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
>> And per-system-call permissions are very dubious. What system calls
>> don't you want to succeed? That ioctl? You just made it impossible
>> to do a modern graphical application. Yet the kind of thing where
>> we would _want_ to help users is in making it easier to sandbox
>> something like the adobe flash player. But without accelerated
>> direct rendering, that's not going to fly, is it?
>
> I was under the impression that Will had a very specific application
> in mind which actually works today and uses the inferior version of
> seccomp.
>
> Will, mind filling us in on that?

With pleasure! I'll be a bit overly verbose to ensure I'm covering my
bases, I hope it's not too tedious.

Support for using system call filtering will be added to the Chromium
browser if it is accepted here. At present, Chromium separates the
processing of untrusted input (html, javascript, images) into
standalone renderer processes. In an effort to reduce the risks
associated with processing the data we put those renderers in a chroot
with a private VFS and PID namespace. This limits the ability for a
compromised renderer to signal() another process outside of the
"sandbox" or access files it shouldn't.

Ideally, the only exposed surface to the renderer would be the IPC
mechanism, memory allocation, etc. That isn't possible today though
[*]. The renderer gets the whole syscall ABI. In many cases, adding
support for (all of the) LSMs to the sandboxing methodology would help
mitigate the exposure. There would be the code paths that handle the
user input prior to calling the LSM hooks, but after that point, the
renderer could be denied, shutdown, etc. Unfortunately, there's no
one-to-one mapping from system calls to LSM hooks (nor do all stock
kernels from distros come with a pre-chosen and configured LSM).

To supply some concreteness, the perf_counter_open() system call comes
to mind. It suffered from a stack-based buffer overflow when
processing the user-supplied arguments, and there was no effective
mechanism, LSM or otherwise, to prevent its access. In my usecase, if
only a whitelist of required system calls was made available to the
Chromium renderer processes, then the addition of a bug like
perf_counter_open()'s to the kernel would not have provided a direct
means to escape the user-level sandboxing and execute arbitrary code
in the kernel.

As I mentioned, if it is possible to expand seccomp to provide a
system call access mechanism (bitmask, whatever), I will expand the
Chromium sandbox to make use of it on every linux distro that ships
with it enabled. In addition, my immediate work focus is on Chromium
OS. I would like to apply system call filtering to every daemon in
the distribution alongside additional security defenses. Also, I am
aware of many server-side uses but can't promise immediate deployment
in the same fashion.

[It's also worth noting that as more browser plugins, like Adobe
Flash, migrate to the Pepper API (chrome,mozilla), they will no longer
need direct hardware access (ioctl()s, fs, etc). All system access
will be brokered via the browser which lets them be sandboxed entirely
-- including system call filtering is supported by the host platform.]

[*] it is possible to do crazy, on-the-fly syscall rewriting with
seccomp(1) and a trusted thread, but the performance cost is huge, the
portability is nil (pure asm), and the risk of a security bug is high.

> I'd agree that adding any of this without a real serious app making
> real use of it would be pointless. I discussed this under the
> impression that the app existed :-)
>
> I also got the very distinct impression from the various iterations
> that a real usecase existed behind it - all the fixes and
> considerations looked very realistic, not designed up for security's
> sake.
>
>> So I'm sorry for throwing cold water on you guys, but the whole
>> "let's come up with a new security gadget" thing just makes me go
>> "oh no, not again".
>
> Fair enough :-)

I don't want to boil the ocean and certainly am not interested in
reliving the LSM-wars. I want the missing piece of the puzzle when it
comes to reducing exposed kernel code. seccomp.mode=1 is so close,
but its overly restrictive nature has made it implausible for nearly
all real-world uses. A slight expansion to allow a system call
bitmask or simple filters would be sufficient for Chromium OS,
Chromium, qemu, and lxc use, among others.

Thanks for reading and replying!
will

2011-05-25 20:02:36

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Wed, May 25, 2011 at 12:11 PM, Kees Cook <[email protected]> wrote:
>
> Uhm, what? Chrome would use it. And LXC would. Those were stated very
> early on as projects extremely interested in syscall filtering.

.. and I seriously doubt it is workable.

Or at least it needs some actual working proof-of-concept thing.
Exactly because of issues like direct rendering etc, that require some
of the nastier system calls to work at all.

As to your example of apache modules - last I saw, most of those were
written in high-level scripting languages that almost invariably end
up using quite a bit of the system call interfaces. And more
importantly, almost nobody does unportable code.

So hey, I'm willing to be convinced. But I'll need more than people
_saying_ that they'd be interested. Because judging by past
performance, nobody ever uses esoteric cool new features.

Linus

2011-05-25 20:19:43

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Linus Torvalds <[email protected]> wrote:

> On Wed, May 25, 2011 at 12:11 PM, Kees Cook <[email protected]> wrote:
> >
> > Uhm, what? Chrome would use it. And LXC would. Those were stated very
> > early on as projects extremely interested in syscall filtering.
>
> .. and I seriously doubt it is workable.
>
> Or at least it needs some actual working proof-of-concept thing.
> Exactly because of issues like direct rendering etc, that require some
> of the nastier system calls to work at all.

Btw., Will's patch in this thread (which i think he tested with real
code) implements an approach which detaches the concept from a rigid
notion of 'syscall filtering' and opens it up for things like
reliable pathname checks, memory object checks, etc. - without having
to change the ABI.

If we go for syscall filtering as per bitmask, then we've pretty much
condemned this to be limited to the syscall boundary alone.

So this sandboxing concept looked flexible enough to me to work
itself up the security concept food chain *embedded in apps*.

<flame>

IMHO the key design mistake of LSM is that it detaches security
policy from applications: you need to be admin to load policies, you
need to be root to use/configure an LSM. Dammit, you need to be root
to add labels to files!

This not only makes the LSM policies distro specific (and needlessly
forked and detached from real security), but also gives the message
that:

'to ensure your security you need to be privileged'

which is the anti-concept of good security IMO.

So why not give unprivileged security policy facilities and let
*Apps* shape their own security models. Yes, they will mess up
initially and will reinvent the wheel. But socially IMO it will work
a *lot* better in the long run: it's not imposed on them
*externally*, it's something they can use and grow gradually. They
will experience the security holes first hand and they will be *able
to do something strategic about them* if we give them the right
facilities.

At least the Chrome browser project appears to be intent on following
such an approach. I consider a more bazaar alike approach more
healthy, and it very much needs kernel help as LSMs are isolated from
apps right now.

The thing is, we cannot possibly make the LSM situation much worse
than it is today: i see *ALL* of the LSMs focused on all the wrong
things!

But yes, i can understand that you are deeply sceptical.

</flame>

Thanks,

Ingo

2011-05-26 01:20:17

by James Morris

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Wed, 25 May 2011, Linus Torvalds wrote:

> And per-system-call permissions are very dubious. What system calls
> don't you want to succeed? That ioctl? You just made it impossible to
> do a modern graphical application. Yet the kind of thing where we
> would _want_ to help users is in making it easier to sandbox something
> like the adobe flash player. But without accelerated direct rendering,
> that's not going to fly, is it?

Going back to the initial idea proposed by Will, where seccomp is simply
extended to filter all syscalls, there is potential benefit in being able
to limit the attack surface of the syscall API.

This is not security mediation in terms of interaction between things
(e.g. "allow A to read B"). It's a _hardening_ feature which prevents a
process from being able to invoke potentially hundreds of syscalls is has
no need for. It would allow us to usefully restrict some well-established
attack modes, e.g. triggering bugs in kernel code via unneeded syscalls.

This is orthogonal to access control schemes (such as SELinux), which are
about mediating security-relevant interactions between objects.

One area of possible use is KVM/Qemu, where processes now contain entire
operating systems, and the attack surface between them is now much broader
e.g. a local unprivileged vulnerability is now effectively a 'remote' full
system compromise.

There has been some discussion of this within the KVM project. Using the
existing seccomp facility is problematic in that it requires significant
reworking of Qemu to a privsep model, which would also then incur a likely
unacceptable context switching overhead. The generalized seccomp filter
as proposed by Will would provide a significant reduction in exposed
syscalls and thus guest->host attack surface.

I've cc'd some KVM folk for more input on how this may or may not meet
their requirements -- Avi/Gleb, there's a background writeup here:
http://lwn.net/Articles/442569/ . We may need a proof of concept and/or
commitment to use this feature for it to be accepted upstream.

- James
--
James Morris
<[email protected]>

2011-05-26 06:08:55

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On 05/26/2011 04:19 AM, James Morris wrote:
> On Wed, 25 May 2011, Linus Torvalds wrote:
>
> > And per-system-call permissions are very dubious. What system calls
> > don't you want to succeed? That ioctl? You just made it impossible to
> > do a modern graphical application. Yet the kind of thing where we
> > would _want_ to help users is in making it easier to sandbox something
> > like the adobe flash player. But without accelerated direct rendering,
> > that's not going to fly, is it?
>
> Going back to the initial idea proposed by Will, where seccomp is simply
> extended to filter all syscalls, there is potential benefit in being able
> to limit the attack surface of the syscall API.
>
> This is not security mediation in terms of interaction between things
> (e.g. "allow A to read B"). It's a _hardening_ feature which prevents a
> process from being able to invoke potentially hundreds of syscalls is has
> no need for. It would allow us to usefully restrict some well-established
> attack modes, e.g. triggering bugs in kernel code via unneeded syscalls.
>
> This is orthogonal to access control schemes (such as SELinux), which are
> about mediating security-relevant interactions between objects.
>
> One area of possible use is KVM/Qemu, where processes now contain entire
> operating systems, and the attack surface between them is now much broader
> e.g. a local unprivileged vulnerability is now effectively a 'remote' full
> system compromise.
>
> There has been some discussion of this within the KVM project. Using the
> existing seccomp facility is problematic in that it requires significant
> reworking of Qemu to a privsep model, which would also then incur a likely
> unacceptable context switching overhead. The generalized seccomp filter
> as proposed by Will would provide a significant reduction in exposed
> syscalls and thus guest->host attack surface.
>
> I've cc'd some KVM folk for more input on how this may or may not meet
> their requirements -- Avi/Gleb, there's a background writeup here:
> http://lwn.net/Articles/442569/ . We may need a proof of concept and/or
> commitment to use this feature for it to be accepted upstream.

Indeed are were looking at sandboxing as a means to mitigate the "guest
exploits qemu, proceeds to exploit host syscall interface" scenario, and
evolved seccomp looks like the best tradeoff in terms of security gains
vs effort needed.

Eric Paris (copied) prototyped this with his own version of enhanced
seccomp and achieved pretty good results, so a proof of concept will be
quite easy to provide.

Regarding dynamic filtering, the biggest question here is how this will
interact with hotplug, which requires new files to be opened in the
sandboxed process (or SCM_RIGHTed in). Any fd-based filtering will
defeat that, so we'll need some way for a privileged monitor to adjust
filters.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2011-05-26 08:25:40

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* James Morris <[email protected]> wrote:

> On Wed, 25 May 2011, Linus Torvalds wrote:
>
> > And per-system-call permissions are very dubious. What system
> > calls don't you want to succeed? That ioctl? You just made it
> > impossible to do a modern graphical application. Yet the kind of
> > thing where we would _want_ to help users is in making it easier
> > to sandbox something like the adobe flash player. But without
> > accelerated direct rendering, that's not going to fly, is it?
>
> Going back to the initial idea proposed by Will, where seccomp is
> simply extended to filter all syscalls, there is potential benefit
> in being able to limit the attack surface of the syscall API.

If controlling the system call boundary is found to be useful then
the logical next logical step is to realize that limiting it to
*only* the syscall boundary is shortsighted.

Also, here's a short reminder of the complexity evolution of this
patch-set, which i've followed since it's been first posted in 2009:

bitmask (2009): 6 files changed, 194 insertions(+), 22 deletions(-)
filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-)

Interestingly, the events version is *by far* the most flexible one
in both the short and the long run, and it is also the smallest patch
...

It's a perfect fit and that's not really surprising: system call
boundary hardening is about filtering various key parameters - while
event tracing is about filtering various key parameters as well.

But it goes further than that: SELinux security policies are in
essence primitive event filters as well, on an abstract level - see
below for more details.

And yes, the primitive, coarse, per syscall allow/disallow bitmask v1
version would not be too painful to the core kernel in terms of code
impact and interaction with other code (it does not interact at all)
- but it would still be sadly shortsighted to not explore the event
filters angle, now that we have actual working code.

It would not improve the LSM situation one tiny bit either - the
bitmask design would guarantee that the seccomp approach can never
seriously replace the sucky LSM concepts we have in the kernel today.

> This is not security mediation in terms of interaction between
> things (e.g. "allow A to read B"). It's a _hardening_ feature
> which prevents a process from being able to invoke potentially
> hundreds of syscalls is has no need for. It would allow us to
> usefully restrict some well-established attack modes, e.g.
> triggering bugs in kernel code via unneeded syscalls.

If you think about it then you'll see that this artificial
distinction between 'mediation' and 'hardening' is nonsense!

If we add the appropriate file label field to VFS tracing events
(which would be useful for many instrumentation reasons as well) then
the event filtering variant of Will's patch:

_will be able to do object level security mediation too_

What is at the core of every access control concept, be that DAC,
MAC, RBAC or ACL? Flexible task specific set of access vectors to
file and other labeled objects, which cannot be circumvented by that
task.

How can we implement a user-space file object manager via Will's
event filters approach? It's actually pretty easy:

- a simple object manager wants to know 'who' does something, 'what'
it is trying to access, and then wants to generate an allow/deny
action as a function of the (who,what) parameters:

- The 'who' is a given as the event filters are per
task, so different tasks with different roles can have
different event filters. This is the equivalent of the current
tasks's security context. [ Event filters installed by the
parent cannot be removed by child tasks (they cannot even read
them - it's transparent). ]

- The most finegrained representation of 'what' are inode
numbers. Because we do not want to generate rules for every
single object we want to group objects and want to define
access rules on different groups. This can be done by defining
an event that emits file labels.

So a simple object manager would simply use file label event
attributes and would define simple rules like:

"(label & tmp_t) || (label & user_home_t)"

to allow access to /tmp and /home files. Filters allow us to define
arbitrary access vectors to objects in essence. The above filters get
passed to the kernel as an ASCII string by the object manager task,
where the filter engine parses it safely and creates atomic
predicates out of it, which can then be executed at the source of any
event.

[ We could even implement a transparent AVC-cache equivalent for
filters, should the complexity and popularity of them increase:
ASCII filters lend themselves very well to hash based caches. ]

Similarly, support for other types of object tagging, network labels,
etc. can be added as well with little pain: they can be added without
any change to the basic ABI! Using events filters here makes it a
very extensible security concept.

It is capable to implement the functional equivalent of most MAC,
DAC, RBAC and other access control concepts, purely in user-space -
in addition to 'hardening' (which btw. is really access control too,
in disguise).

Obviously it is all layered: it is only allowed to control access to
objects all the other security concepts allow for it to access - i.e.
this is an unprivileged LSM, a per application security layer if you
will, that can further refine security policies.

In terms of security models this event filters approach is
unconditional goodness in my opinion.

> This is orthogonal to access control schemes (such as SELinux),
> which are about mediating security-relevant interactions between
> objects.

It's only 'orthogonal' because IMO you make two fundamental mistakes:

1) You arbitrarily limit SELinux to object based security measures
alone.

Which is not even true btw: SELinux certainly has some hooks it
uses for pragmatic non-object hardening: for example all the
places where we fall back to capabilities are places where
there's a method based restriction not object based restriction.

The KDSKBENT ioctl check for example in
security/selinux/hooks.c::selinux_file_ioctl(), or
selinux_vm_enough_memory(), or the CAP_MAC_ADMIN exception in
selinux_inode_getsecurity() all violate 'pure' DAC concepts but
obviously for pragmatic reasons SELinux is doing these ...

mmap_min_addr is a borderline method restriction feature as well:
it does not really control access to the underlying object (RAM),
but controls one (of many) access methods to it by controlling
virtual memory ...

So SELinux, in a rather hypocritical fashion is already involved
in hardening and in filtering, because obviously any practical
and pragmatic security system *has to*.

2) You arbitrarily limit Will's patch to *not* be able to
implement object based security mechanisms. Why?

Syscall hardening and object based access rules are *deeply*
connected, conceptually they are subsets of one and the same thing: a
good, organic security model controlling different hierarchies of
physical and derived (virtual) resources, which allows flexible
control of both objects *and* methods.

The 'methods' (the syscalls and other functionality) are *also* a
derived resource so it's entirely legitimate to control access to
them. Yes, because they are methods you can also try to use them to
restrict access to underlying objects - this is what AppArmour is
about mostly, and yes i agree that in the general case it's not a
particularly robust method.

And yes, i fully submit that object access control has theoretical
advantages and it should often be the primary measure that gives a
robust, often provable backbone to a secure system.

But you'd be out of your mind to not recognize:

- The utility of controlling access methods (as resources) as well,
both to reduce the attack surface in the implementation of those
methods, and to allow the easy summary control of objects where
there's only a low number of (and often only a single!) access
method.

- The utility of unprivileged security frameworks.

- The utility of stackable security fetures. (defense in depth,
anyone?)

Will's astonishingly small patch:

event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-)

Gives us *all three* of those, while also allowing user-space
implemented MAC, DAC, RBAC as well.

> One area of possible use is KVM/Qemu, where processes now contain
> entire operating systems, and the attack surface between them is
> now much broader e.g. a local unprivileged vulnerability is now
> effectively a 'remote' full system compromise.

Note that the main reason why Qemu needs access method hardening is
because it has a dominantly state machine based design which does not
lend itself very well to an object manager security design.

Note that tools/kvm/ would probably like to implement its own object
manager model as well in addition to access method restrictions: by
being virtual hardware it deals with many resources and object
hierarchies that are simply not known to the host OS's LSM.

Unlike Qemu tools/kvm/ has a design that is very fit for MAC
concepts: it uses separate helper threads for separate resources
(this could in many cases even be changed to be separate processes
which only share access to the guest RAM image) - while Qemu is in
most parts a state machine, so in tools/kvm/ we can realistically
have a good object manager and keep an exploit in a networking
interface driver from being able to access disk driver state.

(I've Cc:-ed Pekka for tools/kvm/.)

> There has been some discussion of this within the KVM project.
> Using the existing seccomp facility is problematic in that it
> requires significant reworking of Qemu to a privsep model, which
> would also then incur a likely unacceptable context switching
> overhead. The generalized seccomp filter as proposed by Will would
> provide a significant reduction in exposed syscalls and thus
> guest->host attack surface.

... and the event filter based method would *also* allow MAC to be
defined over physical resources, such as virtual network interfaces,
virtual disk devices, etc.

You are seriously limiting the capabilities of this feature for no
good reason i can recognize.

Thanks,

Ingo

2011-05-26 08:35:15

by Pekka Enberg

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

Hi Ingo,

On Thu, May 26, 2011 at 11:24 AM, Ingo Molnar <[email protected]> wrote:
> Unlike Qemu tools/kvm/ has a design that is very fit for MAC
> concepts: it uses separate helper threads for separate resources
> (this could in many cases even be changed to be separate processes
> which only share access to the guest RAM image) - while Qemu is in
> most parts a state machine, so in tools/kvm/ we can realistically
> have a good object manager and keep an exploit in a networking
> interface driver from being able to access disk driver state.

I haven't really followed this particular discussion nor do I know if
Qemu is good or bad fit but sure, for tools/kvm Chrome-style
sandboxing makes tons of sense and would be a pretty good fit for how
our device model works now.

Pekka

2011-05-26 08:50:47

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On 05/26/2011 11:24 AM, Ingo Molnar wrote:
> So a simple object manager would simply use file label event
> attributes and would define simple rules like:
>
> "(label& tmp_t) || (label& user_home_t)"

Filtering by label vs. filtering by descriptor would solve qemu's
hotplug issue neatly.

> Note that tools/kvm/ would probably like to implement its own object
> manager model as well in addition to access method restrictions: by
> being virtual hardware it deals with many resources and object
> hierarchies that are simply not known to the host OS's LSM.
>
> Unlike Qemu tools/kvm/ has a design that is very fit for MAC
> concepts: it uses separate helper threads for separate resources
> (this could in many cases even be changed to be separate processes
> which only share access to the guest RAM image) - while Qemu is in
> most parts a state machine, so in tools/kvm/ we can realistically
> have a good object manager and keep an exploit in a networking
> interface driver from being able to access disk driver state.

You mean each thread will have a different security context? I don't
see the point. All threads share all of memory so it would be trivial
for one thread to exploit another and gain all of its privileges.

A multi process model works better but it has significant memory and
performance overhead.

(well the memory overhead is much smaller when using transparent huge
pages, but these only work for anonymous memory).

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2011-05-26 08:57:54

by Pekka Enberg

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

Hi Avi,

On Thu, May 26, 2011 at 11:49 AM, Avi Kivity <[email protected]> wrote:
> You mean each thread will have a different security context? ?I don't see
> the point. ?All threads share all of memory so it would be trivial for one
> thread to exploit another and gain all of its privileges.

So how would that happen? I'm assuming that once the security context
has been set up for a thread, you're not able to change it after that.
You'd be able to exploit other threads through shared memory but how
would you gain privileges?

Pekka

2011-05-26 09:31:24

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Avi Kivity <[email protected]> wrote:

> > Note that tools/kvm/ would probably like to implement its own
> > object manager model as well in addition to access method
> > restrictions: by being virtual hardware it deals with many
> > resources and object hierarchies that are simply not known to the
> > host OS's LSM.
> >
> > Unlike Qemu tools/kvm/ has a design that is very fit for MAC
> > concepts: it uses separate helper threads for separate resources
> > (this could in many cases even be changed to be separate
> > processes which only share access to the guest RAM image) - while
> > Qemu is in most parts a state machine, so in tools/kvm/ we can
> > realistically have a good object manager and keep an exploit in a
> > networking interface driver from being able to access disk driver
> > state.
>
> You mean each thread will have a different security context? I
> don't see the point. All threads share all of memory so it would
> be trivial for one thread to exploit another and gain all of its
> privileges.

You are missing the geniality of the tools/kvm/ thread pool! :-)

It could be switched to a worker *process* model rather easily. Guest
RAM and (a limited amount of) global resources would be shared via
mmap(SHARED), but otherwise each worker process would have its own
stack, its own subsystem-specific state, etc.

Exploiting other device domains via the shared guest RAM image is not
possible, we treat guest RAM as untrusted data already.

Devices, like real hardware devices, are functionally pretty
independent from each other, so this security model is rather natural
and makes a lot of sense.

> A multi process model works better but it has significant memory
> and performance overhead.

Not in Linux :-) We context-switch between processes almost as
quickly as we do between threads. With modern tagged TLB hardware
it's even faster.

> (well the memory overhead is much smaller when using transparent
> huge pages, but these only work for anonymous memory).

The biggest amount of RAM is the guest RAM image - but if that is
mmap(SHARED) and mapped using hugepages then the pte overhead from a
process model is largely mitigated.

Once we have a process model then isolation and MAC between devices
becomes a very real possibility: exploit via one network interface
cannot break into a disk interface.

Maybe even the isolation and per device access control of
*same-class* devices from each other is possible: with careful
implementation of the subsystem shared data structures. (which isnt
much really)

Thanks,

Ingo

2011-05-26 09:48:25

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Ingo Molnar <[email protected]> wrote:

> You are missing the geniality of the tools/kvm/ thread pool! :-)
>
> It could be switched to a worker *process* model rather easily.
> Guest RAM and (a limited amount of) global resources would be
> shared via mmap(SHARED), but otherwise each worker process would
> have its own stack, its own subsystem-specific state, etc.

We get VM exit events in the vcpu threads which after minimal
processing pass much of the work to the thread pool. Most of the
virtio work (which could be a source of vulnerability - ringbuffers
are hard) is done in the worker task context.

It would be possible to further increase isolation there by also
passing the IO/MMIO decoding to the worker thread - but i'm not sure
that's truly needed. Most of the risk is where most of the code is -
and the code is in the worker task which interprets on-disk data,
protocols, etc.

So we could not only isolate devices from each other, but we could
also protect the highly capable vcpu fd from exploits in devices -
worker threads generally do not need access to the vcpu fd IIRC.

Thanks,

Ingo

2011-05-26 10:57:33

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On 05/26/2011 12:30 PM, Ingo Molnar wrote:
> * Avi Kivity<[email protected]> wrote:
>
> > > Note that tools/kvm/ would probably like to implement its own
> > > object manager model as well in addition to access method
> > > restrictions: by being virtual hardware it deals with many
> > > resources and object hierarchies that are simply not known to the
> > > host OS's LSM.
> > >
> > > Unlike Qemu tools/kvm/ has a design that is very fit for MAC
> > > concepts: it uses separate helper threads for separate resources
> > > (this could in many cases even be changed to be separate
> > > processes which only share access to the guest RAM image) - while
> > > Qemu is in most parts a state machine, so in tools/kvm/ we can
> > > realistically have a good object manager and keep an exploit in a
> > > networking interface driver from being able to access disk driver
> > > state.
> >
> > You mean each thread will have a different security context? I
> > don't see the point. All threads share all of memory so it would
> > be trivial for one thread to exploit another and gain all of its
> > privileges.
>
> You are missing the geniality of the tools/kvm/ thread pool! :-)

I'm sure the thread pool is very general, but the hardware we're
modelling is not.

> It could be switched to a worker *process* model rather easily. Guest
> RAM and (a limited amount of) global resources would be shared via
> mmap(SHARED), but otherwise each worker process would have its own
> stack, its own subsystem-specific state, etc.

Suppose a guest reconfigures a device's MSI page, and suppose that's
handled by the device's process. Now it's not sufficient to update some
global state, you have to go and tell the host kernel about it. With
good privilege separation the device process would not be permitted to
do that; now it has to pass a message to a process that is.

Same thing applies for BARs, reset signals, live migration, etc.

> Exploiting other device domains via the shared guest RAM image is not
> possible, we treat guest RAM as untrusted data already.

Right.

> Devices, like real hardware devices, are functionally pretty
> independent from each other, so this security model is rather natural
> and makes a lot of sense.

When just pushing packets, you are right. However setup/configuration
is hardly clean.

Consider a CD-ROM eject, for example. Now it can't be done by a simple
callback.

> > A multi process model works better but it has significant memory
> > and performance overhead.
>
> Not in Linux :-) We context-switch between processes almost as
> quickly as we do between threads. With modern tagged TLB hardware
> it's even faster.

Once we get PCID in, yes. There's still the message passing overhead,
and unnecessary context switches. In a threaded model you can choose
whether to switch threads or not, in a process model you cannot.

> > (well the memory overhead is much smaller when using transparent
> > huge pages, but these only work for anonymous memory).
>
> The biggest amount of RAM is the guest RAM image - but if that is
> mmap(SHARED) and mapped using hugepages then the pte overhead from a
> process model is largely mitigated.

That doesn't work with memory hotplug.

> Once we have a process model then isolation and MAC between devices
> becomes a very real possibility: exploit via one network interface
> cannot break into a disk interface.

Yes, certainly.

> Maybe even the isolation and per device access control of
> *same-class* devices from each other is possible: with careful
> implementation of the subsystem shared data structures. (which isnt
> much really)

Right, hardly at all in fact. The problem comes from the side-band
issues like reset, interrupts, hotplug, and whatnot.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2011-05-26 11:03:42

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On 05/26/2011 12:48 PM, Ingo Molnar wrote:
> * Ingo Molnar<[email protected]> wrote:
>
> > You are missing the geniality of the tools/kvm/ thread pool! :-)
> >
> > It could be switched to a worker *process* model rather easily.
> > Guest RAM and (a limited amount of) global resources would be
> > shared via mmap(SHARED), but otherwise each worker process would
> > have its own stack, its own subsystem-specific state, etc.
>
> We get VM exit events in the vcpu threads which after minimal
> processing pass much of the work to the thread pool. Most of the
> virtio work (which could be a source of vulnerability - ringbuffers
> are hard) is done in the worker task context.
>
> It would be possible to further increase isolation there by also
> passing the IO/MMIO decoding to the worker thread - but i'm not sure
> that's truly needed. Most of the risk is where most of the code is -
> and the code is in the worker task which interprets on-disk data,
> protocols, etc.

I've suggested in the past to add an "mmiofd" facility to kvm, similar
to ioeventfd. This is how it would work:

- userspace configures kvm with an mmio range and a pipe
- guest writes to that range write a packet to the pipe describing the write
- guest reads from that range write a packet to the pipe describing the
read, then wait for a reply packet with the result

The advantages would be
- avoid heavyweight exit; kvm can simply wake up a thread on another
core and resume processing
- writes can be pipelined, similar to how PCI writes are posted
- supports process separation

So far no one has posted an implementation but it should be pretty simple.

> So we could not only isolate devices from each other, but we could
> also protect the highly capable vcpu fd from exploits in devices -
> worker threads generally do not need access to the vcpu fd IIRC.

Yes.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2011-05-26 11:17:01

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Avi Kivity <[email protected]> wrote:

> > It would be possible to further increase isolation there by also
> > passing the IO/MMIO decoding to the worker thread - but i'm not
> > sure that's truly needed. Most of the risk is where most of the
> > code is - and the code is in the worker task which interprets
> > on-disk data, protocols, etc.
>
> I've suggested in the past to add an "mmiofd" facility to kvm,
> similar to ioeventfd. This is how it would work:
>
> - userspace configures kvm with an mmio range and a pipe
> - guest writes to that range write a packet to the pipe describing the write
> - guest reads from that range write a packet to the pipe describing
> the read, then wait for a reply packet with the result
>
> The advantages would be
> - avoid heavyweight exit; kvm can simply wake up a thread on another
> core and resume processing
> - writes can be pipelined, similar to how PCI writes are posted
> - supports process separation

Yes, that was my exact thought, a per transport channel fd.

> So far no one has posted an implementation but it should be pretty
> simple.

tools/kvm/ could make quick use of it - and it's a performance
optimization mainly IMO, not primarily a security feature.

If you whip up a quick untested prototype for the KVM side we could
look into adding tooling support for it and could test it.

As long as it's provided as an opt-in ioctl() which if fails (on
older kernels) we fall back to the vcpu-fd, it should be relatively
straightforward to support on the tooling side as well AFAICS.

Thanks,

Ingo

2011-05-26 11:39:33

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Avi Kivity <[email protected]> wrote:

> > The biggest amount of RAM is the guest RAM image - but if that is
> > mmap(SHARED) and mapped using hugepages then the pte overhead
> > from a process model is largely mitigated.
>
> That doesn't work with memory hotplug.

Why not, if we do the sensible thing and restrict the size
granularity and alignment of plugged/unplugged memory regions to 2MB?

We can fix guest Linux as well to not be stupid about the sizing of
memory hotplug requests. It does hotplug based on the memory map we
pass to it anyway.

Am i missing something obvious here?

> > Maybe even the isolation and per device access control of
> > *same-class* devices from each other is possible: with careful
> > implementation of the subsystem shared data structures. (which
> > isnt much really)
>
> Right, hardly at all in fact. The problem comes from the side-band
> issues like reset, interrupts, hotplug, and whatnot.

Yeah. There are two good aspects here i think:

- The sideband IPC overhead does not matter much, it's a side band.

- Spending the effort to isolate configuration details is worth it:
sideband code is a primary breeding ground for bugs and security
holes.

The main worry to me would be the maintainability difference: does it
result in much more complex code? As always i'm cautiously optimistic
about that: i think once we try it we can find a suitable model ...
It might even turn out to be more readable and more flexible in the
end.

Thanks,

Ingo

2011-05-26 14:37:15

by Colin Walters

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Wed, May 25, 2011 at 4:01 PM, Linus Torvalds
<[email protected]> wrote:
>
> As to your example of apache modules - last I saw, most of those were
> written in high-level scripting languages that almost invariably end
> up using quite a bit of the system call interfaces. And more
> importantly, almost nobody does unportable code.

Well, there's a difference between frameworks and applications. At
least in GNOME e.g. we've been generally good at picking up and
transparently taking advantage of Linux-specific stuff where possible
like splice() in g_file_copy(), and in this cycle I'll probably end up
using signalfd() to fix a long standing race condition.

> So hey, I'm willing to be convinced. But I'll need more than people
> _saying_ that they'd be interested. Because judging by past
> performance, nobody ever uses esoteric cool new features.

I'm curious which features you feel are esoteric and cool but unused?

2011-05-26 15:04:20

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 7:37 AM, Colin Walters <[email protected]> wrote:
>
> I'm curious which features you feel are esoteric and cool but unused?

Just about anything linux-specific. Ranging from the totally new
concepts (epoll/clone/splice/signalfd) to just simple cleanups and
extensions of reasonably standard stuff (sync_file_range/sendpage).

Sure, there's almost always *somebody* who uses them, but they are
seldom actually worth it.

The one thing that works well is when you expose it as a standard
interface. So futexes are linux-specific, but they are exposed as the
standard pthreads condition variables etc to apps - very few actually
use them as futexes. But because glibc uses them for the pthreads
synchronization, I think they ended up being used inside glibc for
low-level stuff too, so I think futexes ended up being an unqualified
success - much better than the standard interface.

The "it can be used in standard libraries" ends up being a very
powerful thing. It doesn't have to be libc - if something like a glib
or a big graphical interface uses them, they can get very popular. But
if you have to have actual config options (autoconf or similar) to
enable the feature on Linux, along with a compatibility case (because
older kernels don't even support it, so it's not even "linux", it's
"linux newer than xyz"), then very very few applications end up using
it.

And security issues in particular are often *very* subtle. For
example, something like a system call filter sounds like an obviously
safe thing: it can only limit what you do, right?

Except no, not right at all. Imagine that you're limiting a suid
application, and the one operation you limit is "setuid()". Imagine
that the suid application explicitly drops privileges in order to run
safely as the user. Imagine, further, that it doesn't even check the
return value, because it *knows* that if it is root, it will succeed,
and if it isn't root, then it wasn't suid to begin with and doesn't
need to do anything about it.

Unlikely? Hell no. That's standard practice. And if you allow filter
setup that survives fork+exec, you just opened a HUGE security hole.

Fixable? Yes, easily. And I haven't looked at the current patches, but
I would not be AT ALL surprised if they had exactly the above huge
security hole.

My point being that (a) I'm very dubious about new non-standard
features, because historically they seldom get used very widely and
(b) I'm doubly dubious about security things because it turns out it's
damn easy to get it wrong in all kinds of small subtle details.

Linus

2011-05-26 15:28:44

by Colin Walters

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 11:03 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 26, 2011 at 7:37 AM, Colin Walters <[email protected]> wrote:
>>
>> I'm curious which features you feel are esoteric and cool but unused?
>
> Just about anything linux-specific. Ranging from the totally new
> concepts (epoll/clone/splice/signalfd) to just simple cleanups and
> extensions of reasonably standard stuff (sync_file_range/sendpage).

epoll's very widely used via frameworks I'd say; at least the Apache
runtime uses it, libevent does, and apparently the Sun JDK does:
http://www.google.com/codesearch/p?hl=en#ih5hvYJNSIA/src/solaris/classes/sun/nio/ch/EPollPort.java&q=epoll&sa=N&cd=32&ct=rc
And here's an entry on that: http://blogs.oracle.com/alanb/entry/epoll

(Why doesn't glib? It's hard since the priority design was kind of a
mistake: https://bugzilla.gnome.org/show_bug.cgi?id=156048 )

> The "it can be used in standard libraries" ends up being a very
> powerful thing. It doesn't have to be libc - if something like a glib
> or a big graphical interface uses them, they can get very popular.

Right, that's the distinction I was trying to make.

> But
> if you have to have actual config options (autoconf or similar) to
> enable the feature on Linux, along with a compatibility case (because
> older kernels don't even support it, so it's not even "linux", it's
> "linux newer than xyz"), then very very few applications end up using
> it.

>From my experience as a framework developer, it hasn't been hard at
all to keep track of new Linux features, we talk about them a lot =)
The fallback code is often obvious, like for splice(), though for
signalfd it's going to much more messy to keep around the legacy
helper thread case.

> Unlikely? Hell no. That's standard practice. And if you allow filter
> setup that survives fork+exec, you just opened a HUGE security hole.

Oh definitely, setuid and process inheritance has been a source of
many problems over the years, and I agree it'd be very dangerous for
the syscall filters to stay open across execve. Of course in practice
glibc secure mode exists to mitigate these things too; it could abort
if one was in place in that case.

But I was more curious about your views on Linux-specific interfaces,
and you answered that; thanks!

2011-05-26 16:33:25

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 10:03 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 26, 2011 at 7:37 AM, Colin Walters <[email protected]> wrote:
>>
>> I'm curious which features you feel are esoteric and cool but unused?
>
> Just about anything linux-specific. Ranging from the totally new
> concepts (epoll/clone/splice/signalfd) to just simple cleanups and
> extensions of reasonably standard stuff (sync_file_range/sendpage).
>
> Sure, there's almost always *somebody* who uses them, but they are
> seldom actually worth it.
>
> The one thing that works well is when you expose it as a standard
> interface. So futexes are linux-specific, but they are exposed as the
> standard pthreads condition variables etc to apps - very few actually
> use them as futexes. But because glibc uses them for the pthreads
> synchronization, I think they ended up being used inside glibc for
> low-level stuff too, so I think futexes ended up being an unqualified
> success - much better than the standard interface.
>
> The "it can be used in standard libraries" ends up being a very
> powerful thing. It doesn't have to be libc - if something like a glib
> or a big graphical interface uses them, they can get very popular. But
> if you have to have actual config options (autoconf or similar) to
> enable the feature on Linux, along with a compatibility case (because
> older kernels don't even support it, so it's not even "linux", it's
> "linux newer than xyz"), then very very few applications end up using
> it.
>
> And security issues in particular are often *very* subtle. For
> example, something like a system call filter sounds like an obviously
> safe thing: it can only limit what you do, right?
>
> Except no, not right at all. Imagine that you're limiting a suid
> application, and the one operation you limit is "setuid()". Imagine
> that the suid application explicitly drops privileges in order to run
> safely as the user. Imagine, further, that it doesn't even check the
> return value, because it *knows* that if it is root, it will succeed,
> and if it isn't root, then it wasn't suid to begin with and doesn't
> need to do anything about it.
>
> Unlikely? Hell no. That's standard practice. And if you allow filter
> setup that survives fork+exec, you just opened a HUGE security hole.
>
> Fixable? Yes, easily. And I haven't looked at the current patches, but
> I would not be AT ALL surprised if they had exactly the above huge
> security hole.

FWIW, none of the patches deal with privilege escalation via setuid
files or file capabilities.

> My point being that (a) I'm very dubious about new non-standard
> features, because historically they seldom get used very widely and
> (b) I'm doubly dubious about security things because it turns out it's
> damn easy to get it wrong in all kinds of small subtle details.

I agree with both points, so I'm being a bit hypocritical, I suspect.

At present, I'm not aware of any platforms that support system call
restriction in a non-platform-specific fashion: mac has seatbelt,
freebsd has things like capsicum, linux has seccomp :) This led me to
the proposal around expanding seccomp since it was already a Linux-ism
for this functionality and, ideally, could be minimal to help limit
the subtle-bug-exposure. However, any form of kernel attack surface
reduction would be great, but I'm unaware of any that integrate with
glibc smoothly (even if some do have nicer programming interfaces).

I've used system call filtering in the past with good effect in server
environments, and I believe the Chromium renderer example is a robust
for Linux desktops, even without glibc integration. If the
Linux-specific, non-automatic (glibc) interface is a no-go, then I'll
go back to the drawing board. I'm not sure how to avoid something
Linux-specific in general, even if it's just adding syscall hooks to
LSMs, though it could be possible to share interfaces with some other
platform's implementation of a broader security system that includes
kernel exposure minimization (like capsicum) which could be built-on
what existing substrate is available or a new one, as Ingo proposes.

thanks!
will

2011-05-26 16:54:16

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 9:33 AM, Will Drewry <[email protected]> wrote:
>
> FWIW, none of the patches deal with privilege escalation via setuid
> files or file capabilities.

That is NOT AT ALL what I'm talking about.

I'm talking about the "setuid()" system call (and all its cousins:
setgit/setreuid etc). And the whole thread has been about filtering
system calls, no?

Do a google code search for setuid.

In good code, it will look something like

uid = getuid();

if (setuid(uid)) {
fprintf(stderr, "Unable to drop provileges\n");
exit(1);
}

but I guarantee you that there are cases where people just blindly
drop privileges. google code search found me at least the "heirloom"
source code doing exactly that.

And if you filter system calls, it's entirely possible that you can
attack suid executables through such a vector. Your "limit system
calls for security" security suddenly turned into "avoid the system
call that made things secure"!

See what I'm saying?

Linus

2011-05-26 17:02:47

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 11:46 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 26, 2011 at 9:33 AM, Will Drewry <[email protected]> wrote:
>>
>> FWIW, none of the patches deal with privilege escalation via setuid
>> files or file capabilities.
>
> That is NOT AT ALL what I'm talking about.
>
> I'm talking about the "setuid()" system call (and all its cousins:
> setgit/setreuid etc). And the whole thread has been about filtering
> system calls, no?
>
> Do a google code search for setuid.
>
> In good code, it will look something like
>
> ?uid = getuid();
>
> ?if (setuid(uid)) {
> ? ?fprintf(stderr, "Unable to drop provileges\n");
> ? ?exit(1);
> ?}
>
> but I guarantee you that there are cases where people just blindly
> drop privileges. google code search found me at least the "heirloom"
> source code doing exactly that.
>
> And if you filter system calls, it's entirely possible that you can
> attack suid executables through such a vector. Your "limit system
> calls for security" security suddenly turned into "avoid the system
> call that made things secure"!
>
> See what I'm saying?

Absolutely - that was what I meant :/ The patches do not currently
check creds at creation or again at use, which would lead to
unprivileged filters being used in a privileged context. Right now,
though, if setuid() is not allowed by the seccomp-filter, the process
will be immediately killed with do_exit(SIGKILL) on call -- thus
avoiding a silent failure. I mentioned file capabilities because they
can have setuid-like side effects, too. As long as system call
rejection results in a process death, I *think* it helps with some of
this complexity, but I haven't fully vetted the patches for these
scenarios to be 100% confident.

Sorry I wasn't clear!
will

2011-05-26 17:04:45

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 12:02 PM, Will Drewry <[email protected]> wrote:
> On Thu, May 26, 2011 at 11:46 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, May 26, 2011 at 9:33 AM, Will Drewry <[email protected]> wrote:
>>>
>>> FWIW, none of the patches deal with privilege escalation via setuid
>>> files or file capabilities.
>>
>> That is NOT AT ALL what I'm talking about.
>>
>> I'm talking about the "setuid()" system call (and all its cousins:
>> setgit/setreuid etc). And the whole thread has been about filtering
>> system calls, no?
>>
>> Do a google code search for setuid.
>>
>> In good code, it will look something like
>>
>> ?uid = getuid();
>>
>> ?if (setuid(uid)) {
>> ? ?fprintf(stderr, "Unable to drop provileges\n");
>> ? ?exit(1);
>> ?}
>>
>> but I guarantee you that there are cases where people just blindly
>> drop privileges. google code search found me at least the "heirloom"
>> source code doing exactly that.
>>
>> And if you filter system calls, it's entirely possible that you can
>> attack suid executables through such a vector. Your "limit system
>> calls for security" security suddenly turned into "avoid the system
>> call that made things secure"!
>>
>> See what I'm saying?
>
> Absolutely - that was what I meant :/ ?The patches do not currently
> check creds at creation or again at use, which would lead to
> unprivileged filters being used in a privileged context. ?Right now,
> though, if setuid() is not allowed by the seccomp-filter, the process
> will be immediately killed with do_exit(SIGKILL) on call -- thus
> avoiding a silent failure. I mentioned file capabilities because they
> can have setuid-like side effects, too. ?As long as system call
> rejection results in a process death, I *think* it helps with some of
> this complexity, but I haven't fully vetted the patches for these
> scenarios to be 100% confident.

Bah - by "setuid-like side effects", I meant suid executable-like side
effects. And I blocking even outside of those scenarios, I think
immediate process-death helps resolves coding mistakes leading to
filtering setuid() calls prior to use.

cheers,
will

2011-05-26 17:07:22

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 2011-05-26 at 09:46 -0700, Linus Torvalds wrote:

> And if you filter system calls, it's entirely possible that you can
> attack suid executables through such a vector. Your "limit system
> calls for security" security suddenly turned into "avoid the system
> call that made things secure"!
>
> See what I'm saying?

So you are not complaining about this implementation, but the use of
syscall filtering?

There may be some user that says, "oh I don't want my other apps to be
able to call setuid" thinking it will secure their application even
more. But because that application did the brain dead thing to not check
the return code of setuid, and it just happened to be running
privileged, it then execs off another application that can root the box.

Because, originally that setuid would have succeeded if the user did
nothing special, but now with this filtering, and the user thinking that
they could limit their app from doing harm, they just opened up a hole
that caused their app to do the exact opposite and give the exec'd app
full root privileges.

Did I get this right?

-- Steve

2011-05-26 17:18:01

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 10:02 AM, Will Drewry <[email protected]> wrote:
>
> Absolutely - that was what I meant :/ ?The patches do not currently
> check creds at creation or again at use, which would lead to
> unprivileged filters being used in a privileged context. ?Right now,
> though, if setuid() is not allowed by the seccomp-filter, the process
> will be immediately killed with do_exit(SIGKILL) on call -- thus
> avoiding a silent failure.

Umm.

You do realize that there is a reason we don't allow random kill()
system calls to succeed without privileges either?

So no, "we kill it with sigkill" is not safe *either*. It now is
potentially a way to kill privileged processes that you didn't have
permission to kill.

My point is that it all sounds designed for well-behaved processes.
"kill it if it does something bad" sounds like a *wonderful* idea if
you're doing a sandbox.

But it is suddenly potentially deadly if that capability is used by a
malicious user for a process that isn't ready for it.

One option is to just not ever allow execve() from inside a restricted
environment.

Linus

2011-05-26 17:38:23

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 12:17 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 26, 2011 at 10:02 AM, Will Drewry <[email protected]> wrote:
>>
>> Absolutely - that was what I meant :/ ?The patches do not currently
>> check creds at creation or again at use, which would lead to
>> unprivileged filters being used in a privileged context. ?Right now,
>> though, if setuid() is not allowed by the seccomp-filter, the process
>> will be immediately killed with do_exit(SIGKILL) on call -- thus
>> avoiding a silent failure.
>
> Umm.
>
> You do realize that there is a reason we don't allow random kill()
> system calls to succeed without privileges either?
>
> So no, "we kill it with sigkill" is not safe *either*. It now is
> potentially a way to kill privileged processes that you didn't have
> permission to kill.
>
> My point is that it all sounds designed for well-behaved processes.
> "kill it if it does something bad" sounds like a *wonderful* idea if
> you're doing a sandbox.

Yeah - we end up in a weird place, because for many suid executables,
the failure would be immediate (at priv drop), but it introduces bugs
that will be less obvious in more complex scenarios.

> But it is suddenly potentially deadly if that capability is used by a
> malicious user for a process that isn't ready for it.
>
> One option is to just not ever allow execve() from inside a restricted
> environment.

That'd certainly be fine with me. Another option could be adding a
cred checking (from set to use) or execve time checking or ..., but
simple works for me. I'm not hung up on the implementation details
specifically if the end result is that the syscalls can be _safely_
whitelisted.

Thanks!

2011-05-26 17:39:54

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 26 May 2011 12:02:45 CDT, Will Drewry said:

> Absolutely - that was what I meant :/ The patches do not currently
> check creds at creation or again at use, which would lead to
> unprivileged filters being used in a privileged context. Right now,
> though, if setuid() is not allowed by the seccomp-filter, the process
> will be immediately killed with do_exit(SIGKILL) on call -- thus
> avoiding a silent failure.

How do you know you have the bounding set correct?

This has been a long-standing issue for SELinux policy writing - it's usually
easy to get 95% of the bounding box right (you need these rules for shared
libraries, you need these rules to access the user's home directory, you need
these other rules to talk TCP to the net, etc). There's a nice tool that
converts any remaining rejection messages into rules you can add to the policy.

The problem is twofold: (a) that way you can never be sure you got *all* the
rules right and (b) the missing rules are almost always in squirrelly little
error-handling code that gets invoked once in a blue moon. So in this case,
you end up with trying to debug the SIGKILL that happened when the process was
already in trouble for some other reason...

"Wow. Who would have guessed that program only called gettimeofday() in
the error handler for when it was formatting its crash message?"

Exactly.

Attachments:

(No filename) (227.00 B)

2011-05-26 18:07:35

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On 05/26/2011 02:38 PM, Ingo Molnar wrote:
> * Avi Kivity<[email protected]> wrote:
>
> > > The biggest amount of RAM is the guest RAM image - but if that is
> > > mmap(SHARED) and mapped using hugepages then the pte overhead
> > > from a process model is largely mitigated.
> >
> > That doesn't work with memory hotplug.
>
> Why not, if we do the sensible thing and restrict the size
> granularity and alignment of plugged/unplugged memory regions to 2MB?

Once forked, you cannot have new shared anonymous memory, can you?

> We can fix guest Linux as well to not be stupid about the sizing of
> memory hotplug requests. It does hotplug based on the memory map we
> pass to it anyway.
>
> Am i missing something obvious here?

Yes, the new mmap() will be only visible in the calling process.

> > > Maybe even the isolation and per device access control of
> > > *same-class* devices from each other is possible: with careful
> > > implementation of the subsystem shared data structures. (which
> > > isnt much really)
> >
> > Right, hardly at all in fact. The problem comes from the side-band
> > issues like reset, interrupts, hotplug, and whatnot.
>
> Yeah. There are two good aspects here i think:
>
> - The sideband IPC overhead does not matter much, it's a side band.
>
> - Spending the effort to isolate configuration details is worth it:
> sideband code is a primary breeding ground for bugs and security
> holes.
>
> The main worry to me would be the maintainability difference: does it
> result in much more complex code? As always i'm cautiously optimistic
> about that: i think once we try it we can find a suitable model ...
> It might even turn out to be more readable and more flexible in the
> end.

I also believe it will be more maintainable, especially if written in a
language that has explicit support for message passing (e.g. Erlang).
This is because it is more similar to how hardware actually works.
However it needs to be designed in, it's not just a matter of switching
a thread to a process.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2011-05-26 18:08:13

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 12:38 PM, <[email protected]> wrote:
> On Thu, 26 May 2011 12:02:45 CDT, Will Drewry said:
>
>> Absolutely - that was what I meant :/ ?The patches do not currently
>> check creds at creation or again at use, which would lead to
>> unprivileged filters being used in a privileged context. ?Right now,
>> though, if setuid() is not allowed by the seccomp-filter, the process
>> will be immediately killed with do_exit(SIGKILL) on call -- thus
>> avoiding a silent failure.
>
> How do you know you have the bounding set correct?
>
> This has been a long-standing issue for SELinux policy writing - it's usually
> easy to get 95% of the bounding box right (you need these rules for shared
> libraries, you need these rules to access the user's home directory, you need
> these other rules to talk TCP to the net, etc). ?There's a nice tool that
> converts any remaining rejection messages into rules you can add to the policy.
>
> The problem is twofold: (a) that way you can never be sure you got *all* the
> rules right and (b) the missing rules are almost always in squirrelly little
> error-handling code that gets invoked once in a blue moon. ?So in this case,
> you end up with trying to debug the SIGKILL that happened when the process was
> already in trouble for some other reason...
>
> "Wow. Who would have guessed that program only called gettimeofday() in
> the error handler for when it was formatting its crash message?"
>
> Exactly.

Depending on the need, there is work involved, and there are many ways
to determine your bounding box. It can be very tight -- where you
analyze normal workloads (perf,strace,objdump) and accept the fact
that pathological workloads may result in process death -- or it can
be quite loose and enable most system calls, just not newer ones,
let's say. In practice, you might get bit a few times if you're
overly zealous (I know I have), but it's the difference between
failing open and failing closed. There are some scenarios where you
never, ever want to fail-open even at the cost of process death and
lack of solid insight into a valid failure path.

Hope that makes sense and isn't too general,
will

2011-05-26 18:16:21

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Avi Kivity <[email protected]> wrote:

> On 05/26/2011 02:38 PM, Ingo Molnar wrote:
> >* Avi Kivity<[email protected]> wrote:
> >
> >> > The biggest amount of RAM is the guest RAM image - but if that is
> >> > mmap(SHARED) and mapped using hugepages then the pte overhead
> >> > from a process model is largely mitigated.
> >>
> >> That doesn't work with memory hotplug.
> >
> > Why not, if we do the sensible thing and restrict the size
> > granularity and alignment of plugged/unplugged memory regions to
> > 2MB?
>
> Once forked, you cannot have new shared anonymous memory, can you?

We can have named shared memory.

Incidentally i suggested this to Pekka just yesterday: i think we
should consider guest RAM images to be named files on the local
filesystem (prefixed with the disk image's name or so, for easy
identification), this will help with debugging and with swapping as
well. (This way guest RAM wont eat up regular anonymous swap space -
it will be swapped to the filesystem.)

As a sidenote, live migration might also become possible this way: in
theory we could freeze a guest to its RAM image - which can then be
copied (together with the disk image) to another box as files and
restarted there, with some some hw configuration state dumped to a
header portion of that RAM image as well. (outside of the RAM area)

Thanks,

Ingo

2011-05-26 18:21:50

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On 05/26/2011 09:15 PM, Ingo Molnar wrote:
> * Avi Kivity<[email protected]> wrote:
>
> > On 05/26/2011 02:38 PM, Ingo Molnar wrote:
> > >* Avi Kivity<[email protected]> wrote:
> > >
> > >> > The biggest amount of RAM is the guest RAM image - but if that is
> > >> > mmap(SHARED) and mapped using hugepages then the pte overhead
> > >> > from a process model is largely mitigated.
> > >>
> > >> That doesn't work with memory hotplug.
> > >
> > > Why not, if we do the sensible thing and restrict the size
> > > granularity and alignment of plugged/unplugged memory regions to
> > > 2MB?
> >
> > Once forked, you cannot have new shared anonymous memory, can you?
>
> We can have named shared memory.

But then the benefit of transparent huge pages goes away.

Of course, if some is working on extending transparent hugepages, the
problem is solved. I know there is interest in this.

> Incidentally i suggested this to Pekka just yesterday: i think we
> should consider guest RAM images to be named files on the local
> filesystem (prefixed with the disk image's name or so, for easy
> identification), this will help with debugging and with swapping as
> well. (This way guest RAM wont eat up regular anonymous swap space -
> it will be swapped to the filesystem.)

Qemu supports this via -mem-path. The motivation was supporting
hugetlbfs, before THP. I can't say it was useful for debugging (but
then qemu has a built in memory inspector and debugger, and supports
attaching gdb to the guest).

> As a sidenote, live migration might also become possible this way: in
> theory we could freeze a guest to its RAM image - which can then be
> copied (together with the disk image) to another box as files and
> restarted there, with some some hw configuration state dumped to a
> header portion of that RAM image as well. (outside of the RAM area)

Live migration involves the guest running in parallel with its memory
being copied over. Even a 1GB guest will take 10s over 1GbE; any
reasonably sized guest will take forever.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2011-05-26 18:20:25

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 2011-05-26 at 20:15 +0200, Ingo Molnar wrote:
> Incidentally i suggested this to Pekka just yesterday: i think we
> should consider guest RAM images to be named files on the local
> filesystem (prefixed with the disk image's name or so, for easy
> identification),

That'll break THP and KSM, both rely and work on anon only.

2011-05-26 18:24:27

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 26 May 2011 13:08:10 CDT, Will Drewry said:

> Depending on the need, there is work involved, and there are many ways
> to determine your bounding box. It can be very tight -- where you
> analyze normal workloads (perf,strace,objdump) and accept the fact
> that pathological workloads may result in process death -- or it can
> be quite loose and enable most system calls, just not newer ones,
> let's say. In practice, you might get bit a few times if you're
> overly zealous (I know I have), but it's the difference between
> failing open and failing closed. There are some scenarios where you
> never, ever want to fail-open even at the cost of process death and
> lack of solid insight into a valid failure path.

> Hope that makes sense and isn't too general,

Oh, I already understood all that. :) I'd have to double-check the actual
patch, does it give a (hopefully rate-limited) printk or other hint which
syscall caused the issue, to help in making up the list of needed syscalls?

And we probably want a cleaned-up copy of the quoted paragraph in the
documentation for this feature when it hits the streets. People tuning in late
will need guidance on how to use this in their projects.

Attachments:

(No filename) (227.00 B)

2011-05-26 18:34:48

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 10:38 AM, Will Drewry <[email protected]> wrote:
>>
>> One option is to just not ever allow execve() from inside a restricted
>> environment.
>
> That'd certainly be fine with me.

So if it ends up being purely a "internal to the process" thing, then
I'm much happier about it - it not only limits the scope of things
sufficiently that I don't worry too much about security issues, but it
makes it very clear that it's about a process going into "lock-down"
mode on its own.

It also gets rid of all configuration - one of the things that makes
most security frameworks (look at selinux, but also just ACL's etc)
such a crazy rats nest is the whole "set up for other processes". If
it's designed very much to be about just the "self" process (after
initialization etc), then I think that avoids pretty much all the
serious issues.

A lot of server processes could probably use it as a way to say "Hey,
I guarantee that I will only open new files read-only, and will only
write to the socket that was already opened for me by the accept", and
explicitly limit their worker threads that way.

If that is really sufficient for some chrome sandboxing, then hey,
that's all fine.

Sometimes limiting yourself (rather than looking for some bigger
"generic" solution) is the right answer.

Linus

2011-05-26 18:35:23

by David Lang

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 26 May 2011, Linus Torvalds wrote:
>
> On Thu, May 26, 2011 at 9:33 AM, Will Drewry <[email protected]> wrote:
>>
>> FWIW, none of the patches deal with privilege escalation via setuid
>> files or file capabilities.
>
> That is NOT AT ALL what I'm talking about.
>
> I'm talking about the "setuid()" system call (and all its cousins:
> setgit/setreuid etc). And the whole thread has been about filtering
> system calls, no?
>
> Do a google code search for setuid.
>
> In good code, it will look something like
>
> uid = getuid();
>
> if (setuid(uid)) {
> fprintf(stderr, "Unable to drop provileges\n");
> exit(1);
> }
>
> but I guarantee you that there are cases where people just blindly
> drop privileges. google code search found me at least the "heirloom"
> source code doing exactly that.

I believe that sendmail had this exact vunerability when capibilities were
used to control setuid a couple of years ago.

David Lang

2011-05-26 18:37:25

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Avi Kivity <[email protected]> wrote:

> Live migration involves the guest running in parallel with its
> memory being copied over. Even a 1GB guest will take 10s over
> 1GbE; any reasonably sized guest will take forever.

I suspect we are really offtopic here, but an initial rsync, then
stopping the guest, final rsync + restart the guest at the target
would work with minimal interruption.

Thanks,

Ingo

2011-05-26 18:39:13

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Peter Zijlstra <[email protected]> wrote:

> On Thu, 2011-05-26 at 20:15 +0200, Ingo Molnar wrote:
>
> > Incidentally i suggested this to Pekka just yesterday: i think we
> > should consider guest RAM images to be named files on the local
> > filesystem (prefixed with the disk image's name or so, for easy
> > identification),
>
> That'll break THP and KSM, both rely and work on anon only.

No reason they should be limited to anon only though.

Also, don't we have some sort of anonfs, from which we could get an
fd, which, if mmap()-ed produces regular anonymous shared memory?

That fd could be passed over to other processes, who could then
mmap() the new piece of shared-anon memory as well.

Thanks,

Ingo

2011-05-26 18:43:33

by Casey Schaufler

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On 5/26/2011 10:07 AM, Steven Rostedt wrote:
> On Thu, 2011-05-26 at 09:46 -0700, Linus Torvalds wrote:
>
>> And if you filter system calls, it's entirely possible that you can
>> attack suid executables through such a vector. Your "limit system
>> calls for security" security suddenly turned into "avoid the system
>> call that made things secure"!
>>
>> See what I'm saying?
> So you are not complaining about this implementation, but the use of
> syscall filtering?
>
> There may be some user that says, "oh I don't want my other apps to be
> able to call setuid" thinking it will secure their application even
> more. But because that application did the brain dead thing to not check
> the return code of setuid, and it just happened to be running
> privileged, it then execs off another application that can root the box.
>
> Because, originally that setuid would have succeeded if the user did
> nothing special, but now with this filtering, and the user thinking that
> they could limit their app from doing harm, they just opened up a hole
> that caused their app to do the exact opposite and give the exec'd app
> full root privileges.
>
> Did I get this right?

Yes. Some system calls are there so that you can turn off
privilege. There was a major exploit with sendmail when capabilities
were first introduced that brought the potential for this sort of
problem into the public eye. Kernel mechanisms intended to provide
additional security have to be massively careful about the impact
they may have on applications that are currently security aware and
that make use of the existing mechanisms. The ACL mechanism is much
more complicated than it probably ought to be to accommodate chmod()
and capabilities go way over the top to deal with traditional root
behavior.

> -- Steve
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

2011-05-26 18:45:49

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 26 May 2011 20:36:35 +0200, Ingo Molnar said:

> I suspect we are really offtopic here, but an initial rsync, then
> stopping the guest, final rsync + restart the guest at the target
> would work with minimal interruption.

Actually, after you kick off the migrate, you really want to be tracking in
real time what pages get dirtied while you're doing the initial copy, so that
the second rsync doesn't have to walk through the file finding the differences.
But that requires some extra instrumentation.

Attachments:

(No filename) (227.00 B)

2011-05-26 18:47:38

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Linus Torvalds <[email protected]> wrote:

> It also gets rid of all configuration - one of the things that
> makes most security frameworks (look at selinux, but also just
> ACL's etc) such a crazy rats nest is the whole "set up for other
> processes". If it's designed very much to be about just the "self"
> process (after initialization etc), then I think that avoids pretty
> much all the serious issues.

That's how the event filters work currently: even when inherited they
get removed when exec-ing a setuid task, so they cannot leak into
privileged context and cannot modify execution there.

Inheritance works when requested, covering only same-credential child
tasks, not privileged successors.

Thanks,

Ingo

2011-05-26 18:49:06

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 1:33 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, May 26, 2011 at 10:38 AM, Will Drewry <[email protected]> wrote:
>>>
>>> One option is to just not ever allow execve() from inside a restricted
>>> environment.
>>
>> That'd certainly be fine with me.
>
> So if it ends up being purely a "internal to the process" thing, then
> I'm much happier about it - it not only limits the scope of things
> sufficiently that I don't worry too much about security issues, but it
> makes it very clear that it's about a process going into "lock-down"
> mode on its own.
>
> It also gets rid of all configuration - one of the things that makes
> most security frameworks (look at selinux, but also just ACL's etc)
> such a crazy rats nest is the whole "set up for other processes". If
> it's designed very much to be about just the "self" process (after
> initialization etc), then I think that avoids pretty much all the
> serious issues.
>
> A lot of server processes could probably use it as a way to say "Hey,
> I guarantee that I will only open new files read-only, and will only
> write to the socket that was already opened for me by the accept", and
> explicitly limit their worker threads that way.
>
> If that is really sufficient for some chrome sandboxing, then hey,
> that's all fine.

It adds some hoops, but less than exist today.

> Sometimes limiting yourself (rather than looking for some bigger
> "generic" solution) is the right answer.

I will very happily validate usage and repost with a self-limited
patch series. Doing so makes the change much more explicitly an
expansion of seccomp, which keeps things sane.

Thanks!
will

2011-05-26 18:51:12

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* [email protected] <[email protected]> wrote:

> On Thu, 26 May 2011 20:36:35 +0200, Ingo Molnar said:
>
> > I suspect we are really offtopic here, but an initial rsync, then
> > stopping the guest, final rsync + restart the guest at the target
> > would work with minimal interruption.
>
> Actually, after you kick off the migrate, you really want to be
> tracking in real time what pages get dirtied while you're doing the
> initial copy, so that the second rsync doesn't have to walk through
> the file finding the differences. But that requires some extra
> instrumentation.

Yeah, and that's how socket based live migration works - it's
completely seemless.

But note that the rsync re-scan should not be an issue: both the
source and the target system will obviously have a *lot* more RAM
than the guest RAM image size.

Thanks,

Ingo

2011-05-26 18:54:24

by Steven Rostedt

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 2011-05-26 at 11:43 -0700, Casey Schaufler wrote:

> > Did I get this right?
>
> Yes.

Thanks for the validation ;)

-- Steve

2011-05-26 18:54:41

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* Linus Torvalds <[email protected]> wrote:

> And if you filter system calls, it's entirely possible that you can
> attack suid executables through such a vector. Your "limit system
> calls for security" security suddenly turned into "avoid the system
> call that made things secure"!

That should not be possible with Will's event filter based solution
(his last submitted patch), due to this code in fs/exec.c (which is
in your upstream tree as well):

/*
* Flush performance counters when crossing a
* security domain:
*/
if (!get_dumpable(current->mm))
perf_event_exit_task(current);

This will drop all filters if a setuid-root (or whatever setuid)
binary is executed from a filtered environment.

Does this cover the case you were thinking of?

Thanks,

Ingo

2011-05-26 19:06:18

by David Lang

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 26 May 2011, Ingo Molnar wrote:

> * Linus Torvalds <[email protected]> wrote:
>
>> It also gets rid of all configuration - one of the things that
>> makes most security frameworks (look at selinux, but also just
>> ACL's etc) such a crazy rats nest is the whole "set up for other
>> processes". If it's designed very much to be about just the "self"
>> process (after initialization etc), then I think that avoids pretty
>> much all the serious issues.
>
> That's how the event filters work currently: even when inherited they
> get removed when exec-ing a setuid task, so they cannot leak into
> privileged context and cannot modify execution there.
>
> Inheritance works when requested, covering only same-credential child
> tasks, not privileged successors.

this is a very reasonable default, but there should be some way of saying
that you want the restrictions to carry over to the suid task (I really
know what I'm doing switch)

David Lang

2011-05-26 19:09:10

by Eric Paris

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, May 26, 2011 at 3:05 PM, <[email protected]> wrote:
> On Thu, 26 May 2011, Ingo Molnar wrote:
>
>> * Linus Torvalds <[email protected]> wrote:
>>
>>> It also gets rid of all configuration - one of the things that
>>> makes most security frameworks (look at selinux, but also just
>>> ACL's etc) such a crazy rats nest is the whole "set up for other
>>> processes". If it's designed very much to be about just the "self"
>>> process (after initialization etc), then I think that avoids pretty
>>> much all the serious issues.
>>
>> That's how the event filters work currently: even when inherited they
>> get removed when exec-ing a setuid task, so they cannot leak into
>> privileged context and cannot modify execution there.
>>
>> Inheritance works when requested, covering only same-credential child
>> tasks, not privileged successors.
>
> this is a very reasonable default, but there should be some way of saying
> that you want the restrictions to carry over to the suid task (I really know
> what I'm doing switch)

You mean the "i'm a hacker and want to be able to learn about tasks I
shouldn't be able to learn about" switch? No. You either get out of
the way on SUID or refuse to launch SUID apps. Those are the only
reasonable choices.

2011-05-26 19:46:40

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

* [email protected] <[email protected]> wrote:

> On Thu, 26 May 2011, Ingo Molnar wrote:
>
> >* Linus Torvalds <[email protected]> wrote:
> >
> >>It also gets rid of all configuration - one of the things that
> >>makes most security frameworks (look at selinux, but also just
> >>ACL's etc) such a crazy rats nest is the whole "set up for other
> >>processes". If it's designed very much to be about just the "self"
> >>process (after initialization etc), then I think that avoids pretty
> >>much all the serious issues.
> >
> >That's how the event filters work currently: even when inherited they
> >get removed when exec-ing a setuid task, so they cannot leak into
> >privileged context and cannot modify execution there.
> >
> >Inheritance works when requested, covering only same-credential child
> >tasks, not privileged successors.
>
> this is a very reasonable default, but there should be some way of
> saying that you want the restrictions to carry over to the suid
> task (I really know what I'm doing switch)

Unless you mean that root should be able to do it it's a security
hole both for events and for filters:

- for example we dont want really finegrained events to be used to
BTS hw-trace sshd and thus enable it to discover cryptographic
properties of the private key sshd is using.

- we do not want to *modify* the execution flow of a setuid program,
that can lead to exploits: by pushing the privileged codepath into
a condition that can never occur on a normal system - and thus can
push it into doing something it was not intended to do.

data damage could be done as well: for example if the privileged
code is logging into a system file then modifying execution can
damage the log file.

So it's not a good idea in general to allow unprivileged code to
modify the execution of privileged code. In fact it's not a good idea
to allow it to simply *observe* privileged code. It must remain a
black box with very few information leaking outwards.

Thanks,

Ingo

2011-05-26 19:49:51

by David Lang

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Thu, 26 May 2011, Ingo Molnar wrote:

> * [email protected] <[email protected]> wrote:
>
>> On Thu, 26 May 2011, Ingo Molnar wrote:
>>
>>> * Linus Torvalds <[email protected]> wrote:
>>>
>>>> It also gets rid of all configuration - one of the things that
>>>> makes most security frameworks (look at selinux, but also just
>>>> ACL's etc) such a crazy rats nest is the whole "set up for other
>>>> processes". If it's designed very much to be about just the "self"
>>>> process (after initialization etc), then I think that avoids pretty
>>>> much all the serious issues.
>>>
>>> That's how the event filters work currently: even when inherited they
>>> get removed when exec-ing a setuid task, so they cannot leak into
>>> privileged context and cannot modify execution there.
>>>
>>> Inheritance works when requested, covering only same-credential child
>>> tasks, not privileged successors.
>>
>> this is a very reasonable default, but there should be some way of
>> saying that you want the restrictions to carry over to the suid
>> task (I really know what I'm doing switch)
>
> Unless you mean that root should be able to do it it's a security
> hole both for events and for filters:
>
> - for example we dont want really finegrained events to be used to
> BTS hw-trace sshd and thus enable it to discover cryptographic
> properties of the private key sshd is using.
>
> - we do not want to *modify* the execution flow of a setuid program,
> that can lead to exploits: by pushing the privileged codepath into
> a condition that can never occur on a normal system - and thus can
> push it into doing something it was not intended to do.
>
> data damage could be done as well: for example if the privileged
> code is logging into a system file then modifying execution can
> damage the log file.
>
> So it's not a good idea in general to allow unprivileged code to
> modify the execution of privileged code. In fact it's not a good idea
> to allow it to simply *observe* privileged code. It must remain a
> black box with very few information leaking outwards.

I was thinking of the use case of the real sysadmin (i.e. root) wanting to
be able to constrain things. I can see why you would not want to allow
normal users to do this.

David Lang

2011-05-27 00:15:18

by James Morris

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

Btw, if anyone's going to be at Plumbers this year, we have a day set
aside for the security summit:

https://security.wiki.kernel.org/index.php/LinuxSecuritySummit2011

This may make a good discussion topic.

- James
--
James Morris
<[email protected]>

2011-05-29 16:52:02

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Wed, 25 May 2011 11:42:44 -0700, Linus Torvalds <[email protected]> wrote:
> On Wed, May 25, 2011 at 11:01 AM, Kees Cook <[email protected]> wrote:
> >
> > Can we just go back to the original spec? A lot of people were excited
> > about the prctl() API as done in Will's earlier patchset, we don't lose the
> > extremely useful "enable_on_exec" feature, and we can get away from all
> > this disagreement.
>
> .. and quite frankly, I'm not even convinced about the original simpler spec.
>
> Security is a morass. People come up with cool ideas every day, and
> nobody actually uses them - or if they use them, they are just a
> maintenance nightmare.
>
> Quite frankly, limiting pathname access by some prefix is "cool", but
> it's basically useless.
>
> That's not where security problems are.
>
> Security problems are in the odd corners - ioctl's, /proc files,
> random small interfaces that aren't just about file access.
>
> And who would *use* this thing in real life? Nobody. In order to sell
> me on a new security interface, give me a real actual use case that is
> security-conscious and relevant to real users.
>
> For things like web servers that actually want to limit filename
> lookup, we'd be <i>much</i> better off with a few new flags to
> pathname lookup that say "don't follow symlinks" and "don't follow
> '..'". Things like that can actually be beneficial to
> security-conscious programming, with very little overhead. Some of
> those things currently look up pathnames one component at a time,
> because they can't afford to not do so. That's a *much* better model
> for the whole "only limit to this subtree" case that was quoted
> sometime early in this thread.

The "make sure we don't follow symlinks at all" is a real problem in
VirtFS (http://wiki.qemu.org/Documentation/9psetup) that we are fixing
by adding a forked chrooted process to Qemu. If we are open to a new
open flag O_NOFOLLOW_PATH, which would fail with ELOOP if any of the
path component is a symbolic link, that would greatly simplify VirtFS.
Will such a new flag to open be acceptable ?

-aneesh

2011-05-29 17:03:03

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Sun, May 29, 2011 at 9:51 AM, Aneesh Kumar K.V
<[email protected]> wrote:
>
> The "make sure we don't follow symlinks at all" is a real problem in
> VirtFS (http://wiki.qemu.org/Documentation/9psetup) that we are fixing
> by adding a forked chrooted process to Qemu. If we are open to a new
> open flag O_NOFOLLOW_PATH, which would fail with ELOOP if any of the
> path component is a symbolic link, that would greatly simplify VirtFS.
> Will such a new flag to open be acceptable ?

Such a flag should be something like 3 lines of actual code (and then
the header file changes to actually add the mask itself, which is apt
to be th ebulk of the patch just because we have to have different
values for different architectures).

And yes, it is absolutely acceptable. The only questions in my mind are

- why haven't we done this long ago?

- do we have the flag space?

- should we do a O_NOMNT_PATH flag to do the same for mount-points?

Some people worry about being confused by bind mounts etc.

- do we think ".." is worthy of a flag too?

or is that a "user space can damn well check that itself, even if
it would be absolutely trivial to check in the kernel too"?

Whatever. I think the NOFOLLOW_PATH one is pretty much a no-brainer.
It's not like symlink worries are unusual.

Linus

2011-05-29 18:24:31

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering

On Sun, May 29, 2011 at 10:02:06AM -0700, Linus Torvalds wrote:

> And yes, it is absolutely acceptable. The only questions in my mind are
>
> - why haven't we done this long ago?
>
> - do we have the flag space?
>
> - should we do a O_NOMNT_PATH flag to do the same for mount-points?
>
> Some people worry about being confused by bind mounts etc.
>
> - do we think ".." is worthy of a flag too?
>
> or is that a "user space can damn well check that itself, even if
> it would be absolutely trivial to check in the kernel too"?
>
> Whatever. I think the NOFOLLOW_PATH one is pretty much a no-brainer.
> It's not like symlink worries are unusual.

It's not *quite* a no-brainer. Guys, please hold that one off for a while;
we have more massage to do in the area and I *really* want to get atomic
open work finished (== intents gone, revalidation vs mountpoints sanitized,
etc.) before anything else is done to fs/namie.c. OK?

And as for .. - userland can bloody well check that on its own if it cares.
Let's keep it simple, please - we already have things far too complicated
in there for my taste.

2011-06-01 03:10:44

[permalink] [raw]

Subject: [PATCH v3 01/13] tracing: split out filter initialization and clean up.

Moves the perf-specific profile event allocation and freeing code into
kernel/perf_event.c where it is called from and two symbols are exported
via ftrace_event.h for instantiating struct event_filters without
requiring a change to the core tracing code.

The change allows globally registered ftrace events to be used in
event_filter structs. perf is the current consumer, but a possible
future consumer is a system call filtering using the secure computing
hooks (and the existing syscalls subsystem events).

Signed-off-by: Will Drewry <[email protected]>
---
include/linux/ftrace_event.h | 9 +++--
kernel/perf_event.c | 7 +++-
kernel/trace/trace_events_filter.c | 60 ++++++++++++++++++++++--------------
3 files changed, 48 insertions(+), 28 deletions(-)

diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h
index 22b32af..fea9d98 100644
--- a/include/linux/ftrace_event.h
+++ b/include/linux/ftrace_event.h
@@ -216,6 +216,12 @@ extern int filter_current_check_discard(struct ring_buffer *buffer,
void *rec,
struct ring_buffer_event *event);

+extern void ftrace_free_filter(struct event_filter *filter);
+extern int ftrace_parse_filter(struct event_filter **filter,
+ int event_id,
+ const char *filter_str);
+extern const char *ftrace_get_filter_string(const struct event_filter *filter);
+
enum {
FILTER_OTHER = 0,
FILTER_STATIC_STRING,
@@ -266,9 +272,6 @@ extern int perf_trace_init(struct perf_event *event);
extern void perf_trace_destroy(struct perf_event *event);
extern int perf_trace_add(struct perf_event *event, int flags);
extern void perf_trace_del(struct perf_event *event, int flags);
-extern int ftrace_profile_set_filter(struct perf_event *event, int event_id,
- char *filter_str);
-extern void ftrace_profile_free_filter(struct perf_event *event);
extern void *perf_trace_buf_prepare(int size, unsigned short type,
struct pt_regs *regs, int *rctxp);

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 8e81a98..1da45e7 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5588,7 +5588,8 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)
if (IS_ERR(filter_str))
return PTR_ERR(filter_str);

- ret = ftrace_profile_set_filter(event, event->attr.config, filter_str);
+ ret = ftrace_parse_filter(&event->filter, event->attr.config,
+ filter_str);

kfree(filter_str);
return ret;
@@ -5596,7 +5597,9 @@ static int perf_event_set_filter(struct perf_event *event, void __user *arg)

static void perf_event_free_filter(struct perf_event *event)
{
- ftrace_profile_free_filter(event);
+ struct event_filter *filter = event->filter;
+ event->filter = NULL;
+ ftrace_free_filter(filter);
}

#else
diff --git a/kernel/trace/trace_events_filter.c b/kernel/trace/trace_events_filter.c
index 8008ddc..787b174 100644
--- a/kernel/trace/trace_events_filter.c
+++ b/kernel/trace/trace_events_filter.c
@@ -522,7 +522,7 @@ static void remove_filter_string(struct event_filter *filter)
}

static int replace_filter_string(struct event_filter *filter,
- char *filter_string)
+ const char *filter_string)
{
kfree(filter->filter_string);
filter->filter_string = kstrdup(filter_string, GFP_KERNEL);
@@ -1936,21 +1936,27 @@ out_unlock:
return err;
}

-#ifdef CONFIG_PERF_EVENTS
-
-void ftrace_profile_free_filter(struct perf_event *event)
+/* ftrace_free_filter - frees a parsed filter its internal structures.
+ *
+ * @filter: pointer to the event_filter to free.
+ */
+void ftrace_free_filter(struct event_filter *filter)
{
- struct event_filter *filter = event->filter;
-
- event->filter = NULL;
- __free_filter(filter);
+ if (filter)
+ __free_filter(filter);
}
+EXPORT_SYMBOL_GPL(ftrace_free_filter);

-int ftrace_profile_set_filter(struct perf_event *event, int event_id,
- char *filter_str)
+/* ftrace_parse_filter - allocates and populates a new event_filter
+ *
+ * @event_id: may be something like syscalls::sys_event_tkill's id.
+ * @filter_str: pointer to the filter string. Ownership IS taken.
+ */
+int ftrace_parse_filter(struct event_filter **filter,
+ int event_id,
+ const char *filter_str)
{
int err;
- struct event_filter *filter;
struct filter_parse_state *ps;
struct ftrace_event_call *call = NULL;

@@ -1966,12 +1972,12 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
goto out_unlock;

err = -EEXIST;
- if (event->filter)
+ if (*filter)
goto out_unlock;

- filter = __alloc_filter();
- if (!filter) {
- err = PTR_ERR(filter);
+ *filter = __alloc_filter();
+ if (IS_ERR_OR_NULL(*filter)) {
+ err = PTR_ERR(*filter);
goto out_unlock;
}

@@ -1980,14 +1986,14 @@ int ftrace_profile_set_filter(struct perf_event *event, int event_id,
if (!ps)
goto free_filter;

- parse_init(ps, filter_ops, filter_str);
+ replace_filter_string(*filter, filter_str);
+
+ parse_init(ps, filter_ops, (*filter)->filter_string);
err = filter_parse(ps);
if (err)
goto free_ps;

- err = replace_preds(call, filter, ps, filter_str, false);
- if (!err)
- event->filter = filter;
+ err = replace_preds(call, *filter, ps, (*filter)->filter_string, false);

free_ps:
filter_opstack_clear(ps);
@@ -1995,14 +2001,22 @@ free_ps:
kfree(ps);

free_filter:
- if (err)
- __free_filter(filter);
+ if (err) {
+ __free_filter(*filter);
+ *filter = NULL;
+ }

out_unlock:
mutex_unlock(&event_mutex);

return err;
}
+EXPORT_SYMBOL_GPL(ftrace_parse_filter);

-#endif /* CONFIG_PERF_EVENTS */
-
+const char *ftrace_get_filter_string(const struct event_filter *filter)
+{
+ if (!filter)
+ return NULL;
+ return filter->filter_string;
+}
+EXPORT_SYMBOL_GPL(ftrace_get_filter_string);
--
1.7.0.4

2011-06-01 03:10:46

[permalink] [raw]

Subject: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction

perf appears to be the primary consumer of the CONFIG_FTRACE_SYSCALLS
infrastructure. As such, many the helpers target at perf can be split
into a peerf-focused helper and a generic CONFIG_FTRACE_SYSCALLS
consumer interface.

This change splits out syscall_trace_enter construction from
perf_syscall_enter for current into two helpers:
- ftrace_syscall_enter_state
- ftrace_syscall_enter_state_size

And adds another helper for completeness:
- ftrace_syscall_exit_state_size

These helpers allow for shared code between perf ftrace events and
any other consumers of CONFIG_FTRACE_SYSCALLS events. The proposed
seccomp_filter patches use this code.

Signed-off-by: Will Drewry <[email protected]>
---
include/trace/syscall.h | 4 ++
kernel/trace/trace_syscalls.c | 96 +++++++++++++++++++++++++++++++++++------
2 files changed, 86 insertions(+), 14 deletions(-)

diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 31966a4..242ae04 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -41,6 +41,10 @@ extern int reg_event_syscall_exit(struct ftrace_event_call *call);
extern void unreg_event_syscall_exit(struct ftrace_event_call *call);
extern int
ftrace_format_syscall(struct ftrace_event_call *call, struct trace_seq *s);
+extern int ftrace_syscall_enter_state(u8 *buf, size_t available,
+ struct trace_entry **entry);
+extern size_t ftrace_syscall_enter_state_size(int nb_args);
+extern size_t ftrace_syscall_exit_state_size(void);
enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event);
enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
index ee7b5a0..f37f120 100644
--- a/kernel/trace/trace_syscalls.c
+++ b/kernel/trace/trace_syscalls.c
@@ -95,7 +95,7 @@ find_syscall_meta(unsigned long syscall)
return NULL;
}

-static struct syscall_metadata *syscall_nr_to_meta(int nr)
+struct syscall_metadata *syscall_nr_to_meta(int nr)
{
if (!syscalls_metadata || nr >= NR_syscalls || nr < 0)
return NULL;
@@ -498,7 +498,7 @@ static int sys_perf_refcount_exit;
static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
{
struct syscall_metadata *sys_data;
- struct syscall_trace_enter *rec;
+ void *buf;
struct hlist_head *head;
int syscall_nr;
int rctx;
@@ -513,25 +513,22 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id)
return;

/* get the size after alignment with the u32 buffer size field */
- size = sizeof(unsigned long) * sys_data->nb_args + sizeof(*rec);
- size = ALIGN(size + sizeof(u32), sizeof(u64));
- size -= sizeof(u32);
+ size = ftrace_syscall_enter_state_size(sys_data->nb_args);

if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
"perf buffer not large enough"))
return;

- rec = (struct syscall_trace_enter *)perf_trace_buf_prepare(size,
- sys_data->enter_event->event.type, regs, &rctx);
- if (!rec)
+ buf = perf_trace_buf_prepare(size, sys_data->enter_event->event.type,
+ regs, &rctx);
+ if (!buf)
return;

- rec->nr = syscall_nr;
- syscall_get_arguments(current, regs, 0, sys_data->nb_args,
- (unsigned long *)&rec->args);
+ /* The only error conditions in this helper are handled above. */
+ ftrace_syscall_enter_state(buf, size, NULL);

head = this_cpu_ptr(sys_data->enter_event->perf_events);
- perf_trace_buf_submit(rec, size, rctx, 0, 1, regs, head);
+ perf_trace_buf_submit(buf, size, rctx, 0, 1, regs, head);
}

int perf_sysenter_enable(struct ftrace_event_call *call)
@@ -587,8 +584,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret)
return;

/* We can probably do that at build time */
- size = ALIGN(sizeof(*rec) + sizeof(u32), sizeof(u64));
- size -= sizeof(u32);
+ size = ftrace_syscall_exit_state_size();

/*
* Impossible, but be paranoid with the future
@@ -688,3 +684,75 @@ static int syscall_exit_register(struct ftrace_event_call *event,
}
return 0;
}
+
+/* ftrace_syscall_enter_state_size - returns the state size required.
+ *
+ * @nb_args: number of system call args expected.
+ * a negative value implies the maximum allowed.
+ */
+size_t ftrace_syscall_enter_state_size(int nb_args)
+{
+ /* syscall_get_arguments only supports up to 6 arguments. */
+ int arg_count = (nb_args >= 0 ? nb_args : 6);
+ size_t size = (sizeof(unsigned long) * arg_count) +
+ sizeof(struct syscall_trace_enter);
+ size = ALIGN(size + sizeof(u32), sizeof(u64));
+ size -= sizeof(u32);
+ return size;
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_enter_state_size);
+
+size_t ftrace_syscall_exit_state_size(void)
+{
+ return ALIGN(sizeof(struct syscall_trace_exit) + sizeof(u32),
+ sizeof(u64)) - sizeof(u32);
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_exit_state_size);
+
+/* ftrace_syscall_enter_state - build state for filter matching
+ *
+ * @buf: buffer to populate with current task state for matching
+ * @available: size available for use in the buffer.
+ * @entry: optional pointer to the trace_entry member of the state.
+ *
+ * Returns 0 on success and non-zero otherwise.
+ * If @entry is NULL, it will be ignored.
+ */
+int ftrace_syscall_enter_state(u8 *buf, size_t available,
+ struct trace_entry **entry)
+{
+ struct syscall_trace_enter *sys_enter;
+ struct syscall_metadata *sys_data;
+ int size;
+ int syscall_nr;
+ struct pt_regs *regs = task_pt_regs(current);
+
+ syscall_nr = syscall_get_nr(current, regs);
+ if (syscall_nr < 0)
+ return -EINVAL;
+
+ sys_data = syscall_nr_to_meta(syscall_nr);
+ if (!sys_data)
+ return -EINVAL;
+
+ /* Determine the actual size needed. */
+ size = sizeof(unsigned long) * sys_data->nb_args +
+ sizeof(struct syscall_trace_enter);
+ size = ALIGN(size + sizeof(u32), sizeof(u64));
+ size -= sizeof(u32);
+
+ BUG_ON(size > available);
+ sys_enter = (struct syscall_trace_enter *)buf;
+
+ /* Populating the struct trace_sys_enter is left to the caller, but
+ * a pointer is returned to encourage opacity.
+ */
+ if (entry)
+ *entry = &sys_enter->ent;
+
+ sys_enter->nr = syscall_nr;
+ syscall_get_arguments(current, regs, 0, sys_data->nb_args,
+ sys_enter->args);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ftrace_syscall_enter_state);
--
1.7.0.4

2011-06-01 03:10:51

[permalink] [raw]

Subject: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters

This change adds a new seccomp mode which specifies the allowed system
calls dynamically. When in the new mode (2), all system calls are
checked against process-defined filters - first by system call number,
then by a filter string. If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed.

Filter string parsing and evaluation is handled by the ftrace filter
engine. Related patches tweak to the perf filter trace and free
allowing the calls to be shared. Filters inherit their understanding of
types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
subsystem which already populates this information in syscall_metadata
associated enter_event (and exit_event) structures. If
CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
will be allowed.

The net result is a process may have its system calls filtered using the
ftrace filter engine's inherent understanding of systems calls. The set
of filters is specified through the PR_SET_SECCOMP_FILTER argument in
prctl(). For example, a filterset for a process, like pdftotext, that
should only process read-only input could (roughly) look like:
sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
prctl(PR_SET_SECCOMP, 2);

Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
be &&'d together to ensure that attack surface may only be reduced:
prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");

With the earlier example, the active filter becomes:
"(fd == 1 || fd == 2) && fd != 2"

The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
The latter returns the current filter for a system call to userspace:

prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);

while the former clears any filters for a given system call changing it
back to a defaulty deny:

prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);

v3: - always block execve calls (as per linus torvalds)
- add __NR_seccomp_execve(_32) to seccomp-supporting arches
- ensure compat tasks can't reach ftrace:syscalls
- dropped new defines for seccomp modes.
- two level array instead of hlists (sugg. by olof johansson)
- added generic Kconfig entry that is not connected.
- dropped internal seccomp.h
- move prctl helpers to seccomp_filter
- killed seccomp_t typedef (as per checkpatch)
v2: - changed to use the existing syscall number ABI.
- prctl changes to minimize parsing in the kernel:
prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
- defined PR_SECCOMP_MODE_STRICT and ..._FILTER
- added flags
- provide a default fail syscall_nr_to_meta in ftrace
- provides fallback for unhooked system calls
- use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
- added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
- moved to a hlist and 4 bit hash of linked lists
- added support to operate without CONFIG_FTRACE_SYSCALLS
- moved Kconfig support next to SECCOMP
- made Kconfig entries dependent on EXPERIMENTAL
- added macros to avoid ifdefs from kernel/fork.c
- added compat task/filter matching
- drop seccomp.h inclusion in sched.h and drop seccomp_t
- added Filtering to "show" output
- added on_exec state dup'ing when enabling after a fast-path accept.

Signed-off-by: Will Drewry <[email protected]>
---
include/linux/prctl.h | 5 +
include/linux/sched.h | 2 +-
include/linux/seccomp.h | 98 ++++++-
include/trace/syscall.h | 7 +
kernel/Makefile | 3 +
kernel/fork.c | 3 +
kernel/seccomp.c | 38 ++-
kernel/seccomp_filter.c | 784 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 13 +-
security/Kconfig | 17 +
10 files changed, 954 insertions(+), 16 deletions(-)
create mode 100644 kernel/seccomp_filter.c

diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..44723ce 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -64,6 +64,11 @@
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22

+/* Get/set process seccomp filters */
+#define PR_GET_SECCOMP_FILTER 35
+#define PR_SET_SECCOMP_FILTER 36
+#define PR_CLEAR_SECCOMP_FILTER 37
+
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
#define PR_CAPBSET_DROP 24
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..3f0bc8d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1374,7 +1374,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp_struct seccomp;

/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..f4434ca 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -1,13 +1,33 @@
#ifndef _LINUX_SECCOMP_H
#define _LINUX_SECCOMP_H

+struct seq_file;

#ifdef CONFIG_SECCOMP

+#include <linux/errno.h>
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>

-typedef struct { int mode; } seccomp_t;
+struct seccomp_filters;
+/**
+ * struct seccomp_struct - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 1, the process is under standard seccomp rules
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
+ * filters assignment/use should be RCU-protected and its contents
+ * should never be modified when attached to a seccomp_struct.
+ */
+struct seccomp_struct {
+ uint16_t mode;
+#ifdef CONFIG_SECCOMP_FILTER
+ struct seccomp_filters *filters;
+#endif
+};

extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
__secure_computing(this_syscall);
}

-extern long prctl_get_seccomp(void);
extern long prctl_set_seccomp(unsigned long);
+extern long prctl_get_seccomp(void);

#else /* CONFIG_SECCOMP */

#include <linux/errno.h>

-typedef struct { } seccomp_t;
-
+struct seccomp_struct { };
#define secure_computing(x) do { } while (0)

static inline long prctl_get_seccomp(void)
@@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
return -EINVAL;
}

-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long a2);
{
return -EINVAL;
}

#endif /* CONFIG_SECCOMP */

+#ifdef CONFIG_SECCOMP_FILTER
+
+#define inherit_tsk_seccomp(_child, _orig) do { \
+ _child->seccomp.mode = _orig->seccomp.mode; \
+ _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
+ } while (0)
+#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
+
+extern int seccomp_show_filters(struct seccomp_filters *filters,
+ struct seq_file *);
+extern long seccomp_set_filter(int, char *);
+extern long seccomp_clear_filter(int);
+extern long seccomp_get_filter(int, char *, unsigned long);
+
+extern long prctl_set_seccomp_filter(unsigned long, char __user *);
+extern long prctl_get_seccomp_filter(unsigned long, char __user *,
+ unsigned long);
+extern long prctl_clear_seccomp_filter(unsigned long);
+
+extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
+extern void put_seccomp_filters(struct seccomp_filters *);
+
+extern int seccomp_test_filters(int);
+extern void seccomp_filter_log_failure(int);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+struct seccomp_filters { };
+#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
+#define put_tsk_seccomp(_tsk) do { } while (0)
+
+static inline int seccomp_show_filters(struct seccomp_filters *filters,
+ struct seq_file *m)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_clear_filter(int syscall_nr)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_get_filter(int syscall_nr,
+ char *buf, unsigned long available)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_clear_seccomp_filter(unsigned long a2)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
+ unsigned long a4)
+{
+ return -ENOSYS;
+}
+#endif /* CONFIG_SECCOMP_FILTER */
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 242ae04..e061ad0 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -35,6 +35,8 @@ struct syscall_metadata {
extern unsigned long arch_syscall_addr(int nr);
extern int init_syscall_trace(struct ftrace_event_call *call);

+extern struct syscall_metadata *syscall_nr_to_meta(int);
+
extern int reg_event_syscall_enter(struct ftrace_event_call *call);
extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
extern int reg_event_syscall_exit(struct ftrace_event_call *call);
@@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event);
enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
struct trace_event *event);
+#else
+static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
+{
+ return NULL;
+}
#endif

#ifdef CONFIG_PERF_EVENTS
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..84e7dfb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+ifeq ($(CONFIG_SECCOMP_FILTER),y)
+obj-$(CONFIG_SECCOMP) += seccomp_filter.o
+endif
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..6f835e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ put_tsk_seccomp(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
if (err)
goto out;

+ inherit_tsk_seccomp(tsk, orig);
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..0a942be 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -2,16 +2,20 @@
* linux/kernel/seccomp.c
*
* Copyright 2004-2005 Andrea Arcangeli <[email protected]>
+ * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
*
* This defines a simple but solid secure-computing mode.
*/

#include <linux/seccomp.h>
#include <linux/sched.h>
+#include <linux/slab.h>
#include <linux/compat.h>
+#include <linux/unistd.h>
+#include <linux/ftrace_event.h>

+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1

/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {

void __secure_computing(int this_syscall)
{
- int mode = current->seccomp.mode;
int * syscall;

- switch (mode) {
+ switch (current->seccomp.mode) {
case 1:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
@@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+#ifdef CONFIG_SECCOMP_FILTER
+ case 2:
+ if (this_syscall >= NR_syscalls || this_syscall < 0)
+ break;
+
+ if (!seccomp_test_filters(this_syscall))
+ return;
+
+ seccomp_filter_log_failure(this_syscall);
+ break;
+#endif
default:
BUG();
}
@@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
if (unlikely(current->seccomp.mode))
goto out;

- ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ ret = 0;
+ switch (seccomp_mode) {
+ case 1:
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
+#ifdef CONFIG_SECCOMP_FILTER
+ case 2:
+#endif
+ current->seccomp.mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+ break;
+ default:
+ ret = -EINVAL;
}

- out:
+out:
return ret;
}
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..9782f25
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,784 @@
+/* filter engine-based seccomp system call filtering
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
+ */
+
+#include <linux/compat.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/ftrace_event.h>
+#include <linux/seccomp.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+#include <trace/syscall.h>
+
+
+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
+
+#define SECCOMP_FILTER_ALLOW "1"
+#define SECCOMP_ACTION_DENY 0xffff
+#define SECCOMP_ACTION_ALLOW 0xfffe
+
+/**
+ * struct seccomp_filters - container for seccomp filterset
+ *
+ * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
+ * May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
+ * @event_filters: array of pointers to ftrace event objects
+ * @count: size of @event_filters
+ * @flags: anonymous struct to wrap filters-specific flags
+ * @usage: reference count to simplify use.
+ */
+struct seccomp_filters {
+ uint16_t syscalls[NR_syscalls];
+ struct event_filter **event_filters;
+ uint16_t count;
+ struct {
+ uint32_t compat:1,
+ __reserved:31;
+ } flags;
+ atomic_t usage;
+};
+
+/* Handle ftrace symbol non-existence */
+#ifdef CONFIG_FTRACE_SYSCALLS
+#define create_event_filter(_ef_pptr, _event_type, _str) \
+ ftrace_parse_filter(_ef_pptr, _event_type, _str)
+#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
+#define free_event_filter(_f) ftrace_free_filter(_f)
+
+#else
+
+#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
+#define get_filter_string(_ef) (NULL)
+#define free_event_filter(_f) do { } while (0)
+#endif
+
+/**
+ * seccomp_filters_new - allocates a new filters object
+ * @count: count to allocate for the event_filters array
+ *
+ * Returns ERR_PTR on error or an allocated object.
+ */
+static struct seccomp_filters *seccomp_filters_new(uint16_t count)
+{
+ struct seccomp_filters *f;
+
+ if (count >= SECCOMP_ACTION_ALLOW)
+ return ERR_PTR(-EINVAL);
+
+ f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
+ if (!f)
+ return ERR_PTR(-ENOMEM);
+
+ /* Lazy SECCOMP_ACTION_DENY assignment. */
+ memset(f->syscalls, 0xff, sizeof(f->syscalls));
+ atomic_set(&f->usage, 1);
+
+ f->event_filters = NULL;
+ f->count = count;
+ if (!count)
+ return f;
+
+ f->event_filters = kzalloc(count * sizeof(struct event_filter *),
+ GFP_KERNEL);
+ if (!f->event_filters) {
+ kfree(f);
+ f = ERR_PTR(-ENOMEM);
+ }
+ return f;
+}
+
+/**
+ * seccomp_filters_free - cleans up the filter list and frees the table
+ * @filters: NULL or live object to be completely destructed.
+ */
+static void seccomp_filters_free(struct seccomp_filters *filters)
+{
+ uint16_t count = 0;
+ if (!filters)
+ return;
+ while (count < filters->count) {
+ struct event_filter *f = filters->event_filters[count];
+ free_event_filter(f);
+ count++;
+ }
+ kfree(filters->event_filters);
+ kfree(filters);
+}
+
+static void __put_seccomp_filters(struct seccomp_filters *orig)
+{
+ WARN_ON(atomic_read(&orig->usage));
+ seccomp_filters_free(orig);
+}
+
+#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
+#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
+#define seccomp_filter_dynamic(_id) \
+ (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
+static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
+ int syscall_nr)
+{
+ if (!f)
+ return SECCOMP_ACTION_DENY;
+ return f->syscalls[syscall_nr];
+}
+
+static inline struct event_filter *seccomp_dynamic_filter(
+ const struct seccomp_filters *filters, uint16_t id)
+{
+ if (!seccomp_filter_dynamic(id))
+ return NULL;
+ return filters->event_filters[id];
+}
+
+static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
+ int syscall_nr, uint16_t id)
+{
+ filters->syscalls[syscall_nr] = id;
+}
+
+static inline void set_seccomp_filter(struct seccomp_filters *filters,
+ int syscall_nr, uint16_t id,
+ struct event_filter *dynamic_filter)
+{
+ filters->syscalls[syscall_nr] = id;
+ if (seccomp_filter_dynamic(id))
+ filters->event_filters[id] = dynamic_filter;
+}
+
+static struct event_filter *alloc_event_filter(int syscall_nr,
+ const char *filter_string)
+{
+ struct syscall_metadata *data;
+ struct event_filter *filter = NULL;
+ int err;
+
+ data = syscall_nr_to_meta(syscall_nr);
+ /* Argument-based filtering only works on ftrace-hooked syscalls. */
+ err = -ENOSYS;
+ if (!data)
+ goto fail;
+ err = create_event_filter(&filter,
+ data->enter_event->event.type,
+ filter_string);
+ if (err)
+ goto fail;
+
+ return filter;
+fail:
+ kfree(filter);
+ return ERR_PTR(err);
+}
+
+/**
+ * seccomp_filters_copy - copies filters from src to dst.
+ *
+ * @dst: seccomp_filters to populate.
+ * @src: table to read from.
+ * @skip: specifies an entry, by system call, to skip.
+ *
+ * Returns non-zero on failure.
+ * Both the source and the destination should have no simultaneous
+ * writers, and dst should be exclusive to the caller.
+ * If @skip is < 0, it is ignored.
+ */
+static int seccomp_filters_copy(struct seccomp_filters *dst,
+ const struct seccomp_filters *src,
+ int skip)
+{
+ int id = 0, ret = 0, nr;
+ memcpy(&dst->flags, &src->flags, sizeof(src->flags));
+ memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
+ if (!src->count)
+ goto done;
+ for (nr = 0; nr < NR_syscalls; ++nr) {
+ struct event_filter *filter;
+ const char *str;
+ uint16_t src_id = seccomp_filter_id(src, nr);
+ if (nr == skip) {
+ set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
+ NULL);
+ continue;
+ }
+ if (!seccomp_filter_dynamic(src_id))
+ continue;
+ if (id >= dst->count) {
+ ret = -EINVAL;
+ goto done;
+ }
+ str = get_filter_string(seccomp_dynamic_filter(src, src_id));
+ filter = alloc_event_filter(nr, str);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto done;
+ }
+ set_seccomp_filter(dst, nr, id, filter);
+ id++;
+ }
+
+done:
+ return ret;
+}
+
+/**
+ * seccomp_extend_filter - appends more text to a syscall_nr's filter
+ * @filters: unattached filter object to operate on
+ * @syscall_nr: syscall number to update filters for
+ * @filter_string: string to append to the existing filter
+ *
+ * The new string will be &&'d to the original filter string to ensure that it
+ * always matches the existing predicates or less:
+ * (old_filter) && @filter_string
+ * A new seccomp_filters instance is returned on success and a ERR_PTR on
+ * failure.
+ */
+static int seccomp_extend_filter(struct seccomp_filters *filters,
+ int syscall_nr, char *filter_string)
+{
+ struct event_filter *filter;
+ uint16_t id = seccomp_filter_id(filters, syscall_nr);
+ char *merged = NULL;
+ int ret = -EINVAL, expected;
+
+ /* No extending with a "1". */
+ if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
+ goto out;
+
+ filter = seccomp_dynamic_filter(filters, id);
+ ret = -ENOENT;
+ if (!filter)
+ goto out;
+
+ merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
+ ret = -ENOMEM;
+ if (!merged)
+ goto out;
+
+ expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
+ get_filter_string(filter), filter_string);
+ ret = -E2BIG;
+ if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
+ goto out;
+
+ /* Free the old filter */
+ free_event_filter(filter);
+ set_seccomp_filter(filters, syscall_nr, id, NULL);
+
+ /* Replace it */
+ filter = alloc_event_filter(syscall_nr, merged);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+ set_seccomp_filter(filters, syscall_nr, id, filter);
+ ret = 0;
+
+out:
+ kfree(merged);
+ return ret;
+}
+
+/**
+ * seccomp_add_filter - adds a filter for an unfiltered syscall
+ * @filters: filters object to add a filter/action to
+ * @syscall_nr: system call number to add a filter for
+ * @filter_string: the filter string to apply
+ *
+ * Returns 0 on success and non-zero otherwise.
+ */
+static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
+ char *filter_string)
+{
+ struct event_filter *filter;
+ int ret = 0;
+
+ if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
+ set_seccomp_filter(filters, syscall_nr,
+ SECCOMP_ACTION_ALLOW, NULL);
+ goto out;
+ }
+
+ filter = alloc_event_filter(syscall_nr, filter_string);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+ /* Always add to the last slot available since additions are
+ * are only done one at a time.
+ */
+ set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
+out:
+ return ret;
+}
+
+/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
+static int filter_match_current(struct event_filter *event_filter)
+{
+ int err = 0;
+#ifdef CONFIG_FTRACE_SYSCALLS
+ uint8_t syscall_state[64];
+
+ memset(syscall_state, 0, sizeof(syscall_state));
+
+ /* The generic tracing entry can remain zeroed. */
+ err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
+ NULL);
+ if (err)
+ return 0;
+
+ err = filter_match_preds(event_filter, syscall_state);
+#endif
+ return err;
+}
+
+static const char *syscall_nr_to_name(int syscall)
+{
+ const char *syscall_name = "unknown";
+ struct syscall_metadata *data = syscall_nr_to_meta(syscall);
+ if (data)
+ syscall_name = data->name;
+ return syscall_name;
+}
+
+static void filters_set_compat(struct seccomp_filters *filters)
+{
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ filters->flags.compat = 1;
+#endif
+}
+
+static inline int filters_compat_mismatch(struct seccomp_filters *filters)
+{
+ int ret = 0;
+ if (!filters)
+ return 0;
+#ifdef CONFIG_COMPAT
+ if (!!(is_compat_task()) == filters->flags.compat)
+ ret = 1;
+#endif
+ return ret;
+}
+
+static inline int syscall_is_execve(int syscall)
+{
+ int nr = __NR_execve;
+#ifdef CONFIG_COMPAT
+ if (is_compat_task())
+ nr = __NR_seccomp_execve_32;
+#endif
+ return syscall == nr;
+}
+
+#ifndef KSTK_EIP
+#define KSTK_EIP(x) 0L
+#endif
+
+void seccomp_filter_log_failure(int syscall)
+{
+ pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ syscall_nr_to_name(syscall), KSTK_EIP(current));
+}
+
+/* put_seccomp_state - decrements the reference count of @orig and may free. */
+void put_seccomp_filters(struct seccomp_filters *orig)
+{
+ if (!orig)
+ return;
+
+ if (atomic_dec_and_test(&orig->usage))
+ __put_seccomp_filters(orig);
+}
+
+/* get_seccomp_state - increments the reference count of @orig */
+struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
+{
+ if (!orig)
+ return NULL;
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @state: seccomp_state of current to use.
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(int syscall)
+{
+ uint16_t id;
+ struct event_filter *filter;
+ struct seccomp_filters *filters;
+ int ret = -EACCES;
+
+ rcu_read_lock();
+ filters = get_seccomp_filters(current->seccomp.filters);
+ rcu_read_unlock();
+
+ if (!filters)
+ goto out;
+
+ if (filters_compat_mismatch(filters)) {
+ pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
+ current->comm, task_pid_nr(current));
+ goto out;
+ }
+
+ /* execve is never allowed. */
+ if (syscall_is_execve(syscall))
+ goto out;
+
+ ret = 0;
+ id = seccomp_filter_id(filters, syscall);
+ if (seccomp_filter_allow(id))
+ goto out;
+
+ ret = -EACCES;
+ if (!seccomp_filter_dynamic(id))
+ goto out;
+
+ filter = seccomp_dynamic_filter(filters, id);
+ if (filter && filter_match_current(filter))
+ ret = 0;
+out:
+ put_seccomp_filters(filters);
+ return ret;
+}
+
+/**
+ * seccomp_show_filters - prints the current filter state to a seq_file
+ * @filters: properly get()'d filters object
+ * @m: the prepared seq_file to receive the data
+ *
+ * Returns 0 on a successful write.
+ */
+int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
+{
+ int syscall;
+ seq_printf(m, "Mode: %d\n", current->seccomp.mode);
+ if (!filters)
+ goto out;
+
+ for (syscall = 0; syscall < NR_syscalls; ++syscall) {
+ uint16_t id = seccomp_filter_id(filters, syscall);
+ const char *filter_string = SECCOMP_FILTER_ALLOW;
+ if (seccomp_filter_deny(id))
+ continue;
+ seq_printf(m, "%d (%s): ",
+ syscall,
+ syscall_nr_to_name(syscall));
+ if (seccomp_filter_dynamic(id))
+ filter_string = get_filter_string(
+ seccomp_dynamic_filter(filters, id));
+ seq_printf(m, "%s\n", filter_string);
+ }
+out:
+ return 0;
+}
+EXPORT_SYMBOL_GPL(seccomp_show_filters);
+
+/**
+ * seccomp_get_filter - copies the filter_string into "buf"
+ * @syscall_nr: system call number to look up
+ * @buf: destination buffer
+ * @bufsize: available space in the buffer.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called.
+ *
+ * Looks up the filter for the given system call number on current. If found,
+ * the string length of the NUL-terminated buffer is returned and < 0 is
+ * returned on error. The NUL byte is not included in the length.
+ */
+long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
+{
+ struct seccomp_filters *filters;
+ struct event_filter *filter;
+ long ret = -EINVAL;
+ uint16_t id;
+
+ if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
+ bufsize = SECCOMP_MAX_FILTER_LENGTH;
+
+ rcu_read_lock();
+ filters = get_seccomp_filters(current->seccomp.filters);
+ rcu_read_unlock();
+
+ if (!filters)
+ goto out;
+
+ ret = -ENOENT;
+ id = seccomp_filter_id(filters, syscall_nr);
+ if (seccomp_filter_deny(id))
+ goto out;
+
+ if (seccomp_filter_allow(id)) {
+ ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
+ goto copied;
+ }
+
+ filter = seccomp_dynamic_filter(filters, id);
+ if (!filter)
+ goto out;
+ ret = strlcpy(buf, get_filter_string(filter), bufsize);
+
+copied:
+ if (ret >= bufsize) {
+ ret = -ENOSPC;
+ goto out;
+ }
+ /* Zero out any remaining buffer, just in case. */
+ memset(buf + ret, 0, bufsize - ret);
+out:
+ put_seccomp_filters(filters);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_get_filter);
+
+/**
+ * seccomp_clear_filter: clears the seccomp filter for a syscall.
+ * @syscall_nr: the system call number to clear filters for.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called.
+ *
+ * Returns 0 on success.
+ */
+long seccomp_clear_filter(int syscall_nr)
+{
+ struct seccomp_filters *filters = NULL, *orig_filters;
+ uint16_t id;
+ int ret = -EINVAL;
+
+ rcu_read_lock();
+ orig_filters = get_seccomp_filters(current->seccomp.filters);
+ rcu_read_unlock();
+
+ if (!orig_filters)
+ goto out;
+
+ if (filters_compat_mismatch(orig_filters))
+ goto out;
+
+ id = seccomp_filter_id(orig_filters, syscall_nr);
+ if (seccomp_filter_deny(id))
+ goto out;
+
+ /* Create a new filters object for the task */
+ if (seccomp_filter_dynamic(id))
+ filters = seccomp_filters_new(orig_filters->count - 1);
+ else
+ filters = seccomp_filters_new(orig_filters->count);
+
+ if (IS_ERR(filters)) {
+ ret = PTR_ERR(filters);
+ goto out;
+ }
+
+ /* Copy, but drop the requested entry. */
+ ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
+ if (ret)
+ goto out;
+ get_seccomp_filters(filters); /* simplify the out: path */
+
+ rcu_assign_pointer(current->seccomp.filters, filters);
+ synchronize_rcu();
+ put_seccomp_filters(orig_filters); /* for the task */
+out:
+ put_seccomp_filters(orig_filters); /* for the get */
+ put_seccomp_filters(filters); /* for the extra get */
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_clear_filter);
+
+/**
+ * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
+ * @syscall_nr: system call number to apply the filter to.
+ * @filter: ftrace filter string to apply.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called.
+ *
+ * New filters may be added for system calls when the current task is
+ * not in a secure computing mode (seccomp). Otherwise, existing filters may
+ * be extended.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ struct seccomp_filters *filters = NULL, *orig_filters = NULL;
+ uint16_t id;
+ long ret = -EINVAL;
+ uint16_t filters_needed;
+
+ if (!filter)
+ goto out;
+
+ filter = strstrip(filter);
+ /* Disallow empty strings. */
+ if (filter[0] == 0)
+ goto out;
+
+ rcu_read_lock();
+ orig_filters = get_seccomp_filters(current->seccomp.filters);
+ rcu_read_unlock();
+
+ /* After the first call, compatibility mode is selected permanently. */
+ ret = -EACCES;
+ if (filters_compat_mismatch(orig_filters))
+ goto out;
+
+ filters_needed = orig_filters ? orig_filters->count : 0;
+ id = seccomp_filter_id(orig_filters, syscall_nr);
+ if (seccomp_filter_deny(id)) {
+ /* Don't allow DENYs to be changed when in a seccomp mode */
+ ret = -EACCES;
+ if (current->seccomp.mode)
+ goto out;
+ filters_needed++;
+ }
+
+ filters = seccomp_filters_new(filters_needed);
+ if (IS_ERR(filters)) {
+ ret = PTR_ERR(filters);
+ goto out;
+ }
+
+ filters_set_compat(filters);
+ if (orig_filters) {
+ ret = seccomp_filters_copy(filters, orig_filters, -1);
+ if (ret)
+ goto out;
+ }
+
+ if (seccomp_filter_deny(id))
+ ret = seccomp_add_filter(filters, syscall_nr, filter);
+ else
+ ret = seccomp_extend_filter(filters, syscall_nr, filter);
+ if (ret)
+ goto out;
+ get_seccomp_filters(filters); /* simplify the error paths */
+
+ rcu_assign_pointer(current->seccomp.filters, filters);
+ synchronize_rcu();
+ put_seccomp_filters(orig_filters); /* for the task */
+out:
+ put_seccomp_filters(orig_filters); /* for the get */
+ put_seccomp_filters(filters); /* for get or task, on err */
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_set_filter);
+
+long prctl_set_seccomp_filter(unsigned long syscall_nr,
+ char __user *user_filter)
+{
+ int nr;
+ long ret;
+ char *filter = NULL;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls)
+ goto out;
+
+ ret = -EFAULT;
+ if (!user_filter)
+ goto out;
+
+ filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
+ ret = -ENOMEM;
+ if (!filter)
+ goto out;
+
+ ret = -EFAULT;
+ if (strncpy_from_user(filter, user_filter,
+ SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_set_filter(nr, filter);
+
+out:
+ kfree(filter);
+ return ret;
+}
+
+long prctl_clear_seccomp_filter(unsigned long syscall_nr)
+{
+ int nr = -1;
+ long ret;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_clear_filter(nr);
+
+out:
+ return ret;
+}
+
+long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
+ unsigned long available)
+{
+ int ret, nr;
+ unsigned long copied;
+ char *buf = NULL;
+ ret = -EINVAL;
+ if (!available)
+ goto out;
+ /* Ignore extra buffer space. */
+ if (available > SECCOMP_MAX_FILTER_LENGTH)
+ available = SECCOMP_MAX_FILTER_LENGTH;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls)
+ goto out;
+ nr = (int) syscall_nr;
+
+ ret = -ENOMEM;
+ buf = kmalloc(available, GFP_KERNEL);
+ if (!buf)
+ goto out;
+
+ ret = seccomp_get_filter(nr, buf, available);
+ if (ret < 0)
+ goto out;
+
+ /* Include the NUL byte in the copy. */
+ copied = copy_to_user(dst, buf, ret + 1);
+ ret = -ENOSPC;
+ if (copied)
+ goto out;
+ ret = 0;
+out:
+ kfree(buf);
+ return ret;
+}
diff --git a/kernel/sys.c b/kernel/sys.c
index af468ed..ed60d06 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_ENDIAN:
error = SET_ENDIAN(me, arg2);
break;
-
case PR_GET_SECCOMP:
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
error = prctl_set_seccomp(arg2);
break;
+ case PR_SET_SECCOMP_FILTER:
+ error = prctl_set_seccomp_filter(arg2,
+ (char __user *) arg3);
+ break;
+ case PR_CLEAR_SECCOMP_FILTER:
+ error = prctl_clear_seccomp_filter(arg2);
+ break;
+ case PR_GET_SECCOMP_FILTER:
+ error = prctl_get_seccomp_filter(arg2,
+ (char __user *) arg3,
+ arg4);
+ break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
break;
diff --git a/security/Kconfig b/security/Kconfig
index 95accd4..c76adf2 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -2,6 +2,10 @@
# Security configuration
#

+# Make seccomp filter Kconfig switch below available
+config HAVE_SECCOMP_FILTER
+ bool
+
menu "Security options"

config KEYS
@@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT

If you are unsure how to answer this question, answer N.

+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ select SECCOMP
+ depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
+ help
+ This kernel feature expands CONFIG_SECCOMP to allow computing
+ in environments with reduced kernel access dictated by the
+ application itself through prctl calls. If
+ CONFIG_FTRACE_SYSCALLS is available, then system call
+ argument-based filtering predicates may be used.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SECURITY
bool "Enable different security models"
depends on SYSFS
--
1.7.0.4

2011-06-01 03:12:59

[permalink] [raw]

Subject: [PATCH v3 04/13] seccomp_filter: add process state reporting

Adds seccomp and seccomp_filter status reporting to proc.
/proc/<pid>/seccomp_filter provides the current seccomp mode
and the list of allowed or dynamically filtered system calls.

v3: changed to using filters directly.
v2: removed status entry, added seccomp file.
(requested by [email protected])
allowed S_IRUGO reading of entries
(requested by [email protected])
added flags
got rid of the seccomp_t type
dropped seccomp file

Signed-off-by: Will Drewry <[email protected]>
---
fs/proc/base.c | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index dfa5327..01473fe 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -73,6 +73,7 @@
#include <linux/security.h>
#include <linux/ptrace.h>
#include <linux/tracehook.h>
+#include <linux/seccomp.h>
#include <linux/cgroup.h>
#include <linux/cpuset.h>
#include <linux/audit.h>
@@ -579,6 +580,24 @@ static int proc_pid_syscall(struct task_struct *task, char *buffer)
}
#endif /* CONFIG_HAVE_ARCH_TRACEHOOK */

+/*
+ * Print out the current seccomp filter set for the task.
+ */
+#ifdef CONFIG_SECCOMP_FILTER
+int proc_pid_seccomp_filter_show(struct seq_file *m, struct pid_namespace *ns,
+ struct pid *pid, struct task_struct *task)
+{
+ struct seccomp_filters *filters;
+
+ rcu_read_lock();
+ filters = get_seccomp_filters(task->seccomp.filters);
+ rcu_read_unlock();
+ seccomp_show_filters(filters, m);
+ put_seccomp_filters(filters);
+ return 0;
+}
+#endif /* CONFIG_SECCOMP_FILTER */
+
/************************************************************************/
/* Here the fs part begins */
/************************************************************************/
@@ -2838,6 +2857,9 @@ static const struct pid_entry tgid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUGO, proc_pid_syscall),
#endif
+#ifdef CONFIG_SECCOMP_FILTER
+ ONE("seccomp_filter", S_IRUGO, proc_pid_seccomp_filter_show),
+#endif
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tgid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
@@ -3180,6 +3202,9 @@ static const struct pid_entry tid_base_stuff[] = {
#ifdef CONFIG_HAVE_ARCH_TRACEHOOK
INF("syscall", S_IRUGO, proc_pid_syscall),
#endif
+#ifdef CONFIG_SECCOMP_FILTER
+ ONE("seccomp_filter", S_IRUGO, proc_pid_seccomp_filter_show),
+#endif
INF("cmdline", S_IRUGO, proc_pid_cmdline),
ONE("stat", S_IRUGO, proc_tid_stat),
ONE("statm", S_IRUGO, proc_pid_statm),
--
1.7.0.4

2011-06-01 03:12:43

[permalink] [raw]

Subject: [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works.

Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
implemented presently, and what it may be used for. In addition,
the limitations and caveats of the proposed implementation are
included.

v3: a little more cleanup
v2: moved to prctl/
updated for the v2 syntax.
adds a note about compat behavior

Signed-off-by: Will Drewry <[email protected]>
---
Documentation/prctl/seccomp_filter.txt | 145 ++++++++++++++++++++++++++++++++
1 files changed, 145 insertions(+), 0 deletions(-)
create mode 100644 Documentation/prctl/seccomp_filter.txt

diff --git a/Documentation/prctl/seccomp_filter.txt b/Documentation/prctl/seccomp_filter.txt
new file mode 100644
index 0000000..27ac5af
--- /dev/null
+++ b/Documentation/prctl/seccomp_filter.txt
@@ -0,0 +1,145 @@
+ Seccomp filtering
+ =================
+
+Introduction
+------------
+
+A large number of system calls are exposed to every userland process
+with many of them going unused for the entire lifetime of the process.
+As system calls change and mature, bugs are found and eradicated. A
+certain subset of userland applications benefit by having a reduce set
+of available system calls. The reduced set reduces the total kernel
+surface exposed to the application. System call filtering is meant for
+use with those applications.
+
+The implementation currently leverages both the existing seccomp
+infrastructure and the kernel tracing infrastructure. By centralizing
+hooks for attack surface reduction in seccomp, it is possible to assure
+attention to security that is less relevant in normal ftrace scenarios,
+such as time-of-check, time-of-use attacks. However, ftrace provides a
+rich, human-friendly environment for interfacing with system call
+specific arguments. (As such, this requires FTRACE_SYSCALLS for any
+introspective filtering support.)
+
+
+What it isn't
+-------------
+
+System call filtering isn't a sandbox. It provides a clearly defined
+mechanism for minimizing the exposed kernel surface. Beyond that,
+policy for logical behavior and information flow should be managed with
+a combinations of other system hardening techniques and, potentially, a
+LSM of your choosing. Expressive, dynamic filters based on the ftrace
+filter engine provide further options down this path (avoiding
+pathological sizes or selecting which of the multiplexed system calls in
+socketcall() is allowed, for instance) which could be construed,
+incorrectly, as a more complete sandboxing solution.
+
+
+Usage
+-----
+
+An additional seccomp mode is exposed through mode '2'.
+This mode depends on CONFIG_SECCOMP_FILTER. By default, it provides
+only the most trivial of filter support "1" or cleared. However, if
+CONFIG_FTRACE_SYSCALLS is enabled, the ftrace filter engine may be used
+for more expressive filters.
+
+A collection of filters may be supplied via prctl, and the current set
+of filters is exposed in /proc/<pid>/seccomp_filter.
+
+Interacting with seccomp filters can be done through three new prctl calls
+and one existing one.
+
+PR_SET_SECCOMP:
+ A pre-existing option for enabling strict seccomp mode (1) or
+ filtering seccomp (2).
+
+ Usage:
+ prctl(PR_SET_SECCOMP, 1); /* strict */
+ prctl(PR_SET_SECCOMP, 2); /* filters */
+
+PR_SET_SECCOMP_FILTER:
+ Allows the specification of a new filter for a given system
+ call, by number, and filter string. If CONFIG_FTRACE_SYSCALLS is
+ supported, the filter string may be any valid value for the
+ given system call. If it is not supported, the filter string
+ may only be "1".
+
+ All calls to PR_SET_SECCOMP_FILTER for a given system
+ call will append the supplied string to any existing filters.
+ Filter construction looks as follows:
+ (Nothing) + "fd == 1 || fd == 2" => fd == 1 || fd == 2
+ ... + "fd != 2" => (fd == 1 || fd == 2) && fd != 2
+ ... + "size < 100" =>
+ ((fd == 1 || fd == 2) && fd != 2) && size < 100
+ If there is no filter and the seccomp mode has already
+ transitioned to filtering, additions cannot be made. Filters
+ may only be added that reduce the available kernel surface.
+
+ Usage (per the construction example above):
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "size < 100");
+
+PR_CLEAR_SECCOMP_FILTER:
+ Removes all filter entries for a given system call number. When
+ called prior to entering seccomp filtering mode, it allows for
+ new filters to be applied to the same system call. After
+ transition, however, it completely drops access to the call.
+
+ Usage:
+ prctl(PR_CLEAR_SECCOMP_FILTER, __NR_open);
+
+PR_GET_SECCOMP_FILTER: Returns the aggregated filter string for a system
+ call into a user-supplied buffer of a given length.
+
+ Usage:
+ prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf,
+ sizeof(buf));
+
+All of the above calls return 0 on success and non-zero on error.
+
+
+Example
+-------
+
+Assume a process would like to cleanly read and write to stdin/out/err
+as well as access its filters after seccomp enforcement begins. This
+may be done as follows:
+
+ prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1");
+ prctl(PR_SET_SECCOMP_FILTER, __NR_prctl, "1");
+
+ prctl(PR_SET_SECCOMP, PR_SECCOMP_MODE_FILTER, 0);
+
+ /* Do stuff with fdset . . .*/
+
+ /* Drop read access and keep only write access to fd 1. */
+ prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
+ prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
+
+ /* Perform any final processing . . . */
+ syscall(__NR_exit, 0);
+
+
+Caveats
+-------
+
+- The filter event subsystem comes from CONFIG_TRACE_EVENTS, and the
+system call events come from CONFIG_FTRACE_SYSCALLS. However, if
+neither are available, a filter string of "1" will be honored, and it may
+be removed using PR_CLEAR_SECCOMP_FILTER. With ftrace filtering,
+calling PR_SET_SECCOMP_FILTER with a filter of "0" would have similar
+affect but would not be consistent on a kernel without the support.
+
+- Some platforms support a 32-bit userspace with 64-bit kernels. In
+these cases (CONFIG_COMPAT), system call numbers may not match across
+64-bit and 32-bit system calls. When the first PRCTL_SET_SECCOMP_FILTER
+is called, the in-memory filters state is annotated with whether the
+call has been made via the compat interface. All subsequent calls will
+be checked for compat call mismatch. In the long run, it may make sense
+to store compat and non-compat filters separately, but that is not
+supported at present.
--
1.7.0.4

2011-06-01 03:10:57

[permalink] [raw]

Subject: [PATCH v3 06/13] x86: add HAVE_SECCOMP_FILTER and seccomp_execve

Adds support to the x86 architecture by providing a compatibility
mode wrapper for sys_execve's number and selecting HAVE_SECCOMP_FILTER

Signed-off-by: Will Drewry <[email protected]>
---
arch/x86/Kconfig | 1 +
arch/x86/include/asm/ia32_unistd.h | 1 +
arch/x86/include/asm/seccomp_64.h | 2 ++
3 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..1843d17 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -64,6 +64,7 @@ config X86
select HAVE_TEXT_POKE_SMP
select HAVE_GENERIC_HARDIRQS
select HAVE_SPARSE_IRQ
+ select HAVE_SECCOMP_FILTER
select GENERIC_FIND_FIRST_BIT
select GENERIC_FIND_NEXT_BIT
select GENERIC_IRQ_PROBE
diff --git a/arch/x86/include/asm/ia32_unistd.h b/arch/x86/include/asm/ia32_unistd.h
index 976f6ec..8ed2922 100644
--- a/arch/x86/include/asm/ia32_unistd.h
+++ b/arch/x86/include/asm/ia32_unistd.h
@@ -12,6 +12,7 @@
#define __NR_ia32_exit 1
#define __NR_ia32_read 3
#define __NR_ia32_write 4
+#define __NR_ia32_execve 11
#define __NR_ia32_sigreturn 119
#define __NR_ia32_rt_sigreturn 173

diff --git a/arch/x86/include/asm/seccomp_64.h b/arch/x86/include/asm/seccomp_64.h
index 84ec1bd..85c4219 100644
--- a/arch/x86/include/asm/seccomp_64.h
+++ b/arch/x86/include/asm/seccomp_64.h
@@ -8,10 +8,12 @@
#define __NR_seccomp_write __NR_write
#define __NR_seccomp_exit __NR_exit
#define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve

#define __NR_seccomp_read_32 __NR_ia32_read
#define __NR_seccomp_write_32 __NR_ia32_write
#define __NR_seccomp_exit_32 __NR_ia32_exit
#define __NR_seccomp_sigreturn_32 __NR_ia32_sigreturn
+#define __NR_seccomp_execve_32 __NR_ia32_execve

#endif /* _ASM_X86_SECCOMP_64_H */
--
1.7.0.4

2011-06-01 03:12:29

[permalink] [raw]

Subject: [PATCH v3 07/13] arm: select HAVE_SECCOMP_FILTER

Enable support for CONFIG_SECCOMP_FILTER by selecting HAVE_SECCOMP_FILTER by
default.

Signed-off-by: Will Drewry <[email protected]>
---
arch/arm/Kconfig | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..4725fbc 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -16,6 +16,7 @@ config ARM
select HAVE_FTRACE_MCOUNT_RECORD if (!XIP_KERNEL)
select HAVE_DYNAMIC_FTRACE if (!XIP_KERNEL)
select HAVE_FUNCTION_GRAPH_TRACER if (!THUMB2_KERNEL)
+ select HAVE_SECCOMP_FILTER
select HAVE_GENERIC_DMA_COHERENT
select HAVE_KERNEL_GZIP
select HAVE_KERNEL_LZO
--
1.7.0.4

2011-06-01 03:11:01

[permalink] [raw]

Subject: [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <[email protected]>
---
arch/microblaze/Kconfig | 1 +
arch/microblaze/include/asm/seccomp.h | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..30ef677 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -1,6 +1,7 @@
config MICROBLAZE
def_bool y
select HAVE_MEMBLOCK
+ select HAVE_SECCOMP_FILTER
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
select HAVE_FUNCTION_GRAPH_TRACER
diff --git a/arch/microblaze/include/asm/seccomp.h b/arch/microblaze/include/asm/seccomp.h
index 0d91275..0e38eed 100644
--- a/arch/microblaze/include/asm/seccomp.h
+++ b/arch/microblaze/include/asm/seccomp.h
@@ -7,10 +7,12 @@
#define __NR_seccomp_write __NR_write
#define __NR_seccomp_exit __NR_exit
#define __NR_seccomp_sigreturn __NR_sigreturn
+#define __NR_seccomp_execve __NR_execve

#define __NR_seccomp_read_32 __NR_read
#define __NR_seccomp_write_32 __NR_write
#define __NR_seccomp_exit_32 __NR_exit
#define __NR_seccomp_sigreturn_32 __NR_sigreturn
+#define __NR_seccomp_execve_32 __NR_execve

#endif /* _ASM_MICROBLAZE_SECCOMP_H */
--
1.7.0.4

2011-06-01 03:11:57

[permalink] [raw]

Subject: [PATCH v3 09/13] mips: select HAVE_SECCOMP_FILTER and provide seccomp_execve

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <[email protected]>
---
arch/mips/Kconfig | 1 +
arch/mips/include/asm/seccomp.h | 3 +++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..d376f68 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -10,6 +10,7 @@ config MIPS
select HAVE_ARCH_KGDB
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
+ select HAVE_SECCOMP_FILTER
select HAVE_DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD
select HAVE_C_RECORDMCOUNT
diff --git a/arch/mips/include/asm/seccomp.h b/arch/mips/include/asm/seccomp.h
index ae6306e..4014a3a 100644
--- a/arch/mips/include/asm/seccomp.h
+++ b/arch/mips/include/asm/seccomp.h
@@ -6,6 +6,7 @@
#define __NR_seccomp_write __NR_write
#define __NR_seccomp_exit __NR_exit
#define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve

/*
* Kludge alert:
@@ -19,6 +20,7 @@
#define __NR_seccomp_write_32 4004
#define __NR_seccomp_exit_32 4001
#define __NR_seccomp_sigreturn_32 4193 /* rt_sigreturn */
+#define __NR_seccomp_execve_32 4011

#elif defined(CONFIG_MIPS32_N32)

@@ -26,6 +28,7 @@
#define __NR_seccomp_write_32 6001
#define __NR_seccomp_exit_32 6058
#define __NR_seccomp_sigreturn_32 6211 /* rt_sigreturn */
+#define __NR_seccomp_execve_32 6057

#endif /* CONFIG_MIPS32_O32 */

--
1.7.0.4

2011-06-01 03:11:55

[permalink] [raw]

Subject: [PATCH v3 10/13] s390: select HAVE_SECCOMP_FILTER and provide seccomp_execve

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <[email protected]>
---
arch/s390/Kconfig | 1 +
arch/s390/include/asm/seccomp.h | 3 ++-
2 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..9382198 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -64,6 +64,7 @@ config ARCH_SUPPORTS_DEBUG_PAGEALLOC
config S390
def_bool y
select USE_GENERIC_SMP_HELPERS if SMP
+ select HAVE_SECCOMP_FILTER
select HAVE_SYSCALL_WRAPPERS
select HAVE_FUNCTION_TRACER
select HAVE_FUNCTION_TRACE_MCOUNT_TEST
diff --git a/arch/s390/include/asm/seccomp.h b/arch/s390/include/asm/seccomp.h
index 781a9cf..e5792f5 100644
--- a/arch/s390/include/asm/seccomp.h
+++ b/arch/s390/include/asm/seccomp.h
@@ -7,10 +7,11 @@
#define __NR_seccomp_write __NR_write
#define __NR_seccomp_exit __NR_exit
#define __NR_seccomp_sigreturn __NR_sigreturn
+#define __NR_seccomp_execve __NR_execve

#define __NR_seccomp_read_32 __NR_read
#define __NR_seccomp_write_32 __NR_write
#define __NR_seccomp_exit_32 __NR_exit
-#define __NR_seccomp_sigreturn_32 __NR_sigreturn
+#define __NR_seccomp_execve_32 __NR_execve

#endif /* _ASM_S390_SECCOMP_H */
--
1.7.0.4

2011-06-01 03:11:05

[permalink] [raw]

Subject: [PATCH v3 11/13] powerpc: select HAVE_SECCOMP_FILTER and provide seccomp_execve

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <[email protected]>
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/seccomp.h | 2 ++
2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..0bd4574 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -137,6 +137,7 @@ config PPC
select HAVE_HW_BREAKPOINT if PERF_EVENTS && PPC_BOOK3S_64
select HAVE_GENERIC_HARDIRQS
select HAVE_SPARSE_IRQ
+ select HAVE_SECCOMP_FILTER
select IRQ_PER_CPU
select GENERIC_IRQ_SHOW
select GENERIC_IRQ_SHOW_LEVEL
diff --git a/arch/powerpc/include/asm/seccomp.h b/arch/powerpc/include/asm/seccomp.h
index 00c1d91..3cb9cc1 100644
--- a/arch/powerpc/include/asm/seccomp.h
+++ b/arch/powerpc/include/asm/seccomp.h
@@ -7,10 +7,12 @@
#define __NR_seccomp_write __NR_write
#define __NR_seccomp_exit __NR_exit
#define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve

#define __NR_seccomp_read_32 __NR_read
#define __NR_seccomp_write_32 __NR_write
#define __NR_seccomp_exit_32 __NR_exit
#define __NR_seccomp_sigreturn_32 __NR_sigreturn
+#define __NR_seccomp_execve_32 __NR_execve

#endif /* _ASM_POWERPC_SECCOMP_H */
--
1.7.0.4

2011-06-01 03:11:34

[permalink] [raw]

Subject: [PATCH v3 12/13] sparc: select HAVE_SECCOMP_FILTER and provide seccomp_execve

Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
system call numbering for execve and selecting HAVE_SECCOMP_FILTER.

Signed-off-by: Will Drewry <[email protected]>
---
arch/sparc/Kconfig | 2 ++
arch/sparc/include/asm/seccomp.h | 2 ++
2 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5249760 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -25,6 +25,7 @@ config SPARC
select HAVE_DMA_ATTRS
select HAVE_DMA_API_DEBUG
select HAVE_ARCH_JUMP_LABEL
+ select HAVE_SECCOMP_FILTER

config SPARC32
def_bool !64BIT
@@ -39,6 +40,7 @@ config SPARC64
select HAVE_KRETPROBES
select HAVE_KPROBES
select HAVE_MEMBLOCK
+ select HAVE_SECCOMP_FILTER
select HAVE_SYSCALL_WRAPPERS
select HAVE_DYNAMIC_FTRACE
select HAVE_FTRACE_MCOUNT_RECORD
diff --git a/arch/sparc/include/asm/seccomp.h b/arch/sparc/include/asm/seccomp.h
index adca1bc..a1dac08 100644
--- a/arch/sparc/include/asm/seccomp.h
+++ b/arch/sparc/include/asm/seccomp.h
@@ -6,10 +6,12 @@
#define __NR_seccomp_write __NR_write
#define __NR_seccomp_exit __NR_exit
#define __NR_seccomp_sigreturn __NR_rt_sigreturn
+#define __NR_seccomp_execve __NR_execve

#define __NR_seccomp_read_32 __NR_read
#define __NR_seccomp_write_32 __NR_write
#define __NR_seccomp_exit_32 __NR_exit
#define __NR_seccomp_sigreturn_32 __NR_sigreturn
+#define __NR_seccomp_execve_32 __NR_execve

#endif /* _ASM_SECCOMP_H */
--
1.7.0.4

2011-06-01 03:11:10

[permalink] [raw]

Subject: [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER

Add support for CONFIG_SECCOMP_FILTER by selecting HAVE_SECCOMP_FILTER.
---
arch/sh/Kconfig | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..41ea3a7 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -10,6 +10,7 @@ config SUPERH
select HAVE_DMA_API_DEBUG
select HAVE_DMA_ATTRS
select HAVE_IRQ_WORK
+ select HAVE_SECCOMP_FILTER
select HAVE_PERF_EVENTS
select PERF_USE_VMALLOC
select HAVE_KERNEL_GZIP
--
1.7.0.4

2011-06-01 03:37:23

by David Miller

[permalink] [raw]

Subject: Re: [PATCH v3 12/13] sparc: select HAVE_SECCOMP_FILTER and provide seccomp_execve

From: Will Drewry <[email protected]>
Date: Tue, 31 May 2011 22:10:44 -0500

> Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
> system call numbering for execve and selecting HAVE_SECCOMP_FILTER.
>
> Signed-off-by: Will Drewry <[email protected]>

Acked-by: David S. Miller <[email protected]>

2011-06-01 05:38:00

by Michal Simek

[permalink] [raw]

Subject: Re: [PATCH v3 08/13] microblaze: select HAVE_SECCOMP_FILTER and provide seccomp_execve

Will Drewry wrote:
> Facilitate the use of CONFIG_SECCOMP_FILTER by wrapping compatibility
> system call numbering for execve and selecting HAVE_SECCOMP_FILTER.
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> arch/microblaze/Kconfig | 1 +
> arch/microblaze/include/asm/seccomp.h | 2 ++
> 2 files changed, 3 insertions(+), 0 deletions(-)

Acked-by: Michal Simek <[email protected]>

--
Michal Simek, Ing. (M.Eng)
w: http://www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel 2.6 Microblaze Linux - http://www.monstr.eu/fdt/
Microblaze U-BOOT custodian

2011-06-01 07:00:43

[permalink] [raw]

Subject: Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction

* Will Drewry <[email protected]> wrote:

> perf appears to be the primary consumer of the CONFIG_FTRACE_SYSCALLS
> infrastructure. As such, many the helpers target at perf can be split
> into a peerf-focused helper and a generic CONFIG_FTRACE_SYSCALLS
> consumer interface.
>
> This change splits out syscall_trace_enter construction from
> perf_syscall_enter for current into two helpers:
> - ftrace_syscall_enter_state
> - ftrace_syscall_enter_state_size
>
> And adds another helper for completeness:
> - ftrace_syscall_exit_state_size
>
> These helpers allow for shared code between perf ftrace events and
> any other consumers of CONFIG_FTRACE_SYSCALLS events. The proposed
> seccomp_filter patches use this code.
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> include/trace/syscall.h | 4 ++
> kernel/trace/trace_syscalls.c | 96 +++++++++++++++++++++++++++++++++++------
> 2 files changed, 86 insertions(+), 14 deletions(-)

So, looking at the diffstat comparison again:

bitmask (2009): 6 files changed, 194 insertions(+), 22 deletions(-)
filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
event filters (2011): 5 files changed, 82 insertions(+), 16 deletions(-)

you went back to the middle solution again which is the worst of them
- why?

If you want this to be a stupid, limited hack then go for the v1
bitmask.

If you agree with my observation that filters allow the clean
user-space implementation of LSM equivalent security solutions (of
which sandboxes are just a *narrow special case*) then please use the
main highlevel abstraction we have defined around them: event
filters.

Now, my observation was not uncontested so let me try to sum up the
rather large discussion that erupted around it, as i see it.

I saw four main counter arguments:

- "Sandboxing is special and should stay separate from LSMs."

I think this is a technically bogus argument, see:

https://lkml.org/lkml/2011/5/26/85

That answer of mine went unchallenged.

- "Events should only be observers."

Even ignoring the question of why on earth it should be a problem
for a willing call-site to use event filtering results sensibly,
this argument misses the plain fact that events are *already*
active participants, see:

http://www.spinics.net/lists/mips/msg41075.html

That answer of mine went unchallenged too.

- "This feature is too simplistic."

That's wrong i think, the feature is highly flexible:

http://www.mail-archive.com/[email protected]/msg51387.html

This reply of mine went unchallenged as well.

- "Is this feature actually useful enough for applications, does it
justify the complexity?"

This is the *only* valid technical counter-argument i saw, and it's
a crutial one that is not fully answered yet. Since i think the feature
is an LSM equivalent i think it's at least as useful as any LSM is.

- [ if i missed any important argument then someone please insert it
here. ]

But what you do here is to use the filter engine directly which is
both a limited hack *and* complex (beyond the linecount it doubles
our ABI exposure, amongst other things), so i find that approach
rather counter-productive, now that i've seen the real thing.

Will this feature be just another example of the LSM status quo
dragging down a newcomer into the mud, until it's just as sucky and
limited as any existing LSMs? That would be a sad outcome!

Thanks,

Ingo

ps. Please start a new discussion thread for the next iteration!
This one is *way* too deep already.

2011-06-01 17:15:21

[permalink] [raw]

Subject: Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction

On Wed, Jun 1, 2011 at 2:00 AM, Ingo Molnar <[email protected]> wrote:
>
> * Will Drewry <[email protected]> wrote:
>
>> perf appears to be the primary consumer of the CONFIG_FTRACE_SYSCALLS
>> infrastructure. ?As such, many the helpers target at perf can be split
>> into a peerf-focused helper and a generic CONFIG_FTRACE_SYSCALLS
>> consumer interface.
>>
>> This change splits out syscall_trace_enter construction from
>> perf_syscall_enter for current into two helpers:
>> - ftrace_syscall_enter_state
>> - ftrace_syscall_enter_state_size
>>
>> And adds another helper for completeness:
>> - ftrace_syscall_exit_state_size
>>
>> These helpers allow for shared code between perf ftrace events and
>> any other consumers of CONFIG_FTRACE_SYSCALLS events. ?The proposed
>> seccomp_filter patches use this code.
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?include/trace/syscall.h ? ? ? | ? ?4 ++
>> ?kernel/trace/trace_syscalls.c | ? 96 +++++++++++++++++++++++++++++++++++------
>> ?2 files changed, 86 insertions(+), 14 deletions(-)
>
> So, looking at the diffstat comparison again:
>
> ? ? ? bitmask (2009): ?6 files changed, ?194 insertions(+), 22 deletions(-)
> ?filter engine (2010): 18 files changed, 1100 insertions(+), 21 deletions(-)
> ?event filters (2011): ?5 files changed, ? 82 insertions(+), 16 deletions(-)
>
> you went back to the middle solution again which is the worst of them
> - why?

In short, design for the future and implement now. I'll elaborate a
bit more below.

> If you want this to be a stupid, limited hack then go for the v1
> bitmask.

I only aim for the finest!

(bitmasks were bad for the other consumers of this patch series:
socketcall mulitplexing issues and ioctl # filtering).

> If you agree with my observation that filters allow the clean
> user-space implementation of LSM equivalent security solutions (of
> which sandboxes are just a *narrow special case*) then please use the
> main highlevel abstraction we have defined around them: event
> filters.

I agree that LSM-equivalent security solutions can be moved over to an
ftrace based infrastructure. However, LSMs and seccomp have different
semantics. Reducing the kernel attack surface in a
"sandboxing"-sort-of-way requires a default-deny interface that is
resilient to kernel changes (like new system calls) without
immediately degrading robustness. LSMs provide a fail-open mechanism
for taking an active role in kernel-defined pinch points. It is
possible to implement a default-deny LSM, but it requires a "hook" for
every security event and the addition of a security event results in a
hole in the not-so-default-deny infrastructure. ftrace + event
filters are the same.

Based on my observations while exploring the code, it appears that the
LSM security_* calls could easily become active trace events and the
LSM infrastructure moved over to use those as tracepoints or via
event_filters. There will be a need for new predicates for the
various new types (inode *, etc), and so on. However, the
trace_sys_enter/__secure_computing model will still be a special case.
Even if they fed into security event subsystem or something like
that, the absence of filters on a traced process would need to
default-deny as well as when there are no active matches. So while a
brand-new shared ABI may be possible (security_event_open,
active_event_open, ?), there will still be trickiness in making the
behaviors not have implicit side effects and ensure that newly added
system calls, for instance, that lack the macro wrapper don't poke a
hole in the "sandbox" model. There are a lot of options for designing
it though. Like making TIF_SECCOMP mean that any security_* filter
failure or match count of 0 == process death. It's just that
designing this new approach will be incredibly hairy, and we really
lack many of the concrete requirements that would be needed, in my
opinion.

> Now, my observation was not uncontested so let me try to sum up the
> rather large discussion that erupted around it, as i see it.
>
> I saw four main counter arguments:
>
> ?- "Sandboxing is special and should stay separate from LSMs."
>
> ? I think this is a technically bogus argument, see:
>
> ? ? ? ? https://lkml.org/lkml/2011/5/26/85
>
> ? That answer of mine went unchallenged.

I may have spoken to this above. I dunno.

> ?- "Events should only be observers."
>
> ? Even ignoring the question of why on earth it should be a problem
> ? for a willing call-site to use event filtering results sensibly,
> ? this argument misses the plain fact that events are *already*
> ? active participants, see:
>
> ? ? ? ? http://www.spinics.net/lists/mips/msg41075.html
>
> ? That answer of mine went unchallenged too.
>
> ?- "This feature is too simplistic."
>
> ? That's wrong i think, the feature is highly flexible:
>
> ? ? ? ? http://www.mail-archive.com/[email protected]/msg51387.html
>
> ? This reply of mine went unchallenged as well.

Well I did only implement a PoC. It couldn't handle attack surface
reduction after-the-fact, nor did I add a GET_FILTER call, etc. The
code was minimal in many ways because the functionality was too.

> ?- "Is this feature actually useful enough for applications, does it
> ? ?justify the complexity?"
>
> ?This is the *only* valid technical counter-argument i saw, and it's
> ?a crutial one that is not fully answered yet. Since i think the feature
> ?is an LSM equivalent i think it's at least as useful as any LSM is.
>
> ?- [ if i missed any important argument then someone please insert it
> ? ? here. ]
>
> But what you do here is to use the filter engine directly which is
> both a limited hack *and* complex (beyond the linecount it doubles
> our ABI exposure, amongst other things), so i find that approach
> rather counter-productive, now that i've seen the real thing.
>
> Will this feature be just another example of the LSM status quo
> dragging down a newcomer into the mud, until it's just as sucky and
> limited as any existing LSMs? That would be a sad outcome!

I hope not. I believe it will be easy to move the backend of
seccomp_filter over to a per-task ftrace event filter infrastructure
when that comes in the future. But for now, I'm trying to meet the
needs of possible consumers now: chromium, qemu, lxc, and lay
groundwork for a ftrace-future.

If this is a total fail, then perhaps we should have a separate
discussion over how we can tackle a lot of these needs. I was hoping
that we could push some of that off to the LinuxSecuritySummit -- I've
proposed/requested a QA panel on this topic :) But I'd love to not
wait until then for everything.

> ps. Please start a new discussion thread for the next iteration!
> ? ?This one is *way* too deep already.

Sorry - will do!

thanks!
will

2011-06-01 21:23:50

by Kees Cook

[permalink] [raw]

Subject: Re: [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works.

Hi Will,

Minor typo corrections below...

On Tue, May 31, 2011 at 10:10:37PM -0500, Will Drewry wrote:
> Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
> implemented presently, and what it may be used for. In addition,
> the limitations and caveats of the proposed implementation are
> included.
>
> --- /dev/null
> +++ b/Documentation/prctl/seccomp_filter.txt
> @@ -0,0 +1,145 @@
> ...
> +certain subset of userland applications benefit by having a reduce set
reduced

> +of available system calls. The reduced set reduces the total kernel

Maybe "The resulting set reduces ... " ?

-Kees

--
Kees Cook
Ubuntu Security Team

2011-06-01 23:03:06

[permalink] [raw]

Subject: Re: [PATCH v3 05/13] seccomp_filter: Document what seccomp_filter is and how it works.

On Wed, Jun 1, 2011 at 4:23 PM, Kees Cook <[email protected]> wrote:
> Hi Will,
>
> Minor typo corrections below...
>
> On Tue, May 31, 2011 at 10:10:37PM -0500, Will Drewry wrote:
>> ?Adds a text file covering what CONFIG_SECCOMP_FILTER is, how it is
>> ?implemented presently, and what it may be used for. ?In addition,
>> ?the limitations and caveats of the proposed implementation are
>> ?included.
>>
>> --- /dev/null
>> +++ b/Documentation/prctl/seccomp_filter.txt
>> @@ -0,0 +1,145 @@
>> ...
>> +certain subset of userland applications benefit by having a reduce set
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? reduced
>
>> +of available system calls. ?The reduced set reduces the total kernel
>
> Maybe "The resulting set reduces ... " ?

Cool - I'll clean it up in the next cut.

Thanks!
will

2011-06-02 05:27:52

by Paul Mundt

[permalink] [raw]

Subject: Re: [PATCH v3 13/13] sh: select HAVE_SECCOMP_FILTER

On Tue, May 31, 2011 at 10:10:45PM -0500, Will Drewry wrote:
> Add support for CONFIG_SECCOMP_FILTER by selecting HAVE_SECCOMP_FILTER.
> ---
> arch/sh/Kconfig | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
Acked-by: Paul Mundt <[email protected]>

2011-06-02 14:29:36

[permalink] [raw]

Subject: Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction

* Will Drewry <[email protected]> wrote:

> > If you agree with my observation that filters allow the clean
> > user-space implementation of LSM equivalent security solutions
> > (of which sandboxes are just a *narrow special case*) then please
> > use the main highlevel abstraction we have defined around them:
> > event filters.
>
> I agree that LSM-equivalent security solutions can be moved over to
> an ftrace based infrastructure. However, LSMs and seccomp have
> different semantics. Reducing the kernel attack surface in a
> "sandboxing"-sort-of-way requires a default-deny interface that is
> resilient to kernel changes (like new system calls) without
> immediately degrading robustness. [...]

Correct. Because seccomp is the user of those syscall-surface events
it can use them in such a way - i see no problem there: unknown or
not permitted syscalls get denied for seccomp-mode-2 tasks.

> [...] LSMs provide a fail-open mechanism for taking an active role
> in kernel-defined pinch points. It is possible to implement a
> default-deny LSM, but it requires a "hook" for every security event
> and the addition of a security event results in a hole in the
> not-so-default-deny infrastructure. ftrace + event filters are the
> same.

Well, i only suggested that it's LSM-equivalent security
functionality, i did not suggest that you should implement an LSM in
security/. I do not think the LSM modularization is particularly well
fit for seccomp.

> Based on my observations while exploring the code, it appears that
> the LSM security_* calls could easily become active trace events
> and the LSM infrastructure moved over to use those as tracepoints
> or via event_filters. There will be a need for new predicates for
> the various new types (inode *, etc), and so on. However, the
> trace_sys_enter/__secure_computing model will still be a special
> case.

Yes, and that special event will not go away!

I did not suggest to *replace* those events with the security events.
I suggested to *combine* them - or at least have a model that
smoothly extends to those events as well and does not limit itself to
the syscall surface alone.

We'll want to have both.

But by hardcoding to only those events, and creating a
syscall-numbering special ABI, a wall will be risen between this
implementation and any future enhancement to cover other events. My
suggestion would be to use the event filter approach - that way
there's not a wall but an open door towards future extensions ;-)

Thanks,

Ingo

2011-06-02 15:18:23

[permalink] [raw]

Subject: Re: [PATCH v3 02/13] tracing: split out syscall_trace_enter construction

On Thu, Jun 2, 2011 at 9:29 AM, Ingo Molnar <[email protected]> wrote:
>
> * Will Drewry <[email protected]> wrote:
>
[...]
>
>> Based on my observations while exploring the code, it appears that
>> the LSM security_* calls could easily become active trace events
>> and the LSM infrastructure moved over to use those as tracepoints
>> or via event_filters. ?There will be a need for new predicates for
>> the various new types (inode *, etc), and so on. ?However, the
>> trace_sys_enter/__secure_computing model will still be a special
>> case.
>
> Yes, and that special event will not go away!
>
> I did not suggest to *replace* those events with the security events.
> I suggested to *combine* them - or at least have a model that
> smoothly extends to those events as well and does not limit itself to
> the syscall surface alone.
>
> We'll want to have both.
>
> But by hardcoding to only those events, and creating a
> syscall-numbering special ABI, a wall will be risen between this
> implementation and any future enhancement to cover other events. My
> suggestion would be to use the event filter approach - that way
> there's not a wall but an open door towards future extensions ;-)

Yeah, I can definitely see that. We could have the prctl interface
take in the event id, but that introduces dependency on
CONFIG_PERF_EVENTS in addition
(to get the id exported) and means we'll have much more limited
coverage of syscalls until the syscall wrapping matures.

Could this be resolved in the proposed change by supporting both
mechanisms? Or is that just asking for trouble?

E.g., it could be an extra field:
prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_TYPE_EVENT, event_id,
filter_string);
prctl(PR_SET_SECCOMP_FILTER, PR_SECCOMP_FILTER_TYPE_SYSCALL,
__NR_somesyscall, filter_string);
[and the same for CLEAR_FILTER and GET_FILTER]

or even reserve negative values for event ids and positive for
syscalls (which feels more hackish). Adding event_id support wouldn't
be much more additional code (since it's just a layer of
dereferencing). Since there will likely be syscall-indexed entry
behavior no matter what (like there is for ftrace/perf_sysenter), it
won't necessarily be a large diversion in the future either.

If not, seccomp_filter could depend on both FTRACE_SYSCALLS and
exported PERF_EVENTS (or make "id"s not perf_event specific), then it
could just use the sys_enter event ids. Doing so does have some other
properties that I'm not as fond of, like requiring debugfs to be
compiled in, mounted, and readable by the caller in order to construct
a filterset, so I can still see some benefit for the syscall number
use in some cases (much easier to deploy on a server without debugfs
access, etc). Right now, having both interfaces doesn't really give
us anything, but having the field set aside for future exploration
isn't necessarily a bad thing!

What do you think? Would a change to support both be too crazy/dumb or
just crazy/dumb enough? Or do you see another path that could avoid
isolating any current work from a more fruitful future?

thanks!
will

2011-06-02 17:37:05

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters

On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
> This change adds a new seccomp mode which specifies the allowed system
> calls dynamically. When in the new mode (2), all system calls are
> checked against process-defined filters - first by system call number,
> then by a filter string. If an entry exists for a given system call and
> all filter predicates evaluate to true, then the task may proceed.
> Otherwise, the task is killed.

A few questions below -- I can't say that I understand the RCU usage.

Thanx, Paul

> Filter string parsing and evaluation is handled by the ftrace filter
> engine. Related patches tweak to the perf filter trace and free
> allowing the calls to be shared. Filters inherit their understanding of
> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
> subsystem which already populates this information in syscall_metadata
> associated enter_event (and exit_event) structures. If
> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
> will be allowed.
>
> The net result is a process may have its system calls filtered using the
> ftrace filter engine's inherent understanding of systems calls. The set
> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
> prctl(). For example, a filterset for a process, like pdftotext, that
> should only process read-only input could (roughly) look like:
> sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
> prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
> prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
> prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
> prctl(PR_SET_SECCOMP, 2);
>
> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
> be &&'d together to ensure that attack surface may only be reduced:
> prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
>
> With the earlier example, the active filter becomes:
> "(fd == 1 || fd == 2) && fd != 2"
>
> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
> The latter returns the current filter for a system call to userspace:
>
> prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
>
> while the former clears any filters for a given system call changing it
> back to a defaulty deny:
>
> prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
>
> v3: - always block execve calls (as per linus torvalds)
> - add __NR_seccomp_execve(_32) to seccomp-supporting arches
> - ensure compat tasks can't reach ftrace:syscalls
> - dropped new defines for seccomp modes.
> - two level array instead of hlists (sugg. by olof johansson)
> - added generic Kconfig entry that is not connected.
> - dropped internal seccomp.h
> - move prctl helpers to seccomp_filter
> - killed seccomp_t typedef (as per checkpatch)
> v2: - changed to use the existing syscall number ABI.
> - prctl changes to minimize parsing in the kernel:
> prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
> prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
> prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
> prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
> - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
> - added flags
> - provide a default fail syscall_nr_to_meta in ftrace
> - provides fallback for unhooked system calls
> - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
> - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
> - moved to a hlist and 4 bit hash of linked lists
> - added support to operate without CONFIG_FTRACE_SYSCALLS
> - moved Kconfig support next to SECCOMP
> - made Kconfig entries dependent on EXPERIMENTAL
> - added macros to avoid ifdefs from kernel/fork.c
> - added compat task/filter matching
> - drop seccomp.h inclusion in sched.h and drop seccomp_t
> - added Filtering to "show" output
> - added on_exec state dup'ing when enabling after a fast-path accept.
>
> Signed-off-by: Will Drewry <[email protected]>
> ---
> include/linux/prctl.h | 5 +
> include/linux/sched.h | 2 +-
> include/linux/seccomp.h | 98 ++++++-
> include/trace/syscall.h | 7 +
> kernel/Makefile | 3 +
> kernel/fork.c | 3 +
> kernel/seccomp.c | 38 ++-
> kernel/seccomp_filter.c | 784 +++++++++++++++++++++++++++++++++++++++++++++++
> kernel/sys.c | 13 +-
> security/Kconfig | 17 +
> 10 files changed, 954 insertions(+), 16 deletions(-)
> create mode 100644 kernel/seccomp_filter.c
>
> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> index a3baeb2..44723ce 100644
> --- a/include/linux/prctl.h
> +++ b/include/linux/prctl.h
> @@ -64,6 +64,11 @@
> #define PR_GET_SECCOMP 21
> #define PR_SET_SECCOMP 22
>
> +/* Get/set process seccomp filters */
> +#define PR_GET_SECCOMP_FILTER 35
> +#define PR_SET_SECCOMP_FILTER 36
> +#define PR_CLEAR_SECCOMP_FILTER 37
> +
> /* Get/set the capability bounding set (as per security/commoncap.c) */
> #define PR_CAPBSET_READ 23
> #define PR_CAPBSET_DROP 24
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 18d63ce..3f0bc8d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1374,7 +1374,7 @@ struct task_struct {
> uid_t loginuid;
> unsigned int sessionid;
> #endif
> - seccomp_t seccomp;
> + struct seccomp_struct seccomp;
>
> /* Thread group tracking */
> u32 parent_exec_id;
> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> index 167c333..f4434ca 100644
> --- a/include/linux/seccomp.h
> +++ b/include/linux/seccomp.h
> @@ -1,13 +1,33 @@
> #ifndef _LINUX_SECCOMP_H
> #define _LINUX_SECCOMP_H
>
> +struct seq_file;
>
> #ifdef CONFIG_SECCOMP
>
> +#include <linux/errno.h>
> #include <linux/thread_info.h>
> +#include <linux/types.h>
> #include <asm/seccomp.h>
>
> -typedef struct { int mode; } seccomp_t;
> +struct seccomp_filters;
> +/**
> + * struct seccomp_struct - the state of a seccomp'ed process
> + *
> + * @mode:
> + * if this is 1, the process is under standard seccomp rules
> + * is 2, the process is only allowed to make system calls where
> + * associated filters evaluate successfully.
> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
> + * filters assignment/use should be RCU-protected and its contents
> + * should never be modified when attached to a seccomp_struct.
> + */
> +struct seccomp_struct {
> + uint16_t mode;
> +#ifdef CONFIG_SECCOMP_FILTER
> + struct seccomp_filters *filters;
> +#endif
> +};
>
> extern void __secure_computing(int);
> static inline void secure_computing(int this_syscall)
> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
> __secure_computing(this_syscall);
> }
>
> -extern long prctl_get_seccomp(void);
> extern long prctl_set_seccomp(unsigned long);
> +extern long prctl_get_seccomp(void);
>
> #else /* CONFIG_SECCOMP */
>
> #include <linux/errno.h>
>
> -typedef struct { } seccomp_t;
> -
> +struct seccomp_struct { };
> #define secure_computing(x) do { } while (0)
>
> static inline long prctl_get_seccomp(void)
> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
> return -EINVAL;
> }
>
> -static inline long prctl_set_seccomp(unsigned long arg2)
> +static inline long prctl_set_seccomp(unsigned long a2);
> {
> return -EINVAL;
> }
>
> #endif /* CONFIG_SECCOMP */
>
> +#ifdef CONFIG_SECCOMP_FILTER
> +
> +#define inherit_tsk_seccomp(_child, _orig) do { \
> + _child->seccomp.mode = _orig->seccomp.mode; \
> + _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
> + } while (0)
> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
> +
> +extern int seccomp_show_filters(struct seccomp_filters *filters,
> + struct seq_file *);
> +extern long seccomp_set_filter(int, char *);
> +extern long seccomp_clear_filter(int);
> +extern long seccomp_get_filter(int, char *, unsigned long);
> +
> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
> + unsigned long);
> +extern long prctl_clear_seccomp_filter(unsigned long);
> +
> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
> +extern void put_seccomp_filters(struct seccomp_filters *);
> +
> +extern int seccomp_test_filters(int);
> +extern void seccomp_filter_log_failure(int);
> +
> +#else /* CONFIG_SECCOMP_FILTER */
> +
> +struct seccomp_filters { };
> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
> +#define put_tsk_seccomp(_tsk) do { } while (0)
> +
> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
> + struct seq_file *m)
> +{
> + return -ENOSYS;
> +}
> +
> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
> +{
> + return -ENOSYS;
> +}
> +
> +static inline long seccomp_clear_filter(int syscall_nr)
> +{
> + return -ENOSYS;
> +}
> +
> +static inline long seccomp_get_filter(int syscall_nr,
> + char *buf, unsigned long available)
> +{
> + return -ENOSYS;
> +}
> +
> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
> +{
> + return -ENOSYS;
> +}
> +
> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
> +{
> + return -ENOSYS;
> +}
> +
> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
> + unsigned long a4)
> +{
> + return -ENOSYS;
> +}
> +#endif /* CONFIG_SECCOMP_FILTER */
> #endif /* _LINUX_SECCOMP_H */
> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> index 242ae04..e061ad0 100644
> --- a/include/trace/syscall.h
> +++ b/include/trace/syscall.h
> @@ -35,6 +35,8 @@ struct syscall_metadata {
> extern unsigned long arch_syscall_addr(int nr);
> extern int init_syscall_trace(struct ftrace_event_call *call);
>
> +extern struct syscall_metadata *syscall_nr_to_meta(int);
> +
> extern int reg_event_syscall_enter(struct ftrace_event_call *call);
> extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
> extern int reg_event_syscall_exit(struct ftrace_event_call *call);
> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
> struct trace_event *event);
> enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
> struct trace_event *event);
> +#else
> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
> +{
> + return NULL;
> +}
> #endif
>
> #ifdef CONFIG_PERF_EVENTS
> diff --git a/kernel/Makefile b/kernel/Makefile
> index 85cbfb3..84e7dfb 100644
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
> obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
> obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
> obj-$(CONFIG_SECCOMP) += seccomp.o
> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
> +endif
> obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
> obj-$(CONFIG_TREE_RCU) += rcutree.o
> obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> diff --git a/kernel/fork.c b/kernel/fork.c
> index e7548de..6f835e0 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -34,6 +34,7 @@
> #include <linux/cgroup.h>
> #include <linux/security.h>
> #include <linux/hugetlb.h>
> +#include <linux/seccomp.h>
> #include <linux/swap.h>
> #include <linux/syscalls.h>
> #include <linux/jiffies.h>
> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> + put_tsk_seccomp(tsk);
> free_task_struct(tsk);
> }
> EXPORT_SYMBOL(free_task);
> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
> if (err)
> goto out;
>
> + inherit_tsk_seccomp(tsk, orig);
> setup_thread_stack(tsk, orig);
> clear_user_return_notifier(tsk);
> clear_tsk_need_resched(tsk);
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 57d4b13..0a942be 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -2,16 +2,20 @@
> * linux/kernel/seccomp.c
> *
> * Copyright 2004-2005 Andrea Arcangeli <[email protected]>
> + * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
> *
> * This defines a simple but solid secure-computing mode.
> */
>
> #include <linux/seccomp.h>
> #include <linux/sched.h>
> +#include <linux/slab.h>
> #include <linux/compat.h>
> +#include <linux/unistd.h>
> +#include <linux/ftrace_event.h>
>
> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
> /* #define SECCOMP_DEBUG 1 */
> -#define NR_SECCOMP_MODES 1
>
> /*
> * Secure computing mode 1 allows only read/write/exit/sigreturn.
> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
>
> void __secure_computing(int this_syscall)
> {
> - int mode = current->seccomp.mode;
> int * syscall;
>
> - switch (mode) {
> + switch (current->seccomp.mode) {
> case 1:
> syscall = mode1_syscalls;
> #ifdef CONFIG_COMPAT
> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
> return;
> } while (*++syscall);
> break;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case 2:
> + if (this_syscall >= NR_syscalls || this_syscall < 0)
> + break;
> +
> + if (!seccomp_test_filters(this_syscall))
> + return;
> +
> + seccomp_filter_log_failure(this_syscall);
> + break;
> +#endif
> default:
> BUG();
> }
> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
> if (unlikely(current->seccomp.mode))
> goto out;
>
> - ret = -EINVAL;
> - if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
> - current->seccomp.mode = seccomp_mode;
> - set_thread_flag(TIF_SECCOMP);
> + ret = 0;
> + switch (seccomp_mode) {
> + case 1:
> #ifdef TIF_NOTSC
> disable_TSC();
> #endif
> - ret = 0;
> +#ifdef CONFIG_SECCOMP_FILTER
> + case 2:
> +#endif
> + current->seccomp.mode = seccomp_mode;
> + set_thread_flag(TIF_SECCOMP);
> + break;
> + default:
> + ret = -EINVAL;
> }
>
> - out:
> +out:
> return ret;
> }
> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> new file mode 100644
> index 0000000..9782f25
> --- /dev/null
> +++ b/kernel/seccomp_filter.c
> @@ -0,0 +1,784 @@
> +/* filter engine-based seccomp system call filtering
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +#include <linux/ftrace_event.h>
> +#include <linux/seccomp.h>
> +#include <linux/seq_file.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +
> +#include <asm/syscall.h>
> +#include <trace/syscall.h>
> +
> +
> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
> +
> +#define SECCOMP_FILTER_ALLOW "1"
> +#define SECCOMP_ACTION_DENY 0xffff
> +#define SECCOMP_ACTION_ALLOW 0xfffe
> +
> +/**
> + * struct seccomp_filters - container for seccomp filterset
> + *
> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
> + * May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
> + * @event_filters: array of pointers to ftrace event objects
> + * @count: size of @event_filters
> + * @flags: anonymous struct to wrap filters-specific flags
> + * @usage: reference count to simplify use.
> + */
> +struct seccomp_filters {
> + uint16_t syscalls[NR_syscalls];
> + struct event_filter **event_filters;
> + uint16_t count;
> + struct {
> + uint32_t compat:1,
> + __reserved:31;
> + } flags;
> + atomic_t usage;
> +};
> +
> +/* Handle ftrace symbol non-existence */
> +#ifdef CONFIG_FTRACE_SYSCALLS
> +#define create_event_filter(_ef_pptr, _event_type, _str) \
> + ftrace_parse_filter(_ef_pptr, _event_type, _str)
> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
> +#define free_event_filter(_f) ftrace_free_filter(_f)
> +
> +#else
> +
> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
> +#define get_filter_string(_ef) (NULL)
> +#define free_event_filter(_f) do { } while (0)
> +#endif
> +
> +/**
> + * seccomp_filters_new - allocates a new filters object
> + * @count: count to allocate for the event_filters array
> + *
> + * Returns ERR_PTR on error or an allocated object.
> + */
> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
> +{
> + struct seccomp_filters *f;
> +
> + if (count >= SECCOMP_ACTION_ALLOW)
> + return ERR_PTR(-EINVAL);
> +
> + f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
> + if (!f)
> + return ERR_PTR(-ENOMEM);
> +
> + /* Lazy SECCOMP_ACTION_DENY assignment. */
> + memset(f->syscalls, 0xff, sizeof(f->syscalls));
> + atomic_set(&f->usage, 1);
> +
> + f->event_filters = NULL;
> + f->count = count;
> + if (!count)
> + return f;
> +
> + f->event_filters = kzalloc(count * sizeof(struct event_filter *),
> + GFP_KERNEL);
> + if (!f->event_filters) {
> + kfree(f);
> + f = ERR_PTR(-ENOMEM);
> + }
> + return f;
> +}
> +
> +/**
> + * seccomp_filters_free - cleans up the filter list and frees the table
> + * @filters: NULL or live object to be completely destructed.
> + */
> +static void seccomp_filters_free(struct seccomp_filters *filters)
> +{
> + uint16_t count = 0;
> + if (!filters)
> + return;
> + while (count < filters->count) {
> + struct event_filter *f = filters->event_filters[count];
> + free_event_filter(f);
> + count++;
> + }
> + kfree(filters->event_filters);
> + kfree(filters);
> +}
> +
> +static void __put_seccomp_filters(struct seccomp_filters *orig)
> +{
> + WARN_ON(atomic_read(&orig->usage));
> + seccomp_filters_free(orig);
> +}
> +
> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
> +#define seccomp_filter_dynamic(_id) \
> + (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
> + int syscall_nr)
> +{
> + if (!f)
> + return SECCOMP_ACTION_DENY;
> + return f->syscalls[syscall_nr];
> +}
> +
> +static inline struct event_filter *seccomp_dynamic_filter(
> + const struct seccomp_filters *filters, uint16_t id)
> +{
> + if (!seccomp_filter_dynamic(id))
> + return NULL;
> + return filters->event_filters[id];
> +}
> +
> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
> + int syscall_nr, uint16_t id)
> +{
> + filters->syscalls[syscall_nr] = id;
> +}
> +
> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
> + int syscall_nr, uint16_t id,
> + struct event_filter *dynamic_filter)
> +{
> + filters->syscalls[syscall_nr] = id;
> + if (seccomp_filter_dynamic(id))
> + filters->event_filters[id] = dynamic_filter;
> +}
> +
> +static struct event_filter *alloc_event_filter(int syscall_nr,
> + const char *filter_string)
> +{
> + struct syscall_metadata *data;
> + struct event_filter *filter = NULL;
> + int err;
> +
> + data = syscall_nr_to_meta(syscall_nr);
> + /* Argument-based filtering only works on ftrace-hooked syscalls. */
> + err = -ENOSYS;
> + if (!data)
> + goto fail;
> + err = create_event_filter(&filter,
> + data->enter_event->event.type,
> + filter_string);
> + if (err)
> + goto fail;
> +
> + return filter;
> +fail:
> + kfree(filter);
> + return ERR_PTR(err);
> +}
> +
> +/**
> + * seccomp_filters_copy - copies filters from src to dst.
> + *
> + * @dst: seccomp_filters to populate.
> + * @src: table to read from.
> + * @skip: specifies an entry, by system call, to skip.
> + *
> + * Returns non-zero on failure.
> + * Both the source and the destination should have no simultaneous
> + * writers, and dst should be exclusive to the caller.
> + * If @skip is < 0, it is ignored.
> + */
> +static int seccomp_filters_copy(struct seccomp_filters *dst,
> + const struct seccomp_filters *src,
> + int skip)
> +{
> + int id = 0, ret = 0, nr;
> + memcpy(&dst->flags, &src->flags, sizeof(src->flags));
> + memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
> + if (!src->count)
> + goto done;
> + for (nr = 0; nr < NR_syscalls; ++nr) {
> + struct event_filter *filter;
> + const char *str;
> + uint16_t src_id = seccomp_filter_id(src, nr);
> + if (nr == skip) {
> + set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
> + NULL);
> + continue;
> + }
> + if (!seccomp_filter_dynamic(src_id))
> + continue;
> + if (id >= dst->count) {
> + ret = -EINVAL;
> + goto done;
> + }
> + str = get_filter_string(seccomp_dynamic_filter(src, src_id));
> + filter = alloc_event_filter(nr, str);
> + if (IS_ERR(filter)) {
> + ret = PTR_ERR(filter);
> + goto done;
> + }
> + set_seccomp_filter(dst, nr, id, filter);
> + id++;
> + }
> +
> +done:
> + return ret;
> +}
> +
> +/**
> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
> + * @filters: unattached filter object to operate on
> + * @syscall_nr: syscall number to update filters for
> + * @filter_string: string to append to the existing filter
> + *
> + * The new string will be &&'d to the original filter string to ensure that it
> + * always matches the existing predicates or less:
> + * (old_filter) && @filter_string
> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
> + * failure.
> + */
> +static int seccomp_extend_filter(struct seccomp_filters *filters,
> + int syscall_nr, char *filter_string)
> +{
> + struct event_filter *filter;
> + uint16_t id = seccomp_filter_id(filters, syscall_nr);
> + char *merged = NULL;
> + int ret = -EINVAL, expected;
> +
> + /* No extending with a "1". */
> + if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
> + goto out;
> +
> + filter = seccomp_dynamic_filter(filters, id);
> + ret = -ENOENT;
> + if (!filter)
> + goto out;
> +
> + merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> + ret = -ENOMEM;
> + if (!merged)
> + goto out;
> +
> + expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
> + get_filter_string(filter), filter_string);
> + ret = -E2BIG;
> + if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
> + goto out;
> +
> + /* Free the old filter */
> + free_event_filter(filter);
> + set_seccomp_filter(filters, syscall_nr, id, NULL);
> +
> + /* Replace it */
> + filter = alloc_event_filter(syscall_nr, merged);
> + if (IS_ERR(filter)) {
> + ret = PTR_ERR(filter);
> + goto out;
> + }
> + set_seccomp_filter(filters, syscall_nr, id, filter);
> + ret = 0;
> +
> +out:
> + kfree(merged);
> + return ret;
> +}
> +
> +/**
> + * seccomp_add_filter - adds a filter for an unfiltered syscall
> + * @filters: filters object to add a filter/action to
> + * @syscall_nr: system call number to add a filter for
> + * @filter_string: the filter string to apply
> + *
> + * Returns 0 on success and non-zero otherwise.
> + */
> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
> + char *filter_string)
> +{
> + struct event_filter *filter;
> + int ret = 0;
> +
> + if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
> + set_seccomp_filter(filters, syscall_nr,
> + SECCOMP_ACTION_ALLOW, NULL);
> + goto out;
> + }
> +
> + filter = alloc_event_filter(syscall_nr, filter_string);
> + if (IS_ERR(filter)) {
> + ret = PTR_ERR(filter);
> + goto out;
> + }
> + /* Always add to the last slot available since additions are
> + * are only done one at a time.
> + */
> + set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
> +out:
> + return ret;
> +}
> +
> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
> +static int filter_match_current(struct event_filter *event_filter)
> +{
> + int err = 0;
> +#ifdef CONFIG_FTRACE_SYSCALLS
> + uint8_t syscall_state[64];
> +
> + memset(syscall_state, 0, sizeof(syscall_state));
> +
> + /* The generic tracing entry can remain zeroed. */
> + err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
> + NULL);
> + if (err)
> + return 0;
> +
> + err = filter_match_preds(event_filter, syscall_state);
> +#endif
> + return err;
> +}
> +
> +static const char *syscall_nr_to_name(int syscall)
> +{
> + const char *syscall_name = "unknown";
> + struct syscall_metadata *data = syscall_nr_to_meta(syscall);
> + if (data)
> + syscall_name = data->name;
> + return syscall_name;
> +}
> +
> +static void filters_set_compat(struct seccomp_filters *filters)
> +{
> +#ifdef CONFIG_COMPAT
> + if (is_compat_task())
> + filters->flags.compat = 1;
> +#endif
> +}
> +
> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
> +{
> + int ret = 0;
> + if (!filters)
> + return 0;
> +#ifdef CONFIG_COMPAT
> + if (!!(is_compat_task()) == filters->flags.compat)
> + ret = 1;
> +#endif
> + return ret;
> +}
> +
> +static inline int syscall_is_execve(int syscall)
> +{
> + int nr = __NR_execve;
> +#ifdef CONFIG_COMPAT
> + if (is_compat_task())
> + nr = __NR_seccomp_execve_32;
> +#endif
> + return syscall == nr;
> +}
> +
> +#ifndef KSTK_EIP
> +#define KSTK_EIP(x) 0L
> +#endif
> +
> +void seccomp_filter_log_failure(int syscall)
> +{
> + pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
> + current->comm, task_pid_nr(current), syscall,
> + syscall_nr_to_name(syscall), KSTK_EIP(current));
> +}
> +
> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
> +void put_seccomp_filters(struct seccomp_filters *orig)
> +{
> + if (!orig)
> + return;
> +
> + if (atomic_dec_and_test(&orig->usage))
> + __put_seccomp_filters(orig);
> +}
> +
> +/* get_seccomp_state - increments the reference count of @orig */
> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)

Nit: the name does not match the comment.

> +{
> + if (!orig)
> + return NULL;
> + atomic_inc(&orig->usage);
> + return orig;

This is called in an RCU read-side critical section. What exactly is
RCU protecting? I would expect an rcu_dereference() or one of the
RCU list-traversal primitives somewhere, either here or at the caller.

> +}
> +
> +/**
> + * seccomp_test_filters - tests 'current' against the given syscall
> + * @state: seccomp_state of current to use.
> + * @syscall: number of the system call to test
> + *
> + * Returns 0 on ok and non-zero on error/failure.
> + */
> +int seccomp_test_filters(int syscall)
> +{
> + uint16_t id;
> + struct event_filter *filter;
> + struct seccomp_filters *filters;
> + int ret = -EACCES;
> +
> + rcu_read_lock();
> + filters = get_seccomp_filters(current->seccomp.filters);
> + rcu_read_unlock();
> +
> + if (!filters)
> + goto out;
> +
> + if (filters_compat_mismatch(filters)) {
> + pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
> + current->comm, task_pid_nr(current));
> + goto out;
> + }
> +
> + /* execve is never allowed. */
> + if (syscall_is_execve(syscall))
> + goto out;
> +
> + ret = 0;
> + id = seccomp_filter_id(filters, syscall);
> + if (seccomp_filter_allow(id))
> + goto out;
> +
> + ret = -EACCES;
> + if (!seccomp_filter_dynamic(id))
> + goto out;
> +
> + filter = seccomp_dynamic_filter(filters, id);
> + if (filter && filter_match_current(filter))
> + ret = 0;
> +out:
> + put_seccomp_filters(filters);
> + return ret;
> +}
> +
> +/**
> + * seccomp_show_filters - prints the current filter state to a seq_file
> + * @filters: properly get()'d filters object
> + * @m: the prepared seq_file to receive the data
> + *
> + * Returns 0 on a successful write.
> + */
> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
> +{
> + int syscall;
> + seq_printf(m, "Mode: %d\n", current->seccomp.mode);
> + if (!filters)
> + goto out;
> +
> + for (syscall = 0; syscall < NR_syscalls; ++syscall) {
> + uint16_t id = seccomp_filter_id(filters, syscall);
> + const char *filter_string = SECCOMP_FILTER_ALLOW;
> + if (seccomp_filter_deny(id))
> + continue;
> + seq_printf(m, "%d (%s): ",
> + syscall,
> + syscall_nr_to_name(syscall));
> + if (seccomp_filter_dynamic(id))
> + filter_string = get_filter_string(
> + seccomp_dynamic_filter(filters, id));
> + seq_printf(m, "%s\n", filter_string);
> + }
> +out:
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
> +
> +/**
> + * seccomp_get_filter - copies the filter_string into "buf"
> + * @syscall_nr: system call number to look up
> + * @buf: destination buffer
> + * @bufsize: available space in the buffer.
> + *
> + * Context: User context only. This function may sleep on allocation and
> + * operates on current. current must be attempting a system call
> + * when this is called.
> + *
> + * Looks up the filter for the given system call number on current. If found,
> + * the string length of the NUL-terminated buffer is returned and < 0 is
> + * returned on error. The NUL byte is not included in the length.
> + */
> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
> +{
> + struct seccomp_filters *filters;
> + struct event_filter *filter;
> + long ret = -EINVAL;
> + uint16_t id;
> +
> + if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
> + bufsize = SECCOMP_MAX_FILTER_LENGTH;
> +
> + rcu_read_lock();
> + filters = get_seccomp_filters(current->seccomp.filters);
> + rcu_read_unlock();
> +
> + if (!filters)
> + goto out;
> +
> + ret = -ENOENT;
> + id = seccomp_filter_id(filters, syscall_nr);
> + if (seccomp_filter_deny(id))
> + goto out;
> +
> + if (seccomp_filter_allow(id)) {
> + ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
> + goto copied;
> + }
> +
> + filter = seccomp_dynamic_filter(filters, id);
> + if (!filter)
> + goto out;
> + ret = strlcpy(buf, get_filter_string(filter), bufsize);
> +
> +copied:
> + if (ret >= bufsize) {
> + ret = -ENOSPC;
> + goto out;
> + }
> + /* Zero out any remaining buffer, just in case. */
> + memset(buf + ret, 0, bufsize - ret);
> +out:
> + put_seccomp_filters(filters);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
> +
> +/**
> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
> + * @syscall_nr: the system call number to clear filters for.
> + *
> + * Context: User context only. This function may sleep on allocation and
> + * operates on current. current must be attempting a system call
> + * when this is called.
> + *
> + * Returns 0 on success.
> + */
> +long seccomp_clear_filter(int syscall_nr)
> +{
> + struct seccomp_filters *filters = NULL, *orig_filters;
> + uint16_t id;
> + int ret = -EINVAL;
> +
> + rcu_read_lock();
> + orig_filters = get_seccomp_filters(current->seccomp.filters);
> + rcu_read_unlock();
> +
> + if (!orig_filters)
> + goto out;
> +
> + if (filters_compat_mismatch(orig_filters))
> + goto out;
> +
> + id = seccomp_filter_id(orig_filters, syscall_nr);
> + if (seccomp_filter_deny(id))
> + goto out;
> +
> + /* Create a new filters object for the task */
> + if (seccomp_filter_dynamic(id))
> + filters = seccomp_filters_new(orig_filters->count - 1);
> + else
> + filters = seccomp_filters_new(orig_filters->count);
> +
> + if (IS_ERR(filters)) {
> + ret = PTR_ERR(filters);
> + goto out;
> + }
> +
> + /* Copy, but drop the requested entry. */
> + ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
> + if (ret)
> + goto out;
> + get_seccomp_filters(filters); /* simplify the out: path */
> +
> + rcu_assign_pointer(current->seccomp.filters, filters);

What prevents two copies of seccomp_clear_filter() from running
concurrently?

> + synchronize_rcu();
> + put_seccomp_filters(orig_filters); /* for the task */
> +out:
> + put_seccomp_filters(orig_filters); /* for the get */
> + put_seccomp_filters(filters); /* for the extra get */
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
> +
> +/**
> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
> + * @syscall_nr: system call number to apply the filter to.
> + * @filter: ftrace filter string to apply.
> + *
> + * Context: User context only. This function may sleep on allocation and
> + * operates on current. current must be attempting a system call
> + * when this is called.
> + *
> + * New filters may be added for system calls when the current task is
> + * not in a secure computing mode (seccomp). Otherwise, existing filters may
> + * be extended.
> + *
> + * Returns 0 on success or an errno on failure.
> + */
> +long seccomp_set_filter(int syscall_nr, char *filter)
> +{
> + struct seccomp_filters *filters = NULL, *orig_filters = NULL;
> + uint16_t id;
> + long ret = -EINVAL;
> + uint16_t filters_needed;
> +
> + if (!filter)
> + goto out;
> +
> + filter = strstrip(filter);
> + /* Disallow empty strings. */
> + if (filter[0] == 0)
> + goto out;
> +
> + rcu_read_lock();
> + orig_filters = get_seccomp_filters(current->seccomp.filters);
> + rcu_read_unlock();
> +
> + /* After the first call, compatibility mode is selected permanently. */
> + ret = -EACCES;
> + if (filters_compat_mismatch(orig_filters))
> + goto out;
> +
> + filters_needed = orig_filters ? orig_filters->count : 0;
> + id = seccomp_filter_id(orig_filters, syscall_nr);
> + if (seccomp_filter_deny(id)) {
> + /* Don't allow DENYs to be changed when in a seccomp mode */
> + ret = -EACCES;
> + if (current->seccomp.mode)
> + goto out;
> + filters_needed++;
> + }
> +
> + filters = seccomp_filters_new(filters_needed);
> + if (IS_ERR(filters)) {
> + ret = PTR_ERR(filters);
> + goto out;
> + }
> +
> + filters_set_compat(filters);
> + if (orig_filters) {
> + ret = seccomp_filters_copy(filters, orig_filters, -1);
> + if (ret)
> + goto out;
> + }
> +
> + if (seccomp_filter_deny(id))
> + ret = seccomp_add_filter(filters, syscall_nr, filter);
> + else
> + ret = seccomp_extend_filter(filters, syscall_nr, filter);
> + if (ret)
> + goto out;
> + get_seccomp_filters(filters); /* simplify the error paths */
> +
> + rcu_assign_pointer(current->seccomp.filters, filters);

Again, what prevents two copies of seccomp_set_filter() from running
concurrently?

> + synchronize_rcu();
> + put_seccomp_filters(orig_filters); /* for the task */
> +out:
> + put_seccomp_filters(orig_filters); /* for the get */
> + put_seccomp_filters(filters); /* for get or task, on err */
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
> +
> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
> + char __user *user_filter)
> +{
> + int nr;
> + long ret;
> + char *filter = NULL;
> +
> + ret = -EINVAL;
> + if (syscall_nr >= NR_syscalls)
> + goto out;
> +
> + ret = -EFAULT;
> + if (!user_filter)
> + goto out;
> +
> + filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> + ret = -ENOMEM;
> + if (!filter)
> + goto out;
> +
> + ret = -EFAULT;
> + if (strncpy_from_user(filter, user_filter,
> + SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
> + goto out;
> +
> + nr = (int) syscall_nr;
> + ret = seccomp_set_filter(nr, filter);
> +
> +out:
> + kfree(filter);
> + return ret;
> +}
> +
> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
> +{
> + int nr = -1;
> + long ret;
> +
> + ret = -EINVAL;
> + if (syscall_nr >= NR_syscalls)
> + goto out;
> +
> + nr = (int) syscall_nr;
> + ret = seccomp_clear_filter(nr);
> +
> +out:
> + return ret;
> +}
> +
> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
> + unsigned long available)
> +{
> + int ret, nr;
> + unsigned long copied;
> + char *buf = NULL;
> + ret = -EINVAL;
> + if (!available)
> + goto out;
> + /* Ignore extra buffer space. */
> + if (available > SECCOMP_MAX_FILTER_LENGTH)
> + available = SECCOMP_MAX_FILTER_LENGTH;
> +
> + ret = -EINVAL;
> + if (syscall_nr >= NR_syscalls)
> + goto out;
> + nr = (int) syscall_nr;
> +
> + ret = -ENOMEM;
> + buf = kmalloc(available, GFP_KERNEL);
> + if (!buf)
> + goto out;
> +
> + ret = seccomp_get_filter(nr, buf, available);
> + if (ret < 0)
> + goto out;
> +
> + /* Include the NUL byte in the copy. */
> + copied = copy_to_user(dst, buf, ret + 1);
> + ret = -ENOSPC;
> + if (copied)
> + goto out;
> + ret = 0;
> +out:
> + kfree(buf);
> + return ret;
> +}
> diff --git a/kernel/sys.c b/kernel/sys.c
> index af468ed..ed60d06 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> case PR_SET_ENDIAN:
> error = SET_ENDIAN(me, arg2);
> break;
> -
> case PR_GET_SECCOMP:
> error = prctl_get_seccomp();
> break;
> case PR_SET_SECCOMP:
> error = prctl_set_seccomp(arg2);
> break;
> + case PR_SET_SECCOMP_FILTER:
> + error = prctl_set_seccomp_filter(arg2,
> + (char __user *) arg3);
> + break;
> + case PR_CLEAR_SECCOMP_FILTER:
> + error = prctl_clear_seccomp_filter(arg2);
> + break;
> + case PR_GET_SECCOMP_FILTER:
> + error = prctl_get_seccomp_filter(arg2,
> + (char __user *) arg3,
> + arg4);
> + break;
> case PR_GET_TSC:
> error = GET_TSC_CTL(arg2);
> break;
> diff --git a/security/Kconfig b/security/Kconfig
> index 95accd4..c76adf2 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -2,6 +2,10 @@
> # Security configuration
> #
>
> +# Make seccomp filter Kconfig switch below available
> +config HAVE_SECCOMP_FILTER
> + bool
> +
> menu "Security options"
>
> config KEYS
> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
>
> If you are unsure how to answer this question, answer N.
>
> +config SECCOMP_FILTER
> + bool "Enable seccomp-based system call filtering"
> + select SECCOMP
> + depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
> + help
> + This kernel feature expands CONFIG_SECCOMP to allow computing
> + in environments with reduced kernel access dictated by the
> + application itself through prctl calls. If
> + CONFIG_FTRACE_SYSCALLS is available, then system call
> + argument-based filtering predicates may be used.
> +
> + See Documentation/prctl/seccomp_filter.txt for more detail.
> +
> config SECURITY
> bool "Enable different security models"
> depends on SYSFS
> --
> 1.7.0.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-06-02 18:14:58

[permalink] [raw]

Subject: Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters

On Thu, Jun 2, 2011 at 12:36 PM, Paul E. McKenney
<[email protected]> wrote:
> On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
>> This change adds a new seccomp mode which specifies the allowed system
>> calls dynamically. ?When in the new mode (2), all system calls are
>> checked against process-defined filters - first by system call number,
>> then by a filter string. ?If an entry exists for a given system call and
>> all filter predicates evaluate to true, then the task may proceed.
>> Otherwise, the task is killed.
>
> A few questions below -- I can't say that I understand the RCU usage.
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanx, Paul
>
>> Filter string parsing and evaluation is handled by the ftrace filter
>> engine. ?Related patches tweak to the perf filter trace and free
>> allowing the calls to be shared. Filters inherit their understanding of
>> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
>> subsystem which already populates this information in syscall_metadata
>> associated enter_event (and exit_event) structures. If
>> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
>> will be allowed.
>>
>> The net result is a process may have its system calls filtered using the
>> ftrace filter engine's inherent understanding of systems calls. ?The set
>> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
>> prctl(). For example, a filterset for a process, like pdftotext, that
>> should only process read-only input could (roughly) look like:
>> ? sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
>> ? prctl(PR_SET_SECCOMP, 2);
>>
>> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
>> be &&'d together to ensure that attack surface may only be reduced:
>> ? prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
>>
>> With the earlier example, the active filter becomes:
>> ? "(fd == 1 || fd == 2) && fd != 2"
>>
>> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
>> The latter returns the current filter for a system call to userspace:
>>
>> ? prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
>>
>> while the former clears any filters for a given system call changing it
>> back to a defaulty deny:
>>
>> ? prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
>>
>> v3: - always block execve calls (as per linus torvalds)
>> ? ? - add __NR_seccomp_execve(_32) to seccomp-supporting arches
>> ? ? - ensure compat tasks can't reach ftrace:syscalls
>> ? ? - dropped new defines for seccomp modes.
>> ? ? - two level array instead of hlists (sugg. by olof johansson)
>> ? ? - added generic Kconfig entry that is not connected.
>> ? ? - dropped internal seccomp.h
>> ? ? - move prctl helpers to seccomp_filter
>> ? ? - killed seccomp_t typedef (as per checkpatch)
>> v2: - changed to use the existing syscall number ABI.
>> ? ? - prctl changes to minimize parsing in the kernel:
>> ? ? ? prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
>> ? ? ? prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
>> ? ? ? prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
>> ? ? ? prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
>> ? ? - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
>> ? ? - added flags
>> ? ? - provide a default fail syscall_nr_to_meta in ftrace
>> ? ? - provides fallback for unhooked system calls
>> ? ? - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
>> ? ? - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
>> ? ? - moved to a hlist and 4 bit hash of linked lists
>> ? ? - added support to operate without CONFIG_FTRACE_SYSCALLS
>> ? ? - moved Kconfig support next to SECCOMP
>> ? ? - made Kconfig entries dependent on EXPERIMENTAL
>> ? ? - added macros to avoid ifdefs from kernel/fork.c
>> ? ? - added compat task/filter matching
>> ? ? - drop seccomp.h inclusion in sched.h and drop seccomp_t
>> ? ? - added Filtering to "show" output
>> ? ? - added on_exec state dup'ing when enabling after a fast-path accept.
>>
>> Signed-off-by: Will Drewry <[email protected]>
>> ---
>> ?include/linux/prctl.h ? | ? ?5 +
>> ?include/linux/sched.h ? | ? ?2 +-
>> ?include/linux/seccomp.h | ? 98 ++++++-
>> ?include/trace/syscall.h | ? ?7 +
>> ?kernel/Makefile ? ? ? ? | ? ?3 +
>> ?kernel/fork.c ? ? ? ? ? | ? ?3 +
>> ?kernel/seccomp.c ? ? ? ?| ? 38 ++-
>> ?kernel/seccomp_filter.c | ?784 +++++++++++++++++++++++++++++++++++++++++++++++
>> ?kernel/sys.c ? ? ? ? ? ?| ? 13 +-
>> ?security/Kconfig ? ? ? ?| ? 17 +
>> ?10 files changed, 954 insertions(+), 16 deletions(-)
>> ?create mode 100644 kernel/seccomp_filter.c
>>
>> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
>> index a3baeb2..44723ce 100644
>> --- a/include/linux/prctl.h
>> +++ b/include/linux/prctl.h
>> @@ -64,6 +64,11 @@
>> ?#define PR_GET_SECCOMP ? ? ? 21
>> ?#define PR_SET_SECCOMP ? ? ? 22
>>
>> +/* Get/set process seccomp filters */
>> +#define PR_GET_SECCOMP_FILTER ? ? ? ?35
>> +#define PR_SET_SECCOMP_FILTER ? ? ? ?36
>> +#define PR_CLEAR_SECCOMP_FILTER ? ? ?37
>> +
>> ?/* Get/set the capability bounding set (as per security/commoncap.c) */
>> ?#define PR_CAPBSET_READ 23
>> ?#define PR_CAPBSET_DROP 24
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 18d63ce..3f0bc8d 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1374,7 +1374,7 @@ struct task_struct {
>> ? ? ? uid_t loginuid;
>> ? ? ? unsigned int sessionid;
>> ?#endif
>> - ? ? seccomp_t seccomp;
>> + ? ? struct seccomp_struct seccomp;
>>
>> ?/* Thread group tracking */
>> ? ? ? u32 parent_exec_id;
>> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
>> index 167c333..f4434ca 100644
>> --- a/include/linux/seccomp.h
>> +++ b/include/linux/seccomp.h
>> @@ -1,13 +1,33 @@
>> ?#ifndef _LINUX_SECCOMP_H
>> ?#define _LINUX_SECCOMP_H
>>
>> +struct seq_file;
>>
>> ?#ifdef CONFIG_SECCOMP
>>
>> +#include <linux/errno.h>
>> ?#include <linux/thread_info.h>
>> +#include <linux/types.h>
>> ?#include <asm/seccomp.h>
>>
>> -typedef struct { int mode; } seccomp_t;
>> +struct seccomp_filters;
>> +/**
>> + * struct seccomp_struct - the state of a seccomp'ed process
>> + *
>> + * @mode:
>> + * ? ? if this is 1, the process is under standard seccomp rules
>> + * ? ? ? ? ? ? is 2, the process is only allowed to make system calls where
>> + * ? ? ? ? ? ? ? ? ? associated filters evaluate successfully.
>> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
>> + * ? ? ? ? ? filters assignment/use should be RCU-protected and its contents
>> + * ? ? ? ? ? should never be modified when attached to a seccomp_struct.
>> + */
>> +struct seccomp_struct {
>> + ? ? uint16_t mode;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? struct seccomp_filters *filters;
>> +#endif
>> +};
>>
>> ?extern void __secure_computing(int);
>> ?static inline void secure_computing(int this_syscall)
>> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? __secure_computing(this_syscall);
>> ?}
>>
>> -extern long prctl_get_seccomp(void);
>> ?extern long prctl_set_seccomp(unsigned long);
>> +extern long prctl_get_seccomp(void);
>>
>> ?#else /* CONFIG_SECCOMP */
>>
>> ?#include <linux/errno.h>
>>
>> -typedef struct { } seccomp_t;
>> -
>> +struct seccomp_struct { };
>> ?#define secure_computing(x) do { } while (0)
>>
>> ?static inline long prctl_get_seccomp(void)
>> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
>> ? ? ? return -EINVAL;
>> ?}
>>
>> -static inline long prctl_set_seccomp(unsigned long arg2)
>> +static inline long prctl_set_seccomp(unsigned long a2);
>> ?{
>> ? ? ? return -EINVAL;
>> ?}
>>
>> ?#endif /* CONFIG_SECCOMP */
>>
>> +#ifdef CONFIG_SECCOMP_FILTER
>> +
>> +#define inherit_tsk_seccomp(_child, _orig) do { \
>> + ? ? _child->seccomp.mode = _orig->seccomp.mode; \
>> + ? ? _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
>> + ? ? } while (0)
>> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
>> +
>> +extern int seccomp_show_filters(struct seccomp_filters *filters,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct seq_file *);
>> +extern long seccomp_set_filter(int, char *);
>> +extern long seccomp_clear_filter(int);
>> +extern long seccomp_get_filter(int, char *, unsigned long);
>> +
>> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
>> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long);
>> +extern long prctl_clear_seccomp_filter(unsigned long);
>> +
>> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
>> +extern void put_seccomp_filters(struct seccomp_filters *);
>> +
>> +extern int seccomp_test_filters(int);
>> +extern void seccomp_filter_log_failure(int);
>> +
>> +#else ?/* CONFIG_SECCOMP_FILTER */
>> +
>> +struct seccomp_filters { };
>> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
>> +#define put_tsk_seccomp(_tsk) do { } while (0)
>> +
>> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct seq_file *m)
>> +{
>> + ? ? return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
>> +{
>> + ? ? return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_clear_filter(int syscall_nr)
>> +{
>> + ? ? return -ENOSYS;
>> +}
>> +
>> +static inline long seccomp_get_filter(int syscall_nr,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? char *buf, unsigned long available)
>> +{
>> + ? ? return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
>> +{
>> + ? ? return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
>> +{
>> + ? ? return -ENOSYS;
>> +}
>> +
>> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long a4)
>> +{
>> + ? ? return -ENOSYS;
>> +}
>> +#endif ?/* CONFIG_SECCOMP_FILTER */
>> ?#endif /* _LINUX_SECCOMP_H */
>> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
>> index 242ae04..e061ad0 100644
>> --- a/include/trace/syscall.h
>> +++ b/include/trace/syscall.h
>> @@ -35,6 +35,8 @@ struct syscall_metadata {
>> ?extern unsigned long arch_syscall_addr(int nr);
>> ?extern int init_syscall_trace(struct ftrace_event_call *call);
>>
>> +extern struct syscall_metadata *syscall_nr_to_meta(int);
>> +
>> ?extern int reg_event_syscall_enter(struct ftrace_event_call *call);
>> ?extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
>> ?extern int reg_event_syscall_exit(struct ftrace_event_call *call);
>> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct trace_event *event);
>> ?enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct trace_event *event);
>> +#else
>> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
>> +{
>> + ? ? return NULL;
>> +}
>> ?#endif
>>
>> ?#ifdef CONFIG_PERF_EVENTS
>> diff --git a/kernel/Makefile b/kernel/Makefile
>> index 85cbfb3..84e7dfb 100644
>> --- a/kernel/Makefile
>> +++ b/kernel/Makefile
>> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
>> ?obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>> ?obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>> ?obj-$(CONFIG_SECCOMP) += seccomp.o
>> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
>> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
>> +endif
>> ?obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>> ?obj-$(CONFIG_TREE_RCU) += rcutree.o
>> ?obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index e7548de..6f835e0 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -34,6 +34,7 @@
>> ?#include <linux/cgroup.h>
>> ?#include <linux/security.h>
>> ?#include <linux/hugetlb.h>
>> +#include <linux/seccomp.h>
>> ?#include <linux/swap.h>
>> ?#include <linux/syscalls.h>
>> ?#include <linux/jiffies.h>
>> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
>> ? ? ? free_thread_info(tsk->stack);
>> ? ? ? rt_mutex_debug_task_free(tsk);
>> ? ? ? ftrace_graph_exit_task(tsk);
>> + ? ? put_tsk_seccomp(tsk);
>> ? ? ? free_task_struct(tsk);
>> ?}
>> ?EXPORT_SYMBOL(free_task);
>> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>> ? ? ? if (err)
>> ? ? ? ? ? ? ? goto out;
>>
>> + ? ? inherit_tsk_seccomp(tsk, orig);
>> ? ? ? setup_thread_stack(tsk, orig);
>> ? ? ? clear_user_return_notifier(tsk);
>> ? ? ? clear_tsk_need_resched(tsk);
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 57d4b13..0a942be 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -2,16 +2,20 @@
>> ? * linux/kernel/seccomp.c
>> ? *
>> ? * Copyright 2004-2005 ?Andrea Arcangeli <[email protected]>
>> + * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
>> ? *
>> ? * This defines a simple but solid secure-computing mode.
>> ? */
>>
>> ?#include <linux/seccomp.h>
>> ?#include <linux/sched.h>
>> +#include <linux/slab.h>
>> ?#include <linux/compat.h>
>> +#include <linux/unistd.h>
>> +#include <linux/ftrace_event.h>
>>
>> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>> ?/* #define SECCOMP_DEBUG 1 */
>> -#define NR_SECCOMP_MODES 1
>>
>> ?/*
>> ? * Secure computing mode 1 allows only read/write/exit/sigreturn.
>> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
>>
>> ?void __secure_computing(int this_syscall)
>> ?{
>> - ? ? int mode = current->seccomp.mode;
>> ? ? ? int * syscall;
>>
>> - ? ? switch (mode) {
>> + ? ? switch (current->seccomp.mode) {
>> ? ? ? case 1:
>> ? ? ? ? ? ? ? syscall = mode1_syscalls;
>> ?#ifdef CONFIG_COMPAT
>> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return;
>> ? ? ? ? ? ? ? } while (*++syscall);
>> ? ? ? ? ? ? ? break;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case 2:
>> + ? ? ? ? ? ? if (this_syscall >= NR_syscalls || this_syscall < 0)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> +
>> + ? ? ? ? ? ? if (!seccomp_test_filters(this_syscall))
>> + ? ? ? ? ? ? ? ? ? ? return;
>> +
>> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
>> + ? ? ? ? ? ? break;
>> +#endif
>> ? ? ? default:
>> ? ? ? ? ? ? ? BUG();
>> ? ? ? }
>> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
>> ? ? ? if (unlikely(current->seccomp.mode))
>> ? ? ? ? ? ? ? goto out;
>>
>> - ? ? ret = -EINVAL;
>> - ? ? if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
>> - ? ? ? ? ? ? current->seccomp.mode = seccomp_mode;
>> - ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP);
>> + ? ? ret = 0;
>> + ? ? switch (seccomp_mode) {
>> + ? ? case 1:
>> ?#ifdef TIF_NOTSC
>> ? ? ? ? ? ? ? disable_TSC();
>> ?#endif
>> - ? ? ? ? ? ? ret = 0;
>> +#ifdef CONFIG_SECCOMP_FILTER
>> + ? ? case 2:
>> +#endif
>> + ? ? ? ? ? ? current->seccomp.mode = seccomp_mode;
>> + ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP);
>> + ? ? ? ? ? ? break;
>> + ? ? default:
>> + ? ? ? ? ? ? ret = -EINVAL;
>> ? ? ? }
>>
>> - out:
>> +out:
>> ? ? ? return ret;
>> ?}
>> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
>> new file mode 100644
>> index 0000000..9782f25
>> --- /dev/null
>> +++ b/kernel/seccomp_filter.c
>> @@ -0,0 +1,784 @@
>> +/* filter engine-based seccomp system call filtering
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ?See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, write to the Free Software
>> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>> + *
>> + * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
>> + */
>> +
>> +#include <linux/compat.h>
>> +#include <linux/err.h>
>> +#include <linux/errno.h>
>> +#include <linux/ftrace_event.h>
>> +#include <linux/seccomp.h>
>> +#include <linux/seq_file.h>
>> +#include <linux/sched.h>
>> +#include <linux/slab.h>
>> +#include <linux/uaccess.h>
>> +
>> +#include <asm/syscall.h>
>> +#include <trace/syscall.h>
>> +
>> +
>> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
>> +
>> +#define SECCOMP_FILTER_ALLOW "1"
>> +#define SECCOMP_ACTION_DENY 0xffff
>> +#define SECCOMP_ACTION_ALLOW 0xfffe
>> +
>> +/**
>> + * struct seccomp_filters - container for seccomp filterset
>> + *
>> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
>> + * ? ? ? ? ? ?May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
>> + * @event_filters: array of pointers to ftrace event objects
>> + * @count: size of @event_filters
>> + * @flags: anonymous struct to wrap filters-specific flags
>> + * @usage: reference count to simplify use.
>> + */
>> +struct seccomp_filters {
>> + ? ? uint16_t syscalls[NR_syscalls];
>> + ? ? struct event_filter **event_filters;
>> + ? ? uint16_t count;
>> + ? ? struct {
>> + ? ? ? ? ? ? uint32_t compat:1,
>> + ? ? ? ? ? ? ? ? ? ? ?__reserved:31;
>> + ? ? } flags;
>> + ? ? atomic_t usage;
>> +};
>> +
>> +/* Handle ftrace symbol non-existence */
>> +#ifdef CONFIG_FTRACE_SYSCALLS
>> +#define create_event_filter(_ef_pptr, _event_type, _str) \
>> + ? ? ftrace_parse_filter(_ef_pptr, _event_type, _str)
>> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
>> +#define free_event_filter(_f) ftrace_free_filter(_f)
>> +
>> +#else
>> +
>> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
>> +#define get_filter_string(_ef) (NULL)
>> +#define free_event_filter(_f) do { } while (0)
>> +#endif
>> +
>> +/**
>> + * seccomp_filters_new - allocates a new filters object
>> + * @count: count to allocate for the event_filters array
>> + *
>> + * Returns ERR_PTR on error or an allocated object.
>> + */
>> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
>> +{
>> + ? ? struct seccomp_filters *f;
>> +
>> + ? ? if (count >= SECCOMP_ACTION_ALLOW)
>> + ? ? ? ? ? ? return ERR_PTR(-EINVAL);
>> +
>> + ? ? f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
>> + ? ? if (!f)
>> + ? ? ? ? ? ? return ERR_PTR(-ENOMEM);
>> +
>> + ? ? /* Lazy SECCOMP_ACTION_DENY assignment. */
>> + ? ? memset(f->syscalls, 0xff, sizeof(f->syscalls));
>> + ? ? atomic_set(&f->usage, 1);
>> +
>> + ? ? f->event_filters = NULL;
>> + ? ? f->count = count;
>> + ? ? if (!count)
>> + ? ? ? ? ? ? return f;
>> +
>> + ? ? f->event_filters = kzalloc(count * sizeof(struct event_filter *),
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GFP_KERNEL);
>> + ? ? if (!f->event_filters) {
>> + ? ? ? ? ? ? kfree(f);
>> + ? ? ? ? ? ? f = ERR_PTR(-ENOMEM);
>> + ? ? }
>> + ? ? return f;
>> +}
>> +
>> +/**
>> + * seccomp_filters_free - cleans up the filter list and frees the table
>> + * @filters: NULL or live object to be completely destructed.
>> + */
>> +static void seccomp_filters_free(struct seccomp_filters *filters)
>> +{
>> + ? ? uint16_t count = 0;
>> + ? ? if (!filters)
>> + ? ? ? ? ? ? return;
>> + ? ? while (count < filters->count) {
>> + ? ? ? ? ? ? struct event_filter *f = filters->event_filters[count];
>> + ? ? ? ? ? ? free_event_filter(f);
>> + ? ? ? ? ? ? count++;
>> + ? ? }
>> + ? ? kfree(filters->event_filters);
>> + ? ? kfree(filters);
>> +}
>> +
>> +static void __put_seccomp_filters(struct seccomp_filters *orig)
>> +{
>> + ? ? WARN_ON(atomic_read(&orig->usage));
>> + ? ? seccomp_filters_free(orig);
>> +}
>> +
>> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
>> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
>> +#define seccomp_filter_dynamic(_id) \
>> + ? ? (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
>> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr)
>> +{
>> + ? ? if (!f)
>> + ? ? ? ? ? ? return SECCOMP_ACTION_DENY;
>> + ? ? return f->syscalls[syscall_nr];
>> +}
>> +
>> +static inline struct event_filter *seccomp_dynamic_filter(
>> + ? ? ? ? ? ? const struct seccomp_filters *filters, uint16_t id)
>> +{
>> + ? ? if (!seccomp_filter_dynamic(id))
>> + ? ? ? ? ? ? return NULL;
>> + ? ? return filters->event_filters[id];
>> +}
>> +
>> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr, uint16_t id)
>> +{
>> + ? ? filters->syscalls[syscall_nr] = id;
>> +}
>> +
>> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int syscall_nr, uint16_t id,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct event_filter *dynamic_filter)
>> +{
>> + ? ? filters->syscalls[syscall_nr] = id;
>> + ? ? if (seccomp_filter_dynamic(id))
>> + ? ? ? ? ? ? filters->event_filters[id] = dynamic_filter;
>> +}
>> +
>> +static struct event_filter *alloc_event_filter(int syscall_nr,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const char *filter_string)
>> +{
>> + ? ? struct syscall_metadata *data;
>> + ? ? struct event_filter *filter = NULL;
>> + ? ? int err;
>> +
>> + ? ? data = syscall_nr_to_meta(syscall_nr);
>> + ? ? /* Argument-based filtering only works on ftrace-hooked syscalls. */
>> + ? ? err = -ENOSYS;
>> + ? ? if (!data)
>> + ? ? ? ? ? ? goto fail;
>> + ? ? err = create_event_filter(&filter,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? data->enter_event->event.type,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? filter_string);
>> + ? ? if (err)
>> + ? ? ? ? ? ? goto fail;
>> +
>> + ? ? return filter;
>> +fail:
>> + ? ? kfree(filter);
>> + ? ? return ERR_PTR(err);
>> +}
>> +
>> +/**
>> + * seccomp_filters_copy - copies filters from src to dst.
>> + *
>> + * @dst: seccomp_filters to populate.
>> + * @src: table to read from.
>> + * @skip: specifies an entry, by system call, to skip.
>> + *
>> + * Returns non-zero on failure.
>> + * Both the source and the destination should have no simultaneous
>> + * writers, and dst should be exclusive to the caller.
>> + * If @skip is < 0, it is ignored.
>> + */
>> +static int seccomp_filters_copy(struct seccomp_filters *dst,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct seccomp_filters *src,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? int skip)
>> +{
>> + ? ? int id = 0, ret = 0, nr;
>> + ? ? memcpy(&dst->flags, &src->flags, sizeof(src->flags));
>> + ? ? memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
>> + ? ? if (!src->count)
>> + ? ? ? ? ? ? goto done;
>> + ? ? for (nr = 0; nr < NR_syscalls; ++nr) {
>> + ? ? ? ? ? ? struct event_filter *filter;
>> + ? ? ? ? ? ? const char *str;
>> + ? ? ? ? ? ? uint16_t src_id = seccomp_filter_id(src, nr);
>> + ? ? ? ? ? ? if (nr == skip) {
>> + ? ? ? ? ? ? ? ? ? ? set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL);
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? if (!seccomp_filter_dynamic(src_id))
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? if (id >= dst->count) {
>> + ? ? ? ? ? ? ? ? ? ? ret = -EINVAL;
>> + ? ? ? ? ? ? ? ? ? ? goto done;
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? str = get_filter_string(seccomp_dynamic_filter(src, src_id));
>> + ? ? ? ? ? ? filter = alloc_event_filter(nr, str);
>> + ? ? ? ? ? ? if (IS_ERR(filter)) {
>> + ? ? ? ? ? ? ? ? ? ? ret = PTR_ERR(filter);
>> + ? ? ? ? ? ? ? ? ? ? goto done;
>> + ? ? ? ? ? ? }
>> + ? ? ? ? ? ? set_seccomp_filter(dst, nr, id, filter);
>> + ? ? ? ? ? ? id++;
>> + ? ? }
>> +
>> +done:
>> + ? ? return ret;
>> +}
>> +
>> +/**
>> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
>> + * @filters: unattached filter object to operate on
>> + * @syscall_nr: syscall number to update filters for
>> + * @filter_string: string to append to the existing filter
>> + *
>> + * The new string will be &&'d to the original filter string to ensure that it
>> + * always matches the existing predicates or less:
>> + * ? (old_filter) && @filter_string
>> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
>> + * failure.
>> + */
>> +static int seccomp_extend_filter(struct seccomp_filters *filters,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr, char *filter_string)
>> +{
>> + ? ? struct event_filter *filter;
>> + ? ? uint16_t id = seccomp_filter_id(filters, syscall_nr);
>> + ? ? char *merged = NULL;
>> + ? ? int ret = -EINVAL, expected;
>> +
>> + ? ? /* No extending with a "1". */
>> + ? ? if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? filter = seccomp_dynamic_filter(filters, id);
>> + ? ? ret = -ENOENT;
>> + ? ? if (!filter)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> + ? ? ret = -ENOMEM;
>> + ? ? if (!merged)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
>> + ? ? ? ? ? ? ? ? ? ? ? ? get_filter_string(filter), filter_string);
>> + ? ? ret = -E2BIG;
>> + ? ? if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /* Free the old filter */
>> + ? ? free_event_filter(filter);
>> + ? ? set_seccomp_filter(filters, syscall_nr, id, NULL);
>> +
>> + ? ? /* Replace it */
>> + ? ? filter = alloc_event_filter(syscall_nr, merged);
>> + ? ? if (IS_ERR(filter)) {
>> + ? ? ? ? ? ? ret = PTR_ERR(filter);
>> + ? ? ? ? ? ? goto out;
>> + ? ? }
>> + ? ? set_seccomp_filter(filters, syscall_nr, id, filter);
>> + ? ? ret = 0;
>> +
>> +out:
>> + ? ? kfree(merged);
>> + ? ? return ret;
>> +}
>> +
>> +/**
>> + * seccomp_add_filter - adds a filter for an unfiltered syscall
>> + * @filters: filters object to add a filter/action to
>> + * @syscall_nr: system call number to add a filter for
>> + * @filter_string: the filter string to apply
>> + *
>> + * Returns 0 on success and non-zero otherwise.
>> + */
>> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? char *filter_string)
>> +{
>> + ? ? struct event_filter *filter;
>> + ? ? int ret = 0;
>> +
>> + ? ? if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
>> + ? ? ? ? ? ? set_seccomp_filter(filters, syscall_nr,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SECCOMP_ACTION_ALLOW, NULL);
>> + ? ? ? ? ? ? goto out;
>> + ? ? }
>> +
>> + ? ? filter = alloc_event_filter(syscall_nr, filter_string);
>> + ? ? if (IS_ERR(filter)) {
>> + ? ? ? ? ? ? ret = PTR_ERR(filter);
>> + ? ? ? ? ? ? goto out;
>> + ? ? }
>> + ? ? /* Always add to the last slot available since additions are
>> + ? ? ?* are only done one at a time.
>> + ? ? ?*/
>> + ? ? set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
>> +out:
>> + ? ? return ret;
>> +}
>> +
>> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
>> +static int filter_match_current(struct event_filter *event_filter)
>> +{
>> + ? ? int err = 0;
>> +#ifdef CONFIG_FTRACE_SYSCALLS
>> + ? ? uint8_t syscall_state[64];
>> +
>> + ? ? memset(syscall_state, 0, sizeof(syscall_state));
>> +
>> + ? ? /* The generic tracing entry can remain zeroed. */
>> + ? ? err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL);
>> + ? ? if (err)
>> + ? ? ? ? ? ? return 0;
>> +
>> + ? ? err = filter_match_preds(event_filter, syscall_state);
>> +#endif
>> + ? ? return err;
>> +}
>> +
>> +static const char *syscall_nr_to_name(int syscall)
>> +{
>> + ? ? const char *syscall_name = "unknown";
>> + ? ? struct syscall_metadata *data = syscall_nr_to_meta(syscall);
>> + ? ? if (data)
>> + ? ? ? ? ? ? syscall_name = data->name;
>> + ? ? return syscall_name;
>> +}
>> +
>> +static void filters_set_compat(struct seccomp_filters *filters)
>> +{
>> +#ifdef CONFIG_COMPAT
>> + ? ? if (is_compat_task())
>> + ? ? ? ? ? ? filters->flags.compat = 1;
>> +#endif
>> +}
>> +
>> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
>> +{
>> + ? ? int ret = 0;
>> + ? ? if (!filters)
>> + ? ? ? ? ? ? return 0;
>> +#ifdef CONFIG_COMPAT
>> + ? ? if (!!(is_compat_task()) == filters->flags.compat)
>> + ? ? ? ? ? ? ret = 1;
>> +#endif
>> + ? ? return ret;
>> +}
>> +
>> +static inline int syscall_is_execve(int syscall)
>> +{
>> + ? ? int nr = __NR_execve;
>> +#ifdef CONFIG_COMPAT
>> + ? ? if (is_compat_task())
>> + ? ? ? ? ? ? nr = __NR_seccomp_execve_32;
>> +#endif
>> + ? ? return syscall == nr;
>> +}
>> +
>> +#ifndef KSTK_EIP
>> +#define KSTK_EIP(x) 0L
>> +#endif
>> +
>> +void seccomp_filter_log_failure(int syscall)
>> +{
>> + ? ? pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
>> + ? ? ? ? ? ? current->comm, task_pid_nr(current), syscall,
>> + ? ? ? ? ? ? syscall_nr_to_name(syscall), KSTK_EIP(current));
>> +}
>> +
>> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
>> +void put_seccomp_filters(struct seccomp_filters *orig)
>> +{
>> + ? ? if (!orig)
>> + ? ? ? ? ? ? return;
>> +
>> + ? ? if (atomic_dec_and_test(&orig->usage))
>> + ? ? ? ? ? ? __put_seccomp_filters(orig);
>> +}
>> +
>> +/* get_seccomp_state - increments the reference count of @orig */
>> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
>
> Nit: the name does not match the comment.

Will fix it here and above. Thanks!

>> +{
>> + ? ? if (!orig)
>> + ? ? ? ? ? ? return NULL;
>> + ? ? atomic_inc(&orig->usage);
>> + ? ? return orig;
>
> This is called in an RCU read-side critical section. ?What exactly is
> RCU protecting? ?I would expect an rcu_dereference() or one of the
> RCU list-traversal primitives somewhere, either here or at the caller.

Ah, I spaced on rcu_dereference(). The goal was to make the
assignment and replacement of the seccomp_filters pointer
RCU-protected (in seccomp_state) so there's no concern over it being
replaced partial on platforms where pointer assignments are non-atomic
- such as via /proc/<pid>/seccomp_filters access or a call via the
exported symbols. Object lifetime is managed by reference counting so
that I don't have to worry about extending the RCU read-side critical
section by much or deal with pre-allocations.

I'll add rcu_dereference() to all the get_seccomp_filters() uses where
it makes sense, so that it is called safely. Just to make sure, does
it make sense to continue to rcu protect the specific pointer?

>> +}
>> +
>> +/**
>> + * seccomp_test_filters - tests 'current' against the given syscall
>> + * @state: seccomp_state of current to use.
>> + * @syscall: number of the system call to test
>> + *
>> + * Returns 0 on ok and non-zero on error/failure.
>> + */
>> +int seccomp_test_filters(int syscall)
>> +{
>> + ? ? uint16_t id;
>> + ? ? struct event_filter *filter;
>> + ? ? struct seccomp_filters *filters;
>> + ? ? int ret = -EACCES;
>> +
>> + ? ? rcu_read_lock();
>> + ? ? filters = get_seccomp_filters(current->seccomp.filters);
>> + ? ? rcu_read_unlock();
>> +
>> + ? ? if (!filters)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? if (filters_compat_mismatch(filters)) {
>> + ? ? ? ? ? ? pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
>> + ? ? ? ? ? ? ? ? ? ? current->comm, task_pid_nr(current));
>> + ? ? ? ? ? ? goto out;
>> + ? ? }
>> +
>> + ? ? /* execve is never allowed. */
>> + ? ? if (syscall_is_execve(syscall))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? ret = 0;
>> + ? ? id = seccomp_filter_id(filters, syscall);
>> + ? ? if (seccomp_filter_allow(id))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? ret = -EACCES;
>> + ? ? if (!seccomp_filter_dynamic(id))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? filter = seccomp_dynamic_filter(filters, id);
>> + ? ? if (filter && filter_match_current(filter))
>> + ? ? ? ? ? ? ret = 0;
>> +out:
>> + ? ? put_seccomp_filters(filters);
>> + ? ? return ret;
>> +}
>> +
>> +/**
>> + * seccomp_show_filters - prints the current filter state to a seq_file
>> + * @filters: properly get()'d filters object
>> + * @m: the prepared seq_file to receive the data
>> + *
>> + * Returns 0 on a successful write.
>> + */
>> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
>> +{
>> + ? ? int syscall;
>> + ? ? seq_printf(m, "Mode: %d\n", current->seccomp.mode);
>> + ? ? if (!filters)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? for (syscall = 0; syscall < NR_syscalls; ++syscall) {
>> + ? ? ? ? ? ? uint16_t id = seccomp_filter_id(filters, syscall);
>> + ? ? ? ? ? ? const char *filter_string = SECCOMP_FILTER_ALLOW;
>> + ? ? ? ? ? ? if (seccomp_filter_deny(id))
>> + ? ? ? ? ? ? ? ? ? ? continue;
>> + ? ? ? ? ? ? seq_printf(m, "%d (%s): ",
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? syscall,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? syscall_nr_to_name(syscall));
>> + ? ? ? ? ? ? if (seccomp_filter_dynamic(id))
>> + ? ? ? ? ? ? ? ? ? ? filter_string = get_filter_string(
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seccomp_dynamic_filter(filters, id));
>> + ? ? ? ? ? ? seq_printf(m, "%s\n", filter_string);
>> + ? ? }
>> +out:
>> + ? ? return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
>> +
>> +/**
>> + * seccomp_get_filter - copies the filter_string into "buf"
>> + * @syscall_nr: system call number to look up
>> + * @buf: destination buffer
>> + * @bufsize: available space in the buffer.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + * ? ? ? ? ?operates on current. current must be attempting a system call
>> + * ? ? ? ? ?when this is called.
>> + *
>> + * Looks up the filter for the given system call number on current. ?If found,
>> + * the string length of the NUL-terminated buffer is returned and < 0 is
>> + * returned on error. The NUL byte is not included in the length.
>> + */
>> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
>> +{
>> + ? ? struct seccomp_filters *filters;
>> + ? ? struct event_filter *filter;
>> + ? ? long ret = -EINVAL;
>> + ? ? uint16_t id;
>> +
>> + ? ? if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
>> + ? ? ? ? ? ? bufsize = SECCOMP_MAX_FILTER_LENGTH;
>> +
>> + ? ? rcu_read_lock();
>> + ? ? filters = get_seccomp_filters(current->seccomp.filters);
>> + ? ? rcu_read_unlock();
>> +
>> + ? ? if (!filters)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? ret = -ENOENT;
>> + ? ? id = seccomp_filter_id(filters, syscall_nr);
>> + ? ? if (seccomp_filter_deny(id))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? if (seccomp_filter_allow(id)) {
>> + ? ? ? ? ? ? ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
>> + ? ? ? ? ? ? goto copied;
>> + ? ? }
>> +
>> + ? ? filter = seccomp_dynamic_filter(filters, id);
>> + ? ? if (!filter)
>> + ? ? ? ? ? ? goto out;
>> + ? ? ret = strlcpy(buf, get_filter_string(filter), bufsize);
>> +
>> +copied:
>> + ? ? if (ret >= bufsize) {
>> + ? ? ? ? ? ? ret = -ENOSPC;
>> + ? ? ? ? ? ? goto out;
>> + ? ? }
>> + ? ? /* Zero out any remaining buffer, just in case. */
>> + ? ? memset(buf + ret, 0, bufsize - ret);
>> +out:
>> + ? ? put_seccomp_filters(filters);
>> + ? ? return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
>> +
>> +/**
>> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
>> + * @syscall_nr: the system call number to clear filters for.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + * ? ? ? ? ?operates on current. current must be attempting a system call
>> + * ? ? ? ? ?when this is called.
>> + *
>> + * Returns 0 on success.
>> + */
>> +long seccomp_clear_filter(int syscall_nr)
>> +{
>> + ? ? struct seccomp_filters *filters = NULL, *orig_filters;
>> + ? ? uint16_t id;
>> + ? ? int ret = -EINVAL;
>> +
>> + ? ? rcu_read_lock();
>> + ? ? orig_filters = get_seccomp_filters(current->seccomp.filters);
>> + ? ? rcu_read_unlock();
>> +
>> + ? ? if (!orig_filters)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? if (filters_compat_mismatch(orig_filters))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? id = seccomp_filter_id(orig_filters, syscall_nr);
>> + ? ? if (seccomp_filter_deny(id))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /* Create a new filters object for the task */
>> + ? ? if (seccomp_filter_dynamic(id))
>> + ? ? ? ? ? ? filters = seccomp_filters_new(orig_filters->count - 1);
>> + ? ? else
>> + ? ? ? ? ? ? filters = seccomp_filters_new(orig_filters->count);
>> +
>> + ? ? if (IS_ERR(filters)) {
>> + ? ? ? ? ? ? ret = PTR_ERR(filters);
>> + ? ? ? ? ? ? goto out;
>> + ? ? }
>> +
>> + ? ? /* Copy, but drop the requested entry. */
>> + ? ? ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
>> + ? ? if (ret)
>> + ? ? ? ? ? ? goto out;
>> + ? ? get_seccomp_filters(filters); ?/* simplify the out: path */
>> +
>> + ? ? rcu_assign_pointer(current->seccomp.filters, filters);
>
> What prevents two copies of seccomp_clear_filter() from running
> concurrently?

Nothing - the last one wins assignment, but the objects themselves
should be internally consistent to the parallel calls. If that's a
concern, a per-task writer mutex could be used just to ensure
simultaneous calls to clear and set are performed serially. Would
that make more sense?

>> + ? ? synchronize_rcu();
>> + ? ? put_seccomp_filters(orig_filters); ?/* for the task */
>> +out:
>> + ? ? put_seccomp_filters(orig_filters); ?/* for the get */
>> + ? ? put_seccomp_filters(filters); ?/* for the extra get */
>> + ? ? return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
>> +
>> +/**
>> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
>> + * @syscall_nr: system call number to apply the filter to.
>> + * @filter: ftrace filter string to apply.
>> + *
>> + * Context: User context only. This function may sleep on allocation and
>> + * ? ? ? ? ?operates on current. current must be attempting a system call
>> + * ? ? ? ? ?when this is called.
>> + *
>> + * New filters may be added for system calls when the current task is
>> + * not in a secure computing mode (seccomp). ?Otherwise, existing filters may
>> + * be extended.
>> + *
>> + * Returns 0 on success or an errno on failure.
>> + */
>> +long seccomp_set_filter(int syscall_nr, char *filter)
>> +{
>> + ? ? struct seccomp_filters *filters = NULL, *orig_filters = NULL;
>> + ? ? uint16_t id;
>> + ? ? long ret = -EINVAL;
>> + ? ? uint16_t filters_needed;
>> +
>> + ? ? if (!filter)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? filter = strstrip(filter);
>> + ? ? /* Disallow empty strings. */
>> + ? ? if (filter[0] == 0)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? rcu_read_lock();
>> + ? ? orig_filters = get_seccomp_filters(current->seccomp.filters);
>> + ? ? rcu_read_unlock();
>> +
>> + ? ? /* After the first call, compatibility mode is selected permanently. */
>> + ? ? ret = -EACCES;
>> + ? ? if (filters_compat_mismatch(orig_filters))
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? filters_needed = orig_filters ? orig_filters->count : 0;
>> + ? ? id = seccomp_filter_id(orig_filters, syscall_nr);
>> + ? ? if (seccomp_filter_deny(id)) {
>> + ? ? ? ? ? ? /* Don't allow DENYs to be changed when in a seccomp mode */
>> + ? ? ? ? ? ? ret = -EACCES;
>> + ? ? ? ? ? ? if (current->seccomp.mode)
>> + ? ? ? ? ? ? ? ? ? ? goto out;
>> + ? ? ? ? ? ? filters_needed++;
>> + ? ? }
>> +
>> + ? ? filters = seccomp_filters_new(filters_needed);
>> + ? ? if (IS_ERR(filters)) {
>> + ? ? ? ? ? ? ret = PTR_ERR(filters);
>> + ? ? ? ? ? ? goto out;
>> + ? ? }
>> +
>> + ? ? filters_set_compat(filters);
>> + ? ? if (orig_filters) {
>> + ? ? ? ? ? ? ret = seccomp_filters_copy(filters, orig_filters, -1);
>> + ? ? ? ? ? ? if (ret)
>> + ? ? ? ? ? ? ? ? ? ? goto out;
>> + ? ? }
>> +
>> + ? ? if (seccomp_filter_deny(id))
>> + ? ? ? ? ? ? ret = seccomp_add_filter(filters, syscall_nr, filter);
>> + ? ? else
>> + ? ? ? ? ? ? ret = seccomp_extend_filter(filters, syscall_nr, filter);
>> + ? ? if (ret)
>> + ? ? ? ? ? ? goto out;
>> + ? ? get_seccomp_filters(filters); ?/* simplify the error paths */
>> +
>> + ? ? rcu_assign_pointer(current->seccomp.filters, filters);
>
> Again, what prevents two copies of seccomp_set_filter() from running
> concurrently?

Same deal - nothing, but I'd be happy to add a guard if it makes sense.

Thanks!

>> + ? ? synchronize_rcu();
>> + ? ? put_seccomp_filters(orig_filters); ?/* for the task */
>> +out:
>> + ? ? put_seccomp_filters(orig_filters); ?/* for the get */
>> + ? ? put_seccomp_filters(filters); ?/* for get or task, on err */
>> + ? ? return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
>> +
>> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? char __user *user_filter)
>> +{
>> + ? ? int nr;
>> + ? ? long ret;
>> + ? ? char *filter = NULL;
>> +
>> + ? ? ret = -EINVAL;
>> + ? ? if (syscall_nr >= NR_syscalls)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? ret = -EFAULT;
>> + ? ? if (!user_filter)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
>> + ? ? ret = -ENOMEM;
>> + ? ? if (!filter)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? ret = -EFAULT;
>> + ? ? if (strncpy_from_user(filter, user_filter,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? nr = (int) syscall_nr;
>> + ? ? ret = seccomp_set_filter(nr, filter);
>> +
>> +out:
>> + ? ? kfree(filter);
>> + ? ? return ret;
>> +}
>> +
>> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
>> +{
>> + ? ? int nr = -1;
>> + ? ? long ret;
>> +
>> + ? ? ret = -EINVAL;
>> + ? ? if (syscall_nr >= NR_syscalls)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? nr = (int) syscall_nr;
>> + ? ? ret = seccomp_clear_filter(nr);
>> +
>> +out:
>> + ? ? return ret;
>> +}
>> +
>> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long available)
>> +{
>> + ? ? int ret, nr;
>> + ? ? unsigned long copied;
>> + ? ? char *buf = NULL;
>> + ? ? ret = -EINVAL;
>> + ? ? if (!available)
>> + ? ? ? ? ? ? goto out;
>> + ? ? /* Ignore extra buffer space. */
>> + ? ? if (available > SECCOMP_MAX_FILTER_LENGTH)
>> + ? ? ? ? ? ? available = SECCOMP_MAX_FILTER_LENGTH;
>> +
>> + ? ? ret = -EINVAL;
>> + ? ? if (syscall_nr >= NR_syscalls)
>> + ? ? ? ? ? ? goto out;
>> + ? ? nr = (int) syscall_nr;
>> +
>> + ? ? ret = -ENOMEM;
>> + ? ? buf = kmalloc(available, GFP_KERNEL);
>> + ? ? if (!buf)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? ret = seccomp_get_filter(nr, buf, available);
>> + ? ? if (ret < 0)
>> + ? ? ? ? ? ? goto out;
>> +
>> + ? ? /* Include the NUL byte in the copy. */
>> + ? ? copied = copy_to_user(dst, buf, ret + 1);
>> + ? ? ret = -ENOSPC;
>> + ? ? if (copied)
>> + ? ? ? ? ? ? goto out;
>> + ? ? ret = 0;
>> +out:
>> + ? ? kfree(buf);
>> + ? ? return ret;
>> +}
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index af468ed..ed60d06 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
>> ? ? ? ? ? ? ? case PR_SET_ENDIAN:
>> ? ? ? ? ? ? ? ? ? ? ? error = SET_ENDIAN(me, arg2);
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> -
>> ? ? ? ? ? ? ? case PR_GET_SECCOMP:
>> ? ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp();
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? case PR_SET_SECCOMP:
>> ? ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2);
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case PR_SET_SECCOMP_FILTER:
>> + ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp_filter(arg2,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(char __user *) arg3);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case PR_CLEAR_SECCOMP_FILTER:
>> + ? ? ? ? ? ? ? ? ? ? error = prctl_clear_seccomp_filter(arg2);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? case PR_GET_SECCOMP_FILTER:
>> + ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp_filter(arg2,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(char __user *) arg3,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?arg4);
>> + ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? case PR_GET_TSC:
>> ? ? ? ? ? ? ? ? ? ? ? error = GET_TSC_CTL(arg2);
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> diff --git a/security/Kconfig b/security/Kconfig
>> index 95accd4..c76adf2 100644
>> --- a/security/Kconfig
>> +++ b/security/Kconfig
>> @@ -2,6 +2,10 @@
>> ?# Security configuration
>> ?#
>>
>> +# Make seccomp filter Kconfig switch below available
>> +config HAVE_SECCOMP_FILTER
>> + ? ? ? bool
>> +
>> ?menu "Security options"
>>
>> ?config KEYS
>> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
>>
>> ? ? ? ? If you are unsure how to answer this question, answer N.
>>
>> +config SECCOMP_FILTER
>> + ? ? bool "Enable seccomp-based system call filtering"
>> + ? ? select SECCOMP
>> + ? ? depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
>> + ? ? help
>> + ? ? ? This kernel feature expands CONFIG_SECCOMP to allow computing
>> + ? ? ? in environments with reduced kernel access dictated by the
>> + ? ? ? application itself through prctl calls. ?If
>> + ? ? ? CONFIG_FTRACE_SYSCALLS is available, then system call
>> + ? ? ? argument-based filtering predicates may be used.
>> +
>> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
>> +
>> ?config SECURITY
>> ? ? ? bool "Enable different security models"
>> ? ? ? depends on SYSFS
>> --
>> 1.7.0.4
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at ?http://www.tux.org/lkml/
>

2011-06-02 19:42:37

by Paul E. McKenney

[permalink] [raw]

Subject: Re: [PATCH v3 03/13] seccomp_filters: new mode with configurable syscall filters

On Thu, Jun 02, 2011 at 01:14:54PM -0500, Will Drewry wrote:
> On Thu, Jun 2, 2011 at 12:36 PM, Paul E. McKenney
> <[email protected]> wrote:
> > On Tue, May 31, 2011 at 10:10:35PM -0500, Will Drewry wrote:
> >> This change adds a new seccomp mode which specifies the allowed system
> >> calls dynamically. ?When in the new mode (2), all system calls are
> >> checked against process-defined filters - first by system call number,
> >> then by a filter string. ?If an entry exists for a given system call and
> >> all filter predicates evaluate to true, then the task may proceed.
> >> Otherwise, the task is killed.
> >
> > A few questions below -- I can't say that I understand the RCU usage.
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanx, Paul
> >
> >> Filter string parsing and evaluation is handled by the ftrace filter
> >> engine. ?Related patches tweak to the perf filter trace and free
> >> allowing the calls to be shared. Filters inherit their understanding of
> >> types and arguments for each system call from the CONFIG_FTRACE_SYSCALLS
> >> subsystem which already populates this information in syscall_metadata
> >> associated enter_event (and exit_event) structures. If
> >> CONFIG_FTRACE_SYSCALLS is not compiled in, only filter strings of "1"
> >> will be allowed.
> >>
> >> The net result is a process may have its system calls filtered using the
> >> ftrace filter engine's inherent understanding of systems calls. ?The set
> >> of filters is specified through the PR_SET_SECCOMP_FILTER argument in
> >> prctl(). For example, a filterset for a process, like pdftotext, that
> >> should only process read-only input could (roughly) look like:
> >> ? sprintf(rdonly, "flags == %u", O_RDONLY|O_LARGEFILE);
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_open, rdonly);
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR__llseek, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_brk, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_close, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_exit_group, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_fstat64, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_mmap2, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_munmap, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_read, "1");
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_write, "(fd == 1 | fd == 2)");
> >> ? prctl(PR_SET_SECCOMP, 2);
> >>
> >> Subsequent calls to PR_SET_SECCOMP_FILTER for the same system call will
> >> be &&'d together to ensure that attack surface may only be reduced:
> >> ? prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd != 2");
> >>
> >> With the earlier example, the active filter becomes:
> >> ? "(fd == 1 || fd == 2) && fd != 2"
> >>
> >> The patch also adds PR_CLEAR_SECCOMP_FILTER and PR_GET_SECCOMP_FILTER.
> >> The latter returns the current filter for a system call to userspace:
> >>
> >> ? prctl(PR_GET_SECCOMP_FILTER, __NR_write, buf, bufsize);
> >>
> >> while the former clears any filters for a given system call changing it
> >> back to a defaulty deny:
> >>
> >> ? prctl(PR_CLEAR_SECCOMP_FILTER, __NR_write);
> >>
> >> v3: - always block execve calls (as per linus torvalds)
> >> ? ? - add __NR_seccomp_execve(_32) to seccomp-supporting arches
> >> ? ? - ensure compat tasks can't reach ftrace:syscalls
> >> ? ? - dropped new defines for seccomp modes.
> >> ? ? - two level array instead of hlists (sugg. by olof johansson)
> >> ? ? - added generic Kconfig entry that is not connected.
> >> ? ? - dropped internal seccomp.h
> >> ? ? - move prctl helpers to seccomp_filter
> >> ? ? - killed seccomp_t typedef (as per checkpatch)
> >> v2: - changed to use the existing syscall number ABI.
> >> ? ? - prctl changes to minimize parsing in the kernel:
> >> ? ? ? prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
> >> ? ? ? prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
> >> ? ? ? prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
> >> ? ? ? prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
> >> ? ? - defined PR_SECCOMP_MODE_STRICT and ..._FILTER
> >> ? ? - added flags
> >> ? ? - provide a default fail syscall_nr_to_meta in ftrace
> >> ? ? - provides fallback for unhooked system calls
> >> ? ? - use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
> >> ? ? - added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
> >> ? ? - moved to a hlist and 4 bit hash of linked lists
> >> ? ? - added support to operate without CONFIG_FTRACE_SYSCALLS
> >> ? ? - moved Kconfig support next to SECCOMP
> >> ? ? - made Kconfig entries dependent on EXPERIMENTAL
> >> ? ? - added macros to avoid ifdefs from kernel/fork.c
> >> ? ? - added compat task/filter matching
> >> ? ? - drop seccomp.h inclusion in sched.h and drop seccomp_t
> >> ? ? - added Filtering to "show" output
> >> ? ? - added on_exec state dup'ing when enabling after a fast-path accept.
> >>
> >> Signed-off-by: Will Drewry <[email protected]>
> >> ---
> >> ?include/linux/prctl.h ? | ? ?5 +
> >> ?include/linux/sched.h ? | ? ?2 +-
> >> ?include/linux/seccomp.h | ? 98 ++++++-
> >> ?include/trace/syscall.h | ? ?7 +
> >> ?kernel/Makefile ? ? ? ? | ? ?3 +
> >> ?kernel/fork.c ? ? ? ? ? | ? ?3 +
> >> ?kernel/seccomp.c ? ? ? ?| ? 38 ++-
> >> ?kernel/seccomp_filter.c | ?784 +++++++++++++++++++++++++++++++++++++++++++++++
> >> ?kernel/sys.c ? ? ? ? ? ?| ? 13 +-
> >> ?security/Kconfig ? ? ? ?| ? 17 +
> >> ?10 files changed, 954 insertions(+), 16 deletions(-)
> >> ?create mode 100644 kernel/seccomp_filter.c
> >>
> >> diff --git a/include/linux/prctl.h b/include/linux/prctl.h
> >> index a3baeb2..44723ce 100644
> >> --- a/include/linux/prctl.h
> >> +++ b/include/linux/prctl.h
> >> @@ -64,6 +64,11 @@
> >> ?#define PR_GET_SECCOMP ? ? ? 21
> >> ?#define PR_SET_SECCOMP ? ? ? 22
> >>
> >> +/* Get/set process seccomp filters */
> >> +#define PR_GET_SECCOMP_FILTER ? ? ? ?35
> >> +#define PR_SET_SECCOMP_FILTER ? ? ? ?36
> >> +#define PR_CLEAR_SECCOMP_FILTER ? ? ?37
> >> +
> >> ?/* Get/set the capability bounding set (as per security/commoncap.c) */
> >> ?#define PR_CAPBSET_READ 23
> >> ?#define PR_CAPBSET_DROP 24
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 18d63ce..3f0bc8d 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -1374,7 +1374,7 @@ struct task_struct {
> >> ? ? ? uid_t loginuid;
> >> ? ? ? unsigned int sessionid;
> >> ?#endif
> >> - ? ? seccomp_t seccomp;
> >> + ? ? struct seccomp_struct seccomp;
> >>
> >> ?/* Thread group tracking */
> >> ? ? ? u32 parent_exec_id;
> >> diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
> >> index 167c333..f4434ca 100644
> >> --- a/include/linux/seccomp.h
> >> +++ b/include/linux/seccomp.h
> >> @@ -1,13 +1,33 @@
> >> ?#ifndef _LINUX_SECCOMP_H
> >> ?#define _LINUX_SECCOMP_H
> >>
> >> +struct seq_file;
> >>
> >> ?#ifdef CONFIG_SECCOMP
> >>
> >> +#include <linux/errno.h>
> >> ?#include <linux/thread_info.h>
> >> +#include <linux/types.h>
> >> ?#include <asm/seccomp.h>
> >>
> >> -typedef struct { int mode; } seccomp_t;
> >> +struct seccomp_filters;
> >> +/**
> >> + * struct seccomp_struct - the state of a seccomp'ed process
> >> + *
> >> + * @mode:
> >> + * ? ? if this is 1, the process is under standard seccomp rules
> >> + * ? ? ? ? ? ? is 2, the process is only allowed to make system calls where
> >> + * ? ? ? ? ? ? ? ? ? associated filters evaluate successfully.
> >> + * @filters: Metadata for filters if using CONFIG_SECCOMP_FILTER.
> >> + * ? ? ? ? ? filters assignment/use should be RCU-protected and its contents
> >> + * ? ? ? ? ? should never be modified when attached to a seccomp_struct.
> >> + */
> >> +struct seccomp_struct {
> >> + ? ? uint16_t mode;
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> + ? ? struct seccomp_filters *filters;
> >> +#endif
> >> +};
> >>
> >> ?extern void __secure_computing(int);
> >> ?static inline void secure_computing(int this_syscall)
> >> @@ -16,15 +36,14 @@ static inline void secure_computing(int this_syscall)
> >> ? ? ? ? ? ? ? __secure_computing(this_syscall);
> >> ?}
> >>
> >> -extern long prctl_get_seccomp(void);
> >> ?extern long prctl_set_seccomp(unsigned long);
> >> +extern long prctl_get_seccomp(void);
> >>
> >> ?#else /* CONFIG_SECCOMP */
> >>
> >> ?#include <linux/errno.h>
> >>
> >> -typedef struct { } seccomp_t;
> >> -
> >> +struct seccomp_struct { };
> >> ?#define secure_computing(x) do { } while (0)
> >>
> >> ?static inline long prctl_get_seccomp(void)
> >> @@ -32,11 +51,80 @@ static inline long prctl_get_seccomp(void)
> >> ? ? ? return -EINVAL;
> >> ?}
> >>
> >> -static inline long prctl_set_seccomp(unsigned long arg2)
> >> +static inline long prctl_set_seccomp(unsigned long a2);
> >> ?{
> >> ? ? ? return -EINVAL;
> >> ?}
> >>
> >> ?#endif /* CONFIG_SECCOMP */
> >>
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> +
> >> +#define inherit_tsk_seccomp(_child, _orig) do { \
> >> + ? ? _child->seccomp.mode = _orig->seccomp.mode; \
> >> + ? ? _child->seccomp.filters = get_seccomp_filters(_orig->seccomp.filters); \
> >> + ? ? } while (0)
> >> +#define put_tsk_seccomp(_tsk) put_seccomp_filters(_tsk->seccomp.filters)
> >> +
> >> +extern int seccomp_show_filters(struct seccomp_filters *filters,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct seq_file *);
> >> +extern long seccomp_set_filter(int, char *);
> >> +extern long seccomp_clear_filter(int);
> >> +extern long seccomp_get_filter(int, char *, unsigned long);
> >> +
> >> +extern long prctl_set_seccomp_filter(unsigned long, char __user *);
> >> +extern long prctl_get_seccomp_filter(unsigned long, char __user *,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long);
> >> +extern long prctl_clear_seccomp_filter(unsigned long);
> >> +
> >> +extern struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *);
> >> +extern void put_seccomp_filters(struct seccomp_filters *);
> >> +
> >> +extern int seccomp_test_filters(int);
> >> +extern void seccomp_filter_log_failure(int);
> >> +
> >> +#else ?/* CONFIG_SECCOMP_FILTER */
> >> +
> >> +struct seccomp_filters { };
> >> +#define inherit_tsk_seccomp(_child, _orig) do { } while (0)
> >> +#define put_tsk_seccomp(_tsk) do { } while (0)
> >> +
> >> +static inline int seccomp_show_filters(struct seccomp_filters *filters,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct seq_file *m)
> >> +{
> >> + ? ? return -ENOSYS;
> >> +}
> >> +
> >> +static inline long seccomp_set_filter(int syscall_nr, char *filter)
> >> +{
> >> + ? ? return -ENOSYS;
> >> +}
> >> +
> >> +static inline long seccomp_clear_filter(int syscall_nr)
> >> +{
> >> + ? ? return -ENOSYS;
> >> +}
> >> +
> >> +static inline long seccomp_get_filter(int syscall_nr,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? char *buf, unsigned long available)
> >> +{
> >> + ? ? return -ENOSYS;
> >> +}
> >> +
> >> +static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
> >> +{
> >> + ? ? return -ENOSYS;
> >> +}
> >> +
> >> +static inline long prctl_clear_seccomp_filter(unsigned long a2)
> >> +{
> >> + ? ? return -ENOSYS;
> >> +}
> >> +
> >> +static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long a4)
> >> +{
> >> + ? ? return -ENOSYS;
> >> +}
> >> +#endif ?/* CONFIG_SECCOMP_FILTER */
> >> ?#endif /* _LINUX_SECCOMP_H */
> >> diff --git a/include/trace/syscall.h b/include/trace/syscall.h
> >> index 242ae04..e061ad0 100644
> >> --- a/include/trace/syscall.h
> >> +++ b/include/trace/syscall.h
> >> @@ -35,6 +35,8 @@ struct syscall_metadata {
> >> ?extern unsigned long arch_syscall_addr(int nr);
> >> ?extern int init_syscall_trace(struct ftrace_event_call *call);
> >>
> >> +extern struct syscall_metadata *syscall_nr_to_meta(int);
> >> +
> >> ?extern int reg_event_syscall_enter(struct ftrace_event_call *call);
> >> ?extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
> >> ?extern int reg_event_syscall_exit(struct ftrace_event_call *call);
> >> @@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct trace_event *event);
> >> ?enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct trace_event *event);
> >> +#else
> >> +static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
> >> +{
> >> + ? ? return NULL;
> >> +}
> >> ?#endif
> >>
> >> ?#ifdef CONFIG_PERF_EVENTS
> >> diff --git a/kernel/Makefile b/kernel/Makefile
> >> index 85cbfb3..84e7dfb 100644
> >> --- a/kernel/Makefile
> >> +++ b/kernel/Makefile
> >> @@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
> >> ?obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
> >> ?obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
> >> ?obj-$(CONFIG_SECCOMP) += seccomp.o
> >> +ifeq ($(CONFIG_SECCOMP_FILTER),y)
> >> +obj-$(CONFIG_SECCOMP) += seccomp_filter.o
> >> +endif
> >> ?obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
> >> ?obj-$(CONFIG_TREE_RCU) += rcutree.o
> >> ?obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index e7548de..6f835e0 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -34,6 +34,7 @@
> >> ?#include <linux/cgroup.h>
> >> ?#include <linux/security.h>
> >> ?#include <linux/hugetlb.h>
> >> +#include <linux/seccomp.h>
> >> ?#include <linux/swap.h>
> >> ?#include <linux/syscalls.h>
> >> ?#include <linux/jiffies.h>
> >> @@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
> >> ? ? ? free_thread_info(tsk->stack);
> >> ? ? ? rt_mutex_debug_task_free(tsk);
> >> ? ? ? ftrace_graph_exit_task(tsk);
> >> + ? ? put_tsk_seccomp(tsk);
> >> ? ? ? free_task_struct(tsk);
> >> ?}
> >> ?EXPORT_SYMBOL(free_task);
> >> @@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
> >> ? ? ? if (err)
> >> ? ? ? ? ? ? ? goto out;
> >>
> >> + ? ? inherit_tsk_seccomp(tsk, orig);
> >> ? ? ? setup_thread_stack(tsk, orig);
> >> ? ? ? clear_user_return_notifier(tsk);
> >> ? ? ? clear_tsk_need_resched(tsk);
> >> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> >> index 57d4b13..0a942be 100644
> >> --- a/kernel/seccomp.c
> >> +++ b/kernel/seccomp.c
> >> @@ -2,16 +2,20 @@
> >> ? * linux/kernel/seccomp.c
> >> ? *
> >> ? * Copyright 2004-2005 ?Andrea Arcangeli <[email protected]>
> >> + * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
> >> ? *
> >> ? * This defines a simple but solid secure-computing mode.
> >> ? */
> >>
> >> ?#include <linux/seccomp.h>
> >> ?#include <linux/sched.h>
> >> +#include <linux/slab.h>
> >> ?#include <linux/compat.h>
> >> +#include <linux/unistd.h>
> >> +#include <linux/ftrace_event.h>
> >>
> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
> >> ?/* #define SECCOMP_DEBUG 1 */
> >> -#define NR_SECCOMP_MODES 1
> >>
> >> ?/*
> >> ? * Secure computing mode 1 allows only read/write/exit/sigreturn.
> >> @@ -32,10 +36,9 @@ static int mode1_syscalls_32[] = {
> >>
> >> ?void __secure_computing(int this_syscall)
> >> ?{
> >> - ? ? int mode = current->seccomp.mode;
> >> ? ? ? int * syscall;
> >>
> >> - ? ? switch (mode) {
> >> + ? ? switch (current->seccomp.mode) {
> >> ? ? ? case 1:
> >> ? ? ? ? ? ? ? syscall = mode1_syscalls;
> >> ?#ifdef CONFIG_COMPAT
> >> @@ -47,6 +50,17 @@ void __secure_computing(int this_syscall)
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? return;
> >> ? ? ? ? ? ? ? } while (*++syscall);
> >> ? ? ? ? ? ? ? break;
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> + ? ? case 2:
> >> + ? ? ? ? ? ? if (this_syscall >= NR_syscalls || this_syscall < 0)
> >> + ? ? ? ? ? ? ? ? ? ? break;
> >> +
> >> + ? ? ? ? ? ? if (!seccomp_test_filters(this_syscall))
> >> + ? ? ? ? ? ? ? ? ? ? return;
> >> +
> >> + ? ? ? ? ? ? seccomp_filter_log_failure(this_syscall);
> >> + ? ? ? ? ? ? break;
> >> +#endif
> >> ? ? ? default:
> >> ? ? ? ? ? ? ? BUG();
> >> ? ? ? }
> >> @@ -71,16 +85,22 @@ long prctl_set_seccomp(unsigned long seccomp_mode)
> >> ? ? ? if (unlikely(current->seccomp.mode))
> >> ? ? ? ? ? ? ? goto out;
> >>
> >> - ? ? ret = -EINVAL;
> >> - ? ? if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
> >> - ? ? ? ? ? ? current->seccomp.mode = seccomp_mode;
> >> - ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP);
> >> + ? ? ret = 0;
> >> + ? ? switch (seccomp_mode) {
> >> + ? ? case 1:
> >> ?#ifdef TIF_NOTSC
> >> ? ? ? ? ? ? ? disable_TSC();
> >> ?#endif
> >> - ? ? ? ? ? ? ret = 0;
> >> +#ifdef CONFIG_SECCOMP_FILTER
> >> + ? ? case 2:
> >> +#endif
> >> + ? ? ? ? ? ? current->seccomp.mode = seccomp_mode;
> >> + ? ? ? ? ? ? set_thread_flag(TIF_SECCOMP);
> >> + ? ? ? ? ? ? break;
> >> + ? ? default:
> >> + ? ? ? ? ? ? ret = -EINVAL;
> >> ? ? ? }
> >>
> >> - out:
> >> +out:
> >> ? ? ? return ret;
> >> ?}
> >> diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
> >> new file mode 100644
> >> index 0000000..9782f25
> >> --- /dev/null
> >> +++ b/kernel/seccomp_filter.c
> >> @@ -0,0 +1,784 @@
> >> +/* filter engine-based seccomp system call filtering
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify
> >> + * it under the terms of the GNU General Public License as published by
> >> + * the Free Software Foundation; either version 2 of the License, or
> >> + * (at your option) any later version.
> >> + *
> >> + * This program is distributed in the hope that it will be useful,
> >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ?See the
> >> + * GNU General Public License for more details.
> >> + *
> >> + * You should have received a copy of the GNU General Public License
> >> + * along with this program; if not, write to the Free Software
> >> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> >> + *
> >> + * Copyright (C) 2011 The Chromium OS Authors <[email protected]>
> >> + */
> >> +
> >> +#include <linux/compat.h>
> >> +#include <linux/err.h>
> >> +#include <linux/errno.h>
> >> +#include <linux/ftrace_event.h>
> >> +#include <linux/seccomp.h>
> >> +#include <linux/seq_file.h>
> >> +#include <linux/sched.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/uaccess.h>
> >> +
> >> +#include <asm/syscall.h>
> >> +#include <trace/syscall.h>
> >> +
> >> +
> >> +#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
> >> +
> >> +#define SECCOMP_FILTER_ALLOW "1"
> >> +#define SECCOMP_ACTION_DENY 0xffff
> >> +#define SECCOMP_ACTION_ALLOW 0xfffe
> >> +
> >> +/**
> >> + * struct seccomp_filters - container for seccomp filterset
> >> + *
> >> + * @syscalls: array of 16-bit indices into @event_filters by syscall_nr
> >> + * ? ? ? ? ? ?May also be SECCOMP_ACTION_DENY or SECCOMP_ACTION_ALLOW
> >> + * @event_filters: array of pointers to ftrace event objects
> >> + * @count: size of @event_filters
> >> + * @flags: anonymous struct to wrap filters-specific flags
> >> + * @usage: reference count to simplify use.
> >> + */
> >> +struct seccomp_filters {
> >> + ? ? uint16_t syscalls[NR_syscalls];
> >> + ? ? struct event_filter **event_filters;
> >> + ? ? uint16_t count;
> >> + ? ? struct {
> >> + ? ? ? ? ? ? uint32_t compat:1,
> >> + ? ? ? ? ? ? ? ? ? ? ?__reserved:31;
> >> + ? ? } flags;
> >> + ? ? atomic_t usage;
> >> +};
> >> +
> >> +/* Handle ftrace symbol non-existence */
> >> +#ifdef CONFIG_FTRACE_SYSCALLS
> >> +#define create_event_filter(_ef_pptr, _event_type, _str) \
> >> + ? ? ftrace_parse_filter(_ef_pptr, _event_type, _str)
> >> +#define get_filter_string(_ef) ftrace_get_filter_string(_ef)
> >> +#define free_event_filter(_f) ftrace_free_filter(_f)
> >> +
> >> +#else
> >> +
> >> +#define create_event_filter(_ef_pptr, _event_type, _str) (-ENOSYS)
> >> +#define get_filter_string(_ef) (NULL)
> >> +#define free_event_filter(_f) do { } while (0)
> >> +#endif
> >> +
> >> +/**
> >> + * seccomp_filters_new - allocates a new filters object
> >> + * @count: count to allocate for the event_filters array
> >> + *
> >> + * Returns ERR_PTR on error or an allocated object.
> >> + */
> >> +static struct seccomp_filters *seccomp_filters_new(uint16_t count)
> >> +{
> >> + ? ? struct seccomp_filters *f;
> >> +
> >> + ? ? if (count >= SECCOMP_ACTION_ALLOW)
> >> + ? ? ? ? ? ? return ERR_PTR(-EINVAL);
> >> +
> >> + ? ? f = kzalloc(sizeof(struct seccomp_filters), GFP_KERNEL);
> >> + ? ? if (!f)
> >> + ? ? ? ? ? ? return ERR_PTR(-ENOMEM);
> >> +
> >> + ? ? /* Lazy SECCOMP_ACTION_DENY assignment. */
> >> + ? ? memset(f->syscalls, 0xff, sizeof(f->syscalls));
> >> + ? ? atomic_set(&f->usage, 1);
> >> +
> >> + ? ? f->event_filters = NULL;
> >> + ? ? f->count = count;
> >> + ? ? if (!count)
> >> + ? ? ? ? ? ? return f;
> >> +
> >> + ? ? f->event_filters = kzalloc(count * sizeof(struct event_filter *),
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GFP_KERNEL);
> >> + ? ? if (!f->event_filters) {
> >> + ? ? ? ? ? ? kfree(f);
> >> + ? ? ? ? ? ? f = ERR_PTR(-ENOMEM);
> >> + ? ? }
> >> + ? ? return f;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_filters_free - cleans up the filter list and frees the table
> >> + * @filters: NULL or live object to be completely destructed.
> >> + */
> >> +static void seccomp_filters_free(struct seccomp_filters *filters)
> >> +{
> >> + ? ? uint16_t count = 0;
> >> + ? ? if (!filters)
> >> + ? ? ? ? ? ? return;
> >> + ? ? while (count < filters->count) {
> >> + ? ? ? ? ? ? struct event_filter *f = filters->event_filters[count];
> >> + ? ? ? ? ? ? free_event_filter(f);
> >> + ? ? ? ? ? ? count++;
> >> + ? ? }
> >> + ? ? kfree(filters->event_filters);
> >> + ? ? kfree(filters);
> >> +}
> >> +
> >> +static void __put_seccomp_filters(struct seccomp_filters *orig)
> >> +{
> >> + ? ? WARN_ON(atomic_read(&orig->usage));
> >> + ? ? seccomp_filters_free(orig);
> >> +}
> >> +
> >> +#define seccomp_filter_allow(_id) ((_id) == SECCOMP_ACTION_ALLOW)
> >> +#define seccomp_filter_deny(_id) ((_id) == SECCOMP_ACTION_DENY)
> >> +#define seccomp_filter_dynamic(_id) \
> >> + ? ? (!seccomp_filter_allow(_id) && !seccomp_filter_deny(_id))
> >> +static inline uint16_t seccomp_filter_id(const struct seccomp_filters *f,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr)
> >> +{
> >> + ? ? if (!f)
> >> + ? ? ? ? ? ? return SECCOMP_ACTION_DENY;
> >> + ? ? return f->syscalls[syscall_nr];
> >> +}
> >> +
> >> +static inline struct event_filter *seccomp_dynamic_filter(
> >> + ? ? ? ? ? ? const struct seccomp_filters *filters, uint16_t id)
> >> +{
> >> + ? ? if (!seccomp_filter_dynamic(id))
> >> + ? ? ? ? ? ? return NULL;
> >> + ? ? return filters->event_filters[id];
> >> +}
> >> +
> >> +static inline void set_seccomp_filter_id(struct seccomp_filters *filters,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr, uint16_t id)
> >> +{
> >> + ? ? filters->syscalls[syscall_nr] = id;
> >> +}
> >> +
> >> +static inline void set_seccomp_filter(struct seccomp_filters *filters,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? int syscall_nr, uint16_t id,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct event_filter *dynamic_filter)
> >> +{
> >> + ? ? filters->syscalls[syscall_nr] = id;
> >> + ? ? if (seccomp_filter_dynamic(id))
> >> + ? ? ? ? ? ? filters->event_filters[id] = dynamic_filter;
> >> +}
> >> +
> >> +static struct event_filter *alloc_event_filter(int syscall_nr,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?const char *filter_string)
> >> +{
> >> + ? ? struct syscall_metadata *data;
> >> + ? ? struct event_filter *filter = NULL;
> >> + ? ? int err;
> >> +
> >> + ? ? data = syscall_nr_to_meta(syscall_nr);
> >> + ? ? /* Argument-based filtering only works on ftrace-hooked syscalls. */
> >> + ? ? err = -ENOSYS;
> >> + ? ? if (!data)
> >> + ? ? ? ? ? ? goto fail;
> >> + ? ? err = create_event_filter(&filter,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? data->enter_event->event.type,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? filter_string);
> >> + ? ? if (err)
> >> + ? ? ? ? ? ? goto fail;
> >> +
> >> + ? ? return filter;
> >> +fail:
> >> + ? ? kfree(filter);
> >> + ? ? return ERR_PTR(err);
> >> +}
> >> +
> >> +/**
> >> + * seccomp_filters_copy - copies filters from src to dst.
> >> + *
> >> + * @dst: seccomp_filters to populate.
> >> + * @src: table to read from.
> >> + * @skip: specifies an entry, by system call, to skip.
> >> + *
> >> + * Returns non-zero on failure.
> >> + * Both the source and the destination should have no simultaneous
> >> + * writers, and dst should be exclusive to the caller.
> >> + * If @skip is < 0, it is ignored.
> >> + */
> >> +static int seccomp_filters_copy(struct seccomp_filters *dst,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? const struct seccomp_filters *src,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? int skip)
> >> +{
> >> + ? ? int id = 0, ret = 0, nr;
> >> + ? ? memcpy(&dst->flags, &src->flags, sizeof(src->flags));
> >> + ? ? memcpy(dst->syscalls, src->syscalls, sizeof(dst->syscalls));
> >> + ? ? if (!src->count)
> >> + ? ? ? ? ? ? goto done;
> >> + ? ? for (nr = 0; nr < NR_syscalls; ++nr) {
> >> + ? ? ? ? ? ? struct event_filter *filter;
> >> + ? ? ? ? ? ? const char *str;
> >> + ? ? ? ? ? ? uint16_t src_id = seccomp_filter_id(src, nr);
> >> + ? ? ? ? ? ? if (nr == skip) {
> >> + ? ? ? ? ? ? ? ? ? ? set_seccomp_filter(dst, nr, SECCOMP_ACTION_DENY,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL);
> >> + ? ? ? ? ? ? ? ? ? ? continue;
> >> + ? ? ? ? ? ? }
> >> + ? ? ? ? ? ? if (!seccomp_filter_dynamic(src_id))
> >> + ? ? ? ? ? ? ? ? ? ? continue;
> >> + ? ? ? ? ? ? if (id >= dst->count) {
> >> + ? ? ? ? ? ? ? ? ? ? ret = -EINVAL;
> >> + ? ? ? ? ? ? ? ? ? ? goto done;
> >> + ? ? ? ? ? ? }
> >> + ? ? ? ? ? ? str = get_filter_string(seccomp_dynamic_filter(src, src_id));
> >> + ? ? ? ? ? ? filter = alloc_event_filter(nr, str);
> >> + ? ? ? ? ? ? if (IS_ERR(filter)) {
> >> + ? ? ? ? ? ? ? ? ? ? ret = PTR_ERR(filter);
> >> + ? ? ? ? ? ? ? ? ? ? goto done;
> >> + ? ? ? ? ? ? }
> >> + ? ? ? ? ? ? set_seccomp_filter(dst, nr, id, filter);
> >> + ? ? ? ? ? ? id++;
> >> + ? ? }
> >> +
> >> +done:
> >> + ? ? return ret;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_extend_filter - appends more text to a syscall_nr's filter
> >> + * @filters: unattached filter object to operate on
> >> + * @syscall_nr: syscall number to update filters for
> >> + * @filter_string: string to append to the existing filter
> >> + *
> >> + * The new string will be &&'d to the original filter string to ensure that it
> >> + * always matches the existing predicates or less:
> >> + * ? (old_filter) && @filter_string
> >> + * A new seccomp_filters instance is returned on success and a ERR_PTR on
> >> + * failure.
> >> + */
> >> +static int seccomp_extend_filter(struct seccomp_filters *filters,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int syscall_nr, char *filter_string)
> >> +{
> >> + ? ? struct event_filter *filter;
> >> + ? ? uint16_t id = seccomp_filter_id(filters, syscall_nr);
> >> + ? ? char *merged = NULL;
> >> + ? ? int ret = -EINVAL, expected;
> >> +
> >> + ? ? /* No extending with a "1". */
> >> + ? ? if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? filter = seccomp_dynamic_filter(filters, id);
> >> + ? ? ret = -ENOENT;
> >> + ? ? if (!filter)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? merged = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> >> + ? ? ret = -ENOMEM;
> >> + ? ? if (!merged)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? expected = snprintf(merged, SECCOMP_MAX_FILTER_LENGTH, "(%s) && %s",
> >> + ? ? ? ? ? ? ? ? ? ? ? ? get_filter_string(filter), filter_string);
> >> + ? ? ret = -E2BIG;
> >> + ? ? if (expected >= SECCOMP_MAX_FILTER_LENGTH || expected < 0)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? /* Free the old filter */
> >> + ? ? free_event_filter(filter);
> >> + ? ? set_seccomp_filter(filters, syscall_nr, id, NULL);
> >> +
> >> + ? ? /* Replace it */
> >> + ? ? filter = alloc_event_filter(syscall_nr, merged);
> >> + ? ? if (IS_ERR(filter)) {
> >> + ? ? ? ? ? ? ret = PTR_ERR(filter);
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> + ? ? set_seccomp_filter(filters, syscall_nr, id, filter);
> >> + ? ? ret = 0;
> >> +
> >> +out:
> >> + ? ? kfree(merged);
> >> + ? ? return ret;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_add_filter - adds a filter for an unfiltered syscall
> >> + * @filters: filters object to add a filter/action to
> >> + * @syscall_nr: system call number to add a filter for
> >> + * @filter_string: the filter string to apply
> >> + *
> >> + * Returns 0 on success and non-zero otherwise.
> >> + */
> >> +static int seccomp_add_filter(struct seccomp_filters *filters, int syscall_nr,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? char *filter_string)
> >> +{
> >> + ? ? struct event_filter *filter;
> >> + ? ? int ret = 0;
> >> +
> >> + ? ? if (!strcmp(SECCOMP_FILTER_ALLOW, filter_string)) {
> >> + ? ? ? ? ? ? set_seccomp_filter(filters, syscall_nr,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SECCOMP_ACTION_ALLOW, NULL);
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> +
> >> + ? ? filter = alloc_event_filter(syscall_nr, filter_string);
> >> + ? ? if (IS_ERR(filter)) {
> >> + ? ? ? ? ? ? ret = PTR_ERR(filter);
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> + ? ? /* Always add to the last slot available since additions are
> >> + ? ? ?* are only done one at a time.
> >> + ? ? ?*/
> >> + ? ? set_seccomp_filter(filters, syscall_nr, filters->count - 1, filter);
> >> +out:
> >> + ? ? return ret;
> >> +}
> >> +
> >> +/* Wrap optional ftrace syscall support. Returns 1 on match or 0 otherwise. */
> >> +static int filter_match_current(struct event_filter *event_filter)
> >> +{
> >> + ? ? int err = 0;
> >> +#ifdef CONFIG_FTRACE_SYSCALLS
> >> + ? ? uint8_t syscall_state[64];
> >> +
> >> + ? ? memset(syscall_state, 0, sizeof(syscall_state));
> >> +
> >> + ? ? /* The generic tracing entry can remain zeroed. */
> >> + ? ? err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL);
> >> + ? ? if (err)
> >> + ? ? ? ? ? ? return 0;
> >> +
> >> + ? ? err = filter_match_preds(event_filter, syscall_state);
> >> +#endif
> >> + ? ? return err;
> >> +}
> >> +
> >> +static const char *syscall_nr_to_name(int syscall)
> >> +{
> >> + ? ? const char *syscall_name = "unknown";
> >> + ? ? struct syscall_metadata *data = syscall_nr_to_meta(syscall);
> >> + ? ? if (data)
> >> + ? ? ? ? ? ? syscall_name = data->name;
> >> + ? ? return syscall_name;
> >> +}
> >> +
> >> +static void filters_set_compat(struct seccomp_filters *filters)
> >> +{
> >> +#ifdef CONFIG_COMPAT
> >> + ? ? if (is_compat_task())
> >> + ? ? ? ? ? ? filters->flags.compat = 1;
> >> +#endif
> >> +}
> >> +
> >> +static inline int filters_compat_mismatch(struct seccomp_filters *filters)
> >> +{
> >> + ? ? int ret = 0;
> >> + ? ? if (!filters)
> >> + ? ? ? ? ? ? return 0;
> >> +#ifdef CONFIG_COMPAT
> >> + ? ? if (!!(is_compat_task()) == filters->flags.compat)
> >> + ? ? ? ? ? ? ret = 1;
> >> +#endif
> >> + ? ? return ret;
> >> +}
> >> +
> >> +static inline int syscall_is_execve(int syscall)
> >> +{
> >> + ? ? int nr = __NR_execve;
> >> +#ifdef CONFIG_COMPAT
> >> + ? ? if (is_compat_task())
> >> + ? ? ? ? ? ? nr = __NR_seccomp_execve_32;
> >> +#endif
> >> + ? ? return syscall == nr;
> >> +}
> >> +
> >> +#ifndef KSTK_EIP
> >> +#define KSTK_EIP(x) 0L
> >> +#endif
> >> +
> >> +void seccomp_filter_log_failure(int syscall)
> >> +{
> >> + ? ? pr_info("%s[%d]: system call %d (%s) blocked at 0x%lx\n",
> >> + ? ? ? ? ? ? current->comm, task_pid_nr(current), syscall,
> >> + ? ? ? ? ? ? syscall_nr_to_name(syscall), KSTK_EIP(current));
> >> +}
> >> +
> >> +/* put_seccomp_state - decrements the reference count of @orig and may free. */
> >> +void put_seccomp_filters(struct seccomp_filters *orig)
> >> +{
> >> + ? ? if (!orig)
> >> + ? ? ? ? ? ? return;
> >> +
> >> + ? ? if (atomic_dec_and_test(&orig->usage))
> >> + ? ? ? ? ? ? __put_seccomp_filters(orig);
> >> +}
> >> +
> >> +/* get_seccomp_state - increments the reference count of @orig */
> >> +struct seccomp_filters *get_seccomp_filters(struct seccomp_filters *orig)
> >
> > Nit: the name does not match the comment.
>
> Will fix it here and above. Thanks!
>
> >> +{
> >> + ? ? if (!orig)
> >> + ? ? ? ? ? ? return NULL;
> >> + ? ? atomic_inc(&orig->usage);
> >> + ? ? return orig;
> >
> > This is called in an RCU read-side critical section. ?What exactly is
> > RCU protecting? ?I would expect an rcu_dereference() or one of the
> > RCU list-traversal primitives somewhere, either here or at the caller.
>
> Ah, I spaced on rcu_dereference(). The goal was to make the
> assignment and replacement of the seccomp_filters pointer
> RCU-protected (in seccomp_state) so there's no concern over it being
> replaced partial on platforms where pointer assignments are non-atomic
> - such as via /proc/<pid>/seccomp_filters access or a call via the
> exported symbols. Object lifetime is managed by reference counting so
> that I don't have to worry about extending the RCU read-side critical
> section by much or deal with pre-allocations.
>
> I'll add rcu_dereference() to all the get_seccomp_filters() uses where
> it makes sense, so that it is called safely. Just to make sure, does
> it make sense to continue to rcu protect the specific pointer?

It might. The usual other options is to use a lock outside of the element
containing the reference count to protect reference-count manipulation.
If there is some convenient lock, especially if it is already held where
needed, then locking is more straightforward. Otherwise, RCU is usually
a reasonable option.

> >> +}
> >> +
> >> +/**
> >> + * seccomp_test_filters - tests 'current' against the given syscall
> >> + * @state: seccomp_state of current to use.
> >> + * @syscall: number of the system call to test
> >> + *
> >> + * Returns 0 on ok and non-zero on error/failure.
> >> + */
> >> +int seccomp_test_filters(int syscall)
> >> +{
> >> + ? ? uint16_t id;
> >> + ? ? struct event_filter *filter;
> >> + ? ? struct seccomp_filters *filters;
> >> + ? ? int ret = -EACCES;
> >> +
> >> + ? ? rcu_read_lock();
> >> + ? ? filters = get_seccomp_filters(current->seccomp.filters);
> >> + ? ? rcu_read_unlock();
> >> +
> >> + ? ? if (!filters)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? if (filters_compat_mismatch(filters)) {
> >> + ? ? ? ? ? ? pr_info("%s[%d]: seccomp_filter compat() mismatch.\n",
> >> + ? ? ? ? ? ? ? ? ? ? current->comm, task_pid_nr(current));
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> +
> >> + ? ? /* execve is never allowed. */
> >> + ? ? if (syscall_is_execve(syscall))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? ret = 0;
> >> + ? ? id = seccomp_filter_id(filters, syscall);
> >> + ? ? if (seccomp_filter_allow(id))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? ret = -EACCES;
> >> + ? ? if (!seccomp_filter_dynamic(id))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? filter = seccomp_dynamic_filter(filters, id);
> >> + ? ? if (filter && filter_match_current(filter))
> >> + ? ? ? ? ? ? ret = 0;
> >> +out:
> >> + ? ? put_seccomp_filters(filters);
> >> + ? ? return ret;
> >> +}
> >> +
> >> +/**
> >> + * seccomp_show_filters - prints the current filter state to a seq_file
> >> + * @filters: properly get()'d filters object
> >> + * @m: the prepared seq_file to receive the data
> >> + *
> >> + * Returns 0 on a successful write.
> >> + */
> >> +int seccomp_show_filters(struct seccomp_filters *filters, struct seq_file *m)
> >> +{
> >> + ? ? int syscall;
> >> + ? ? seq_printf(m, "Mode: %d\n", current->seccomp.mode);
> >> + ? ? if (!filters)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? for (syscall = 0; syscall < NR_syscalls; ++syscall) {
> >> + ? ? ? ? ? ? uint16_t id = seccomp_filter_id(filters, syscall);
> >> + ? ? ? ? ? ? const char *filter_string = SECCOMP_FILTER_ALLOW;
> >> + ? ? ? ? ? ? if (seccomp_filter_deny(id))
> >> + ? ? ? ? ? ? ? ? ? ? continue;
> >> + ? ? ? ? ? ? seq_printf(m, "%d (%s): ",
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? syscall,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? syscall_nr_to_name(syscall));
> >> + ? ? ? ? ? ? if (seccomp_filter_dynamic(id))
> >> + ? ? ? ? ? ? ? ? ? ? filter_string = get_filter_string(
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? seccomp_dynamic_filter(filters, id));
> >> + ? ? ? ? ? ? seq_printf(m, "%s\n", filter_string);
> >> + ? ? }
> >> +out:
> >> + ? ? return 0;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_show_filters);
> >> +
> >> +/**
> >> + * seccomp_get_filter - copies the filter_string into "buf"
> >> + * @syscall_nr: system call number to look up
> >> + * @buf: destination buffer
> >> + * @bufsize: available space in the buffer.
> >> + *
> >> + * Context: User context only. This function may sleep on allocation and
> >> + * ? ? ? ? ?operates on current. current must be attempting a system call
> >> + * ? ? ? ? ?when this is called.
> >> + *
> >> + * Looks up the filter for the given system call number on current. ?If found,
> >> + * the string length of the NUL-terminated buffer is returned and < 0 is
> >> + * returned on error. The NUL byte is not included in the length.
> >> + */
> >> +long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
> >> +{
> >> + ? ? struct seccomp_filters *filters;
> >> + ? ? struct event_filter *filter;
> >> + ? ? long ret = -EINVAL;
> >> + ? ? uint16_t id;
> >> +
> >> + ? ? if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
> >> + ? ? ? ? ? ? bufsize = SECCOMP_MAX_FILTER_LENGTH;
> >> +
> >> + ? ? rcu_read_lock();
> >> + ? ? filters = get_seccomp_filters(current->seccomp.filters);
> >> + ? ? rcu_read_unlock();
> >> +
> >> + ? ? if (!filters)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? ret = -ENOENT;
> >> + ? ? id = seccomp_filter_id(filters, syscall_nr);
> >> + ? ? if (seccomp_filter_deny(id))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? if (seccomp_filter_allow(id)) {
> >> + ? ? ? ? ? ? ret = strlcpy(buf, SECCOMP_FILTER_ALLOW, bufsize);
> >> + ? ? ? ? ? ? goto copied;
> >> + ? ? }
> >> +
> >> + ? ? filter = seccomp_dynamic_filter(filters, id);
> >> + ? ? if (!filter)
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? ret = strlcpy(buf, get_filter_string(filter), bufsize);
> >> +
> >> +copied:
> >> + ? ? if (ret >= bufsize) {
> >> + ? ? ? ? ? ? ret = -ENOSPC;
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> + ? ? /* Zero out any remaining buffer, just in case. */
> >> + ? ? memset(buf + ret, 0, bufsize - ret);
> >> +out:
> >> + ? ? put_seccomp_filters(filters);
> >> + ? ? return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_get_filter);
> >> +
> >> +/**
> >> + * seccomp_clear_filter: clears the seccomp filter for a syscall.
> >> + * @syscall_nr: the system call number to clear filters for.
> >> + *
> >> + * Context: User context only. This function may sleep on allocation and
> >> + * ? ? ? ? ?operates on current. current must be attempting a system call
> >> + * ? ? ? ? ?when this is called.
> >> + *
> >> + * Returns 0 on success.
> >> + */
> >> +long seccomp_clear_filter(int syscall_nr)
> >> +{
> >> + ? ? struct seccomp_filters *filters = NULL, *orig_filters;
> >> + ? ? uint16_t id;
> >> + ? ? int ret = -EINVAL;
> >> +
> >> + ? ? rcu_read_lock();
> >> + ? ? orig_filters = get_seccomp_filters(current->seccomp.filters);
> >> + ? ? rcu_read_unlock();
> >> +
> >> + ? ? if (!orig_filters)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? if (filters_compat_mismatch(orig_filters))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? id = seccomp_filter_id(orig_filters, syscall_nr);
> >> + ? ? if (seccomp_filter_deny(id))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? /* Create a new filters object for the task */
> >> + ? ? if (seccomp_filter_dynamic(id))
> >> + ? ? ? ? ? ? filters = seccomp_filters_new(orig_filters->count - 1);
> >> + ? ? else
> >> + ? ? ? ? ? ? filters = seccomp_filters_new(orig_filters->count);
> >> +
> >> + ? ? if (IS_ERR(filters)) {
> >> + ? ? ? ? ? ? ret = PTR_ERR(filters);
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> +
> >> + ? ? /* Copy, but drop the requested entry. */
> >> + ? ? ret = seccomp_filters_copy(filters, orig_filters, syscall_nr);
> >> + ? ? if (ret)
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? get_seccomp_filters(filters); ?/* simplify the out: path */
> >> +
> >> + ? ? rcu_assign_pointer(current->seccomp.filters, filters);
> >
> > What prevents two copies of seccomp_clear_filter() from running
> > concurrently?
>
> Nothing - the last one wins assignment, but the objects themselves
> should be internally consistent to the parallel calls. If that's a
> concern, a per-task writer mutex could be used just to ensure
> simultaneous calls to clear and set are performed serially. Would
> that make more sense?

Here is the sequence of events that I am concerned about:

o CPU 0 sets orig_filters to point to the current filters.

o CPU 1 sets its local orig_filters to point to the current
set of filters.

o Both CPUs allocate new filters and use rcu_assign_pointer()
to do the update. As you say, the last one wins, but it appears
to me that the first one leaks memory.

o Both CPUs free the object referenced by their orig_filters,
which might or might not result in a double free, depending
on exactly what happens below. (You might actually be OK,
I didn't check -- leaking memory was enough for me to call
attention to this.)

So yes, please use some kind of mutual exclusion. Not sure what you
mean by "per-task mutex", but whatever it is must prevent two different
tasks from acting on the same set of filters at the same time. The
thing that I call "per-task mutex" would -not- do that.

> >> + ? ? synchronize_rcu();
> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the task */
> >> +out:
> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the get */
> >> + ? ? put_seccomp_filters(filters); ?/* for the extra get */
> >> + ? ? return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_clear_filter);
> >> +
> >> +/**
> >> + * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
> >> + * @syscall_nr: system call number to apply the filter to.
> >> + * @filter: ftrace filter string to apply.
> >> + *
> >> + * Context: User context only. This function may sleep on allocation and
> >> + * ? ? ? ? ?operates on current. current must be attempting a system call
> >> + * ? ? ? ? ?when this is called.
> >> + *
> >> + * New filters may be added for system calls when the current task is
> >> + * not in a secure computing mode (seccomp). ?Otherwise, existing filters may
> >> + * be extended.
> >> + *
> >> + * Returns 0 on success or an errno on failure.
> >> + */
> >> +long seccomp_set_filter(int syscall_nr, char *filter)
> >> +{
> >> + ? ? struct seccomp_filters *filters = NULL, *orig_filters = NULL;
> >> + ? ? uint16_t id;
> >> + ? ? long ret = -EINVAL;
> >> + ? ? uint16_t filters_needed;
> >> +
> >> + ? ? if (!filter)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? filter = strstrip(filter);
> >> + ? ? /* Disallow empty strings. */
> >> + ? ? if (filter[0] == 0)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? rcu_read_lock();
> >> + ? ? orig_filters = get_seccomp_filters(current->seccomp.filters);
> >> + ? ? rcu_read_unlock();
> >> +
> >> + ? ? /* After the first call, compatibility mode is selected permanently. */
> >> + ? ? ret = -EACCES;
> >> + ? ? if (filters_compat_mismatch(orig_filters))
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? filters_needed = orig_filters ? orig_filters->count : 0;
> >> + ? ? id = seccomp_filter_id(orig_filters, syscall_nr);
> >> + ? ? if (seccomp_filter_deny(id)) {
> >> + ? ? ? ? ? ? /* Don't allow DENYs to be changed when in a seccomp mode */
> >> + ? ? ? ? ? ? ret = -EACCES;
> >> + ? ? ? ? ? ? if (current->seccomp.mode)
> >> + ? ? ? ? ? ? ? ? ? ? goto out;
> >> + ? ? ? ? ? ? filters_needed++;
> >> + ? ? }
> >> +
> >> + ? ? filters = seccomp_filters_new(filters_needed);
> >> + ? ? if (IS_ERR(filters)) {
> >> + ? ? ? ? ? ? ret = PTR_ERR(filters);
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> +
> >> + ? ? filters_set_compat(filters);
> >> + ? ? if (orig_filters) {
> >> + ? ? ? ? ? ? ret = seccomp_filters_copy(filters, orig_filters, -1);
> >> + ? ? ? ? ? ? if (ret)
> >> + ? ? ? ? ? ? ? ? ? ? goto out;
> >> + ? ? }
> >> +
> >> + ? ? if (seccomp_filter_deny(id))
> >> + ? ? ? ? ? ? ret = seccomp_add_filter(filters, syscall_nr, filter);
> >> + ? ? else
> >> + ? ? ? ? ? ? ret = seccomp_extend_filter(filters, syscall_nr, filter);
> >> + ? ? if (ret)
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? get_seccomp_filters(filters); ?/* simplify the error paths */
> >> +
> >> + ? ? rcu_assign_pointer(current->seccomp.filters, filters);
> >
> > Again, what prevents two copies of seccomp_set_filter() from running
> > concurrently?
>
> Same deal - nothing, but I'd be happy to add a guard if it makes sense.
>
> Thanks!
>
> >> + ? ? synchronize_rcu();
> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the task */
> >> +out:
> >> + ? ? put_seccomp_filters(orig_filters); ?/* for the get */
> >> + ? ? put_seccomp_filters(filters); ?/* for get or task, on err */
> >> + ? ? return ret;
> >> +}
> >> +EXPORT_SYMBOL_GPL(seccomp_set_filter);
> >> +
> >> +long prctl_set_seccomp_filter(unsigned long syscall_nr,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? char __user *user_filter)
> >> +{
> >> + ? ? int nr;
> >> + ? ? long ret;
> >> + ? ? char *filter = NULL;
> >> +
> >> + ? ? ret = -EINVAL;
> >> + ? ? if (syscall_nr >= NR_syscalls)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? ret = -EFAULT;
> >> + ? ? if (!user_filter)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? filter = kzalloc(SECCOMP_MAX_FILTER_LENGTH + 1, GFP_KERNEL);
> >> + ? ? ret = -ENOMEM;
> >> + ? ? if (!filter)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? ret = -EFAULT;
> >> + ? ? if (strncpy_from_user(filter, user_filter,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? SECCOMP_MAX_FILTER_LENGTH - 1) < 0)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? nr = (int) syscall_nr;
> >> + ? ? ret = seccomp_set_filter(nr, filter);
> >> +
> >> +out:
> >> + ? ? kfree(filter);
> >> + ? ? return ret;
> >> +}
> >> +
> >> +long prctl_clear_seccomp_filter(unsigned long syscall_nr)
> >> +{
> >> + ? ? int nr = -1;
> >> + ? ? long ret;
> >> +
> >> + ? ? ret = -EINVAL;
> >> + ? ? if (syscall_nr >= NR_syscalls)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? nr = (int) syscall_nr;
> >> + ? ? ret = seccomp_clear_filter(nr);
> >> +
> >> +out:
> >> + ? ? return ret;
> >> +}
> >> +
> >> +long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long available)
> >> +{
> >> + ? ? int ret, nr;
> >> + ? ? unsigned long copied;
> >> + ? ? char *buf = NULL;
> >> + ? ? ret = -EINVAL;
> >> + ? ? if (!available)
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? /* Ignore extra buffer space. */
> >> + ? ? if (available > SECCOMP_MAX_FILTER_LENGTH)
> >> + ? ? ? ? ? ? available = SECCOMP_MAX_FILTER_LENGTH;
> >> +
> >> + ? ? ret = -EINVAL;
> >> + ? ? if (syscall_nr >= NR_syscalls)
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? nr = (int) syscall_nr;
> >> +
> >> + ? ? ret = -ENOMEM;
> >> + ? ? buf = kmalloc(available, GFP_KERNEL);
> >> + ? ? if (!buf)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? ret = seccomp_get_filter(nr, buf, available);
> >> + ? ? if (ret < 0)
> >> + ? ? ? ? ? ? goto out;
> >> +
> >> + ? ? /* Include the NUL byte in the copy. */
> >> + ? ? copied = copy_to_user(dst, buf, ret + 1);
> >> + ? ? ret = -ENOSPC;
> >> + ? ? if (copied)
> >> + ? ? ? ? ? ? goto out;
> >> + ? ? ret = 0;
> >> +out:
> >> + ? ? kfree(buf);
> >> + ? ? return ret;
> >> +}
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index af468ed..ed60d06 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -1698,13 +1698,24 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> >> ? ? ? ? ? ? ? case PR_SET_ENDIAN:
> >> ? ? ? ? ? ? ? ? ? ? ? error = SET_ENDIAN(me, arg2);
> >> ? ? ? ? ? ? ? ? ? ? ? break;
> >> -
> >> ? ? ? ? ? ? ? case PR_GET_SECCOMP:
> >> ? ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp();
> >> ? ? ? ? ? ? ? ? ? ? ? break;
> >> ? ? ? ? ? ? ? case PR_SET_SECCOMP:
> >> ? ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp(arg2);
> >> ? ? ? ? ? ? ? ? ? ? ? break;
> >> + ? ? ? ? ? ? case PR_SET_SECCOMP_FILTER:
> >> + ? ? ? ? ? ? ? ? ? ? error = prctl_set_seccomp_filter(arg2,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(char __user *) arg3);
> >> + ? ? ? ? ? ? ? ? ? ? break;
> >> + ? ? ? ? ? ? case PR_CLEAR_SECCOMP_FILTER:
> >> + ? ? ? ? ? ? ? ? ? ? error = prctl_clear_seccomp_filter(arg2);
> >> + ? ? ? ? ? ? ? ? ? ? break;
> >> + ? ? ? ? ? ? case PR_GET_SECCOMP_FILTER:
> >> + ? ? ? ? ? ? ? ? ? ? error = prctl_get_seccomp_filter(arg2,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?(char __user *) arg3,
> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?arg4);
> >> + ? ? ? ? ? ? ? ? ? ? break;
> >> ? ? ? ? ? ? ? case PR_GET_TSC:
> >> ? ? ? ? ? ? ? ? ? ? ? error = GET_TSC_CTL(arg2);
> >> ? ? ? ? ? ? ? ? ? ? ? break;
> >> diff --git a/security/Kconfig b/security/Kconfig
> >> index 95accd4..c76adf2 100644
> >> --- a/security/Kconfig
> >> +++ b/security/Kconfig
> >> @@ -2,6 +2,10 @@
> >> ?# Security configuration
> >> ?#
> >>
> >> +# Make seccomp filter Kconfig switch below available
> >> +config HAVE_SECCOMP_FILTER
> >> + ? ? ? bool
> >> +
> >> ?menu "Security options"
> >>
> >> ?config KEYS
> >> @@ -82,6 +86,19 @@ config SECURITY_DMESG_RESTRICT
> >>
> >> ? ? ? ? If you are unsure how to answer this question, answer N.
> >>
> >> +config SECCOMP_FILTER
> >> + ? ? bool "Enable seccomp-based system call filtering"
> >> + ? ? select SECCOMP
> >> + ? ? depends on HAVE_SECCOMP_FILTER && EXPERIMENTAL
> >> + ? ? help
> >> + ? ? ? This kernel feature expands CONFIG_SECCOMP to allow computing
> >> + ? ? ? in environments with reduced kernel access dictated by the
> >> + ? ? ? application itself through prctl calls. ?If
> >> + ? ? ? CONFIG_FTRACE_SYSCALLS is available, then system call
> >> + ? ? ? argument-based filtering predicates may be used.
> >> +
> >> + ? ? ? See Documentation/prctl/seccomp_filter.txt for more detail.
> >> +
> >> ?config SECURITY
> >> ? ? ? bool "Enable different security models"
> >> ? ? ? depends on SYSFS
> >> --
> >> 1.7.0.4
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [email protected]
> >> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at ?http://www.tux.org/lkml/
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2011-06-02 20:28:56