2014-07-25 13:47:52

by David Drysdale

[permalink] [raw]
Subject: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

The last couple of versions of FreeBSD (9.x/10.x) have included the
Capsicum security framework [1], which allows security-aware
applications to sandbox themselves in a very fine-grained way. For
example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
restrict sshd's credentials checking process, to reduce the chances of
credential leakage.

It would be good to have equivalent functionality in Linux, so I've been
working on getting the Capsicum framework running in the kernel, and I'd
appreciate some feedback/opinions on the general approach.

I'm attaching a corresponding draft patchset for reference, but
hopefully this cover email can cover the significant features to save
everyone having to look through the code details. (It does mean this is
a long email though -- apologies for that.)


1) Capsicum Capabilities
------------------------

The most significant aspect of Capsicum is associating *rights* with
(some) file descriptors, so that the kernel only allows operations on an
FD if the rights permit it. This allows userspace applications to
sandbox themselves by tightly constraining what's allowed with both
input and outputs; for example, tcpdump might restrict itself so it can
only read from the network FD, and only write to stdout.

The kernel thus needs to police the rights checks for these file
descriptors (referred to as 'Capsicum capabilities', completely
different than POSIX.1e capabilities), and the best place to do this is
at the points where a file descriptor from userspace is converted to a
struct file * within the kernel.

[Policing the rights checks anywhere else, for example at the system
call boundary, isn't a good idea because it opens up the possibility
of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
changed (as openat/close/dup2 are allowed in capability mode) between
the 'check' at syscall entry and the 'use' at fget() invocation.]

However, this does lead to quite an invasive change to the kernel --
every invocation of fget() or similar functions (fdget(),
sockfd_lookup(), user_path_at(),...) needs to be annotated with the
rights associated with the specific operations that will be performed on
the struct file. There are ~100 such invocations that need annotation.

My current implementation approach is to use varargs variants of the
fget() functions that include the required rights, varargs-macroed so
that the only impact in a non-Capsicum build is the need to cope with an
ERR_PTR on failure rather than just NULL:

#ifdef CONFIG_SECURITY_CAPSICUM
#define fgetr(fd, ...) _fgetr((fd), __VA_ARGS__, CAP_LIST_END)
/* + Other variants... */
#else
#define fgetr(fd, ...) (fget(fd) ?: ERR_PTR(-EBADF))
/* + Other variants... */
#endif

For example, an existing chunk of code like:

SYSCALL_DEFINE1(fchdir, unsigned int, fd)
{
struct fd f = fdget_raw(fd);
struct inode *inode;
int error = -EBADF;

error = -EBADF;
if (!f.file)
goto out;
...

might become:

SYSCALL_DEFINE1(fchdir, unsigned int, fd)
{
struct fd f = fdgetr_raw(fd, CAP_FCHDIR);
struct inode *inode;
int error = -EBADF;

if (IS_ERR(f.file)) {
error = PTR_ERR(f.file);
goto out;
}
...

In a Capsicum build the fdgetr_raw() function performs rights checks
(and potentially returns a new errno as ERR_PTR(-ENOTCAPABLE)), whereas
in a non-Capsicum build the only change is that fdget_raw() returns
ERR_PTR(-EBADF) rather than just NULL.


2) Capsicum Capabilities Data Structure
---------------------------------------

Internally, the rights associated with a Capsicum capability FD are
stored in a special struct file wrapper. For a normal file, the rights
check inside fget() falls through, but for a capability wrapper the
rights in the wrapper are checked and (if capable) the underlying
wrapped struct file is returned.

[This is approximately the implementation that was present in FreeBSD
9.x. For FreeBSD 10.x, the wrapper file was removed and the rights
associated with a file descriptor are now stored in the fdtable. As
that impacts memory use for all processes, whether Capsicum users or
not, I've stuck with the FreeBSD 9.x approach.]


3) Allowing Capability Mode
---------------------------

Capsicum also includes 'capability mode', which locks down the available
syscalls so the rights restrictions can't just be bypassed by opening
new file descriptors. More precisely, capability mode prevents access
to syscalls that access global namespaces, such as the filesystem or the
IP:port space.

The existing seccomp-bpf functionality of the kernel is a good mechanism
for implementing capability mode, but there are a few additional details
that also need to be addressed.

a) The capability mode filter program needs to apply process-wide, not
just to the current thread.

b) In capability mode, new files can still be opened with openat(2) but
only if they are beneath an existing directory file descriptor.

c) In capability mode it should still be possible for a process to send
signals to itself with kill(2)/tgkill(2).

This v2 patchset copes with these as follows:

a) Kees Cook's incoming seccomp(2) patchset covers thread
synchronization of filters.

b) A new prctl(PR_SET_OPENAT_BENEATH) operation implicitly sets the
O_BENEATH flag (see below) for all file-open operations for all
threads of the current process, by adding a new openat_beneath
flag in task_struct.

c) An extension to the seccomp_data structure that includes the current
task's tid and tgid values allows for BPF programs that check a
kill(2)/tgkill(2) argument against the current thread, in a manner
that is robust against fork(2)/clone(2).

The combination of these features with the existing seccomp-bpf
functionality gives the tools needed to implement capability mode.


4) New System Calls
-------------------

To allow userspace applications to access the Capsicum capability
functionality, I'm proposing two new system calls: cap_rights_limit(2)
and cap_rights_get(2). I guess these could potentially be implemented
elsewhere (e.g. as fcntl(2) operations?) but the changes seem
significant enough that new syscalls are warranted.

[FreeBSD 10.x actually includes six new syscalls for manipulating the
rights associated with a Capsicum capability -- the capability rights
can police that only specific fcntl(2) or ioctl(2) commands are
allowed, and FreeBSD sets these with distinct syscalls.]


5) New openat(2) O_BENEATH Flag
-------------------------------

For Capsicum capabilities that are directory file descriptors, the
Capsicum framework only allows openat(cap_dfd, path, ...) operations to
work for files that are beneath the specified directory (and even that
only when the directory FD has the CAP_LOOKUP right), rejecting paths
that start with "/" or include "..". The same restriction applies
process-wide for a process in capability mode.

As this seemed like functionality that might be more generally useful,
I've implemented it independently as a new O_BENEATH flag for openat(2).
The Capsicum code then always triggers the use of that flag when the dfd
is a Capsicum capability, or when the prctl(2) command described above
is in play.

[FreeBSD has the openat(2) relative-only behaviour for capability DFDs
and processes in capability mode, but does not include the O_BENEATH
flag.]


6) Patchset Notes
-----------------

I've appended the draft patchset (against v3.16-rc5) for the
implementation of Capsicum capabilities, in case anyone wants to dive
into the details.

However, I should point out that it might include some code that hasn't
been compiled -- I attempted to change every fget() invocation I could
find, even if it was for a build that I can't perform (but I have built
allyesconfig on x86 & ARM).

Also, I've left a gap in the syscall and prctl(2) command numbering, to
allow this to be merged on top of Kees Cook's seccomp(2) changes.

Regards,

David Drysdale


[1] http://www.cl.cam.ac.uk/research/security/capsicum/papers/2010usenix-security-capsicum-website.pdf
[2] http://www.watson.org/~robert/2007woot/


Changes since v1:
- removed gratuitous LSM hooks [Andy Lutomirski, Paul Moore]
- renamed O_BENEATH_ONLY to O_BENEATH [Christoph Hellwig]
- updated syscall numbers to allow for seccomp(2)
- added prctl(PR_SET_OPENAT_BENEATH) [Paolo Bonzini]
- added tid/tgid info to seccomp_data [Paolo Bonzini]
- update spacing for current checkpatch.pl
- [manpages] describe struct cap_rights [Andy Lutomirski]
- [manpages] clarify nioctl use [Andy Lutomirski]
- [manpages] clarify CAP_FCNTL use [Andy Lutomirski]


David Drysdale (11):
fs: add O_BENEATH flag to openat(2)
selftests: Add test of O_BENEATH & openat(2)
capsicum: rights values and structure definitions
capsicum: implement fgetr() and friends
capsicum: convert callers to use fgetr() etc
capsicum: implement sockfd_lookupr()
capsicum: convert callers to use sockfd_lookupr() etc
capsicum: invoke Capsicum on FD/file conversion
capsicum: add syscalls to limit FD rights
capsicum: prctl(2) to force use of O_BENEATH
seccomp: Add tgid and tid into seccomp_data

Documentation/security/capsicum.txt | 102 +++++++
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/alpha/kernel/osf_sys.c | 6 +-
arch/ia64/kernel/perfmon.c | 54 ++--
arch/parisc/hpux/fs.c | 6 +-
arch/parisc/include/uapi/asm/fcntl.h | 1 +
arch/powerpc/kvm/powerpc.c | 4 +-
arch/powerpc/platforms/cell/spu_syscalls.c | 15 +-
arch/powerpc/platforms/cell/spufs/coredump.c | 2 +
arch/sparc/include/uapi/asm/fcntl.h | 1 +
arch/x86/syscalls/syscall_64.tbl | 2 +
drivers/base/dma-buf.c | 6 +-
drivers/block/loop.c | 14 +-
drivers/block/nbd.c | 5 +-
drivers/infiniband/core/ucma.c | 4 +-
drivers/infiniband/core/uverbs_cmd.c | 6 +-
drivers/infiniband/core/uverbs_main.c | 4 +-
drivers/infiniband/hw/usnic/usnic_transport.c | 2 +-
drivers/md/md.c | 8 +-
drivers/scsi/iscsi_tcp.c | 2 +-
drivers/staging/android/sync.c | 2 +-
drivers/staging/lustre/lustre/llite/file.c | 6 +-
drivers/staging/lustre/lustre/lmv/lmv_obd.c | 7 +-
drivers/staging/lustre/lustre/mdc/lproc_mdc.c | 8 +-
drivers/staging/lustre/lustre/mdc/mdc_request.c | 4 +-
drivers/staging/usbip/stub_dev.c | 2 +-
drivers/staging/usbip/vhci_sysfs.c | 2 +-
drivers/vfio/pci/vfio_pci.c | 6 +-
drivers/vfio/pci/vfio_pci_intrs.c | 6 +-
drivers/vfio/vfio.c | 6 +-
drivers/vhost/net.c | 8 +-
drivers/video/fbdev/msm/mdp.c | 4 +-
fs/aio.c | 37 ++-
fs/autofs4/dev-ioctl.c | 16 +-
fs/autofs4/inode.c | 4 +-
fs/btrfs/ioctl.c | 20 +-
fs/btrfs/send.c | 7 +-
fs/cifs/ioctl.c | 6 +-
fs/coda/inode.c | 4 +-
fs/coda/psdev.c | 2 +-
fs/compat.c | 18 +-
fs/compat_ioctl.c | 14 +-
fs/eventfd.c | 17 +-
fs/eventpoll.c | 19 +-
fs/ext4/ioctl.c | 6 +-
fs/fcntl.c | 106 ++++++-
fs/fhandle.c | 6 +-
fs/file.c | 130 ++++++++
fs/fuse/inode.c | 10 +-
fs/ioctl.c | 13 +-
fs/locks.c | 11 +-
fs/namei.c | 310 ++++++++++++++-----
fs/ncpfs/inode.c | 5 +-
fs/notify/dnotify/dnotify.c | 2 +
fs/notify/fanotify/fanotify_user.c | 16 +-
fs/notify/inotify/inotify_user.c | 12 +-
fs/ocfs2/cluster/heartbeat.c | 8 +-
fs/open.c | 46 +--
fs/proc/fd.c | 17 +-
fs/proc/namespaces.c | 6 +-
fs/read_write.c | 113 ++++---
fs/readdir.c | 18 +-
fs/select.c | 11 +-
fs/signalfd.c | 6 +-
fs/splice.c | 34 ++-
fs/stat.c | 10 +-
fs/statfs.c | 8 +-
fs/sync.c | 21 +-
fs/timerfd.c | 40 ++-
fs/utimes.c | 10 +-
fs/xattr.c | 26 +-
fs/xfs/xfs_ioctl.c | 14 +-
include/linux/capsicum.h | 72 +++++
include/linux/file.h | 136 +++++++++
include/linux/namei.h | 10 +
include/linux/net.h | 16 +
include/linux/sched.h | 3 +
include/linux/syscalls.h | 12 +
include/uapi/asm-generic/errno.h | 3 +
include/uapi/asm-generic/fcntl.h | 4 +
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/capsicum.h | 343 +++++++++++++++++++++
include/uapi/linux/prctl.h | 14 +
include/uapi/linux/seccomp.h | 10 +
ipc/mqueue.c | 30 +-
kernel/events/core.c | 14 +-
kernel/module.c | 10 +-
kernel/seccomp.c | 2 +
kernel/sys.c | 33 +-
kernel/sys_ni.c | 4 +
kernel/taskstats.c | 4 +-
kernel/time/posix-clock.c | 27 +-
mm/fadvise.c | 7 +-
mm/internal.h | 19 ++
mm/memcontrol.c | 12 +-
mm/mmap.c | 7 +-
mm/nommu.c | 9 +-
mm/readahead.c | 6 +-
net/9p/trans_fd.c | 10 +-
net/bluetooth/bnep/sock.c | 2 +-
net/bluetooth/cmtp/sock.c | 2 +-
net/bluetooth/hidp/sock.c | 4 +-
net/compat.c | 4 +-
net/l2tp/l2tp_core.c | 11 +-
net/l2tp/l2tp_core.h | 2 +
net/sched/sch_atm.c | 2 +-
net/socket.c | 207 ++++++++++---
net/sunrpc/svcsock.c | 4 +-
security/Kconfig | 15 +
security/Makefile | 2 +-
security/capsicum-rights.c | 201 +++++++++++++
security/capsicum-rights.h | 10 +
security/capsicum.c | 380 ++++++++++++++++++++++++
sound/core/pcm_native.c | 10 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat/.gitignore | 3 +
tools/testing/selftests/openat/Makefile | 24 ++
tools/testing/selftests/openat/openat.c | 146 +++++++++
virt/kvm/eventfd.c | 6 +-
virt/kvm/vfio.c | 12 +-
120 files changed, 2818 insertions(+), 533 deletions(-)
create mode 100644 Documentation/security/capsicum.txt
create mode 100644 include/linux/capsicum.h
create mode 100644 include/uapi/linux/capsicum.h
create mode 100644 security/capsicum-rights.c
create mode 100644 security/capsicum-rights.h
create mode 100644 security/capsicum.c
create mode 100644 tools/testing/selftests/openat/.gitignore
create mode 100644 tools/testing/selftests/openat/Makefile
create mode 100644 tools/testing/selftests/openat/openat.c

--
2.0.0.526.g5318336


2014-07-25 13:47:55

by David Drysdale

[permalink] [raw]
Subject: [PATCH 01/11] fs: add O_BENEATH flag to openat(2)

Add a new O_BENEATH flag for openat(2) which restricts the
provided path, rejecting (with -EACCES) paths that are not beneath
the provided dfd. In particular, reject:
- paths that contain .. components
- paths that begin with /
- symlinks that have paths as above.

Signed-off-by: David Drysdale <[email protected]>
---
arch/alpha/include/uapi/asm/fcntl.h | 1 +
arch/parisc/include/uapi/asm/fcntl.h | 1 +
arch/sparc/include/uapi/asm/fcntl.h | 1 +
fs/fcntl.c | 5 +++--
fs/namei.c | 43 ++++++++++++++++++++++++------------
fs/open.c | 4 +++-
include/linux/namei.h | 1 +
include/uapi/asm-generic/fcntl.h | 4 ++++
8 files changed, 43 insertions(+), 17 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/fcntl.h b/arch/alpha/include/uapi/asm/fcntl.h
index 09f49a6b87d1..76a87038d2c1 100644
--- a/arch/alpha/include/uapi/asm/fcntl.h
+++ b/arch/alpha/include/uapi/asm/fcntl.h
@@ -33,6 +33,7 @@

#define O_PATH 040000000
#define __O_TMPFILE 0100000000
+#define O_BENEATH 0200000000 /* no / or .. in openat path */

#define F_GETLK 7
#define F_SETLK 8
diff --git a/arch/parisc/include/uapi/asm/fcntl.h b/arch/parisc/include/uapi/asm/fcntl.h
index 34a46cbc76ed..3adadf72f929 100644
--- a/arch/parisc/include/uapi/asm/fcntl.h
+++ b/arch/parisc/include/uapi/asm/fcntl.h
@@ -21,6 +21,7 @@

#define O_PATH 020000000
#define __O_TMPFILE 040000000
+#define O_BENEATH 080000000 /* no / or .. in openat path */

#define F_GETLK64 8
#define F_SETLK64 9
diff --git a/arch/sparc/include/uapi/asm/fcntl.h b/arch/sparc/include/uapi/asm/fcntl.h
index 7e8ace5bf760..ea38f0bd6cec 100644
--- a/arch/sparc/include/uapi/asm/fcntl.h
+++ b/arch/sparc/include/uapi/asm/fcntl.h
@@ -36,6 +36,7 @@

#define O_PATH 0x1000000
#define __O_TMPFILE 0x2000000
+#define O_BENEATH 0x4000000 /* no / or .. in openat path */

#define F_GETOWN 5 /* for sockets. */
#define F_SETOWN 6 /* for sockets. */
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 72c82f69b01b..abf82e05d7b3 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -742,14 +742,15 @@ static int __init fcntl_init(void)
* Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
* is defined as O_NONBLOCK on some platforms and not on others.
*/
- BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
+ BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ != HWEIGHT32(
O_RDONLY | O_WRONLY | O_RDWR |
O_CREAT | O_EXCL | O_NOCTTY |
O_TRUNC | O_APPEND | /* O_NONBLOCK | */
__O_SYNC | O_DSYNC | FASYNC |
O_DIRECT | O_LARGEFILE | O_DIRECTORY |
O_NOFOLLOW | O_NOATIME | O_CLOEXEC |
- __FMODE_EXEC | O_PATH | __O_TMPFILE
+ __FMODE_EXEC | O_PATH | __O_TMPFILE |
+ O_BENEATH
));

fasync_cache = kmem_cache_create("fasync_cache",
diff --git a/fs/namei.c b/fs/namei.c
index 985c6f368485..165ebb1209d4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -647,7 +647,7 @@ static __always_inline void set_root(struct nameidata *nd)
get_fs_root(current->fs, &nd->root);
}

-static int link_path_walk(const char *, struct nameidata *);
+static int link_path_walk(const char *, struct nameidata *, unsigned int);

static __always_inline void set_root_rcu(struct nameidata *nd)
{
@@ -820,7 +820,8 @@ static int may_linkat(struct path *link)
}

static __always_inline int
-follow_link(struct path *link, struct nameidata *nd, void **p)
+follow_link(struct path *link, struct nameidata *nd, unsigned int flags,
+ void **p)
{
struct dentry *dentry = link->dentry;
int error;
@@ -867,7 +868,7 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
nd->flags |= LOOKUP_JUMPED;
}
nd->inode = nd->path.dentry->d_inode;
- error = link_path_walk(s, nd);
+ error = link_path_walk(s, nd, flags);
if (unlikely(error))
put_link(nd, link, *p);
}
@@ -1574,7 +1575,8 @@ out_err:
* Without that kind of total limit, nasty chains of consecutive
* symlinks can cause almost arbitrarily long lookups.
*/
-static inline int nested_symlink(struct path *path, struct nameidata *nd)
+static inline int nested_symlink(struct path *path, struct nameidata *nd,
+ unsigned int flags)
{
int res;

@@ -1592,7 +1594,7 @@ static inline int nested_symlink(struct path *path, struct nameidata *nd)
struct path link = *path;
void *cookie;

- res = follow_link(&link, nd, &cookie);
+ res = follow_link(&link, nd, flags, &cookie);
if (res)
break;
res = walk_component(nd, path, LOOKUP_FOLLOW);
@@ -1731,13 +1733,19 @@ static inline unsigned long hash_name(const char *name, unsigned int *hashp)
* Returns 0 and nd will have valid dentry and mnt on success.
* Returns error and drops reference to input namei data on failure.
*/
-static int link_path_walk(const char *name, struct nameidata *nd)
+static int link_path_walk(const char *name, struct nameidata *nd,
+ unsigned int flags)
{
struct path next;
int err;

- while (*name=='/')
+ while (*name == '/') {
+ if (flags & LOOKUP_BENEATH) {
+ err = -EACCES;
+ goto exit;
+ }
name++;
+ }
if (!*name)
return 0;

@@ -1759,6 +1767,10 @@ static int link_path_walk(const char *name, struct nameidata *nd)
if (name[0] == '.') switch (len) {
case 2:
if (name[1] == '.') {
+ if (flags & LOOKUP_BENEATH) {
+ err = -EACCES;
+ goto exit;
+ }
type = LAST_DOTDOT;
nd->flags |= LOOKUP_JUMPED;
}
@@ -1798,7 +1810,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
return err;

if (err) {
- err = nested_symlink(&next, nd);
+ err = nested_symlink(&next, nd, flags);
if (err)
return err;
}
@@ -1807,6 +1819,7 @@ static int link_path_walk(const char *name, struct nameidata *nd)
break;
}
}
+exit:
terminate_walk(nd);
return err;
}
@@ -1845,6 +1858,8 @@ static int path_init(int dfd, const char *name, unsigned int flags,

nd->m_seq = read_seqbegin(&mount_lock);
if (*name=='/') {
+ if (flags & LOOKUP_BENEATH)
+ return -EACCES;
if (flags & LOOKUP_RCU) {
rcu_read_lock();
set_root_rcu(nd);
@@ -1938,7 +1953,7 @@ static int path_lookupat(int dfd, const char *name,
return err;

current->total_link_count = 0;
- err = link_path_walk(name, nd);
+ err = link_path_walk(name, nd, flags);

if (!err && !(flags & LOOKUP_PARENT)) {
err = lookup_last(nd, &path);
@@ -1949,7 +1964,7 @@ static int path_lookupat(int dfd, const char *name,
if (unlikely(err))
break;
nd->flags |= LOOKUP_PARENT;
- err = follow_link(&link, nd, &cookie);
+ err = follow_link(&link, nd, flags, &cookie);
if (err)
break;
err = lookup_last(nd, &path);
@@ -2288,7 +2303,7 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
return err;

current->total_link_count = 0;
- err = link_path_walk(name, &nd);
+ err = link_path_walk(name, &nd, flags);
if (err)
goto out;

@@ -2300,7 +2315,7 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
if (unlikely(err))
break;
nd.flags |= LOOKUP_PARENT;
- err = follow_link(&link, &nd, &cookie);
+ err = follow_link(&link, &nd, flags, &cookie);
if (err)
break;
err = mountpoint_last(&nd, path);
@@ -3186,7 +3201,7 @@ static struct file *path_openat(int dfd, struct filename *pathname,
goto out;

current->total_link_count = 0;
- error = link_path_walk(pathname->name, nd);
+ error = link_path_walk(pathname->name, nd, flags);
if (unlikely(error))
goto out;

@@ -3205,7 +3220,7 @@ static struct file *path_openat(int dfd, struct filename *pathname,
break;
nd->flags |= LOOKUP_PARENT;
nd->flags &= ~(LOOKUP_OPEN|LOOKUP_CREATE|LOOKUP_EXCL);
- error = follow_link(&link, nd, &cookie);
+ error = follow_link(&link, nd, flags, &cookie);
if (unlikely(error))
break;
error = do_last(nd, &path, file, op, &opened, pathname);
diff --git a/fs/open.c b/fs/open.c
index 36662d036237..ee16be3a7291 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -875,7 +875,7 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
* If we have O_PATH in the open flag. Then we
* cannot have anything other than the below set of flags
*/
- flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH;
+ flags &= O_DIRECTORY | O_NOFOLLOW | O_PATH | O_BENEATH;
acc_mode = 0;
} else {
acc_mode = MAY_OPEN | ACC_MODE(flags);
@@ -906,6 +906,8 @@ static inline int build_open_flags(int flags, umode_t mode, struct open_flags *o
lookup_flags |= LOOKUP_DIRECTORY;
if (!(flags & O_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
+ if (flags & O_BENEATH)
+ lookup_flags |= LOOKUP_BENEATH;
op->lookup_flags = lookup_flags;
return 0;
}
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 492de72560fa..bd0615d1143b 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -39,6 +39,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};
#define LOOKUP_FOLLOW 0x0001
#define LOOKUP_DIRECTORY 0x0002
#define LOOKUP_AUTOMOUNT 0x0004
+#define LOOKUP_BENEATH 0x0008

#define LOOKUP_PARENT 0x0010
#define LOOKUP_REVAL 0x0020
diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
index 7543b3e51331..f63aa749a4fb 100644
--- a/include/uapi/asm-generic/fcntl.h
+++ b/include/uapi/asm-generic/fcntl.h
@@ -92,6 +92,10 @@
#define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
#define O_TMPFILE_MASK (__O_TMPFILE | O_DIRECTORY | O_CREAT)

+#ifndef O_BENEATH
+#define O_BENEATH 040000000 /* no / or .. in openat path */
+#endif
+
#ifndef O_NDELAY
#define O_NDELAY O_NONBLOCK
#endif
--
2.0.0.526.g5318336

2014-07-25 13:47:58

by David Drysdale

[permalink] [raw]
Subject: [PATCH 02/11] selftests: Add test of O_BENEATH & openat(2)

Add simple tests of openat(2) variations, including examples that
check the new O_BENEATH flag.

Signed-off-by: David Drysdale <[email protected]>
---
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/openat/.gitignore | 3 +
tools/testing/selftests/openat/Makefile | 24 +++++
tools/testing/selftests/openat/openat.c | 149 ++++++++++++++++++++++++++++++
4 files changed, 177 insertions(+)
create mode 100644 tools/testing/selftests/openat/.gitignore
create mode 100644 tools/testing/selftests/openat/Makefile
create mode 100644 tools/testing/selftests/openat/openat.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index e66e710cc595..a8e8c83f7992 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -11,6 +11,7 @@ TARGETS += vm
TARGETS += powerpc
TARGETS += user
TARGETS += sysctl
+TARGETS += openat

all:
for TARGET in $(TARGETS); do \
diff --git a/tools/testing/selftests/openat/.gitignore b/tools/testing/selftests/openat/.gitignore
new file mode 100644
index 000000000000..0a2446e89ad5
--- /dev/null
+++ b/tools/testing/selftests/openat/.gitignore
@@ -0,0 +1,3 @@
+openat
+subdir
+topfile
\ No newline at end of file
diff --git a/tools/testing/selftests/openat/Makefile b/tools/testing/selftests/openat/Makefile
new file mode 100644
index 000000000000..dc28ce943edf
--- /dev/null
+++ b/tools/testing/selftests/openat/Makefile
@@ -0,0 +1,24 @@
+CC = $(CROSS_COMPILE)gcc
+CFLAGS = -Wall
+BINARIES = openat
+DEPS = subdir topfile subdir/bottomfile subdir/symlinkup subdir/symlinkout
+all: $(BINARIES) $(DEPS)
+
+subdir:
+ mkdir -p subdir
+topfile:
+ echo 0123456789 > $@
+subdir/bottomfile: | subdir
+ echo 0123456789 > $@
+subdir/symlinkup:
+ ln -s ../topfile $@
+subdir/symlinkout:
+ ln -s /etc/passwd $@
+%: %.c
+ $(CC) $(CFLAGS) -o $@ $^
+
+run_tests: all
+ ./openat
+
+clean:
+ rm -rf $(BINARIES) $(DEPS)
diff --git a/tools/testing/selftests/openat/openat.c b/tools/testing/selftests/openat/openat.c
new file mode 100644
index 000000000000..146f02566cd8
--- /dev/null
+++ b/tools/testing/selftests/openat/openat.c
@@ -0,0 +1,149 @@
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <errno.h>
+
+#include <linux/fcntl.h>
+
+/* Bypass glibc */
+static int openat_(int dirfd, const char *pathname, int flags)
+{
+ return syscall(__NR_openat, dirfd, pathname, flags);
+}
+
+static int openat_or_die(int dfd, const char *path, int flags)
+{
+ int fd = openat_(dfd, path, flags);
+
+ if (fd < 0) {
+ printf("Failed to openat(%d, '%s'); "
+ "check prerequisites are available\n", dfd, path);
+ exit(1);
+ }
+ return fd;
+}
+
+static int check_openat(int dfd, const char *path, int flags)
+{
+ int rc;
+ int fd;
+ char buffer[4];
+
+ errno = 0;
+ printf("Check success of openat(%d, '%s', %x)... ",
+ dfd, path?:"(null)", flags);
+ fd = openat_(dfd, path, flags);
+ if (fd < 0) {
+ printf("[FAIL]: openat() failed, rc=%d errno=%d (%s)\n",
+ fd, errno, strerror(errno));
+ return 1;
+ }
+ errno = 0;
+ rc = read(fd, buffer, sizeof(buffer));
+ if (rc < 0) {
+ printf("[FAIL]: read() failed, rc=%d errno=%d (%s)\n",
+ rc, errno, strerror(errno));
+ return 1;
+ }
+ close(fd);
+ printf("[OK]\n");
+ return 0;
+}
+
+#define check_openat_fail(dfd, path, flags, errno) \
+ _check_openat_fail(dfd, path, flags, errno, #errno)
+static int _check_openat_fail(int dfd, const char *path, int flags,
+ int expected_errno, const char *errno_str)
+{
+ int rc;
+
+ errno = 0;
+ printf("Check failure of openat(%d, '%s', %x) with %s... ",
+ dfd, path?:"(null)", flags, errno_str);
+ rc = openat_(dfd, path, flags);
+ if (rc > 0) {
+ printf("[FAIL] (unexpected success from openat(2))\n");
+ close(rc);
+ return 1;
+ }
+ if (errno != expected_errno) {
+ printf("[FAIL] (expected errno %d (%s) not %d (%s)\n",
+ expected_errno, strerror(expected_errno),
+ errno, strerror(errno));
+ return 1;
+ }
+ printf("[OK]\n");
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ int fail = 0;
+ int dot_dfd = openat_or_die(AT_FDCWD, ".", O_RDONLY);
+ int subdir_dfd = openat_or_die(AT_FDCWD, "subdir", O_RDONLY);
+ int file_fd = openat_or_die(AT_FDCWD, "topfile", O_RDONLY);
+
+ /* Sanity check normal behavior */
+ fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY);
+ fail |= check_openat(AT_FDCWD, "subdir/bottomfile", O_RDONLY);
+
+ fail |= check_openat(dot_dfd, "topfile", O_RDONLY);
+ fail |= check_openat(dot_dfd, "subdir/bottomfile", O_RDONLY);
+ fail |= check_openat(dot_dfd, "subdir/../topfile", O_RDONLY);
+
+ fail |= check_openat(subdir_dfd, "../topfile", O_RDONLY);
+ fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY);
+ fail |= check_openat(subdir_dfd, "../subdir/bottomfile", O_RDONLY);
+ fail |= check_openat(subdir_dfd, "symlinkup", O_RDONLY);
+ fail |= check_openat(subdir_dfd, "symlinkout", O_RDONLY);
+
+ fail |= check_openat(AT_FDCWD, "/etc/passwd", O_RDONLY);
+ fail |= check_openat(dot_dfd, "/etc/passwd", O_RDONLY);
+ fail |= check_openat(subdir_dfd, "/etc/passwd", O_RDONLY);
+
+ fail |= check_openat_fail(AT_FDCWD, "bogus", O_RDONLY, ENOENT);
+ fail |= check_openat_fail(dot_dfd, "bogus", O_RDONLY, ENOENT);
+ fail |= check_openat_fail(999, "bogus", O_RDONLY, EBADF);
+ fail |= check_openat_fail(file_fd, "bogus", O_RDONLY, ENOTDIR);
+
+#ifdef O_BENEATH
+ /* Test out O_BENEATH */
+ fail |= check_openat(AT_FDCWD, "topfile", O_RDONLY|O_BENEATH);
+ fail |= check_openat(AT_FDCWD, "subdir/bottomfile",
+ O_RDONLY|O_BENEATH);
+
+ fail |= check_openat(dot_dfd, "topfile", O_RDONLY|O_BENEATH);
+ fail |= check_openat(dot_dfd, "subdir/bottomfile",
+ O_RDONLY|O_BENEATH);
+ fail |= check_openat(subdir_dfd, "bottomfile", O_RDONLY|O_BENEATH);
+
+ /* Can't open paths with ".." in them */
+ fail |= check_openat_fail(dot_dfd, "subdir/../topfile",
+ O_RDONLY|O_BENEATH, EACCES);
+ fail |= check_openat_fail(subdir_dfd, "../topfile",
+ O_RDONLY|O_BENEATH, EACCES);
+ fail |= check_openat_fail(subdir_dfd, "../subdir/bottomfile",
+ O_RDONLY|O_BENEATH, EACCES);
+
+ /* Can't open paths starting with "/" */
+ fail |= check_openat_fail(AT_FDCWD, "/etc/passwd",
+ O_RDONLY|O_BENEATH, EACCES);
+ fail |= check_openat_fail(dot_dfd, "/etc/passwd",
+ O_RDONLY|O_BENEATH, EACCES);
+ fail |= check_openat_fail(subdir_dfd, "/etc/passwd",
+ O_RDONLY|O_BENEATH, EACCES);
+ /* Can't sneak around constraints with symlinks */
+ fail |= check_openat_fail(subdir_dfd, "symlinkup",
+ O_RDONLY|O_BENEATH, EACCES);
+ fail |= check_openat_fail(subdir_dfd, "symlinkout",
+ O_RDONLY|O_BENEATH, EACCES);
+#else
+ printf("Skipping O_BENEATH tests due to missing #define\n");
+#endif
+
+ return fail ? -1 : 0;
+}
--
2.0.0.526.g5318336

2014-07-25 13:48:22

by David Drysdale

[permalink] [raw]
Subject: [PATCH 05/11] capsicum: convert callers to use fgetr() etc

Convert places that use fget()-like functions to use the
equivalent fgetr() variant instead.

Annotate each such call with an indication of what operations will
be performed on the retrieved struct file, to allow future policing
of rights associated with file descriptors.

Also change each call site to cope with an ERR_PTR return from
fgetr() rather than a plain NULL failure from fget().

Signed-off-by: David Drysdale <[email protected]>
---
arch/alpha/kernel/osf_sys.c | 6 +-
arch/ia64/kernel/perfmon.c | 54 +++++++-----
arch/parisc/hpux/fs.c | 6 +-
arch/powerpc/kvm/powerpc.c | 4 +-
arch/powerpc/platforms/cell/spu_syscalls.c | 15 ++--
drivers/base/dma-buf.c | 6 +-
drivers/block/loop.c | 14 +--
drivers/block/nbd.c | 2 +-
drivers/infiniband/core/ucma.c | 4 +-
drivers/infiniband/core/uverbs_cmd.c | 6 +-
drivers/infiniband/core/uverbs_main.c | 4 +-
drivers/infiniband/hw/usnic/usnic_transport.c | 2 +-
drivers/md/md.c | 8 +-
drivers/staging/android/sync.c | 2 +-
drivers/staging/lustre/lustre/llite/file.c | 6 +-
drivers/staging/lustre/lustre/lmv/lmv_obd.c | 7 +-
drivers/staging/lustre/lustre/mdc/lproc_mdc.c | 7 +-
drivers/staging/lustre/lustre/mdc/mdc_request.c | 4 +-
drivers/vfio/pci/vfio_pci.c | 7 +-
drivers/vfio/pci/vfio_pci_intrs.c | 6 +-
drivers/vfio/vfio.c | 6 +-
drivers/vhost/net.c | 6 +-
drivers/video/fbdev/msm/mdp.c | 5 +-
fs/aio.c | 37 +++++++-
fs/autofs4/dev-ioctl.c | 17 ++--
fs/autofs4/inode.c | 4 +-
fs/btrfs/ioctl.c | 21 +++--
fs/btrfs/send.c | 7 +-
fs/cifs/ioctl.c | 6 +-
fs/coda/inode.c | 4 +-
fs/coda/psdev.c | 2 +-
fs/compat.c | 18 ++--
fs/compat_ioctl.c | 14 ++-
fs/eventfd.c | 18 ++--
fs/eventpoll.c | 19 +++--
fs/ext4/ioctl.c | 6 +-
fs/fcntl.c | 101 ++++++++++++++++++++--
fs/fhandle.c | 7 +-
fs/fuse/inode.c | 10 ++-
fs/ioctl.c | 12 ++-
fs/locks.c | 8 +-
fs/notify/fanotify/fanotify_user.c | 16 ++--
fs/notify/inotify/inotify_user.c | 12 +--
fs/ocfs2/cluster/heartbeat.c | 8 +-
fs/open.c | 42 +++++----
fs/proc/namespaces.c | 6 +-
fs/read_write.c | 109 ++++++++++++++----------
fs/readdir.c | 18 ++--
fs/select.c | 11 ++-
fs/signalfd.c | 7 +-
fs/splice.c | 35 +++++---
fs/stat.c | 10 ++-
fs/statfs.c | 9 +-
fs/sync.c | 21 +++--
fs/timerfd.c | 40 +++++++--
fs/utimes.c | 10 ++-
fs/xattr.c | 26 +++---
fs/xfs/xfs_ioctl.c | 14 +--
ipc/mqueue.c | 30 +++----
kernel/events/core.c | 15 ++--
kernel/module.c | 10 ++-
kernel/sys.c | 6 +-
kernel/taskstats.c | 4 +-
kernel/time/posix-clock.c | 27 +++---
mm/fadvise.c | 7 +-
mm/internal.h | 19 +++++
mm/memcontrol.c | 12 +--
mm/mmap.c | 7 +-
mm/nommu.c | 10 ++-
mm/readahead.c | 6 +-
net/9p/trans_fd.c | 10 +--
sound/core/pcm_native.c | 10 ++-
virt/kvm/eventfd.c | 6 +-
virt/kvm/vfio.c | 12 +--
74 files changed, 698 insertions(+), 385 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 1402fcc11c2c..8f2d9597096b 100644
--- a/arch/alpha/kernel/osf_sys.c
+++ b/arch/alpha/kernel/osf_sys.c
@@ -146,7 +146,7 @@ SYSCALL_DEFINE4(osf_getdirentries, unsigned int, fd,
long __user *, basep)
{
int error;
- struct fd arg = fdget(fd);
+ struct fd arg = fdgetr(fd, CAP_READ);
struct osf_dirent_callback buf = {
.ctx.actor = osf_filldir,
.dirent = dirent,
@@ -154,8 +154,8 @@ SYSCALL_DEFINE4(osf_getdirentries, unsigned int, fd,
.count = count
};

- if (!arg.file)
- return -EBADF;
+ if (IS_ERR(arg.file))
+ return PTR_ERR(arg.file);

error = iterate_dir(arg.file, &buf.ctx);
if (error >= 0)
diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index 5845ffea67c3..2c214b4ddea0 100644
--- a/arch/ia64/kernel/perfmon.c
+++ b/arch/ia64/kernel/perfmon.c
@@ -471,6 +471,7 @@ typedef struct {
int cmd_flags;
unsigned int cmd_narg;
size_t cmd_argsize;
+ u64 cmd_right;
int (*cmd_getsize)(void *arg, size_t *sz);
} pfm_cmd_desc_t;

@@ -4620,31 +4621,40 @@ pfm_exit_thread(struct task_struct *task)
/*
* functions MUST be listed in the increasing order of their index (see permfon.h)
*/
-#define PFM_CMD(name, flags, arg_count, arg_type, getsz) { name, #name, flags, arg_count, sizeof(arg_type), getsz }
-#define PFM_CMD_S(name, flags) { name, #name, flags, 0, 0, NULL }
+#define PFM_CMD(name, flags, arg_count, arg_type, right, getsz) \
+ { name, #name, flags, arg_count, sizeof(arg_type), right, getsz }
+#define PFM_CMD_S(name, flags, right) \
+ { name, #name, flags, 0, 0, right, NULL }
#define PFM_CMD_PCLRWS (PFM_CMD_FD|PFM_CMD_ARG_RW|PFM_CMD_STOP)
#define PFM_CMD_PCLRW (PFM_CMD_FD|PFM_CMD_ARG_RW)
-#define PFM_CMD_NONE { NULL, "no-cmd", 0, 0, 0, NULL}
+#define PFM_CMD_NONE { NULL, "no-cmd", 0, 0, 0, 0, NULL}

static pfm_cmd_desc_t pfm_cmd_tab[]={
/* 0 */PFM_CMD_NONE,
-/* 1 */PFM_CMD(pfm_write_pmcs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 2 */PFM_CMD(pfm_write_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 3 */PFM_CMD(pfm_read_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 4 */PFM_CMD_S(pfm_stop, PFM_CMD_PCLRWS),
-/* 5 */PFM_CMD_S(pfm_start, PFM_CMD_PCLRWS),
+/* 1 */PFM_CMD(pfm_write_pmcs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t,
+ CAP_WRITE, NULL),
+/* 2 */PFM_CMD(pfm_write_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t,
+ CAP_WRITE, NULL),
+/* 3 */PFM_CMD(pfm_read_pmds, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_reg_t,
+ CAP_READ, NULL),
+/* 4 */PFM_CMD_S(pfm_stop, PFM_CMD_PCLRWS, CAP_PERFMON),
+/* 5 */PFM_CMD_S(pfm_start, PFM_CMD_PCLRWS, CAP_PERFMON),
/* 6 */PFM_CMD_NONE,
/* 7 */PFM_CMD_NONE,
-/* 8 */PFM_CMD(pfm_context_create, PFM_CMD_ARG_RW, 1, pfarg_context_t, pfm_ctx_getsize),
+/* 8 */PFM_CMD(pfm_context_create, PFM_CMD_ARG_RW, 1, pfarg_context_t,
+ CAP_PERFMON, pfm_ctx_getsize),
/* 9 */PFM_CMD_NONE,
-/* 10 */PFM_CMD_S(pfm_restart, PFM_CMD_PCLRW),
+/* 10 */PFM_CMD_S(pfm_restart, PFM_CMD_PCLRW, CAP_PERFMON),
/* 11 */PFM_CMD_NONE,
-/* 12 */PFM_CMD(pfm_get_features, PFM_CMD_ARG_RW, 1, pfarg_features_t, NULL),
-/* 13 */PFM_CMD(pfm_debug, 0, 1, unsigned int, NULL),
+/* 12 */PFM_CMD(pfm_get_features, PFM_CMD_ARG_RW, 1, pfarg_features_t,
+ CAP_READ, NULL),
+/* 13 */PFM_CMD(pfm_debug, 0, 1, unsigned int, CAP_PERFMON, NULL),
/* 14 */PFM_CMD_NONE,
-/* 15 */PFM_CMD(pfm_get_pmc_reset, PFM_CMD_ARG_RW, PFM_CMD_ARG_MANY, pfarg_reg_t, NULL),
-/* 16 */PFM_CMD(pfm_context_load, PFM_CMD_PCLRWS, 1, pfarg_load_t, NULL),
-/* 17 */PFM_CMD_S(pfm_context_unload, PFM_CMD_PCLRWS),
+/* 15 */PFM_CMD(pfm_get_pmc_reset, PFM_CMD_ARG_RW, PFM_CMD_ARG_MANY,
+ pfarg_reg_t, CAP_READ, NULL),
+/* 16 */PFM_CMD(pfm_context_load, PFM_CMD_PCLRWS, 1, pfarg_load_t,
+ CAP_READ, NULL),
+/* 17 */PFM_CMD_S(pfm_context_unload, PFM_CMD_PCLRWS, CAP_READ),
/* 18 */PFM_CMD_NONE,
/* 19 */PFM_CMD_NONE,
/* 20 */PFM_CMD_NONE,
@@ -4659,8 +4669,10 @@ static pfm_cmd_desc_t pfm_cmd_tab[]={
/* 29 */PFM_CMD_NONE,
/* 30 */PFM_CMD_NONE,
/* 31 */PFM_CMD_NONE,
-/* 32 */PFM_CMD(pfm_write_ibrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t, NULL),
-/* 33 */PFM_CMD(pfm_write_dbrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t, NULL)
+/* 32 */PFM_CMD(pfm_write_ibrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t,
+ CAP_WRITE, NULL),
+/* 33 */PFM_CMD(pfm_write_dbrs, PFM_CMD_PCLRWS, PFM_CMD_ARG_MANY, pfarg_dbreg_t,
+ CAP_WRITE, NULL)
};
#define PFM_CMD_COUNT (sizeof(pfm_cmd_tab)/sizeof(pfm_cmd_desc_t))

@@ -4866,13 +4878,13 @@ restart_args:

if (unlikely((cmd_flags & PFM_CMD_FD) == 0)) goto skip_fd;

- ret = -EBADF;
-
- f = fdget(fd);
- if (unlikely(f.file == NULL)) {
+ f = fdgetr(fd, pfm_cmd_tab[cmd].cmd_right);
+ if (unlikely(IS_ERR(f.file)) {
DPRINT(("invalid fd %d\n", fd));
+ ret = PTR_ERR(f.file);
goto error_args;
}
+ ret = -EBADF;
if (unlikely(PFM_IS_FILE(f.file) == 0)) {
DPRINT(("fd %d not related to perfmon\n", fd));
goto error_args;
diff --git a/arch/parisc/hpux/fs.c b/arch/parisc/hpux/fs.c
index 2bedafea3d94..8a48b5d4bb15 100644
--- a/arch/parisc/hpux/fs.c
+++ b/arch/parisc/hpux/fs.c
@@ -105,9 +105,9 @@ int hpux_getdents(unsigned int fd, struct hpux_dirent __user *dirent, unsigned i
};
int error;

- arg = fdget(fd);
- if (!arg.file)
- return -EBADF;
+ arg = fdgetr(fd, CAP_READ);
+ if (IS_ERR(arg.file))
+ return PTR_ERR(arg.file);

error = iterate_dir(arg.file, &buf.ctx);
if (error >= 0)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 61c738ab1283..39aa7a9f573e 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -933,7 +933,7 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
struct kvm_device *dev;

r = -EBADF;
- f = fdget(cap->args[0]);
+ f = fdgetr(cap->args[0], CAP_FSTAT);
if (!f.file)
break;

@@ -952,7 +952,7 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
struct kvm_device *dev;

r = -EBADF;
- f = fdget(cap->args[0]);
+ f = fdgetr(cap->args[0], CAP_FSTAT);
if (!f.file)
break;

diff --git a/arch/powerpc/platforms/cell/spu_syscalls.c b/arch/powerpc/platforms/cell/spu_syscalls.c
index 5e6e0bad6db6..6c3f10603c0f 100644
--- a/arch/powerpc/platforms/cell/spu_syscalls.c
+++ b/arch/powerpc/platforms/cell/spu_syscalls.c
@@ -77,11 +77,13 @@ SYSCALL_DEFINE4(spu_create, const char __user *, name, unsigned int, flags,
return -ENOSYS;

if (flags & SPU_CREATE_AFFINITY_SPU) {
- struct fd neighbor = fdget(neighbor_fd);
- ret = -EBADF;
- if (neighbor.file) {
+ struct fd neighbor = fdgetr(neighbor_fd, CAP_READ, CAP_WRITE,
+ CAP_MAPEXEC);
+ if (!IS_ERR(neighbor.file)) {
ret = calls->create_thread(name, flags, mode, neighbor.file);
fdput(neighbor);
+ } else {
+ ret = PTR_ERR(neighbor.file);
}
} else
ret = calls->create_thread(name, flags, mode, NULL);
@@ -100,11 +102,12 @@ asmlinkage long sys_spu_run(int fd, __u32 __user *unpc, __u32 __user *ustatus)
if (!calls)
return -ENOSYS;

- ret = -EBADF;
- arg = fdget(fd);
- if (arg.file) {
+ arg = fdgetr(fd, CAP_READ, CAP_WRITE, CAP_MAPEXEC);
+ if (!IS_ERR(arg.file)) {
ret = calls->spu_run(arg.file, unpc, ustatus);
fdput(arg);
+ } else {
+ ret = PTR_ERR(arg.file);
}

spufs_calls_put(calls);
diff --git a/drivers/base/dma-buf.c b/drivers/base/dma-buf.c
index 840c7fa80983..0e1cea1498ec 100644
--- a/drivers/base/dma-buf.c
+++ b/drivers/base/dma-buf.c
@@ -216,10 +216,10 @@ struct dma_buf *dma_buf_get(int fd)
{
struct file *file;

- file = fget(fd);
+ file = fgetr(fd, CAP_MMAP, CAP_READ, CAP_WRITE);

- if (!file)
- return ERR_PTR(-EBADF);
+ if (IS_ERR(file))
+ return (struct dma_buf *)file;

if (!is_dma_buf_file(file)) {
fput(file);
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 6cb1beb47c25..b589b8c4d2e9 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -652,10 +652,11 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
if (!(lo->lo_flags & LO_FLAGS_READ_ONLY))
goto out;

- error = -EBADF;
- file = fget(arg);
- if (!file)
+ file = fgetr(arg, CAP_PWRITE, CAP_PREAD, CAP_FSYNC, CAP_FSTAT);
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
goto out;
+ }

inode = file->f_mapping->host;
old_file = lo->lo_backing_file;
@@ -834,10 +835,11 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
/* This is safe, since we have a reference from open(). */
__module_get(THIS_MODULE);

- error = -EBADF;
- file = fget(arg);
- if (!file)
+ file = fgetr(arg, CAP_PWRITE, CAP_PREAD, CAP_FSYNC, CAP_FSTAT);
+ if (IS_ERR(file)) {
+ error = PTR_ERR(file);
goto out;
+ }

error = -EBUSY;
if (lo->lo_state != Lo_unbound)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index fb31b8ee4372..08381e2049b6 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -651,7 +651,7 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
nbd->disconnect = 0; /* we're connected now */
return 0;
}
- return -EINVAL;
+ return err;
}

case NBD_SET_BLKSIZE:
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 56a4b7ca7ee3..b3b0b1aea8aa 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -1406,8 +1406,8 @@ static ssize_t ucma_migrate_id(struct ucma_file *new_file,
return -EFAULT;

/* Get current fd to protect against it being closed */
- f = fdget(cmd.fd);
- if (!f.file)
+ f = fdgetr(cmd.fd, CAP_READ, CAP_WRITE, CAP_POLL_EVENT);
+ if (IS_ERR(f.file))
return -ENOENT;

/* Validate current fd and prevent destruction of id. */
diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index ea6203ee7bcc..06db7ca75e1f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -719,9 +719,9 @@ ssize_t ib_uverbs_open_xrcd(struct ib_uverbs_file *file,

if (cmd.fd != -1) {
/* search for file descriptor */
- f = fdget(cmd.fd);
- if (!f.file) {
- ret = -EBADF;
+ f = fdgetr(cmd.fd, CAP_FSTAT);
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
goto err_tree_mutex_unlock;
}

diff --git a/drivers/infiniband/core/uverbs_main.c b/drivers/infiniband/core/uverbs_main.c
index 08219fb3338b..edaf0693ab12 100644
--- a/drivers/infiniband/core/uverbs_main.c
+++ b/drivers/infiniband/core/uverbs_main.c
@@ -566,9 +566,9 @@ struct file *ib_uverbs_alloc_event_file(struct ib_uverbs_file *uverbs_file,
struct ib_uverbs_event_file *ib_uverbs_lookup_comp_file(int fd)
{
struct ib_uverbs_event_file *ev_file = NULL;
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_LIST_END);

- if (!f.file)
+ if (IS_ERR(f.file))
return NULL;

if (f.file->f_op != &uverbs_event_fops)
diff --git a/drivers/infiniband/hw/usnic/usnic_transport.c b/drivers/infiniband/hw/usnic/usnic_transport.c
index ddef6f77a78c..5e2265792b83 100644
--- a/drivers/infiniband/hw/usnic/usnic_transport.c
+++ b/drivers/infiniband/hw/usnic/usnic_transport.c
@@ -134,7 +134,7 @@ struct socket *usnic_transport_get_socket(int sock_fd)
char buf[25];

/* sockfd_lookup will internally do a fget */
- sock = sockfd_lookup(sock_fd, &err);
+ sock = sockfd_lookupr(sock_fd, &err, CAP_SOCK_SERVER);
if (!sock) {
usnic_err("Unable to lookup socket for fd %d with err %d\n",
sock_fd, err);
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 32fc19c540d4..aa031d9d3d06 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5973,12 +5973,14 @@ static int set_bitmap_file(struct mddev *mddev, int fd)
struct inode *inode;
if (mddev->bitmap)
return -EEXIST; /* cannot add when bitmap is present */
- mddev->bitmap_info.file = fget(fd);
+ mddev->bitmap_info.file = fgetr(fd, CAP_READ);

- if (mddev->bitmap_info.file == NULL) {
+ if (IS_ERR(mddev->bitmap_info.file)) {
+ err = PTR_ERR(mddev->bitmap_info.file);
+ mddev->bitmap_info.file = NULL;
printk(KERN_ERR "%s: error: failed to get bitmap file\n",
mdname(mddev));
- return -EBADF;
+ return err;
}

inode = mddev->bitmap_info.file->f_mapping->host;
diff --git a/drivers/staging/android/sync.c b/drivers/staging/android/sync.c
index 18174f7c871c..cdc4b935de35 100644
--- a/drivers/staging/android/sync.c
+++ b/drivers/staging/android/sync.c
@@ -406,7 +406,7 @@ static void sync_fence_free_pts(struct sync_fence *fence)

struct sync_fence *sync_fence_fdget(int fd)
{
- struct file *file = fget(fd);
+ struct file *file = fgetr(fd, CAP_IOCTL);

if (file == NULL)
return NULL;
diff --git a/drivers/staging/lustre/lustre/llite/file.c b/drivers/staging/lustre/lustre/llite/file.c
index 716e1ee0104f..76ab1a7bc335 100644
--- a/drivers/staging/lustre/lustre/llite/file.c
+++ b/drivers/staging/lustre/lustre/llite/file.c
@@ -2186,9 +2186,9 @@ ll_file_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
if ((file->f_flags & O_ACCMODE) == 0) /* O_RDONLY */
return -EPERM;

- file2 = fget(lsl.sl_fd);
- if (file2 == NULL)
- return -EBADF;
+ file2 = fgetr(lsl.sl_fd, CAP_FSTAT);
+ if (IS_ERR(file2))
+ return PTR_ERR(file2);

rc = -EPERM;
if ((file2->f_flags & O_ACCMODE) != 0) /* O_WRONLY or O_RDWR */
diff --git a/drivers/staging/lustre/lustre/lmv/lmv_obd.c b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
index 4edf8a31221c..3dd88ece2608 100644
--- a/drivers/staging/lustre/lustre/lmv/lmv_obd.c
+++ b/drivers/staging/lustre/lustre/lmv/lmv_obd.c
@@ -880,10 +880,9 @@ static int lmv_hsm_ct_register(struct lmv_obd *lmv, unsigned int cmd, int len,
return -ENOTCONN;

/* at least one registration done, with no failure */
- filp = fget(lk->lk_wfd);
- if (filp == NULL) {
- return -EBADF;
- }
+ filp = fgetr(lk->lk_wfd, CAP_READ);
+ if (IS_ERR(filp))
+ return PTR_ERR(filp);
rc = libcfs_kkuc_group_add(filp, lk->lk_uid, lk->lk_group, lk->lk_data);
if (rc != 0 && filp != NULL)
fput(filp);
diff --git a/drivers/staging/lustre/lustre/mdc/lproc_mdc.c b/drivers/staging/lustre/lustre/mdc/lproc_mdc.c
index 2663480a68c5..c30fd463b5e3 100644
--- a/drivers/staging/lustre/lustre/mdc/lproc_mdc.c
+++ b/drivers/staging/lustre/lustre/mdc/lproc_mdc.c
@@ -130,9 +130,12 @@ static ssize_t mdc_kuc_write(struct file *file, const char *buffer,
if (fd == 0) {
rc = libcfs_kkuc_group_put(KUC_GRP_HSM, lh);
} else {
- struct file *fp = fget(fd);
+ struct file *fp = fgetr(fd, CAP_WRITE);

- rc = libcfs_kkuc_msg_put(fp, lh);
+ if (IS_ERR(fp))
+ rc = PTR_ERR(fp);
+ else
+ rc = libcfs_kkuc_msg_put(fp, lh);
fput(fp);
}
OBD_FREE(lh, len);
diff --git a/drivers/staging/lustre/lustre/mdc/mdc_request.c b/drivers/staging/lustre/lustre/mdc/mdc_request.c
index fca43cf1d671..1cd4b9e89619 100644
--- a/drivers/staging/lustre/lustre/mdc/mdc_request.c
+++ b/drivers/staging/lustre/lustre/mdc/mdc_request.c
@@ -1606,7 +1606,9 @@ static int mdc_ioc_changelog_send(struct obd_device *obd,
cs->cs_obd = obd;
cs->cs_startrec = icc->icc_recno;
/* matching fput in mdc_changelog_send_thread */
- cs->cs_fp = fget(icc->icc_id);
+ cs->cs_fp = fgetr(icc->icc_id, CAP_WRITE);
+ if (IS_ERR(cs->cs_fp))
+ cs->cs_fp = NULL;
cs->cs_flags = icc->icc_flags;

/*
diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 010e0f8b8e4f..b264a0a947bc 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -638,9 +638,10 @@ reset_info_exit:
*/
for (i = 0; i < hdr.count; i++) {
struct vfio_group *group;
- struct fd f = fdget(group_fds[i]);
- if (!f.file) {
- ret = -EBADF;
+ struct fd f = fdgetr(group_fds[i], CAP_FSTAT);
+
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
break;
}

diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 9dd49c9839ac..4591feea9004 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -149,9 +149,9 @@ static int virqfd_enable(struct vfio_pci_device *vdev,
INIT_WORK(&virqfd->shutdown, virqfd_shutdown);
INIT_WORK(&virqfd->inject, virqfd_inject);

- irqfd = fdget(fd);
- if (!irqfd.file) {
- ret = -EBADF;
+ irqfd = fdgetr(fd, CAP_WRITE);
+ if (IS_ERR(irqfd.file)) {
+ ret = PTR_ERR(irqfd.file);
goto err_fd;
}

diff --git a/drivers/vfio/vfio.c b/drivers/vfio/vfio.c
index f018d8d0f975..02cc422c19ef 100644
--- a/drivers/vfio/vfio.c
+++ b/drivers/vfio/vfio.c
@@ -1036,9 +1036,9 @@ static int vfio_group_set_container(struct vfio_group *group, int container_fd)
if (atomic_read(&group->container_users))
return -EINVAL;

- f = fdget(container_fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(container_fd, CAP_LIST_END);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

/* Sanity check, is this really our fd? */
if (f.file->f_op != &vfio_fops) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f724a35..8f552d2b637e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -871,11 +871,11 @@ err:

static struct socket *get_tap_socket(int fd)
{
- struct file *file = fget(fd);
+ struct file *file = fgetr(fd, CAP_READ, CAP_WRITE);
struct socket *sock;

- if (!file)
- return ERR_PTR(-EBADF);
+ if (IS_ERR(file))
+ return ERR_PTR(PTR_ERR(file));
sock = tun_get_socket(file);
if (!IS_ERR(sock))
return sock;
diff --git a/drivers/video/fbdev/msm/mdp.c b/drivers/video/fbdev/msm/mdp.c
index 113c7876c855..90c2c7f2c6c4 100644
--- a/drivers/video/fbdev/msm/mdp.c
+++ b/drivers/video/fbdev/msm/mdp.c
@@ -257,8 +257,9 @@ int get_img(struct mdp_img *img, struct fb_info *info,
struct file **filep)
{
int ret = 0;
- struct fd f = fdget(img->memory_id);
- if (f.file == NULL)
+ struct fd f = fdgetr(img->memory_id, CAP_FSTAT);
+
+ if (IS_ERR(f.file))
return -1;

if (MAJOR(file_inode(f.file)->i_rdev) == FB_MAJOR) {
diff --git a/fs/aio.c b/fs/aio.c
index 955947ef3e02..d69f35e49784 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1404,10 +1404,38 @@ rw_common:
return 0;
}

+static struct capsicum_rights *
+aio_opcode_rights(struct capsicum_rights *rights, int opcode)
+{
+ switch (opcode) {
+ case IOCB_CMD_PREAD:
+ case IOCB_CMD_PREADV:
+ cap_rights_init(rights, CAP_PREAD);
+ break;
+
+ case IOCB_CMD_PWRITE:
+ case IOCB_CMD_PWRITEV:
+ cap_rights_init(rights, CAP_PWRITE);
+ break;
+
+ case IOCB_CMD_FSYNC:
+ case IOCB_CMD_FDSYNC:
+ cap_rights_init(rights, CAP_FSYNC);
+ break;
+
+ default:
+ cap_rights_init(rights, CAP_PREAD, CAP_PWRITE, CAP_POLL_EVENT,
+ CAP_FSYNC);
+ break;
+ }
+ return rights;
+}
+
static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
struct iocb *iocb, bool compat)
{
struct kiocb *req;
+ struct capsicum_rights rights;
ssize_t ret;

/* enforce forwards compatibility on users */
@@ -1430,9 +1458,12 @@ static int io_submit_one(struct kioctx *ctx, struct iocb __user *user_iocb,
if (unlikely(!req))
return -EAGAIN;

- req->ki_filp = fget(iocb->aio_fildes);
- if (unlikely(!req->ki_filp)) {
- ret = -EBADF;
+ req->ki_filp = fget_rights(iocb->aio_fildes,
+ aio_opcode_rights(&rights,
+ iocb->aio_lio_opcode));
+ if (unlikely(IS_ERR(req->ki_filp))) {
+ ret = PTR_ERR(req->ki_filp);
+ req->ki_filp = NULL;
goto out_put_req;
}

diff --git a/fs/autofs4/dev-ioctl.c b/fs/autofs4/dev-ioctl.c
index 5b570b6efa28..0d08289a1b79 100644
--- a/fs/autofs4/dev-ioctl.c
+++ b/fs/autofs4/dev-ioctl.c
@@ -371,9 +371,9 @@ static int autofs_dev_ioctl_setpipefd(struct file *fp,
goto out;
}

- pipe = fget(pipefd);
- if (!pipe) {
- err = -EBADF;
+ pipe = fgetr(pipefd, CAP_READ, CAP_WRITE, CAP_FSYNC);
+ if (IS_ERR(pipe)) {
+ err = PTR_ERR(pipe);
goto out;
}
if (autofs_prepare_pipe(pipe) < 0) {
@@ -665,11 +665,16 @@ static int _autofs_dev_ioctl(unsigned int command, struct autofs_dev_ioctl __use
*/
if (cmd != AUTOFS_DEV_IOCTL_OPENMOUNT_CMD &&
cmd != AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD) {
- fp = fget(param->ioctlfd);
- if (!fp) {
+ struct capsicum_rights rights;
+
+ cap_rights_init(&rights, CAP_IOCTL, CAP_FSTAT);
+ rights.nioctls = 1;
+ rights.ioctls = &cmd;
+ fp = fget_rights(param->ioctlfd, &rights);
+ if (IS_ERR(fp)) {
if (cmd == AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD)
goto cont;
- err = -EBADF;
+ err = PTR_ERR(fp);
goto out;
}

diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index 1c55388ae633..5330ef636410 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -305,9 +305,9 @@ int autofs4_fill_super(struct super_block *s, void *data, int silent)
sbi->sub_version = AUTOFS_PROTO_SUBVERSION;

DPRINTK("pipe fd = %d, pgrp = %u", pipefd, pid_nr(sbi->oz_pgrp));
- pipe = fget(pipefd);
+ pipe = fgetr(pipefd, CAP_WRITE, CAP_FSYNC);

- if (!pipe) {
+ if (IS_ERR(pipe)) {
printk("autofs: could not open pipe file descriptor\n");
goto fail_dput;
}
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 47aceb494d1d..2de27f13d0ee 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1666,10 +1666,12 @@ static noinline int btrfs_ioctl_snap_create_transid(struct file *file,
ret = btrfs_mksubvol(&file->f_path, name, namelen,
NULL, transid, readonly, inherit);
} else {
- struct fd src = fdget(fd);
+ struct fd src = fdgetr(fd, CAP_FSTAT);
struct inode *src_inode;
- if (!src.file) {
- ret = -EINVAL;
+ if (IS_ERR(src.file)) {
+ ret = PTR_ERR(src.file);
+ if (ret == -EBADF)
+ ret = -EINVAL;
goto out_drop_write;
}

@@ -3040,9 +3042,10 @@ static long btrfs_ioctl_file_extent_same(struct file *file,

for (i = 0, info = same->info; i < count; i++, info++) {
struct inode *dst;
- struct fd dst_file = fdget(info->fd);
- if (!dst_file.file) {
- info->status = -EBADF;
+ struct fd dst_file = fdgetr(info->fd, CAP_FSTAT);
+
+ if (IS_ERR(dst_file.file)) {
+ info->status = PTR_ERR(dst_file.file);
continue;
}
dst = file_inode(dst_file.file);
@@ -3611,9 +3614,9 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
if (ret)
return ret;

- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
+ src_file = fdgetr(srcfd, CAP_FSTAT);
+ if (IS_ERR(src_file.file)) {
+ ret = PTR_ERR(src_file.file);
goto out_drop_write;
}

diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c
index 6528aa662181..9b201c68b30b 100644
--- a/fs/btrfs/send.c
+++ b/fs/btrfs/send.c
@@ -5573,9 +5573,10 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_)

sctx->flags = arg->flags;

- sctx->send_filp = fget(arg->send_fd);
- if (!sctx->send_filp) {
- ret = -EBADF;
+ sctx->send_filp = fgetr(arg->send_fd, CAP_PWRITE);
+ if (IS_ERR(sctx->send_filp)) {
+ ret = PTR_ERR(sctx->send_filp);
+ sctx->send_filp = NULL;
goto out;
}

diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index 45cb59bcc791..405eeee73ffb 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -61,9 +61,9 @@ static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
return rc;
}

- src_file = fdget(srcfd);
- if (!src_file.file) {
- rc = -EBADF;
+ src_file = fdgetr(srcfd, CAP_PREAD);
+ if (IS_ERR(src_file.file)) {
+ rc = PTR_ERR(src_file.file);
goto out_drop_write;
}

diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index fe3afb2de880..ab5e79cfbc46 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -128,8 +128,8 @@ static int get_device_index(struct coda_mount_data *data)
return -1;
}

- f = fdget(data->fd);
- if (!f.file)
+ f = fdgetr(data->fd, CAP_FSTAT);
+ if (IS_ERR(f.file))
goto Ebadf;
inode = file_inode(f.file);
if (!S_ISCHR(inode->i_mode) || imajor(inode) != CODA_PSDEV_MAJOR) {
diff --git a/fs/coda/psdev.c b/fs/coda/psdev.c
index 5c1e4242368b..5749ecfab46e 100644
--- a/fs/coda/psdev.c
+++ b/fs/coda/psdev.c
@@ -188,7 +188,7 @@ static ssize_t coda_psdev_write(struct file *file, const char __user *buf,
struct coda_open_by_fd_out *outp =
(struct coda_open_by_fd_out *)req->uc_data;
if (!outp->oh.result)
- outp->fh = fget(outp->fd);
+ outp->fh = fgetr(outp->fd, CAP_LIST_END);
}

wake_up(&req->uc_sleep);
diff --git a/fs/compat.c b/fs/compat.c
index 66d3d3c6b4b2..58c3992931c9 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -889,14 +889,14 @@ COMPAT_SYSCALL_DEFINE3(old_readdir, unsigned int, fd,
struct compat_old_linux_dirent __user *, dirent, unsigned int, count)
{
int error;
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_READ);
struct compat_readdir_callback buf = {
.ctx.actor = compat_fillonedir,
.dirent = dirent
};

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

error = iterate_dir(f.file, &buf.ctx);
if (buf.result)
@@ -979,9 +979,9 @@ COMPAT_SYSCALL_DEFINE3(getdents, unsigned int, fd,
if (!access_ok(VERIFY_WRITE, dirent, count))
return -EFAULT;

- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_READ);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

error = iterate_dir(f.file, &buf.ctx);
if (error >= 0)
@@ -1064,9 +1064,9 @@ COMPAT_SYSCALL_DEFINE3(getdents64, unsigned int, fd,
if (!access_ok(VERIFY_WRITE, dirent, count))
return -EFAULT;

- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_READ);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

error = iterate_dir(f.file, &buf.ctx);
if (error >= 0)
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index e82289047272..68f3ab88f00f 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -1542,10 +1542,18 @@ COMPAT_SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd,
compat_ulong_t, arg32)
{
unsigned long arg = arg32;
- struct fd f = fdget(fd);
- int error = -EBADF;
- if (!f.file)
+ struct capsicum_rights rights;
+ struct fd f;
+ int error;
+
+ cap_rights_init(&rights, CAP_IOCTL);
+ rights.nioctls = 1;
+ rights.ioctls = &cmd;
+ f = fdget_rights(fd, &rights);
+ if (IS_ERR(f.file)) {
+ error = PTR_ERR(f.file);
goto out;
+ }

/* RED-PEN how should LSM module know it's handling 32bit? */
error = security_file_ioctl(f.file, cmd, arg);
diff --git a/fs/eventfd.c b/fs/eventfd.c
index d6a88e7812f3..9c5216b59c6e 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -319,16 +319,17 @@ static const struct file_operations eventfd_fops = {
* Returns a pointer to the eventfd file structure in case of success, or the
* following error pointer:
*
- * -EBADF : Invalid @fd file descriptor.
- * -EINVAL : The @fd file descriptor is not an eventfd file.
+ * -EBADF : Invalid @fd file descriptor.
+ * -ENOTCAPABLE : The @fd file descriptor does not have the required rights.
+ * -EINVAL : The @fd file descriptor is not an eventfd file.
*/
struct file *eventfd_fget(int fd)
{
struct file *file;

- file = fget(fd);
- if (!file)
- return ERR_PTR(-EBADF);
+ file = fgetr(fd, CAP_WRITE);
+ if (IS_ERR(file))
+ return file;
if (file->f_op != &eventfd_fops) {
fput(file);
return ERR_PTR(-EINVAL);
@@ -350,9 +351,10 @@ EXPORT_SYMBOL_GPL(eventfd_fget);
struct eventfd_ctx *eventfd_ctx_fdget(int fd)
{
struct eventfd_ctx *ctx;
- struct fd f = fdget(fd);
- if (!f.file)
- return ERR_PTR(-EBADF);
+ struct fd f = fdgetr(fd, CAP_WRITE);
+
+ if (IS_ERR(f.file))
+ return (struct eventfd_ctx *) f.file;
ctx = eventfd_ctx_fileget(f.file);
fdput(f);
return ctx;
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index b10b48c2a7af..0cba830f834b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1836,15 +1836,18 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
copy_from_user(&epds, event, sizeof(struct epoll_event)))
goto error_return;

- error = -EBADF;
- f = fdget(epfd);
- if (!f.file)
+ f = fdgetr(epfd, CAP_EPOLL_CTL);
+ if (IS_ERR(f.file)) {
+ error = PTR_ERR(f.file);
goto error_return;
+ }

/* Get the "struct file *" for the target file */
- tf = fdget(fd);
- if (!tf.file)
+ tf = fdgetr(fd, CAP_POLL_EVENT);
+ if (IS_ERR(tf.file)) {
+ error = PTR_ERR(tf.file);
goto error_fput;
+ }

/* The target file descriptor must support poll */
error = -EPERM;
@@ -1976,9 +1979,9 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
return -EFAULT;

/* Get the "struct file *" for the eventpoll file */
- f = fdget(epfd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(epfd, CAP_POLL_EVENT);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

/*
* We have to check that the file structure underneath the fd
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 0f2252ec274d..a26108969d9b 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -419,9 +419,9 @@ group_extend_out:
return -EFAULT;
me.moved_len = 0;

- donor = fdget(me.donor_fd);
- if (!donor.file)
- return -EBADF;
+ donor = fdgetr(me.donor_fd, CAP_PWRITE, CAP_FSTAT);
+ if (IS_ERR(donor.file))
+ return PTR_ERR(donor.file);

if (!(donor.file->f_mode & FMODE_WRITE)) {
err = -EBADF;
diff --git a/fs/fcntl.c b/fs/fcntl.c
index abf82e05d7b3..322a260e225b 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -355,13 +355,99 @@ static int check_fcntl_cmd(unsigned cmd)
return 0;
}

+static bool fcntl_rights(unsigned int cmd, struct capsicum_rights *rights)
+{
+ switch (cmd) {
+ case F_DUPFD:
+ case F_DUPFD_CLOEXEC:
+ /*
+ * Returning true (=>use wrapped file) implies that no rights
+ * are needed.
+ */
+ cap_rights_init(rights, 0);
+ return true;
+ case F_GETFD:
+ case F_SETFD:
+ cap_rights_init(rights, 0);
+ return false;
+ case F_GETFL:
+ cap_rights_init(rights, CAP_FCNTL);
+ rights->fcntls = CAP_FCNTL_GETFL;
+ return false;
+ case F_SETFL:
+ cap_rights_init(rights, CAP_FCNTL);
+ rights->fcntls = CAP_FCNTL_SETFL;
+ return false;
+ case F_GETOWN:
+ case F_GETOWN_EX:
+ case F_GETOWNER_UIDS:
+ cap_rights_init(rights, CAP_FCNTL);
+ rights->fcntls = CAP_FCNTL_GETOWN;
+ return false;
+ case F_SETOWN:
+ case F_SETOWN_EX:
+ cap_rights_init(rights, CAP_FCNTL);
+ rights->fcntls = CAP_FCNTL_SETOWN;
+ return false;
+ case F_GETLK:
+ case F_SETLK:
+ case F_SETLKW:
+#if BITS_PER_LONG == 32
+ case F_GETLK64:
+ case F_SETLK64:
+ case F_SETLKW64:
+#endif
+ cap_rights_init(rights, CAP_FLOCK);
+ return false;
+ case F_GETSIG:
+ case F_SETSIG:
+ cap_rights_init(rights, CAP_POLL_EVENT, CAP_FSIGNAL);
+ return false;
+ case F_GETLEASE:
+ case F_SETLEASE:
+ cap_rights_init(rights, CAP_FLOCK, CAP_FSIGNAL);
+ return false;
+ case F_NOTIFY:
+ cap_rights_init(rights, CAP_NOTIFY);
+ return false;
+ case F_SETPIPE_SZ:
+ cap_rights_init(rights, CAP_SETSOCKOPT);
+ return false;
+ case F_GETPIPE_SZ:
+ cap_rights_init(rights, CAP_GETSOCKOPT);
+ return false;
+ default:
+ cap_rights_set_all(rights);
+ return false;
+ }
+}
+
+static inline struct fd fcntl_fdget_raw(unsigned int fd, unsigned int cmd,
+ struct capsicum_rights *rights)
+{
+ struct fd f;
+
+ if (fcntl_rights(cmd, rights)) {
+ /* Use the file directly, don't attempt to unwrap */
+ f = fdget_raw(fd);
+ if (f.file == NULL)
+ f.file = ERR_PTR(-EBADF);
+ } else {
+ f = fdget_raw_rights(fd, NULL, rights);
+ }
+ return f;
+}
+
SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
-{
- struct fd f = fdget_raw(fd);
+{
+ struct capsicum_rights rights;
+ struct fd f = fcntl_fdget_raw(fd, cmd, &rights);
long err = -EBADF;

- if (!f.file)
+ if (IS_ERR(f.file)) {
+ err = PTR_ERR(f.file);
goto out;
+ }

if (unlikely(f.file->f_mode & FMODE_PATH)) {
if (!check_fcntl_cmd(cmd))
@@ -381,12 +467,15 @@ out:
#if BITS_PER_LONG == 32
SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
unsigned long, arg)
-{
- struct fd f = fdget_raw(fd);
+{
+ struct capsicum_rights rights;
+ struct fd f = fcntl_fdget_raw(fd, cmd, &rights);
long err = -EBADF;

- if (!f.file)
+ if (IS_ERR(f.file)) {
+ err = PTR_ERR(f.file);
goto out;
+ }

if (unlikely(f.file->f_mode & FMODE_PATH)) {
if (!check_fcntl_cmd(cmd))
diff --git a/fs/fhandle.c b/fs/fhandle.c
index 999ff5c3cab0..b9f9496838ac 100644
--- a/fs/fhandle.c
+++ b/fs/fhandle.c
@@ -121,9 +121,10 @@ static struct vfsmount *get_vfsmount_from_fd(int fd)
mnt = mntget(fs->pwd.mnt);
spin_unlock(&fs->lock);
} else {
- struct fd f = fdget(fd);
- if (!f.file)
- return ERR_PTR(-EBADF);
+ struct fd f = fdgetr(fd, CAP_LOOKUP);
+
+ if (IS_ERR(f.file))
+ return (struct vfsmount *)f.file;
mnt = mntget(f.file->f_path.mnt);
fdput(f);
}
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 754dcf23de8a..4a49dca49c8f 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -1025,11 +1025,15 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
sb->s_time_gran = 1;
sb->s_export_op = &fuse_export_operations;

- file = fget(d.fd);
- err = -EINVAL;
- if (!file)
+ file = fgetr(d.fd, CAP_READ, CAP_WRITE);
+ if (IS_ERR(file)) {
+ err = PTR_ERR(file);
+ if (err == -EBADF)
+ err = -EINVAL;
goto err;
+ }

+ err = -EINVAL;
if ((file->f_op != &fuse_dev_operations) ||
(file->f_cred->user_ns != &init_user_ns))
goto err_fput;
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 8ac3fad36192..23ba2aac63da 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -604,10 +604,16 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
{
int error;
- struct fd f = fdget(fd);
+ struct capsicum_rights rights;
+ struct fd f;

- if (!f.file)
- return -EBADF;
+ cap_rights_init(&rights, CAP_IOCTL);
+ rights.nioctls = 1;
+ rights.ioctls = &cmd;
+ f = fdget_rights(fd, &rights);
+
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
error = security_file_ioctl(f.file, cmd, arg);
if (!error)
error = do_vfs_ioctl(f.file, fd, cmd, arg);
diff --git a/fs/locks.c b/fs/locks.c
index 717fbc404e6b..fdad193dc4b4 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1828,19 +1828,21 @@ EXPORT_SYMBOL(flock_lock_file_wait);
*/
SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_FLOCK);
struct file_lock *lock;
int can_sleep, unlock;
int error;

- error = -EBADF;
- if (!f.file)
+ if (IS_ERR(f.file)) {
+ error = PTR_ERR(f.file);
goto out;
+ }

can_sleep = !(cmd & LOCK_NB);
cmd &= ~LOCK_NB;
unlock = (cmd == LOCK_UN);

+ error = -EBADF;
if (!unlock && !(cmd & LOCK_MAND) &&
!(f.file->f_mode & (FMODE_READ|FMODE_WRITE)))
goto out_putf;
diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c
index 3fdc8a3e1134..b89f51c9182e 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -433,11 +433,12 @@ static int fanotify_find_path(int dfd, const char __user *filename,
dfd, filename, flags);

if (filename == NULL) {
- struct fd f = fdget(dfd);
+ struct fd f = fdgetr(dfd, CAP_FSTAT);

- ret = -EBADF;
- if (!f.file)
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
goto out;
+ }

ret = -ENOTDIR;
if ((flags & FAN_MARK_ONLYDIR) &&
@@ -457,7 +458,8 @@ static int fanotify_find_path(int dfd, const char __user *filename,
if (flags & FAN_MARK_ONLYDIR)
lookup_flags |= LOOKUP_DIRECTORY;

- ret = user_path_at(dfd, filename, lookup_flags, path);
+ ret = user_path_atr(dfd, filename, lookup_flags, path,
+ CAP_FSTAT, CAP_LOOKUP);
if (ret)
goto out;
}
@@ -822,9 +824,9 @@ SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
#endif
return -EINVAL;

- f = fdget(fanotify_fd);
- if (unlikely(!f.file))
- return -EBADF;
+ f = fdgetr(fanotify_fd, CAP_NOTIFY);
+ if (unlikely(IS_ERR(f.file)))
+ return PTR_ERR(f.file);

/* verify that this is indeed an fanotify instance */
ret = -EINVAL;
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index cc423a30a0c8..18389f0d83f8 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -711,9 +711,9 @@ SYSCALL_DEFINE3(inotify_add_watch, int, fd, const char __user *, pathname,
if (unlikely(!(mask & ALL_INOTIFY_BITS)))
return -EINVAL;

- f = fdget(fd);
- if (unlikely(!f.file))
- return -EBADF;
+ f = fdgetr(fd, CAP_NOTIFY);
+ if (unlikely(IS_ERR(f.file)))
+ return PTR_ERR(f.file);

/* verify that this is indeed an inotify instance */
if (unlikely(f.file->f_op != &inotify_fops)) {
@@ -749,9 +749,9 @@ SYSCALL_DEFINE2(inotify_rm_watch, int, fd, __s32, wd)
struct fd f;
int ret = 0;

- f = fdget(fd);
- if (unlikely(!f.file))
- return -EBADF;
+ f = fdgetr(fd, CAP_NOTIFY);
+ if (unlikely(IS_ERR(f.file)))
+ return PTR_ERR(f.file);

/* verify that this is indeed an inotify instance */
ret = -EINVAL;
diff --git a/fs/ocfs2/cluster/heartbeat.c b/fs/ocfs2/cluster/heartbeat.c
index 73039295d0d1..4e49b06b6d52 100644
--- a/fs/ocfs2/cluster/heartbeat.c
+++ b/fs/ocfs2/cluster/heartbeat.c
@@ -1740,9 +1740,13 @@ static ssize_t o2hb_region_dev_write(struct o2hb_region *reg,
if (fd < 0 || fd >= INT_MAX)
goto out;

- f = fdget(fd);
- if (f.file == NULL)
+ f = fdgetr(fd, CAP_FSTAT);
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
+ if (ret == -EBADF)
+ ret = -EINVAL;
goto out;
+ }

if (reg->hr_blocks == 0 || reg->hr_start_block == 0 ||
reg->hr_block_bytes == 0)
diff --git a/fs/open.c b/fs/open.c
index ee16be3a7291..111b023a741c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -159,10 +159,11 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
error = -EINVAL;
if (length < 0)
goto out;
- error = -EBADF;
- f = fdget(fd);
- if (!f.file)
+ f = fdgetr(fd, CAP_FTRUNCATE);
+ if (IS_ERR(f.file)) {
+ error = PTR_ERR(f.file);
goto out;
+ }

/* explicitly opened as large or we are on 64-bit box */
if (f.file->f_flags & O_LARGEFILE)
@@ -302,12 +303,14 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)

SYSCALL_DEFINE4(fallocate, int, fd, int, mode, loff_t, offset, loff_t, len)
{
- struct fd f = fdget(fd);
- int error = -EBADF;
+ struct fd f = fdgetr(fd, CAP_WRITE);
+ int error;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
error = do_fallocate(f.file, mode, offset, len);
fdput(f);
+ } else {
+ error = PTR_ERR(f.file);
}
return error;
}
@@ -348,7 +351,7 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)

old_cred = override_creds(override_cred);
retry:
- res = user_path_at(dfd, filename, lookup_flags, &path);
+ res = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FSTAT);
if (res)
goto out;

@@ -426,13 +429,14 @@ out:

SYSCALL_DEFINE1(fchdir, unsigned int, fd)
{
- struct fd f = fdget_raw(fd);
+ struct fd f = fdgetr_raw(fd, CAP_FCHDIR);
struct inode *inode;
int error = -EBADF;

- error = -EBADF;
- if (!f.file)
+ if (IS_ERR(f.file)) {
+ error = PTR_ERR(f.file);
goto out;
+ }

inode = file_inode(f.file);

@@ -513,13 +517,15 @@ out_unlock:

SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_FCHMOD);
int err = -EBADF;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
audit_inode(NULL, f.file->f_path.dentry, 0);
err = chmod_common(&f.file->f_path, mode);
fdput(f);
+ } else {
+ err = PTR_ERR(f.file);
}
return err;
}
@@ -530,7 +536,7 @@ SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, umode_t, mode
int error;
unsigned int lookup_flags = LOOKUP_FOLLOW;
retry:
- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FCHMODAT);
if (!error) {
error = chmod_common(&path, mode);
path_put(&path);
@@ -603,7 +609,7 @@ SYSCALL_DEFINE5(fchownat, int, dfd, const char __user *, filename, uid_t, user,
if (flag & AT_EMPTY_PATH)
lookup_flags |= LOOKUP_EMPTY;
retry:
- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FCHOWNAT);
if (error)
goto out;
error = mnt_want_write(path.mnt);
@@ -634,11 +640,13 @@ SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group

SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
{
- struct fd f = fdget(fd);
- int error = -EBADF;
+ struct fd f = fdgetr(fd, CAP_FCHOWN);
+ int error;

- if (!f.file)
+ if (IS_ERR(f.file)) {
+ error = PTR_ERR(f.file);
goto out;
+ }

error = mnt_want_write_file(f.file);
if (error)
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index 89026095f2b5..dc29c4d6f050 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -272,9 +272,9 @@ struct file *proc_ns_fget(int fd)
{
struct file *file;

- file = fget(fd);
- if (!file)
- return ERR_PTR(-EBADF);
+ file = fgetr(fd, CAP_SETNS);
+ if (IS_ERR(file))
+ return file;

if (file->f_op != &ns_file_operations)
goto out_invalid;
diff --git a/fs/read_write.c b/fs/read_write.c
index c6e0f20a9f94..ed96b1ad7207 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -275,9 +275,10 @@ static inline void fdput_pos(struct fd f)
SYSCALL_DEFINE3(lseek, unsigned int, fd, off_t, offset, unsigned int, whence)
{
off_t retval;
- struct fd f = fdget_pos(fd);
- if (!f.file)
- return -EBADF;
+ struct fd f = fdgetr_pos(fd, CAP_SEEK);
+
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

retval = -EINVAL;
if (whence <= SEEK_MAX) {
@@ -303,11 +304,11 @@ SYSCALL_DEFINE5(llseek, unsigned int, fd, unsigned long, offset_high,
unsigned int, whence)
{
int retval;
- struct fd f = fdget_pos(fd);
+ struct fd f = fdgetr_pos(fd, CAP_SEEK);
loff_t offset;

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

retval = -EINVAL;
if (whence > SEEK_MAX)
@@ -554,15 +555,17 @@ static inline void file_pos_write(struct file *file, loff_t pos)

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
- struct fd f = fdget_pos(fd);
- ssize_t ret = -EBADF;
+ struct fd f = fdgetr_pos(fd, CAP_READ);
+ ssize_t ret;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
loff_t pos = file_pos_read(f.file);
ret = vfs_read(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
+ } else {
+ ret = PTR_ERR(f.file);
}
return ret;
}
@@ -570,15 +573,17 @@ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,
size_t, count)
{
- struct fd f = fdget_pos(fd);
- ssize_t ret = -EBADF;
+ struct fd f = fdgetr_pos(fd, CAP_WRITE);
+ ssize_t ret;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
loff_t pos = file_pos_read(f.file);
ret = vfs_write(f.file, buf, count, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
+ } else {
+ ret = PTR_ERR(f.file);
}

return ret;
@@ -593,12 +598,14 @@ SYSCALL_DEFINE4(pread64, unsigned int, fd, char __user *, buf,
if (pos < 0)
return -EINVAL;

- f = fdget(fd);
- if (f.file) {
+ f = fdgetr(fd, CAP_PREAD);
+ if (!IS_ERR(f.file)) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
ret = vfs_read(f.file, buf, count, &pos);
fdput(f);
+ } else {
+ ret = PTR_ERR(f.file);
}

return ret;
@@ -613,12 +620,14 @@ SYSCALL_DEFINE4(pwrite64, unsigned int, fd, const char __user *, buf,
if (pos < 0)
return -EINVAL;

- f = fdget(fd);
- if (f.file) {
+ f = fdgetr(fd, CAP_PWRITE);
+ if (!IS_ERR(f.file)) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
ret = vfs_write(f.file, buf, count, &pos);
fdput(f);
+ } else {
+ ret = PTR_ERR(f.file);
}

return ret;
@@ -878,15 +887,17 @@ EXPORT_SYMBOL(vfs_writev);
SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
unsigned long, vlen)
{
- struct fd f = fdget_pos(fd);
- ssize_t ret = -EBADF;
+ struct fd f = fdgetr_pos(fd, CAP_READ);
+ ssize_t ret;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
loff_t pos = file_pos_read(f.file);
ret = vfs_readv(f.file, vec, vlen, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
+ } else {
+ ret = PTR_ERR(f.file);
}

if (ret > 0)
@@ -898,15 +909,17 @@ SYSCALL_DEFINE3(readv, unsigned long, fd, const struct iovec __user *, vec,
SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,
unsigned long, vlen)
{
- struct fd f = fdget_pos(fd);
- ssize_t ret = -EBADF;
+ struct fd f = fdgetr_pos(fd, CAP_WRITE);
+ ssize_t ret;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
loff_t pos = file_pos_read(f.file);
ret = vfs_writev(f.file, vec, vlen, &pos);
if (ret >= 0)
file_pos_write(f.file, pos);
fdput_pos(f);
+ } else {
+ ret = PTR_ERR(f.file);
}

if (ret > 0)
@@ -926,17 +939,19 @@ SYSCALL_DEFINE5(preadv, unsigned long, fd, const struct iovec __user *, vec,
{
loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
- ssize_t ret = -EBADF;
+ ssize_t ret;

if (pos < 0)
return -EINVAL;

- f = fdget(fd);
- if (f.file) {
+ f = fdgetr(fd, CAP_PREAD);
+ if (!IS_ERR(f.file)) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
ret = vfs_readv(f.file, vec, vlen, &pos);
fdput(f);
+ } else {
+ ret = PTR_ERR(f.file);
}

if (ret > 0)
@@ -950,17 +965,19 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec,
{
loff_t pos = pos_from_hilo(pos_h, pos_l);
struct fd f;
- ssize_t ret = -EBADF;
+ ssize_t ret;

if (pos < 0)
return -EINVAL;

- f = fdget(fd);
- if (f.file) {
+ f = fdgetr(fd, CAP_PWRITE);
+ if (!IS_ERR(f.file)) {
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
ret = vfs_writev(f.file, vec, vlen, &pos);
fdput(f);
+ } else {
+ ret = PTR_ERR(f.file);
}

if (ret > 0)
@@ -1055,12 +1072,12 @@ COMPAT_SYSCALL_DEFINE3(readv, compat_ulong_t, fd,
const struct compat_iovec __user *,vec,
compat_ulong_t, vlen)
{
- struct fd f = fdget_pos(fd);
+ struct fd f = fdgetr_pos(fd, CAP_READ);
ssize_t ret;
loff_t pos;

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
pos = f.file->f_pos;
ret = compat_readv(f.file, vec, vlen, &pos);
if (ret >= 0)
@@ -1078,9 +1095,9 @@ static long __compat_sys_preadv64(unsigned long fd,

if (pos < 0)
return -EINVAL;
- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_PREAD);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PREAD)
ret = compat_readv(f.file, vec, vlen, &pos);
@@ -1132,12 +1149,12 @@ COMPAT_SYSCALL_DEFINE3(writev, compat_ulong_t, fd,
const struct compat_iovec __user *, vec,
compat_ulong_t, vlen)
{
- struct fd f = fdget_pos(fd);
+ struct fd f = fdgetr_pos(fd, CAP_WRITE);
ssize_t ret;
loff_t pos;

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
pos = f.file->f_pos;
ret = compat_writev(f.file, vec, vlen, &pos);
if (ret >= 0)
@@ -1155,9 +1172,9 @@ static long __compat_sys_pwritev64(unsigned long fd,

if (pos < 0)
return -EINVAL;
- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_PWRITE);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
ret = -ESPIPE;
if (f.file->f_mode & FMODE_PWRITE)
ret = compat_writev(f.file, vec, vlen, &pos);
@@ -1198,9 +1215,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
* Get input file, and verify that it is ok..
*/
retval = -EBADF;
- in = fdget(in_fd);
- if (!in.file)
+ in = fdgetr(in_fd, CAP_PREAD);
+ if (IS_ERR(in.file)) {
+ retval = PTR_ERR(in.file);
goto out;
+ }
if (!(in.file->f_mode & FMODE_READ))
goto fput_in;
retval = -ESPIPE;
@@ -1220,9 +1239,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
* Get output file, and verify that it is ok..
*/
retval = -EBADF;
- out = fdget(out_fd);
- if (!out.file)
+ out = fdgetr(out_fd, CAP_WRITE);
+ if (IS_ERR(out.file)) {
+ retval = PTR_ERR(out.file);
goto fput_in;
+ }
if (!(out.file->f_mode & FMODE_WRITE))
goto fput_out;
retval = -EINVAL;
diff --git a/fs/readdir.c b/fs/readdir.c
index 33fd92208cb7..284728df319a 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -110,14 +110,14 @@ SYSCALL_DEFINE3(old_readdir, unsigned int, fd,
struct old_linux_dirent __user *, dirent, unsigned int, count)
{
int error;
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_READ);
struct readdir_callback buf = {
.ctx.actor = fillonedir,
.dirent = dirent
};

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

error = iterate_dir(f.file, &buf.ctx);
if (buf.result)
@@ -206,9 +206,9 @@ SYSCALL_DEFINE3(getdents, unsigned int, fd,
if (!access_ok(VERIFY_WRITE, dirent, count))
return -EFAULT;

- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_READ);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

error = iterate_dir(f.file, &buf.ctx);
if (error >= 0)
@@ -286,9 +286,9 @@ SYSCALL_DEFINE3(getdents64, unsigned int, fd,
if (!access_ok(VERIFY_WRITE, dirent, count))
return -EFAULT;

- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_READ);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

error = iterate_dir(f.file, &buf.ctx);
if (error >= 0)
diff --git a/fs/select.c b/fs/select.c
index 467bb1cb3ea5..079bb7e9c126 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -449,8 +449,8 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
break;
if (!(bit & all_bits))
continue;
- f = fdget(i);
- if (f.file) {
+ f = fdgetr(i, CAP_POLL_EVENT);
+ if (!IS_ERR(f.file)) {
const struct file_operations *f_op;
f_op = f.file->f_op;
mask = DEFAULT_POLLMASK;
@@ -487,6 +487,9 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
} else if (busy_flag & mask)
can_busy_loop = true;

+ } else if (PTR_ERR(f.file) != -EBADF) {
+ retval = PTR_ERR(f.file);
+ break;
}
}
if (res_in)
@@ -757,9 +760,9 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait,
mask = 0;
fd = pollfd->fd;
if (fd >= 0) {
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_POLL_EVENT);
mask = POLLNVAL;
- if (f.file) {
+ if (!IS_ERR(f.file)) {
mask = DEFAULT_POLLMASK;
if (f.file->f_op->poll) {
pwait->_key = pollfd->events|POLLERR|POLLHUP;
diff --git a/fs/signalfd.c b/fs/signalfd.c
index 424b7b65321f..f484c2db0c8a 100644
--- a/fs/signalfd.c
+++ b/fs/signalfd.c
@@ -288,9 +288,10 @@ SYSCALL_DEFINE4(signalfd4, int, ufd, sigset_t __user *, user_mask,
if (ufd < 0)
kfree(ctx);
} else {
- struct fd f = fdget(ufd);
- if (!f.file)
- return -EBADF;
+ struct fd f = fdgetr(ufd, CAP_FSIGNAL);
+
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
ctx = f.file->private_data;
if (f.file->f_op != &signalfd_fops) {
fdput(f);
diff --git a/fs/splice.c b/fs/splice.c
index f5cb9ba84510..a72872e83d0b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1647,14 +1647,16 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
return 0;

error = -EBADF;
- f = fdget(fd);
- if (f.file) {
+ f = fdgetr(fd, CAP_WRITE);
+ if (!IS_ERR(f.file)) {
if (f.file->f_mode & FMODE_WRITE)
error = vmsplice_to_pipe(f.file, iov, nr_segs, flags);
else if (f.file->f_mode & FMODE_READ)
error = vmsplice_to_user(f.file, iov, nr_segs, flags);

fdput(f);
+ } else {
+ error = PTR_ERR(f.file);
}

return error;
@@ -1692,19 +1694,23 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
return 0;

error = -EBADF;
- in = fdget(fd_in);
- if (in.file) {
+ in = fdgetr(fd_in, CAP_PREAD);
+ if (!IS_ERR(in.file)) {
if (in.file->f_mode & FMODE_READ) {
- out = fdget(fd_out);
- if (out.file) {
+ out = fdgetr(fd_out, CAP_PWRITE);
+ if (!IS_ERR(out.file)) {
if (out.file->f_mode & FMODE_WRITE)
error = do_splice(in.file, off_in,
out.file, off_out,
len, flags);
fdput(out);
+ } else {
+ error = PTR_ERR(out.file);
}
}
fdput(in);
+ } else {
+ error = PTR_ERR(in.file);
}
return error;
}
@@ -2023,19 +2029,24 @@ SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
return 0;

error = -EBADF;
- in = fdget(fdin);
- if (in.file) {
+ in = fdgetr(fdin, CAP_READ);
+ if (!IS_ERR(in.file)) {
if (in.file->f_mode & FMODE_READ) {
- struct fd out = fdget(fdout);
- if (out.file) {
+ struct fd out = fdgetr(fdout, CAP_WRITE);
+
+ if (!IS_ERR(out.file)) {
if (out.file->f_mode & FMODE_WRITE)
error = do_tee(in.file, out.file,
len, flags);
fdput(out);
+ } else {
+ error = PTR_ERR(out.file);
}
}
- fdput(in);
- }
+ fdput(in);
+ } else {
+ error = PTR_ERR(in.file);
+ }

return error;
}
diff --git a/fs/stat.c b/fs/stat.c
index ae0c3cef9927..f40b3530eab4 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -76,12 +76,14 @@ EXPORT_SYMBOL(vfs_getattr);

int vfs_fstat(unsigned int fd, struct kstat *stat)
{
- struct fd f = fdget_raw(fd);
- int error = -EBADF;
+ struct fd f = fdgetr_raw(fd, CAP_FSTAT);
+ int error;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
error = vfs_getattr(&f.file->f_path, stat);
fdput(f);
+ } else {
+ error = PTR_ERR(f.file);
}
return error;
}
@@ -103,7 +105,7 @@ int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
if (flag & AT_EMPTY_PATH)
lookup_flags |= LOOKUP_EMPTY;
retry:
- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_atr(dfd, filename, lookup_flags, &path, CAP_FSTAT);
if (error)
goto out;

diff --git a/fs/statfs.c b/fs/statfs.c
index 083dc0ac9140..e2c6be447aa3 100644
--- a/fs/statfs.c
+++ b/fs/statfs.c
@@ -94,11 +94,14 @@ retry:

int fd_statfs(int fd, struct kstatfs *st)
{
- struct fd f = fdget_raw(fd);
- int error = -EBADF;
- if (f.file) {
+ struct fd f = fdgetr_raw(fd, CAP_FSTATFS);
+ int error;
+
+ if (!IS_ERR(f.file)) {
error = vfs_statfs(&f.file->f_path, st);
fdput(f);
+ } else {
+ error = PTR_ERR(f.file);
}
return error;
}
diff --git a/fs/sync.c b/fs/sync.c
index b28d1dd10e8b..663afe812600 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -148,12 +148,12 @@ void emergency_sync(void)
*/
SYSCALL_DEFINE1(syncfs, int, fd)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_FSYNC);
struct super_block *sb;
int ret;

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
sb = f.file->f_dentry->d_sb;

down_read(&sb->s_umount);
@@ -199,12 +199,14 @@ EXPORT_SYMBOL(vfs_fsync);

static int do_fsync(unsigned int fd, int datasync)
{
- struct fd f = fdget(fd);
- int ret = -EBADF;
+ struct fd f = fdgetr(fd, CAP_FSYNC);
+ int ret;

- if (f.file) {
+ if (!IS_ERR(f.file)) {
ret = vfs_fsync(f.file, datasync);
fdput(f);
+ } else {
+ ret = PTR_ERR(f.file);
}
return ret;
}
@@ -310,10 +312,11 @@ SYSCALL_DEFINE4(sync_file_range, int, fd, loff_t, offset, loff_t, nbytes,
else
endbyte--; /* inclusive */

- ret = -EBADF;
- f = fdget(fd);
- if (!f.file)
+ f = fdgetr(fd, CAP_FSYNC, CAP_SEEK);
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
goto out;
+ }

i_mode = file_inode(f.file)->i_mode;
ret = -ESPIPE;
diff --git a/fs/timerfd.c b/fs/timerfd.c
index 0013142c0475..daf04417e2cf 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -291,6 +291,32 @@ static const struct file_operations timerfd_fops = {
.llseek = noop_llseek,
};

+#ifdef CONFIG_SECURITY_CAPSICUM
+#define timerfd_fgetr(f, p, ...) \
+ _timerfd_fgetr((f), (p), __VA_ARGS__, 0ULL)
+static int _timerfd_fgetr(int fd, struct fd *p, ...)
+{
+ struct capsicum_rights rights;
+ struct fd f;
+ va_list ap;
+
+ va_start(ap, p);
+ f = fdget_rights(fd, cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
+ if (f.file->f_op != &timerfd_fops) {
+ fdput(f);
+ return -EINVAL;
+ }
+ *p = f;
+ return 0;
+}
+
+#else
+
+#define timerfd_fgetr(f, p, ...) \
+ timerfd_fget((f), (p))
static int timerfd_fget(int fd, struct fd *p)
{
struct fd f = fdget(fd);
@@ -304,6 +330,8 @@ static int timerfd_fget(int fd, struct fd *p)
return 0;
}

+#endif
+
SYSCALL_DEFINE2(timerfd_create, int, clockid, int, flags)
{
int ufd;
@@ -359,7 +387,7 @@ static int do_timerfd_settime(int ufd, int flags,
!timespec_valid(&new->it_interval))
return -EINVAL;

- ret = timerfd_fget(ufd, &f);
+ ret = timerfd_fgetr(ufd, &f, CAP_WRITE, (old ? CAP_READ : 0));
if (ret)
return ret;
ctx = f.file->private_data;
@@ -397,8 +425,10 @@ static int do_timerfd_settime(int ufd, int flags,
hrtimer_forward_now(&ctx->t.tmr, ctx->tintv);
}

- old->it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
- old->it_interval = ktime_to_timespec(ctx->tintv);
+ if (old) {
+ old->it_value = ktime_to_timespec(timerfd_get_remaining(ctx));
+ old->it_interval = ktime_to_timespec(ctx->tintv);
+ }

/*
* Re-program the timer to the new value ...
@@ -414,7 +444,7 @@ static int do_timerfd_gettime(int ufd, struct itimerspec *t)
{
struct fd f;
struct timerfd_ctx *ctx;
- int ret = timerfd_fget(ufd, &f);
+ int ret = timerfd_fgetr(ufd, &f, CAP_READ);
if (ret)
return ret;
ctx = f.file->private_data;
@@ -451,7 +481,7 @@ SYSCALL_DEFINE4(timerfd_settime, int, ufd, int, flags,

if (copy_from_user(&new, utmr, sizeof(new)))
return -EFAULT;
- ret = do_timerfd_settime(ufd, flags, &new, &old);
+ ret = do_timerfd_settime(ufd, flags, &new, otmr ? &old : NULL);
if (ret)
return ret;
if (otmr && copy_to_user(otmr, &old, sizeof(old)))
diff --git a/fs/utimes.c b/fs/utimes.c
index aa138d64560a..1d451efd6ae2 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -152,10 +152,11 @@ long do_utimes(int dfd, const char __user *filename, struct timespec *times,
if (flags & AT_SYMLINK_NOFOLLOW)
goto out;

- f = fdget(dfd);
- error = -EBADF;
- if (!f.file)
+ f = fdgetr(dfd, CAP_FUTIMES);
+ if (IS_ERR(f.file)) {
+ error = PTR_ERR(f.file);
goto out;
+ }

error = utimes_common(&f.file->f_path, times);
fdput(f);
@@ -166,7 +167,8 @@ long do_utimes(int dfd, const char __user *filename, struct timespec *times,
if (!(flags & AT_SYMLINK_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
retry:
- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_atr(dfd, filename, lookup_flags, &path,
+ CAP_FUTIMESAT);
if (error)
goto out;

diff --git a/fs/xattr.c b/fs/xattr.c
index 3377dff18404..3013dc4cbf27 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -415,12 +415,12 @@ retry:
SYSCALL_DEFINE5(fsetxattr, int, fd, const char __user *, name,
const void __user *,value, size_t, size, int, flags)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_EXTATTR_SET);
struct dentry *dentry;
- int error = -EBADF;
+ int error;

- if (!f.file)
- return error;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
dentry = f.file->f_path.dentry;
audit_inode(NULL, dentry, 0);
error = mnt_want_write_file(f.file);
@@ -522,11 +522,11 @@ retry:
SYSCALL_DEFINE4(fgetxattr, int, fd, const char __user *, name,
void __user *, value, size_t, size)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_EXTATTR_GET);
ssize_t error = -EBADF;

- if (!f.file)
- return error;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
audit_inode(NULL, f.file->f_path.dentry, 0);
error = getxattr(f.file->f_path.dentry, name, value, size);
fdput(f);
@@ -611,11 +611,11 @@ retry:

SYSCALL_DEFINE3(flistxattr, int, fd, char __user *, list, size_t, size)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_EXTATTR_LIST);
ssize_t error = -EBADF;

- if (!f.file)
- return error;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
audit_inode(NULL, f.file->f_path.dentry, 0);
error = listxattr(f.file->f_path.dentry, list, size);
fdput(f);
@@ -688,12 +688,12 @@ retry:

SYSCALL_DEFINE2(fremovexattr, int, fd, const char __user *, name)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_EXTATTR_DELETE);
struct dentry *dentry;
int error = -EBADF;

- if (!f.file)
- return error;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
dentry = f.file->f_path.dentry;
audit_inode(NULL, dentry, 0);
error = mnt_want_write_file(f.file);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 8bc1bbce7451..f2f8f199abe6 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -76,9 +76,9 @@ xfs_find_handle(
struct xfs_inode *ip;

if (cmd == XFS_IOC_FD_TO_HANDLE) {
- f = fdget(hreq->fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(hreq->fd, CAP_FSTAT);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
inode = file_inode(f.file);
} else {
error = user_lpath((const char __user *)hreq->path, &path);
@@ -1443,8 +1443,8 @@ xfs_ioc_swapext(
int error = 0;

/* Pull information for the target fd */
- f = fdget((int)sxp->sx_fdtarget);
- if (!f.file) {
+ f = fdgetr((int)sxp->sx_fdtarget, CAP_READ, CAP_WRITE, CAP_FSTAT);
+ if (IS_ERR(f.file)) {
error = XFS_ERROR(EINVAL);
goto out;
}
@@ -1456,8 +1456,8 @@ xfs_ioc_swapext(
goto out_put_file;
}

- tmp = fdget((int)sxp->sx_fdtmp);
- if (!tmp.file) {
+ tmp = fdgetr((int)sxp->sx_fdtmp, CAP_READ, CAP_WRITE, CAP_FSTAT);
+ if (IS_ERR(tmp.file)) {
error = XFS_ERROR(EINVAL);
goto out_put_file;
}
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 4fcf39af1776..f38639676b00 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -978,9 +978,9 @@ SYSCALL_DEFINE5(mq_timedsend, mqd_t, mqdes, const char __user *, u_msg_ptr,

audit_mq_sendrecv(mqdes, msg_len, msg_prio, timeout ? &ts : NULL);

- f = fdget(mqdes);
- if (unlikely(!f.file)) {
- ret = -EBADF;
+ f = fdgetr(mqdes, CAP_WRITE);
+ if (unlikely(IS_ERR(f.file))) {
+ ret = PTR_ERR(f.file);
goto out;
}

@@ -1094,9 +1094,9 @@ SYSCALL_DEFINE5(mq_timedreceive, mqd_t, mqdes, char __user *, u_msg_ptr,

audit_mq_sendrecv(mqdes, msg_len, 0, timeout ? &ts : NULL);

- f = fdget(mqdes);
- if (unlikely(!f.file)) {
- ret = -EBADF;
+ f = fdgetr(mqdes, CAP_READ);
+ if (unlikely(IS_ERR(f.file))) {
+ ret = PTR_ERR(f.file);
goto out;
}

@@ -1229,9 +1229,9 @@ SYSCALL_DEFINE2(mq_notify, mqd_t, mqdes,
skb_put(nc, NOTIFY_COOKIE_LEN);
/* and attach it to the socket */
retry:
- f = fdget(notification.sigev_signo);
- if (!f.file) {
- ret = -EBADF;
+ f = fdgetr(notification.sigev_signo, CAP_POLL_EVENT);
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
goto out;
}
sock = netlink_getsockbyfilp(f.file);
@@ -1254,9 +1254,9 @@ retry:
}
}

- f = fdget(mqdes);
- if (!f.file) {
- ret = -EBADF;
+ f = fdgetr(mqdes, CAP_POLL_EVENT);
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
goto out;
}

@@ -1328,9 +1328,9 @@ SYSCALL_DEFINE3(mq_getsetattr, mqd_t, mqdes,
return -EINVAL;
}

- f = fdget(mqdes);
- if (!f.file) {
- ret = -EBADF;
+ f = fdgetr(mqdes, CAP_POLL_EVENT);
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
goto out;
}

diff --git a/kernel/events/core.c b/kernel/events/core.c
index a33d9a2bcbd7..b44e4c0d7dcb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -603,11 +603,11 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
{
struct perf_cgroup *cgrp;
struct cgroup_subsys_state *css;
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_FSTAT);
int ret = 0;

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

css = css_tryget_online_from_dir(f.file->f_dentry,
&perf_event_cgrp_subsys);
@@ -3636,9 +3636,10 @@ static const struct file_operations perf_fops;

static inline int perf_fget_light(int fd, struct fd *p)
{
- struct fd f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ struct fd f = fdgetr(fd, CAP_WRITE);
+
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

if (f.file->f_op != &perf_fops) {
fdput(f);
@@ -3689,7 +3690,7 @@ static long perf_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
int ret;
if (arg != -1) {
struct perf_event *output_event;
- struct fd output;
+ struct fd output = { .file = NULL };
ret = perf_fget_light(arg, &output);
if (ret)
return ret;
diff --git a/kernel/module.c b/kernel/module.c
index 81e727cf6df9..f7f781dac164 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -2513,14 +2513,18 @@ static int copy_module_from_user(const void __user *umod, unsigned long len,
/* Sets info->hdr and info->len. */
static int copy_module_from_fd(int fd, struct load_info *info)
{
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_FEXECVE);
int err;
struct kstat stat;
loff_t pos;
ssize_t bytes = 0;

- if (!f.file)
- return -ENOEXEC;
+ if (IS_ERR(f.file)) {
+ err = PTR_ERR(f.file);
+ if (err == -EBADF)
+ err = -ENOEXEC;
+ return err;
+ }

err = security_kernel_module_from_file(f.file);
if (err)
diff --git a/kernel/sys.c b/kernel/sys.c
index 66a751ebf9d9..8d8ccf6cfb38 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1634,9 +1634,9 @@ static int prctl_set_mm_exe_file(struct mm_struct *mm, unsigned int fd)
struct inode *inode;
int err;

- exe = fdget(fd);
- if (!exe.file)
- return -EBADF;
+ exe = fdgetr(fd, CAP_FEXECVE);
+ if (IS_ERR(exe.file))
+ return PTR_ERR(exe.file);

inode = file_inode(exe.file);

diff --git a/kernel/taskstats.c b/kernel/taskstats.c
index 13d2f7cd65db..1ad5fd005334 100644
--- a/kernel/taskstats.c
+++ b/kernel/taskstats.c
@@ -437,8 +437,8 @@ static int cgroupstats_user_cmd(struct sk_buff *skb, struct genl_info *info)
return -EINVAL;

fd = nla_get_u32(info->attrs[CGROUPSTATS_CMD_ATTR_FD]);
- f = fdget(fd);
- if (!f.file)
+ f = fdgetr(fd, CAP_FSTAT);
+ if (IS_ERR(f.file))
return 0;

size = nla_total_size(sizeof(struct cgroupstats));
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ce033c7aa2e8..0766473c77ea 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -246,13 +246,18 @@ struct posix_clock_desc {
struct posix_clock *clk;
};

-static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd)
+static int get_clock_desc(const clockid_t id, struct posix_clock_desc *cd,
+ u64 right)
{
- struct file *fp = fget(CLOCKID_TO_FD(id));
+ struct file *fp = fgetr(CLOCKID_TO_FD(id), right);
int err = -EINVAL;

- if (!fp)
+ if (IS_ERR(fp)) {
+ err = PTR_ERR(fp);
+ if (err == -EBADF)
+ err = -EINVAL;
return err;
+ }

if (fp->f_op->open != posix_clock_open || !fp->private_data)
goto out;
@@ -278,7 +283,7 @@ static int pc_clock_adjtime(clockid_t id, struct timex *tx)
struct posix_clock_desc cd;
int err;

- err = get_clock_desc(id, &cd);
+ err = get_clock_desc(id, &cd, CAP_WRITE);
if (err)
return err;

@@ -302,7 +307,7 @@ static int pc_clock_gettime(clockid_t id, struct timespec *ts)
struct posix_clock_desc cd;
int err;

- err = get_clock_desc(id, &cd);
+ err = get_clock_desc(id, &cd, CAP_READ);
if (err)
return err;

@@ -321,7 +326,7 @@ static int pc_clock_getres(clockid_t id, struct timespec *ts)
struct posix_clock_desc cd;
int err;

- err = get_clock_desc(id, &cd);
+ err = get_clock_desc(id, &cd, CAP_READ);
if (err)
return err;

@@ -340,7 +345,7 @@ static int pc_clock_settime(clockid_t id, const struct timespec *ts)
struct posix_clock_desc cd;
int err;

- err = get_clock_desc(id, &cd);
+ err = get_clock_desc(id, &cd, CAP_WRITE);
if (err)
return err;

@@ -365,7 +370,7 @@ static int pc_timer_create(struct k_itimer *kit)
struct posix_clock_desc cd;
int err;

- err = get_clock_desc(id, &cd);
+ err = get_clock_desc(id, &cd, CAP_WRITE);
if (err)
return err;

@@ -385,7 +390,7 @@ static int pc_timer_delete(struct k_itimer *kit)
struct posix_clock_desc cd;
int err;

- err = get_clock_desc(id, &cd);
+ err = get_clock_desc(id, &cd, CAP_WRITE);
if (err)
return err;

@@ -404,7 +409,7 @@ static void pc_timer_gettime(struct k_itimer *kit, struct itimerspec *ts)
clockid_t id = kit->it_clock;
struct posix_clock_desc cd;

- if (get_clock_desc(id, &cd))
+ if (get_clock_desc(id, &cd, CAP_READ))
return;

if (cd.clk->ops.timer_gettime)
@@ -420,7 +425,7 @@ static int pc_timer_settime(struct k_itimer *kit, int flags,
struct posix_clock_desc cd;
int err;

- err = get_clock_desc(id, &cd);
+ err = get_clock_desc(id, &cd, CAP_WRITE);
if (err)
return err;

diff --git a/mm/fadvise.c b/mm/fadvise.c
index 3bcfd81db45e..69d51a43dc56 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -27,7 +27,7 @@
*/
SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
{
- struct fd f = fdget(fd);
+ struct fd f;
struct address_space *mapping;
struct backing_dev_info *bdi;
loff_t endbyte; /* inclusive */
@@ -36,8 +36,9 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
unsigned long nrpages;
int ret = 0;

- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_LIST_END);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

if (S_ISFIFO(file_inode(f.file)->i_mode)) {
ret = -ESPIPE;
diff --git a/mm/internal.h b/mm/internal.h
index 7f22a11fcc66..07f18aa9ccfb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -13,6 +13,8 @@

#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/capsicum.h>

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
@@ -91,6 +93,23 @@ static inline void get_page_foll(struct page *page)
}
}

+static inline struct capsicum_rights *
+mmap_rights(struct capsicum_rights *rights,
+ unsigned long prot,
+ unsigned long flags)
+{
+#ifdef CONFIG_SECURITY_CAPSICUM
+ cap_rights_init(rights, CAP_MMAP);
+ if (prot & PROT_READ)
+ cap_rights_set(rights, CAP_MMAP_R);
+ if ((flags & MAP_SHARED) && (prot & PROT_WRITE))
+ cap_rights_set(rights, CAP_MMAP_W);
+ if (prot & PROT_EXEC)
+ cap_rights_set(rights, CAP_MMAP_X);
+#endif
+ return rights;
+}
+
extern unsigned long highest_memmap_pfn;

/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a2c7bcb0e6eb..d09576bcbf57 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5868,9 +5868,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
init_waitqueue_func_entry(&event->wait, memcg_event_wake);
INIT_WORK(&event->remove, memcg_event_remove);

- efile = fdget(efd);
- if (!efile.file) {
- ret = -EBADF;
+ efile = fdgetr(efd, CAP_WRITE);
+ if (IS_ERR(efile.file)) {
+ ret = PTR_ERR(efile.file);
goto out_kfree;
}

@@ -5880,9 +5880,9 @@ static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
goto out_put_efile;
}

- cfile = fdget(cfd);
- if (!cfile.file) {
- ret = -EBADF;
+ cfile = fdgetr(cfd, CAP_READ);
+ if (IS_ERR(cfile.file)) {
+ ret = PTR_ERR(cfile.file);
goto out_put_eventfd;
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 129b847d30cc..e2e3cf45feec 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1381,10 +1381,13 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long retval = -EBADF;

if (!(flags & MAP_ANONYMOUS)) {
+ struct capsicum_rights rights;
audit_mmap_fd(fd, flags);
- file = fget(fd);
- if (!file)
+ file = fget_rights(fd, mmap_rights(&rights, prot, flags));
+ if (IS_ERR(file)) {
+ retval = PTR_ERR(file);
goto out;
+ }
if (is_file_hugepages(file))
len = ALIGN(len, huge_page_size(hstate_file(file)));
retval = -EINVAL;
diff --git a/mm/nommu.c b/mm/nommu.c
index 4a852f6c5709..3b1be7a25b39 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1496,13 +1496,17 @@ SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
unsigned long, fd, unsigned long, pgoff)
{
struct file *file = NULL;
- unsigned long retval = -EBADF;
+ unsigned long retval;

audit_mmap_fd(fd, flags);
if (!(flags & MAP_ANONYMOUS)) {
- file = fget(fd);
- if (!file)
+ struct capsicum_rights rights;
+
+ file = fget_rights(fd, mmap_rights(&rights, prot, flags));
+ if (IS_ERR(file)) {
+ retval = PTR_ERR(file);
goto out;
+ }
}

flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
diff --git a/mm/readahead.c b/mm/readahead.c
index 0ca36a7770b1..781125653dcf 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -566,8 +566,8 @@ SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
struct fd f;

ret = -EBADF;
- f = fdget(fd);
- if (f.file) {
+ f = fdgetr(fd, CAP_PREAD);
+ if (!IS_ERR(f.file)) {
if (f.file->f_mode & FMODE_READ) {
struct address_space *mapping = f.file->f_mapping;
pgoff_t start = offset >> PAGE_CACHE_SHIFT;
@@ -576,6 +576,8 @@ SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count)
ret = do_readahead(mapping, f.file, start, len);
}
fdput(f);
+ } else {
+ ret = PTR_ERR(f.file);
}
return ret;
}
diff --git a/net/9p/trans_fd.c b/net/9p/trans_fd.c
index 80d08f6664cb..6d0866ac873d 100644
--- a/net/9p/trans_fd.c
+++ b/net/9p/trans_fd.c
@@ -789,12 +789,12 @@ static int p9_fd_open(struct p9_client *client, int rfd, int wfd)
if (!ts)
return -ENOMEM;

- ts->rd = fget(rfd);
- ts->wr = fget(wfd);
- if (!ts->rd || !ts->wr) {
- if (ts->rd)
+ ts->rd = fgetr(rfd, CAP_READ, CAP_POLL_EVENT);
+ ts->wr = fgetr(wfd, CAP_WRITE, CAP_POLL_EVENT);
+ if (IS_ERR(ts->rd) || IS_ERR(ts->wr)) {
+ if (!IS_ERR(ts->rd))
fput(ts->rd);
- if (ts->wr)
+ if (!IS_ERR(ts->wr))
fput(ts->wr);
kfree(ts);
return -EIO;
diff --git a/sound/core/pcm_native.c b/sound/core/pcm_native.c
index b653ab001fba..8bd3eb38f260 100644
--- a/sound/core/pcm_native.c
+++ b/sound/core/pcm_native.c
@@ -1611,10 +1611,14 @@ static int snd_pcm_link(struct snd_pcm_substream *substream, int fd)
struct snd_pcm_file *pcm_file;
struct snd_pcm_substream *substream1;
struct snd_pcm_group *group;
- struct fd f = fdget(fd);
+ struct fd f = fdgetr(fd, CAP_LIST_END);

- if (!f.file)
- return -EBADFD;
+ if (IS_ERR(f.file)) {
+ res = PTR_ERR(f.file);
+ if (res == -EBADF)
+ return -EBADFD;
+ return res;
+ }
if (!is_pcm_file(f.file)) {
res = -EBADFD;
goto _badf;
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 20c3af7692c5..f1e38c413731 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -311,9 +311,9 @@ kvm_irqfd_assign(struct kvm *kvm, struct kvm_irqfd *args)
INIT_WORK(&irqfd->inject, irqfd_inject);
INIT_WORK(&irqfd->shutdown, irqfd_shutdown);

- f = fdget(args->fd);
- if (!f.file) {
- ret = -EBADF;
+ f = fdgetr(args->fd, CAP_WRITE);
+ if (IS_ERR(f.file)) {
+ ret = PTR_ERR(f.file);
goto out;
}

diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
index ba1a93f935c7..1f427fafa03b 100644
--- a/virt/kvm/vfio.c
+++ b/virt/kvm/vfio.c
@@ -124,9 +124,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
if (get_user(fd, argp))
return -EFAULT;

- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_FSTAT);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

vfio_group = kvm_vfio_group_get_external_user(f.file);
fdput(f);
@@ -164,9 +164,9 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
if (get_user(fd, argp))
return -EFAULT;

- f = fdget(fd);
- if (!f.file)
- return -EBADF;
+ f = fdgetr(fd, CAP_FSTAT);
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);

vfio_group = kvm_vfio_group_get_external_user(f.file);
fdput(f);
--
2.0.0.526.g5318336

2014-07-25 13:48:31

by David Drysdale

[permalink] [raw]
Subject: [PATCH 2/6] capsicum.7: describe Capsicum capability framework

Signed-off-by: David Drysdale <[email protected]>
---
man7/capsicum.7 | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 97 insertions(+)
create mode 100644 man7/capsicum.7

diff --git a/man7/capsicum.7 b/man7/capsicum.7
new file mode 100644
index 000000000000..e736060bb5bc
--- /dev/null
+++ b/man7/capsicum.7
@@ -0,0 +1,97 @@
+.\"
+.\" Copyright (c) 2014 Google, Inc.
+.\" Copyright (c) 2011, 2013 Robert N. M. Watson
+.\" Copyright (c) 2011 Jonathan Anderson
+.\" All rights reserved.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH CAPSICUM 7 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+capsicum \- lightweight OS capability and sandbox framework
+.SH SYNOPSIS
+.B #include <sys/capsicum.h>
+.SH DESCRIPTION
+Capsicum is a lightweight OS capability and sandbox framework implementing a hybrid
+capability system model.
+Capsicum can be used for application and library compartmentalisation, the
+decomposition of larger bodies of software into isolated (sandboxed)
+components in order to implement security policies and limit the impact of
+software vulnerabilities.
+.PP
+Capsicum provides three core kernel mechanisms,
+.IR "Capsicum capabilities",
+.I "capability mode"
+and
+.IR "process descriptors",
+each described below.
+
+.SS Capsicum Capabilities
+A
+.I Capsicum capability
+is a file descriptor that has been limited so that only
+certain operations can be performed on it.
+For example, a file descriptor returned by
+.BR open (2)
+may be refined using
+.BR cap_rights_limit (2)
+so that only
+.BR read (2)
+and
+.BR write (2)
+can be called on it, but not
+.BR fchmod (2).
+The complete list of the capability rights can be found in the
+.BR rights (7)
+manual page.
+
+.SS Capability Mode
+Capsicum capability mode is a process mode, entered by invoking
+.BR cap_enter (3),
+in which access to global OS namespaces (such as the file system and PID
+namespaces) is restricted; only explicitly delegated rights, referenced by
+memory mappings or file descriptors, may be used.
+Once set, the flag is inherited by future children processes, and may not be
+cleared.
+
+.SS Process Descriptors
+.I Process descriptors
+are file descriptors representing processes, allowing parent processes to manage
+child processes without requiring access to the PID namespace, and are described in
+greater detail in
+.BR procdesc (7).
+.SH VERSIONS
+Capsicum support is available in the kernel since version 3.???.
+.SH SEE ALSO
+.BR cap_enter (3),
+.BR cap_getmode (3) ,
+.BR cap_rights_get (2),
+.BR cap_rights_limit (2) ,
+.BR pdfork (2),
+.BR pdgetpid (2),
+.BR pdkill (2),
+.BR pdwait4 (2),
+.BR procdesc (7),
+.BR rights (7)
+
--
2.0.0.526.g5318336

2014-07-25 13:48:39

by David Drysdale

[permalink] [raw]
Subject: [PATCH 6/6] prctl.2: describe PR_SET_OPENAT_BENEATH/PR_GET_OPENAT_BENEATH

---
man2/prctl.2 | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)

diff --git a/man2/prctl.2 b/man2/prctl.2
index 119989183ed3..f5f71af249f2 100644
--- a/man2/prctl.2
+++ b/man2/prctl.2
@@ -295,6 +295,41 @@ A value of 1 indicates
.BR execve (2)
will operate in the privilege-restricting mode described above.
.TP
+.BR PR_SET_OPENAT_BENEATH " (since Linux 3.??)"
+Set the calling process's
+.I openat_beneath
+bit to the value in
+.IR arg2 .
+With
+.I openat_beneath
+set to 1, all
+.BR openat (2)
+and
+.BR open (2)
+operations act as though the
+.B O_BENEATH
+flag is set.
+Once set, this bit cannot be unset.
+The setting of this bit is inherited by children created by
+.BR fork (2)
+and
+.BR clone (2),
+and preserved across
+.BR execve (2).
+.TP
+.BR PR_GET_OPENAT_BENEATH " (since Linux 3.??)"
+Return (as the function result) the value of the
+.I openat_beneath
+bit for the current process.
+A value of 0 indicates the regular behavior.
+A value of 1 indicates that
+.BR openat (2)
+and
+.BR open (2)
+will operate in the implicit
+.B O_BENEATH
+mode described above.
+.TP
.BR PR_SET_PDEATHSIG " (since Linux 2.1.57)"
Set the parent process death signal
of the calling process to \fIarg2\fP (either a signal value
--
2.0.0.526.g5318336

2014-07-25 13:48:41

by David Drysdale

[permalink] [raw]
Subject: [PATCH 4/6] cap_rights_limit.2: limit FD rights for Capsicum

Signed-off-by: David Drysdale <[email protected]>
---
man2/cap_rights_limit.2 | 241 ++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 241 insertions(+)
create mode 100644 man2/cap_rights_limit.2

diff --git a/man2/cap_rights_limit.2 b/man2/cap_rights_limit.2
new file mode 100644
index 000000000000..450bbdc6f86d
--- /dev/null
+++ b/man2/cap_rights_limit.2
@@ -0,0 +1,241 @@
+.\"
+.\" Copyright (c) 2008-2010 Robert N. M. Watson
+.\" Copyright (c) 2012-2013 The FreeBSD Foundation
+.\" Copyright (c) 2013-2014 Google, Inc.
+.\" All rights reserved.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH CAP_RIGHTS_LIMIT 2 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+cap_rights_limit \- limit Capsicum capability rights
+.SH SYNOPSIS
+.nf
+.B #include <sys/capsicum.h>
+.sp
+.BI "int cap_rights_limit(int " fd ", const struct cap_rights *" rights ,
+.BI " unsigned int " fcntls ,
+.BI " int " nioctls ", unsigned int *" ioctls );
+.SH DESCRIPTION
+When a file descriptor is created by a function such as
+.BR accept (2),
+.BR accept4 (2),
+.BR creat (2),
+.BR epoll_create (2),
+.BR eventfd (2),
+.BR mq_open (2),
+.BR open (2),
+.BR openat (2),
+.BR pdfork (2),
+.BR pipe (2),
+.BR pipe2 (2),
+.BR signalfd (2),
+.BR socket (2),
+.BR socketpair (2)
+or
+.BR timerfd_create (2),
+it implicitly has all Capsicum capability rights.
+Those rights can be reduced (but never expanded) by using the
+.BR cap_rights_limit ()
+system call.
+Once Capsicum capability rights are reduced, operations on the file descriptor
+.I fd
+will be limited to those permitted by the remainder of the arguments.
+.PP
+The
+.I rights
+argument describes the primary rights for the file descriptor, as a
+.I cap_rights
+structure:
+.in +4n
+.nf
+
+#define CAP_RIGHTS_VERSION_00 0
+#define CAP_RIGHTS_VERSION_01 1
+#define CAP_RIGHTS_VERSION_02 2
+#define CAP_RIGHTS_VERSION_03 3
+#define CAP_RIGHTS_VERSION CAP_RIGHTS_VERSION_00
+struct cap_rights {
+ __u64 cr_rights[CAP_RIGHTS_VERSION + 2];
+};
+.fi
+.in
+.PP
+The contents of the
+.I cr_rights
+array are encoded as follows.
+.IP \(bu
+The top 2 bits of
+.I cr_rights[0]
+hold the array size minus 2, allowing for an array size between 2 and 5 inclusive.
+.IP \(bu
+The top 2 bits of the other
+.I cr_rights
+entries are zero.
+.IP \(bu
+The following 5 bits of each array entry indicate its position in the array,
+from 0b00001 for entry [0], 0b00010 for entry [1], up to 0b10000 for entry [4].
+.IP \(bu
+The remaining 57 bits of each array entry identify a particular primary
+right.
+.PP
+This encoding allows for future expansion in the number of distinct rights;
+for example, a
+.I cap_rights
+structure holding the
+.B CAP_READ
+and
+.B CAP_BINDAT
+rights would have contents of
+.in +4n
+.nf
+0x0200000000000001 0x0400000000001000
+.fi
+.in
+as a version 0 array, but would be encoded in a larger version 1 array as
+.in +4n
+.nf
+0x4200000000000001 0x0400000000001000 0x0800000000000000.
+.fi
+.in
+.PP
+User programs should prepare the contents of the
+.I cap_rights
+structure with the
+.BR cap_rights_init (3)
+family of functions.
+The complete list of primary rights can be found in the
+.BR rights (7)
+manual page.
+.PP
+If a file descriptor is granted the
+.B CAP_FCNTL
+primary capability right, some specific
+.BR fcntl (2)
+commands can be selectively disallowed with the
+.I fcntls
+argument. The following flags may be specified in the
+.I fcntls
+argument for this:
+.TP
+.B CAP_FCNTL_GETFL
+Permit
+.B F_GETFL
+command.
+.TP
+.B CAP_FCNTL_SETFL
+Permit
+.B F_SETFL
+command.
+.TP
+.B CAP_FCNTL_GETOWN
+Permit
+.B F_GETOWN
+command.
+.TP
+.B CAP_FCNTL_SETOWN
+Permit
+.B F_SETOWN
+command.
+.PP
+A value of
+.B CAP_FCNTL_ALL
+for the
+.I fcntls
+argument leaves the set of allowed
+.BR fcntl (2)
+commands unchanged.
+.PP
+However, note that
+.BR fcntl (2)
+commands that are analogous to other system calls
+(such as
+.B F_DUPFD
+or
+.BR F_GETPIPE_SZ )
+are not controlled by the
+.B CAP_FCNTL
+right and so do not have corresponding values in the
+.I fcntls
+argument.
+.PP
+If a file descriptor is granted the
+.B CAP_IOCTL
+capability right, the list of allowed
+.BR ioctl (2)
+commands can be selectively reduced (but never expanded) using the
+.I nioctls
+and
+.I ioctls
+arguments.
+The
+.I ioctls
+argument is an array of
+.BR ioctl (2)
+command values and the
+.I nioctls
+argument specifies the number of elements in the array.
+.PP
+If the
+.I nioctls
+argument is -1 or 0, the
+.I ioctls
+argument is ignored, and either all
+.BR ioctl (2)
+operations or no
+.BR ioctl (2)
+operations (respectively) will be allowed.
+.PP
+Capsicum capability rights assigned to a file descriptor can be obtained with the
+.BR cap_rights_get (2)
+system call.
+.SH RETURN VALUE
+.BR cap_rights_limit ()
+returns zero on success. On error, -1 is returned and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+isn't a valid open file descriptor.
+.TP
+.B EINVAL
+An invalid set of rights has been requested in
+.IR rights .
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOTCAPABLE
+The arguments contain capability rights not present for the given file descriptor (Capsicum
+capability rights list can only be reduced, never expanded).
+.SH VERSION
+Capsicum support was added to the kernel in version 3.???.
+.SH SEE ALSO
+.BR cap_enter (2),
+.BR cap_rights_get (2),
+.BR cap_rights_init (3),
+.BR capsicum (7),
+.BR rights (7)
--
2.0.0.526.g5318336

2014-07-25 13:48:38

by David Drysdale

[permalink] [raw]
Subject: [PATCH 5/6] cap_rights_get.2: retrieve Capsicum fd rights

Signed-off-by: David Drysdale <[email protected]>
---
man2/cap_rights_get.2 | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 134 insertions(+)
create mode 100644 man2/cap_rights_get.2

diff --git a/man2/cap_rights_get.2 b/man2/cap_rights_get.2
new file mode 100644
index 000000000000..d6aa5b88c6af
--- /dev/null
+++ b/man2/cap_rights_get.2
@@ -0,0 +1,134 @@
+.\"
+.\" Copyright (c) 2008-2010 Robert N. M. Watson
+.\" Copyright (c) 2012-2013 The FreeBSD Foundation
+.\" Copyright (c) 2013-2014 Google, Inc.
+.\" All rights reserved.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH CAP_RIGHTS_GET 2 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+cap_rights_get \- retrieve Capsicum capability rights
+.SH SYNOPSIS
+.nf
+.B #include <sys/capsicum.h>
+.sp
+.BI "int cap_rights_get(int " fd ", struct cap_rights *" rights ,
+.BI " unsigned int *" fcntls ,
+.BI " int *" nioctls ", unsigned int *" ioctls );
+.SH DESCRIPTION
+Obtain the current Capsicum capability rights for a file descriptor.
+.PP
+The function will fill the
+.I rights
+argument (if non-NULL) with the primary capability rights of the
+.I fd
+descriptor. The
+.I rights
+field is a
+.I cap_rights
+structure, as described in
+.BR cap_rights_limit (2),
+and can be examined with the
+.BR cap_rights_is_set (3)
+family of functions. The complete list of primary rights can be found in the
+.BR rights (7)
+manual page.
+.PP
+If the
+.I fcntls
+argument is non-NULL, it will be filled in with a bitmask of allowed
+.BR fcntl (2)
+commands; see
+.BR cap_rights_limit (2)
+for values. If the file descriptor does not have the
+.B CAP_FCNTL
+primary right, the returned
+.I fcntls
+value will be zero.
+.PP
+If the
+.I nioctls
+argument is non-NULL, it will be filled in with the number of allowed
+.BR ioctl (2)
+commands, or with the value CAP_IOCTLS_ALL to indicate that all
+.BR ioctl (2)
+commands are allowed. If the file descriptor does not have the
+.B CAP_IOCTL
+primary right, the returned
+.I nioctls
+value will be zero.
+.PP
+If the
+.I ioctls
+argument is non-NULL, the caller should specify the size of the
+provided buffer as the initial value of the
+.I nioctls
+argument (as a count of the number of
+.BR ioctl (2)
+command values the buffer can hold).
+On successful completion of the system call, the
+.I ioctls
+buffer is filled with the
+.BR ioctl (2)
+command values, up to maximum of the initial value of
+.BR nioctls .
+.PP
+If all
+.BR ioctl (2)
+commands are allowed (the
+.B CAP_IOCTL
+primary capability right is assigned to the file descriptor and the
+set of allowed
+.BR ioctl (2)
+commands was never limited for this file descriptor), the
+system call will not modify the buffer pointed to by the
+.I ioctls
+argument.
+.PP
+Capsicum capability rights assigned to a file descriptor can be reduced with the
+.BR cap_rights_limit (2)
+system call.
+.SH RETURN VALUE
+.BR cap_rights_get ()
+returns zero on success. On error, -1 is returned and
+.I errno
+is set appropriately.
+.SH ERRORS
+.TP
+.B EBADF
+.I fd
+isn't a valid open file descriptor.
+.TP
+.B EFAULT
+Invalid pointer argument.
+.SH VERSION
+Capsicum support was added to the kernel in version 3.???.
+.SH SEE ALSO
+.BR cap_enter (2),
+.BR cap_rights_limit (2),
+.BR cap_rights_init (3),
+.BR capsicum (7),
+.BR rights (7)
+
--
2.0.0.526.g5318336

2014-07-25 13:49:34

by David Drysdale

[permalink] [raw]
Subject: [PATCH 3/6] rights.7: Describe Capsicum primary rights

Signed-off-by: David Drysdale <[email protected]>
---
man7/rights.7 | 525 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 525 insertions(+)
create mode 100644 man7/rights.7

diff --git a/man7/rights.7 b/man7/rights.7
new file mode 100644
index 000000000000..33bb8e48d12f
--- /dev/null
+++ b/man7/rights.7
@@ -0,0 +1,525 @@
+.\"
+.\" Copyright (c) 2014 Google, Inc.
+.\" Copyright (c) 2012-2013 The FreeBSD Foundation
+.\" Copyright (c) 2008-2010 Robert N. M. Watson
+.\" All rights reserved.
+.\"
+.\" This software was developed at the University of Cambridge Computer
+.\" Laboratory with support from a grant from Google, Inc.
+.\"
+.\" %%%LICENSE_START(BSD_2_CLAUSE)
+.\" Portions of this documentation were written by Pawel Jakub Dawidek
+.\" under sponsorship from the FreeBSD Foundation.
+.\"
+.\" Redistribution and use in source and binary forms, with or without
+.\" modification, are permitted provided that the following conditions
+.\" are met:
+.\" 1. Redistributions of source code must retain the above copyright
+.\" notice, this list of conditions and the following disclaimer.
+.\" 2. Redistributions in binary form must reproduce the above copyright
+.\" notice, this list of conditions and the following disclaimer in the
+.\" documentation and/or other materials provided with the distribution.
+.\"
+.\" THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+.\" ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+.\" SUCH DAMAGE.
+.\" %%%LICENSE_END
+.\"
+.TH RIGHTS 7 2014-05-07 "Linux" "Linux Programmer's Manual"
+.SH NAME
+Capsicum capability rights for file descriptors
+.SH SYNOPSIS
+.B #include <linux/capsicum.h>
+.SH DESCRIPTION
+When a file descriptor is created by a function such as
+.BR accept (2),
+.BR accept4 (2),
+.BR creat (2),
+.BR epoll_create (2),
+.BR eventfd (2),
+.BR mq_open (2),
+.BR open (2),
+.BR openat (2),
+.BR pdfork (2),
+.BR pipe (2),
+.BR pipe2 (2),
+.BR signalfd (2),
+.BR socket (2),
+.BR socketpair (2)
+or
+.BR timerfd_create (2),
+it implicitly has all Capsicum capability rights.
+Those rights can be reduced (but never expanded) by using the
+.BR cap_rights_limit (2)
+system call.
+Once capability rights are reduced, operations on the file descriptor will be
+limited to those permitted by the associated rights.
+.Pp
+The list of primary capability rights is provided below. In addition,
+.BR ioctl (2)
+and
+.BR fcntl (2)
+can also be restricted to only allow specific commands.
+.PP
+The
+.I "struct cap_rights"
+type is used to store a list of primary capability rights; the
+.BR cap_rights_init (3)
+family of functions should be used to manage the structure.
+.SH RIGHTS
+The following rights may be specified in a rights mask:
+.TP
+.B CAP_ACCEPT
+Permit
+.BR accept (2)
+and
+.BR accept4 (2).
+.TP
+.B CAP_BIND
+Permit
+.BR bind (2).
+Note that sockets can also become bound implicitly as a result of
+.BR connect (2)
+or
+.BR send (2),
+and that socket options set with
+.BR setsockopt (2)
+may also affect binding behavior.
+.TP
+.B CAP_CONNECT
+Permit
+.BR connect (2);
+also required for
+.BR sendto (2)
+with a non-NULL destination address.
+.TP
+.B CAP_CREATE
+Permit
+.BR openat (2)
+with the
+.B O_CREAT
+flag.
+.TP
+.B CAP_EVENT
+Permit
+.BR select (2),
+.BR poll (2),
+and
+.BR epoll (7)
+to be used in monitoring the file descriptor for events.
+.TP
+.B CAP_EXTATTR_DELETE
+Permit
+.BR fremovexattr (2).
+.TP
+.B CAP_EXTATTR_GET
+Permit
+.BR fgetxattr (2).
+.TP
+.B CAP_EXTATTR_LIST
+Permit
+.BR flistxattr (2).
+.TP
+.B CAP_EXTATTR_SET
+Permit
+.BR fsetxattr (2).
+.TP
+.B CAP_FCHDIR
+Permit
+.BR fchdir (2).
+.TP
+.B CAP_FCHMOD
+Permit
+.BR fchmod (2)
+and
+.BR fchmodat (2)
+if the
+.B CAP_LOOKUP
+right is also present.
+.TP
+.B CAP_FCHMODAT
+An alias to
+.B CAP_FCHMOD
+and
+.BR CAP_LOOKUP .
+.TP
+.B CAP_FCHOWN
+Permit
+.BR fchown (2)
+and
+.BR fchownat (2)
+if the
+.B CAP_LOOKUP
+right is also present.
+.TP
+.B CAP_FCHOWNAT
+An alias to
+.B CAP_FCHOWN
+and
+.BR CAP_LOOKUP .
+.TP
+.B CAP_FCNTL
+Permit
+.BR fcntl (2).
+Note that only the
+.BR F_GETFL ,
+.BR F_SETFL ,
+.B F_GETOWN ,
+.B F_SETOWN ,
+.B F_GETOWN_EX
+and
+.B F_SETOWN_EX
+commands require this capability right.
+Also note that the list of permitted commands can be further limited with the
+.BR cap_rights_limit (2)
+system call.
+.TP
+.B CAP_FEXECVE
+Permit
+.BR execveat (2)
+and
+.BR openat (2)
+with the
+.B O_EXEC
+flag;
+.B CAP_READ
+is also required.
+.TP
+.B CAP_FLOCK
+Permit
+.BR flock (2)
+and
+.BR fcntl (2)
+(with
+.BR F_GETLK ,
+.BR F_SETLK
+or
+.B F_SETLKW
+flag).
+.TP
+.B CAP_FSTAT
+Permit
+.BR fstat (2).
+.TP
+.B CAP_FSTATFS
+Permit
+.BR fstatfs (2).
+.TP
+.B CAP_FSYNC
+Permit
+.BR fsync (2)
+and
+.BR openat (2)
+with the
+.B O_SYNC
+flag.
+.TP
+.B CAP_FTRUNCATE
+Permit
+.BR ftruncate (2)
+and
+.BR openat (2)
+with the
+.B O_TRUNC
+flag.
+.TP
+.B CAP_FUTIMES
+Permit
+.BR futimesat (2)
+if the
+.B CAP_LOOKUP
+right is also present.
+.TP
+.B CAP_FUTIMESAT
+An alias to
+.B CAP_FUTIMES
+and
+.BR CAP_LOOKUP .
+.TP
+.B CAP_GETPEERNAME
+Permit
+.BR getpeername (2).
+.TP
+.B CAP_GETSOCKNAME
+Permit
+.BR getsockname (2).
+.TP
+.B CAP_GETSOCKOPT
+Permit
+.BR getsockopt (2).
+.TP
+.B CAP_IOCTL
+Permit
+.BR ioctl (2).
+Be aware that this system call has enormous scope, including potentially
+global scope for some objects.
+The list of permitted ioctl commands can be further limited with the
+.BR cap_rights_limit (2)
+system call.
+.TP
+.B CAP_LINKAT
+Permit
+.BR linkat (2)
+and
+.BR renameat (2)
+on the destination directory descriptor.
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_LISTEN
+Permit
+.BR listen (2);
+not much use (generally) without
+.BR CAP_BIND .
+.TP
+.B CAP_LOOKUP
+Permit the file descriptor to be used as a starting directory for calls such as
+.BR linkat (2),
+.BR openat (2),
+and
+.BR unlinkat (2).
+.TP
+.B CAP_MKDIRAT
+Permit
+.BR mkdirat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_MKFIFOAT
+Permit
+.BR mkfifoat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_MKNODAT
+Permit
+.BR mknodat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_MMAP
+Permit
+.BR mmap (2)
+with the
+.B PROT_NONE
+protection.
+.TP
+.B CAP_MMAP_R
+Permit
+.BR mmap (2)
+with the
+.B PROT_READ
+protection.
+This right includes the
+.B CAP_READ
+and
+.B CAP_SEEK
+rights.
+.TP
+.B CAP_MMAP_RW
+An alias to
+.B CAP_MMAP_R
+and
+.BR CAP_MMAP_W .
+.TP
+.B CAP_MMAP_RWX
+An alias to
+.BR CAP_MMAP_R ,
+.B CAP_MMAP_W
+and
+.BR CAP_MMAP_X .
+.TP
+.B CAP_MMAP_RX
+An alias to
+.B CAP_MMAP_R
+and
+.BR CAP_MMAP_X .
+.TP
+.B CAP_MMAP_W
+Permit
+.BR mmap (2)
+with the
+.B PROT_WRITE
+protection.
+This right includes the
+.B CAP_WRITE
+and
+.B CAP_SEEK
+rights.
+.TP
+.B CAP_MMAP_WX
+An alias to
+.B CAP_MMAP_W
+and
+.BR CAP_MMAP_X .
+.TP
+.B CAP_MMAP_X
+Permit
+.BR mmap (2)
+with the
+.B PROT_EXEC
+protection.
+This right includes the
+.B CAP_SEEK
+right.
+.TP
+.B CAP_PDGETPID
+Permit
+.BR pdgetpid (2).
+.TP
+.B CAP_PDKILL
+Permit
+.BR pdkill (2).
+.TP
+.B CAP_PDWAIT
+Permit
+.BR pdwait4 (2).
+.TP
+.B CAP_PEELOFF
+Permit
+.BR sctp_peeloff (3).
+.TP
+.B CAP_PREAD
+An alias to
+.B CAP_READ
+and
+.BR CAP_SEEK .
+.TP
+.B CAP_PWRITE
+An alias to
+.B CAP_SEEK
+and
+.BR CAP_WRITE .
+.TP
+.B CAP_READ
+Permit
+.BR openat (2)
+with the
+.BR O_RDONLY flag,
+.BR read (2),
+.BR readv (2),
+.BR recv (2),
+.BR recvfrom (2),
+.BR recvmsg (2),
+.BR pread (2)
+(
+.B CAP_SEEK
+is also required),
+.BR preadv (2)
+(
+.B CAP_SEEK
+is also required) and related system calls.
+.TP
+.B CAP_RECV
+An alias to
+.BR CAP_READ .
+.TP
+.B CAP_RENAMEAT
+Permit
+.BR renameat (2).
+This right is required on the source directory descriptor.
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_SEEK
+Permit operations that seek on the file descriptor, such as
+.BR lseek (2),
+but also required for I/O system calls that can read or write at any position
+in the file, such as
+.BR pread (2)
+and
+.BR pwrite (2).
+.TP
+.B CAP_SEND
+An alias to
+.BR CAP_WRITE .
+.TP
+.B CAP_SETSOCKOPT
+Permit
+.BR setsockopt (2);
+this controls various aspects of socket behavior and may affect binding,
+connecting, and other behaviors with global scope.
+.TP
+.B CAP_SHUTDOWN
+Permit explicit
+.BR shutdown (2);
+closing the socket will also generally shut down any connections on it.
+.TP
+.B CAP_SYMLINKAT
+Permit
+.BR symlinkat (2).
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_UNLINKAT
+Permit
+.BR unlinkat (2)
+and
+.BR renameat (2).
+This right is only required for
+.BR renameat (2)
+on the destination directory descriptor if the destination object already
+exists and will be removed by the rename.
+This right includes the
+.B CAP_LOOKUP
+right.
+.TP
+.B CAP_WRITE
+Allow
+.BR openat (2)
+with
+.B O_WRONLY
+and
+.B O_APPEND
+flags set,
+.BR send (2),
+.BR sendmsg (2),
+.BR sendto (2),
+.BR write (2),
+.BR writev (2),
+.BR pwrite (2),
+.BR pwritev (2)
+and related system calls.
+For
+.BR sendto (2)
+with a non-NULL connection address,
+.B CAP_CONNECT
+is also required.
+For
+.BR openat (2)
+with the
+.B O_WRONLY
+flag, but without the
+.B O_APPEND
+flag,
+.B CAP_SEEK
+is also required.
+For
+.BR pwrite (2)
+and
+.BR pwritev (2)
+.B CAP_SEEK
+is also required.
+.SH VERSIONS
+Capsicum support was originally added to the kernel in version 3.???.
+.SH SEE ALSO
+.BR cap_enter (3),
+.BR cap_fcntls_limit (3),
+.BR cap_ioctls_limit (3),
+.BR cap_rights_limit (2),
+.BR cap_rights_limit (3),
+.BR capsicum (7)
--
2.0.0.526.g5318336

2014-07-25 13:49:54

by David Drysdale

[permalink] [raw]
Subject: [PATCH 1/6] open.2: describe O_BENEATH flag

Signed-off-by: David Drysdale <[email protected]>
---
man2/open.2 | 33 +++++++++++++++++++++++++++++++--
1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/man2/open.2 b/man2/open.2
index 475d9e405908..c3d080fb3bea 100644
--- a/man2/open.2
+++ b/man2/open.2
@@ -705,7 +705,7 @@ in a fully formed state (using
as described above).
.RE
.IP
-.B O_TMPFILE
+.B O_TMPFILE " (since Linux 3.??)"
requires support by the underlying filesystem;
only a subset of Linux filesystems provide that support.
In the initial implementation, support was provided in
@@ -715,6 +715,31 @@ XFS support was added
.\" commit ab29743117f9f4c22ac44c13c1647fb24fb2bafe
in Linux 3.15.
.TP
+.B O_BENEATH
+Ensure that the
+.I pathname
+is beneath the current working directory (for
+.BR open (2))
+or the
+.I dirfd
+(for
+.BR openat (2)).
+If the
+.I pathname
+is absolute or contains a path component of "..", the
+.BR open ()
+fails with the error
+.BR EACCES.
+This occurs even if ".." path component would not actually
+escape the original directory; for example, a
+.I pathname
+of "subdir/../filename" would be rejected.
+Path components that are symbolic links to absolute paths, or that are
+relative paths containing a ".." component, will also cause the
+.BR open ()
+operation to fail with the error
+.BR EACCES.
+.TP
.B O_TRUNC
If the file already exists and is a regular file and the access mode allows
writing (i.e., is
@@ -791,7 +816,11 @@ The requested access to the file is not allowed, or search permission
is denied for one of the directories in the path prefix of
.IR pathname ,
or the file did not exist yet and write access to the parent directory
-is not allowed.
+is not allowed, or the
+.B O_BENEATH
+flag was specified and the
+.I pathname
+was not beneath the relevant directory.
(See also
.BR path_resolution (7).)
.TP
--
2.0.0.526.g5318336

2014-07-25 13:48:18

by David Drysdale

[permalink] [raw]
Subject: [PATCH 06/11] capsicum: implement sockfd_lookupr()

Add variants of sockfd_lookup() and related functions where the caller
indicates the operations that will be performed on the socket.

If CONFIG_SECURITY_CAPSICUM is defined, these variants use the
fgetr()-style functions to retrieve the struct file from the file
descriptor.

If CONFIG_SECURITY_CAPSICUM is not defined, these variants use the
normal fget() functions.

Signed-off-by: David Drysdale <[email protected]>
---
include/linux/net.h | 16 +++++++
net/socket.c | 120 ++++++++++++++++++++++++++++++++++++++++++++--------
2 files changed, 119 insertions(+), 17 deletions(-)

diff --git a/include/linux/net.h b/include/linux/net.h
index 17d83393afcc..05429ce3b730 100644
--- a/include/linux/net.h
+++ b/include/linux/net.h
@@ -24,6 +24,7 @@
#include <linux/fcntl.h> /* For O_CLOEXEC and O_NONBLOCK */
#include <linux/kmemcheck.h>
#include <linux/rcupdate.h>
+#include <linux/capsicum.h>
#include <linux/jump_label.h>
#include <uapi/linux/net.h>

@@ -222,6 +223,21 @@ struct socket *sock_from_file(struct file *file, int *err);
#define sockfd_put(sock) fput(sock->file)
int net_ratelimit(void);

+#ifdef CONFIG_SECURITY_CAPSICUM
+struct socket *sockfd_lookup_rights(int fd, int *err,
+ struct capsicum_rights *rights);
+struct socket *_sockfd_lookupr(int fd, int *err, ...);
+#define sockfd_lookupr(fd, err, ...) \
+ _sockfd_lookupr((fd), (err), __VA_ARGS__, 0ULL)
+#else
+static inline struct socket *
+sockfd_lookup_rights(int fd, int *err, struct capsicum_rights *rights)
+{
+ return sockfd_lookup(fd, err);
+}
+#define sockfd_lookupr(fd, err, ...) sockfd_lookup((fd), (err))
+#endif
+
#define net_ratelimited_function(function, ...) \
do { \
if (net_ratelimit()) \
diff --git a/net/socket.c b/net/socket.c
index abf56b2a14f9..cc2e59576b3c 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -96,6 +96,7 @@
#include <net/compat.h>
#include <net/wext.h>
#include <net/cls_cgroup.h>
+#include <net/sctp/sctp.h>

#include <net/sock.h>
#include <linux/netfilter.h>
@@ -418,6 +419,108 @@ struct socket *sock_from_file(struct file *file, int *err)
}
EXPORT_SYMBOL(sock_from_file);

+static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
+{
+ struct fd f = fdget(fd);
+ struct socket *sock;
+
+ *err = -EBADF;
+ if (f.file) {
+ sock = sock_from_file(f.file, err);
+ if (likely(sock)) {
+ *fput_needed = f.flags;
+ return sock;
+ }
+ fdput(f);
+ }
+ return NULL;
+}
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+struct socket *sockfd_lookup_rights(int fd, int *err,
+ struct capsicum_rights *rights)
+{
+ struct file *file;
+ struct socket *sock;
+
+ file = fget_rights(fd, rights);
+ if (IS_ERR(file)) {
+ *err = PTR_ERR(file);
+ return NULL;
+ }
+
+ sock = sock_from_file(file, err);
+ if (!sock)
+ fput(file);
+ return sock;
+}
+EXPORT_SYMBOL(sockfd_lookup_rights);
+
+static struct socket *
+sockfd_lookup_light_rights(int fd, int *err, int *fput_needed,
+ const struct capsicum_rights **actual_rights,
+ const struct capsicum_rights *required_rights)
+{
+ struct fd f = fdget_raw_rights(fd, actual_rights, required_rights);
+ struct socket *sock;
+
+ *err = -EBADF;
+ if (!IS_ERR(f.file)) {
+ sock = sock_from_file(f.file, err);
+ if (likely(sock)) {
+ *fput_needed = f.flags;
+ return sock;
+ }
+ fdput(f);
+ } else {
+ *err = PTR_ERR(f.file);
+ }
+ return NULL;
+}
+
+struct socket *_sockfd_lookupr(int fd, int *err, ...)
+{
+ struct capsicum_rights rights;
+ struct socket *sock;
+ va_list ap;
+
+ va_start(ap, err);
+ sock = sockfd_lookup_rights(fd, err, cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ return sock;
+}
+EXPORT_SYMBOL(_sockfd_lookupr);
+
+struct socket *_sockfd_lookupr_light(int fd, int *err, int *fput_needed, ...)
+{
+ struct capsicum_rights rights;
+ struct socket *sock;
+ va_list ap;
+
+ va_start(ap, fput_needed);
+ sock = sockfd_lookup_light_rights(fd, err, fput_needed,
+ NULL, cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ return sock;
+}
+#define sockfd_lookupr_light(fd, err, fpn, ...) \
+ _sockfd_lookupr_light((fd), (err), (fpn), __VA_ARGS__, 0ULL)
+
+#else
+
+static inline struct socket *
+sockfd_lookup_light_rights(int fd, int *err, int *fput_needed,
+ const struct capsicum_rights **actual_rights,
+ const struct capsicum_rights *required_rights)
+{
+ return sockfd_lookup_light(fd, err, fput_needed);
+}
+
+#define sockfd_lookupr_light(f, e, p, ...) \
+ sockfd_lookup_light((f), (e), (p))
+
+#endif
+
/**
* sockfd_lookup - Go from a file number to its socket slot
* @fd: file handle
@@ -449,23 +552,6 @@ struct socket *sockfd_lookup(int fd, int *err)
}
EXPORT_SYMBOL(sockfd_lookup);

-static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
-{
- struct fd f = fdget(fd);
- struct socket *sock;
-
- *err = -EBADF;
- if (f.file) {
- sock = sock_from_file(f.file, err);
- if (likely(sock)) {
- *fput_needed = f.flags;
- return sock;
- }
- fdput(f);
- }
- return NULL;
-}
-
#define XATTR_SOCKPROTONAME_SUFFIX "sockprotoname"
#define XATTR_NAME_SOCKPROTONAME (XATTR_SYSTEM_PREFIX XATTR_SOCKPROTONAME_SUFFIX)
#define XATTR_NAME_SOCKPROTONAME_LEN (sizeof(XATTR_NAME_SOCKPROTONAME)-1)
--
2.0.0.526.g5318336

2014-07-25 13:50:27

by David Drysdale

[permalink] [raw]
Subject: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

Add the current thread and thread group IDs into the data
available for seccomp-bpf programs to work on. This allows
installation of filters that police syscalls based on thread
or process ID, e.g. tgkill(2)/kill(2)/prctl(2).

Signed-off-by: David Drysdale <[email protected]>
---
include/uapi/linux/seccomp.h | 10 ++++++++++
kernel/seccomp.c | 2 ++
2 files changed, 12 insertions(+)

diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index ac2dc9f72973..b88370d6f6ca 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -36,12 +36,22 @@
* @instruction_pointer: at the time of the system call.
* @args: up to 6 system call arguments always stored as 64-bit values
* regardless of the architecture.
+ * @tgid: thread group ID of the thread executing the BPF program.
+ * @tid: thread ID of the thread executing the BPF program.
+ * The SECCOMP_DATA_TID_PRESENT macro indicates the presence of the
+ * tgid and tid fields; user programs may use this macro to conditionally
+ * compile code against older versions of the kernel. Note also that
+ * BPF programs should cope with the absence of these fields by testing
+ * the length of data available.
*/
struct seccomp_data {
int nr;
__u32 arch;
__u64 instruction_pointer;
__u64 args[6];
+ __u32 tgid;
+ __u32 tid;
};
+#define SECCOMP_DATA_TID_PRESENT 1

#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 301bbc24739c..dd5146f15d6d 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -80,6 +80,8 @@ static void populate_seccomp_data(struct seccomp_data *sd)
sd->args[4] = args[4];
sd->args[5] = args[5];
sd->instruction_pointer = KSTK_EIP(task);
+ sd->tgid = task_tgid_vnr(current);
+ sd->tid = task_pid_vnr(current);
}

/**
--
2.0.0.526.g5318336

2014-07-25 13:50:49

by David Drysdale

[permalink] [raw]
Subject: [PATCH 10/11] capsicum: prctl(2) to force use of O_BENEATH

Add a per-task flag that indicates all openat(2) operations
should implicitly have the O_BENEATH flag set.

Add a prctl(2) command to set this flag (irrevocably). Include
an option to force the flag set to be synchronized across all
tasks in the thread group.

Signed-off-by: David Drysdale <[email protected]>
---
fs/namei.c | 3 +++
include/linux/sched.h | 3 +++
include/uapi/linux/prctl.h | 14 ++++++++++++++
kernel/sys.c | 28 ++++++++++++++++++++++++++++
4 files changed, 48 insertions(+)

diff --git a/fs/namei.c b/fs/namei.c
index fe03a7dd7537..5d3b440869df 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1849,6 +1849,9 @@ static int path_init(int dfd, const char *name, unsigned int *flags,
nd->flags = (*flags) | LOOKUP_PARENT | LOOKUP_JUMPED;
nd->depth = 0;

+ if (current->openat_beneath)
+ *flags |= LOOKUP_BENEATH;
+
if ((*flags) & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0c987a..8d2943879b7b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1303,6 +1303,9 @@ struct task_struct {
/* Used for emulating ABI behavior of previous Linux versions */
unsigned int personality;

+ /* Indicate that openat(2) operations implictly have O_BENEATH */
+ unsigned openat_beneath:1;
+
unsigned in_execve:1; /* Tell the LSMs that the process is doing an
* execve */
unsigned in_iowait:1;
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 58afc04c107e..b34fb2bbdaf8 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -152,4 +152,18 @@
#define PR_SET_THP_DISABLE 41
#define PR_GET_THP_DISABLE 42

+/*
+ * If openat_beneath is set for a task, then all openat(2) operations will
+ * implicitly have the O_BENEATH flag set for them. Once set, this flag cannot
+ * be cleared.
+ */
+#define PR_SET_OPENAT_BENEATH 44
+#define PR_GET_OPENAT_BENEATH 45
+
+/*
+ Indicate that the openat_beneath flag should be synchronized across all
+ * threads in the process.
+ */
+#define PR_SET_OPENAT_BENEATH_TSYNC 0x01
+
#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 8d8ccf6cfb38..cf2530c41982 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1821,6 +1821,23 @@ out:
return error;
}

+static int prctl_set_openat_beneath(struct task_struct *me, unsigned long flags)
+{
+ me->openat_beneath = 1;
+ if (flags & PR_SET_OPENAT_BENEATH_TSYNC) {
+ struct task_struct *thread, *caller;
+ unsigned long tflags;
+
+ write_lock_irqsave(&tasklist_lock, tflags);
+ thread = caller = me;
+ while_each_thread(caller, thread) {
+ thread->openat_beneath = 1;
+ }
+ write_unlock_irqrestore(&tasklist_lock, tflags);
+ }
+ return 0;
+}
+
#ifdef CONFIG_CHECKPOINT_RESTORE
static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr)
{
@@ -1996,6 +2013,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
return current->no_new_privs ? 1 : 0;
+ case PR_SET_OPENAT_BENEATH:
+ if (arg2 != 1 || arg4 || arg5)
+ return -EINVAL;
+ if ((arg3 & ~(PR_SET_OPENAT_BENEATH_TSYNC)) != 0)
+ return -EINVAL;
+ error = prctl_set_openat_beneath(me, arg3);
+ break;
+ case PR_GET_OPENAT_BENEATH:
+ if (arg2 || arg3 || arg4 || arg5)
+ return -EINVAL;
+ return me->openat_beneath;
case PR_GET_THP_DISABLE:
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
--
2.0.0.526.g5318336

2014-07-25 13:51:13

by David Drysdale

[permalink] [raw]
Subject: [PATCH 08/11] capsicum: invoke Capsicum on FD/file conversion

Add calls to Capsicum to intercept FD/file conversion in the following
situations:
- Any place where a file descriptor is converted to a struct file.
- Any place where a new file descriptor that is derived from an
existing file descriptor is installed into the fdtable.

For the first of these, Capsicum checks for a Capsicum capability
wrapper file and unwraps it to the underlying file, provided the
appropriate rights are available (otherwise, ERR_PTR(-ENOTCAPABLE)
is returned).

For the second, Capsicum checks whether the original file has
restricted rights associated with it. If it does, it is replaced
with a Capsicum capability wrapper file before installation into
the fdtable. This propagates the rights from the original FD to
the new FD, and particularly affects accept(2) and openat(2).

For openat(2) rights propagation in particular, the rights
associated with the dfd need to make their way through the code
in fs/namei.c to allow this.

The path walking code in fs/namei.c is also modified to enable
the O_BENEATH flag if the dfd is a Capsicum capability.

Signed-off-by: David Drysdale <[email protected]>
---
arch/powerpc/platforms/cell/spufs/coredump.c | 2 +
fs/file.c | 7 +-
fs/locks.c | 3 +
fs/namei.c | 223 +++++++++++++++++++------
fs/notify/dnotify/dnotify.c | 2 +
fs/proc/fd.c | 18 +-
include/linux/capsicum.h | 22 +++
include/uapi/asm-generic/errno.h | 3 +
net/socket.c | 10 +-
security/Makefile | 2 +-
security/capsicum.c | 238 +++++++++++++++++++++++++++
11 files changed, 468 insertions(+), 62 deletions(-)
create mode 100644 security/capsicum.c

diff --git a/arch/powerpc/platforms/cell/spufs/coredump.c b/arch/powerpc/platforms/cell/spufs/coredump.c
index be6212ddbf06..9dd63b501514 100644
--- a/arch/powerpc/platforms/cell/spufs/coredump.c
+++ b/arch/powerpc/platforms/cell/spufs/coredump.c
@@ -29,6 +29,7 @@
#include <linux/syscalls.h>
#include <linux/coredump.h>
#include <linux/binfmts.h>
+#include <linux/capsicum.h>

#include <asm/uaccess.h>

@@ -101,6 +102,7 @@ static struct spu_context *coredump_next_context(int *fd)
return NULL;
*fd = n - 1;
file = fcheck(*fd);
+ file = capsicum_file_lookup(file, NULL, NULL);
return SPUFS_I(file_inode(file))->i_ctx;
}

diff --git a/fs/file.c b/fs/file.c
index ae53219d720b..e9befafcc158 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -14,6 +14,7 @@
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/security.h>
+#include <linux/capsicum.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/file.h>
@@ -720,8 +721,8 @@ unsigned long __fdget_pos(unsigned int fd)

#ifdef CONFIG_SECURITY_CAPSICUM
/*
- * We might want to change the return value of fget() and friends. This
- * function is called with the intended return value, and fget() will /actually/
+ * Capsicum might want to change the return value of fget() and friends. This
+ * function is called with the intended return value, and fget() will actually
* return whatever is returned from here. We adjust the reference counter if
* necessary.
*/
@@ -736,7 +737,7 @@ static struct file *unwrap_file(struct file *orig,
return ERR_PTR(-EBADF);
if (IS_ERR(orig))
return orig;
- f = orig; /* TODO: change the value of f here */
+ f = capsicum_file_lookup(orig, required_rights, actual_rights);
if (f != orig && update_refcnt) {
/* We're not returning the original, and the calling code
* has already incremented the refcount on it, we need to
diff --git a/fs/locks.c b/fs/locks.c
index fdad193dc4b4..81e57ea0bdb5 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -121,6 +121,7 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/security.h>
+#include <linux/capsicum.h>
#include <linux/slab.h>
#include <linux/syscalls.h>
#include <linux/time.h>
@@ -2133,6 +2134,7 @@ again:
*/
spin_lock(&current->files->file_lock);
f = fcheck(fd);
+ f = capsicum_file_lookup(f, NULL, NULL);
spin_unlock(&current->files->file_lock);
if (!error && f != filp && flock.l_type != F_UNLCK) {
flock.l_type = F_UNLCK;
@@ -2267,6 +2269,7 @@ again:
*/
spin_lock(&current->files->file_lock);
f = fcheck(fd);
+ f = capsicum_file_lookup(f, NULL, NULL);
spin_unlock(&current->files->file_lock);
if (!error && f != filp && flock.l_type != F_UNLCK) {
flock.l_type = F_UNLCK;
diff --git a/fs/namei.c b/fs/namei.c
index 548e351fade1..fe03a7dd7537 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -34,6 +34,7 @@
#include <linux/device_cgroup.h>
#include <linux/fs_struct.h>
#include <linux/posix_acl.h>
+#include <linux/capsicum.h>
#include <asm/uaccess.h>

#include "internal.h"
@@ -1751,7 +1752,7 @@ static int link_path_walk(const char *name, struct nameidata *nd,
{
struct path next;
int err;
-
+
while (*name == '/') {
if (flags & LOOKUP_BENEATH) {
err = -EACCES;
@@ -1837,15 +1838,18 @@ exit:
return err;
}

-static int path_init(int dfd, const char *name, unsigned int flags,
- struct nameidata *nd, struct file **fp)
+static int path_init(int dfd, const char *name, unsigned int *flags,
+ struct nameidata *nd, struct file **fp,
+ const struct capsicum_rights **dfd_rights,
+ const struct capsicum_rights *rights)
{
int retval = 0;

nd->last_type = LAST_ROOT; /* if there are only slashes... */
- nd->flags = flags | LOOKUP_JUMPED;
+ nd->flags = (*flags) | LOOKUP_PARENT | LOOKUP_JUMPED;
nd->depth = 0;
- if (flags & LOOKUP_ROOT) {
+
+ if ((*flags) & LOOKUP_ROOT) {
struct dentry *root = nd->root.dentry;
struct inode *inode = root->d_inode;
if (*name) {
@@ -1857,7 +1861,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
}
nd->path = nd->root;
nd->inode = inode;
- if (flags & LOOKUP_RCU) {
+ if ((*flags) & LOOKUP_RCU) {
rcu_read_lock();
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
nd->m_seq = read_seqbegin(&mount_lock);
@@ -1871,9 +1875,11 @@ static int path_init(int dfd, const char *name, unsigned int flags,

nd->m_seq = read_seqbegin(&mount_lock);
if (*name=='/') {
- if (flags & LOOKUP_BENEATH)
+ if ((*flags) & LOOKUP_BENEATH)
return -EACCES;
- if (flags & LOOKUP_RCU) {
+ if (dfd_rights)
+ *dfd_rights = NULL;
+ if ((*flags) & LOOKUP_RCU) {
rcu_read_lock();
set_root_rcu(nd);
} else {
@@ -1882,7 +1888,9 @@ static int path_init(int dfd, const char *name, unsigned int flags,
}
nd->path = nd->root;
} else if (dfd == AT_FDCWD) {
- if (flags & LOOKUP_RCU) {
+ if (dfd_rights)
+ *dfd_rights = NULL;
+ if ((*flags) & LOOKUP_RCU) {
struct fs_struct *fs = current->fs;
unsigned seq;

@@ -1898,11 +1906,13 @@ static int path_init(int dfd, const char *name, unsigned int flags,
}
} else {
/* Caller must check execute permissions on the starting path component */
- struct fd f = fdget_raw(dfd);
+ struct fd f = fdget_raw_rights(dfd, dfd_rights, rights);
struct dentry *dentry;

- if (!f.file)
- return -EBADF;
+ if (IS_ERR(f.file))
+ return PTR_ERR(f.file);
+ if (!cap_rights_is_all(*dfd_rights))
+ *flags |= LOOKUP_BENEATH;

dentry = f.file->f_path.dentry;

@@ -1914,7 +1924,7 @@ static int path_init(int dfd, const char *name, unsigned int flags,
}

nd->path = f.file->f_path;
- if (flags & LOOKUP_RCU) {
+ if ((*flags) & LOOKUP_RCU) {
if (f.flags & FDPUT_FPUT)
*fp = f.file;
nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
@@ -1939,9 +1949,12 @@ static inline int lookup_last(struct nameidata *nd, struct path *path)
}

/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
-static int path_lookupat(int dfd, const char *name,
- unsigned int flags, struct nameidata *nd)
+static int path_lookupat(int dfd,
+ const char *name, unsigned int flags,
+ struct nameidata *nd,
+ const struct capsicum_rights *rights)
{
+ const struct capsicum_rights *dfd_rights;
struct file *base = NULL;
struct path path;
int err;
@@ -1960,7 +1973,7 @@ static int path_lookupat(int dfd, const char *name,
* be handled by restarting a traditional ref-walk (which will always
* be able to complete).
*/
- err = path_init(dfd, name, flags | LOOKUP_PARENT, nd, &base);
+ err = path_init(dfd, name, &flags, nd, &base, &dfd_rights, rights);

if (unlikely(err))
return err;
@@ -2005,27 +2018,32 @@ static int path_lookupat(int dfd, const char *name,
return err;
}

-static int filename_lookup(int dfd, struct filename *name,
- unsigned int flags, struct nameidata *nd)
+static int filename_lookup(int dfd,
+ struct filename *name, unsigned int flags,
+ struct nameidata *nd,
+ const struct capsicum_rights *rights)
{
- int retval = path_lookupat(dfd, name->name, flags | LOOKUP_RCU, nd);
+ int retval = path_lookupat(dfd, name->name, flags | LOOKUP_RCU, nd,
+ rights);
if (unlikely(retval == -ECHILD))
- retval = path_lookupat(dfd, name->name, flags, nd);
+ retval = path_lookupat(dfd, name->name, flags, nd, rights);
if (unlikely(retval == -ESTALE))
- retval = path_lookupat(dfd, name->name,
- flags | LOOKUP_REVAL, nd);
+ retval = path_lookupat(dfd, name->name, flags | LOOKUP_REVAL,
+ nd, rights);

if (likely(!retval))
audit_inode(name, nd->path.dentry, flags & LOOKUP_PARENT);
return retval;
}

-static int do_path_lookup(int dfd, const char *name,
- unsigned int flags, struct nameidata *nd)
+static int do_path_lookup(int dfd,
+ const char *name, unsigned int flags,
+ struct nameidata *nd,
+ const struct capsicum_rights *rights)
{
struct filename filename = { .name = name };

- return filename_lookup(dfd, &filename, flags, nd);
+ return filename_lookup(dfd, &filename, flags, nd, rights);
}

/* does lookup, returns the object with parent locked */
@@ -2033,7 +2051,9 @@ struct dentry *kern_path_locked(const char *name, struct path *path)
{
struct nameidata nd;
struct dentry *d;
- int err = do_path_lookup(AT_FDCWD, name, LOOKUP_PARENT, &nd);
+ int err;
+
+ err = do_path_lookup(AT_FDCWD, name, LOOKUP_PARENT, &nd, NULL);
if (err)
return ERR_PTR(err);
if (nd.last_type != LAST_NORM) {
@@ -2054,7 +2074,9 @@ struct dentry *kern_path_locked(const char *name, struct path *path)
int kern_path(const char *name, unsigned int flags, struct path *path)
{
struct nameidata nd;
- int res = do_path_lookup(AT_FDCWD, name, flags, &nd);
+ int res;
+
+ res = do_path_lookup(AT_FDCWD, name, flags, &nd, NULL);
if (!res)
*path = nd.path;
return res;
@@ -2079,7 +2101,7 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
nd.root.mnt = mnt;
BUG_ON(flags & LOOKUP_PARENT);
/* the first argument of do_path_lookup() is ignored with LOOKUP_ROOT */
- err = do_path_lookup(AT_FDCWD, name, flags | LOOKUP_ROOT, &nd);
+ err = do_path_lookup(AT_FDCWD, name, flags | LOOKUP_ROOT, &nd, NULL);
if (!err)
*path = nd.path;
return err;
@@ -2162,8 +2184,7 @@ static int user_path_at_empty_rights(int dfd,
if (!IS_ERR(tmp)) {

BUG_ON(flags & LOOKUP_PARENT);
-
- err = filename_lookup(dfd, tmp, flags, &nd);
+ err = filename_lookup(dfd, tmp, flags, &nd, rights);
putname(tmp);
if (!err)
*path = nd.path;
@@ -2213,7 +2234,7 @@ int _user_path_atr(int dfd,
*/
static struct filename *
user_path_parent(int dfd, const char __user *path, struct nameidata *nd,
- unsigned int flags)
+ unsigned int flags, const struct capsicum_rights *rights)
{
struct filename *s = getname(path);
int error;
@@ -2224,7 +2245,7 @@ user_path_parent(int dfd, const char __user *path, struct nameidata *nd,
if (IS_ERR(s))
return s;

- error = filename_lookup(dfd, s, flags | LOOKUP_PARENT, nd);
+ error = filename_lookup(dfd, s, flags | LOOKUP_PARENT, nd, rights);
if (error) {
putname(s);
return ERR_PTR(error);
@@ -2340,9 +2361,11 @@ path_mountpoint(int dfd, const char *name, struct path *path, unsigned int flags
{
struct file *base = NULL;
struct nameidata nd;
+ const struct capsicum_rights *dfd_rights;
int err;

- err = path_init(dfd, name, flags | LOOKUP_PARENT, &nd, &base);
+ err = path_init(dfd, name, &flags, &nd, &base,
+ &dfd_rights, &lookup_rights);
if (unlikely(err))
return err;

@@ -3167,8 +3190,10 @@ static int do_tmpfile(int dfd, struct filename *pathname,
static const struct qstr name = QSTR_INIT("/", 1);
struct dentry *dentry, *child;
struct inode *dir;
- int error = path_lookupat(dfd, pathname->name,
- flags | LOOKUP_DIRECTORY, nd);
+ int error;
+
+ error = path_lookupat(dfd, pathname->name, flags | LOOKUP_DIRECTORY, nd,
+ &lookup_rights);
if (unlikely(error))
return error;
error = mnt_want_write(nd->path.mnt);
@@ -3220,15 +3245,42 @@ out:
return error;
}

+static void openat_primary_rights(struct capsicum_rights *rights,
+ unsigned int flags)
+{
+ switch (flags & O_ACCMODE) {
+ case O_RDONLY:
+ cap_rights_set(rights, CAP_READ);
+ break;
+ case O_RDWR:
+ cap_rights_set(rights, CAP_READ);
+ /* FALLTHRU */
+ case O_WRONLY:
+ cap_rights_set(rights, CAP_WRITE);
+ if (!(flags & (O_APPEND | O_TRUNC)))
+ cap_rights_set(rights, CAP_SEEK);
+ break;
+ }
+ if (flags & O_CREAT)
+ cap_rights_set(rights, CAP_CREATE);
+ if (flags & O_TRUNC)
+ cap_rights_set(rights, CAP_FTRUNCATE);
+ if (flags & (O_DSYNC|FASYNC))
+ cap_rights_set(rights, CAP_FSYNC);
+}
+
static struct file *path_openat(int dfd, struct filename *pathname,
struct nameidata *nd, const struct open_flags *op, int flags)
{
+ struct capsicum_rights rights;
+ const struct capsicum_rights *dfd_rights;
struct file *base = NULL;
struct file *file;
struct path path;
int opened = 0;
int error;

+ cap_rights_init(&rights, CAP_LOOKUP);
file = get_empty_filp();
if (IS_ERR(file))
return file;
@@ -3240,7 +3292,9 @@ static struct file *path_openat(int dfd, struct filename *pathname,
goto out;
}

- error = path_init(dfd, pathname->name, flags | LOOKUP_PARENT, nd, &base);
+ openat_primary_rights(&rights, file->f_flags);
+ error = path_init(dfd, pathname->name, &flags, nd, &base,
+ &dfd_rights, &rights);
if (unlikely(error))
goto out;

@@ -3270,6 +3324,17 @@ static struct file *path_openat(int dfd, struct filename *pathname,
error = do_last(nd, &path, file, op, &opened, pathname);
put_link(nd, &link, cookie);
}
+ if (!error) {
+ struct file *install_file;
+
+ install_file = capsicum_file_install(dfd_rights, file);
+ if (IS_ERR(install_file)) {
+ error = PTR_ERR(install_file);
+ goto out;
+ } else {
+ file = install_file;
+ }
+ }
out:
if (nd->root.mnt && !(nd->flags & LOOKUP_ROOT))
path_put(&nd->root);
@@ -3328,8 +3393,12 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
return file;
}

-struct dentry *kern_path_create(int dfd, const char *pathname,
- struct path *path, unsigned int lookup_flags)
+static struct dentry *
+kern_path_create_rights(int dfd,
+ const char *pathname,
+ struct path *path,
+ unsigned int lookup_flags,
+ const struct capsicum_rights *rights)
{
struct dentry *dentry = ERR_PTR(-EEXIST);
struct nameidata nd;
@@ -3343,7 +3412,8 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
*/
lookup_flags &= LOOKUP_REVAL;

- error = do_path_lookup(dfd, pathname, LOOKUP_PARENT|lookup_flags, &nd);
+ error = do_path_lookup(dfd, pathname, LOOKUP_PARENT|lookup_flags, &nd,
+ rights);
if (error)
return ERR_PTR(error);

@@ -3397,6 +3467,13 @@ out:
path_put(&nd.path);
return dentry;
}
+
+struct dentry *kern_path_create(int dfd, const char *pathname,
+ struct path *path, unsigned int lookup_flags)
+{
+ return kern_path_create_rights(dfd, pathname, path, lookup_flags,
+ &lookup_rights);
+}
EXPORT_SYMBOL(kern_path_create);

void done_path_create(struct path *path, struct dentry *dentry)
@@ -3408,17 +3485,29 @@ void done_path_create(struct path *path, struct dentry *dentry)
}
EXPORT_SYMBOL(done_path_create);

-struct dentry *user_path_create(int dfd, const char __user *pathname,
- struct path *path, unsigned int lookup_flags)
+static struct dentry *
+user_path_create_rights(int dfd,
+ const char __user *pathname,
+ struct path *path,
+ unsigned int lookup_flags,
+ const struct capsicum_rights *rights)
{
struct filename *tmp = getname(pathname);
struct dentry *res;
if (IS_ERR(tmp))
return ERR_CAST(tmp);
- res = kern_path_create(dfd, tmp->name, path, lookup_flags);
+ res = kern_path_create_rights(dfd, tmp->name, path, lookup_flags,
+ rights);
putname(tmp);
return res;
}
+
+struct dentry *user_path_create(int dfd, const char __user *pathname,
+ struct path *path, unsigned int lookup_flags)
+{
+ return user_path_create_rights(dfd, pathname, path, lookup_flags,
+ &lookup_rights);
+}
EXPORT_SYMBOL(user_path_create);

int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
@@ -3469,16 +3558,28 @@ static int may_mknod(umode_t mode)
SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
unsigned, dev)
{
+ struct capsicum_rights rights;
struct dentry *dentry;
struct path path;
int error;
unsigned int lookup_flags = 0;

+ cap_rights_init(&rights, CAP_LOOKUP);
error = may_mknod(mode);
if (error)
return error;
+
+ switch (mode & S_IFMT) {
+ case S_IFCHR: case S_IFBLK:
+ cap_rights_set(&rights, CAP_MKNODAT);
+ break;
+ case S_IFIFO:
+ cap_rights_set(&rights, CAP_MKFIFOAT);
+ break;
+ }
retry:
- dentry = user_path_create(dfd, filename, &path, lookup_flags);
+ dentry = user_path_create_rights(dfd, filename, &path, lookup_flags,
+ &rights);
if (IS_ERR(dentry))
return PTR_ERR(dentry);

@@ -3545,9 +3646,13 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
struct path path;
int error;
unsigned int lookup_flags = LOOKUP_DIRECTORY;
+ struct capsicum_rights rights;
+
+ cap_rights_init(&rights, CAP_LOOKUP, CAP_MKDIRAT);

retry:
- dentry = user_path_create(dfd, pathname, &path, lookup_flags);
+ dentry = user_path_create_rights(dfd, pathname, &path, lookup_flags,
+ &rights);
if (IS_ERR(dentry))
return PTR_ERR(dentry);

@@ -3638,9 +3743,11 @@ static long do_rmdir(int dfd, const char __user *pathname)
struct filename *name;
struct dentry *dentry;
struct nameidata nd;
+ struct capsicum_rights rights;
unsigned int lookup_flags = 0;
+ cap_rights_init(&rights, CAP_UNLINKAT);
retry:
- name = user_path_parent(dfd, pathname, &nd, lookup_flags);
+ name = user_path_parent(dfd, pathname, &nd, lookup_flags, &rights);
if (IS_ERR(name))
return PTR_ERR(name);

@@ -3765,8 +3872,11 @@ static long do_unlinkat(int dfd, const char __user *pathname)
struct inode *inode = NULL;
struct inode *delegated_inode = NULL;
unsigned int lookup_flags = 0;
+ struct capsicum_rights rights;
+
+ cap_rights_init(&rights, CAP_UNLINKAT);
retry:
- name = user_path_parent(dfd, pathname, &nd, lookup_flags);
+ name = user_path_parent(dfd, pathname, &nd, lookup_flags, &rights);
if (IS_ERR(name))
return PTR_ERR(name);

@@ -3872,12 +3982,15 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
struct dentry *dentry;
struct path path;
unsigned int lookup_flags = 0;
+ struct capsicum_rights rights;

from = getname(oldname);
if (IS_ERR(from))
return PTR_ERR(from);
+ cap_rights_init(&rights, CAP_SYMLINKAT);
retry:
- dentry = user_path_create(newdfd, newname, &path, lookup_flags);
+ dentry = user_path_create_rights(newdfd, newname, &path, lookup_flags,
+ &rights);
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto out_putname;
@@ -3988,6 +4101,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
struct dentry *new_dentry;
struct path old_path, new_path;
struct inode *delegated_inode = NULL;
+ struct capsicum_rights rights;
int how = 0;
int error;

@@ -4006,13 +4120,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,

if (flags & AT_SYMLINK_FOLLOW)
how |= LOOKUP_FOLLOW;
+ cap_rights_init(&rights, CAP_LINKAT);
retry:
error = user_path_at(olddfd, oldname, how, &old_path);
if (error)
return error;

- new_dentry = user_path_create(newdfd, newname, &new_path,
- (how & LOOKUP_REVAL));
+ new_dentry = user_path_create_rights(newdfd, newname, &new_path,
+ (how & LOOKUP_REVAL), &rights);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out;
@@ -4243,6 +4358,8 @@ SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
struct inode *delegated_inode = NULL;
struct filename *from;
struct filename *to;
+ struct capsicum_rights old_rights;
+ struct capsicum_rights new_rights;
unsigned int lookup_flags = 0;
bool should_retry = false;
int error;
@@ -4253,14 +4370,18 @@ SYSCALL_DEFINE5(renameat2, int, olddfd, const char __user *, oldname,
if ((flags & RENAME_NOREPLACE) && (flags & RENAME_EXCHANGE))
return -EINVAL;

+ cap_rights_init(&old_rights, CAP_RENAMEAT);
+ cap_rights_init(&new_rights, CAP_LINKAT);
retry:
- from = user_path_parent(olddfd, oldname, &oldnd, lookup_flags);
+ from = user_path_parent(olddfd, oldname, &oldnd, lookup_flags,
+ &old_rights);
if (IS_ERR(from)) {
error = PTR_ERR(from);
goto exit;
}

- to = user_path_parent(newdfd, newname, &newnd, lookup_flags);
+ to = user_path_parent(newdfd, newname, &newnd, lookup_flags,
+ &new_rights);
if (IS_ERR(to)) {
error = PTR_ERR(to);
goto exit1;
diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index abc8cbcfe90e..fcb228bb7dd6 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -25,6 +25,7 @@
#include <linux/slab.h>
#include <linux/fdtable.h>
#include <linux/fsnotify_backend.h>
+#include <linux/capsicum.h>

int dir_notify_enable __read_mostly = 1;

@@ -327,6 +328,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)

rcu_read_lock();
f = fcheck(fd);
+ f = capsicum_file_lookup(f, NULL, NULL);
rcu_read_unlock();

/* if (f != filp) means that we lost a race and another task/thread
diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 0788d093f5d8..05c367417cb0 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -6,6 +6,7 @@
#include <linux/namei.h>
#include <linux/pid.h>
#include <linux/security.h>
+#include <linux/capsicum.h>
#include <linux/file.h>
#include <linux/seq_file.h>

@@ -20,6 +21,7 @@ static int seq_show(struct seq_file *m, void *v)
struct files_struct *files = NULL;
int f_flags = 0, ret = -ENOENT;
struct file *file = NULL;
+ struct file *underlying = NULL;
struct task_struct *task;

task = get_proc_task(m->private);
@@ -36,12 +38,13 @@ static int seq_show(struct seq_file *m, void *v)
file = fcheck_files(files, fd);
if (file) {
struct fdtable *fdt = files_fdtable(files);
-
- f_flags = file->f_flags;
+ underlying = capsicum_file_lookup(file, NULL, NULL);
+ f_flags = underlying->f_flags;
if (close_on_exec(fd, fdt))
f_flags |= O_CLOEXEC;

get_file(file);
+ get_file(underlying);
ret = 0;
}
spin_unlock(&files->file_lock);
@@ -50,10 +53,11 @@ static int seq_show(struct seq_file *m, void *v)

if (!ret) {
seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n",
- (long long)file->f_pos, f_flags,
- real_mount(file->f_path.mnt)->mnt_id);
+ (long long)underlying->f_pos, f_flags,
+ real_mount(underlying->f_path.mnt)->mnt_id);
if (file->f_op->show_fdinfo)
ret = file->f_op->show_fdinfo(m, file);
+ fput(underlying);
fput(file);
}

@@ -95,7 +99,10 @@ static int tid_fd_revalidate(struct dentry *dentry, unsigned int flags)
rcu_read_lock();
file = fcheck_files(files, fd);
if (file) {
- unsigned f_mode = file->f_mode;
+ unsigned f_mode;
+
+ file = capsicum_file_lookup(file, NULL, NULL);
+ f_mode = file->f_mode;

rcu_read_unlock();
put_files_struct(files);
@@ -158,6 +165,7 @@ static int proc_fd_link(struct dentry *dentry, struct path *path)
spin_lock(&files->file_lock);
fd_file = fcheck_files(files, fd);
if (fd_file) {
+ fd_file = capsicum_file_lookup(fd_file, NULL, NULL);
*path = fd_file->f_path;
path_get(&fd_file->f_path);
ret = 0;
diff --git a/include/linux/capsicum.h b/include/linux/capsicum.h
index 74f79756097a..24d74dcd5a99 100644
--- a/include/linux/capsicum.h
+++ b/include/linux/capsicum.h
@@ -16,6 +16,13 @@ struct capsicum_rights {
#define CAP_LIST_END 0ULL

#ifdef CONFIG_SECURITY_CAPSICUM
+/* FD->struct file interception functions */
+struct file *capsicum_file_lookup(struct file *file,
+ const struct capsicum_rights *required_rights,
+ const struct capsicum_rights **actual_rights);
+struct file *capsicum_file_install(const struct capsicum_rights *base_rights,
+ struct file *file);
+
/* Rights manipulation functions */
#define cap_rights_init(rights, ...) \
_cap_rights_init((rights), __VA_ARGS__, CAP_LIST_END)
@@ -32,6 +39,21 @@ bool cap_rights_is_all(const struct capsicum_rights *rights);

#else

+static inline struct file *
+capsicum_file_lookup(struct file *file,
+ const struct capsicum_rights *required_rights,
+ const struct capsicum_rights **actual_rights)
+{
+ return file;
+}
+
+static inline struct file *
+capsicum_file_install(const const struct capsicum_rights *base_rights,
+ struct file *file)
+{
+ return file;
+}
+
#define cap_rights_init(rights, ...) _cap_rights_noop(rights)
#define cap_rights_set(rights, ...) _cap_rights_noop(rights)
#define cap_rights_set_all(rights) _cap_rights_noop(rights)
diff --git a/include/uapi/asm-generic/errno.h b/include/uapi/asm-generic/errno.h
index 1e1ea6e6e7a5..550570ed7b9f 100644
--- a/include/uapi/asm-generic/errno.h
+++ b/include/uapi/asm-generic/errno.h
@@ -110,4 +110,7 @@

#define EHWPOISON 133 /* Memory page has hardware error */

+#define ECAPMODE 134 /* Not permitted in capability mode */
+#define ENOTCAPABLE 135 /* Capability FD rights insufficient */
+
#endif
diff --git a/net/socket.c b/net/socket.c
index 2240c2e52927..41322c3c2a4a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -1671,6 +1671,7 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
{
struct socket *sock, *newsock;
struct file *newfile;
+ struct file *installfile;
int err, len, newfd, fput_needed;
struct sockaddr_storage address;
struct capsicum_rights rights;
@@ -1738,7 +1739,12 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,

/* File flags are not inherited via accept() unlike another OSes. */

- fd_install(newfd, newfile);
+ installfile = capsicum_file_install(listen_rights, newfile);
+ if (IS_ERR(installfile)) {
+ err = PTR_ERR(installfile);
+ goto out_fd;
+ }
+ fd_install(newfd, installfile);
err = newfd;

out_put:
@@ -2117,7 +2123,7 @@ static int ___sys_sendmsg(struct socket *sock_noaddr, struct socket *sock_addr,
}
sock = (msg_sys->msg_name ? sock_addr : sock_noaddr);
if (!sock)
- return -EBADF;
+ return -ENOTCAPABLE;

if (msg_sys->msg_iovlen > UIO_FASTIOV) {
err = -EMSGSIZE;
diff --git a/security/Makefile b/security/Makefile
index c5e1363ae136..e46d014a74b3 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -14,7 +14,7 @@ obj-y += commoncap.o
obj-$(CONFIG_MMU) += min_addr.o

# Object file lists
-obj-$(CONFIG_SECURITY) += security.o capability.o capsicum-rights.o
+obj-$(CONFIG_SECURITY) += security.o capability.o capsicum.o capsicum-rights.o
obj-$(CONFIG_SECURITYFS) += inode.o
obj-$(CONFIG_SECURITY_SELINUX) += selinux/
obj-$(CONFIG_SECURITY_SMACK) += smack/
diff --git a/security/capsicum.c b/security/capsicum.c
new file mode 100644
index 000000000000..4a004829b9c8
--- /dev/null
+++ b/security/capsicum.c
@@ -0,0 +1,238 @@
+/*
+ * Main implementation of Capsicum, a capability framework for UNIX.
+ *
+ * Copyright (C) 2012-2013 The Chromium OS Authors
+ * <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ *
+ * See Documentation/security/capsicum.txt for information on Capsicum.
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/fs.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/security.h>
+#include <linux/syscalls.h>
+#include <linux/capsicum.h>
+
+#include "capsicum-rights.h"
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * Capsicum capability structure, holding the associated rights and underlying
+ * real file. Capabilities are not stacked, i.e. underlying always points to a
+ * normal file not another Capsicum capability. Accessed via file->private_data.
+ */
+struct capsicum_capability {
+ struct capsicum_rights rights;
+ struct file *underlying;
+};
+
+static void capsicum_panic_not_unwrapped(void);
+static int capsicum_release(struct inode *i, struct file *capf);
+static int capsicum_show_fdinfo(struct seq_file *m, struct file *capf);
+
+#define panic_ptr ((void *)&capsicum_panic_not_unwrapped)
+static const struct file_operations capsicum_file_ops = {
+ .owner = NULL,
+ .llseek = panic_ptr,
+ .read = panic_ptr,
+ .write = panic_ptr,
+ .aio_read = panic_ptr,
+ .aio_write = panic_ptr,
+ .iterate = panic_ptr,
+ .poll = panic_ptr,
+ .unlocked_ioctl = panic_ptr,
+ .compat_ioctl = panic_ptr,
+ .mmap = panic_ptr,
+ .open = panic_ptr,
+ .flush = NULL, /* This is called on close if implemented. */
+ .release = capsicum_release, /* This is the only one we want. */
+ .fsync = panic_ptr,
+ .aio_fsync = panic_ptr,
+ .fasync = panic_ptr,
+ .lock = panic_ptr,
+ .sendpage = panic_ptr,
+ .get_unmapped_area = panic_ptr,
+ .check_flags = panic_ptr,
+ .flock = panic_ptr,
+ .splice_write = panic_ptr,
+ .splice_read = panic_ptr,
+ .setlease = panic_ptr,
+ .fallocate = panic_ptr,
+ .show_fdinfo = capsicum_show_fdinfo
+};
+
+static inline bool capsicum_is_cap(const struct file *file)
+{
+ return file->f_op == &capsicum_file_ops;
+}
+
+static struct capsicum_rights all_rights = {
+ .primary = {.cr_rights = {CAP_ALL0, CAP_ALL1} },
+ .fcntls = CAP_FCNTL_ALL,
+ .nioctls = -1,
+ .ioctls = NULL
+};
+
+static struct file *capsicum_cap_alloc(const struct capsicum_rights *rights,
+ bool take_ioctls)
+{
+ int err;
+ struct file *capf;
+ /* memory to be freed on error exit: */
+ struct capsicum_capability *cap = NULL;
+ unsigned int *ioctls = (take_ioctls ? rights->ioctls : NULL);
+
+ BUG_ON((rights->nioctls > 0) != (rights->ioctls != NULL));
+
+ cap = kmalloc(sizeof(*cap), GFP_KERNEL);
+ if (!cap) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+ cap->underlying = NULL;
+ cap->rights = *rights;
+ if (!take_ioctls && rights->nioctls > 0) {
+ cap->rights.ioctls = kmemdup(rights->ioctls,
+ rights->nioctls * sizeof(unsigned int),
+ GFP_KERNEL);
+ if (!cap->rights.ioctls) {
+ err = -ENOMEM;
+ goto out_err;
+ }
+ ioctls = cap->rights.ioctls;
+ }
+
+ capf = anon_inode_getfile("[capability]", &capsicum_file_ops, cap, 0);
+ if (IS_ERR(capf)) {
+ err = PTR_ERR(capf);
+ goto out_err;
+ }
+ return capf;
+
+out_err:
+ kfree(ioctls);
+ kfree(cap);
+ return ERR_PTR(err);
+}
+
+/*
+ * File operations functions.
+ */
+
+/*
+ * When we release a Capsicum capability, release our reference to the
+ * underlying (wrapped) file as well.
+ */
+static int capsicum_release(struct inode *i, struct file *capf)
+{
+ struct capsicum_capability *cap;
+
+ if (!capsicum_is_cap(capf))
+ return -EINVAL;
+
+ cap = capf->private_data;
+ BUG_ON(!cap);
+ if (cap->underlying)
+ fput(cap->underlying);
+ cap->underlying = NULL;
+ kfree(cap->rights.ioctls);
+ kfree(cap);
+ return 0;
+}
+
+static int capsicum_show_fdinfo(struct seq_file *m, struct file *capf)
+{
+ int i;
+ struct capsicum_capability *cap;
+
+ if (!capsicum_is_cap(capf))
+ return -EINVAL;
+
+ cap = capf->private_data;
+ BUG_ON(!cap);
+ seq_puts(m, "rights:");
+ for (i = 0; i < (CAP_RIGHTS_VERSION + 2); i++)
+ seq_printf(m, "\t%#016llx", cap->rights.primary.cr_rights[i]);
+ seq_puts(m, "\n");
+ seq_printf(m, " fcntls: %#08x\n", cap->rights.fcntls);
+ if (cap->rights.nioctls > 0) {
+ seq_puts(m, " ioctls:");
+ for (i = 0; i < cap->rights.nioctls; i++)
+ seq_printf(m, "\t%#08x", cap->rights.ioctls[i]);
+ seq_puts(m, "\n");
+ }
+ return 0;
+}
+
+static void capsicum_panic_not_unwrapped(void)
+{
+ /*
+ * General Capsicum file operations should never be called, because the
+ * relevant file should always be unwrapped and the underlying real file
+ * used instead.
+ */
+ panic("Called a file_operations member on a Capsicum wrapper");
+}
+
+/*
+ * We are looking up a file by its file descriptor. If it is a Capsicum
+ * capability, and has the required rights, we unwrap it and return the
+ * underlying file.
+ */
+struct file *capsicum_file_lookup(struct file *file,
+ const struct capsicum_rights *required_rights,
+ const struct capsicum_rights **actual_rights)
+{
+ struct capsicum_capability *cap;
+
+ /* See if the file in question is a Capsicum capability. */
+ if (!capsicum_is_cap(file)) {
+ if (actual_rights)
+ *actual_rights = &all_rights;
+ return file;
+ }
+ cap = file->private_data;
+ if (required_rights &&
+ !cap_rights_contains(&cap->rights, required_rights)) {
+ return ERR_PTR(-ENOTCAPABLE);
+ }
+ if (actual_rights)
+ *actual_rights = &cap->rights;
+ return cap->underlying;
+}
+EXPORT_SYMBOL(capsicum_file_lookup);
+
+struct file *capsicum_file_install(const struct capsicum_rights *base_rights,
+ struct file *file)
+{
+ struct file *capf;
+ struct capsicum_capability *cap;
+
+ if (!base_rights || cap_rights_is_all(base_rights))
+ return file;
+
+ capf = capsicum_cap_alloc(base_rights, false);
+ if (IS_ERR(capf))
+ return capf;
+
+ if (!atomic_long_inc_not_zero(&file->f_count)) {
+ fput(capf);
+ return ERR_PTR(-EBADF);
+ }
+ cap = capf->private_data;
+ cap->underlying = file;
+ return capf;
+}
+EXPORT_SYMBOL(capsicum_file_install);
+
+#endif
--
2.0.0.526.g5318336

2014-07-25 13:51:19

by David Drysdale

[permalink] [raw]
Subject: [PATCH 07/11] capsicum: convert callers to use sockfd_lookupr() etc

Convert places that use sockfd_lookup() functions to use the
equivalent sockfd_lookupr() variant instead.

Annotate each such call with an indication of what operations will
be performed on the retrieved socket, to allow future policing
of rights associated with file descriptors.

Signed-off-by: David Drysdale <[email protected]>
---
drivers/block/nbd.c | 3 +-
drivers/scsi/iscsi_tcp.c | 2 +-
drivers/staging/usbip/stub_dev.c | 2 +-
drivers/staging/usbip/vhci_sysfs.c | 2 +-
drivers/vhost/net.c | 2 +-
fs/ncpfs/inode.c | 5 +-
net/bluetooth/bnep/sock.c | 2 +-
net/bluetooth/cmtp/sock.c | 2 +-
net/bluetooth/hidp/sock.c | 4 +-
net/compat.c | 4 +-
net/l2tp/l2tp_core.c | 11 ++--
net/l2tp/l2tp_core.h | 2 +
net/sched/sch_atm.c | 2 +-
net/socket.c | 119 +++++++++++++++++++++++--------------
net/sunrpc/svcsock.c | 4 +-
15 files changed, 100 insertions(+), 66 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 08381e2049b6..b5344c8cbb14 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -643,7 +643,8 @@ static int __nbd_ioctl(struct block_device *bdev, struct nbd_device *nbd,
int err;
if (nbd->sock)
return -EBUSY;
- sock = sockfd_lookup(arg, &err);
+ sock = sockfd_lookupr(arg, &err,
+ CAP_READ, CAP_WRITE, CAP_SHUTDOWN);
if (sock) {
nbd->sock = sock;
if (max_part > 0)
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index a669f2d11c31..f112bbd32278 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -652,7 +652,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
int err;

/* lookup for existing socket */
- sock = sockfd_lookup((int)transport_eph, &err);
+ sock = sockfd_lookupr((int)transport_eph, &err, CAP_SOCK_SERVER);
if (!sock) {
iscsi_conn_printk(KERN_ERR, conn,
"sockfd_lookup failed %d\n", err);
diff --git a/drivers/staging/usbip/stub_dev.c b/drivers/staging/usbip/stub_dev.c
index 51d0c7188738..9654d9f871c9 100644
--- a/drivers/staging/usbip/stub_dev.c
+++ b/drivers/staging/usbip/stub_dev.c
@@ -109,7 +109,7 @@ static ssize_t store_sockfd(struct device *dev, struct device_attribute *attr,
goto err;
}

- socket = sockfd_lookup(sockfd, &err);
+ socket = sockfd_lookupr(sockfd, &err, CAP_LIST_END);
if (!socket)
goto err;

diff --git a/drivers/staging/usbip/vhci_sysfs.c b/drivers/staging/usbip/vhci_sysfs.c
index 211f43f67ea2..efe9d7625433 100644
--- a/drivers/staging/usbip/vhci_sysfs.c
+++ b/drivers/staging/usbip/vhci_sysfs.c
@@ -195,7 +195,7 @@ static ssize_t store_attach(struct device *dev, struct device_attribute *attr,
return -EINVAL;

/* Extract socket from fd. */
- socket = sockfd_lookup(sockfd, &err);
+ socket = sockfd_lookupr(sockfd, &err, CAP_LIST_END);
if (!socket)
return -EINVAL;

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8f552d2b637e..2d670e409972 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -843,7 +843,7 @@ static struct socket *get_raw_socket(int fd)
char buf[MAX_ADDR_LEN];
} uaddr;
int uaddr_len = sizeof uaddr, r;
- struct socket *sock = sockfd_lookup(fd, &r);
+ struct socket *sock = sockfd_lookupr(fd, &r, CAP_READ, CAP_WRITE);

if (!sock)
return ERR_PTR(-ENOTSOCK);
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index e31e589369a4..580024e60d20 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -539,7 +539,7 @@ static int ncp_fill_super(struct super_block *sb, void *raw_data, int silent)
if (!uid_valid(data.mounted_uid) || !uid_valid(data.uid) ||
!gid_valid(data.gid))
goto out;
- sock = sockfd_lookup(data.ncp_fd, &error);
+ sock = sockfd_lookupr(data.ncp_fd, &error, CAP_WRITE, CAP_FSTAT);
if (!sock)
goto out;

@@ -567,7 +567,8 @@ static int ncp_fill_super(struct super_block *sb, void *raw_data, int silent)
server->ncp_sock = sock;

if (data.info_fd != -1) {
- struct socket *info_sock = sockfd_lookup(data.info_fd, &error);
+ struct socket *info_sock = sockfd_lookupr(data.info_fd, &error,
+ CAP_WRITE, CAP_FSTAT);
if (!info_sock)
goto out_bdi;
server->info_sock = info_sock;
diff --git a/net/bluetooth/bnep/sock.c b/net/bluetooth/bnep/sock.c
index 5f051290daba..1a69b6b05d2e 100644
--- a/net/bluetooth/bnep/sock.c
+++ b/net/bluetooth/bnep/sock.c
@@ -69,7 +69,7 @@ static int bnep_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long
if (copy_from_user(&ca, argp, sizeof(ca)))
return -EFAULT;

- nsock = sockfd_lookup(ca.sock, &err);
+ nsock = sockfd_lookupr(ca.sock, &err, CAP_READ, CAP_WRITE);
if (!nsock)
return err;

diff --git a/net/bluetooth/cmtp/sock.c b/net/bluetooth/cmtp/sock.c
index d82787d417bd..4033b771e6ca 100644
--- a/net/bluetooth/cmtp/sock.c
+++ b/net/bluetooth/cmtp/sock.c
@@ -83,7 +83,7 @@ static int cmtp_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long
if (copy_from_user(&ca, argp, sizeof(ca)))
return -EFAULT;

- nsock = sockfd_lookup(ca.sock, &err);
+ nsock = sockfd_lookupr(ca.sock, &err, CAP_READ, CAP_WRITE);
if (!nsock)
return err;

diff --git a/net/bluetooth/hidp/sock.c b/net/bluetooth/hidp/sock.c
index cb3fdde1968a..85afd39595f3 100644
--- a/net/bluetooth/hidp/sock.c
+++ b/net/bluetooth/hidp/sock.c
@@ -67,11 +67,11 @@ static int hidp_sock_ioctl(struct socket *sock, unsigned int cmd, unsigned long
if (copy_from_user(&ca, argp, sizeof(ca)))
return -EFAULT;

- csock = sockfd_lookup(ca.ctrl_sock, &err);
+ csock = sockfd_lookupr(ca.ctrl_sock, &err, CAP_READ, CAP_WRITE);
if (!csock)
return err;

- isock = sockfd_lookup(ca.intr_sock, &err);
+ isock = sockfd_lookupr(ca.intr_sock, &err, CAP_READ, CAP_WRITE);
if (!isock) {
sockfd_put(csock);
return err;
diff --git a/net/compat.c b/net/compat.c
index 9a76eaf63184..06655190173e 100644
--- a/net/compat.c
+++ b/net/compat.c
@@ -388,7 +388,7 @@ COMPAT_SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname,
char __user *, optval, unsigned int, optlen)
{
int err;
- struct socket *sock = sockfd_lookup(fd, &err);
+ struct socket *sock = sockfd_lookupr(fd, &err, CAP_SETSOCKOPT);

if (sock) {
err = security_socket_setsockopt(sock, level, optname);
@@ -508,7 +508,7 @@ COMPAT_SYSCALL_DEFINE5(getsockopt, int, fd, int, level, int, optname,
char __user *, optval, int __user *, optlen)
{
int err;
- struct socket *sock = sockfd_lookup(fd, &err);
+ struct socket *sock = sockfd_lookupr(fd, &err, CAP_GETSOCKOPT);

if (sock) {
err = security_socket_getsockopt(sock, level, optname);
diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
index bea259043205..03fd2c626cef 100644
--- a/net/l2tp/l2tp_core.c
+++ b/net/l2tp/l2tp_core.c
@@ -175,7 +175,8 @@ l2tp_session_id_hash_2(struct l2tp_net *pn, u32 session_id)
* owned by userspace. A struct sock returned from this function must be
* released using l2tp_tunnel_sock_put once you're done with it.
*/
-static struct sock *l2tp_tunnel_sock_lookup(struct l2tp_tunnel *tunnel)
+static struct sock *l2tp_tunnel_sock_lookup(struct l2tp_tunnel *tunnel,
+ struct capsicum_rights *rights)
{
int err = 0;
struct socket *sock = NULL;
@@ -189,7 +190,7 @@ static struct sock *l2tp_tunnel_sock_lookup(struct l2tp_tunnel *tunnel)
* of closing it. Look the socket up using the fd to ensure
* consistency.
*/
- sock = sockfd_lookup(tunnel->fd, &err);
+ sock = sockfd_lookup_rights(tunnel->fd, &err, rights);
if (sock)
sk = sock->sk;
} else {
@@ -1314,9 +1315,11 @@ static void l2tp_tunnel_del_work(struct work_struct *work)
struct l2tp_tunnel *tunnel = NULL;
struct socket *sock = NULL;
struct sock *sk = NULL;
+ struct capsicum_rights rights;

tunnel = container_of(work, struct l2tp_tunnel, del_work);
- sk = l2tp_tunnel_sock_lookup(tunnel);
+ sk = l2tp_tunnel_sock_lookup(tunnel,
+ cap_rights_init(&rights, CAP_SHUTDOWN));
if (!sk)
return;

@@ -1522,7 +1525,7 @@ int l2tp_tunnel_create(struct net *net, int fd, int version, u32 tunnel_id, u32
if (err < 0)
goto err;
} else {
- sock = sockfd_lookup(fd, &err);
+ sock = sockfd_lookupr(fd, &err, CAP_READ, CAP_WRITE);
if (!sock) {
pr_err("tunl %u: sockfd_lookup(fd=%d) returned %d\n",
tunnel_id, fd, err);
diff --git a/net/l2tp/l2tp_core.h b/net/l2tp/l2tp_core.h
index 68aa9ffd4ae4..4082366d7b74 100644
--- a/net/l2tp/l2tp_core.h
+++ b/net/l2tp/l2tp_core.h
@@ -11,6 +11,8 @@
#ifndef _L2TP_CORE_H_
#define _L2TP_CORE_H_

+#include <linux/capsicum.h>
+
/* Just some random numbers */
#define L2TP_TUNNEL_MAGIC 0x42114DDA
#define L2TP_SESSION_MAGIC 0x0C04EB7D
diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index 8449b337f9e3..8131efa6d164 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -238,7 +238,7 @@ static int atm_tc_change(struct Qdisc *sch, u32 classid, u32 parent,
}
pr_debug("atm_tc_change: type %d, payload %d, hdr_len %d\n",
opt->nla_type, nla_len(opt), hdr_len);
- sock = sockfd_lookup(fd, &error);
+ sock = sockfd_lookupr(fd, &error, CAP_GETSOCKNAME);
if (!sock)
return error; /* f_count++ */
pr_debug("atm_tc_change: f_count %ld\n", file_count(sock->file));
diff --git a/net/socket.c b/net/socket.c
index cc2e59576b3c..2240c2e52927 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -419,23 +419,6 @@ struct socket *sock_from_file(struct file *file, int *err)
}
EXPORT_SYMBOL(sock_from_file);

-static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
-{
- struct fd f = fdget(fd);
- struct socket *sock;
-
- *err = -EBADF;
- if (f.file) {
- sock = sock_from_file(f.file, err);
- if (likely(sock)) {
- *fput_needed = f.flags;
- return sock;
- }
- fdput(f);
- }
- return NULL;
-}
-
#ifdef CONFIG_SECURITY_CAPSICUM
struct socket *sockfd_lookup_rights(int fd, int *err,
struct capsicum_rights *rights)
@@ -508,6 +491,23 @@ struct socket *_sockfd_lookupr_light(int fd, int *err, int *fput_needed, ...)

#else

+static struct socket *sockfd_lookup_light(int fd, int *err, int *fput_needed)
+{
+ struct fd f = fdget(fd);
+ struct socket *sock;
+
+ *err = -EBADF;
+ if (f.file) {
+ sock = sock_from_file(f.file, err);
+ if (likely(sock)) {
+ *fput_needed = f.flags;
+ return sock;
+ }
+ fdput(f);
+ }
+ return NULL;
+}
+
static inline struct socket *
sockfd_lookup_light_rights(int fd, int *err, int *fput_needed,
const struct capsicum_rights **actual_rights,
@@ -1610,7 +1610,7 @@ SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)
struct sockaddr_storage address;
int err, fput_needed;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_BIND);
if (sock) {
err = move_addr_to_kernel(umyaddr, addrlen, &address);
if (err >= 0) {
@@ -1639,7 +1639,7 @@ SYSCALL_DEFINE2(listen, int, fd, int, backlog)
int err, fput_needed;
int somaxconn;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_LISTEN);
if (sock) {
somaxconn = sock_net(sock->sk)->core.sysctl_somaxconn;
if ((unsigned int)backlog > somaxconn)
@@ -1673,6 +1673,8 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
struct file *newfile;
int err, len, newfd, fput_needed;
struct sockaddr_storage address;
+ struct capsicum_rights rights;
+ const struct capsicum_rights *listen_rights = NULL;

if (flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
return -EINVAL;
@@ -1680,7 +1682,9 @@ SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,
if (SOCK_NONBLOCK != O_NONBLOCK && (flags & SOCK_NONBLOCK))
flags = (flags & ~SOCK_NONBLOCK) | O_NONBLOCK;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookup_light_rights(fd, &err, &fput_needed,
+ &listen_rights,
+ cap_rights_init(&rights, CAP_ACCEPT));
if (!sock)
goto out;

@@ -1772,7 +1776,7 @@ SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,
struct sockaddr_storage address;
int err, fput_needed;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_CONNECT);
if (!sock)
goto out;
err = move_addr_to_kernel(uservaddr, addrlen, &address);
@@ -1804,7 +1808,7 @@ SYSCALL_DEFINE3(getsockname, int, fd, struct sockaddr __user *, usockaddr,
struct sockaddr_storage address;
int len, err, fput_needed;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_GETSOCKNAME);
if (!sock)
goto out;

@@ -1835,7 +1839,7 @@ SYSCALL_DEFINE3(getpeername, int, fd, struct sockaddr __user *, usockaddr,
struct sockaddr_storage address;
int len, err, fput_needed;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_GETPEERNAME);
if (sock != NULL) {
err = security_socket_getpeername(sock);
if (err) {
@@ -1873,7 +1877,8 @@ SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,

if (len > INT_MAX)
len = INT_MAX;
- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed,
+ CAP_WRITE, addr ? CAP_CONNECT : 0ULL);
if (!sock)
goto out;

@@ -1932,7 +1937,7 @@ SYSCALL_DEFINE6(recvfrom, int, fd, void __user *, ubuf, size_t, size,

if (size > INT_MAX)
size = INT_MAX;
- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_READ);
if (!sock)
goto out;

@@ -1986,7 +1991,7 @@ SYSCALL_DEFINE5(setsockopt, int, fd, int, level, int, optname,
if (optlen < 0)
return -EINVAL;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_SETSOCKOPT);
if (sock != NULL) {
err = security_socket_setsockopt(sock, level, optname);
if (err)
@@ -2017,7 +2022,10 @@ SYSCALL_DEFINE5(getsockopt, int, fd, int, level, int, optname,
int err, fput_needed;
struct socket *sock;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_GETSOCKOPT,
+ (level == SOL_SCTP &&
+ optname == SCTP_SOCKOPT_PEELOFF)
+ ? CAP_PEELOFF : 0ULL);
if (sock != NULL) {
err = security_socket_getsockopt(sock, level, optname);
if (err)
@@ -2046,7 +2054,7 @@ SYSCALL_DEFINE2(shutdown, int, fd, int, how)
int err, fput_needed;
struct socket *sock;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_SHUTDOWN);
if (sock != NULL) {
err = security_socket_shutdown(sock, how);
if (!err)
@@ -2082,10 +2090,12 @@ static int copy_msghdr_from_user(struct msghdr *kmsg,
return 0;
}

-static int ___sys_sendmsg(struct socket *sock, struct msghdr __user *msg,
- struct msghdr *msg_sys, unsigned int flags,
- struct used_address *used_address)
+static int ___sys_sendmsg(struct socket *sock_noaddr, struct socket *sock_addr,
+ struct msghdr __user *msg,
+ struct msghdr *msg_sys, unsigned int flags,
+ struct used_address *used_address)
{
+ struct socket *sock;
struct compat_msghdr __user *msg_compat =
(struct compat_msghdr __user *)msg;
struct sockaddr_storage address;
@@ -2105,6 +2115,9 @@ static int ___sys_sendmsg(struct socket *sock, struct msghdr __user *msg,
if (err)
return err;
}
+ sock = (msg_sys->msg_name ? sock_addr : sock_noaddr);
+ if (!sock)
+ return -EBADF;

if (msg_sys->msg_iovlen > UIO_FASTIOV) {
err = -EMSGSIZE;
@@ -2204,15 +2217,22 @@ long __sys_sendmsg(int fd, struct msghdr __user *msg, unsigned flags)
{
int fput_needed, err;
struct msghdr msg_sys;
- struct socket *sock;
-
- sock = sockfd_lookup_light(fd, &err, &fput_needed);
- if (!sock)
+ struct socket *sock_addr;
+ struct socket *sock_noaddr;
+
+ sock_addr = sockfd_lookupr_light(fd, &err, &fput_needed,
+ CAP_WRITE, CAP_CONNECT);
+ sock_noaddr = sock_addr;
+ if (!sock_noaddr)
+ sock_noaddr = sockfd_lookupr_light(fd, &err, &fput_needed,
+ CAP_WRITE);
+ if (!sock_noaddr)
goto out;

- err = ___sys_sendmsg(sock, msg, &msg_sys, flags, NULL);
+ err = ___sys_sendmsg(sock_noaddr, sock_addr, msg, &msg_sys, flags,
+ NULL);

- fput_light(sock->file, fput_needed);
+ fput_light(sock_noaddr->file, fput_needed);
out:
return err;
}
@@ -2232,7 +2252,8 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
unsigned int flags)
{
int fput_needed, err, datagrams;
- struct socket *sock;
+ struct socket *sock_addr;
+ struct socket *sock_noaddr;
struct mmsghdr __user *entry;
struct compat_mmsghdr __user *compat_entry;
struct msghdr msg_sys;
@@ -2243,8 +2264,13 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,

datagrams = 0;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
- if (!sock)
+ sock_addr = sockfd_lookupr_light(fd, &err, &fput_needed,
+ CAP_WRITE, CAP_CONNECT);
+ sock_noaddr = sock_addr;
+ if (!sock_noaddr)
+ sock_noaddr = sockfd_lookupr_light(fd, &err, &fput_needed,
+ CAP_WRITE);
+ if (!sock_noaddr)
return err;

used_address.name_len = UINT_MAX;
@@ -2254,14 +2280,15 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,

while (datagrams < vlen) {
if (MSG_CMSG_COMPAT & flags) {
- err = ___sys_sendmsg(sock, (struct msghdr __user *)compat_entry,
- &msg_sys, flags, &used_address);
+ err = ___sys_sendmsg(sock_noaddr, sock_addr,
+ (struct msghdr __user *)compat_entry,
+ &msg_sys, flags, &used_address);
if (err < 0)
break;
err = __put_user(err, &compat_entry->msg_len);
++compat_entry;
} else {
- err = ___sys_sendmsg(sock,
+ err = ___sys_sendmsg(sock_noaddr, sock_addr,
(struct msghdr __user *)entry,
&msg_sys, flags, &used_address);
if (err < 0)
@@ -2275,7 +2302,7 @@ int __sys_sendmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,
++datagrams;
}

- fput_light(sock->file, fput_needed);
+ fput_light(sock_noaddr->file, fput_needed);

/* We only return an error if no datagrams were able to be sent */
if (datagrams != 0)
@@ -2394,7 +2421,7 @@ long __sys_recvmsg(int fd, struct msghdr __user *msg, unsigned flags)
struct msghdr msg_sys;
struct socket *sock;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_READ);
if (!sock)
goto out;

@@ -2434,7 +2461,7 @@ int __sys_recvmmsg(int fd, struct mmsghdr __user *mmsg, unsigned int vlen,

datagrams = 0;

- sock = sockfd_lookup_light(fd, &err, &fput_needed);
+ sock = sockfd_lookupr_light(fd, &err, &fput_needed, CAP_READ);
if (!sock)
return err;

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index b507cd327d9b..3d535e881e7b 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1413,7 +1413,7 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
bool svc_alien_sock(struct net *net, int fd)
{
int err;
- struct socket *sock = sockfd_lookup(fd, &err);
+ struct socket *sock = sockfd_lookupr(fd, &err, CAP_LIST_END);
bool ret = false;

if (!sock)
@@ -1441,7 +1441,7 @@ int svc_addsock(struct svc_serv *serv, const int fd, char *name_return,
const size_t len)
{
int err = 0;
- struct socket *so = sockfd_lookup(fd, &err);
+ struct socket *so = sockfd_lookupr(fd, &err, CAP_LISTEN);
struct svc_sock *svsk = NULL;
struct sockaddr_storage addr;
struct sockaddr *sin = (struct sockaddr *)&addr;
--
2.0.0.526.g5318336

2014-07-25 13:51:11

by David Drysdale

[permalink] [raw]
Subject: [PATCH 09/11] capsicum: add syscalls to limit FD rights

Add the cap_rights_get(2) and cap_rights_set(2) syscalls to
allow retrieval and modification of the rights associated with
a file descriptor.

When a normal file descriptor has its rights restricted in any
way, it becomes a Capsicum capability file descriptor. This is
a wrapper struct file that is installed in the fdtable in place
of the original file. From this point on, when the FD is converted
to a struct file by fget() (or equivalent), the wrapper is checked
for the appropriate rights and the wrapped inner normal file is
returned.

When a Capsicum capability file descriptor has its rights restricted
further (they cannot be expanded), a new wrapper is created with
the restricted rights, also wrapping the same inner normal file.
In other words, the .underlying field in a struct capsicum_capability
is always a normal file, never another Capsicum capability file.

These syscalls specify the different components of the compound
rights structure separately, allowing components to be unspecified
for no change.

Note that in FreeBSD 10.x the function of this pair of syscalls
is implemented as 3 distinct pairs of syscalls, one pair for each
component of the compound rights (primary/fcntl/ioctl).

Signed-off-by: David Drysdale <[email protected]>
---
arch/x86/syscalls/syscall_64.tbl | 2 +
include/linux/syscalls.h | 12 ++++
kernel/sys_ni.c | 4 ++
security/capsicum.c | 146 +++++++++++++++++++++++++++++++++++++++
4 files changed, 164 insertions(+)

diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index ec255a1646d2..d980d2b8bfad 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,8 @@
314 common sched_setattr sys_sched_setattr
315 common sched_getattr sys_sched_getattr
316 common renameat2 sys_renameat2
+318 common cap_rights_limit sys_cap_rights_limit
+319 common cap_rights_get sys_cap_rights_get

#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index b0881a0ed322..d754f0846037 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
struct perf_event_attr;
struct file_handle;
struct sigaltstack;
+struct cap_rights;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -866,4 +867,15 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
unsigned long idx1, unsigned long idx2);
asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags);
+asmlinkage long sys_cap_rights_limit(unsigned int orig_fd,
+ const struct cap_rights __user *new_rights,
+ unsigned int fcntls,
+ int nioctls,
+ unsigned int __user *ioctls);
+asmlinkage long sys_cap_rights_get(unsigned int fd,
+ struct cap_rights __user *rightsp,
+ unsigned int __user *fcntls,
+ int __user *nioctls,
+ unsigned int __user *ioctls);
+
#endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 36441b51b5df..ef634cb1bc6c 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -213,3 +213,7 @@ cond_syscall(compat_sys_open_by_handle_at);

/* compare kernel pointers */
cond_syscall(sys_kcmp);
+
+/* capsicum object capabilities */
+cond_syscall(sys_cap_rights_get);
+cond_syscall(sys_cap_rights_limit);
diff --git a/security/capsicum.c b/security/capsicum.c
index 4a004829b9c8..42c14bab4b64 100644
--- a/security/capsicum.c
+++ b/security/capsicum.c
@@ -125,6 +125,152 @@ out_err:
return ERR_PTR(err);
}

+/* Takes ownership of rights->ioctls */
+static int capsicum_rights_limit(unsigned int fd,
+ struct capsicum_rights *rights)
+{
+ int rc = -EBADF;
+ struct capsicum_capability *cap;
+ struct file *capf = NULL;
+ struct file *file; /* current file for fd */
+ struct file *underlying; /* base file for capability */
+ struct files_struct *files = current->files;
+ struct fdtable *fdt;
+
+ /* Allocate capability before taking files->file_lock */
+ capf = capsicum_cap_alloc(rights, true);
+ rights->ioctls = NULL; /* capsicum_cap_alloc took ownership */
+ if (IS_ERR(capf))
+ return PTR_ERR(capf);
+ cap = capf->private_data;
+
+ spin_lock(&files->file_lock);
+ fdt = files_fdtable(files);
+ if (fd >= fdt->max_fds)
+ goto out_err;
+ file = fdt->fd[fd];
+ if (!file)
+ goto out_err;
+
+ /* If we're limiting an existing Capsicum capability object, ensure
+ * we wrap its underlying normal file. */
+ if (capsicum_is_cap(file)) {
+ struct capsicum_capability *old_cap = file->private_data;
+ /* Reject attempts to widen existing rights */
+ if (!cap_rights_contains(&old_cap->rights, &cap->rights)) {
+ rc = -ENOTCAPABLE;
+ goto out_err;
+ }
+ underlying = old_cap->underlying;
+ } else {
+ underlying = file;
+ }
+ if (!atomic_long_inc_not_zero(&underlying->f_count)) {
+ rc = -EBADF;
+ goto out_err;
+ }
+ cap->underlying = underlying;
+
+ fput(file);
+ rcu_assign_pointer(fdt->fd[fd], capf);
+ spin_unlock(&files->file_lock);
+ return 0;
+out_err:
+ spin_unlock(&files->file_lock);
+ fput(capf);
+ return rc;
+}
+
+SYSCALL_DEFINE5(cap_rights_limit,
+ unsigned int, fd,
+ const struct cap_rights __user *, new_rights,
+ unsigned int, new_fcntls,
+ int, nioctls,
+ unsigned int __user *, new_ioctls)
+{
+ struct capsicum_rights rights;
+
+ if (!new_rights)
+ return -EFAULT;
+ if (nioctls < 0 && nioctls != -1)
+ return -EINVAL;
+ if (copy_from_user(&rights.primary, new_rights,
+ sizeof(struct cap_rights)))
+ return -EFAULT;
+ rights.fcntls = new_fcntls;
+ rights.nioctls = nioctls;
+ if (rights.nioctls > 0) {
+ size_t size;
+
+ if (!new_ioctls)
+ return -EINVAL;
+ size = rights.nioctls * sizeof(unsigned int);
+ rights.ioctls = kmalloc(size, GFP_KERNEL);
+ if (!rights.ioctls)
+ return -ENOMEM;
+ if (copy_from_user(rights.ioctls, new_ioctls, size)) {
+ kfree(rights.ioctls);
+ return -EFAULT;
+ }
+ } else {
+ rights.ioctls = NULL;
+ }
+ if (cap_rights_regularize(&rights))
+ return -ENOTCAPABLE;
+
+ return capsicum_rights_limit(fd, &rights);
+}
+
+SYSCALL_DEFINE5(cap_rights_get,
+ unsigned int, fd,
+ struct cap_rights __user *, rightsp,
+ unsigned int __user *, fcntls,
+ int __user *, nioctls,
+ unsigned int __user *, ioctls)
+{
+ int result = -EFAULT;
+ struct file *file;
+ struct capsicum_rights *rights = &all_rights;
+ int ioctls_to_copy = -1;
+
+ file = fget_raw(fd);
+ if (file == NULL)
+ return -EBADF;
+ if (capsicum_is_cap(file)) {
+ struct capsicum_capability *cap = file->private_data;
+
+ rights = &cap->rights;
+ }
+
+ if (rightsp) {
+ if (copy_to_user(rightsp, &rights->primary,
+ sizeof(struct cap_rights)))
+ goto out;
+ }
+ if (fcntls) {
+ if (put_user(rights->fcntls, fcntls))
+ goto out;
+ }
+ if (nioctls) {
+ int n;
+
+ if (get_user(n, nioctls))
+ goto out;
+ if (put_user(rights->nioctls, nioctls))
+ goto out;
+ ioctls_to_copy = min(rights->nioctls, n);
+ }
+ if (ioctls && ioctls_to_copy > 0) {
+ if (copy_to_user(ioctls, rights->ioctls,
+ ioctls_to_copy * sizeof(unsigned int)))
+ goto out;
+ }
+ result = 0;
+out:
+ fput(file);
+ return result;
+}
+
/*
* File operations functions.
*/
--
2.0.0.526.g5318336

2014-07-25 13:53:56

by David Drysdale

[permalink] [raw]
Subject: [PATCH 03/11] capsicum: rights values and structure definitions

Define (in include/uapi/linux/capsicum.h) values for primary
rights associated with Capsicum capability file descriptors.

Also define the structure that primary rights reside in (struct
cap_rights), and the complete compound rights structure (struct
capsicum_rights).

- Primary rights describe the main operations that can be
performed on a file.
- Secondary rights allow for specific fcntl() and ioctl()
operations to be policed.

Add functions to manipulate these rights structures.

This change is adapted from the FreeBSD 10.x implementation of
Capsicum, with the aim of preserving compatibility between the
two implementations as closely as possible.

Signed-off-by: David Drysdale <[email protected]>
---
Documentation/security/capsicum.txt | 80 +++++++++
include/linux/capsicum.h | 50 ++++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/capsicum.h | 343 ++++++++++++++++++++++++++++++++++++
security/Kconfig | 15 ++
security/Makefile | 2 +-
security/capsicum-rights.c | 205 +++++++++++++++++++++
security/capsicum-rights.h | 10 ++
8 files changed, 705 insertions(+), 1 deletion(-)
create mode 100644 Documentation/security/capsicum.txt
create mode 100644 include/linux/capsicum.h
create mode 100644 include/uapi/linux/capsicum.h
create mode 100644 security/capsicum-rights.c
create mode 100644 security/capsicum-rights.h

diff --git a/Documentation/security/capsicum.txt b/Documentation/security/capsicum.txt
new file mode 100644
index 000000000000..73636cc5134f
--- /dev/null
+++ b/Documentation/security/capsicum.txt
@@ -0,0 +1,80 @@
+Capsicum Object Capabilities
+============================
+
+Capsicum is a lightweight OS capability and sandbox framework, which allows
+security-aware userspace applications to sandbox parts of their own code in a
+highly granular way, reducing the attack surface in the event of subversion.
+
+Originally developed at the University of Cambridge Computer Laboratory, and
+initially implemented in FreeBSD 9.x, Capsicum extends the POSIX API, providing
+several new OS primitives to support object-capability security on UNIX-like
+operating systems.
+
+Note that Capsicum capability file descriptors are radically different to the
+POSIX.1e capabilities that are already available in Linux:
+ - POSIX.1e capabilities subdivide the root user's authority into different
+ areas of functionality.
+ - Capsicum capabilities restrict individual file descriptors so that
+ only operations permitted by that particular FD's rights are allowed.
+
+
+Overview
+--------
+
+Capability-based security is a security model where objects can only be
+accessed via capabilities, which are unforgeable tokens of authority that only
+give rights to perform certain operations.
+
+Capsicum is a pragmatic blend of capability-based security with standard
+UNIX/POSIX system semantics. A Capsicum capability is a file descriptor that
+has an associated rights bitmask, and the kernel polices operations using that
+file descriptor, failing operations with insufficient rights.
+
+
+Capability Data Structure
+-------------------------
+
+Internally, a capability is a particular kind of struct file that wraps an
+underlying normal file. The private data for the wrapper indicates the
+wrapped file, and holds the rights information for the capability.
+
+
+FD to File Conversion
+---------------------
+
+The primary policing of Capsicum capabilities occurs when a user-provided file
+descriptor is converted to a struct file object, normally using one of the
+fgetr() family of functions.
+
+All such operations in the kernel are annotated with information about the
+operations that are going to be performed on the retrieved struct file. For
+example, a file that is retrieved for a read operation has its fgetr() call
+annotated with CAP_READ, indicating that any capability FD that reaches this
+point needs to include the CAP_READ right to progress further. If the
+appropriate right is not available, -ENOTCAPABLE is returned.
+
+This change is the most significant change to the kernel, as it affects all
+FD-to-file conversions. However, for a non-Capsicum build of the kernel the
+impact is minimal as the additional rights parameters to fgetr*() are macroed
+out.
+
+
+Path Traversal
+--------------
+
+Capsicum does allow new files to be accessed beneath a directory for which the
+application has a suitable capability FD (one including the CAP_LOOKUP right),
+using the openat(2) system call. To prevent escape from the directory, path
+traversals are policed for "/" and ".." components by implicitly setting the
+O_BENEATH flag for file-open operations.
+
+
+New System Calls
+----------------
+
+Capsicum implements the following new system calls:
+ - cap_rights_limit: restrict the rights associated with file descriptor, thus
+ turning it into a capability FD; internally this is implemented by wrapping
+ the original struct file with a capability file (security/capsicum.c)
+ - cap_rights_get: return the rights associated with a capability FD
+ (security/capsicum.c)
diff --git a/include/linux/capsicum.h b/include/linux/capsicum.h
new file mode 100644
index 000000000000..74f79756097a
--- /dev/null
+++ b/include/linux/capsicum.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_CAPSICUM_H
+#define _LINUX_CAPSICUM_H
+
+#include <stdarg.h>
+#include <uapi/linux/capsicum.h>
+
+struct file;
+/* Complete rights structure (primary and subrights). */
+struct capsicum_rights {
+ struct cap_rights primary;
+ unsigned int fcntls; /* Only valid if CAP_FCNTL set in primary. */
+ int nioctls; /* -1=>all; only valid if CAP_IOCTL set in primary */
+ unsigned int *ioctls;
+};
+
+#define CAP_LIST_END 0ULL
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/* Rights manipulation functions */
+#define cap_rights_init(rights, ...) \
+ _cap_rights_init((rights), __VA_ARGS__, CAP_LIST_END)
+#define cap_rights_set(rights, ...) \
+ _cap_rights_set((rights), __VA_ARGS__, CAP_LIST_END)
+struct capsicum_rights *_cap_rights_init(struct capsicum_rights *rights, ...);
+struct capsicum_rights *_cap_rights_set(struct capsicum_rights *rights, ...);
+struct capsicum_rights *cap_rights_vinit(struct capsicum_rights *rights,
+ va_list ap);
+struct capsicum_rights *cap_rights_vset(struct capsicum_rights *rights,
+ va_list ap);
+struct capsicum_rights *cap_rights_set_all(struct capsicum_rights *rights);
+bool cap_rights_is_all(const struct capsicum_rights *rights);
+
+#else
+
+#define cap_rights_init(rights, ...) _cap_rights_noop(rights)
+#define cap_rights_set(rights, ...) _cap_rights_noop(rights)
+#define cap_rights_set_all(rights) _cap_rights_noop(rights)
+static inline struct capsicum_rights *
+_cap_rights_noop(struct capsicum_rights *rights)
+{
+ return rights;
+}
+static inline bool cap_rights_is_all(const struct capsicum_rights *rights)
+{
+ return true;
+}
+
+#endif
+
+#endif /* _LINUX_CAPSICUM_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 24e9033f8b3f..99e5d0fef529 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -73,6 +73,7 @@ header-y += btrfs.h
header-y += can.h
header-y += capability.h
header-y += capi.h
+header-y += capsicum.h
header-y += cciss_defs.h
header-y += cciss_ioctl.h
header-y += cdrom.h
diff --git a/include/uapi/linux/capsicum.h b/include/uapi/linux/capsicum.h
new file mode 100644
index 000000000000..a39ac86fa183
--- /dev/null
+++ b/include/uapi/linux/capsicum.h
@@ -0,0 +1,343 @@
+#ifndef _UAPI_LINUX_CAPSICUM_H
+#define _UAPI_LINUX_CAPSICUM_H
+
+/*-
+ * Copyright (c) 2008-2010 Robert N. M. Watson
+ * Copyright (c) 2012 FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed at the University of Cambridge Computer
+ * Laboratory with support from a grant from Google, Inc.
+ *
+ * Portions of this software were developed by Pawel Jakub Dawidek under
+ * sponsorship from the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+/*
+ * Definitions for Capsicum capabilities facility.
+ */
+#include <linux/types.h>
+
+/*
+ * The top two bits in the first element of the cr_rights[] array contain
+ * total number of elements in the array - 2. This means if those two bits are
+ * equal to 0, we have 2 array elements.
+ * The top two bits in all remaining array elements should be 0.
+ * The next five bits contain array index. Only one bit is used and bit position
+ * in this five-bits range defines array index. This means there can be at most
+ * five array elements.
+ */
+#define CAP_RIGHTS_VERSION_00 0
+/*
+#define CAP_RIGHTS_VERSION_01 1
+#define CAP_RIGHTS_VERSION_02 2
+#define CAP_RIGHTS_VERSION_03 3
+*/
+#define CAP_RIGHTS_VERSION CAP_RIGHTS_VERSION_00
+
+/* Primary rights */
+struct cap_rights {
+ __u64 cr_rights[CAP_RIGHTS_VERSION + 2];
+};
+
+#define CAPRIGHT(idx, bit) ((1ULL << (57 + (idx))) | (bit))
+
+/*
+ * Possible rights on capabilities.
+ *
+ * Notes:
+ * Some system calls don't require a capability in order to perform an
+ * operation on an fd. These include: close, dup, dup2.
+ *
+ * sendfile is authorized using CAP_READ on the file and CAP_WRITE on the
+ * socket.
+ *
+ * mmap() and aio*() system calls will need special attention as they may
+ * involve reads or writes depending a great deal on context.
+ */
+
+/* INDEX 0 */
+
+/*
+ * General file I/O.
+ */
+/* Allows for openat(O_RDONLY), read(2), readv(2). */
+#define CAP_READ CAPRIGHT(0, 0x0000000000000001ULL)
+/* Allows for openat(O_WRONLY | O_APPEND), write(2), writev(2). */
+#define CAP_WRITE CAPRIGHT(0, 0x0000000000000002ULL)
+/* Allows for lseek(fd, 0, SEEK_CUR). */
+#define CAP_SEEK_TELL CAPRIGHT(0, 0x0000000000000004ULL)
+/* Allows for lseek(2). */
+#define CAP_SEEK (CAP_SEEK_TELL | 0x0000000000000008ULL)
+/* Allows for aio_read(2), pread(2), preadv(2). */
+#define CAP_PREAD (CAP_SEEK | CAP_READ)
+/*
+ * Allows for aio_write(2), openat(O_WRONLY) (without O_APPEND), pwrite(2),
+ * pwritev(2).
+ */
+#define CAP_PWRITE (CAP_SEEK | CAP_WRITE)
+/* Allows for mmap(PROT_NONE). */
+#define CAP_MMAP CAPRIGHT(0, 0x0000000000000010ULL)
+/* Allows for mmap(PROT_READ). */
+#define CAP_MMAP_R (CAP_MMAP | CAP_SEEK | CAP_READ)
+/* Allows for mmap(PROT_WRITE). */
+#define CAP_MMAP_W (CAP_MMAP | CAP_SEEK | CAP_WRITE)
+/* Allows for mmap(PROT_EXEC). */
+#define CAP_MMAP_X (CAP_MMAP | CAP_SEEK | 0x0000000000000020ULL)
+/* Allows for mmap(PROT_READ | PROT_WRITE). */
+#define CAP_MMAP_RW (CAP_MMAP_R | CAP_MMAP_W)
+/* Allows for mmap(PROT_READ | PROT_EXEC). */
+#define CAP_MMAP_RX (CAP_MMAP_R | CAP_MMAP_X)
+/* Allows for mmap(PROT_WRITE | PROT_EXEC). */
+#define CAP_MMAP_WX (CAP_MMAP_W | CAP_MMAP_X)
+/* Allows for mmap(PROT_READ | PROT_WRITE | PROT_EXEC). */
+#define CAP_MMAP_RWX (CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
+/* Allows for openat(O_CREAT). */
+#define CAP_CREATE CAPRIGHT(0, 0x0000000000000040ULL)
+/* Allows for openat(O_EXEC) and fexecve(2) in turn. */
+#define CAP_FEXECVE CAPRIGHT(0, 0x0000000000000080ULL)
+/* Allows for openat(O_SYNC), openat(O_FSYNC), fsync(2), aio_fsync(2). */
+#define CAP_FSYNC CAPRIGHT(0, 0x0000000000000100ULL)
+/* Allows for openat(O_TRUNC), ftruncate(2). */
+#define CAP_FTRUNCATE CAPRIGHT(0, 0x0000000000000200ULL)
+
+/* Lookups - used to constrain *at() calls. */
+#define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL)
+
+/* VFS methods. */
+/* Allows for fchdir(2). */
+#define CAP_FCHDIR CAPRIGHT(0, 0x0000000000000800ULL)
+/* Allows for fchflags(2). */
+#define CAP_FCHFLAGS CAPRIGHT(0, 0x0000000000001000ULL)
+/* Allows for fchflags(2) and chflagsat(2). */
+#define CAP_CHFLAGSAT (CAP_FCHFLAGS | CAP_LOOKUP)
+/* Allows for fchmod(2). */
+#define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL)
+/* Allows for fchmod(2) and fchmodat(2). */
+#define CAP_FCHMODAT (CAP_FCHMOD | CAP_LOOKUP)
+/* Allows for fchown(2). */
+#define CAP_FCHOWN CAPRIGHT(0, 0x0000000000004000ULL)
+/* Allows for fchown(2) and fchownat(2). */
+#define CAP_FCHOWNAT (CAP_FCHOWN | CAP_LOOKUP)
+/* Allows for fcntl(2). */
+#define CAP_FCNTL CAPRIGHT(0, 0x0000000000008000ULL)
+/*
+ * Allows for flock(2), openat(O_SHLOCK), openat(O_EXLOCK),
+ * fcntl(F_SETLK_REMOTE), fcntl(F_SETLKW), fcntl(F_SETLK), fcntl(F_GETLK).
+ */
+#define CAP_FLOCK CAPRIGHT(0, 0x0000000000010000ULL)
+/* Allows for fpathconf(2). */
+#define CAP_FPATHCONF CAPRIGHT(0, 0x0000000000020000ULL)
+/* Allows for UFS background-fsck operations. */
+#define CAP_FSCK CAPRIGHT(0, 0x0000000000040000ULL)
+/* Allows for fstat(2). */
+#define CAP_FSTAT CAPRIGHT(0, 0x0000000000080000ULL)
+/* Allows for fstat(2), fstatat(2) and faccessat(2). */
+#define CAP_FSTATAT (CAP_FSTAT | CAP_LOOKUP)
+/* Allows for fstatfs(2). */
+#define CAP_FSTATFS CAPRIGHT(0, 0x0000000000100000ULL)
+/* Allows for futimes(2). */
+#define CAP_FUTIMES CAPRIGHT(0, 0x0000000000200000ULL)
+/* Allows for futimes(2) and futimesat(2). */
+#define CAP_FUTIMESAT (CAP_FUTIMES | CAP_LOOKUP)
+/* Allows for linkat(2) and renameat(2) (destination directory descriptor). */
+#define CAP_LINKAT (CAP_LOOKUP | 0x0000000000400000ULL)
+/* Allows for mkdirat(2). */
+#define CAP_MKDIRAT (CAP_LOOKUP | 0x0000000000800000ULL)
+/* Allows for mkfifoat(2). */
+#define CAP_MKFIFOAT (CAP_LOOKUP | 0x0000000001000000ULL)
+/* Allows for mknodat(2). */
+#define CAP_MKNODAT (CAP_LOOKUP | 0x0000000002000000ULL)
+/* Allows for renameat(2). */
+#define CAP_RENAMEAT (CAP_LOOKUP | 0x0000000004000000ULL)
+/* Allows for symlinkat(2). */
+#define CAP_SYMLINKAT (CAP_LOOKUP | 0x0000000008000000ULL)
+/*
+ * Allows for unlinkat(2) and renameat(2) if destination object exists and
+ * will be removed.
+ */
+#define CAP_UNLINKAT (CAP_LOOKUP | 0x0000000010000000ULL)
+
+/* Socket operations. */
+/* Allows for accept(2) and accept4(2). */
+#define CAP_ACCEPT CAPRIGHT(0, 0x0000000020000000ULL)
+/* Allows for bind(2). */
+#define CAP_BIND CAPRIGHT(0, 0x0000000040000000ULL)
+/* Allows for connect(2). */
+#define CAP_CONNECT CAPRIGHT(0, 0x0000000080000000ULL)
+/* Allows for getpeername(2). */
+#define CAP_GETPEERNAME CAPRIGHT(0, 0x0000000100000000ULL)
+/* Allows for getsockname(2). */
+#define CAP_GETSOCKNAME CAPRIGHT(0, 0x0000000200000000ULL)
+/* Allows for getsockopt(2). */
+#define CAP_GETSOCKOPT CAPRIGHT(0, 0x0000000400000000ULL)
+/* Allows for listen(2). */
+#define CAP_LISTEN CAPRIGHT(0, 0x0000000800000000ULL)
+/* Allows for sctp_peeloff(2). */
+#define CAP_PEELOFF CAPRIGHT(0, 0x0000001000000000ULL)
+#define CAP_RECV CAP_READ
+#define CAP_SEND CAP_WRITE
+/* Allows for setsockopt(2). */
+#define CAP_SETSOCKOPT CAPRIGHT(0, 0x0000002000000000ULL)
+/* Allows for shutdown(2). */
+#define CAP_SHUTDOWN CAPRIGHT(0, 0x0000004000000000ULL)
+
+/* Allows for bindat(2) on a directory descriptor. */
+#define CAP_BINDAT (CAP_LOOKUP | 0x0000008000000000ULL)
+/* Allows for connectat(2) on a directory descriptor. */
+#define CAP_CONNECTAT (CAP_LOOKUP | 0x0000010000000000ULL)
+
+#define CAP_SOCK_CLIENT \
+ (CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
+ CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
+#define CAP_SOCK_SERVER \
+ (CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
+ CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
+ CAP_SETSOCKOPT | CAP_SHUTDOWN)
+
+/* All used bits for index 0. */
+#define CAP_ALL0 CAPRIGHT(0, 0x0000007FFFFFFFFFULL)
+
+/* Available bits for index 0. */
+#define CAP_UNUSED0_40 CAPRIGHT(0, 0x0000008000000000ULL)
+/* ... */
+#define CAP_UNUSED0_57 CAPRIGHT(0, 0x0100000000000000ULL)
+
+/* INDEX 1 */
+
+/* Mandatory Access Control. */
+/* Allows for mac_get_fd(3). */
+#define CAP_MAC_GET CAPRIGHT(1, 0x0000000000000001ULL)
+/* Allows for mac_set_fd(3). */
+#define CAP_MAC_SET CAPRIGHT(1, 0x0000000000000002ULL)
+
+/* Methods on semaphores. */
+#define CAP_SEM_GETVALUE CAPRIGHT(1, 0x0000000000000004ULL)
+#define CAP_SEM_POST CAPRIGHT(1, 0x0000000000000008ULL)
+#define CAP_SEM_WAIT CAPRIGHT(1, 0x0000000000000010ULL)
+
+/* Allows select(2) and poll(2) on descriptor. */
+#define CAP_EVENT CAPRIGHT(1, 0x0000000000000020ULL)
+/* Allows for kevent(2) on kqueue descriptor with eventlist != NULL. */
+#define CAP_KQUEUE_EVENT CAPRIGHT(1, 0x0000000000000040ULL)
+
+/* Strange and powerful rights that should not be given lightly. */
+/* Allows for ioctl(2). */
+#define CAP_IOCTL CAPRIGHT(1, 0x0000000000000080ULL)
+#define CAP_TTYHOOK CAPRIGHT(1, 0x0000000000000100ULL)
+
+/* Process management via process descriptors. */
+/* Allows for pdgetpid(2). */
+#define CAP_PDGETPID CAPRIGHT(1, 0x0000000000000200ULL)
+/* Allows for pdwait4(2). */
+#define CAP_PDWAIT CAPRIGHT(1, 0x0000000000000400ULL)
+/* Allows for pdkill(2). */
+#define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL)
+
+/* Extended attributes. */
+/* Allows for extattr_delete_fd(2). */
+#define CAP_EXTATTR_DELETE CAPRIGHT(1, 0x0000000000001000ULL)
+/* Allows for extattr_get_fd(2). */
+#define CAP_EXTATTR_GET CAPRIGHT(1, 0x0000000000002000ULL)
+/* Allows for extattr_list_fd(2). */
+#define CAP_EXTATTR_LIST CAPRIGHT(1, 0x0000000000004000ULL)
+/* Allows for extattr_set_fd(2). */
+#define CAP_EXTATTR_SET CAPRIGHT(1, 0x0000000000008000ULL)
+
+/* Access Control Lists. */
+/* Allows for acl_valid_fd_np(3). */
+#define CAP_ACL_CHECK CAPRIGHT(1, 0x0000000000010000ULL)
+/* Allows for acl_delete_fd_np(3). */
+#define CAP_ACL_DELETE CAPRIGHT(1, 0x0000000000020000ULL)
+/* Allows for acl_get_fd(3) and acl_get_fd_np(3). */
+#define CAP_ACL_GET CAPRIGHT(1, 0x0000000000040000ULL)
+/* Allows for acl_set_fd(3) and acl_set_fd_np(3). */
+#define CAP_ACL_SET CAPRIGHT(1, 0x0000000000080000ULL)
+
+/* Allows for kevent(2) on kqueue descriptor with changelist != NULL. */
+#define CAP_KQUEUE_CHANGE CAPRIGHT(1, 0x0000000000100000ULL)
+
+#define CAP_KQUEUE (CAP_KQUEUE_EVENT | CAP_KQUEUE_CHANGE)
+
+/* Modify signalfd signal mask. */
+#define CAP_FSIGNAL CAPRIGHT(1, 0x0000000000200000ULL)
+
+/* Modify epollfd set of FDs/events */
+#define CAP_EPOLL_CTL CAPRIGHT(1, 0x0000000000400000ULL)
+
+/* Modify things monitored by inotify/fanotify FD */
+#define CAP_NOTIFY CAPRIGHT(1, 0x0000000000800000ULL)
+
+/* Allow entry to a namespace associated with a file descriptor */
+#define CAP_SETNS CAPRIGHT(1, 0x0000000001000000ULL)
+
+/* Allow performance monitoring operations */
+#define CAP_PERFMON CAPRIGHT(1, 0x0000000002000000ULL)
+
+/* All used bits for index 1. */
+#define CAP_ALL1 CAPRIGHT(1, 0x0000000003FFFFFFULL)
+
+/* Available bits for index 1. */
+#define CAP_UNUSED1_27 CAPRIGHT(1, 0x0000000004000000ULL)
+/* ... */
+#define CAP_UNUSED1_57 CAPRIGHT(1, 0x0100000000000000ULL)
+
+/* Backward compatibility. */
+#define CAP_POLL_EVENT CAP_EVENT
+
+#define CAP_SET_ALL(rights) do { \
+ (rights)->cr_rights[0] = \
+ ((__u64)CAP_RIGHTS_VERSION << 62) | CAP_ALL0; \
+ (rights)->cr_rights[1] = CAP_ALL1; \
+} while (0)
+
+#define CAP_SET_NONE(rights) do { \
+ (rights)->cr_rights[0] = \
+ ((__u64)CAP_RIGHTS_VERSION << 62) | CAPRIGHT(0, 0ULL); \
+ (rights)->cr_rights[1] = CAPRIGHT(1, 0ULL); \
+} while (0)
+
+#define CAP_IS_ALL(rights) \
+ (((rights)->cr_rights[0] == \
+ (((__u64)CAP_RIGHTS_VERSION << 62) | CAP_ALL0)) && \
+ ((rights)->cr_rights[1] == CAP_ALL1))
+
+#define CAPRVER(right) ((int)((right) >> 62))
+#define CAPVER(rights) CAPRVER((rights)->cr_rights[0])
+#define CAPARSIZE(rights) (CAPVER(rights) + 2)
+#define CAPIDXBIT(right) ((int)(((right) >> 57) & 0x1F))
+
+/*
+ * Allowed fcntl(2) commands.
+ */
+#define CAP_FCNTL_GETFL (1 << F_GETFL)
+#define CAP_FCNTL_SETFL (1 << F_SETFL)
+#define CAP_FCNTL_GETOWN (1 << F_GETOWN)
+#define CAP_FCNTL_SETOWN (1 << F_SETOWN)
+#define CAP_FCNTL_ALL (CAP_FCNTL_GETFL | CAP_FCNTL_SETFL | \
+ CAP_FCNTL_GETOWN | CAP_FCNTL_SETOWN)
+
+#define CAP_IOCTLS_ALL SSIZE_MAX
+
+#endif /* _UAPI_LINUX_CAPSICUM_H */
diff --git a/security/Kconfig b/security/Kconfig
index beb86b500adf..006020864612 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -117,6 +117,21 @@ config LSM_MMAP_MIN_ADDR
this low address space will need the permission specific to the
systems running LSM.

+config SECURITY_CAPSICUM
+ bool "Capsicum capabilities"
+ default y
+ depends on SECURITY
+ depends on SECURITY_PATH
+ depends on SECCOMP
+ help
+ Enable the Capsicum capability framework, which implements security
+ primitives that support fine-grained capabilities on file
+ descriptors; see Documentation/security/capsicum.txt for more
+ details.
+
+ If you are unsure as to whether this is required, answer N.
+
+
source security/selinux/Kconfig
source security/smack/Kconfig
source security/tomoyo/Kconfig
diff --git a/security/Makefile b/security/Makefile
index 05f1c934d74b..c5e1363ae136 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -14,7 +14,7 @@ obj-y += commoncap.o
obj-$(CONFIG_MMU) += min_addr.o

# Object file lists
-obj-$(CONFIG_SECURITY) += security.o capability.o
+obj-$(CONFIG_SECURITY) += security.o capability.o capsicum-rights.o
obj-$(CONFIG_SECURITYFS) += inode.o
obj-$(CONFIG_SECURITY_SELINUX) += selinux/
obj-$(CONFIG_SECURITY_SMACK) += smack/
diff --git a/security/capsicum-rights.c b/security/capsicum-rights.c
new file mode 100644
index 000000000000..5ce18a684848
--- /dev/null
+++ b/security/capsicum-rights.c
@@ -0,0 +1,205 @@
+/*-
+ * Copyright (c) 2013 FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed by Pawel Jakub Dawidek under sponsorship from
+ * the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <stdarg.h>
+#include <linux/capsicum.h>
+#include <linux/slab.h>
+#include <linux/fcntl.h>
+#include <linux/bug.h>
+
+#include "capsicum-rights.h"
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+#define CAPARSIZE_MIN (CAP_RIGHTS_VERSION_00 + 2)
+#define CAPARSIZE_MAX (CAP_RIGHTS_VERSION + 2)
+
+/*
+ * -1 indicates invalid index value, otherwise log2(v), ie.:
+ * 0x001 -> 0, 0x002 -> 1, 0x004 -> 2, 0x008 -> 3, 0x010 -> 4, rest -> -1
+ */
+static const int bit2idx[] = {
+ -1, 0, 1, -1, 2, -1, -1, -1, 3, -1, -1, -1, -1, -1, -1, -1,
+ 4, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
+};
+
+static inline int right_to_index(__u64 right)
+{
+ return bit2idx[CAPIDXBIT(right)];
+}
+
+static inline bool has_right(const struct capsicum_rights *rights, u64 right)
+{
+ int idx = right_to_index(right);
+
+ return (rights->primary.cr_rights[idx] & right) == right;
+}
+
+struct capsicum_rights *
+cap_rights_vset(struct capsicum_rights *rights, va_list ap)
+{
+ u64 right;
+ int i, n;
+
+ n = CAPARSIZE(&rights->primary);
+ BUG_ON(n < CAPARSIZE_MIN || n > CAPARSIZE_MAX);
+
+ while (true) {
+ right = va_arg(ap, u64);
+ if (right == 0)
+ break;
+ BUG_ON(CAPRVER(right) != 0);
+ i = right_to_index(right);
+ BUG_ON(i < 0 || i >= n);
+ BUG_ON(CAPIDXBIT(rights->primary.cr_rights[i]) !=
+ CAPIDXBIT(right));
+ rights->primary.cr_rights[i] |= right;
+ }
+ return rights;
+}
+EXPORT_SYMBOL(cap_rights_vset);
+
+struct capsicum_rights *
+cap_rights_vinit(struct capsicum_rights *rights, va_list ap)
+{
+ CAP_SET_NONE(&rights->primary);
+ rights->nioctls = 0;
+ rights->ioctls = NULL;
+ rights->fcntls = 0;
+ cap_rights_vset(rights, ap);
+ return rights;
+}
+EXPORT_SYMBOL(cap_rights_vinit);
+
+bool cap_rights_regularize(struct capsicum_rights *rights)
+{
+ bool changed = false;
+
+ if (!has_right(rights, CAP_FCNTL) && rights->fcntls != 0x00) {
+ changed = true;
+ rights->fcntls = 0x00;
+ }
+ if (!has_right(rights, CAP_IOCTL) && (rights->nioctls != 0)) {
+ changed = true;
+ kfree(rights->ioctls);
+ rights->nioctls = 0;
+ rights->ioctls = NULL;
+ }
+ return changed;
+}
+
+struct capsicum_rights *_cap_rights_init(struct capsicum_rights *rights, ...)
+{
+ va_list ap;
+
+ va_start(ap, rights);
+ cap_rights_vinit(rights, ap);
+ va_end(ap);
+ return rights;
+}
+EXPORT_SYMBOL(_cap_rights_init);
+
+struct capsicum_rights *_cap_rights_set(struct capsicum_rights *rights, ...)
+{
+ va_list ap;
+
+ va_start(ap, rights);
+ cap_rights_vset(rights, ap);
+ va_end(ap);
+ return rights;
+}
+EXPORT_SYMBOL(_cap_rights_set);
+
+struct capsicum_rights *cap_rights_set_all(struct capsicum_rights *rights)
+{
+ CAP_SET_ALL(&rights->primary);
+ rights->nioctls = -1;
+ rights->ioctls = NULL;
+ rights->fcntls = CAP_FCNTL_ALL;
+ return rights;
+}
+EXPORT_SYMBOL(cap_rights_set_all);
+
+static bool cap_rights_ioctls_contains(const struct capsicum_rights *big,
+ const struct capsicum_rights *little)
+{
+ int i, j;
+
+ if (big->nioctls == -1)
+ return true;
+ if (big->nioctls < little->nioctls)
+ return false;
+ for (i = 0; i < little->nioctls; i++) {
+ for (j = 0; j < big->nioctls; j++) {
+ if (little->ioctls[i] == big->ioctls[j])
+ break;
+ }
+ if (j == big->nioctls)
+ return false;
+ }
+ return true;
+}
+
+static bool cap_rights_primary_contains(const struct cap_rights *big,
+ const struct cap_rights *little)
+{
+ unsigned int i, n;
+
+ BUG_ON(CAPVER(big) != CAP_RIGHTS_VERSION_00);
+ BUG_ON(CAPVER(little) != CAP_RIGHTS_VERSION_00);
+
+ n = CAPARSIZE(big);
+ BUG_ON(n < CAPARSIZE_MIN || n > CAPARSIZE_MAX);
+
+ for (i = 0; i < n; i++) {
+ if ((big->cr_rights[i] & little->cr_rights[i]) !=
+ little->cr_rights[i]) {
+ return false;
+ }
+ }
+ return true;
+}
+
+bool cap_rights_contains(const struct capsicum_rights *big,
+ const struct capsicum_rights *little)
+{
+ return cap_rights_primary_contains(&big->primary,
+ &little->primary) &&
+ ((big->fcntls & little->fcntls) == little->fcntls) &&
+ cap_rights_ioctls_contains(big, little);
+}
+
+bool cap_rights_is_all(const struct capsicum_rights *rights)
+{
+ return CAP_IS_ALL(&rights->primary) &&
+ rights->fcntls == CAP_FCNTL_ALL &&
+ rights->nioctls == -1;
+}
+EXPORT_SYMBOL(cap_rights_is_all);
+
+#endif /* CONFIG_SECURITY_CAPSICUM */
diff --git a/security/capsicum-rights.h b/security/capsicum-rights.h
new file mode 100644
index 000000000000..b7143e3d65b7
--- /dev/null
+++ b/security/capsicum-rights.h
@@ -0,0 +1,10 @@
+#ifndef _CAPSICUM_RIGHTS_H
+#define _CAPSICUM_RIGHTS_H
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+bool cap_rights_regularize(struct capsicum_rights *rights);
+bool cap_rights_contains(const struct capsicum_rights *big,
+ const struct capsicum_rights *little);
+#endif
+
+#endif /* _CAPSICUM_RIGHTS_H */
--
2.0.0.526.g5318336

2014-07-25 13:56:20

by David Drysdale

[permalink] [raw]
Subject: [PATCH 04/11] capsicum: implement fgetr() and friends

Add variants of fget() and related functions where the caller
indicates the operations that will be performed on the file.

If CONFIG_SECURITY_CAPSICUM is defined, these variants build a
struct capsicum_rights instance holding the rights associated
with the file operations; this will allow a future hook to check
whether a rights-restricted file has those specific rights
available.

If CONFIG_SECURITY_CAPSICUM is not defined, these variants expand
to the underlying fget() function, with one difference: failures
are returned as an ERR_PTR value rather than just NULL.

Signed-off-by: David Drysdale <[email protected]>
---
fs/file.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/namei.c | 50 ++++++++++++++++--
fs/read_write.c | 5 --
include/linux/file.h | 139 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/namei.h | 9 ++++
5 files changed, 331 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 66923fe3176e..ae53219d720b 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -13,6 +13,7 @@
#include <linux/mmzone.h>
#include <linux/time.h>
#include <linux/sched.h>
+#include <linux/security.h>
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/file.h>
@@ -717,6 +718,141 @@ unsigned long __fdget_pos(unsigned int fd)
return v;
}

+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * We might want to change the return value of fget() and friends. This
+ * function is called with the intended return value, and fget() will /actually/
+ * return whatever is returned from here. We adjust the reference counter if
+ * necessary.
+ */
+static struct file *unwrap_file(struct file *orig,
+ const struct capsicum_rights *required_rights,
+ const struct capsicum_rights **actual_rights,
+ bool update_refcnt)
+{
+ struct file *f;
+
+ if (orig == NULL)
+ return ERR_PTR(-EBADF);
+ if (IS_ERR(orig))
+ return orig;
+ f = orig; /* TODO: change the value of f here */
+ if (f != orig && update_refcnt) {
+ /* We're not returning the original, and the calling code
+ * has already incremented the refcount on it, we need to
+ * release that reference and obtain a reference to the new
+ * return value, if any.
+ */
+ if (!IS_ERR(f) && !atomic_long_inc_not_zero(&f->f_count))
+ f = ERR_PTR(-EBADF);
+ atomic_long_dec(&orig->f_count);
+ }
+
+ return f;
+}
+
+struct file *fget_rights(unsigned int fd, const struct capsicum_rights *rights)
+{
+ return unwrap_file(fget(fd), rights, NULL, true);
+}
+EXPORT_SYMBOL(fget_rights);
+
+struct file *fget_raw_rights(unsigned int fd,
+ const struct capsicum_rights *rights)
+{
+ return unwrap_file(fget_raw(fd), rights, NULL, true);
+}
+EXPORT_SYMBOL(fget_raw_rights);
+
+struct fd fdget_rights(unsigned int fd, const struct capsicum_rights *rights)
+{
+ struct fd f = fdget(fd);
+
+ f.file = unwrap_file(f.file, rights, NULL, (f.flags & FDPUT_FPUT));
+ return f;
+}
+EXPORT_SYMBOL(fdget_rights);
+
+struct fd fdget_raw_rights(unsigned int fd,
+ const struct capsicum_rights **actual_rights,
+ const struct capsicum_rights *rights)
+{
+ struct fd f = fdget_raw(fd);
+
+ f.file = unwrap_file(f.file, rights, actual_rights,
+ (f.flags & FDPUT_FPUT));
+ return f;
+}
+EXPORT_SYMBOL(fdget_raw_rights);
+
+struct file *_fgetr(unsigned int fd, ...)
+{
+ struct capsicum_rights rights;
+ struct file *f;
+ va_list ap;
+
+ va_start(ap, fd);
+ f = fget_rights(fd, cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ return f;
+}
+EXPORT_SYMBOL(_fgetr);
+
+struct file *_fgetr_raw(unsigned int fd, ...)
+{
+ struct capsicum_rights rights;
+ struct file *f;
+ va_list ap;
+
+ va_start(ap, fd);
+ f = fget_raw_rights(fd, cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ return f;
+}
+EXPORT_SYMBOL(_fgetr_raw);
+
+struct fd _fdgetr(unsigned int fd, ...)
+{
+ struct fd f;
+ struct capsicum_rights rights;
+ va_list ap;
+
+ va_start(ap, fd);
+ f = fdget_rights(fd, cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ return f;
+}
+EXPORT_SYMBOL(_fdgetr);
+
+struct fd _fdgetr_raw(unsigned int fd, ...)
+{
+ struct fd f;
+ struct capsicum_rights rights;
+ va_list ap;
+
+ va_start(ap, fd);
+ f = fdget_raw_rights(fd, NULL, cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ return f;
+}
+EXPORT_SYMBOL(_fdgetr_raw);
+
+struct fd _fdgetr_pos(unsigned int fd, ...)
+{
+ struct fd f;
+ struct capsicum_rights rights;
+ va_list ap;
+
+ f = __to_fd(__fdget_pos(fd));
+ va_start(ap, fd);
+ f.file = unwrap_file(f.file, cap_rights_vinit(&rights, ap), NULL,
+ (f.flags & FDPUT_FPUT));
+ va_end(ap);
+ return f;
+}
+EXPORT_SYMBOL(_fdgetr_pos);
+#endif
+
/*
* We only lock f_pos if we have threads or if the file might be
* shared with another process. In both cases we'll have an elevated
diff --git a/fs/namei.c b/fs/namei.c
index 165ebb1209d4..548e351fade1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -647,6 +647,19 @@ static __always_inline void set_root(struct nameidata *nd)
get_fs_root(current->fs, &nd->root);
}

+/*
+ * Retrieval of files against a directory file descriptor requires
+ * CAP_LOOKUP. As this is common in this file, set up the required rights once
+ * and for all.
+ */
+static struct capsicum_rights lookup_rights;
+static int __init init_lookup_rights(void)
+{
+ cap_rights_init(&lookup_rights, CAP_LOOKUP);
+ return 0;
+}
+fs_initcall(init_lookup_rights);
+
static int link_path_walk(const char *, struct nameidata *, unsigned int);

static __always_inline void set_root_rcu(struct nameidata *nd)
@@ -2136,8 +2149,12 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
}
EXPORT_SYMBOL(lookup_one_len);

-int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
- struct path *path, int *empty)
+static int user_path_at_empty_rights(int dfd,
+ const char __user *name,
+ unsigned flags,
+ struct path *path,
+ int *empty,
+ const struct capsicum_rights *rights)
{
struct nameidata nd;
struct filename *tmp = getname_flags(name, flags, empty);
@@ -2154,13 +2171,40 @@ int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
return err;
}

+int user_path_at_empty(int dfd, const char __user *name, unsigned flags,
+ struct path *path, int *empty)
+{
+ return user_path_at_empty_rights(dfd, name, flags, path, empty,
+ &lookup_rights);
+}
+
int user_path_at(int dfd, const char __user *name, unsigned flags,
struct path *path)
{
- return user_path_at_empty(dfd, name, flags, path, NULL);
+ return user_path_at_empty_rights(dfd, name, flags, path, NULL,
+ &lookup_rights);
}
EXPORT_SYMBOL(user_path_at);

+#ifdef CONFIG_SECURITY_CAPSICUM
+int _user_path_atr(int dfd,
+ const char __user *name,
+ unsigned flags,
+ struct path *path,
+ ...)
+{
+ struct capsicum_rights rights;
+ int rc;
+ va_list ap;
+
+ va_start(ap, path);
+ rc = user_path_at_empty_rights(dfd, name, flags, path, NULL,
+ cap_rights_vinit(&rights, ap));
+ va_end(ap);
+ return rc;
+}
+#endif
+
/*
* NB: most callers don't do anything directly with the reference to the
* to struct filename, but the nd->last pointer points into the name string
diff --git a/fs/read_write.c b/fs/read_write.c
index 009d8542a889..c6e0f20a9f94 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -265,11 +265,6 @@ loff_t vfs_llseek(struct file *file, loff_t offset, int whence)
}
EXPORT_SYMBOL(vfs_llseek);

-static inline struct fd fdget_pos(int fd)
-{
- return __to_fd(__fdget_pos(fd));
-}
-
static inline void fdput_pos(struct fd f)
{
if (f.flags & FDPUT_POS_UNLOCK)
diff --git a/include/linux/file.h b/include/linux/file.h
index 4d69123377a2..8bf7e13365b5 100644
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -8,6 +8,8 @@
#include <linux/compiler.h>
#include <linux/types.h>
#include <linux/posix_types.h>
+#include <linux/err.h>
+#include <linux/capsicum.h>

struct file;

@@ -39,6 +41,21 @@ static inline void fdput(struct fd fd)
fput(fd.file);
}

+/*
+ * The base functions for converting a file descriptor to a struct file are:
+ * - fget() always increments refcount, doesn't work on O_PATH files.
+ * - fget_raw() always increments refcount, and does work on O_PATH files.
+ * - fdget() only increments refcount if needed, doesn't work on O_PATH files.
+ * - fdget_raw() only increments refcount if needed, works on O_PATH files.
+ * - fdget_pos() as fdget(), but also locks the file position lock (for
+ * operations that POSIX requires to be atomic w.r.t file position).
+ * These functions return NULL on failure, and return the actual entry in the
+ * fdtable (which may be a wrapper if the file is a Capsicum capability).
+ *
+ * These functions should normally only be used when a file is being
+ * transferred (e.g. dup(2)) or manipulated as-is; normal users should stick
+ * to the fgetr() variants below.
+ */
extern struct file *fget(unsigned int fd);
extern struct file *fget_raw(unsigned int fd);
extern unsigned long __fdget(unsigned int fd);
@@ -60,6 +77,128 @@ static inline struct fd fdget_raw(unsigned int fd)
return __to_fd(__fdget_raw(fd));
}

+static inline struct fd fdget_pos(unsigned int fd)
+{
+ return __to_fd(__fdget_pos(fd));
+}
+
+#ifdef CONFIG_SECURITY_CAPSICUM
+/*
+ * The full unwrapping variant functions are:
+ * - fget_rights()
+ * - fget_raw_rights()
+ * - fdget_rights()
+ * - fdget_raw_rights()
+ * These versions have the same behavior as the equivalent base functions, but:
+ * - They also take a struct capsicum_rights argument describing the details
+ * of the operations to be performed on the file.
+ * - They remove any Capsicum capability wrapper for the file, returning the
+ * normal underlying file.
+ * - They return an ERR_PTR on failure (typically with either -EBADF for an
+ * unrecognized FD, or -ENOTCAPABLE for a Capsicum capability FD that does
+ * not have the requisite rights).
+ *
+ * The fdget_raw_rights() function also optionally returns the actual Capsicum
+ * rights associated with the file descriptor; the caller should only access
+ * this structure while it holds a reference to the file.
+ *
+ * These functions should normally only be used:
+ * - when the operation being performed on the file requires more detailed
+ * specification (in particular: the ioctl(2) or fcntl(2) command invoked)
+ * - (for fdget_raw_rights()) when a new file descriptor will be created from
+ * this file descriptor, and so should potentially inherit its rights (if
+ * it is a Capsicum capability file descriptor).
+ * Otherwise users should stick to the simpler fgetr() variants below.
+ */
+extern struct file *fget_rights(unsigned int fd,
+ const struct capsicum_rights *rights);
+extern struct file *fget_raw_rights(unsigned int fd,
+ const struct capsicum_rights *rights);
+extern struct fd fdget_rights(unsigned int fd,
+ const struct capsicum_rights *rights);
+extern struct fd fdget_raw_rights(unsigned int fd,
+ const struct capsicum_rights **actual_rights,
+ const struct capsicum_rights *rights);
+
+/*
+ * The simple unwrapping variant functions are:
+ * - fgetr()
+ * - fgetr_raw()
+ * - fdgetr()
+ * - fdgetr_raw()
+ * - fdgetr_pos()
+ * These versions have the same behavior as the equivalent base functions, but:
+ * - They also take variable arguments indicating the operations to be
+ * performed on the file.
+ * - They remove any Capsicum capability wrapper for the file, returning the
+ * normal underlying file.
+ * - They return an ERR_PTR on failure (typically with either -EBADF for an
+ * unrecognized FD, or -ENOTCAPABLE for a Capsicum capability FD that does
+ * not have the requisite rights).
+ *
+ * These functions should normally be used for FD->file conversion.
+ */
+#define fgetr(fd, ...) _fgetr((fd), __VA_ARGS__, CAP_LIST_END)
+#define fgetr_raw(fd, ...) _fgetr_raw((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr(fd, ...) _fdgetr((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr_raw(fd, ...) _fdgetr_raw((fd), __VA_ARGS__, CAP_LIST_END)
+#define fdgetr_pos(fd, ...) _fdgetr_pos((fd), __VA_ARGS__, CAP_LIST_END)
+extern struct file *_fgetr(unsigned int fd, ...);
+extern struct file *_fgetr_raw(unsigned int fd, ...);
+extern struct fd _fdgetr(unsigned int fd, ...);
+extern struct fd _fdgetr_raw(unsigned int fd, ...);
+extern struct fd _fdgetr_pos(unsigned int fd, ...);
+
+#else
+/*
+ * In a non-Capsicum build, all rights-checking fget() variants fall back to the
+ * normal versions (but still return errors as ERR_PTR values not just NULL).
+ */
+static inline struct file *fget_rights(unsigned int fd,
+ const struct capsicum_rights *rights)
+{
+ return fget(fd) ?: ERR_PTR(-EBADF);
+}
+static inline struct file *fget_raw_rights(unsigned int fd,
+ const struct capsicum_rights *rights)
+{
+ return fget_raw(fd) ?: ERR_PTR(-EBADF);
+}
+static inline struct fd fdget_rights(unsigned int fd,
+ const struct capsicum_rights *rights)
+{
+ struct fd f = fdget(fd);
+
+ if (f.file == NULL)
+ f.file = ERR_PTR(-EBADF);
+ return f;
+}
+static inline struct fd
+fdget_raw_rights(unsigned int fd,
+ const struct capsicum_rights **actual_rights,
+ const struct capsicum_rights *rights)
+{
+ struct fd f = fdget_raw(fd);
+
+ if (f.file == NULL)
+ f.file = ERR_PTR(-EBADF);
+ return f;
+}
+
+#define fgetr(fd, ...) (fget(fd) ?: ERR_PTR(-EBADF))
+#define fgetr_raw(fd, ...) (fget_raw(fd) ?: ERR_PTR(-EBADF))
+#define fdgetr(fd, ...) fdget_rights((fd), NULL)
+#define fdgetr_raw(fd, ...) fdget_raw_rights((fd), NULL, NULL)
+static inline struct fd fdgetr_pos(int fd, ...)
+{
+ struct fd f = fdget_pos(fd);
+
+ if (f.file == NULL)
+ f.file = ERR_PTR(-EBADF);
+ return f;
+}
+#endif
+
extern int f_dupfd(unsigned int from, struct file *file, unsigned flags);
extern int replace_fd(unsigned fd, struct file *file, unsigned flags);
extern void set_close_on_exec(unsigned int fd, int flag);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index bd0615d1143b..3466f35d7e5d 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -59,6 +59,15 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT, LAST_BIND};

extern int user_path_at(int, const char __user *, unsigned, struct path *);
extern int user_path_at_empty(int, const char __user *, unsigned, struct path *, int *empty);
+#ifdef CONFIG_SECURITY_CAPSICUM
+extern int _user_path_atr(int, const char __user *, unsigned,
+ struct path *, ...);
+#define user_path_atr(f, n, x, p, ...) \
+ _user_path_atr((f), (n), (x), (p), __VA_ARGS__, 0ULL)
+#else
+#define user_path_atr(f, n, x, p, ...) \
+ user_path_at((f), (n), (x), (p))
+#endif

#define user_path(name, path) user_path_at(AT_FDCWD, name, LOOKUP_FOLLOW, path)
#define user_lpath(name, path) user_path_at(AT_FDCWD, name, 0, path)
--
2.0.0.526.g5318336

2014-07-25 14:02:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 10/11] capsicum: prctl(2) to force use of O_BENEATH

Il 25/07/2014 15:47, David Drysdale ha scritto:
> @@ -1996,6 +2013,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> if (arg2 || arg3 || arg4 || arg5)
> return -EINVAL;
> return current->no_new_privs ? 1 : 0;
> + case PR_SET_OPENAT_BENEATH:
> + if (arg2 != 1 || arg4 || arg5)
> + return -EINVAL;
> + if ((arg3 & ~(PR_SET_OPENAT_BENEATH_TSYNC)) != 0)
> + return -EINVAL;
> + error = prctl_set_openat_beneath(me, arg3);
> + break;
> + case PR_GET_OPENAT_BENEATH:
> + if (arg2 || arg3 || arg4 || arg5)
> + return -EINVAL;
> + return me->openat_beneath;
> case PR_GET_THP_DISABLE:
> if (arg2 || arg3 || arg4 || arg5)
> return -EINVAL;
>

Why are you always forbidding a change of prctl from 1 to 0? It should
be safe if current->no_new_privs is clear.

Do new threads inherit from the parent?

Also, I wonder if you need something like this check:

/*
* Installing a seccomp filter requires that the task has
* CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
* This avoids scenarios where unprivileged tasks can affect the
* behavior of privileged children.
*/
if (!current->no_new_privs &&
security_capable_noaudit(current_cred(), current_user_ns(),
CAP_SYS_ADMIN) != 0)
return -EACCES;

Paolo

2014-07-25 15:59:55

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Jul 25, 2014 6:48 AM, "David Drysdale" <[email protected]> wrote:
>
> Add the current thread and thread group IDs into the data
> available for seccomp-bpf programs to work on. This allows
> installation of filters that police syscalls based on thread
> or process ID, e.g. tgkill(2)/kill(2)/prctl(2).
>
> Signed-off-by: David Drysdale <[email protected]>
> ---
> include/uapi/linux/seccomp.h | 10 ++++++++++
> kernel/seccomp.c | 2 ++
> 2 files changed, 12 insertions(+)
>
> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
> index ac2dc9f72973..b88370d6f6ca 100644
> --- a/include/uapi/linux/seccomp.h
> +++ b/include/uapi/linux/seccomp.h
> @@ -36,12 +36,22 @@
> * @instruction_pointer: at the time of the system call.
> * @args: up to 6 system call arguments always stored as 64-bit values
> * regardless of the architecture.
> + * @tgid: thread group ID of the thread executing the BPF program.
> + * @tid: thread ID of the thread executing the BPF program.
> + * The SECCOMP_DATA_TID_PRESENT macro indicates the presence of the
> + * tgid and tid fields; user programs may use this macro to conditionally
> + * compile code against older versions of the kernel. Note also that
> + * BPF programs should cope with the absence of these fields by testing
> + * the length of data available.
> */
> struct seccomp_data {
> int nr;
> __u32 arch;
> __u64 instruction_pointer;
> __u64 args[6];
> + __u32 tgid;
> + __u32 tid;
> };
> +#define SECCOMP_DATA_TID_PRESENT 1
>
> #endif /* _UAPI_LINUX_SECCOMP_H */
> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
> index 301bbc24739c..dd5146f15d6d 100644
> --- a/kernel/seccomp.c
> +++ b/kernel/seccomp.c
> @@ -80,6 +80,8 @@ static void populate_seccomp_data(struct seccomp_data *sd)
> sd->args[4] = args[4];
> sd->args[5] = args[5];
> sd->instruction_pointer = KSTK_EIP(task);
> + sd->tgid = task_tgid_vnr(current);
> + sd->tid = task_pid_vnr(current);
> }

This is, IMO, problematic. These should probably be relative to the
filter creator, not the filtered task. This will also hurt
performance.

What's the use case? Can it be better achieved with a new eBPF function?

--Andy

>
> /**
> --
> 2.0.0.526.g5318336
>

2014-07-25 16:00:58

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 10/11] capsicum: prctl(2) to force use of O_BENEATH

On Jul 25, 2014 7:02 AM, "Paolo Bonzini" <[email protected]> wrote:
>
> Il 25/07/2014 15:47, David Drysdale ha scritto:
> > @@ -1996,6 +2013,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > if (arg2 || arg3 || arg4 || arg5)
> > return -EINVAL;
> > return current->no_new_privs ? 1 : 0;
> > + case PR_SET_OPENAT_BENEATH:
> > + if (arg2 != 1 || arg4 || arg5)
> > + return -EINVAL;
> > + if ((arg3 & ~(PR_SET_OPENAT_BENEATH_TSYNC)) != 0)
> > + return -EINVAL;
> > + error = prctl_set_openat_beneath(me, arg3);
> > + break;
> > + case PR_GET_OPENAT_BENEATH:
> > + if (arg2 || arg3 || arg4 || arg5)
> > + return -EINVAL;
> > + return me->openat_beneath;
> > case PR_GET_THP_DISABLE:
> > if (arg2 || arg3 || arg4 || arg5)
> > return -EINVAL;
> >
>
> Why are you always forbidding a change of prctl from 1 to 0? It should
> be safe if current->no_new_privs is clear.

I don't immediately see why you're forbidding unsettling it at all.
If you need it to be sticky, then use seccomp or Capsicum to make it
sticky.

Also, the way implementation is dangerously racy -- if anyone pokes at
adjacent bitfields without the lock, they can get corrupted. Try
basing on Kees' seccomp tree or security-next and using the new atomic
flags field.


--Andy

>
> Do new threads inherit from the parent?
>
> Also, I wonder if you need something like this check:
>
> /*
> * Installing a seccomp filter requires that the task has
> * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
> * This avoids scenarios where unprivileged tasks can affect the
> * behavior of privileged children.
> */
> if (!current->no_new_privs &&
> security_capable_noaudit(current_cred(), current_user_ns(),
> CAP_SYS_ADMIN) != 0)
> return -EACCES;
>
> Paolo

2014-07-25 17:10:40

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Fri, Jul 25, 2014 at 8:59 AM, Andy Lutomirski <[email protected]> wrote:
> On Jul 25, 2014 6:48 AM, "David Drysdale" <[email protected]> wrote:
>>
>> Add the current thread and thread group IDs into the data
>> available for seccomp-bpf programs to work on. This allows
>> installation of filters that police syscalls based on thread
>> or process ID, e.g. tgkill(2)/kill(2)/prctl(2).
>>
>> Signed-off-by: David Drysdale <[email protected]>
>> ---
>> include/uapi/linux/seccomp.h | 10 ++++++++++
>> kernel/seccomp.c | 2 ++
>> 2 files changed, 12 insertions(+)
>>
>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>> index ac2dc9f72973..b88370d6f6ca 100644
>> --- a/include/uapi/linux/seccomp.h
>> +++ b/include/uapi/linux/seccomp.h
>> @@ -36,12 +36,22 @@
>> * @instruction_pointer: at the time of the system call.
>> * @args: up to 6 system call arguments always stored as 64-bit values
>> * regardless of the architecture.
>> + * @tgid: thread group ID of the thread executing the BPF program.
>> + * @tid: thread ID of the thread executing the BPF program.
>> + * The SECCOMP_DATA_TID_PRESENT macro indicates the presence of the
>> + * tgid and tid fields; user programs may use this macro to conditionally
>> + * compile code against older versions of the kernel. Note also that
>> + * BPF programs should cope with the absence of these fields by testing
>> + * the length of data available.
>> */
>> struct seccomp_data {
>> int nr;
>> __u32 arch;
>> __u64 instruction_pointer;
>> __u64 args[6];
>> + __u32 tgid;
>> + __u32 tid;
>> };
>> +#define SECCOMP_DATA_TID_PRESENT 1
>>
>> #endif /* _UAPI_LINUX_SECCOMP_H */
>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>> index 301bbc24739c..dd5146f15d6d 100644
>> --- a/kernel/seccomp.c
>> +++ b/kernel/seccomp.c
>> @@ -80,6 +80,8 @@ static void populate_seccomp_data(struct seccomp_data *sd)
>> sd->args[4] = args[4];
>> sd->args[5] = args[5];
>> sd->instruction_pointer = KSTK_EIP(task);
>> + sd->tgid = task_tgid_vnr(current);
>> + sd->tid = task_pid_vnr(current);
>> }
>
> This is, IMO, problematic. These should probably be relative to the
> filter creator, not the filtered task. This will also hurt
> performance.

Yeah, we can't change the seccomp_data structure without a lot of
care, and tgid/tid really should be encoded in the filter. However, it
is tricky in the forking case.

>
> What's the use case? Can it be better achieved with a new eBPF function?

Julien had been wanting something like this too (though he'd suggested
it via prctl): limit the signal functions to "self" only. I wonder if
adding a prctl like done for O_BENEATH could work for signal sending?

-Kees

--
Kees Cook
Chrome OS Security

2014-07-25 17:18:29

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

[cc: Eric Biederman]

On Fri, Jul 25, 2014 at 10:10 AM, Kees Cook <[email protected]> wrote:
> On Fri, Jul 25, 2014 at 8:59 AM, Andy Lutomirski <[email protected]> wrote:
>> On Jul 25, 2014 6:48 AM, "David Drysdale" <[email protected]> wrote:
>>>
>>> Add the current thread and thread group IDs into the data
>>> available for seccomp-bpf programs to work on. This allows
>>> installation of filters that police syscalls based on thread
>>> or process ID, e.g. tgkill(2)/kill(2)/prctl(2).
>>>
>>> Signed-off-by: David Drysdale <[email protected]>
>>> ---
>>> include/uapi/linux/seccomp.h | 10 ++++++++++
>>> kernel/seccomp.c | 2 ++
>>> 2 files changed, 12 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>>> index ac2dc9f72973..b88370d6f6ca 100644
>>> --- a/include/uapi/linux/seccomp.h
>>> +++ b/include/uapi/linux/seccomp.h
>>> @@ -36,12 +36,22 @@
>>> * @instruction_pointer: at the time of the system call.
>>> * @args: up to 6 system call arguments always stored as 64-bit values
>>> * regardless of the architecture.
>>> + * @tgid: thread group ID of the thread executing the BPF program.
>>> + * @tid: thread ID of the thread executing the BPF program.
>>> + * The SECCOMP_DATA_TID_PRESENT macro indicates the presence of the
>>> + * tgid and tid fields; user programs may use this macro to conditionally
>>> + * compile code against older versions of the kernel. Note also that
>>> + * BPF programs should cope with the absence of these fields by testing
>>> + * the length of data available.
>>> */
>>> struct seccomp_data {
>>> int nr;
>>> __u32 arch;
>>> __u64 instruction_pointer;
>>> __u64 args[6];
>>> + __u32 tgid;
>>> + __u32 tid;
>>> };
>>> +#define SECCOMP_DATA_TID_PRESENT 1
>>>
>>> #endif /* _UAPI_LINUX_SECCOMP_H */
>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>> index 301bbc24739c..dd5146f15d6d 100644
>>> --- a/kernel/seccomp.c
>>> +++ b/kernel/seccomp.c
>>> @@ -80,6 +80,8 @@ static void populate_seccomp_data(struct seccomp_data *sd)
>>> sd->args[4] = args[4];
>>> sd->args[5] = args[5];
>>> sd->instruction_pointer = KSTK_EIP(task);
>>> + sd->tgid = task_tgid_vnr(current);
>>> + sd->tid = task_pid_vnr(current);
>>> }
>>
>> This is, IMO, problematic. These should probably be relative to the
>> filter creator, not the filtered task. This will also hurt
>> performance.
>
> Yeah, we can't change the seccomp_data structure without a lot of
> care, and tgid/tid really should be encoded in the filter. However, it
> is tricky in the forking case.
>
>>
>> What's the use case? Can it be better achieved with a new eBPF function?
>
> Julien had been wanting something like this too (though he'd suggested
> it via prctl): limit the signal functions to "self" only. I wonder if
> adding a prctl like done for O_BENEATH could work for signal sending?
>


Can we do one better and add a flag to prevent any non-self pid
lookups? This might actually be easy on top of the pid namespace work
(e.g. we could change the way that find_task_by_vpid works).

It's far from just being signals. There's access_process_vm, ptrace,
all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
is ridiculous), and probably some others that I've forgotten about or
never noticed in the first place.

--Andy

> -Kees
>
> --
> Kees Cook
> Chrome OS Security



--
Andy Lutomirski
AMA Capital Management, LLC

2014-07-25 17:38:21

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Fri, Jul 25, 2014 at 10:18 AM, Andy Lutomirski <[email protected]> wrote:
> [cc: Eric Biederman]
>
> On Fri, Jul 25, 2014 at 10:10 AM, Kees Cook <[email protected]> wrote:
>> On Fri, Jul 25, 2014 at 8:59 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Jul 25, 2014 6:48 AM, "David Drysdale" <[email protected]> wrote:
>>>>
>>>> Add the current thread and thread group IDs into the data
>>>> available for seccomp-bpf programs to work on. This allows
>>>> installation of filters that police syscalls based on thread
>>>> or process ID, e.g. tgkill(2)/kill(2)/prctl(2).
>>>>
>>>> Signed-off-by: David Drysdale <[email protected]>
>>>> ---
>>>> include/uapi/linux/seccomp.h | 10 ++++++++++
>>>> kernel/seccomp.c | 2 ++
>>>> 2 files changed, 12 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>>>> index ac2dc9f72973..b88370d6f6ca 100644
>>>> --- a/include/uapi/linux/seccomp.h
>>>> +++ b/include/uapi/linux/seccomp.h
>>>> @@ -36,12 +36,22 @@
>>>> * @instruction_pointer: at the time of the system call.
>>>> * @args: up to 6 system call arguments always stored as 64-bit values
>>>> * regardless of the architecture.
>>>> + * @tgid: thread group ID of the thread executing the BPF program.
>>>> + * @tid: thread ID of the thread executing the BPF program.
>>>> + * The SECCOMP_DATA_TID_PRESENT macro indicates the presence of the
>>>> + * tgid and tid fields; user programs may use this macro to conditionally
>>>> + * compile code against older versions of the kernel. Note also that
>>>> + * BPF programs should cope with the absence of these fields by testing
>>>> + * the length of data available.
>>>> */
>>>> struct seccomp_data {
>>>> int nr;
>>>> __u32 arch;
>>>> __u64 instruction_pointer;
>>>> __u64 args[6];
>>>> + __u32 tgid;
>>>> + __u32 tid;
>>>> };
>>>> +#define SECCOMP_DATA_TID_PRESENT 1
>>>>
>>>> #endif /* _UAPI_LINUX_SECCOMP_H */
>>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>>> index 301bbc24739c..dd5146f15d6d 100644
>>>> --- a/kernel/seccomp.c
>>>> +++ b/kernel/seccomp.c
>>>> @@ -80,6 +80,8 @@ static void populate_seccomp_data(struct seccomp_data *sd)
>>>> sd->args[4] = args[4];
>>>> sd->args[5] = args[5];
>>>> sd->instruction_pointer = KSTK_EIP(task);
>>>> + sd->tgid = task_tgid_vnr(current);
>>>> + sd->tid = task_pid_vnr(current);
>>>> }
>>>
>>> This is, IMO, problematic. These should probably be relative to the
>>> filter creator, not the filtered task. This will also hurt
>>> performance.
>>
>> Yeah, we can't change the seccomp_data structure without a lot of
>> care, and tgid/tid really should be encoded in the filter. However, it
>> is tricky in the forking case.
>>
>>>
>>> What's the use case? Can it be better achieved with a new eBPF function?
>>
>> Julien had been wanting something like this too (though he'd suggested
>> it via prctl): limit the signal functions to "self" only. I wonder if
>> adding a prctl like done for O_BENEATH could work for signal sending?
>>
>
>
> Can we do one better and add a flag to prevent any non-self pid
> lookups? This might actually be easy on top of the pid namespace work
> (e.g. we could change the way that find_task_by_vpid works).

Ooh, that would be extremely interesting, yes. Kind of an extreme form
of pid namespace without actually being a namespace.

> It's far from just being signals. There's access_process_vm, ptrace,
> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
> is ridiculous), and probably some others that I've forgotten about or
> never noticed in the first place.

Yeah, that would be very interesting.

-Kees

--
Kees Cook
Chrome OS Security

2014-07-25 18:24:55

by Julien Tinnes

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Fri, Jul 25, 2014 at 10:38 AM, Kees Cook <[email protected]> wrote:
> On Fri, Jul 25, 2014 at 10:18 AM, Andy Lutomirski <[email protected]> wrote:
>> [cc: Eric Biederman]
>>
>> On Fri, Jul 25, 2014 at 10:10 AM, Kees Cook <[email protected]> wrote:

>>> Julien had been wanting something like this too (though he'd suggested
>>> it via prctl): limit the signal functions to "self" only. I wonder if
>>> adding a prctl like done for O_BENEATH could work for signal sending?
>>>
>>
>>
>> Can we do one better and add a flag to prevent any non-self pid
>> lookups? This might actually be easy on top of the pid namespace work
>> (e.g. we could change the way that find_task_by_vpid works).
>
> Ooh, that would be extremely interesting, yes. Kind of an extreme form
> of pid namespace without actually being a namespace.
>
>> It's far from just being signals. There's access_process_vm, ptrace,
>> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
>> is ridiculous), and probably some others that I've forgotten about or
>> never noticed in the first place.
>
> Yeah, that would be very interesting.

Yes, this would be incredibly useful.

1. For Chromium [1], I dislike relying on seccomp purely for
"access-control" (to other processes or files). Because it's really
hard to think about everything (things like CPUCLOCK_PID bite,
seehttps://crbug.com/374479).
Se we have a first layer of sandboxing (using PID + NET namespaces and
chroot) for "access-control" and a second layer for kernel attack
surface reduction and a few other things using seccomp-bpf.

The first layer isn't currently very good; it's heavyweight and
complex (you need an init(1) per namespace and that init cannot be
multi-purposed as a useful process because pid = 1 can never receive
signals). One PID namespace per process isn't something that scales
well. (Also before USER_NS it required a setuid root program).

2. Even with a safe pure seccomp-bpf sandbox that prevents sending
signals to other process / ptrace() et al and that restrict
clock_gettime(2) properly, things become quickly very tedious because
as far as the kernel is concerned, the process under this BPF program
can still pass ptrace_may_access() to other processes. This means for
instance that no matter what you do, a model where open() is allowed
can't work if /proc is available. We need a mode that says
"ptrace_may_access()" will never pass.

So yes, I really would like:
- a prctl that says: "I'm dropping privileges and I now can't interact
with other thread groups (via signals, ptrace, etc..)".
- Something to drop access to the file system. It could be an
unprivileged way to chroot() to an empty directory (unprivileged
namespaces work for that, - except if you're already in a chroot -).
This is a little tricky without allowing chroot escapes, so I suspect
we would want to express it in terms of mount namespace, or something
else, rather than chroot.

Then we have the primitives we need to build sandboxes in a simple
way and we can add seccomp-bpf on top to do things such as open()
hooking (via SECCOMP_RET_TRAP) and to restrict the kernel attack
surface.

Julien

[1] https://code.google.com/p/chromium/wiki/LinuxSandboxing

2014-07-25 18:33:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Fri, Jul 25, 2014 at 11:22 AM, Julien Tinnes <[email protected]> wrote:
> On Fri, Jul 25, 2014 at 10:38 AM, Kees Cook <[email protected]> wrote:
>>
>> On Fri, Jul 25, 2014 at 10:18 AM, Andy Lutomirski <[email protected]>
>> wrote:
>> > [cc: Eric Biederman]
>> >
>> > On Fri, Jul 25, 2014 at 10:10 AM, Kees Cook <[email protected]>
>> > wrote:
>>
>> >> Julien had been wanting something like this too (though he'd suggested
>> >> it via prctl): limit the signal functions to "self" only. I wonder if
>> >> adding a prctl like done for O_BENEATH could work for signal sending?
>> >>
>> >
>> >
>> > Can we do one better and add a flag to prevent any non-self pid
>> > lookups? This might actually be easy on top of the pid namespace work
>> > (e.g. we could change the way that find_task_by_vpid works).
>>
>> Ooh, that would be extremely interesting, yes. Kind of an extreme form
>> of pid namespace without actually being a namespace.
>>
>> > It's far from just being signals. There's access_process_vm, ptrace,
>> > all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
>> > is ridiculous), and probably some others that I've forgotten about or
>> > never noticed in the first place.
>>
>> Yeah, that would be very interesting.
>
>
> Yes, this would be incredibly useful.
>
> 1. For Chromium [1], I dislike relying on seccomp purely for
> "access-control" (to other processes or files). Because it's really hard to
> think about everything (things like CPUCLOCK_PID bite, see
> https://crbug.com/374479).

Not public :(

> Se we have a first layer of sandboxing (using PID + NET namespaces and
> chroot) for "access-control" and a second layer for kernel attack surface
> reduction and a few other things using seccomp-bpf.
>
> The first layer isn't currently very good; it's heavyweight and complex (you
> need an init(1) per namespace and that init cannot be multi-purposed as a
> useful process because pid = 1 can never receive signals). One PID namespace
> per process isn't something that scales well. (Also before USER_NS it
> required a setuid root program).
>
> 2. Even with a safe pure seccomp-bpf sandbox that prevents sending signals
> to other process / ptrace() et al and that restrict clock_gettime(2)
> properly, things become quickly very tedious because as far as the kernel is
> concerned, the process under this BPF program can still pass
> ptrace_may_access() to other processes. This means for instance that no
> matter what you do, a model where open() is allowed can't work if /proc is
> available. We need a mode that says "ptrace_may_access()" will never pass.
>
> So yes, I really would like:
> - a prctl that says: "I'm dropping privileges and I now can't interact with
> other thread groups (via signals, ptrace, etc..)".
> - Something to drop access to the file system. It could be an unprivileged
> way to chroot() to an empty directory (unprivileged namespaces work for
> that, - except if you're already in a chroot -). This is a little tricky
> without allowing chroot escapes, so I suspect we would want to express it in
> terms of mount namespace, or something else, rather than chroot.

Capsicum will give you this.

See the other thread for a more concrete proposal. prctl is getting
out of hand.

--Andy

2014-07-26 21:08:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

David Drysdale <[email protected]> writes:

> The last couple of versions of FreeBSD (9.x/10.x) have included the
> Capsicum security framework [1], which allows security-aware
> applications to sandbox themselves in a very fine-grained way. For
> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
> restrict sshd's credentials checking process, to reduce the chances of
> credential leakage.
>
> It would be good to have equivalent functionality in Linux, so I've been
> working on getting the Capsicum framework running in the kernel, and I'd
> appreciate some feedback/opinions on the general approach.
>
> I'm attaching a corresponding draft patchset for reference, but
> hopefully this cover email can cover the significant features to save
> everyone having to look through the code details. (It does mean this is
> a long email though -- apologies for that.)
>
>
> 1) Capsicum Capabilities
> ------------------------
>
> The most significant aspect of Capsicum is associating *rights* with
> (some) file descriptors, so that the kernel only allows operations on an
> FD if the rights permit it. This allows userspace applications to
> sandbox themselves by tightly constraining what's allowed with both
> input and outputs; for example, tcpdump might restrict itself so it can
> only read from the network FD, and only write to stdout.
>
> The kernel thus needs to police the rights checks for these file
> descriptors (referred to as 'Capsicum capabilities', completely
> different than POSIX.1e capabilities), and the best place to do this is
> at the points where a file descriptor from userspace is converted to a
> struct file * within the kernel.
>
> [Policing the rights checks anywhere else, for example at the system
> call boundary, isn't a good idea because it opens up the possibility
> of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
> changed (as openat/close/dup2 are allowed in capability mode) between
> the 'check' at syscall entry and the 'use' at fget() invocation.]
>
> However, this does lead to quite an invasive change to the kernel --
> every invocation of fget() or similar functions (fdget(),
> sockfd_lookup(), user_path_at(),...) needs to be annotated with the
> rights associated with the specific operations that will be performed on
> the struct file. There are ~100 such invocations that need
> annotation.

And it is silly. Roughly you just need a locking version of
fcntl(F_SETFL).

That is make the restriction in the struct file not in the fd to file
lookup.

Files in unix have been capabilities for more than 20 years. That is
what file descriptor passing in unix domain sockets are all about. We
don't need an additional ``capability rights'' layer on top of file
descriptors.

In fact internal to linux with FMODE_READ and friends we already have
restrictions on which methods are allowed on linux file descriptors.
So this whole entire abstraction layer you are adding seems just plain
broken.

Going farther one huge thing your work and the capsicum work in general
is missing is an implementation of revoke. With a little care a good
implementation of bits reporting and controlling which methods are
available on which file descriptors should be a good start on
a revoke implementation for linux as well.

> 2) Capsicum Capabilities Data Structure
> ---------------------------------------
>
> Internally, the rights associated with a Capsicum capability FD are
> stored in a special struct file wrapper. For a normal file, the rights
> check inside fget() falls through, but for a capability wrapper the
> rights in the wrapper are checked and (if capable) the underlying
> wrapped struct file is returned.
>
> [This is approximately the implementation that was present in FreeBSD
> 9.x. For FreeBSD 10.x, the wrapper file was removed and the rights
> associated with a file descriptor are now stored in the fdtable. As
> that impacts memory use for all processes, whether Capsicum users or
> not, I've stuck with the FreeBSD 9.x approach.]

I have already mentioned that this is an insane choice of semantics
right? Adding an extra layer on top of a data structure that is a
perfectly good restrictor of rights.

If you can't add the restriction on struct file itself I would argue
that the semantics of capsicum are fundamentally broken.

I can not imagine why in the world you would want and extra layer of
indirection, complication, and maintenance.

> 3) Allowing Capability Mode
> ---------------------------
>
> Capsicum also includes 'capability mode', which locks down the available
> syscalls so the rights restrictions can't just be bypassed by opening
> new file descriptors. More precisely, capability mode prevents access
> to syscalls that access global namespaces, such as the filesystem or the
> IP:port space.
>
> The existing seccomp-bpf functionality of the kernel is a good mechanism
> for implementing capability mode, but there are a few additional details
> that also need to be addressed.
>
> a) The capability mode filter program needs to apply process-wide, not
> just to the current thread.
>
> b) In capability mode, new files can still be opened with openat(2) but
> only if they are beneath an existing directory file descriptor.

Which raises the question is that worth it?

> c) In capability mode it should still be possible for a process to send
> signals to itself with kill(2)/tgkill(2).

Again is it worth it?

I would think you would want capability mode to default to the minium
set of system calls you could get away with (to keep kernel code
auditing to a minium) and only add things if the performance gain of
using the syscall exceeds the pain.

If you look at a kernel like sel4 it succeeds in with an object
capability model with many fewer system calls than you are proposing
to export. Roughly just read, write and close.

Consider the fact if you really want a kernel layer you can completely
trust and rely on someone needs to write a formal proof of that layer.
Short of that someone certainly needs to audit the kernel code very
closely so simplicity of semantics and simplicity of implementation are
very important.

> This v2 patchset copes with these as follows:
>
> a) Kees Cook's incoming seccomp(2) patchset covers thread
> synchronization of filters.
>
> b) A new prctl(PR_SET_OPENAT_BENEATH) operation implicitly sets the
> O_BENEATH flag (see below) for all file-open operations for all
> threads of the current process, by adding a new openat_beneath
> flag in task_struct.
>
> c) An extension to the seccomp_data structure that includes the current
> task's tid and tgid values allows for BPF programs that check a
> kill(2)/tgkill(2) argument against the current thread, in a manner
> that is robust against fork(2)/clone(2).
>
> The combination of these features with the existing seccomp-bpf
> functionality gives the tools needed to implement capability mode.
>
>
> 4) New System Calls
> -------------------
>
> To allow userspace applications to access the Capsicum capability
> functionality, I'm proposing two new system calls: cap_rights_limit(2)
> and cap_rights_get(2). I guess these could potentially be implemented
> elsewhere (e.g. as fcntl(2) operations?) but the changes seem
> significant enough that new syscalls are warranted.
>
> [FreeBSD 10.x actually includes six new syscalls for manipulating the
> rights associated with a Capsicum capability -- the capability rights
> can police that only specific fcntl(2) or ioctl(2) commands are
> allowed, and FreeBSD sets these with distinct syscalls.]

ioctls? In a sandbox? Ick.

> 5) New openat(2) O_BENEATH Flag
> -------------------------------
>
> For Capsicum capabilities that are directory file descriptors, the
> Capsicum framework only allows openat(cap_dfd, path, ...) operations to
> work for files that are beneath the specified directory (and even that
> only when the directory FD has the CAP_LOOKUP right), rejecting paths
> that start with "/" or include "..". The same restriction applies
> process-wide for a process in capability mode.
>
> As this seemed like functionality that might be more generally useful,
> I've implemented it independently as a new O_BENEATH flag for openat(2).
> The Capsicum code then always triggers the use of that flag when the dfd
> is a Capsicum capability, or when the prctl(2) command described above
> is in play.
>
> [FreeBSD has the openat(2) relative-only behaviour for capability DFDs
> and processes in capability mode, but does not include the O_BENEATH
> flag.]

If you are going to allow open I would think the simple solution here is
to just create a mount namespace that only has available the files you
would like to export to the process, and force all of the file
descriptors into that mount namespace.

Eric

2014-07-27 12:08:35

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCH 10/11] capsicum: prctl(2) to force use of O_BENEATH

On Fri, Jul 25, 2014 at 5:00 PM, Andy Lutomirski <[email protected]> wrote:
>
> On Jul 25, 2014 7:02 AM, "Paolo Bonzini" <[email protected]> wrote:
> >
> > Il 25/07/2014 15:47, David Drysdale ha scritto:
> > > @@ -1996,6 +2013,17 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> > > if (arg2 || arg3 || arg4 || arg5)
> > > return -EINVAL;
> > > return current->no_new_privs ? 1 : 0;
> > > + case PR_SET_OPENAT_BENEATH:
> > > + if (arg2 != 1 || arg4 || arg5)
> > > + return -EINVAL;
> > > + if ((arg3 & ~(PR_SET_OPENAT_BENEATH_TSYNC)) != 0)
> > > + return -EINVAL;
> > > + error = prctl_set_openat_beneath(me, arg3);
> > > + break;
> > > + case PR_GET_OPENAT_BENEATH:
> > > + if (arg2 || arg3 || arg4 || arg5)
> > > + return -EINVAL;
> > > + return me->openat_beneath;
> > > case PR_GET_THP_DISABLE:
> > > if (arg2 || arg3 || arg4 || arg5)
> > > return -EINVAL;
> > >
> >
> > Why are you always forbidding a change of prctl from 1 to 0? It should
> > be safe if current->no_new_privs is clear.
>
> I don't immediately see why you're forbidding unsettling it at all.
> If you need it to be sticky, then use seccomp or Capsicum to make it
> sticky.

Good point, that would make the function more generic -- needing to
latch is specific to Capsicum's use of it.

>
> Also, the way implementation is dangerously racy -- if anyone pokes at
> adjacent bitfields without the lock, they can get corrupted. Try
> basing on Kees' seccomp tree or security-next and using the new atomic
> flags field.

Ah yes, sorry -- I hadn't yet shifted the implementation to line up with
the work you and Kees have put into the seccomp stuff.

>
>
> --Andy
>
> >
> > Do new threads inherit from the parent?
> >
> > Also, I wonder if you need something like this check:
> >
> > /*
> > * Installing a seccomp filter requires that the task has
> > * CAP_SYS_ADMIN in its namespace or be running with no_new_privs.
> > * This avoids scenarios where unprivileged tasks can affect the
> > * behavior of privileged children.
> > */
> > if (!current->no_new_privs &&
> > security_capable_noaudit(current_cred(), current_user_ns(),
> > CAP_SYS_ADMIN) != 0)
> > return -EACCES;
> >
> > Paolo

Yes, new threads inherit the flag from the parent so the
NNP||CAP_SYS_ADMIN check is probably needed.

2014-07-27 12:09:24

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Fri, Jul 25, 2014 at 6:18 PM, Andy Lutomirski <[email protected]> wrote:
> [cc: Eric Biederman]
>
> On Fri, Jul 25, 2014 at 10:10 AM, Kees Cook <[email protected]> wrote:
>> On Fri, Jul 25, 2014 at 8:59 AM, Andy Lutomirski <[email protected]> wrote:
>>> On Jul 25, 2014 6:48 AM, "David Drysdale" <[email protected]> wrote:
>>>>
>>>> Add the current thread and thread group IDs into the data
>>>> available for seccomp-bpf programs to work on. This allows
>>>> installation of filters that police syscalls based on thread
>>>> or process ID, e.g. tgkill(2)/kill(2)/prctl(2).
>>>>
>>>> Signed-off-by: David Drysdale <[email protected]>
>>>> ---
>>>> include/uapi/linux/seccomp.h | 10 ++++++++++
>>>> kernel/seccomp.c | 2 ++
>>>> 2 files changed, 12 insertions(+)
>>>>
>>>> diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
>>>> index ac2dc9f72973..b88370d6f6ca 100644
>>>> --- a/include/uapi/linux/seccomp.h
>>>> +++ b/include/uapi/linux/seccomp.h
>>>> @@ -36,12 +36,22 @@
>>>> * @instruction_pointer: at the time of the system call.
>>>> * @args: up to 6 system call arguments always stored as 64-bit values
>>>> * regardless of the architecture.
>>>> + * @tgid: thread group ID of the thread executing the BPF program.
>>>> + * @tid: thread ID of the thread executing the BPF program.
>>>> + * The SECCOMP_DATA_TID_PRESENT macro indicates the presence of the
>>>> + * tgid and tid fields; user programs may use this macro to conditionally
>>>> + * compile code against older versions of the kernel. Note also that
>>>> + * BPF programs should cope with the absence of these fields by testing
>>>> + * the length of data available.
>>>> */
>>>> struct seccomp_data {
>>>> int nr;
>>>> __u32 arch;
>>>> __u64 instruction_pointer;
>>>> __u64 args[6];
>>>> + __u32 tgid;
>>>> + __u32 tid;
>>>> };
>>>> +#define SECCOMP_DATA_TID_PRESENT 1
>>>>
>>>> #endif /* _UAPI_LINUX_SECCOMP_H */
>>>> diff --git a/kernel/seccomp.c b/kernel/seccomp.c
>>>> index 301bbc24739c..dd5146f15d6d 100644
>>>> --- a/kernel/seccomp.c
>>>> +++ b/kernel/seccomp.c
>>>> @@ -80,6 +80,8 @@ static void populate_seccomp_data(struct seccomp_data *sd)
>>>> sd->args[4] = args[4];
>>>> sd->args[5] = args[5];
>>>> sd->instruction_pointer = KSTK_EIP(task);
>>>> + sd->tgid = task_tgid_vnr(current);
>>>> + sd->tid = task_pid_vnr(current);
>>>> }
>>>
>>> This is, IMO, problematic. These should probably be relative to the
>>> filter creator, not the filtered task. This will also hurt
>>> performance.
>>
>> Yeah, we can't change the seccomp_data structure without a lot of
>> care, and tgid/tid really should be encoded in the filter. However, it
>> is tricky in the forking case.
>>
>>>
>>> What's the use case? Can it be better achieved with a new eBPF function?

The specific use case is to be able to write a filter that allows kill(2)
or tgkill(2) to self, where the filter still works after forking. Capsicum
capability mode in general locks down system calls that access PIDs
(as they're a global namespace), but allows kill(self) as a pragmatic
compromise to make it easier to migrate applications to use Capsicum.

>> Julien had been wanting something like this too (though he'd suggested
>> it via prctl): limit the signal functions to "self" only. I wonder if
>> adding a prctl like done for O_BENEATH could work for signal sending?
>>
>
>
> Can we do one better and add a flag to prevent any non-self pid
> lookups? This might actually be easy on top of the pid namespace work
> (e.g. we could change the way that find_task_by_vpid works).

That sounds like a good idea, as long as it's possible for
non-CAP_SYS_ADMIN processes to do....

> It's far from just being signals. There's access_process_vm, ptrace,
> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
> is ridiculous), and probably some others that I've forgotten about or
> never noticed in the first place.

For the Capsicum case in particular, most of these are restricted
by the capability mode filter anyhow (although I need to fix it for
CPUCLOCK_PID -- thanks for pointing that out); the kill(2) case
was a special case to make migrations easier. But a more general
mechanism seems sensible.


> --Andy
>
>> -Kees
>>
>> --
>> Kees Cook
>> Chrome OS Security
>
>
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

2014-07-27 12:10:34

by David Drysdale

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Fri, Jul 25, 2014 at 7:32 PM, Andy Lutomirski <[email protected]> wrote:
> On Fri, Jul 25, 2014 at 11:22 AM, Julien Tinnes <[email protected]> wrote:
>> On Fri, Jul 25, 2014 at 10:38 AM, Kees Cook <[email protected]> wrote:
>>>
>>> On Fri, Jul 25, 2014 at 10:18 AM, Andy Lutomirski <[email protected]>
>>> wrote:
>>> > [cc: Eric Biederman]
>>> >
>>> > On Fri, Jul 25, 2014 at 10:10 AM, Kees Cook <[email protected]>
>>> > wrote:
>>>
>>> >> Julien had been wanting something like this too (though he'd suggested
>>> >> it via prctl): limit the signal functions to "self" only. I wonder if
>>> >> adding a prctl like done for O_BENEATH could work for signal sending?
>>> >>
>>> >
>>> >
>>> > Can we do one better and add a flag to prevent any non-self pid
>>> > lookups? This might actually be easy on top of the pid namespace work
>>> > (e.g. we could change the way that find_task_by_vpid works).
>>>
>>> Ooh, that would be extremely interesting, yes. Kind of an extreme form
>>> of pid namespace without actually being a namespace.
>>>
>>> > It's far from just being signals. There's access_process_vm, ptrace,
>>> > all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
>>> > is ridiculous), and probably some others that I've forgotten about or
>>> > never noticed in the first place.
>>>
>>> Yeah, that would be very interesting.
>>
>>
>> Yes, this would be incredibly useful.
>>
>> 1. For Chromium [1], I dislike relying on seccomp purely for
>> "access-control" (to other processes or files). Because it's really hard to
>> think about everything (things like CPUCLOCK_PID bite, see
>> https://crbug.com/374479).
>
> Not public :(
>
>> Se we have a first layer of sandboxing (using PID + NET namespaces and
>> chroot) for "access-control" and a second layer for kernel attack surface
>> reduction and a few other things using seccomp-bpf.
>>
>> The first layer isn't currently very good; it's heavyweight and complex (you
>> need an init(1) per namespace and that init cannot be multi-purposed as a
>> useful process because pid = 1 can never receive signals). One PID namespace
>> per process isn't something that scales well. (Also before USER_NS it
>> required a setuid root program).
>>
>> 2. Even with a safe pure seccomp-bpf sandbox that prevents sending signals
>> to other process / ptrace() et al and that restrict clock_gettime(2)
>> properly, things become quickly very tedious because as far as the kernel is
>> concerned, the process under this BPF program can still pass
>> ptrace_may_access() to other processes. This means for instance that no
>> matter what you do, a model where open() is allowed can't work if /proc is
>> available. We need a mode that says "ptrace_may_access()" will never pass.
>>
>> So yes, I really would like:
>> - a prctl that says: "I'm dropping privileges and I now can't interact with
>> other thread groups (via signals, ptrace, etc..)".
>> - Something to drop access to the file system. It could be an unprivileged
>> way to chroot() to an empty directory (unprivileged namespaces work for
>> that, - except if you're already in a chroot -). This is a little tricky
>> without allowing chroot escapes, so I suspect we would want to express it in
>> terms of mount namespace, or something else, rather than chroot.
>
> Capsicum will give you this.

Yep, that's the idea. As long as there aren't any open DFDs for "/proc" on
entry to capability mode, there shouldn't be a way to access it later -- but it
is still possible to openat(2) new files (relative to a pre-opened DFD).

> See the other thread for a more concrete proposal. prctl is getting
> out of hand.
>
> --Andy

2014-07-28 12:31:27

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

Il 26/07/2014 23:04, Eric W. Biederman ha scritto:
>> The most significant aspect of Capsicum is associating *rights* with
>> (some) file descriptors, so that the kernel only allows operations on an
>> FD if the rights permit it. This allows userspace applications to
>> sandbox themselves by tightly constraining what's allowed with both
>> input and outputs; for example, tcpdump might restrict itself so it can
>> only read from the network FD, and only write to stdout.
>>
>> The kernel thus needs to police the rights checks for these file
>> descriptors (referred to as 'Capsicum capabilities', completely
>> different than POSIX.1e capabilities), and the best place to do this is
>> at the points where a file descriptor from userspace is converted to a
>> struct file * within the kernel.
>>
>> [Policing the rights checks anywhere else, for example at the system
>> call boundary, isn't a good idea because it opens up the possibility
>> of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>> changed (as openat/close/dup2 are allowed in capability mode) between
>> the 'check' at syscall entry and the 'use' at fget() invocation.]
>>
>> However, this does lead to quite an invasive change to the kernel --
>> every invocation of fget() or similar functions (fdget(),
>> sockfd_lookup(), user_path_at(),...) needs to be annotated with the
>> rights associated with the specific operations that will be performed on
>> the struct file. There are ~100 such invocations that need
>> annotation.
>
> And it is silly. Roughly you just need a locking version of
> fcntl(F_SETFL).
>
> That is make the restriction in the struct file not in the fd to file
> lookup.

No, they have to be in the file descriptor. The same file descriptor
can be dup'ed and passed with different capabilities to different processes.

Say you pass an eventfd to a process with SCM_RIGHTS, and you want to
only allow the process to write to it.

>> 4) New System Calls
>> -------------------
>>
>> To allow userspace applications to access the Capsicum capability
>> functionality, I'm proposing two new system calls: cap_rights_limit(2)
>> and cap_rights_get(2). I guess these could potentially be implemented
>> elsewhere (e.g. as fcntl(2) operations?) but the changes seem
>> significant enough that new syscalls are warranted.
>>
>> [FreeBSD 10.x actually includes six new syscalls for manipulating the
>> rights associated with a Capsicum capability -- the capability rights
>> can police that only specific fcntl(2) or ioctl(2) commands are
>> allowed, and FreeBSD sets these with distinct syscalls.]
>
> ioctls? In a sandbox? Ick.

KVM? X11? Both of them use loads of ioctls. I'm less sure of the
benefit of picking which fcntls to allow.

Paolo

2014-07-28 16:04:44

by David Drysdale

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

On Sat, Jul 26, 2014 at 10:04 PM, Eric W. Biederman
<[email protected]> wrote:
> David Drysdale <[email protected]> writes:
>
>> The last couple of versions of FreeBSD (9.x/10.x) have included the
>> Capsicum security framework [1], which allows security-aware
>> applications to sandbox themselves in a very fine-grained way. For
>> example, OpenSSH now (>= 6.5) uses Capsicum in its FreeBSD version to
>> restrict sshd's credentials checking process, to reduce the chances of
>> credential leakage.
>>
>> It would be good to have equivalent functionality in Linux, so I've been
>> working on getting the Capsicum framework running in the kernel, and I'd
>> appreciate some feedback/opinions on the general approach.
>>
>> I'm attaching a corresponding draft patchset for reference, but
>> hopefully this cover email can cover the significant features to save
>> everyone having to look through the code details. (It does mean this is
>> a long email though -- apologies for that.)
>>
>>
>> 1) Capsicum Capabilities
>> ------------------------
>>
>> The most significant aspect of Capsicum is associating *rights* with
>> (some) file descriptors, so that the kernel only allows operations on an
>> FD if the rights permit it. This allows userspace applications to
>> sandbox themselves by tightly constraining what's allowed with both
>> input and outputs; for example, tcpdump might restrict itself so it can
>> only read from the network FD, and only write to stdout.
>>
>> The kernel thus needs to police the rights checks for these file
>> descriptors (referred to as 'Capsicum capabilities', completely
>> different than POSIX.1e capabilities), and the best place to do this is
>> at the points where a file descriptor from userspace is converted to a
>> struct file * within the kernel.
>>
>> [Policing the rights checks anywhere else, for example at the system
>> call boundary, isn't a good idea because it opens up the possibility
>> of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>> changed (as openat/close/dup2 are allowed in capability mode) between
>> the 'check' at syscall entry and the 'use' at fget() invocation.]
>>
>> However, this does lead to quite an invasive change to the kernel --
>> every invocation of fget() or similar functions (fdget(),
>> sockfd_lookup(), user_path_at(),...) needs to be annotated with the
>> rights associated with the specific operations that will be performed on
>> the struct file. There are ~100 such invocations that need
>> annotation.
>
> And it is silly. Roughly you just need a locking version of
> fcntl(F_SETFL).
>
> That is make the restriction in the struct file not in the fd to file
> lookup.

The file status flags are far too coarse -- for example, O_RDONLY
doesn't prevent fchmod(2).

Also, because they're associated with the struct file there is no way
to pass a file descriptor with only a subset of rights across a UNIX
socket (how do I send a read-only FD corresponding to my
O_RDWR file?).

> Files in unix have been capabilities for more than 20 years. That is
> what file descriptor passing in unix domain sockets are all about. We
> don't need an additional ``capability rights'' layer on top of file
> descriptors.

True, the ability to pass FDs across UNIX sockets is one of the
key things that makes them analogous to object-capabilities and
suitable as the substratum for Capsicum. But the coarseness of the
existing rights, and the lack of coherent policing of those rights,
means that an (opt-in) extension like Capsicum is useful.

> In fact internal to linux with FMODE_READ and friends we already have
> restrictions on which methods are allowed on linux file descriptors.
> So this whole entire abstraction layer you are adding seems just plain
> broken.
>
> Going farther one huge thing your work and the capsicum work in general
> is missing is an implementation of revoke. With a little care a good
> implementation of bits reporting and controlling which methods are
> available on which file descriptors should be a good start on
> a revoke implementation for linux as well.

Yeah, revocation is interesting. There was been some discussion of
it for Capsicum a while ago:
https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2011-January/msg00002.html
but I don't think a firm conclusion was reached. As I understand it,
there's also a theoretical argument that revocation can be constructed
on top of the capability system, by handing out capabilities to proxy
objects that can then change their behaviour so they no longer pass
operations through.

>> 2) Capsicum Capabilities Data Structure
>> ---------------------------------------
>>
>> Internally, the rights associated with a Capsicum capability FD are
>> stored in a special struct file wrapper. For a normal file, the rights
>> check inside fget() falls through, but for a capability wrapper the
>> rights in the wrapper are checked and (if capable) the underlying
>> wrapped struct file is returned.
>>
>> [This is approximately the implementation that was present in FreeBSD
>> 9.x. For FreeBSD 10.x, the wrapper file was removed and the rights
>> associated with a file descriptor are now stored in the fdtable. As
>> that impacts memory use for all processes, whether Capsicum users or
>> not, I've stuck with the FreeBSD 9.x approach.]
>
> I have already mentioned that this is an insane choice of semantics
> right? Adding an extra layer on top of a data structure that is a
> perfectly good restrictor of rights.
>
> If you can't add the restriction on struct file itself I would argue
> that the semantics of capsicum are fundamentally broken.
>
> I can not imagine why in the world you would want and extra layer of
> indirection, complication, and maintenance.

See above.

>> 3) Allowing Capability Mode
>> ---------------------------
>>
>> Capsicum also includes 'capability mode', which locks down the available
>> syscalls so the rights restrictions can't just be bypassed by opening
>> new file descriptors. More precisely, capability mode prevents access
>> to syscalls that access global namespaces, such as the filesystem or the
>> IP:port space.
>>
>> The existing seccomp-bpf functionality of the kernel is a good mechanism
>> for implementing capability mode, but there are a few additional details
>> that also need to be addressed.
>>
>> a) The capability mode filter program needs to apply process-wide, not
>> just to the current thread.
>>
>> b) In capability mode, new files can still be opened with openat(2) but
>> only if they are beneath an existing directory file descriptor.
>
> Which raises the question is that worth it?

>From a practical point of view, still allowing (directory-relative) filesystem
access means it's much easier for an application to adapt to Capsicum.

A simple example is something like unzip/tar -xf, where locking it down
is conceptually as simple as:
- apply a CAP_READ restriction to the FD for the input file
- apply a CAP_WRITE+CAP_LOOKUP restriction to the DFD for the
output directory
- enter capability mode.
Then the worst a malicious input file can do is to write files under the
output directory -- which it could do anyway.

Similarly, migrating an application that writes temporary files out to
some tempdir is much more straightforward if openat(dfd,...) is still
allowed.

>> c) In capability mode it should still be possible for a process to send
>> signals to itself with kill(2)/tgkill(2).
>
> Again is it worth it?
>
> I would think you would want capability mode to default to the minium
> set of system calls you could get away with (to keep kernel code
> auditing to a minium) and only add things if the performance gain of
> using the syscall exceeds the pain.

The set of syscalls allowed in capability mode is based more on the
principle of restricting access to global namespaces, to prevent (in
particular) the confused deputy problem. Allowing kill(self) is a pragmatic
compromise to make it easier to migrate existing applications to use
Capsicum.

> If you look at a kernel like sel4 it succeeds in with an object
> capability model with many fewer system calls than you are proposing
> to export. Roughly just read, write and close.
>
> Consider the fact if you really want a kernel layer you can completely
> trust and rely on someone needs to write a formal proof of that layer.
> Short of that someone certainly needs to audit the kernel code very
> closely so simplicity of semantics and simplicity of implementation are
> very important.

seL4 may have a much simpler (and more formally-correct) model, but
the aim of Capsicum is to get some of the benefits of that well-analysed
model without the need for a (massive) migration effort -- and in a way
that allows co-existence of migrated and unmigrated code.

>> This v2 patchset copes with these as follows:
>>
>> a) Kees Cook's incoming seccomp(2) patchset covers thread
>> synchronization of filters.
>>
>> b) A new prctl(PR_SET_OPENAT_BENEATH) operation implicitly sets the
>> O_BENEATH flag (see below) for all file-open operations for all
>> threads of the current process, by adding a new openat_beneath
>> flag in task_struct.
>>
>> c) An extension to the seccomp_data structure that includes the current
>> task's tid and tgid values allows for BPF programs that check a
>> kill(2)/tgkill(2) argument against the current thread, in a manner
>> that is robust against fork(2)/clone(2).
>>
>> The combination of these features with the existing seccomp-bpf
>> functionality gives the tools needed to implement capability mode.
>>
>>
>> 4) New System Calls
>> -------------------
>>
>> To allow userspace applications to access the Capsicum capability
>> functionality, I'm proposing two new system calls: cap_rights_limit(2)
>> and cap_rights_get(2). I guess these could potentially be implemented
>> elsewhere (e.g. as fcntl(2) operations?) but the changes seem
>> significant enough that new syscalls are warranted.
>>
>> [FreeBSD 10.x actually includes six new syscalls for manipulating the
>> rights associated with a Capsicum capability -- the capability rights
>> can police that only specific fcntl(2) or ioctl(2) commands are
>> allowed, and FreeBSD sets these with distinct syscalls.]
>
> ioctls? In a sandbox? Ick.
>
>> 5) New openat(2) O_BENEATH Flag
>> -------------------------------
>>
>> For Capsicum capabilities that are directory file descriptors, the
>> Capsicum framework only allows openat(cap_dfd, path, ...) operations to
>> work for files that are beneath the specified directory (and even that
>> only when the directory FD has the CAP_LOOKUP right), rejecting paths
>> that start with "/" or include "..". The same restriction applies
>> process-wide for a process in capability mode.
>>
>> As this seemed like functionality that might be more generally useful,
>> I've implemented it independently as a new O_BENEATH flag for openat(2).
>> The Capsicum code then always triggers the use of that flag when the dfd
>> is a Capsicum capability, or when the prctl(2) command described above
>> is in play.
>>
>> [FreeBSD has the openat(2) relative-only behaviour for capability DFDs
>> and processes in capability mode, but does not include the O_BENEATH
>> flag.]
>
> If you are going to allow open I would think the simple solution here is
> to just create a mount namespace that only has available the files you
> would like to export to the process, and force all of the file
> descriptors into that mount namespace.
>
> Eric

I think that's a lot harder for application writers to do -- they would need
to have CAP_SYS_ADMIN to set up the namespace & mounts before
dropping privileges. And the net result would again be much less
fine-grained (e.g. for the unzip example above, the specific capability
rights prevent reading of any files that already exist in the output
directory).

2014-07-28 21:17:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

David Drysdale <[email protected]> writes:

> On Sat, Jul 26, 2014 at 10:04 PM, Eric W. Biederman
> <[email protected]> wrote:
>> David Drysdale <[email protected]> writes:

>>> 1) Capsicum Capabilities
>>> ------------------------
>>>
>>> The most significant aspect of Capsicum is associating *rights* with
>>> (some) file descriptors, so that the kernel only allows operations on an
>>> FD if the rights permit it. This allows userspace applications to
>>> sandbox themselves by tightly constraining what's allowed with both
>>> input and outputs; for example, tcpdump might restrict itself so it can
>>> only read from the network FD, and only write to stdout.
>>>
>>> The kernel thus needs to police the rights checks for these file
>>> descriptors (referred to as 'Capsicum capabilities', completely
>>> different than POSIX.1e capabilities), and the best place to do this is
>>> at the points where a file descriptor from userspace is converted to a
>>> struct file * within the kernel.
>>>
>>> [Policing the rights checks anywhere else, for example at the system
>>> call boundary, isn't a good idea because it opens up the possibility
>>> of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>> changed (as openat/close/dup2 are allowed in capability mode) between
>>> the 'check' at syscall entry and the 'use' at fget() invocation.]
>>>
>>> However, this does lead to quite an invasive change to the kernel --
>>> every invocation of fget() or similar functions (fdget(),
>>> sockfd_lookup(), user_path_at(),...) needs to be annotated with the
>>> rights associated with the specific operations that will be performed on
>>> the struct file. There are ~100 such invocations that need
>>> annotation.
>>
>> And it is silly. Roughly you just need a locking version of
>> fcntl(F_SETFL).
>>
>> That is make the restriction in the struct file not in the fd to file
>> lookup.
>
> The file status flags are far too coarse -- for example, O_RDONLY
> doesn't prevent fchmod(2).
>
> Also, because they're associated with the struct file there is no way
> to pass a file descriptor with only a subset of rights across a UNIX
> socket (how do I send a read-only FD corresponding to my
> O_RDWR file?).

You can't. Unix domain sockets pass struct file references.
That is passing a file descriptor with a unix domain sockets is the
equivlanet of dup().

You absolutely must make your struct file read-only before passing it,
otherwise the the receiver will have a read-write instance.

This notion that a shared structure should have different semantics
depending on who is looking at it, sounds like a maintenance nightmare
to me.

>> Files in unix have been capabilities for more than 20 years. That is
>> what file descriptor passing in unix domain sockets are all about. We
>> don't need an additional ``capability rights'' layer on top of file
>> descriptors.
>
> True, the ability to pass FDs across UNIX sockets is one of the
> key things that makes them analogous to object-capabilities and
> suitable as the substratum for Capsicum. But the coarseness of the
> existing rights, and the lack of coherent policing of those rights,
> means that an (opt-in) extension like Capsicum is useful.

Finer grained restrictions are perfectly sensible.

I don't see how it make sense to place those restrictions in
struct fdtable instead of in struct file.

I see two sensible implementations:
- Add a seccomp bpf filter to struct file.
- Add permission bits to struct file.

The bpf filter would allow for a simple code extension that would
allow any arbitrary policy to be applied.

Adding permisison bits to struct file for gating access to
file_operations and inode_operations would result in the fastest
possible implementation and something very easy to audit and
understand. But there like ioctl or read/write on IB control files
where finer grained permissions might be desirable.

For a lot of operations we alread have bits like FMODE_READ and
FMODE_CAN_READ on struct file for reasons such as performance
so that only a single cache line needs to be touched.

It might be that it would be worth having both. Something cheap and
genrally accessible and something fast.

>> Going farther one huge thing your work and the capsicum work in general
>> is missing is an implementation of revoke. With a little care a good
>> implementation of bits reporting and controlling which methods are
>> available on which file descriptors should be a good start on
>> a revoke implementation for linux as well.
>
> Yeah, revocation is interesting. There was been some discussion of
> it for Capsicum a while ago:
> https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2011-January/msg00002.html
> but I don't think a firm conclusion was reached. As I understand it,
> there's also a theoretical argument that revocation can be constructed
> on top of the capability system, by handing out capabilities to proxy
> objects that can then change their behaviour so they no longer pass
> operations through.

I believe the capsicum approach is so far from allowing general proxy
objects that, that is not a compelling case.

>>> 3) Allowing Capability Mode
>>> ---------------------------
>>>
>>> Capsicum also includes 'capability mode', which locks down the available
>>> syscalls so the rights restrictions can't just be bypassed by opening
>>> new file descriptors. More precisely, capability mode prevents access
>>> to syscalls that access global namespaces, such as the filesystem or the
>>> IP:port space.
>>>
>>> The existing seccomp-bpf functionality of the kernel is a good mechanism
>>> for implementing capability mode, but there are a few additional details
>>> that also need to be addressed.
>>>
>>> a) The capability mode filter program needs to apply process-wide, not
>>> just to the current thread.
>>>
>>> b) In capability mode, new files can still be opened with openat(2) but
>>> only if they are beneath an existing directory file descriptor.
>>
>> Which raises the question is that worth it?
>
> From a practical point of view, still allowing (directory-relative) filesystem
> access means it's much easier for an application to adapt to Capsicum.
>
> A simple example is something like unzip/tar -xf, where locking it down
> is conceptually as simple as:
> - apply a CAP_READ restriction to the FD for the input file
> - apply a CAP_WRITE+CAP_LOOKUP restriction to the DFD for the
> output directory
> - enter capability mode.
> Then the worst a malicious input file can do is to write files under the
> output directory -- which it could do anyway.
>
> Similarly, migrating an application that writes temporary files out to
> some tempdir is much more straightforward if openat(dfd,...) is still
> allowed.


>>> c) In capability mode it should still be possible for a process to send
>>> signals to itself with kill(2)/tgkill(2).


> seL4 may have a much simpler (and more formally-correct) model, but
> the aim of Capsicum is to get some of the benefits of that well-analysed
> model without the need for a (massive) migration effort -- and in a way
> that allows co-existence of migrated and unmigrated code.

Alright. About the same target as seccomp then. Aiming at something
that is simple to adopt sounds like a reasonable goal.

>>> 5) New openat(2) O_BENEATH Flag
>>> -------------------------------
>>>
>>> For Capsicum capabilities that are directory file descriptors, the
>>> Capsicum framework only allows openat(cap_dfd, path, ...) operations to
>>> work for files that are beneath the specified directory (and even that
>>> only when the directory FD has the CAP_LOOKUP right), rejecting paths
>>> that start with "/" or include "..". The same restriction applies
>>> process-wide for a process in capability mode.
>>>
>>> As this seemed like functionality that might be more generally useful,
>>> I've implemented it independently as a new O_BENEATH flag for openat(2).
>>> The Capsicum code then always triggers the use of that flag when the dfd
>>> is a Capsicum capability, or when the prctl(2) command described above
>>> is in play.
>>>
>>> [FreeBSD has the openat(2) relative-only behaviour for capability DFDs
>>> and processes in capability mode, but does not include the O_BENEATH
>>> flag.]
>>
>> If you are going to allow open I would think the simple solution here is
>> to just create a mount namespace that only has available the files you
>> would like to export to the process, and force all of the file
>> descriptors into that mount namespace.

> I think that's a lot harder for application writers to do -- they would need
> to have CAP_SYS_ADMIN to set up the namespace & mounts before
> dropping privileges.

They only need CAP_SYS_ADMIN in their local user namespace, which anyone
is allowed to create.

> And the net result would again be much less
> fine-grained (e.g. for the unzip example above, the specific capability
> rights prevent reading of any files that already exist in the output
> directory).

Nope. What you can implement today if you want fine grained limitations
like this is to create a mount namespace with exactly the subdirectory
tree you want to allow access to and to return a file descriptor that
points into that mount namespace. (When complete the only user of that
mount namespace would be your file descriptor).

In fact that solution is sufficiently performant and simple that even
if you came up with a better user space interface for it that is how we
would want to implement it.

And frankly that already exists today so I think a fairly large burden
of proof needs to be met to suggest that we need to add additional
functionality to the kernel.

Furthermore I think it makes sense for application authors to use
as otherwise they will have to wait a year or so for the debates to be
finished and your new functionality to be merged.

Additionally unless there is a process wide restriction to relative
paths I can trivially escape your relative path implementation by simply
doing open(.) and getting a struct file without any of those
restrictions.

Eric

2014-07-28 21:22:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

Andy Lutomirski <[email protected]> writes:

> [cc: Eric Biederman]
>

> Can we do one better and add a flag to prevent any non-self pid
> lookups? This might actually be easy on top of the pid namespace work
> (e.g. we could change the way that find_task_by_vpid works).
>
> It's far from just being signals. There's access_process_vm, ptrace,
> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
> is ridiculous), and probably some others that I've forgotten about or
> never noticed in the first place.

So here is the practical question.

Are these processes that only can send signals to their thread group
allowed to call fork()?


If fork is allowed and all pid lookups are restricted to their own
thread group that wait, waitpid, and all of the rest of the wait family
will never return the pids of their children, and zombies will
accumulate. Aka the semantics are fundamentally broken.

If fork is not allowed pid namespaces already solve this problem.

Eric

2014-07-29 08:44:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

Il 28/07/2014 23:13, Eric W. Biederman ha scritto:
> This notion that a shared structure should have different semantics
> depending on who is looking at it, sounds like a maintenance nightmare
> to me.

Isn't that already with seccomp BPF filters? You can have the parent
process set a BPF filter that forbids read on that file descriptor or
write on that file descriptor.

Effectively, this patchset provides the functionality to "hotplug" BPF
filters on a running process as more file descriptors are passed via
SCM_RIGHTS. Except it doesn't use BPF filters, and instead uses a
separate set of discrete capabilities.

> I see two sensible implementations:
> - Add a seccomp bpf filter to struct file.

I proposed something like this, but it has some additional
implementation complications. See the v1 thread.

But it does have to be in a file descriptor rather than a struct file.
It's part of the model that two processes can have different views of
the file descriptor (again my toy example is that of an eventfd that an
unprivileged child process can only write to).

> Additionally unless there is a process wide restriction to relative
> paths I can trivially escape your relative path implementation by simply
> doing open(.) and getting a struct file without any of those
> restrictions.

Yes, there is a prctl for that.

Paolo

2014-07-29 10:58:49

by David Drysdale

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

On Mon, Jul 28, 2014 at 10:13 PM, Eric W. Biederman
<[email protected]> wrote:
> David Drysdale <[email protected]> writes:
>
>> On Sat, Jul 26, 2014 at 10:04 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> David Drysdale <[email protected]> writes:
>
>>>> 1) Capsicum Capabilities
>>>> ------------------------
>>>>
>>>> The most significant aspect of Capsicum is associating *rights* with
>>>> (some) file descriptors, so that the kernel only allows operations on an
>>>> FD if the rights permit it. This allows userspace applications to
>>>> sandbox themselves by tightly constraining what's allowed with both
>>>> input and outputs; for example, tcpdump might restrict itself so it can
>>>> only read from the network FD, and only write to stdout.
>>>>
>>>> The kernel thus needs to police the rights checks for these file
>>>> descriptors (referred to as 'Capsicum capabilities', completely
>>>> different than POSIX.1e capabilities), and the best place to do this is
>>>> at the points where a file descriptor from userspace is converted to a
>>>> struct file * within the kernel.
>>>>
>>>> [Policing the rights checks anywhere else, for example at the system
>>>> call boundary, isn't a good idea because it opens up the possibility
>>>> of time-of-check/time-of-use (TOCTOU) attacks [2] where FDs are
>>>> changed (as openat/close/dup2 are allowed in capability mode) between
>>>> the 'check' at syscall entry and the 'use' at fget() invocation.]
>>>>
>>>> However, this does lead to quite an invasive change to the kernel --
>>>> every invocation of fget() or similar functions (fdget(),
>>>> sockfd_lookup(), user_path_at(),...) needs to be annotated with the
>>>> rights associated with the specific operations that will be performed on
>>>> the struct file. There are ~100 such invocations that need
>>>> annotation.
>>>
>>> And it is silly. Roughly you just need a locking version of
>>> fcntl(F_SETFL).
>>>
>>> That is make the restriction in the struct file not in the fd to file
>>> lookup.
>>
>> The file status flags are far too coarse -- for example, O_RDONLY
>> doesn't prevent fchmod(2).
>>
>> Also, because they're associated with the struct file there is no way
>> to pass a file descriptor with only a subset of rights across a UNIX
>> socket (how do I send a read-only FD corresponding to my
>> O_RDWR file?).
>
> You can't. Unix domain sockets pass struct file references.
> That is passing a file descriptor with a unix domain sockets is the
> equivlanet of dup().

That's my point -- if the rights are associated with the struct file
there is no way of subsetting them. By associating rights with
the file descriptor, you can pass around different views of the
same underlying object with different rights.

> You absolutely must make your struct file read-only before passing it,
> otherwise the the receiver will have a read-write instance.

Even if the struct file is read-only, the recipient will still be
able to do lots of things (e.g. fchmod as mentioned previously)

> This notion that a shared structure should have different semantics
> depending on who is looking at it, sounds like a maintenance nightmare
> to me.

I don't think it's that bad -- it's not completely different semantics,
just a strict subset of allowed operations. Also, having a distinct
ENOTCAPABLE errno value makes it easy to spot mismatches
of required & provided rights.

>>> Files in unix have been capabilities for more than 20 years. That is
>>> what file descriptor passing in unix domain sockets are all about. We
>>> don't need an additional ``capability rights'' layer on top of file
>>> descriptors.
>>
>> True, the ability to pass FDs across UNIX sockets is one of the
>> key things that makes them analogous to object-capabilities and
>> suitable as the substratum for Capsicum. But the coarseness of the
>> existing rights, and the lack of coherent policing of those rights,
>> means that an (opt-in) extension like Capsicum is useful.
>
> Finer grained restrictions are perfectly sensible.
>
> I don't see how it make sense to place those restrictions in
> struct fdtable instead of in struct file.
>
> I see two sensible implementations:
> - Add a seccomp bpf filter to struct file.
> - Add permission bits to struct file.
>
> The bpf filter would allow for a simple code extension that would
> allow any arbitrary policy to be applied.

Paolo Bonzini suggested this on the previous iteration, but
I think the associated complications are too overwhelming.
Have a look at: https://lkml.org/lkml/2014/7/7/165

> Adding permisison bits to struct file for gating access to
> file_operations and inode_operations would result in the fastest
> possible implementation and something very easy to audit and
> understand. But there like ioctl or read/write on IB control files
> where finer grained permissions might be desirable.
>
> For a lot of operations we alread have bits like FMODE_READ and
> FMODE_CAN_READ on struct file for reasons such as performance
> so that only a single cache line needs to be touched.
>
> It might be that it would be worth having both. Something cheap and
> genrally accessible and something fast.
>
>>> Going farther one huge thing your work and the capsicum work in general
>>> is missing is an implementation of revoke. With a little care a good
>>> implementation of bits reporting and controlling which methods are
>>> available on which file descriptors should be a good start on
>>> a revoke implementation for linux as well.
>>
>> Yeah, revocation is interesting. There was been some discussion of
>> it for Capsicum a while ago:
>> https://lists.cam.ac.uk/pipermail/cl-capsicum-discuss/2011-January/msg00002.html
>> but I don't think a firm conclusion was reached. As I understand it,
>> there's also a theoretical argument that revocation can be constructed
>> on top of the capability system, by handing out capabilities to proxy
>> objects that can then change their behaviour so they no longer pass
>> operations through.
>
> I believe the capsicum approach is so far from allowing general proxy
> objects that, that is not a compelling case.

Yeah, it is probably more of a theoretical argument than a practical
one. But then, the requirement for revocation might fall into the same
bucket -- more desirable in theory than in practice, particularly for the
sort of application self-sandboxing and compartmentalization that
Capsicum is particularly good for.

>>>> 3) Allowing Capability Mode
>>>> ---------------------------
>>>>
>>>> Capsicum also includes 'capability mode', which locks down the available
>>>> syscalls so the rights restrictions can't just be bypassed by opening
>>>> new file descriptors. More precisely, capability mode prevents access
>>>> to syscalls that access global namespaces, such as the filesystem or the
>>>> IP:port space.
>>>>
>>>> The existing seccomp-bpf functionality of the kernel is a good mechanism
>>>> for implementing capability mode, but there are a few additional details
>>>> that also need to be addressed.
>>>>
>>>> a) The capability mode filter program needs to apply process-wide, not
>>>> just to the current thread.
>>>>
>>>> b) In capability mode, new files can still be opened with openat(2) but
>>>> only if they are beneath an existing directory file descriptor.
>>>
>>> Which raises the question is that worth it?
>>
>> From a practical point of view, still allowing (directory-relative) filesystem
>> access means it's much easier for an application to adapt to Capsicum.
>>
>> A simple example is something like unzip/tar -xf, where locking it down
>> is conceptually as simple as:
>> - apply a CAP_READ restriction to the FD for the input file
>> - apply a CAP_WRITE+CAP_LOOKUP restriction to the DFD for the
>> output directory
>> - enter capability mode.
>> Then the worst a malicious input file can do is to write files under the
>> output directory -- which it could do anyway.
>>
>> Similarly, migrating an application that writes temporary files out to
>> some tempdir is much more straightforward if openat(dfd,...) is still
>> allowed.
>
>
>>>> c) In capability mode it should still be possible for a process to send
>>>> signals to itself with kill(2)/tgkill(2).
>
>
>> seL4 may have a much simpler (and more formally-correct) model, but
>> the aim of Capsicum is to get some of the benefits of that well-analysed
>> model without the need for a (massive) migration effort -- and in a way
>> that allows co-existence of migrated and unmigrated code.
>
> Alright. About the same target as seccomp then. Aiming at something
> that is simple to adopt sounds like a reasonable goal.
>
>>>> 5) New openat(2) O_BENEATH Flag
>>>> -------------------------------
>>>>
>>>> For Capsicum capabilities that are directory file descriptors, the
>>>> Capsicum framework only allows openat(cap_dfd, path, ...) operations to
>>>> work for files that are beneath the specified directory (and even that
>>>> only when the directory FD has the CAP_LOOKUP right), rejecting paths
>>>> that start with "/" or include "..". The same restriction applies
>>>> process-wide for a process in capability mode.
>>>>
>>>> As this seemed like functionality that might be more generally useful,
>>>> I've implemented it independently as a new O_BENEATH flag for openat(2).
>>>> The Capsicum code then always triggers the use of that flag when the dfd
>>>> is a Capsicum capability, or when the prctl(2) command described above
>>>> is in play.
>>>>
>>>> [FreeBSD has the openat(2) relative-only behaviour for capability DFDs
>>>> and processes in capability mode, but does not include the O_BENEATH
>>>> flag.]
>>>
>>> If you are going to allow open I would think the simple solution here is
>>> to just create a mount namespace that only has available the files you
>>> would like to export to the process, and force all of the file
>>> descriptors into that mount namespace.
>
>> I think that's a lot harder for application writers to do -- they would need
>> to have CAP_SYS_ADMIN to set up the namespace & mounts before
>> dropping privileges.
>
> They only need CAP_SYS_ADMIN in their local user namespace, which anyone
> is allowed to create.
>
>> And the net result would again be much less
>> fine-grained (e.g. for the unzip example above, the specific capability
>> rights prevent reading of any files that already exist in the output
>> directory).
>
> Nope. What you can implement today if you want fine grained limitations
> like this is to create a mount namespace with exactly the subdirectory
> tree you want to allow access to and to return a file descriptor that
> points into that mount namespace. (When complete the only user of that
> mount namespace would be your file descriptor).

How does that solve the particular example I mentioned? The DFD
within the mount namespace will still allow any operation on any file
that's already in the subdirectory -- or am I misunderstanding
something?

> In fact that solution is sufficiently performant and simple that even
> if you came up with a better user space interface for it that is how we
> would want to implement it.
>
> And frankly that already exists today so I think a fairly large burden
> of proof needs to be met to suggest that we need to add additional
> functionality to the kernel.
>
> Furthermore I think it makes sense for application authors to use
> as otherwise they will have to wait a year or so for the debates to be
> finished and your new functionality to be merged.
>
> Additionally unless there is a process wide restriction to relative
> paths I can trivially escape your relative path implementation by simply
> doing open(.) and getting a struct file without any of those
> restrictions.
>
> Eric

Capability mode would impose a process wide restriction to relative
paths -- see sections 3) and 5) of the original mail.

2014-07-30 04:06:07

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Mon, Jul 28, 2014 at 2:18 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> [cc: Eric Biederman]
>>
>
>> Can we do one better and add a flag to prevent any non-self pid
>> lookups? This might actually be easy on top of the pid namespace work
>> (e.g. we could change the way that find_task_by_vpid works).
>>
>> It's far from just being signals. There's access_process_vm, ptrace,
>> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
>> is ridiculous), and probably some others that I've forgotten about or
>> never noticed in the first place.
>
> So here is the practical question.
>
> Are these processes that only can send signals to their thread group
> allowed to call fork()?
>
>
> If fork is allowed and all pid lookups are restricted to their own
> thread group that wait, waitpid, and all of the rest of the wait family
> will never return the pids of their children, and zombies will
> accumulate. Aka the semantics are fundamentally broken.

Good point.

I can imagine at least three ways that fork() could continue working, though:

1. Allow lookups of immediate children, too. (I don't love this one.)
2. Allow non-self pids to be translated in but not out. This way
P_ALL will continue working.
3. Have the kernel treat any PID-restricted process as though it were NOCLDWAIT.

I think I like #3. Thoughts?

>
> If fork is not allowed pid namespaces already solve this problem.

PID namespaces are fairly heavyweight. Julien pointed out that using
PID namespaces requires a bunch of dummy PID 1 processes.

--Andy

2014-07-30 04:12:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

Andy Lutomirski <[email protected]> writes:

> On Mon, Jul 28, 2014 at 2:18 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Andy Lutomirski <[email protected]> writes:
>>
>>> [cc: Eric Biederman]
>>>
>>
>>> Can we do one better and add a flag to prevent any non-self pid
>>> lookups? This might actually be easy on top of the pid namespace work
>>> (e.g. we could change the way that find_task_by_vpid works).
>>>
>>> It's far from just being signals. There's access_process_vm, ptrace,
>>> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
>>> is ridiculous), and probably some others that I've forgotten about or
>>> never noticed in the first place.
>>
>> So here is the practical question.
>>
>> Are these processes that only can send signals to their thread group
>> allowed to call fork()?
>>
>>
>> If fork is allowed and all pid lookups are restricted to their own
>> thread group that wait, waitpid, and all of the rest of the wait family
>> will never return the pids of their children, and zombies will
>> accumulate. Aka the semantics are fundamentally broken.
>
> Good point.
>
> I can imagine at least three ways that fork() could continue working, though:
>
> 1. Allow lookups of immediate children, too. (I don't love this one.)
> 2. Allow non-self pids to be translated in but not out. This way
> P_ALL will continue working.
> 3. Have the kernel treat any PID-restricted process as though it were NOCLDWAIT.
>
> I think I like #3. Thoughts?
>
>>
>> If fork is not allowed pid namespaces already solve this problem.
>
> PID namespaces are fairly heavyweight. Julien pointed out that using
> PID namespaces requires a bunch of dummy PID 1 processes.

Only if you can't tolerate init exiting. The reasoning with respect to
signals and signals being ignored was wrong. And if you only have one
process you care about and no children to worry about neither the
difference in signal handling nor the world dies whe init exits applies.

Therefore given what I have read described pid namespaces are a trivial
solution to this problem space.

Eric

2014-07-30 04:35:24

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Tue, Jul 29, 2014 at 9:08 PM, Eric W. Biederman
<[email protected]> wrote:
> Andy Lutomirski <[email protected]> writes:
>
>> On Mon, Jul 28, 2014 at 2:18 PM, Eric W. Biederman
>> <[email protected]> wrote:
>>> Andy Lutomirski <[email protected]> writes:
>>>
>>>> [cc: Eric Biederman]
>>>>
>>>
>>>> Can we do one better and add a flag to prevent any non-self pid
>>>> lookups? This might actually be easy on top of the pid namespace work
>>>> (e.g. we could change the way that find_task_by_vpid works).
>>>>
>>>> It's far from just being signals. There's access_process_vm, ptrace,
>>>> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
>>>> is ridiculous), and probably some others that I've forgotten about or
>>>> never noticed in the first place.
>>>
>>> So here is the practical question.
>>>
>>> Are these processes that only can send signals to their thread group
>>> allowed to call fork()?
>>>
>>>
>>> If fork is allowed and all pid lookups are restricted to their own
>>> thread group that wait, waitpid, and all of the rest of the wait family
>>> will never return the pids of their children, and zombies will
>>> accumulate. Aka the semantics are fundamentally broken.
>>
>> Good point.
>>
>> I can imagine at least three ways that fork() could continue working, though:
>>
>> 1. Allow lookups of immediate children, too. (I don't love this one.)
>> 2. Allow non-self pids to be translated in but not out. This way
>> P_ALL will continue working.
>> 3. Have the kernel treat any PID-restricted process as though it were NOCLDWAIT.
>>
>> I think I like #3. Thoughts?
>>
>>>
>>> If fork is not allowed pid namespaces already solve this problem.
>>
>> PID namespaces are fairly heavyweight. Julien pointed out that using
>> PID namespaces requires a bunch of dummy PID 1 processes.
>
> Only if you can't tolerate init exiting. The reasoning with respect to
> signals and signals being ignored was wrong. And if you only have one
> process you care about and no children to worry about neither the
> difference in signal handling nor the world dies whe init exits applies.

Can you elaborate? It seems entirely plausible to me that there are
programs that won't work right as PID 1 without considerable
adaptation.

>
> Therefore given what I have read described pid namespaces are a trivial
> solution to this problem space.

pid namespaces also won't work in the context of Capsicum unless you
want every single Capsicum process to be its own pid namespace. Also,
pid namespaces don't offer any way to protect children from parents.

--Andy

2014-07-30 06:25:53

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework


I have cut this down to just focus on O_BENEATH openat case.

David Drysdale <[email protected]> writes:

> On Mon, Jul 28, 2014 at 10:13 PM, Eric W. Biederman
> <[email protected]> wrote:

>> Nope. What you can implement today if you want fine grained limitations
>> like this is to create a mount namespace with exactly the subdirectory
>> tree you want to allow access to and to return a file descriptor that
>> points into that mount namespace. (When complete the only user of that
>> mount namespace would be your file descriptor).
>
> How does that solve the particular example I mentioned? The DFD
> within the mount namespace will still allow any operation on any file
> that's already in the subdirectory -- or am I misunderstanding
> something?

The goal was to bound the DFD to the directory and all of it's
subdirectories such that openat(dfd, "../../..") would open
the dfd, and that further opens of other directories would also not
allow you to escape.

Since the mount namespace only contains the choosen directory and it's
subdirectories that works easily and trivially.

So while you can indeed perform any file operation on that dfd who
cares because none of those operations can get you anywhere you aren't
supposed to be.

My point was that you can as granular as you would like by binding a dfd
to a mount namespace instead of binding a process to a mount namespace,
and the code already exists and is being maintained.

So while things are not packaged in the form that has been requested it
looks to me as if the functionality for directories already exists
within the Linux kernel.

Eric

2014-07-30 14:51:41

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCHv2 00/11] Adding FreeBSD's Capsicum security framework

On Jul 29, 2014 11:25 PM, "Eric W. Biederman" <[email protected]> wrote:
>
>
> I have cut this down to just focus on O_BENEATH openat case.
>
> David Drysdale <[email protected]> writes:
>
> > On Mon, Jul 28, 2014 at 10:13 PM, Eric W. Biederman
> > <[email protected]> wrote:
>
> >> Nope. What you can implement today if you want fine grained limitations
> >> like this is to create a mount namespace with exactly the subdirectory
> >> tree you want to allow access to and to return a file descriptor that
> >> points into that mount namespace. (When complete the only user of that
> >> mount namespace would be your file descriptor).
> >
> > How does that solve the particular example I mentioned? The DFD
> > within the mount namespace will still allow any operation on any file
> > that's already in the subdirectory -- or am I misunderstanding
> > something?
>
> The goal was to bound the DFD to the directory and all of it's
> subdirectories such that openat(dfd, "../../..") would open
> the dfd, and that further opens of other directories would also not
> allow you to escape.
>
> Since the mount namespace only contains the choosen directory and it's
> subdirectories that works easily and trivially.
>
> So while you can indeed perform any file operation on that dfd who
> cares because none of those operations can get you anywhere you aren't
> supposed to be.
>
> My point was that you can as granular as you would like by binding a dfd
> to a mount namespace instead of binding a process to a mount namespace,
> and the code already exists and is being maintained.
>
> So while things are not packaged in the form that has been requested it
> looks to me as if the functionality for directories already exists
> within the Linux kernel.

I think this would be amazingly expensive -- every constrained fd
would need to carry an entire mount namespace with it. That namespace
might need to have shared recursive mounts under it. And dfds created
for subdirectories would need yet another mount namespace. And all
these mount namespaces would probably need user namespaces to go along
with them.

It would also have odd semantics. If you have a dfd pointing to /foo,
and /foo/link is a symlink to "../bar", then looking up "link"
relative to /foo should fail; it should not try to resolve /foo/bar.

IOW I think this is impractical.

>
> Eric

2014-07-30 14:52:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH 11/11] seccomp: Add tgid and tid into seccomp_data

On Jul 29, 2014 10:57 PM, "Eric W. Biederman" <[email protected]> wrote:
>
> Andy Lutomirski <[email protected]> writes:
>
> > On Tue, Jul 29, 2014 at 9:08 PM, Eric W. Biederman
> > <[email protected]> wrote:
> >> Andy Lutomirski <[email protected]> writes:
> >>
> >>> On Mon, Jul 28, 2014 at 2:18 PM, Eric W. Biederman
> >>> <[email protected]> wrote:
> >>>> Andy Lutomirski <[email protected]> writes:
> >>>>
> >>>>> [cc: Eric Biederman]
> >>>>>
> >>>>
> >>>>> Can we do one better and add a flag to prevent any non-self pid
> >>>>> lookups? This might actually be easy on top of the pid namespace work
> >>>>> (e.g. we could change the way that find_task_by_vpid works).
> >>>>>
> >>>>> It's far from just being signals. There's access_process_vm, ptrace,
> >>>>> all the signal functions, clock_gettime (see CPUCLOCK_PID -- yes, this
> >>>>> is ridiculous), and probably some others that I've forgotten about or
> >>>>> never noticed in the first place.
> >>>>
> >>>> So here is the practical question.
> >>>>
> >>>> Are these processes that only can send signals to their thread group
> >>>> allowed to call fork()?
> >>>>
> >>>>
> >>>> If fork is allowed and all pid lookups are restricted to their own
> >>>> thread group that wait, waitpid, and all of the rest of the wait family
> >>>> will never return the pids of their children, and zombies will
> >>>> accumulate. Aka the semantics are fundamentally broken.
> >>>
> >>> Good point.
> >>>
> >>> I can imagine at least three ways that fork() could continue working, though:
> >>>
> >>> 1. Allow lookups of immediate children, too. (I don't love this one.)
> >>> 2. Allow non-self pids to be translated in but not out. This way
> >>> P_ALL will continue working.
> >>> 3. Have the kernel treat any PID-restricted process as though it were NOCLDWAIT.
> >>>
> >>> I think I like #3. Thoughts?
> >>>
> >>>>
> >>>> If fork is not allowed pid namespaces already solve this problem.
> >>>
> >>> PID namespaces are fairly heavyweight. Julien pointed out that using
> >>> PID namespaces requires a bunch of dummy PID 1 processes.
> >>
> >> Only if you can't tolerate init exiting. The reasoning with respect to
> >> signals and signals being ignored was wrong. And if you only have one
> >> process you care about and no children to worry about neither the
> >> difference in signal handling nor the world dies whe init exits applies.
> >
> > Can you elaborate? It seems entirely plausible to me that there are
> > programs that won't work right as PID 1 without considerable
> > adaptation.
>
> The only funny things about pid 1 of a pid namespace are:
> - children can't send signals to pid 1 unless a signal handler has
> been established.
> - All children die when the parent dies.
> - Grand children become zombies of the parent when the children die.
> - The pid is 1.
>
> That is almost everything is the same and it takes almost no adaptation
> (really) to run as the initial pid in a pid namespace.
>
> Not being able to receive signals (which is the argument I read against
> them) is bogus. You just have to set your signal handler to something
> besides SIG_DFL.
>
> So I have my question: What is the use case people are trying to solve
> by filtering signals and pid lookups. If children are not part of the
> goal a pid namespace will work just fine.
>
> >> Therefore given what I have read described pid namespaces are a trivial
> >> solution to this problem space.
> >
> > pid namespaces also won't work in the context of Capsicum unless you
> > want every single Capsicum process to be its own pid namespace.
>
> For a tightly bound process I don't see why each process could not be
> it's own pid namespace.

Two main reasons: You can't put yourself in a pid namespace, so you
need to fork into your sandbox, and you can't prevent yourself from
seeing your children (although, as noted, my approach has issues here,
too, but I think this is more easily solved outside the context of
namespaces).

>
> > Also,
> > pid namespaces don't offer any way to protect children from parents.
>
> And my presumption was that there were not any children because the
> semantics suggested so far do not properly support children.
>

I'd like to try to fix that.

Another approach: let waiting for zombies that are immediate children
be an exception.

--Andy

> Eric