Changes from v3:
* Ported on 2.6.25-rc3-mm1;
* Re-splitted into smaller pieces;
* Added more comments to tricky places.
This controller allows to tune the devices accessibility by tasks,
i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
access to IDE devices and completely hide SCSI disks.
Tasks still can call mknod to create device files, regardless of
whether the particular device is visible or accessible, but they
may not be able to open it later.
This one hides under CONFIG_CGROUP_DEVS option.
To play with it - run a standard procedure:
# mount -t container none /cont/devs -o devices
# mkdir /cont/devs/0
# echo -n $$ > /cont/devs/0/tasks
and tune device permissions.
The only configuration file called devices.permissions accepts
strings like '[cb] <major>:(<minor>|*) [r-][w-]' to provide read,
write or read-write access to a particular device. Asterisk as the
minor means "all devices with a given major".
This will be described in Documentation/controllers/devices.txt
file in more details.
Here are some historical notes.
The third version was here:
http://openvz.org/pipermail/devel/2008-February/010832.html
Changes from v2:
* Fixed problems pointed out by Sukadev with permissions
revoke. Now we have to perform kobject re-lookup on
each char device open, just like for block ones, so I
think this is OK.
The second version was here:
http://openvz.org/pipermail/devel/2008-January/010160.html
Changes from v1:
* Added the block devices support :) It turned out to
be a bit simpler than the char one (or I missed
something significant);
* Now we can enable/disable not just individual devices,
but the whole major with all its minors (see the TODO
list beyond as well);
* Added the ability to restrict the read/write permissions
to devices, not just visible/invisible state.
The first version was here:
http://openvz.org/pipermail/devel/2007-September/007647.html
Thanks,
Pavel
The number of chains in the kobj_map structure is hard-coded
with a magic constant of 255. Make a named one instead.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
drivers/base/map.c | 21 ++++++++++++---------
1 files changed, 12 insertions(+), 9 deletions(-)
diff --git a/drivers/base/map.c b/drivers/base/map.c
index e87017f..7de565f 100644
--- a/drivers/base/map.c
+++ b/drivers/base/map.c
@@ -16,6 +16,8 @@
#include <linux/kobject.h>
#include <linux/kobj_map.h>
+#define KOBJ_MAP_PROBES 255
+
struct kobj_map {
struct probe {
struct probe *next;
@@ -25,7 +27,7 @@ struct kobj_map {
kobj_probe_t *get;
int (*lock)(dev_t, void *);
void *data;
- } *probes[255];
+ } *probes[KOBJ_MAP_PROBES];
struct mutex *lock;
};
@@ -38,8 +40,8 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
unsigned i;
struct probe *p;
- if (n > 255)
- n = 255;
+ if (n > KOBJ_MAP_PROBES)
+ n = KOBJ_MAP_PROBES;
p = kmalloc(sizeof(struct probe) * n, GFP_KERNEL);
@@ -56,7 +58,7 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
}
mutex_lock(domain->lock);
for (i = 0, p -= n; i < n; i++, p++, index++) {
- struct probe **s = &domain->probes[index % 255];
+ struct probe **s = &domain->probes[index % KOBJ_MAP_PROBES];
while (*s && (*s)->range < range)
s = &(*s)->next;
p->next = *s;
@@ -73,13 +75,14 @@ void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)
unsigned i;
struct probe *found = NULL;
- if (n > 255)
- n = 255;
+ if (n > KOBJ_MAP_PROBES)
+ n = KOBJ_MAP_PROBES;
mutex_lock(domain->lock);
for (i = 0; i < n; i++, index++) {
struct probe **s;
- for (s = &domain->probes[index % 255]; *s; s = &(*s)->next) {
+ for (s = &domain->probes[index % KOBJ_MAP_PROBES];
+ *s; s = &(*s)->next) {
struct probe *p = *s;
if (p->dev == dev && p->range == range) {
*s = p->next;
@@ -101,7 +104,7 @@ struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, int *index)
retry:
mutex_lock(domain->lock);
- for (p = domain->probes[MAJOR(dev) % 255]; p; p = p->next) {
+ for (p = domain->probes[MAJOR(dev) % KOBJ_MAP_PROBES]; p; p = p->next) {
struct kobject *(*probe)(dev_t, int *, void *);
struct module *owner;
void *data;
@@ -148,7 +151,7 @@ struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock)
base->dev = 1;
base->range = ~0;
base->get = base_probe;
- for (i = 0; i < 255; i++)
+ for (i = 0; i < KOBJ_MAP_PROBES; i++)
p->probes[i] = base;
p->lock = lock;
return p;
--
1.5.3.4
The kobj_lookup() searches the kobject by its dev_t. Since this
mapping is going to be restricted with permissions, make the
kobj_lookup() return the mapping's permissions.
Currently the mode returned is unused.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
block/genhd.c | 4 ++--
drivers/base/map.c | 8 ++++++--
fs/block_dev.c | 3 ++-
fs/char_dev.c | 4 +++-
include/linux/genhd.h | 2 +-
include/linux/kobj_map.h | 2 +-
6 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c
index 9a7a903..1d1d0f2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -210,12 +210,12 @@ void unlink_gendisk(struct gendisk *disk)
* This function gets the structure containing partitioning
* information for the given device @dev.
*/
-struct gendisk *get_gendisk(dev_t devt, int *part)
+struct gendisk *get_gendisk(dev_t devt, mode_t *mode, int *part)
{
struct kobject *kobj;
struct device *dev;
- kobj = kobj_lookup(bdev_map, devt, part);
+ kobj = kobj_lookup(bdev_map, devt, mode, part);
if (kobj == NULL)
return NULL;
diff --git a/drivers/base/map.c b/drivers/base/map.c
index c7e28a1..285a2d2 100644
--- a/drivers/base/map.c
+++ b/drivers/base/map.c
@@ -117,7 +117,8 @@ void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)
kfree(found);
}
-struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, int *index)
+struct kobject *kobj_lookup(struct kobj_map *domain, dev_t dev, mode_t *mode,
+ int *index)
{
struct kobject *kobj;
struct probe *p;
@@ -149,8 +150,11 @@ retry:
kobj = probe(dev, index, data);
/* Currently ->owner protects _only_ ->probe() itself. */
module_put(owner);
- if (kobj)
+ if (kobj) {
+ if (mode)
+ *mode = p->mode;
return kobj;
+ }
goto retry;
}
mutex_unlock(domain->lock);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index a0e9596..00dda91 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -934,11 +934,12 @@ static int do_open(struct block_device *bdev, struct file *file, int for_part)
struct module *owner = NULL;
struct gendisk *disk;
int ret = -ENXIO;
+ mode_t mode;
int part;
file->f_mapping = bdev->bd_inode->i_mapping;
lock_kernel();
- disk = get_gendisk(bdev->bd_dev, &part);
+ disk = get_gendisk(bdev->bd_dev, &mode, &part);
if (!disk) {
unlock_kernel();
bdput(bdev);
diff --git a/fs/char_dev.c b/fs/char_dev.c
index 68e510b..dceb579 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -367,8 +367,10 @@ static int chrdev_open(struct inode *inode, struct file *filp)
if (!p) {
struct kobject *kobj;
int idx;
+ mode_t mode;
+
spin_unlock(&cdev_lock);
- kobj = kobj_lookup(cdev_map, inode->i_rdev, &idx);
+ kobj = kobj_lookup(cdev_map, inode->i_rdev, &mode, &idx);
if (!kobj)
return -ENXIO;
new = container_of(kobj, struct cdev, kobj);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 09a3b18..e09df44 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -371,7 +371,7 @@ extern int get_blkdev_list(char *, int);
extern void add_disk(struct gendisk *disk);
extern void del_gendisk(struct gendisk *gp);
extern void unlink_gendisk(struct gendisk *gp);
-extern struct gendisk *get_gendisk(dev_t dev, int *part);
+extern struct gendisk *get_gendisk(dev_t dev, mode_t *mode, int *part);
extern void set_device_ro(struct block_device *bdev, int flag);
extern void set_disk_ro(struct gendisk *disk, int flag);
diff --git a/include/linux/kobj_map.h b/include/linux/kobj_map.h
index bafe178..867f307 100644
--- a/include/linux/kobj_map.h
+++ b/include/linux/kobj_map.h
@@ -8,7 +8,7 @@ struct kobj_map;
int kobj_map(struct kobj_map *, dev_t, unsigned long, struct module *,
kobj_probe_t *, int (*)(dev_t, void *), void *);
void kobj_unmap(struct kobj_map *, dev_t, unsigned long);
-struct kobject *kobj_lookup(struct kobj_map *, dev_t, int *);
+struct kobject *kobj_lookup(struct kobj_map *, dev_t, mode_t *, int *);
struct kobj_map *kobj_map_init(kobj_probe_t *, struct mutex *);
#endif
--
1.5.3.4
The current implementation first converts kobj to device and
then checks for kobj not to be NULL. This is OK, since the
converting macros is a container_of one, but this does not
look very nice.
Besides, I'll have to add some code _before_ the kobj_lookup()
and thus will have to move the call to kobj_lookup lower.
So do this job here to make the further patching cleaner.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
block/genhd.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c
index e8cf05a..9a7a903 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -212,10 +212,15 @@ void unlink_gendisk(struct gendisk *disk)
*/
struct gendisk *get_gendisk(dev_t devt, int *part)
{
- struct kobject *kobj = kobj_lookup(bdev_map, devt, part);
- struct device *dev = kobj_to_dev(kobj);
+ struct kobject *kobj;
+ struct device *dev;
+
+ kobj = kobj_lookup(bdev_map, devt, part);
+ if (kobj == NULL)
+ return NULL;
- return kobj ? dev_to_disk(dev) : NULL;
+ dev = kobj_to_dev(kobj);
+ return dev_to_disk(dev);
}
/*
--
1.5.3.4
The idea of the set is to make the dev_t to struct kobj map
be per-cgroup, and provide a permissions for this mapping.
This mode mask will operate with FMODE_xxx modes.
To achieve this, I add the mode_t mode on the struct probe,
which sets this mapping, and move the locked part of the
kobj_map() into a separate function, which also accepts the
mode to be set on the probe.
By default all the modes are provided for a map.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
drivers/base/map.c | 31 ++++++++++++++++++++++++++-----
1 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/drivers/base/map.c b/drivers/base/map.c
index 7de565f..c7e28a1 100644
--- a/drivers/base/map.c
+++ b/drivers/base/map.c
@@ -15,6 +15,7 @@
#include <linux/kdev_t.h>
#include <linux/kobject.h>
#include <linux/kobj_map.h>
+#include <linux/fs.h>
#define KOBJ_MAP_PROBES 255
@@ -22,6 +23,7 @@ struct kobj_map {
struct probe {
struct probe *next;
dev_t dev;
+ mode_t mode;
unsigned long range;
struct module *owner;
kobj_probe_t *get;
@@ -31,9 +33,9 @@ struct kobj_map {
struct mutex *lock;
};
-int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
- struct module *module, kobj_probe_t *probe,
- int (*lock)(dev_t, void *), void *data)
+static int __kobj_map(struct kobj_map *domain, dev_t dev, mode_t mode,
+ unsigned long range, struct module *module,
+ kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data)
{
unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
unsigned index = MAJOR(dev);
@@ -55,8 +57,15 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
p->dev = dev;
p->range = range;
p->data = data;
+ /*
+ * When opening a device we want to (dis)allow only
+ * read or write. But sys_open() can provide more
+ * modes like lseek or pread. So, these FMODE-s are
+ * always 'on'.
+ */
+ p->mode = mode | FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
}
- mutex_lock(domain->lock);
+
for (i = 0, p -= n; i < n; i++, p++, index++) {
struct probe **s = &domain->probes[index % KOBJ_MAP_PROBES];
while (*s && (*s)->range < range)
@@ -64,10 +73,22 @@ int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
p->next = *s;
*s = p;
}
- mutex_unlock(domain->lock);
return 0;
}
+int kobj_map(struct kobj_map *domain, dev_t dev, unsigned long range,
+ struct module *module, kobj_probe_t *probe,
+ int (*lock)(dev_t, void *), void *data)
+{
+ int err;
+
+ mutex_lock(domain->lock);
+ err = __kobj_map(domain, dev, FMODE_READ | FMODE_WRITE, range,
+ module, probe, lock, data);
+ mutex_unlock(domain->lock);
+ return err;
+}
+
void kobj_unmap(struct kobj_map *domain, dev_t dev, unsigned long range)
{
unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
--
1.5.3.4
The character devices currently have a global map for
devices and its own locks and callbacks for devices
probing. So here are the functions that provide the way
to play with character devices maps.
They are cdev_map_init/cdev_map_fini to create/destroy
one, cdev_add_to_map/cdev_del_from_map to map/unmap devices,
and cdev_iterate_map for uniformity.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/char_dev.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/cdev.h | 7 +++++
2 files changed, 70 insertions(+), 0 deletions(-)
diff --git a/fs/char_dev.c b/fs/char_dev.c
index d042446..7a4c1bd 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -476,6 +476,69 @@ int cdev_add(struct cdev *p, dev_t dev, unsigned count)
return kobj_map(cdev_map, dev, count, NULL, exact_match, exact_lock, p);
}
+#ifdef CONFIG_CGROUP_DEVS
+int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode)
+{
+ int tmp;
+ struct kobject *k;
+ struct cdev *c;
+
+ k = kobj_lookup(cdev_map, dev, NULL, &tmp);
+ if (k == NULL)
+ return -ENODEV;
+
+ c = container_of(k, struct cdev, kobj);
+ tmp = kobj_remap(map, dev, mode, all ? MINORMASK : 1, NULL,
+ exact_match, exact_lock, c);
+ if (tmp < 0) {
+ cdev_put(c);
+ return tmp;
+ }
+
+ return 0;
+}
+
+int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all)
+{
+ int tmp;
+ struct kobject *k;
+ struct cdev *c;
+
+ k = kobj_lookup(map, dev, NULL, &tmp);
+ if (k == NULL)
+ return -ENODEV;
+
+ c = container_of(k, struct cdev, kobj);
+ kobj_unmap(map, dev, all ? MINORMASK : 1);
+
+ /*
+ * one put for the kobj_lookup above and one for
+ * the kobj_lookup in cdev_add_to_map
+ */
+ cdev_put(c);
+ cdev_put(c);
+ return 0;
+}
+
+void cdev_iterate_map(struct kobj_map *map,
+ int (*fn)(dev_t, int, mode_t, void *), void *x)
+{
+ kobj_map_iterate(map, fn, x);
+}
+
+static struct kobject *base_probe(dev_t dev, int *part, void *data);
+
+struct kobj_map *cdev_map_init(void)
+{
+ return kobj_map_init(base_probe, &chrdevs_lock);
+}
+
+void cdev_map_fini(struct kobj_map *map)
+{
+ kobj_map_fini(map);
+}
+#endif
+
static void cdev_unmap(dev_t dev, unsigned count)
{
kobj_unmap(cdev_map, dev, count);
diff --git a/include/linux/cdev.h b/include/linux/cdev.h
index 1e29b13..14d377b 100644
--- a/include/linux/cdev.h
+++ b/include/linux/cdev.h
@@ -33,5 +33,12 @@ void cd_forget(struct inode *);
extern struct backing_dev_info directly_mappable_cdev_bdi;
+struct kobj_map;
+int cdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode);
+int cdev_del_from_map(struct kobj_map *map, dev_t dev, int all);
+struct kobj_map *cdev_map_init(void);
+void cdev_map_fini(struct kobj_map *map);
+void cdev_iterate_map(struct kobj_map *,
+ int (*fn)(dev_t, int, mode_t, void *), void *);
#endif
#endif
--
1.5.3.4
Finally, here's the control group, which makes full use of the
interfaces, declared in the previous patches.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
Documentation/controllers/devices.txt | 61 +++++++
fs/Makefile | 2 +
fs/devscontrol.c | 314 +++++++++++++++++++++++++++++++++
include/linux/cgroup_subsys.h | 6 +
include/linux/devscontrol.h | 14 ++
init/Kconfig | 13 ++
6 files changed, 410 insertions(+), 0 deletions(-)
create mode 100644 Documentation/controllers/devices.txt
create mode 100644 fs/devscontrol.c
diff --git a/Documentation/controllers/devices.txt b/Documentation/controllers/devices.txt
new file mode 100644
index 0000000..630586c
--- /dev/null
+++ b/Documentation/controllers/devices.txt
@@ -0,0 +1,61 @@
+
+ Devices visibility controller
+
+This controller allows to tune the devices accessibility by tasks,
+i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
+access to IDE devices and completely hide SCSI disks.
+
+Tasks still can call mknod to create device files, regardless of
+whether the particular device is visible or accessible, but they
+may not be able to open it later.
+
+This one hides under CONFIG_CGROUP_DEVS option.
+
+
+Configuring
+
+The controller provides a single file to configure itself -- the
+devices.permissions one. To change the accessibility level for some
+device write the following string into it:
+
+[cb] <major>:(<minor>|*) [r-][w-]
+ ^ ^ ^
+ | | |
+ | | +--- access rights (1)
+ | |
+ | +-- device major and minor numbers (2)
+ |
+ +-- device type (character / block)
+
+1) The access rights set to '--' remove the device from the group's
+access list, so that it will not even be shown in this file later.
+
+2) Setting the minor to '*' grants access to all the minors for
+particular major.
+
+When reading from it, one may see something like
+
+ c 1:5 rw
+ b 8:* r-
+
+Security issues, concerning who may grant access to what are governed
+at the cgroup infrastructure level.
+
+
+Examples:
+
+1. Grant full access to /dev/null
+ # echo 'c 1:3 rw' > /cgroups/<id>/devices.permissions
+
+2. Grant the read-only access to /dev/sda and partitions
+ # echo 'b 8:* r-' > ...
+
+3. Change the /dev/null access to write-only
+ # echo 'c 1:3 -w' > ...
+
+4. Revoke access to /dev/sda
+ # echo 'b 8:* --' > ...
+
+
+ Written by Pavel Emelyanov <[email protected]>
+
diff --git a/fs/Makefile b/fs/Makefile
index 7996220..5ad03be 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -64,6 +64,8 @@ obj-y += devpts/
obj-$(CONFIG_PROFILING) += dcookies.o
obj-$(CONFIG_DLM) += dlm/
+
+obj-$(CONFIG_CGROUP_DEVS) += devscontrol.o
# Do not add any filesystems before this line
obj-$(CONFIG_REISERFS_FS) += reiserfs/
diff --git a/fs/devscontrol.c b/fs/devscontrol.c
new file mode 100644
index 0000000..4a664cf
--- /dev/null
+++ b/fs/devscontrol.c
@@ -0,0 +1,314 @@
+/*
+ * devscontrol.c - Device Controller
+ *
+ * Copyright 2008 OpenVZ Parallels Inc
+ * Author: Pavel Emelyanov <xemul at openvz dot org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/cgroup.h>
+#include <linux/cdev.h>
+#include <linux/err.h>
+#include <linux/devscontrol.h>
+#include <linux/uaccess.h>
+#include <linux/fs.h>
+#include <linux/genhd.h>
+
+struct devs_cgroup {
+ /*
+ * The subsys state to build into cgroups infrastructure
+ */
+ struct cgroup_subsys_state css;
+
+ /*
+ * The maps of character and block devices. They provide a
+ * map from dev_t-s to struct cdev/gendisk. See fs/char_dev.c
+ * and block/genhd.c to find out how the ->open() callbacks
+ * work when opening a device.
+ *
+ * Each group will have its own maps, and at the open()
+ * time code will lookup in this map to get the device
+ * and permissions by its dev_t.
+ */
+ struct kobj_map *cdev_map;
+ struct kobj_map *bdev_map;
+};
+
+static inline
+struct devs_cgroup *css_to_devs(struct cgroup_subsys_state *css)
+{
+ return container_of(css, struct devs_cgroup, css);
+}
+
+static inline
+struct devs_cgroup *cgroup_to_devs(struct cgroup *cont)
+{
+ return css_to_devs(cgroup_subsys_state(cont, devs_subsys_id));
+}
+
+struct kobj_map *task_cdev_map(struct task_struct *tsk)
+{
+ struct cgroup_subsys_state *css;
+
+ css = task_subsys_state(tsk, devs_subsys_id);
+ if (css->cgroup->parent == NULL)
+ return NULL;
+ else
+ return css_to_devs(css)->cdev_map;
+}
+
+struct kobj_map *task_bdev_map(struct task_struct *tsk)
+{
+ struct cgroup_subsys_state *css;
+
+ css = task_subsys_state(tsk, devs_subsys_id);
+ if (css->cgroup->parent == NULL)
+ return NULL;
+ else
+ return css_to_devs(css)->bdev_map;
+}
+
+static struct cgroup_subsys_state *
+devs_create(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ struct devs_cgroup *devs;
+
+ devs = kzalloc(sizeof(struct devs_cgroup), GFP_KERNEL);
+ if (devs == NULL)
+ goto out;
+
+ devs->cdev_map = cdev_map_init();
+ if (devs->cdev_map == NULL)
+ goto out_free;
+
+ devs->bdev_map = bdev_map_init();
+ if (devs->bdev_map == NULL)
+ goto out_free_cdev;
+
+ return &devs->css;
+
+out_free_cdev:
+ cdev_map_fini(devs->cdev_map);
+out_free:
+ kfree(devs);
+out:
+ return ERR_PTR(-ENOMEM);
+}
+
+static void devs_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ struct devs_cgroup *devs;
+
+ devs = cgroup_to_devs(cont);
+ bdev_map_fini(devs->bdev_map);
+ cdev_map_fini(devs->cdev_map);
+ kfree(devs);
+}
+
+/*
+ * The devices.permissions file read/write functionality
+ *
+ * The following two routines parse and print the strings like
+ * [cb] <major>:(<minor>|*) [r-][w-]
+ */
+
+static int decode_perms_str(char *buf, int *chrdev, dev_t *dev,
+ int *all, mode_t *mode)
+{
+ unsigned int major, minor;
+ char *end;
+ mode_t tmp;
+
+ if (buf[0] == 'c')
+ *chrdev = 1;
+ else if (buf[0] == 'b')
+ *chrdev = 0;
+ else
+ return -EINVAL;
+
+ if (buf[1] != ' ')
+ return -EINVAL;
+
+ major = simple_strtoul(buf + 2, &end, 10);
+ if (*end != ':')
+ return -EINVAL;
+
+ if (end[1] == '*') {
+ if (end[2] != ' ')
+ return -EINVAL;
+
+ *all = 1;
+ minor = 0;
+ end += 2;
+ } else {
+ minor = simple_strtoul(end + 1, &end, 10);
+ if (*end != ' ')
+ return -EINVAL;
+
+ *all = 0;
+ }
+
+ tmp = 0;
+
+ if (end[1] == 'r')
+ tmp |= FMODE_READ;
+ else if (end[1] != '-')
+ return -EINVAL;
+ if (end[2] == 'w')
+ tmp |= FMODE_WRITE;
+ else if (end[2] != '-')
+ return -EINVAL;
+
+ *dev = MKDEV(major, minor);
+ *mode = tmp;
+ return 0;
+}
+
+static int encode_perms_str(char *buf, int len, int chrdev, dev_t dev,
+ int all, mode_t mode)
+{
+ int ret;
+
+ ret = snprintf(buf, len, "%c %d:", chrdev ? 'c' : 'b', MAJOR(dev));
+ if (all)
+ ret += snprintf(buf + ret, len - ret, "*");
+ else
+ ret += snprintf(buf + ret, len - ret, "%d", MINOR(dev));
+
+ ret += snprintf(buf + ret, len - ret, " %c%c\n",
+ (mode & FMODE_READ) ? 'r' : '-',
+ (mode & FMODE_WRITE) ? 'w' : '-');
+
+ return ret;
+}
+
+static ssize_t devs_write(struct cgroup *cont, struct cftype *cft,
+ struct file *f, const char __user *ubuf,
+ size_t nbytes, loff_t *pos)
+{
+ int err, all, chrdev;
+ dev_t dev;
+ char buf[64];
+ struct devs_cgroup *devs;
+ mode_t mode;
+
+ if (copy_from_user(buf, ubuf, sizeof(buf)))
+ return -EFAULT;
+
+ buf[sizeof(buf) - 1] = 0;
+ err = decode_perms_str(buf, &chrdev, &dev, &all, &mode);
+ if (err < 0)
+ return err;
+
+ devs = cgroup_to_devs(cont);
+
+ /*
+ * No locking here is required - all that we need
+ * is provided inside the kobject mapping code
+ */
+
+ if (mode == 0) {
+ if (chrdev)
+ err = cdev_del_from_map(devs->cdev_map, dev, all);
+ else
+ err = bdev_del_from_map(devs->bdev_map, dev, all);
+
+ if (err < 0)
+ return err;
+
+ css_put(&devs->css);
+ } else {
+ if (chrdev)
+ err = cdev_add_to_map(devs->cdev_map, dev, all, mode);
+ else
+ err = bdev_add_to_map(devs->bdev_map, dev, all, mode);
+
+ if (err < 0)
+ return err;
+
+ css_get(&devs->css);
+ }
+
+ return nbytes;
+}
+
+struct devs_dump_arg {
+ char *buf;
+ int pos;
+ int chrdev;
+};
+
+static int devs_dump_one(dev_t dev, int range, mode_t mode, void *x)
+{
+ struct devs_dump_arg *arg = x;
+ char tmp[64];
+ int len;
+
+ len = encode_perms_str(tmp, sizeof(tmp), arg->chrdev, dev,
+ range != 1, mode);
+
+ if (arg->pos >= PAGE_SIZE - len)
+ return 1;
+
+ memcpy(arg->buf + arg->pos, tmp, len);
+ arg->pos += len;
+ return 0;
+}
+
+static ssize_t devs_read(struct cgroup *cont, struct cftype *cft,
+ struct file *f, char __user *ubuf, size_t nbytes, loff_t *pos)
+{
+ struct devs_dump_arg arg;
+ struct devs_cgroup *devs;
+ ssize_t ret;
+
+ arg.buf = (char *)__get_free_page(GFP_KERNEL);
+ if (arg.buf == NULL)
+ return -ENOMEM;
+
+ devs = cgroup_to_devs(cont);
+ arg.pos = 0;
+
+ arg.chrdev = 1;
+ cdev_iterate_map(devs->cdev_map, devs_dump_one, &arg);
+
+ arg.chrdev = 0;
+ bdev_iterate_map(devs->bdev_map, devs_dump_one, &arg);
+
+ ret = simple_read_from_buffer(ubuf, nbytes, pos,
+ arg.buf, arg.pos);
+
+ free_page((unsigned long)arg.buf);
+ return ret;
+}
+
+static struct cftype devs_files[] = {
+ {
+ .name = "permissions",
+ .write = devs_write,
+ .read = devs_read,
+ },
+};
+
+static int devs_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss,
+ devs_files, ARRAY_SIZE(devs_files));
+}
+
+struct cgroup_subsys devs_subsys = {
+ .name = "devices",
+ .subsys_id = devs_subsys_id,
+ .create = devs_create,
+ .destroy = devs_destroy,
+ .populate = devs_populate,
+};
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 1ddebfc..212c735 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
#endif
/* */
+
+#ifdef CONFIG_CGROUP_DEVS
+SUBSYS(devs)
+#endif
+
+/* */
diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
index 04c168b..ae5f801 100644
--- a/include/linux/devscontrol.h
+++ b/include/linux/devscontrol.h
@@ -1,5 +1,18 @@
#ifndef __DEVS_CONTROL_H__
#define __DEVS_CONTROL_H__
+struct kobj_map;
+struct task_struct;
+
+/*
+ * task_[cb]dev_map - get a map from task. Both calls may return
+ * NULL, to indicate, that task doesn't belong to any group and
+ * that the global map is to be used.
+ */
+
+#ifdef CONFIG_CGROUP_DEVS
+struct kobj_map *task_cdev_map(struct task_struct *);
+struct kobj_map *task_bdev_map(struct task_struct *);
+#else
static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
{
return NULL;
@@ -10,3 +23,4 @@ static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
return NULL;
}
#endif
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index 77b2ac8..1b7e671 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -289,6 +289,19 @@ config CGROUP_DEBUG
Say N if unsure
+config CGROUP_DEVS
+ bool "Devices cgroup subsystem"
+ depends on CGROUPS
+ help
+ Controlls the access rights to devices, i.e. you may hide
+ some of them from tasks, so that they will not be able
+ to open them, or you may grant a read-only access to some
+ of them.
+
+ See Documentation/controllers/devices.txt for details.
+
+ This is harmless to say N here, so do it if unsure.
+
config CGROUP_NS
bool "Namespace cgroup subsystem"
depends on CGROUPS
--
1.5.3.4
This includes the following functions:
1. kobj_remap - this one will change the mapping's permissions
or add a new one if required.
2. kobj_map_iterate will walk the map, calling the callback
on each element.
3. kobj_map_fini is a rollback for kobj_map_finid - cleans all
the mappings.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
drivers/base/map.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kobj_map.h | 6 +++
2 files changed, 90 insertions(+), 0 deletions(-)
diff --git a/drivers/base/map.c b/drivers/base/map.c
index 285a2d2..8e2e1c2 100644
--- a/drivers/base/map.c
+++ b/drivers/base/map.c
@@ -181,3 +181,87 @@ struct kobj_map *kobj_map_init(kobj_probe_t *base_probe, struct mutex *lock)
p->lock = lock;
return p;
}
+
+#ifdef CONFIG_CGROUP_DEVS
+int kobj_remap(struct kobj_map *domain, dev_t dev, mode_t mode,
+ unsigned long range, struct module *module,
+ kobj_probe_t *probe, int (*lock)(dev_t, void *), void *data)
+{
+ unsigned n = MAJOR(dev + range - 1) - MAJOR(dev) + 1;
+ unsigned index = MAJOR(dev);
+ unsigned i;
+ int err = -ESRCH;
+
+ if (n > KOBJ_MAP_PROBES)
+ n = KOBJ_MAP_PROBES;
+
+ mutex_lock(domain->lock);
+ for (i = 0; i < n; i++, index++) {
+ struct probe **s;
+ for (s = &domain->probes[index % KOBJ_MAP_PROBES];
+ *s; s = &(*s)->next) {
+ struct probe *p = *s;
+ if (p->dev == dev) {
+ p->mode = mode | FMODE_LSEEK |
+ FMODE_PREAD | FMODE_PWRITE;
+ err = 0;
+ break;
+ }
+ }
+ }
+
+ if (err)
+ err = __kobj_map(domain, dev, mode, range, module,
+ probe, lock, data);
+ mutex_unlock(domain->lock);
+ return err;
+}
+
+void kobj_map_iterate(struct kobj_map *domain,
+ int (*fn)(dev_t, int, mode_t, void *), void *arg)
+{
+ int i;
+ struct probe *p;
+ dev_t skip = MKDEV(0, 0);
+
+ mutex_lock(domain->lock);
+ for (i = 0; i < KOBJ_MAP_PROBES; i++) {
+ p = domain->probes[i];
+ while (p != NULL) {
+ if (p->dev == skip)
+ goto next;
+ /*
+ * this 'device' is special - see the kobj_map_init why
+ */
+ if (p->dev == 1)
+ goto next;
+
+ skip = p->dev;
+ if (fn(p->dev, p->range, p->mode, arg))
+ goto done;
+next:
+ p = p->next;
+ }
+ }
+done:
+ mutex_unlock(domain->lock);
+}
+
+void kobj_map_fini(struct kobj_map *map)
+{
+ int i;
+ struct probe *p, *next;
+
+ for (i = 0; i < KOBJ_MAP_PROBES; i++) {
+ p = map->probes[i];
+ while (p->next != NULL) {
+ next = p->next;
+ kfree(p);
+ p = next;
+ }
+ }
+
+ kfree(p);
+ kfree(map);
+}
+#endif
diff --git a/include/linux/kobj_map.h b/include/linux/kobj_map.h
index 867f307..35a670d 100644
--- a/include/linux/kobj_map.h
+++ b/include/linux/kobj_map.h
@@ -11,4 +11,10 @@ void kobj_unmap(struct kobj_map *, dev_t, unsigned long);
struct kobject *kobj_lookup(struct kobj_map *, dev_t, mode_t *, int *);
struct kobj_map *kobj_map_init(kobj_probe_t *, struct mutex *);
+void kobj_map_iterate(struct kobj_map *, int (*fn)(dev_t, int, mode_t, void *),
+ void *);
+int kobj_remap(struct kobj_map *, dev_t, mode_t, unsigned long, struct module *,
+ kobj_probe_t *, int (*)(dev_t, void *), void *);
+void kobj_map_fini(struct kobj_map *);
+
#endif
--
1.5.3.4
Now check the requesting permissions against the granted
(with the dev_t-to-kobj map) ones.
The tricky place is chrdev_open - it caches the struct cdev
on inode and thus, we have to perform lookup each time
if we are in a restricted mapping.
The task_cdev_map and task_bdev_map provide the map which
the current task is in, but now they just return NULL, which
means, that the task is not in any.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
block/genhd.c | 8 +++++++-
fs/block_dev.c | 8 ++++++++
fs/char_dev.c | 18 ++++++++++++++++--
include/linux/devscontrol.h | 12 ++++++++++++
4 files changed, 43 insertions(+), 3 deletions(-)
create mode 100644 include/linux/devscontrol.h
diff --git a/block/genhd.c b/block/genhd.c
index 1d1d0f2..a619158 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -8,6 +8,7 @@
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/blkdev.h>
+#include <linux/devscontrol.h>
#include <linux/init.h>
#include <linux/spinlock.h>
#include <linux/seq_file.h>
@@ -212,10 +213,15 @@ void unlink_gendisk(struct gendisk *disk)
*/
struct gendisk *get_gendisk(dev_t devt, mode_t *mode, int *part)
{
+ struct kobj_map *map;
struct kobject *kobj;
struct device *dev;
- kobj = kobj_lookup(bdev_map, devt, mode, part);
+ map = task_bdev_map(current);
+ if (map == NULL)
+ map = bdev_map;
+
+ kobj = kobj_lookup(map, devt, mode, part);
if (kobj == NULL)
return NULL;
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 00dda91..34dc607 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -945,6 +945,14 @@ static int do_open(struct block_device *bdev, struct file *file, int for_part)
bdput(bdev);
return ret;
}
+
+ if ((file->f_mode & mode) != file->f_mode) {
+ unlock_kernel();
+ bdput(bdev);
+ put_disk(disk);
+ return -EACCES;
+ }
+
owner = disk->fops->owner;
mutex_lock_nested(&bdev->bd_mutex, for_part);
diff --git a/fs/char_dev.c b/fs/char_dev.c
index dceb579..d042446 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -22,6 +22,8 @@
#include <linux/mutex.h>
#include <linux/backing-dev.h>
+#include <linux/devscontrol.h>
+
#ifdef CONFIG_KMOD
#include <linux/kmod.h>
#endif
@@ -361,19 +363,31 @@ static int chrdev_open(struct inode *inode, struct file *filp)
struct cdev *p;
struct cdev *new = NULL;
int ret = 0;
+ struct kobj_map *map;
+
+ map = task_cdev_map(current);
+ if (map == NULL)
+ map = cdev_map;
spin_lock(&cdev_lock);
p = inode->i_cdev;
- if (!p) {
+ if (!p || map != cdev_map) {
struct kobject *kobj;
int idx;
mode_t mode;
spin_unlock(&cdev_lock);
- kobj = kobj_lookup(cdev_map, inode->i_rdev, &mode, &idx);
+ kobj = kobj_lookup(map, inode->i_rdev, &mode, &idx);
if (!kobj)
return -ENXIO;
new = container_of(kobj, struct cdev, kobj);
+ BUG_ON(p != NULL && p != new);
+
+ if ((filp->f_mode & mode) != filp->f_mode) {
+ cdev_put(new);
+ return -EACCES;
+ }
+
spin_lock(&cdev_lock);
p = inode->i_cdev;
if (!p) {
diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
new file mode 100644
index 0000000..04c168b
--- /dev/null
+++ b/include/linux/devscontrol.h
@@ -0,0 +1,12 @@
+#ifndef __DEVS_CONTROL_H__
+#define __DEVS_CONTROL_H__
+static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
+{
+ return NULL;
+}
+
+static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
+{
+ return NULL;
+}
+#endif
--
1.5.3.4
This routines were already done for char devices, these ones
are for block devices. Everything is the same, but the locks,
probe and match callbacks and kobj-to-device conversions.
Signed-off-by: Pavel Emelyanov <[email protected]>
---
block/genhd.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/genhd.h | 8 ++++++
2 files changed, 75 insertions(+), 0 deletions(-)
diff --git a/block/genhd.c b/block/genhd.c
index a619158..f407f62 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -204,6 +204,73 @@ void unlink_gendisk(struct gendisk *disk)
disk->minors);
}
+#ifdef CONFIG_CGROUP_DEVS
+int bdev_add_to_map(struct kobj_map *map, dev_t dev, int all, mode_t mode)
+{
+ int tmp;
+ struct kobject *kobj;
+ struct device *d;
+ struct gendisk *disk;
+
+ kobj = kobj_lookup(bdev_map, dev, NULL, &tmp);
+ if (kobj == NULL)
+ return -ENODEV;
+
+ d = kobj_to_dev(kobj);
+ disk = dev_to_disk(d);
+ tmp = kobj_remap(map, dev, mode, all ? MINORBITS : 1, NULL,
+ exact_match, exact_lock, disk);
+ if (tmp < 0) {
+ put_disk(disk);
+ return tmp;
+ }
+
+ return 0;
+}
+
+int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all)
+{
+ int tmp;
+ struct kobject *kobj;
+ struct device *d;
+ struct gendisk *disk;
+
+ kobj = kobj_lookup(map, dev, NULL, &tmp);
+ if (kobj == NULL)
+ return -ENODEV;
+
+ d = kobj_to_dev(kobj);
+ disk = dev_to_disk(d);
+ kobj_unmap(map, dev, all ? MINORBITS : 1);
+
+ /*
+ * one put for the kobj_lookup above and one for
+ * the kobj_lookup in bdev_add_to_map
+ */
+ put_disk(disk);
+ put_disk(disk);
+ return 0;
+}
+
+void bdev_iterate_map(struct kobj_map *map,
+ int (*fn)(dev_t, int, mode_t, void *), void *x)
+{
+ kobj_map_iterate(map, fn, x);
+}
+
+static struct kobject *base_probe(dev_t devt, int *part, void *data);
+
+struct kobj_map *bdev_map_init(void)
+{
+ return kobj_map_init(base_probe, &block_class_lock);
+}
+
+void bdev_map_fini(struct kobj_map *map)
+{
+ kobj_map_fini(map);
+}
+#endif
+
/**
* get_gendisk - get partitioning information for a given device
* @dev: device to get partitioning information for
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index e09df44..c5483fc 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -373,6 +373,14 @@ extern void del_gendisk(struct gendisk *gp);
extern void unlink_gendisk(struct gendisk *gp);
extern struct gendisk *get_gendisk(dev_t dev, mode_t *mode, int *part);
+struct kobj_map;
+extern int bdev_add_to_map(struct kobj_map *, dev_t dev, int all, mode_t mode);
+extern int bdev_del_from_map(struct kobj_map *map, dev_t dev, int all);
+extern void bdev_iterate_map(struct kobj_map *map,
+ int (*fn)(dev_t, int, mode_t, void *), void *x);
+extern struct kobj_map *bdev_map_init(void);
+extern void bdev_map_fini(struct kobj_map *map);
+
extern void set_device_ro(struct block_device *bdev, int flag);
extern void set_disk_ro(struct gendisk *disk, int flag);
--
1.5.3.4
On Wed, 05 Mar 2008 20:37:40 +0300
Pavel Emelyanov <[email protected]> wrote:
> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
> new file mode 100644
> index 0000000..04c168b
> --- /dev/null
> +++ b/include/linux/devscontrol.h
> @@ -0,0 +1,12 @@
> +#ifndef __DEVS_CONTROL_H__
> +#define __DEVS_CONTROL_H__
> +static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
> +{
> + return NULL;
> +}
> +
> +static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
> +{
> + return NULL;
> +}
> +#endif
This doesn't include sufficient headers to be compileable.
I'm sure there are lots of headers like this. But we regularly need
to fix them.
On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
> Changes from v3:
> * Ported on 2.6.25-rc3-mm1;
> * Re-splitted into smaller pieces;
> * Added more comments to tricky places.
>
> This controller allows to tune the devices accessibility by tasks,
> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> access to IDE devices and completely hide SCSI disks.
From within the kernel itself? The kernel should not be keeping track
of the mode of devices, that's what the filesystem holding /dev is for.
Those modes change all the time depending on the device plugged in, and
the user using the "console". Why should the kernel need to worry about
any of this?
> Tasks still can call mknod to create device files, regardless of
> whether the particular device is visible or accessible, but they
> may not be able to open it later.
>
> This one hides under CONFIG_CGROUP_DEVS option.
>
> To play with it - run a standard procedure:
>
> # mount -t container none /cont/devs -o devices
> # mkdir /cont/devs/0
> # echo -n $$ > /cont/devs/0/tasks
What is /cont/ for?
> and tune device permissions.
How is this done?
Why would the kernel care about this stuff?
confused,
greg k-h
On Wed, Mar 05, 2008 at 08:47:26PM +0300, Pavel Emelyanov wrote:
> Finally, here's the control group, which makes full use of the
> interfaces, declared in the previous patches.
>
> Signed-off-by: Pavel Emelyanov <[email protected]>
>
> ---
> Documentation/controllers/devices.txt | 61 +++++++
The information here should be in Documentation/ABI/ right?
I still don't see how changing these permissions within the kernel is
going to affect anything outside of it, nor why you want to be tracking
this within the kernel itself.
What's wrong with just having different /dev trees for the different
containers? That would not require any kernel changes, and you don't
have to write a tool to be sending these odd "mode settings" into the
kernel.
thanks,
greg k-h
Quoting Greg KH ([email protected]):
> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
> > Changes from v3:
> > * Ported on 2.6.25-rc3-mm1;
> > * Re-splitted into smaller pieces;
> > * Added more comments to tricky places.
> >
> > This controller allows to tune the devices accessibility by tasks,
> > i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> > access to IDE devices and completely hide SCSI disks.
>
> From within the kernel itself? The kernel should not be keeping track
> of the mode of devices, that's what the filesystem holding /dev is for.
> Those modes change all the time depending on the device plugged in, and
> the user using the "console". Why should the kernel need to worry about
> any of this?
These are distinct from the permissions on device files. No matter what
the permissions on the device files, a task in a devcg cgroup which
isn't allowed write to chardev 4:64 will not be able to write to
/dev/ttyS0.
The purpose is to prevent a root task from granting itself access to
certain devices. Without this, the only option currently is to take
CAP_MKNOD out of the capability bounding set for a container and make
sure that /dev is set up right (and enforce nodev for mounts). In
itself that doesn't sound so bad and it was my preference at first, but
the argument is that things like udev should be able to run in a
container, and will object about not being able to create devices.
> > Tasks still can call mknod to create device files, regardless of
> > whether the particular device is visible or accessible, but they
> > may not be able to open it later.
> >
> > This one hides under CONFIG_CGROUP_DEVS option.
> >
> > To play with it - run a standard procedure:
> >
> > # mount -t container none /cont/devs -o devices
> > # mkdir /cont/devs/0
> > # echo -n $$ > /cont/devs/0/tasks
>
> What is /cont/ for?
cgroups used to be called containers, so 'cont' is presumably shorthand
for container.
> > and tune device permissions.
>
> How is this done?
>
> Why would the kernel care about this stuff?
Because there is no way for userspace to restrict a root process in a
container from accessing whatever devices it wants.
> confused,
>
> greg k-h
thanks,
-serge
On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
> Quoting Greg KH ([email protected]):
> > On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
> > > Changes from v3:
> > > * Ported on 2.6.25-rc3-mm1;
> > > * Re-splitted into smaller pieces;
> > > * Added more comments to tricky places.
> > >
> > > This controller allows to tune the devices accessibility by tasks,
> > > i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> > > access to IDE devices and completely hide SCSI disks.
> >
> > From within the kernel itself? The kernel should not be keeping track
> > of the mode of devices, that's what the filesystem holding /dev is for.
> > Those modes change all the time depending on the device plugged in, and
> > the user using the "console". Why should the kernel need to worry about
> > any of this?
>
> These are distinct from the permissions on device files. No matter what
> the permissions on the device files, a task in a devcg cgroup which
> isn't allowed write to chardev 4:64 will not be able to write to
> /dev/ttyS0.
Then why not do that from userspace with a different /dev, or with a
LSM?
> The purpose is to prevent a root task from granting itself access to
> certain devices. Without this, the only option currently is to take
> CAP_MKNOD out of the capability bounding set for a container and make
> sure that /dev is set up right (and enforce nodev for mounts). In
> itself that doesn't sound so bad and it was my preference at first,
that would be my preference as well.
> but the argument is that things like udev should be able to run in a
> container, and will object about not being able to create devices.
No reason you can't modify udev to do something like this. At the
worse, just disable udev warning messages, that is pretty trivial to do.
This really makes it seem like this kernel change is not needed at all.
> > > Tasks still can call mknod to create device files, regardless of
> > > whether the particular device is visible or accessible, but they
> > > may not be able to open it later.
> > >
> > > This one hides under CONFIG_CGROUP_DEVS option.
> > >
> > > To play with it - run a standard procedure:
> > >
> > > # mount -t container none /cont/devs -o devices
> > > # mkdir /cont/devs/0
> > > # echo -n $$ > /cont/devs/0/tasks
> >
> > What is /cont/ for?
>
> cgroups used to be called containers, so 'cont' is presumably shorthand
> for container.
>
> > > and tune device permissions.
> >
> > How is this done?
> >
> > Why would the kernel care about this stuff?
>
> Because there is no way for userspace to restrict a root process in a
> container from accessing whatever devices it wants.
LSM does this quite well today, why reinvent the wheel and put wierd
hooks in other parts of the kernel that do not need them at all?
thanks,
greg k-h
Greg KH wrote:
> On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
>> Quoting Greg KH ([email protected]):
>>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
>>>> Changes from v3:
>>>> * Ported on 2.6.25-rc3-mm1;
>>>> * Re-splitted into smaller pieces;
>>>> * Added more comments to tricky places.
>>>>
>>>> This controller allows to tune the devices accessibility by tasks,
>>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
>>>> access to IDE devices and completely hide SCSI disks.
>>> From within the kernel itself? The kernel should not be keeping track
>>> of the mode of devices, that's what the filesystem holding /dev is for.
>>> Those modes change all the time depending on the device plugged in, and
>>> the user using the "console". Why should the kernel need to worry about
>>> any of this?
>> These are distinct from the permissions on device files. No matter what
>> the permissions on the device files, a task in a devcg cgroup which
>> isn't allowed write to chardev 4:64 will not be able to write to
>> /dev/ttyS0.
>
> Then why not do that from userspace with a different /dev, or with a
> LSM?
Different dev is not suitable, since task may still call mknod to
create device it needs and use it. This is not about comfortable
use, this is about security.
LSM approach was proposed, but that required some API to configure
the permissions. This API done via control groups, so there were no
difference between this approach and that. Except for this one doesn't
create one more level of filtering at the top of kobject lookup and
thus is simpler and faster.
>> The purpose is to prevent a root task from granting itself access to
>> certain devices. Without this, the only option currently is to take
>> CAP_MKNOD out of the capability bounding set for a container and make
>> sure that /dev is set up right (and enforce nodev for mounts). In
>> itself that doesn't sound so bad and it was my preference at first,
>
> that would be my preference as well.
That approach relies on a proper user space setup inside a container,
but this creates security holes, since container user may ignore all
these "requirements".
>> but the argument is that things like udev should be able to run in a
>> container, and will object about not being able to create devices.
>
> No reason you can't modify udev to do something like this. At the
Sure we _can_ modify udev, but the problem is that users of virtualisation
solutions often (very often) use old software (e.g. set up some out-dated
distribution inside a container), so trick with modified udev simply won't
work in many cases.
> worse, just disable udev warning messages, that is pretty trivial to do.
>
> This really makes it seem like this kernel change is not needed at all.
>
>>>> Tasks still can call mknod to create device files, regardless of
>>>> whether the particular device is visible or accessible, but they
>>>> may not be able to open it later.
>>>>
>>>> This one hides under CONFIG_CGROUP_DEVS option.
>>>>
>>>> To play with it - run a standard procedure:
>>>>
>>>> # mount -t container none /cont/devs -o devices
>>>> # mkdir /cont/devs/0
>>>> # echo -n $$ > /cont/devs/0/tasks
>>> What is /cont/ for?
>> cgroups used to be called containers, so 'cont' is presumably shorthand
>> for container.
>>
>>>> and tune device permissions.
>>> How is this done?
>>>
>>> Why would the kernel care about this stuff?
>> Because there is no way for userspace to restrict a root process in a
>> container from accessing whatever devices it wants.
>
> LSM does this quite well today, why reinvent the wheel and put wierd
> hooks in other parts of the kernel that do not need them at all?
These are not hooks actually. I just made kobj_map-s per-group.
> thanks,
>
> greg k-h
>
Thanks,
Pavel
Andrew Morton wrote:
> On Wed, 05 Mar 2008 20:37:40 +0300
> Pavel Emelyanov <[email protected]> wrote:
>
>> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
>> new file mode 100644
>> index 0000000..04c168b
>> --- /dev/null
>> +++ b/include/linux/devscontrol.h
>> @@ -0,0 +1,12 @@
>> +#ifndef __DEVS_CONTROL_H__
>> +#define __DEVS_CONTROL_H__
>> +static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
>> +{
>> + return NULL;
>> +}
>> +
>> +static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
>> +{
>> + return NULL;
>> +}
>> +#endif
>
> This doesn't include sufficient headers to be compileable.
Indeed :( The patch #9 actually declares two needed struct-s, so the final
compilation is clean, but this set is still bad from the git-bisect point
of view, sorry.
Shall I provide a devscontrol-make-use-of-permissions-returned-by-kobj_lookup-fix.patch
for this one? :)
> I'm sure there are lots of headers like this. But we regularly need
> to fix them.
>
On Thu, Mar 06, 2008 at 11:36:22AM +0300, Pavel Emelyanov wrote:
> Greg KH wrote:
> > On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
> >> Quoting Greg KH ([email protected]):
> >>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
> >>>> Changes from v3:
> >>>> * Ported on 2.6.25-rc3-mm1;
> >>>> * Re-splitted into smaller pieces;
> >>>> * Added more comments to tricky places.
> >>>>
> >>>> This controller allows to tune the devices accessibility by tasks,
> >>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> >>>> access to IDE devices and completely hide SCSI disks.
> >>> From within the kernel itself? The kernel should not be keeping track
> >>> of the mode of devices, that's what the filesystem holding /dev is for.
> >>> Those modes change all the time depending on the device plugged in, and
> >>> the user using the "console". Why should the kernel need to worry about
> >>> any of this?
> >> These are distinct from the permissions on device files. No matter what
> >> the permissions on the device files, a task in a devcg cgroup which
> >> isn't allowed write to chardev 4:64 will not be able to write to
> >> /dev/ttyS0.
> >
> > Then why not do that from userspace with a different /dev, or with a
> > LSM?
>
> Different dev is not suitable, since task may still call mknod to
> create device it needs and use it. This is not about comfortable
> use, this is about security.
LSM can control mknod quite well. And it is _all_ about security, don't
try to add "security" to some other portion of the kernel that it not
expecting it :)
> LSM approach was proposed, but that required some API to configure
> the permissions.
Much like the API you already created to configure the permissions...
> This API done via control groups, so there were no
> difference between this approach and that. Except for this one doesn't
> create one more level of filtering at the top of kobject lookup and
> thus is simpler and faster.
How can it be "faster"? There is no speed problems here.
Please use LSM for this, that is what it is explicitly there for.
> >> The purpose is to prevent a root task from granting itself access to
> >> certain devices. Without this, the only option currently is to take
> >> CAP_MKNOD out of the capability bounding set for a container and make
> >> sure that /dev is set up right (and enforce nodev for mounts). In
> >> itself that doesn't sound so bad and it was my preference at first,
> >
> > that would be my preference as well.
>
> That approach relies on a proper user space setup inside a container,
> but this creates security holes, since container user may ignore all
> these "requirements".
Ok then, use LSM to enforce those requirements, again, that is what it
is there for.
> >> but the argument is that things like udev should be able to run in a
> >> container, and will object about not being able to create devices.
> >
> > No reason you can't modify udev to do something like this. At the
>
> Sure we _can_ modify udev, but the problem is that users of virtualisation
> solutions often (very often) use old software (e.g. set up some out-dated
> distribution inside a container), so trick with modified udev simply won't
> work in many cases.
So the kernel's job is to enforce policy that an old userspace can not
seem to get straight? That bit of logic doesn't go over so well...
Andrew, please drop these patches from your tree, this all should be
done with a simple LSM interface, which would be much smaller and much
less intrusive than this patchset.
thanks,
greg k-h
On Thu 2008-03-06 11:36:22, Pavel Emelyanov wrote:
> Greg KH wrote:
> > On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
> >> Quoting Greg KH ([email protected]):
> >>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
> >>>> Changes from v3:
> >>>> * Ported on 2.6.25-rc3-mm1;
> >>>> * Re-splitted into smaller pieces;
> >>>> * Added more comments to tricky places.
> >>>>
> >>>> This controller allows to tune the devices accessibility by tasks,
> >>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
> >>>> access to IDE devices and completely hide SCSI disks.
> >>> From within the kernel itself? The kernel should not be keeping track
> >>> of the mode of devices, that's what the filesystem holding /dev is for.
> >>> Those modes change all the time depending on the device plugged in, and
> >>> the user using the "console". Why should the kernel need to worry about
> >>> any of this?
> >> These are distinct from the permissions on device files. No matter what
> >> the permissions on the device files, a task in a devcg cgroup which
> >> isn't allowed write to chardev 4:64 will not be able to write to
> >> /dev/ttyS0.
> >
> > Then why not do that from userspace with a different /dev, or with a
> > LSM?
>
> Different dev is not suitable, since task may still call mknod to
> create device it needs and use it. This is not about comfortable
> use, this is about security.
And you may still take out mknod capability...
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
Pavel Machek wrote:
> On Thu 2008-03-06 11:36:22, Pavel Emelyanov wrote:
>> Greg KH wrote:
>>> On Wed, Mar 05, 2008 at 09:15:25PM -0600, Serge E. Hallyn wrote:
>>>> Quoting Greg KH ([email protected]):
>>>>> On Wed, Mar 05, 2008 at 08:23:35PM +0300, Pavel Emelyanov wrote:
>>>>>> Changes from v3:
>>>>>> * Ported on 2.6.25-rc3-mm1;
>>>>>> * Re-splitted into smaller pieces;
>>>>>> * Added more comments to tricky places.
>>>>>>
>>>>>> This controller allows to tune the devices accessibility by tasks,
>>>>>> i.e. grant full access for /dev/null, /dev/zero etc, grant read-only
>>>>>> access to IDE devices and completely hide SCSI disks.
>>>>> From within the kernel itself? The kernel should not be keeping track
>>>>> of the mode of devices, that's what the filesystem holding /dev is for.
>>>>> Those modes change all the time depending on the device plugged in, and
>>>>> the user using the "console". Why should the kernel need to worry about
>>>>> any of this?
>>>> These are distinct from the permissions on device files. No matter what
>>>> the permissions on the device files, a task in a devcg cgroup which
>>>> isn't allowed write to chardev 4:64 will not be able to write to
>>>> /dev/ttyS0.
>>> Then why not do that from userspace with a different /dev, or with a
>>> LSM?
>> Different dev is not suitable, since task may still call mknod to
>> create device it needs and use it. This is not about comfortable
>> use, this is about security.
>
> And you may still take out mknod capability...
Sure we can. In that case we can pre-create only alowed devices for a container
and be happy, but this is not a flexible way of doing things. The main problem is
that udev will stop working in such a container. Patching one is not an option
either, for the reasons I have described before.
Many virtualization functionality we're submitting can be achieved via proper
userpace modifications, but we want to be able to run old software (from old
and even historic distributions) inside containers, so we have to make this
at the kernel level.
Thanks,
Pavel
Andrew Morton wrote:
> On Wed, 05 Mar 2008 20:37:40 +0300
> Pavel Emelyanov <[email protected]> wrote:
>
>> diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
>> new file mode 100644
>> index 0000000..04c168b
>> --- /dev/null
>> +++ b/include/linux/devscontrol.h
>> @@ -0,0 +1,12 @@
>> +#ifndef __DEVS_CONTROL_H__
>> +#define __DEVS_CONTROL_H__
>> +static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
>> +{
>> + return NULL;
>> +}
>> +
>> +static inline struct kobj_map *task_bdev_map(struct task_struct *tsk)
>> +{
>> + return NULL;
>> +}
>> +#endif
>
> This doesn't include sufficient headers to be compileable.
>
> I'm sure there are lots of headers like this. But we regularly need
> to fix them.
>
Not sure, whether this is still relevant after Greg's comments, but that's
the -fix patch for this one. (It will cause a conflict with the 9th patch.)
diff --git a/include/linux/devscontrol.h b/include/linux/devscontrol.h
index 04c168b..52cac64 100644
--- a/include/linux/devscontrol.h
+++ b/include/linux/devscontrol.h
@@ -1,5 +1,8 @@
#ifndef __DEVS_CONTROL_H__
#define __DEVS_CONTROL_H__
+struct kobj_map;
+struct task_struct;
+
static inline struct kobj_map *task_cdev_map(struct task_struct *tsk)
{
return NULL;
On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov <[email protected]> wrote:
> > This doesn't include sufficient headers to be compileable.
> >
> > I'm sure there are lots of headers like this. But we regularly need
> > to fix them.
> >
>
> Not sure, whether this is still relevant after Greg's comments, but that's
> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
Well. Where do we stand with this? afaict the state of play is:
Greg: do it in udev
Pavel: but people want to run old distros in containers
Realistically, when is the mainline kernel likely to have sufficient
container functionality which is sufficiently well-tested for people to
actually be able to do that? And how much longer will it take for that
kernel.org functionality to propagate out into non-bleeding-edge distros?
Altogether we're looking at one to three years, aren't we? By then,
perhaps a lot of "old" distros are already udev-based.
otoh, my experience upgrading old kernels to new udev has not been a
good one (ie: it didn't work).
Andrew Morton wrote:
> On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov <[email protected]> wrote:
>
>>> This doesn't include sufficient headers to be compileable.
>>>
>>> I'm sure there are lots of headers like this. But we regularly need
>>> to fix them.
>>>
>> Not sure, whether this is still relevant after Greg's comments, but that's
>> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
>
> Well. Where do we stand with this? afaict the state of play is:
>
> Greg: do it in udev
> Pavel: but people want to run old distros in containers
Actually no.
Greg: Use LSM for this
Pavel: My approach just makes maps per-group, while LSM will
bring a new level of filtering/lookup on device open path
> Realistically, when is the mainline kernel likely to have sufficient
> container functionality which is sufficiently well-tested for people to
> actually be able to do that? And how much longer will it take for that
> kernel.org functionality to propagate out into non-bleeding-edge distros?
The fact is that we have users of OpenVZ and even Virtuozzo, that still use
redhat-9 as in containers. So even if this is ready in 5 years, there will
always be someone who sets the outdated (by that time) fedora-core-8 and find
out, that his udev refuses to work.
> Altogether we're looking at one to three years, aren't we? By then,
> perhaps a lot of "old" distros are already udev-based.
>
> otoh, my experience upgrading old kernels to new udev has not been a
> good one (ie: it didn't work).
Agree, but we're talking about making old udev working with new kernel.
>
On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
> Andrew Morton wrote:
> > On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov <[email protected]> wrote:
> >
> >>> This doesn't include sufficient headers to be compileable.
> >>>
> >>> I'm sure there are lots of headers like this. But we regularly need
> >>> to fix them.
> >>>
> >> Not sure, whether this is still relevant after Greg's comments, but that's
> >> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
> >
> > Well. Where do we stand with this? afaict the state of play is:
> >
> > Greg: do it in udev
> > Pavel: but people want to run old distros in containers
>
> Actually no.
>
> Greg: Use LSM for this
Yes, that is my recommendation.
> Pavel: My approach just makes maps per-group, while LSM will
> bring a new level of filtering/lookup on device open path
Huh? You are still doing that same "filtering/lookup" by modifying the
maps code. The result should be exactly the same.
Why do you not want to use the LSM interface? That is exactly what it
is there for, don't go creating new hooks into the kernel for the exact
same functionality.
Opening a dev node is not on any "fast path" that you need to be
concerned about a few extra calls within the kernel.
And, I think in the end your patch would be much smaller and easier to
understand and review and maintain overall.
> > Realistically, when is the mainline kernel likely to have sufficient
> > container functionality which is sufficiently well-tested for people to
> > actually be able to do that? And how much longer will it take for that
> > kernel.org functionality to propagate out into non-bleeding-edge distros?
>
> The fact is that we have users of OpenVZ and even Virtuozzo, that still use
> redhat-9 as in containers. So even if this is ready in 5 years, there will
> always be someone who sets the outdated (by that time) fedora-core-8 and find
> out, that his udev refuses to work.
That's fine, use the LSM interface, no need to change userspace at all.
Although I think your requirement of using new kernels on very old
distros is going to cause you more problems than you realize in the
end...
thanks,
greg k-h
Greg KH wrote:
> On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
>> Andrew Morton wrote:
>>> On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov <[email protected]> wrote:
>>>
>>>>> This doesn't include sufficient headers to be compileable.
>>>>>
>>>>> I'm sure there are lots of headers like this. But we regularly need
>>>>> to fix them.
>>>>>
>>>> Not sure, whether this is still relevant after Greg's comments, but that's
>>>> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
>>> Well. Where do we stand with this? afaict the state of play is:
>>>
>>> Greg: do it in udev
>>> Pavel: but people want to run old distros in containers
>> Actually no.
>>
>> Greg: Use LSM for this
>
> Yes, that is my recommendation.
>
>> Pavel: My approach just makes maps per-group, while LSM will
>> bring a new level of filtering/lookup on device open path
>
> Huh? You are still doing that same "filtering/lookup" by modifying the
> maps code. The result should be exactly the same.
No - this lookup was there before to get struct kobject from the dev_t,
I just make it look up in another map.
> Why do you not want to use the LSM interface? That is exactly what it
> is there for, don't go creating new hooks into the kernel for the exact
> same functionality.
_No_new_hooks_ - just the map is get from task, not from a static variable.
I have basically three objections against LSM.
1. LSM is not stackable, so loading this tiny module with devices
access rights will block other security modules;
2. Turning CONFIG_SECURITY on immediately causes all the other hooks
to get called. This affects performance on critical paths, like
process creation/destruction, network flow and so on. This impact
is small, but noticeable;
3. With LSM turned on we'll have to "virtualize" it, i.e. make its
work safe in a container. I don't presume to judge how much work
will have to be done in this area, so the result patch would be
even larger and maybe will duplicate functionality, which is currently
in cgroups. OTOH, cgroups already provide the ways to correctly
delegate proper rights to containers.
> Opening a dev node is not on any "fast path" that you need to be
> concerned about a few extra calls within the kernel.
>
> And, I think in the end your patch would be much smaller and easier to
> understand and review and maintain overall.
Hardly - the largest part of my patch is cgroup manipulations. The part
that makes the char and block layers switch to new map ac check the
permissions is 10-20 lines of new code.
But with LSM I will still need this API.
>>> Realistically, when is the mainline kernel likely to have sufficient
>>> container functionality which is sufficiently well-tested for people to
>>> actually be able to do that? And how much longer will it take for that
>>> kernel.org functionality to propagate out into non-bleeding-edge distros?
>> The fact is that we have users of OpenVZ and even Virtuozzo, that still use
>> redhat-9 as in containers. So even if this is ready in 5 years, there will
>> always be someone who sets the outdated (by that time) fedora-core-8 and find
>> out, that his udev refuses to work.
>
> That's fine, use the LSM interface, no need to change userspace at all.
>
> Although I think your requirement of using new kernels on very old
> distros is going to cause you more problems than you realize in the
> end...
>
> thanks,
>
> greg k-h
>
On Fri, Mar 07, 2008 at 07:38:51PM +0300, Pavel Emelyanov wrote:
> Greg KH wrote:
> > On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
> >> Andrew Morton wrote:
> >>> On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov <[email protected]> wrote:
> >>>
> >>>>> This doesn't include sufficient headers to be compileable.
> >>>>>
> >>>>> I'm sure there are lots of headers like this. But we regularly need
> >>>>> to fix them.
> >>>>>
> >>>> Not sure, whether this is still relevant after Greg's comments, but that's
> >>>> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
> >>> Well. Where do we stand with this? afaict the state of play is:
> >>>
> >>> Greg: do it in udev
> >>> Pavel: but people want to run old distros in containers
> >> Actually no.
> >>
> >> Greg: Use LSM for this
> >
> > Yes, that is my recommendation.
> >
> >> Pavel: My approach just makes maps per-group, while LSM will
> >> bring a new level of filtering/lookup on device open path
> >
> > Huh? You are still doing that same "filtering/lookup" by modifying the
> > maps code. The result should be exactly the same.
>
> No - this lookup was there before to get struct kobject from the dev_t,
> I just make it look up in another map.
>
> > Why do you not want to use the LSM interface? That is exactly what it
> > is there for, don't go creating new hooks into the kernel for the exact
> > same functionality.
>
> _No_new_hooks_ - just the map is get from task, not from a static variable.
The new "hook" is that we now have ways of setting different maps, and
you added a whole infrastructure to set these permissions from
userspace, as well as track them within the kernel. That's a security
policy that you are putting into place within the kernel, just like LSM
is supposed to be used for.
> I have basically three objections against LSM.
>
> 1. LSM is not stackable, so loading this tiny module with devices
> access rights will block other security modules;
Do you really want to run other LSMs within a containerd kernel? Is
that a requirement? It would seem to run counter to the main goal of
containers to me.
> 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
> to get called. This affects performance on critical paths, like
> process creation/destruction, network flow and so on. This impact
> is small, but noticeable;
The last time this was brought up, it was not noticable, except for some
network paths. And even then, the number was lost in the noise from
what I saw. I think with a containered machine, you have bigger things
to be worried about :)
> 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
> work safe in a container. I don't presume to judge how much work
> will have to be done in this area, so the result patch would be
> even larger and maybe will duplicate functionality, which is currently
> in cgroups. OTOH, cgroups already provide the ways to correctly
> delegate proper rights to containers.
No, your lsm would be your "virtualize" policy. I don't think you would
have to do any additional work here, but could be wrong. Would like to
see the code to prove it.
> > Opening a dev node is not on any "fast path" that you need to be
> > concerned about a few extra calls within the kernel.
> >
> > And, I think in the end your patch would be much smaller and easier to
> > understand and review and maintain overall.
>
> Hardly - the largest part of my patch is cgroup manipulations. The part
> that makes the char and block layers switch to new map ac check the
> permissions is 10-20 lines of new code.
>
> But with LSM I will still need this API.
Yes, but your LSM hooks will be smaller than the code modifications to
the map logic :)
Again, I object to this as you are driving a new security policy
infrastructure into the device node logic where it does not belong as we
already have this functionality in the LSM interface today. Please use
that one instead and don't clutter up the kernel with "one-off" security
changes like this one.
Please try the LSM interface and see what happens. If, after you have
created a patch, you still have objections, please post it for review
and I will be glad to revisit my opinion at that time.
thanks,
greg k-h
On Fri, Mar 07, 2008 at 09:01:04AM -0800, Greg KH wrote:
> Again, I object to this as you are driving a new security policy
> infrastructure into the device node logic where it does not belong as we
> already have this functionality in the LSM interface today. Please use
> that one instead and don't clutter up the kernel with "one-off" security
> changes like this one.
>
> Please try the LSM interface and see what happens. If, after you have
> created a patch, you still have objections, please post it for review
> and I will be glad to revisit my opinion at that time.
Not that per-container mappings on that level made any kind of sense in
the first place...
Quoting Greg KH ([email protected]):
> On Fri, Mar 07, 2008 at 07:38:51PM +0300, Pavel Emelyanov wrote:
> > Greg KH wrote:
> > > On Fri, Mar 07, 2008 at 12:52:40PM +0300, Pavel Emelyanov wrote:
> > >> Andrew Morton wrote:
> > >>> On Fri, 07 Mar 2008 12:22:01 +0300 Pavel Emelyanov <[email protected]> wrote:
> > >>>
> > >>>>> This doesn't include sufficient headers to be compileable.
> > >>>>>
> > >>>>> I'm sure there are lots of headers like this. But we regularly need
> > >>>>> to fix them.
> > >>>>>
> > >>>> Not sure, whether this is still relevant after Greg's comments, but that's
> > >>>> the -fix patch for this one. (It will cause a conflict with the 9th patch.)
> > >>> Well. Where do we stand with this? afaict the state of play is:
> > >>>
> > >>> Greg: do it in udev
> > >>> Pavel: but people want to run old distros in containers
> > >> Actually no.
> > >>
> > >> Greg: Use LSM for this
> > >
> > > Yes, that is my recommendation.
> > >
> > >> Pavel: My approach just makes maps per-group, while LSM will
> > >> bring a new level of filtering/lookup on device open path
> > >
> > > Huh? You are still doing that same "filtering/lookup" by modifying the
> > > maps code. The result should be exactly the same.
> >
> > No - this lookup was there before to get struct kobject from the dev_t,
> > I just make it look up in another map.
> >
> > > Why do you not want to use the LSM interface? That is exactly what it
> > > is there for, don't go creating new hooks into the kernel for the exact
> > > same functionality.
> >
> > _No_new_hooks_ - just the map is get from task, not from a static variable.
>
> The new "hook" is that we now have ways of setting different maps, and
> you added a whole infrastructure to set these permissions from
> userspace, as well as track them within the kernel. That's a security
> policy that you are putting into place within the kernel, just like LSM
> is supposed to be used for.
>
> > I have basically three objections against LSM.
> >
> > 1. LSM is not stackable, so loading this tiny module with devices
> > access rights will block other security modules;
>
> Do you really want to run other LSMs within a containerd kernel? Is
> that a requirement? It would seem to run counter to the main goal of
> containers to me.
Until user namespaces are complete, selinux seems the only good solution
to offer isolation.
> > 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
> > to get called. This affects performance on critical paths, like
> > process creation/destruction, network flow and so on. This impact
> > is small, but noticeable;
>
> The last time this was brought up, it was not noticable, except for some
> network paths. And even then, the number was lost in the noise from
> what I saw. I think with a containered machine, you have bigger things
> to be worried about :)
>
> > 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
> > work safe in a container. I don't presume to judge how much work
> > will have to be done in this area, so the result patch would be
> > even larger and maybe will duplicate functionality, which is currently
> > in cgroups. OTOH, cgroups already provide the ways to correctly
> > delegate proper rights to containers.
>
> No, your lsm would be your "virtualize" policy. I don't think you would
> have to do any additional work here, but could be wrong. Would like to
> see the code to prove it.
>
> > > Opening a dev node is not on any "fast path" that you need to be
> > > concerned about a few extra calls within the kernel.
> > >
> > > And, I think in the end your patch would be much smaller and easier to
> > > understand and review and maintain overall.
> >
> > Hardly - the largest part of my patch is cgroup manipulations. The part
> > that makes the char and block layers switch to new map ac check the
> > permissions is 10-20 lines of new code.
> >
> > But with LSM I will still need this API.
>
> Yes, but your LSM hooks will be smaller than the code modifications to
> the map logic :)
>
> Again, I object to this as you are driving a new security policy
> infrastructure into the device node logic where it does not belong as we
> already have this functionality in the LSM interface today. Please use
> that one instead and don't clutter up the kernel with "one-off" security
> changes like this one.
>
> Please try the LSM interface and see what happens. If, after you have
https://lists.linux-foundation.org/pipermail/containers/2007-November/008589.html
This was just a proof of concept for discussion.
Our conclusion (including my own) was that it would be nicer to have the
controls not be shoed in using lsm but rather at the level Pavel was
doing.
> created a patch, you still have objections, please post it for review
> and I will be glad to revisit my opinion at that time.
>
> thanks,
>
> greg k-h
--- "Serge E. Hallyn" <[email protected]> wrote:
> ...
>
> Until user namespaces are complete, selinux seems the only good solution
> to offer isolation.
Smack does it better and cheaper. (Unless you define good==selinux)
(insert smiley)
Casey Schaufler
[email protected]
On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
> > Do you really want to run other LSMs within a containerd kernel? Is
> > that a requirement? It would seem to run counter to the main goal of
> > containers to me.
>
> Until user namespaces are complete, selinux seems the only good solution
> to offer isolation.
Great, use that instead :)
> > > 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
> > > to get called. This affects performance on critical paths, like
> > > process creation/destruction, network flow and so on. This impact
> > > is small, but noticeable;
> >
> > The last time this was brought up, it was not noticable, except for some
> > network paths. And even then, the number was lost in the noise from
> > what I saw. I think with a containered machine, you have bigger things
> > to be worried about :)
> >
> > > 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
> > > work safe in a container. I don't presume to judge how much work
> > > will have to be done in this area, so the result patch would be
> > > even larger and maybe will duplicate functionality, which is currently
> > > in cgroups. OTOH, cgroups already provide the ways to correctly
> > > delegate proper rights to containers.
> >
> > No, your lsm would be your "virtualize" policy. I don't think you would
> > have to do any additional work here, but could be wrong. Would like to
> > see the code to prove it.
> >
> > > > Opening a dev node is not on any "fast path" that you need to be
> > > > concerned about a few extra calls within the kernel.
> > > >
> > > > And, I think in the end your patch would be much smaller and easier to
> > > > understand and review and maintain overall.
> > >
> > > Hardly - the largest part of my patch is cgroup manipulations. The part
> > > that makes the char and block layers switch to new map ac check the
> > > permissions is 10-20 lines of new code.
> > >
> > > But with LSM I will still need this API.
> >
> > Yes, but your LSM hooks will be smaller than the code modifications to
> > the map logic :)
> >
> > Again, I object to this as you are driving a new security policy
> > infrastructure into the device node logic where it does not belong as we
> > already have this functionality in the LSM interface today. Please use
> > that one instead and don't clutter up the kernel with "one-off" security
> > changes like this one.
> >
> > Please try the LSM interface and see what happens. If, after you have
>
> https://lists.linux-foundation.org/pipermail/containers/2007-November/008589.html
>
> This was just a proof of concept for discussion.
I don't see the LSM patch in that posting, only a Makefile change that
adds it to the build.
> Our conclusion (including my own) was that it would be nicer to have the
> controls not be shoed in using lsm but rather at the level Pavel was
> doing.
Why? What is the difference? I don't see that discussion anywhere in
that post.
Why add add-hock security stuff to the kernel all over the place and not
use the LSM interface that we already have? If LSM doesn't provide the
exact right hooks needed, we can always change it (and no, we should not
be adding kobj_maps hooks to LSM).
thanks,
greg k-h
Quoting Casey Schaufler ([email protected]):
>
> --- "Serge E. Hallyn" <[email protected]> wrote:
>
> > ...
> >
> > Until user namespaces are complete, selinux seems the only good solution
> > to offer isolation.
>
> Smack does it better and cheaper. (Unless you define good==selinux)
> (insert smiley)
Ah, thanks - I hadn't looked into it, but yes IIUC smack should
definately work. I'll have to give that a shot.
(A basic selinux policy module to isolate a container was pretty simple,
but providing finer-grained intra-container access seems to take some
changes to the base refpolicy. I've been waiting a few weeks to find
time to work on that.)
thanks,
-serge
Quoting Greg KH ([email protected]):
> On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
> > > Do you really want to run other LSMs within a containerd kernel? Is
> > > that a requirement? It would seem to run counter to the main goal of
> > > containers to me.
> >
> > Until user namespaces are complete, selinux seems the only good solution
> > to offer isolation.
>
> Great, use that instead :)
That can't work as is since you can't specify major:minor in policy.
So all we could do again is simply refuse all mknod, which we can
already do with per-process capability bounding sets.
> > > > 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
> > > > to get called. This affects performance on critical paths, like
> > > > process creation/destruction, network flow and so on. This impact
> > > > is small, but noticeable;
> > >
> > > The last time this was brought up, it was not noticable, except for some
> > > network paths. And even then, the number was lost in the noise from
> > > what I saw. I think with a containered machine, you have bigger things
> > > to be worried about :)
> > >
> > > > 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
> > > > work safe in a container. I don't presume to judge how much work
> > > > will have to be done in this area, so the result patch would be
> > > > even larger and maybe will duplicate functionality, which is currently
> > > > in cgroups. OTOH, cgroups already provide the ways to correctly
> > > > delegate proper rights to containers.
> > >
> > > No, your lsm would be your "virtualize" policy. I don't think you would
> > > have to do any additional work here, but could be wrong. Would like to
> > > see the code to prove it.
> > >
> > > > > Opening a dev node is not on any "fast path" that you need to be
> > > > > concerned about a few extra calls within the kernel.
> > > > >
> > > > > And, I think in the end your patch would be much smaller and easier to
> > > > > understand and review and maintain overall.
> > > >
> > > > Hardly - the largest part of my patch is cgroup manipulations. The part
> > > > that makes the char and block layers switch to new map ac check the
> > > > permissions is 10-20 lines of new code.
> > > >
> > > > But with LSM I will still need this API.
> > >
> > > Yes, but your LSM hooks will be smaller than the code modifications to
> > > the map logic :)
> > >
> > > Again, I object to this as you are driving a new security policy
> > > infrastructure into the device node logic where it does not belong as we
> > > already have this functionality in the LSM interface today. Please use
> > > that one instead and don't clutter up the kernel with "one-off" security
> > > changes like this one.
> > >
> > > Please try the LSM interface and see what happens. If, after you have
> >
> > https://lists.linux-foundation.org/pipermail/containers/2007-November/008589.html
> >
> > This was just a proof of concept for discussion.
>
> I don't see the LSM patch in that posting, only a Makefile change that
> adds it to the build.
Aw, crap! That's right, I remember noticing a few days later that it
wasn't in my git index any more.
An older version (by about a month) of the patch with the LSM coded
straight into the cgroup file is appended below.
> > Our conclusion (including my own) was that it would be nicer to have the
> > controls not be shoed in using lsm but rather at the level Pavel was
> > doing.
>
> Why? What is the difference? I don't see that discussion anywhere in
> that post.
That happened in a later thread in december:
https://lists.linux-foundation.org/pipermail/containers/2007-December/009337.html
> Why add add-hock security stuff to the kernel all over the place and not
> use the LSM interface that we already have? If LSM doesn't provide the
> exact right hooks needed, we can always change it (and no, we should not
> be adding kobj_maps hooks to LSM).
>
> thanks,
>
> greg k-h
>From 4266131c40b629e3b04c0d9d01569a95fa967e3e Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <[email protected]>
Date: Thu, 11 Oct 2007 15:27:48 -0400
Subject: [PATCH 1/1] cgroups: implement device whitelist cgroup+lsm
Implement a cgroup using the LSM interface to enforce open and mknod
on device files. Not a line of this code is expected to be used in a
final version, this is just a proof of concept.
No stacking is implemented, so to test this you must have
CGROUPS=y
SECURITY=y
but all other LSMs =n (no capabilities, no selinux, no rootplug).
This implements a simple device access whitelist. A whitelist entry
has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it
applies to all types, all major numbers, and all minor numbers. Major and
minor are obvious. Access is a composition of r (read), w (write), and
m (mknod).
The root devcgroup starts with rwm to 'all'. A child devcg gets a copy
of the parent. Admins can then add and remove devices to the whitelist.
Once CAP_HOST_ADMIN is introduced it will be needed to add entries as
well or remove entries from another cgroup, though just CAP_SYS_ADMIN
will suffice to remove entries for your own group.
An entry is added by doing "echo <type> <maj> <min> <access>" > devcg.allow,
for instance:
echo b 7 0 mrw > /cgroups/1/devcg.allow
An entry is removed by doing likewise into devcg.deny. Since this is a
pure whitelist, not acls, you can only remove entries which exist in the
whitelist. You must explicitly
echo a 0 0 mrw > /cgroups/1/devcg.deny
to remove the "allow all" entry which is automatically inherited from
the root cgroup.
While composing this with the ns_cgroup may seem logical, it may not
be the right thing to do. Note that each newly created devcg gets
a copy of the parent whitelist. So if you had done
mount -t cgroup -o ns,devcg none /cgroups
then once a process in /cgroup/1 had done an unshare(CLONE_NEWNS)
it would be under /cgroup/1/node_<pid>
if an admin did
echo b 7 0 m > /cgroups/1/devcg.deny
then the entry would still be in the whitelist for /cgroups/1/node_<pid>.
Something to discuss if we get that far before nixing this whole idea.
Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/cgroup_subsys.h | 6 +
init/Kconfig | 7 +
kernel/Makefile | 1 +
kernel/dev_cgroup.c | 554 +++++++++++++++++++++++++++++++++++++++++
4 files changed, 568 insertions(+), 0 deletions(-)
create mode 100644 kernel/dev_cgroup.c
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index d822977..cf55cb2 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -36,3 +36,9 @@ SUBSYS(mem_cgroup)
#endif
/* */
+
+#ifdef CONFIG_CGROUP_DEV
+SUBSYS(devcg)
+#endif
+
+/* */
diff --git a/init/Kconfig b/init/Kconfig
index 6bb603a..0b3b684 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -319,6 +319,13 @@ config CPUSETS
Say N if unsure.
+config CGROUP_DEV
+ bool "Device controller for cgroups"
+ depends on CGROUPS && SECURITY && EXPERIMENTAL
+ help
+ Provides a cgroup implementing whitelists for devices which
+ a process in the cgroup can mknod or open.
+
config FAIR_GROUP_SCHED
bool "Fair group CPU scheduler"
default y
diff --git a/kernel/Makefile b/kernel/Makefile
index 76f782f..6ded46d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpu_acct.o
obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
+obj-$(CONFIG_CGROUP_DEV) += dev_cgroup.o
obj-$(CONFIG_IKCONFIG) += configs.o
obj-$(CONFIG_STOP_MACHINE) += stop_machine.o
obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
diff --git a/kernel/dev_cgroup.c b/kernel/dev_cgroup.c
new file mode 100644
index 0000000..87c8fb4
--- /dev/null
+++ b/kernel/dev_cgroup.c
@@ -0,0 +1,554 @@
+/*
+ * dev_cgroup.c - device cgroup subsystem
+ *
+ * Copyright 2007 IBM Corp
+ */
+
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/security.h>
+
+#include <asm/uaccess.h>
+
+#define ACC_MKNOD 1
+#define ACC_READ 2
+#define ACC_WRITE 4
+
+#define DEV_BLOCK 1
+#define DEV_CHAR 2
+#define DEV_ALL 4 /* this represents all devices */
+
+/*
+ * whitelist locking rules:
+ * cgroup_lock() cannot be taken under cgroup->lock.
+ * cgroup->lock can be taken with or without cgroup_lock().
+ *
+ * modifications always require cgroup_lock
+ * modifications to a list which is visible require the
+ * cgroup->lock *and* cgroup_lock()
+ * walking the list requires cgroup->lock or cgroup_lock().
+ *
+ * reasoning: dev_whitelist_copy() needs to kmalloc, so needs
+ * a mutex, which the cgroup_lock() is. Since modifying
+ * a visible list requires both locks, either lock can be
+ * taken for walking the list. Since the wh->spinlock is taken
+ * for modifying a public-accessible list, the spinlock is
+ * sufficient for just walking the list.
+ */
+
+struct dev_whitelist_item {
+ u32 major, minor;
+ short type;
+ short access;
+ struct list_head list;
+};
+
+struct dev_cgroup {
+ struct cgroup_subsys_state css;
+ struct list_head whitelist;
+ spinlock_t lock;
+};
+
+struct cgroup_subsys devcg_subsys;
+
+static inline struct dev_cgroup *cgroup_to_devcg(
+ struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, devcg_subsys_id),
+ struct dev_cgroup, css);
+}
+
+/*
+ * Once 64-bit caps and CAP_HOST_ADMIN exist, we will be
+ * requiring (CAP_HOST_ADMIN|CAP_MKNOD) to create a device
+ * not in the whitelist, * (CAP_HOST_ADMIN|CAP_SYS_ADMIN)
+ * to edit the whitelist,
+ */
+static int devcg_can_attach(struct cgroup_subsys *ss,
+ struct cgroup *new_cgroup, struct task_struct *task)
+{
+ struct cgroup *orig;
+
+ if (current != task) {
+ if (!cgroup_is_descendant(new_cgroup))
+ return -EPERM;
+ }
+
+ if (atomic_read(&new_cgroup->count) != 0)
+ return -EPERM;
+
+ orig = task_cgroup(task, devcg_subsys_id);
+ if (orig && orig != new_cgroup->parent)
+ return -EPERM;
+
+ return 0;
+}
+
+/*
+ * called under cgroup_lock()
+ */
+int dev_whitelist_copy(struct list_head *dest, struct list_head *orig)
+{
+ struct dev_whitelist_item *wh, *tmp, *new;
+
+ list_for_each_entry(wh, orig, list) {
+ new = kmalloc(sizeof(*wh), GFP_KERNEL);
+ if (!new)
+ goto free_and_exit;
+ new->major = wh->major;
+ new->minor = wh->minor;
+ new->type = wh->type;
+ new->access = wh->access;
+ list_add_tail(&new->list, dest);
+ }
+
+ return 0;
+
+free_and_exit:
+ list_for_each_entry_safe(wh, tmp, dest, list) {
+ list_del(&wh->list);
+ kfree(wh);
+ }
+ return -ENOMEM;
+}
+
+/* Stupid prototype - don't bother combining existing entries */
+/*
+ * called under cgroup_lock()
+ * since the list is visible to other tasks, we need the spinlock also
+ */
+void dev_whitelist_add(struct dev_cgroup *dev_cgroup, struct dev_whitelist_item *wh)
+{
+ spin_lock(&dev_cgroup->lock);
+ list_add_tail(&wh->list, &dev_cgroup->whitelist);
+ spin_unlock(&dev_cgroup->lock);
+}
+
+/*
+ * called under cgroup_lock()
+ * since the list is visible to other tasks, we need the spinlock also
+ */
+void dev_whitelist_rm(struct dev_cgroup *dev_cgroup, struct dev_whitelist_item *wh)
+{
+ struct dev_whitelist_item *walk, *tmp;
+
+ spin_lock(&dev_cgroup->lock);
+ list_for_each_entry_safe(walk, tmp, &dev_cgroup->whitelist, list) {
+ if (walk->type & DEV_ALL) {
+ list_del(&walk->list);
+ kfree(walk);
+ continue;
+ }
+ if (walk->type != wh->type)
+ continue;
+ if (walk->major != wh->major || walk->minor != wh->minor)
+ continue;
+ walk->access &= ~wh->access;
+ if (!walk->access) {
+ list_del(&walk->list);
+ kfree(walk);
+ }
+ }
+ spin_unlock(&dev_cgroup->lock);
+}
+
+/*
+ * Rules: you can only create a cgroup if
+ * 1. you are capable(CAP_SYS_ADMIN)
+ * 2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *devcg_create(struct cgroup_subsys *ss,
+ struct cgroup *cgroup)
+{
+ struct dev_cgroup *dev_cgroup, *parent_dev_cgroup;
+ struct cgroup *parent_cgroup;
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ if (!cgroup_is_descendant(cgroup))
+ return ERR_PTR(-EPERM);
+
+ dev_cgroup = kzalloc(sizeof(*dev_cgroup), GFP_KERNEL);
+ if (!dev_cgroup)
+ return ERR_PTR(-ENOMEM);
+ INIT_LIST_HEAD(&dev_cgroup->whitelist);
+ parent_cgroup = cgroup->parent;
+
+ if (parent_cgroup == NULL) {
+ struct dev_whitelist_item *wh;
+ wh = kmalloc(sizeof(*wh), GFP_KERNEL);
+ wh->minor = wh->major = 0;
+ wh->type = DEV_ALL;
+ wh->access = ACC_MKNOD | ACC_READ | ACC_WRITE;
+ list_add(&wh->list, &dev_cgroup->whitelist);
+ } else {
+ parent_dev_cgroup = cgroup_to_devcg(parent_cgroup);
+ ret = dev_whitelist_copy(&dev_cgroup->whitelist,
+ &parent_dev_cgroup->whitelist);
+ if (ret) {
+ kfree(dev_cgroup);
+ return ERR_PTR(ret);
+ }
+ }
+
+ spin_lock_init(&dev_cgroup->lock);
+ return &dev_cgroup->css;
+}
+
+static void devcg_destroy(struct cgroup_subsys *ss,
+ struct cgroup *cgroup)
+{
+ struct dev_cgroup *dev_cgroup;
+ struct dev_whitelist_item *wh, *tmp;
+
+ dev_cgroup = cgroup_to_devcg(cgroup);
+ list_for_each_entry_safe(wh, tmp, &dev_cgroup->whitelist, list) {
+ list_del(&wh->list);
+ kfree(wh);
+ }
+ kfree(dev_cgroup);
+}
+
+#define DEVCG_ALLOW 1
+#define DEVCG_DENY 2
+
+void set_access(char *acc, short access)
+{
+ int idx = 0;
+ memset(acc, 0, 4);
+ if (access & ACC_READ)
+ acc[idx++] = 'r';
+ if (access & ACC_WRITE)
+ acc[idx++] = 'w';
+ if (access & ACC_MKNOD)
+ acc[idx++] = 'm';
+}
+
+char type_to_char(short type)
+{
+ if (type == DEV_ALL)
+ return 'a';
+ if (type == DEV_CHAR)
+ return 'c';
+ if (type == DEV_BLOCK)
+ return 'b';
+ return 'X';
+}
+
+char *print_whitelist(struct dev_cgroup *devcgroup, int *len)
+{
+ char *buf, *s, acc[4];
+ struct dev_whitelist_item *wh;
+ int ret;
+ int count = 0;
+
+ buf = kmalloc(4096, GFP_KERNEL);
+ if (!buf)
+ return ERR_PTR(-ENOMEM);
+ s = buf;
+ *s = '\0';
+ *len = 0;
+
+ spin_lock(&devcgroup->lock);
+ list_for_each_entry(wh, &devcgroup->whitelist, list) {
+ set_access(acc, wh->access);
+ printk(KERN_NOTICE "%s (count%d): whtype %hd maj %u min %u acc %hd\n", __FUNCTION__,
+ count, wh->type, wh->major, wh->minor, wh->access);
+ ret = snprintf(s, 4095-(s-buf), "%c %u %u %s\n",
+ type_to_char(wh->type), wh->major, wh->minor, acc);
+ if (s+ret >= buf+4095) {
+ kfree(buf);
+ buf = ERR_PTR(-ENOMEM);
+ break;
+ }
+ s += ret;
+ *len += ret;
+ count++;
+ }
+ spin_unlock(&devcgroup->lock);
+
+ return buf;
+}
+
+static ssize_t devcg_access_read(struct cgroup *cgroup,
+ struct cftype *cft, struct file *file,
+ char __user *userbuf, size_t nbytes, loff_t *ppos)
+{
+ struct dev_cgroup *devcgrp = cgroup_to_devcg(cgroup);
+ int filetype = cft->private;
+ char *buffer;
+ int len, retval;
+
+ if (filetype != DEVCG_ALLOW)
+ return -EINVAL;
+ buffer = print_whitelist(devcgrp, &len);
+ if (IS_ERR(buffer))
+ return PTR_ERR(buffer);
+
+ retval = simple_read_from_buffer(userbuf, nbytes, ppos, buffer, len);
+ kfree(buffer);
+ return retval;
+}
+
+static inline short convert_access(char *acc)
+{
+ short access = 0;
+
+ while (*acc) {
+ switch (*acc) {
+ case 'r':
+ case 'R': access |= ACC_READ; break;
+ case 'w':
+ case 'W': access |= ACC_WRITE; break;
+ case 'm':
+ case 'M': access |= ACC_MKNOD; break;
+ case '\n': break;
+ default:
+ return -EINVAL;
+ }
+ acc++;
+ }
+
+ return access;
+}
+
+static inline short convert_type(char intype)
+{
+ short type = 0;
+ switch(intype) {
+ case 'a': type = DEV_ALL; break;
+ case 'c': type = DEV_CHAR; break;
+ case 'b': type = DEV_BLOCK; break;
+ default: type = -EACCES; break;
+ }
+ return type;
+}
+
+/*
+ * Current rules:
+ * CAP_SYS_ADMIN needed for all writes.
+ * when we have CAP_HOST_ADMIN, the rules will become:
+ * if (!writetoself)
+ * require capable(CAP_HOST_ADMIN | CAP_SYS_ADMIN);
+ * if (is_allow)
+ * require capable(CAP_HOST_ADMIN | CAP_SYS_ADMIN);
+ * require capable(CAP_SYS_ADMIN);
+ */
+static int have_write_permission(int is_allow, int writetoself)
+{
+ if (!capable(CAP_SYS_ADMIN))
+ return 0;
+ return 1;
+}
+
+static ssize_t devcg_access_write(struct cgroup *cgroup, struct cftype *cft,
+ struct file *file, const char __user *userbuf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct cgroup *cur_cgroup;
+ struct dev_cgroup *devcgrp, *cur_devcgroup;
+ int filetype = cft->private;
+ char *buffer, acc[4];
+ int retval = 0;
+ int nitems;
+ char type;
+ struct dev_whitelist_item *wh;
+
+ devcgrp = cgroup_to_devcg(cgroup);
+ cur_cgroup = task_cgroup(current, devcg_subsys.subsys_id);
+ cur_devcgroup = cgroup_to_devcg(cur_cgroup);
+
+ if (!have_write_permission(filetype==DEVCG_ALLOW, cur_devcgroup==devcgrp))
+ return -EPERM;
+
+ buffer = kmalloc(nbytes+1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ wh = kmalloc(sizeof(*wh), GFP_KERNEL);
+ if (!wh) {
+ kfree(buffer);
+ return -ENOMEM;
+ }
+
+ if (copy_from_user(buffer, userbuf, nbytes)) {
+ retval = -EFAULT;
+ goto out1;
+ }
+ buffer[nbytes] = 0; /* nul-terminate */
+
+ cgroup_lock();
+ if (cgroup_is_removed(cgroup)) {
+ retval = -ENODEV;
+ goto out2;
+ }
+
+ memset(wh, 0, sizeof(*wh));
+ memset(acc, 0, 4);
+ nitems = sscanf(buffer, "%c %u %u %3s", &type, &wh->major, &wh->minor, acc);
+ retval = -EINVAL;
+ if (nitems != 4)
+ goto out2;
+ wh->type = convert_type(type);
+ if (wh->type < 0)
+ goto out2;
+ wh->access = convert_access(acc);
+ if (wh->access < 0)
+ goto out2;
+ retval = 0;
+ switch (filetype) {
+ case DEVCG_ALLOW:
+ printk(KERN_NOTICE "%s: add whtype %hd maj %u min %u acc %hd\n", __FUNCTION__,
+ wh->type, wh->major, wh->minor, wh->access);
+ dev_whitelist_add(devcgrp, wh);
+ break;
+ case DEVCG_DENY:
+ dev_whitelist_rm(devcgrp, wh);
+ break;
+ default:
+ retval = -EINVAL;
+ goto out2;
+ }
+
+ if (retval == 0)
+ retval = nbytes;
+
+out2:
+ cgroup_unlock();
+out1:
+ kfree(buffer);
+ return retval;
+}
+
+static struct cftype dev_cgroup_files[] = {
+ {
+ .name = "allow",
+ .read = devcg_access_read,
+ .write = devcg_access_write,
+ .private = DEVCG_ALLOW,
+ },
+ {
+ .name = "deny",
+ .write = devcg_access_write,
+ .private = DEVCG_DENY,
+ },
+};
+
+static int devcg_populate(struct cgroup_subsys *ss,
+ struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, dev_cgroup_files,
+ ARRAY_SIZE(dev_cgroup_files));
+}
+
+struct cgroup_subsys devcg_subsys = {
+ .name = "devcg",
+ .can_attach = devcg_can_attach,
+ .create = devcg_create,
+ .destroy = devcg_destroy,
+ .populate = devcg_populate,
+ .subsys_id = devcg_subsys_id,
+};
+
+static int devcgroup_inode_permission(struct inode *inode, int mask,
+ struct nameidata *nd)
+{
+ struct cgroup *cgroup;
+ struct dev_cgroup *dev_cgroup;
+ struct dev_whitelist_item *wh;
+
+ dev_t device = inode->i_rdev;
+ if (!device)
+ return 0;
+ if (!S_ISBLK(inode->i_mode) && !S_ISCHR(inode->i_mode))
+ return 0;
+ cgroup = task_cgroup(current, devcg_subsys.subsys_id);
+ dev_cgroup = cgroup_to_devcg(cgroup);
+ if (!dev_cgroup)
+ return 0;
+
+ spin_lock(&dev_cgroup->lock);
+ /* if capable(CAP_HOST_ADMIN) return 0; */
+ list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
+ if (wh->type & DEV_ALL)
+ goto acc_check;
+ if (S_ISBLK(inode->i_mode) && !(wh->type & DEV_BLOCK))
+ continue;
+ if (S_ISCHR(inode->i_mode) && !(wh->type & DEV_CHAR))
+ continue;
+ if (wh->major != imajor(inode) || wh->minor != iminor(inode))
+ continue;
+acc_check:
+ if ((mask & MAY_WRITE) && !(wh->access & ACC_WRITE))
+ continue;
+ if ((mask & MAY_READ) && !(wh->access & ACC_READ))
+ continue;
+ spin_unlock(&dev_cgroup->lock);
+ return 0;
+ }
+ spin_unlock(&dev_cgroup->lock);
+
+ printk(KERN_NOTICE "%s: %d denied %d access to %s (%lu)\n", __FUNCTION__,
+ current->pid, mask, nd ? nd->dentry->d_name.name : "null",
+ inode->i_ino);
+ return -EPERM;
+}
+
+static int devcgroup_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
+{
+ struct cgroup *cgroup;
+ struct dev_cgroup *dev_cgroup;
+ struct dev_whitelist_item *wh;
+
+ /* if capable(CAP_HOST_ADMIN) return 0; */
+ cgroup = task_cgroup(current, devcg_subsys.subsys_id);
+ dev_cgroup = cgroup_to_devcg(cgroup);
+ if (!dev_cgroup)
+ return 0;
+
+ spin_lock(&dev_cgroup->lock);
+ list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
+ if (wh->type & DEV_ALL)
+ goto ok;
+ if (S_ISBLK(mode) && !(wh->type & DEV_BLOCK))
+ continue;
+ if (S_ISCHR(mode) && !(wh->type & DEV_CHAR))
+ continue;
+ if (wh->major != MAJOR(dev) || wh->minor != MINOR(dev))
+ continue;
+ if (wh->access & ACC_MKNOD)
+ goto ok;
+ }
+ spin_unlock(&dev_cgroup->lock);
+
+ printk(KERN_NOTICE "%s: %d denied %d access to (%d %d)\n", __FUNCTION__,
+ current->pid, mode, MAJOR(dev), MINOR(dev));
+ return -EPERM;
+
+ok:
+ spin_unlock(&dev_cgroup->lock);
+ return 0;
+}
+
+static struct security_operations devcgroup_security_ops = {
+ .inode_mknod = devcgroup_inode_mknod,
+ .inode_permission = devcgroup_inode_permission,
+};
+
+static int __init dev_cgroup_security_init (void)
+{
+ /* register ourselves with the security framework */
+ if (register_security (&devcgroup_security_ops)) {
+ printk (KERN_INFO "Failure registering device cgroup lsm\n");
+ return -EINVAL;
+ }
+ printk (KERN_INFO "Device cgroup LSM initialized\n");
+ return 0;
+}
+
+security_initcall (dev_cgroup_security_init);
--
1.5.1.1.GIT
On Fri, 2008-03-07 at 12:30 -0600, Serge E. Hallyn wrote:
> Quoting Casey Schaufler ([email protected]):
> >
> > --- "Serge E. Hallyn" <[email protected]> wrote:
> >
> > > ...
> > >
> > > Until user namespaces are complete, selinux seems the only good solution
> > > to offer isolation.
> >
> > Smack does it better and cheaper. (Unless you define good==selinux)
> > (insert smiley)
>
> Ah, thanks - I hadn't looked into it, but yes IIUC smack should
> definately work. I'll have to give that a shot.
Not if you want to confine uid 0. smack doesn't control capabilities,
even the ones used to override it.
So you'd have to at least configure your per-process bset and file caps
rather carefully. And even then you have to watch out for things with
CAP_MAC* or CAP_SETPCAP.
> (A basic selinux policy module to isolate a container was pretty simple,
> but providing finer-grained intra-container access seems to take some
> changes to the base refpolicy. I've been waiting a few weeks to find
> time to work on that.)
--
Stephen Smalley
National Security Agency
--- Stephen Smalley <[email protected]> wrote:
>
> On Fri, 2008-03-07 at 12:30 -0600, Serge E. Hallyn wrote:
> > Quoting Casey Schaufler ([email protected]):
> > >
> > > --- "Serge E. Hallyn" <[email protected]> wrote:
> > >
> > > > ...
> > > >
> > > > Until user namespaces are complete, selinux seems the only good
> solution
> > > > to offer isolation.
> > >
> > > Smack does it better and cheaper. (Unless you define good==selinux)
> > > (insert smiley)
> >
> > Ah, thanks - I hadn't looked into it, but yes IIUC smack should
> > definately work. I'll have to give that a shot.
>
> Not if you want to confine uid 0. smack doesn't control capabilities,
> even the ones used to override it.
>
> So you'd have to at least configure your per-process bset and file caps
> rather carefully. And even then you have to watch out for things with
> CAP_MAC* or CAP_SETPCAP.
Shrug. As if getting 800,000 lines of policy definition
for a thousand applications completely correct is going to be easier.
Casey Schaufler
[email protected]
Quoting Casey Schaufler ([email protected]):
>
> --- Stephen Smalley <[email protected]> wrote:
>
> >
> > On Fri, 2008-03-07 at 12:30 -0600, Serge E. Hallyn wrote:
> > > Quoting Casey Schaufler ([email protected]):
> > > >
> > > > --- "Serge E. Hallyn" <[email protected]> wrote:
> > > >
> > > > > ...
> > > > >
> > > > > Until user namespaces are complete, selinux seems the only good
> > solution
> > > > > to offer isolation.
> > > >
> > > > Smack does it better and cheaper. (Unless you define good==selinux)
> > > > (insert smiley)
> > >
> > > Ah, thanks - I hadn't looked into it, but yes IIUC smack should
> > > definately work. I'll have to give that a shot.
> >
> > Not if you want to confine uid 0. smack doesn't control capabilities,
> > even the ones used to override it.
> >
> > So you'd have to at least configure your per-process bset and file caps
> > rather carefully. And even then you have to watch out for things with
> > CAP_MAC* or CAP_SETPCAP.
>
> Shrug. As if getting 800,000 lines of policy definition
> for a thousand applications completely correct is going to be easier.
Folks, as I get time I will try with both :)
I suspect the CAP_MAC_ADMIN will mean containers won't be able to do any
policy updates without an update to smack to do a CAP_NS_OVERRIDE type
of thing. For SELinux I've got my hopes on the userspace policy daemon,
but first (my next step atm) I need to get the namespace thing going
(vserver1.root_t etc).
-serge
On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
> Quoting Greg KH ([email protected]):
> > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
> > > > Do you really want to run other LSMs within a containerd kernel? Is
> > > > that a requirement? It would seem to run counter to the main goal of
> > > > containers to me.
> > >
> > > Until user namespaces are complete, selinux seems the only good solution
> > > to offer isolation.
> >
> > Great, use that instead :)
>
> That can't work as is since you can't specify major:minor in policy.
Your LSM can not, or the LSM interface does not allow this to happen?
> So all we could do again is simply refuse all mknod, which we can
> already do with per-process capability bounding sets.
I thought we passed that info down to the LSM module, can't you do your
selection at that point in time?
And then, just mediate open() like always, right?
thanks,
greg k-h
Quoting Greg KH ([email protected]):
> On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
> > Quoting Greg KH ([email protected]):
> > > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
> > > > > Do you really want to run other LSMs within a containerd kernel? Is
> > > > > that a requirement? It would seem to run counter to the main goal of
> > > > > containers to me.
> > > >
> > > > Until user namespaces are complete, selinux seems the only good solution
> > > > to offer isolation.
> > >
> > > Great, use that instead :)
> >
> > That can't work as is since you can't specify major:minor in policy.
>
> Your LSM can not, or the LSM interface does not allow this to happen?
No my lsm in fact does, you just can't do it with selinux policy at the
moment. I was still responding to your "just use selinux" :)
> > So all we could do again is simply refuse all mknod, which we can
> > already do with per-process capability bounding sets.
>
> I thought we passed that info down to the LSM module, can't you do your
> selection at that point in time?
>
> And then, just mediate open() like always, right?
Yup, the patch I included inline does that.
An LSM can address the problem. It just felt like more of a
patch-over-the-real-problem kind of solution.
thanks,
-serge
On Sat, Mar 08, 2008 at 03:47:57PM -0600, Serge E. Hallyn wrote:
> Quoting Greg KH ([email protected]):
> > On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
> > > Quoting Greg KH ([email protected]):
> > > > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
> > > > > > Do you really want to run other LSMs within a containerd kernel? Is
> > > > > > that a requirement? It would seem to run counter to the main goal of
> > > > > > containers to me.
> > > > >
> > > > > Until user namespaces are complete, selinux seems the only good solution
> > > > > to offer isolation.
> > > >
> > > > Great, use that instead :)
> > >
> > > That can't work as is since you can't specify major:minor in policy.
> >
> > Your LSM can not, or the LSM interface does not allow this to happen?
>
> No my lsm in fact does, you just can't do it with selinux policy at the
> moment. I was still responding to your "just use selinux" :)
I never said "use selinux", do you think I am crazy? :)
Just use your own lsm, that's all I recommended.
> > > So all we could do again is simply refuse all mknod, which we can
> > > already do with per-process capability bounding sets.
> >
> > I thought we passed that info down to the LSM module, can't you do your
> > selection at that point in time?
> >
> > And then, just mediate open() like always, right?
>
> Yup, the patch I included inline does that.
Great. But don't put that other file in the core kernel, put it in
security/ please.
> An LSM can address the problem. It just felt like more of a
> patch-over-the-real-problem kind of solution.
I disagree, it sounds exactly like what LSM is for.
thanks,
greg k-h
Quoting Greg KH ([email protected]):
> On Sat, Mar 08, 2008 at 03:47:57PM -0600, Serge E. Hallyn wrote:
> > Quoting Greg KH ([email protected]):
> > > On Fri, Mar 07, 2008 at 12:50:52PM -0600, Serge E. Hallyn wrote:
> > > > Quoting Greg KH ([email protected]):
> > > > > On Fri, Mar 07, 2008 at 11:35:42AM -0600, Serge E. Hallyn wrote:
> > > > > > > Do you really want to run other LSMs within a containerd kernel? Is
> > > > > > > that a requirement? It would seem to run counter to the main goal of
> > > > > > > containers to me.
> > > > > >
> > > > > > Until user namespaces are complete, selinux seems the only good solution
> > > > > > to offer isolation.
> > > > >
> > > > > Great, use that instead :)
> > > >
> > > > That can't work as is since you can't specify major:minor in policy.
> > >
> > > Your LSM can not, or the LSM interface does not allow this to happen?
> >
> > No my lsm in fact does, you just can't do it with selinux policy at the
> > moment. I was still responding to your "just use selinux" :)
>
> I never said "use selinux", do you think I am crazy? :)
>
> Just use your own lsm, that's all I recommended.
>
> > > > So all we could do again is simply refuse all mknod, which we can
> > > > already do with per-process capability bounding sets.
> > >
> > > I thought we passed that info down to the LSM module, can't you do your
> > > selection at that point in time?
> > >
> > > And then, just mediate open() like always, right?
> >
> > Yup, the patch I included inline does that.
>
> Great. But don't put that other file in the core kernel, put it in
> security/ please.
>
> > An LSM can address the problem. It just felt like more of a
> > patch-over-the-real-problem kind of solution.
>
> I disagree, it sounds exactly like what LSM is for.
Ok, I went ahead and recreated the two files I had lost by not
git-adding them. I suspect if we were to use this in place of Pavel's
patch, we'd want to switch the API over to what he was using? I think
Pavel and Paul Menage had fine-tuned his somewhat... If Pavel doesn't
gag, maybe we could just use his cgroup code minus the kobject code
plus the two LSM hooks?
-serge
>From ff9f9a190d8664275fb86c7d5b2153fd79b990d7 Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn <[email protected]>
Date: Mon, 10 Mar 2008 13:01:33 -0700
Subject: [PATCH 1/1] cgroups: implement device whitelist cgroup+lsm
Implement a cgroup using the LSM interface to enforce open and mknod
on device files. Not a line of this code is expected to be used in a
final version, this is just a proof of concept to explore whether we
can or should use an LSM for this until device namespaces are really
needed.
This implements a simple device access whitelist. A whitelist entry
has 4 fields. 'type' is a (all), c (char), or b (block). 'all' means it
applies to all types, all major numbers, and all minor numbers. Major and
minor are obvious. Access is a composition of r (read), w (write), and
m (mknod).
The root devcgroup starts with rwm to 'all'. A child devcg gets a copy
of the parent. Admins can then add and remove devices to the whitelist.
Once CAP_HOST_ADMIN is introduced it will be needed to add entries as
well or remove entries from another cgroup, though just CAP_SYS_ADMIN
will suffice to remove entries for your own group.
An entry is added by doing "echo <type> <maj> <min> <access>" > devcg.allow,
for instance:
echo b 7 0 mrw > /cgroups/1/devcg.allow
An entry is removed by doing likewise into devcg.deny. Since this is a
pure whitelist, not acls, you can only remove entries which exist in the
whitelist. You must explicitly
echo a 0 0 mrw > /cgroups/1/devcg.deny
to remove the "allow all" entry which is automatically inherited from
the root cgroup.
While composing this with the ns_cgroup may seem logical, it may not
be the right thing to do. Note that each newly created devcg gets
a copy of the parent whitelist. So if you had done
mount -t cgroup -o ns,devcg none /cgroups
then once a process in /cgroup/1 had done an unshare(CLONE_NEWNS)
it would be under /cgroup/1/node_<pid>
if an admin did
echo b 7 0 m > /cgroups/1/devcg.deny
then the entry would still be in the whitelist for /cgroups/1/node_<pid>.
Something to discuss if we get that far before nixing this whole idea.
CAP_NS_OVERRIDE is defined as a capability needed to tweak device
access.
CONFIG_COMMONCAP is defined whenever security/commoncap.c should
be compiled, so that the decision of whether to show the option
for FILE_CAPABILITIES can be a bit cleaner.
Signed-off-by: Serge E. Hallyn <[email protected]>
---
include/linux/capability.h | 11 +-
include/linux/cgroup_subsys.h | 6 +
include/linux/devcg.h | 53 ++++++
init/Kconfig | 7 +
kernel/Makefile | 1 +
kernel/dev_cgroup.c | 386 +++++++++++++++++++++++++++++++++++++++++
security/Kconfig | 8 +-
security/Makefile | 12 +-
security/dev_cgroup.c | 132 ++++++++++++++
9 files changed, 606 insertions(+), 10 deletions(-)
create mode 100644 include/linux/devcg.h
create mode 100644 kernel/dev_cgroup.c
create mode 100644 security/dev_cgroup.c
diff --git a/include/linux/capability.h b/include/linux/capability.h
index eaab759..f8ecba1 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -333,7 +333,16 @@ typedef struct kernel_cap_struct {
#define CAP_MAC_ADMIN 33
-#define CAP_LAST_CAP CAP_MAC_ADMIN
+/* Allow acting on resources in another namespace. In particular:
+ * 1. when combined with CAP_MKNOD and dev_cgroup is enabled,
+ * allow creation of devices not in the device whitelist.
+ * 2. whencombined with CAP_SYS_ADMIN and dev_cgroup is enabled,
+ * allow editing device cgroup whitelist
+ */
+
+#define CAP_NS_OVERRIDE 34
+
+#define CAP_LAST_CAP CAP_NS_OVERRIDE
#define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index 1ddebfc..01e8034 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -42,3 +42,9 @@ SUBSYS(mem_cgroup)
#endif
/* */
+
+#ifdef CONFIG_CGROUP_DEV
+SUBSYS(devcg)
+#endif
+
+/* */
diff --git a/include/linux/devcg.h b/include/linux/devcg.h
new file mode 100644
index 0000000..764a734
--- /dev/null
+++ b/include/linux/devcg.h
@@ -0,0 +1,53 @@
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/security.h>
+
+#include <asm/uaccess.h>
+
+#define ACC_MKNOD 1
+#define ACC_READ 2
+#define ACC_WRITE 4
+
+#define DEV_BLOCK 1
+#define DEV_CHAR 2
+#define DEV_ALL 4 /* this represents all devices */
+
+/*
+ * whitelist locking rules:
+ * cgroup_lock() cannot be taken under cgroup->lock.
+ * cgroup->lock can be taken with or without cgroup_lock().
+ *
+ * modifications always require cgroup_lock
+ * modifications to a list which is visible require the
+ * cgroup->lock *and* cgroup_lock()
+ * walking the list requires cgroup->lock or cgroup_lock().
+ *
+ * reasoning: dev_whitelist_copy() needs to kmalloc, so needs
+ * a mutex, which the cgroup_lock() is. Since modifying
+ * a visible list requires both locks, either lock can be
+ * taken for walking the list. Since the wh->spinlock is taken
+ * for modifying a public-accessible list, the spinlock is
+ * sufficient for just walking the list.
+ */
+
+struct dev_whitelist_item {
+ u32 major, minor;
+ short type;
+ short access;
+ struct list_head list;
+};
+
+struct dev_cgroup {
+ struct cgroup_subsys_state css;
+ struct list_head whitelist;
+ spinlock_t lock;
+};
+
+static inline struct dev_cgroup *cgroup_to_devcg(
+ struct cgroup *cgroup)
+{
+ return container_of(cgroup_subsys_state(cgroup, devcg_subsys_id),
+ struct dev_cgroup, css);
+}
diff --git a/init/Kconfig b/init/Kconfig
index 77b2ac8..db302fa 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -298,6 +298,13 @@ config CGROUP_NS
for instance virtual servers and checkpoint/restart
jobs.
+config CGROUP_DEV
+ bool "Device controller for cgroups"
+ depends on CGROUPS && SECURITY && EXPERIMENTAL
+ help
+ Provides a cgroup implementing whitelists for devices which
+ a process in the cgroup can mknod or open.
+
config CPUSETS
bool "Cpuset support"
depends on SMP && CGROUPS
diff --git a/kernel/Makefile b/kernel/Makefile
index 4336031..2e93d27 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -42,6 +42,7 @@ obj-$(CONFIG_CGROUPS) += cgroup.o
obj-$(CONFIG_CGROUP_DEBUG) += cgroup_debug.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CGROUP_NS) += ns_cgroup.o
+obj-$(CONFIG_CGROUP_DEV) += dev_cgroup.o
obj-$(CONFIG_UTS_NS) += utsname.o
obj-$(CONFIG_USER_NS) += user_namespace.o
obj-$(CONFIG_PID_NS) += pid_namespace.o
diff --git a/kernel/dev_cgroup.c b/kernel/dev_cgroup.c
new file mode 100644
index 0000000..a0f4371
--- /dev/null
+++ b/kernel/dev_cgroup.c
@@ -0,0 +1,386 @@
+/*
+ * dev_cgroup.c - device cgroup subsystem
+ *
+ * Copyright 2007 IBM Corp
+ */
+
+#include <linux/devcg.h>
+
+static int devcg_can_attach(struct cgroup_subsys *ss,
+ struct cgroup *new_cgroup, struct task_struct *task)
+{
+ struct cgroup *orig;
+
+ if (current != task) {
+ if (!cgroup_is_descendant(new_cgroup))
+ return -EPERM;
+ }
+
+ if (atomic_read(&new_cgroup->count) != 0)
+ return -EPERM;
+
+ orig = task_cgroup(task, devcg_subsys_id);
+ if (orig && orig != new_cgroup->parent)
+ return -EPERM;
+
+ return 0;
+}
+
+/*
+ * called under cgroup_lock()
+ */
+int dev_whitelist_copy(struct list_head *dest, struct list_head *orig)
+{
+ struct dev_whitelist_item *wh, *tmp, *new;
+
+ list_for_each_entry(wh, orig, list) {
+ new = kmalloc(sizeof(*wh), GFP_KERNEL);
+ if (!new)
+ goto free_and_exit;
+ new->major = wh->major;
+ new->minor = wh->minor;
+ new->type = wh->type;
+ new->access = wh->access;
+ list_add_tail(&new->list, dest);
+ }
+
+ return 0;
+
+free_and_exit:
+ list_for_each_entry_safe(wh, tmp, dest, list) {
+ list_del(&wh->list);
+ kfree(wh);
+ }
+ return -ENOMEM;
+}
+
+/* Stupid prototype - don't bother combining existing entries */
+/*
+ * called under cgroup_lock()
+ * since the list is visible to other tasks, we need the spinlock also
+ */
+void dev_whitelist_add(struct dev_cgroup *dev_cgroup,
+ struct dev_whitelist_item *wh)
+{
+ spin_lock(&dev_cgroup->lock);
+ list_add_tail(&wh->list, &dev_cgroup->whitelist);
+ spin_unlock(&dev_cgroup->lock);
+}
+
+/*
+ * called under cgroup_lock()
+ * since the list is visible to other tasks, we need the spinlock also
+ */
+void dev_whitelist_rm(struct dev_cgroup *dev_cgroup,
+ struct dev_whitelist_item *wh)
+{
+ struct dev_whitelist_item *walk, *tmp;
+
+ spin_lock(&dev_cgroup->lock);
+ list_for_each_entry_safe(walk, tmp, &dev_cgroup->whitelist, list) {
+ if (walk->type & DEV_ALL) {
+ list_del(&walk->list);
+ kfree(walk);
+ continue;
+ }
+ if (walk->type != wh->type)
+ continue;
+ if (walk->major != wh->major || walk->minor != wh->minor)
+ continue;
+ walk->access &= ~wh->access;
+ if (!walk->access) {
+ list_del(&walk->list);
+ kfree(walk);
+ }
+ }
+ spin_unlock(&dev_cgroup->lock);
+}
+
+/*
+ * Rules: you can only create a cgroup if
+ * 1. you are capable(CAP_SYS_ADMIN|CAP_NS_OVERRIDE)
+ * 2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *devcg_create(struct cgroup_subsys *ss,
+ struct cgroup *cgroup)
+{
+ struct dev_cgroup *dev_cgroup, *parent_dev_cgroup;
+ struct cgroup *parent_cgroup;
+ int ret;
+
+ if (!capable(CAP_SYS_ADMIN) || !capable(CAP_NS_OVERRIDE))
+ return ERR_PTR(-EPERM);
+ if (!cgroup_is_descendant(cgroup))
+ return ERR_PTR(-EPERM);
+
+ dev_cgroup = kzalloc(sizeof(*dev_cgroup), GFP_KERNEL);
+ if (!dev_cgroup)
+ return ERR_PTR(-ENOMEM);
+ INIT_LIST_HEAD(&dev_cgroup->whitelist);
+ parent_cgroup = cgroup->parent;
+
+ if (parent_cgroup == NULL) {
+ struct dev_whitelist_item *wh;
+ wh = kmalloc(sizeof(*wh), GFP_KERNEL);
+ wh->minor = wh->major = 0;
+ wh->type = DEV_ALL;
+ wh->access = ACC_MKNOD | ACC_READ | ACC_WRITE;
+ list_add(&wh->list, &dev_cgroup->whitelist);
+ } else {
+ parent_dev_cgroup = cgroup_to_devcg(parent_cgroup);
+ ret = dev_whitelist_copy(&dev_cgroup->whitelist,
+ &parent_dev_cgroup->whitelist);
+ if (ret) {
+ kfree(dev_cgroup);
+ return ERR_PTR(ret);
+ }
+ }
+
+ spin_lock_init(&dev_cgroup->lock);
+ return &dev_cgroup->css;
+}
+
+static void devcg_destroy(struct cgroup_subsys *ss,
+ struct cgroup *cgroup)
+{
+ struct dev_cgroup *dev_cgroup;
+ struct dev_whitelist_item *wh, *tmp;
+
+ dev_cgroup = cgroup_to_devcg(cgroup);
+ list_for_each_entry_safe(wh, tmp, &dev_cgroup->whitelist, list) {
+ list_del(&wh->list);
+ kfree(wh);
+ }
+ kfree(dev_cgroup);
+}
+
+#define DEVCG_ALLOW 1
+#define DEVCG_DENY 2
+
+void set_access(char *acc, short access)
+{
+ int idx = 0;
+ memset(acc, 0, 4);
+ if (access & ACC_READ)
+ acc[idx++] = 'r';
+ if (access & ACC_WRITE)
+ acc[idx++] = 'w';
+ if (access & ACC_MKNOD)
+ acc[idx++] = 'm';
+}
+
+char type_to_char(short type)
+{
+ if (type == DEV_ALL)
+ return 'a';
+ if (type == DEV_CHAR)
+ return 'c';
+ if (type == DEV_BLOCK)
+ return 'b';
+ return 'X';
+}
+
+char *print_whitelist(struct dev_cgroup *devcgroup, int *len)
+{
+ char *buf, *s, acc[4];
+ struct dev_whitelist_item *wh;
+ int ret;
+ int count = 0;
+
+ buf = kmalloc(4096, GFP_KERNEL);
+ if (!buf)
+ return ERR_PTR(-ENOMEM);
+ s = buf;
+ *s = '\0';
+ *len = 0;
+
+ spin_lock(&devcgroup->lock);
+ list_for_each_entry(wh, &devcgroup->whitelist, list) {
+ set_access(acc, wh->access);
+ printk(KERN_NOTICE
+ "%s (count%d): whtype %hd maj %u min %u acc %hd\n",
+ __FUNCTION__, count, wh->type, wh->major, wh->minor,
+ wh->access);
+ ret = snprintf(s, 4095-(s-buf), "%c %u %u %s\n",
+ type_to_char(wh->type), wh->major, wh->minor, acc);
+ if (s+ret >= buf+4095) {
+ kfree(buf);
+ buf = ERR_PTR(-ENOMEM);
+ break;
+ }
+ s += ret;
+ *len += ret;
+ count++;
+ }
+ spin_unlock(&devcgroup->lock);
+
+ return buf;
+}
+
+static ssize_t devcg_access_read(struct cgroup *cgroup,
+ struct cftype *cft, struct file *file,
+ char __user *userbuf, size_t nbytes, loff_t *ppos)
+{
+ struct dev_cgroup *devcgrp = cgroup_to_devcg(cgroup);
+ int filetype = cft->private;
+ char *buffer;
+ int len, retval;
+
+ if (filetype != DEVCG_ALLOW)
+ return -EINVAL;
+ buffer = print_whitelist(devcgrp, &len);
+ if (IS_ERR(buffer))
+ return PTR_ERR(buffer);
+
+ retval = simple_read_from_buffer(userbuf, nbytes, ppos, buffer, len);
+ kfree(buffer);
+ return retval;
+}
+
+static inline short convert_access(char *acc)
+{
+ short access = 0;
+
+ while (*acc) {
+ switch (*acc) {
+ case 'r':
+ case 'R': access |= ACC_READ; break;
+ case 'w':
+ case 'W': access |= ACC_WRITE; break;
+ case 'm':
+ case 'M': access |= ACC_MKNOD; break;
+ case '\n': break;
+ default:
+ return -EINVAL;
+ }
+ acc++;
+ }
+
+ return access;
+}
+
+static inline short convert_type(char intype)
+{
+ short type = 0;
+ switch (intype) {
+ case 'a': type = DEV_ALL; break;
+ case 'c': type = DEV_CHAR; break;
+ case 'b': type = DEV_BLOCK; break;
+ default: type = -EACCES; break;
+ }
+ return type;
+}
+
+static ssize_t devcg_access_write(struct cgroup *cgroup, struct cftype *cft,
+ struct file *file, const char __user *userbuf,
+ size_t nbytes, loff_t *ppos)
+{
+ struct cgroup *cur_cgroup;
+ struct dev_cgroup *devcgrp, *cur_devcgroup;
+ int filetype = cft->private;
+ char *buffer, acc[4];
+ int retval = 0;
+ int nitems;
+ char type;
+ struct dev_whitelist_item *wh;
+
+ if (!capable(CAP_SYS_ADMIN) || !capable(CAP_NS_OVERRIDE))
+ return -EPERM;
+
+ devcgrp = cgroup_to_devcg(cgroup);
+ cur_cgroup = task_cgroup(current, devcg_subsys.subsys_id);
+ cur_devcgroup = cgroup_to_devcg(cur_cgroup);
+
+ buffer = kmalloc(nbytes+1, GFP_KERNEL);
+ if (!buffer)
+ return -ENOMEM;
+
+ wh = kmalloc(sizeof(*wh), GFP_KERNEL);
+ if (!wh) {
+ kfree(buffer);
+ return -ENOMEM;
+ }
+
+ if (copy_from_user(buffer, userbuf, nbytes)) {
+ retval = -EFAULT;
+ goto out1;
+ }
+ buffer[nbytes] = 0; /* nul-terminate */
+
+ cgroup_lock();
+ if (cgroup_is_removed(cgroup)) {
+ retval = -ENODEV;
+ goto out2;
+ }
+
+ memset(wh, 0, sizeof(*wh));
+ memset(acc, 0, 4);
+ nitems = sscanf(buffer, "%c %u %u %3s", &type, &wh->major, &wh->minor,
+ acc);
+ retval = -EINVAL;
+ if (nitems != 4)
+ goto out2;
+ wh->type = convert_type(type);
+ if (wh->type < 0)
+ goto out2;
+ wh->access = convert_access(acc);
+ if (wh->access < 0)
+ goto out2;
+ retval = 0;
+ switch (filetype) {
+ case DEVCG_ALLOW:
+ printk(KERN_NOTICE
+ "%s: add whtype %hd maj %u min %u acc %hd\n",
+ __FUNCTION__, wh->type, wh->major, wh->minor,
+ wh->access);
+ dev_whitelist_add(devcgrp, wh);
+ break;
+ case DEVCG_DENY:
+ dev_whitelist_rm(devcgrp, wh);
+ break;
+ default:
+ retval = -EINVAL;
+ goto out2;
+ }
+
+ if (retval == 0)
+ retval = nbytes;
+
+out2:
+ cgroup_unlock();
+out1:
+ kfree(buffer);
+ return retval;
+}
+
+static struct cftype dev_cgroup_files[] = {
+ {
+ .name = "allow",
+ .read = devcg_access_read,
+ .write = devcg_access_write,
+ .private = DEVCG_ALLOW,
+ },
+ {
+ .name = "deny",
+ .write = devcg_access_write,
+ .private = DEVCG_DENY,
+ },
+};
+
+static int devcg_populate(struct cgroup_subsys *ss,
+ struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, dev_cgroup_files,
+ ARRAY_SIZE(dev_cgroup_files));
+}
+
+struct cgroup_subsys devcg_subsys = {
+ .name = "devcg",
+ .can_attach = devcg_can_attach,
+ .create = devcg_create,
+ .destroy = devcg_destroy,
+ .populate = devcg_populate,
+ .subsys_id = devcg_subsys_id,
+};
diff --git a/security/Kconfig b/security/Kconfig
index 5dfc206..8082edc 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -75,15 +75,19 @@ config SECURITY_NETWORK_XFRM
config SECURITY_CAPABILITIES
bool "Default Linux Capabilities"
- depends on SECURITY
+ depends on SECURITY && !CGROUP_DEV
default y
help
This enables the "default" Linux capabilities functionality.
If you are unsure how to answer this question, answer Y.
+config COMMONCAP
+ bool
+ default !SECURITY || SECURITY_CAPABILITIES || SECURITY_ROOTPLUG || SECURITY_SMACK || CGROUP_DEV
+
config SECURITY_FILE_CAPABILITIES
bool "File POSIX Capabilities (EXPERIMENTAL)"
- depends on (SECURITY=n || SECURITY_CAPABILITIES!=n) && EXPERIMENTAL
+ depends on COMMONCAP && EXPERIMENTAL
default n
help
This enables filesystem capabilities, allowing you to give
diff --git a/security/Makefile b/security/Makefile
index 9e8b025..6093003 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -6,15 +6,13 @@ obj-$(CONFIG_KEYS) += keys/
subdir-$(CONFIG_SECURITY_SELINUX) += selinux
subdir-$(CONFIG_SECURITY_SMACK) += smack
-# if we don't select a security model, use the default capabilities
-ifneq ($(CONFIG_SECURITY),y)
-obj-y += commoncap.o
-endif
+obj-$(CONFIG_COMMONCAP) += commoncap.o
# Object file lists
obj-$(CONFIG_SECURITY) += security.o dummy.o inode.o
# Must precede capability.o in order to stack properly.
obj-$(CONFIG_SECURITY_SELINUX) += selinux/built-in.o
-obj-$(CONFIG_SECURITY_SMACK) += commoncap.o smack/built-in.o
-obj-$(CONFIG_SECURITY_CAPABILITIES) += commoncap.o capability.o
-obj-$(CONFIG_SECURITY_ROOTPLUG) += commoncap.o root_plug.o
+obj-$(CONFIG_SECURITY_SMACK) += smack/built-in.o
+obj-$(CONFIG_SECURITY_CAPABILITIES) += capability.o
+obj-$(CONFIG_SECURITY_ROOTPLUG) += root_plug.o
+obj-$(CONFIG_CGROUP_DEV) += dev_cgroup.o
diff --git a/security/dev_cgroup.c b/security/dev_cgroup.c
new file mode 100644
index 0000000..2719cdb
--- /dev/null
+++ b/security/dev_cgroup.c
@@ -0,0 +1,132 @@
+/*
+ * LSM portion of the device cgroup subsystem.
+ *
+ * Copyright 2007 IBM Corp
+ */
+
+#include <linux/devcg.h>
+
+extern struct cgroup_subsys devcg_subsys;
+
+static int devcgroup_inode_permission(struct inode *inode, int mask,
+ struct nameidata *nd)
+{
+ struct cgroup *cgroup;
+ struct dev_cgroup *dev_cgroup;
+ struct dev_whitelist_item *wh;
+
+ dev_t device = inode->i_rdev;
+ if (!device)
+ return 0;
+ if (!S_ISBLK(inode->i_mode) && !S_ISCHR(inode->i_mode))
+ return 0;
+ cgroup = task_cgroup(current, devcg_subsys.subsys_id);
+ dev_cgroup = cgroup_to_devcg(cgroup);
+ if (!dev_cgroup)
+ return 0;
+
+ spin_lock(&dev_cgroup->lock);
+ list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
+ if (wh->type & DEV_ALL)
+ goto acc_check;
+ if (S_ISBLK(inode->i_mode) && !(wh->type & DEV_BLOCK))
+ continue;
+ if (S_ISCHR(inode->i_mode) && !(wh->type & DEV_CHAR))
+ continue;
+ if (wh->major != imajor(inode) || wh->minor != iminor(inode))
+ continue;
+acc_check:
+ if ((mask & MAY_WRITE) && !(wh->access & ACC_WRITE))
+ continue;
+ if ((mask & MAY_READ) && !(wh->access & ACC_READ))
+ continue;
+ spin_unlock(&dev_cgroup->lock);
+ return 0;
+ }
+ spin_unlock(&dev_cgroup->lock);
+
+ return -EPERM;
+}
+
+static int devcgroup_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)
+{
+ struct cgroup *cgroup;
+ struct dev_cgroup *dev_cgroup;
+ struct dev_whitelist_item *wh;
+
+ cgroup = task_cgroup(current, devcg_subsys.subsys_id);
+ dev_cgroup = cgroup_to_devcg(cgroup);
+ if (!dev_cgroup)
+ return 0;
+
+ spin_lock(&dev_cgroup->lock);
+ list_for_each_entry(wh, &dev_cgroup->whitelist, list) {
+ if (wh->type & DEV_ALL)
+ goto ok;
+ if (S_ISBLK(mode) && !(wh->type & DEV_BLOCK))
+ continue;
+ if (S_ISCHR(mode) && !(wh->type & DEV_CHAR))
+ continue;
+ if (wh->major != MAJOR(dev) || wh->minor != MINOR(dev))
+ continue;
+ if (wh->access & ACC_MKNOD)
+ goto ok;
+ }
+ spin_unlock(&dev_cgroup->lock);
+
+ printk(KERN_NOTICE "%s: %d denied %d access to (%d %d)\n", __FUNCTION__,
+ current->pid, mode, MAJOR(dev), MINOR(dev));
+ return -EPERM;
+
+ok:
+ spin_unlock(&dev_cgroup->lock);
+ return 0;
+}
+
+static struct security_operations devcgroup_security_ops = {
+ .inode_mknod = devcgroup_inode_mknod,
+ .inode_permission = devcgroup_inode_permission,
+
+ .ptrace = cap_ptrace,
+ .capget = cap_capget,
+ .capset_check = cap_capset_check,
+ .capset_set = cap_capset_set,
+ .capable = cap_capable,
+ .settime = cap_settime,
+ .netlink_send = cap_netlink_send,
+ .netlink_recv = cap_netlink_recv,
+
+ .bprm_apply_creds = cap_bprm_apply_creds,
+ .bprm_set_security = cap_bprm_set_security,
+ .bprm_secureexec = cap_bprm_secureexec,
+
+ .inode_setxattr = cap_inode_setxattr,
+ .inode_removexattr = cap_inode_removexattr,
+ .inode_need_killpriv = cap_inode_need_killpriv,
+ .inode_killpriv = cap_inode_killpriv,
+
+ .task_kill = cap_task_kill,
+ .task_setscheduler = cap_task_setscheduler,
+ .task_setioprio = cap_task_setioprio,
+ .task_setnice = cap_task_setnice,
+ .task_post_setuid = cap_task_post_setuid,
+ .task_prctl = cap_task_prctl,
+ .task_reparent_to_init = cap_task_reparent_to_init,
+
+ .syslog = cap_syslog,
+
+ .vm_enough_memory = cap_vm_enough_memory,
+};
+
+static int __init dev_cgroup_security_init (void)
+{
+ /* register ourselves with the security framework */
+ if (register_security (&devcgroup_security_ops)) {
+ printk (KERN_INFO "Failure registering device cgroup lsm\n");
+ return -EINVAL;
+ }
+ printk (KERN_INFO "Device cgroup LSM initialized\n");
+ return 0;
+}
+
+security_initcall (dev_cgroup_security_init);
--
1.5.1
[snip]
>> I have basically three objections against LSM.
>>
>> 1. LSM is not stackable, so loading this tiny module with devices
>> access rights will block other security modules;
>
> Do you really want to run other LSMs within a containerd kernel? Is
> that a requirement? It would seem to run counter to the main goal of
> containers to me.
_I_ do not want to run LSM with containers, but this was a requirement
from some people when we decided which approach to chose - LSM of mine.
Nevertheless, this is not very important, indeed.
>> 2. Turning CONFIG_SECURITY on immediately causes all the other hooks
>> to get called. This affects performance on critical paths, like
>> process creation/destruction, network flow and so on. This impact
>> is small, but noticeable;
>
> The last time this was brought up, it was not noticable, except for some
> network paths. And even then, the number was lost in the noise from
> what I saw. I think with a containered machine, you have bigger things
> to be worried about :)
The namespace-based virtualization implies no performance overhead for
a container, actually. This is not xen or vmware where performance
loss is OK.
Besides, I've measured some things - the lat_syscall test for open from
lmbench test suite and the nptl perf test. Here are the results:
sec nosec
open 3.0980s 3.0709s
nptl 2.7746s 2.7710s
So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
much, but it is noticeable. Besides, this is only two tests, digging deeper
may reveal more.
Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
to the vmlinux...
I think, I finally agree with you and Al Viro, that the kobj mapper is
not the right place to put the filtering, but taking the above numbers
into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
versions of security_inode_permission/security_file_permission/etc?
>> 3. With LSM turned on we'll have to "virtualize" it, i.e. make its
>> work safe in a container. I don't presume to judge how much work
>> will have to be done in this area, so the result patch would be
>> even larger and maybe will duplicate functionality, which is currently
>> in cgroups. OTOH, cgroups already provide the ways to correctly
>> delegate proper rights to containers.
>
> No, your lsm would be your "virtualize" policy. I don't think you would
> have to do any additional work here, but could be wrong. Would like to
> see the code to prove it.
>
>>> Opening a dev node is not on any "fast path" that you need to be
>>> concerned about a few extra calls within the kernel.
>>>
>>> And, I think in the end your patch would be much smaller and easier to
>>> understand and review and maintain overall.
>> Hardly - the largest part of my patch is cgroup manipulations. The part
>> that makes the char and block layers switch to new map ac check the
>> permissions is 10-20 lines of new code.
>>
>> But with LSM I will still need this API.
>
> Yes, but your LSM hooks will be smaller than the code modifications to
> the map logic :)
>
> Again, I object to this as you are driving a new security policy
> infrastructure into the device node logic where it does not belong as we
> already have this functionality in the LSM interface today. Please use
> that one instead and don't clutter up the kernel with "one-off" security
> changes like this one.
>
> Please try the LSM interface and see what happens. If, after you have
> created a patch, you still have objections, please post it for review
> and I will be glad to revisit my opinion at that time.
>
> thanks,
>
> greg k-h
>
On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
>
> Besides, I've measured some things - the lat_syscall test for open from
> lmbench test suite and the nptl perf test. Here are the results:
>
> sec nosec
> open 3.0980s 3.0709s
> nptl 2.7746s 2.7710s
>
> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
> much, but it is noticeable. Besides, this is only two tests, digging deeper
> may reveal more.
I think that is in the noise of sampling if you run that test many more
times.
> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
> to the vmlinux...
>
> I think, I finally agree with you and Al Viro, that the kobj mapper is
> not the right place to put the filtering, but taking the above numbers
> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
> versions of security_inode_permission/security_file_permission/etc?
Ask the security module interface maintainers about this, not me :)
good luck,
greg k-h
Greg KH wrote:
> On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
>> Besides, I've measured some things - the lat_syscall test for open from
>> lmbench test suite and the nptl perf test. Here are the results:
>>
>> sec nosec
>> open 3.0980s 3.0709s
>> nptl 2.7746s 2.7710s
>>
>> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
>> much, but it is noticeable. Besides, this is only two tests, digging deeper
>> may reveal more.
>
> I think that is in the noise of sampling if you run that test many more
> times.
These numbers are average values of 20 runs of each test. I didn't
provide the measurement accuracy, but the abs(open.sec - open.nosec)
is greater than it.
>> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
>> to the vmlinux...
>>
>> I think, I finally agree with you and Al Viro, that the kobj mapper is
>> not the right place to put the filtering, but taking the above numbers
>> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
>> versions of security_inode_permission/security_file_permission/etc?
>
> Ask the security module interface maintainers about this, not me :)
OK :) Thanks for your time, Greg.
So, Serge, since you already have a LSM-based version, maybe you can
change it with the proposed "fix" and send it to LSM maintainers for
review?
> good luck,
>
> greg k-h
>
Quoting Pavel Emelyanov ([email protected]):
> Greg KH wrote:
> > On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
> >> Besides, I've measured some things - the lat_syscall test for open from
> >> lmbench test suite and the nptl perf test. Here are the results:
> >>
> >> sec nosec
> >> open 3.0980s 3.0709s
> >> nptl 2.7746s 2.7710s
> >>
> >> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
> >> much, but it is noticeable. Besides, this is only two tests, digging deeper
> >> may reveal more.
> >
> > I think that is in the noise of sampling if you run that test many more
> > times.
>
> These numbers are average values of 20 runs of each test. I didn't
> provide the measurement accuracy, but the abs(open.sec - open.nosec)
> is greater than it.
>
> >> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
> >> to the vmlinux...
> >>
> >> I think, I finally agree with you and Al Viro, that the kobj mapper is
> >> not the right place to put the filtering, but taking the above numbers
> >> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
> >> versions of security_inode_permission/security_file_permission/etc?
> >
> > Ask the security module interface maintainers about this, not me :)
>
> OK :) Thanks for your time, Greg.
>
> So, Serge, since you already have a LSM-based version, maybe you can
> change it with the proposed "fix" and send it to LSM maintainers for
> review?
To take the point of view of someone who neither wants containers nor
LSM but just a fast box,
you're asking me to introduce LSM hooks for the !SECURITY case? :)
I can give it a shot, but I expect some complaints. Now at least the
_mknod hook shouldn't be a hotpath, and I suppose I can add yet
an #ifdef inside the !SECURITY version of security_inode_permission().
I still expect some complaints though. I'll send something soon.
thanks,
-serge
> > good luck,
> >
> > greg k-h
> >
On Wed, 2008-03-12 at 08:09 -0500, Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov ([email protected]):
> > Greg KH wrote:
> > > On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
> > >> Besides, I've measured some things - the lat_syscall test for open from
> > >> lmbench test suite and the nptl perf test. Here are the results:
> > >>
> > >> sec nosec
> > >> open 3.0980s 3.0709s
> > >> nptl 2.7746s 2.7710s
> > >>
> > >> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
> > >> much, but it is noticeable. Besides, this is only two tests, digging deeper
> > >> may reveal more.
> > >
> > > I think that is in the noise of sampling if you run that test many more
> > > times.
> >
> > These numbers are average values of 20 runs of each test. I didn't
> > provide the measurement accuracy, but the abs(open.sec - open.nosec)
> > is greater than it.
> >
> > >> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
> > >> to the vmlinux...
> > >>
> > >> I think, I finally agree with you and Al Viro, that the kobj mapper is
> > >> not the right place to put the filtering, but taking the above numbers
> > >> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
> > >> versions of security_inode_permission/security_file_permission/etc?
> > >
> > > Ask the security module interface maintainers about this, not me :)
> >
> > OK :) Thanks for your time, Greg.
> >
> > So, Serge, since you already have a LSM-based version, maybe you can
> > change it with the proposed "fix" and send it to LSM maintainers for
> > review?
>
> To take the point of view of someone who neither wants containers nor
> LSM but just a fast box,
>
> you're asking me to introduce LSM hooks for the !SECURITY case? :)
>
> I can give it a shot, but I expect some complaints. Now at least the
> _mknod hook shouldn't be a hotpath, and I suppose I can add yet
> an #ifdef inside the !SECURITY version of security_inode_permission().
> I still expect some complaints though. I'll send something soon.
Not sure I'm following the plot here, but please don't do anything that
will prohibit the use of containers/namespaces with security modules
like SELinux/Smack. Yes, that's a legitimate use case, and there will
be people who will want to do that - they serve different but
complementary purposes (containers are _not_ a substitute for MAC). We
don't want them to be exclusive of one another.
--
Stephen Smalley
National Security Agency
On Wed, 2008-03-12 at 09:18 -0400, Stephen Smalley wrote:
> On Wed, 2008-03-12 at 08:09 -0500, Serge E. Hallyn wrote:
> > Quoting Pavel Emelyanov ([email protected]):
> > > Greg KH wrote:
> > > > On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
> > > >> Besides, I've measured some things - the lat_syscall test for open from
> > > >> lmbench test suite and the nptl perf test. Here are the results:
> > > >>
> > > >> sec nosec
> > > >> open 3.0980s 3.0709s
> > > >> nptl 2.7746s 2.7710s
> > > >>
> > > >> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
> > > >> much, but it is noticeable. Besides, this is only two tests, digging deeper
> > > >> may reveal more.
> > > >
> > > > I think that is in the noise of sampling if you run that test many more
> > > > times.
> > >
> > > These numbers are average values of 20 runs of each test. I didn't
> > > provide the measurement accuracy, but the abs(open.sec - open.nosec)
> > > is greater than it.
> > >
> > > >> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
> > > >> to the vmlinux...
> > > >>
> > > >> I think, I finally agree with you and Al Viro, that the kobj mapper is
> > > >> not the right place to put the filtering, but taking the above numbers
> > > >> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
> > > >> versions of security_inode_permission/security_file_permission/etc?
> > > >
> > > > Ask the security module interface maintainers about this, not me :)
> > >
> > > OK :) Thanks for your time, Greg.
> > >
> > > So, Serge, since you already have a LSM-based version, maybe you can
> > > change it with the proposed "fix" and send it to LSM maintainers for
> > > review?
> >
> > To take the point of view of someone who neither wants containers nor
> > LSM but just a fast box,
> >
> > you're asking me to introduce LSM hooks for the !SECURITY case? :)
> >
> > I can give it a shot, but I expect some complaints. Now at least the
> > _mknod hook shouldn't be a hotpath, and I suppose I can add yet
> > an #ifdef inside the !SECURITY version of security_inode_permission().
> > I still expect some complaints though. I'll send something soon.
>
> Not sure I'm following the plot here, but please don't do anything that
> will prohibit the use of containers/namespaces with security modules
> like SELinux/Smack. Yes, that's a legitimate use case, and there will
> be people who will want to do that - they serve different but
> complementary purposes (containers are _not_ a substitute for MAC). We
> don't want them to be exclusive of one another.
Also, note that "real" device labeling and access control (i.e. bind a
label to a device in the kernel irrespective of what device node is used
to access it so that a process that can create any device nodes at all
can't effectively bypass all device access controls just by creating an
arbitrary node to any device in a type accessible to it) is already
called out on our kernel todo list for SELinux, and contributions there
would be welcome.
--
Stephen Smalley
National Security Agency
Serge E. Hallyn wrote:
> Quoting Pavel Emelyanov ([email protected]):
>> Greg KH wrote:
>>> On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
>>>> Besides, I've measured some things - the lat_syscall test for open from
>>>> lmbench test suite and the nptl perf test. Here are the results:
>>>>
>>>> sec nosec
>>>> open 3.0980s 3.0709s
>>>> nptl 2.7746s 2.7710s
>>>>
>>>> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
>>>> much, but it is noticeable. Besides, this is only two tests, digging deeper
>>>> may reveal more.
>>> I think that is in the noise of sampling if you run that test many more
>>> times.
>> These numbers are average values of 20 runs of each test. I didn't
>> provide the measurement accuracy, but the abs(open.sec - open.nosec)
>> is greater than it.
>>
>>>> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
>>>> to the vmlinux...
>>>>
>>>> I think, I finally agree with you and Al Viro, that the kobj mapper is
>>>> not the right place to put the filtering, but taking the above numbers
>>>> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
>>>> versions of security_inode_permission/security_file_permission/etc?
>>> Ask the security module interface maintainers about this, not me :)
>> OK :) Thanks for your time, Greg.
>>
>> So, Serge, since you already have a LSM-based version, maybe you can
>> change it with the proposed "fix" and send it to LSM maintainers for
>> review?
>
> To take the point of view of someone who neither wants containers nor
> LSM but just a fast box,
>
> you're asking me to introduce LSM hooks for the !SECURITY case? :)
No exactly. Look at security_ptrace, security_task_kill or
security_vm_enough_memory for !SECURITY cases. I wanted to see similar
for device access list not to replace selinux with this small "security
module" and not to carry this huge LSM for our modest purposes.
> I can give it a shot, but I expect some complaints. Now at least the
> _mknod hook shouldn't be a hotpath, and I suppose I can add yet
> an #ifdef inside the !SECURITY version of security_inode_permission().
> I still expect some complaints though. I'll send something soon.
>
> thanks,
> -serge
>
>>> good luck,
>>>
>>> greg k-h
>>>
>
Quoting Stephen Smalley ([email protected]):
>
> On Wed, 2008-03-12 at 08:09 -0500, Serge E. Hallyn wrote:
> > Quoting Pavel Emelyanov ([email protected]):
> > > Greg KH wrote:
> > > > On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
> > > >> Besides, I've measured some things - the lat_syscall test for open from
> > > >> lmbench test suite and the nptl perf test. Here are the results:
> > > >>
> > > >> sec nosec
> > > >> open 3.0980s 3.0709s
> > > >> nptl 2.7746s 2.7710s
> > > >>
> > > >> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
> > > >> much, but it is noticeable. Besides, this is only two tests, digging deeper
> > > >> may reveal more.
> > > >
> > > > I think that is in the noise of sampling if you run that test many more
> > > > times.
> > >
> > > These numbers are average values of 20 runs of each test. I didn't
> > > provide the measurement accuracy, but the abs(open.sec - open.nosec)
> > > is greater than it.
> > >
> > > >> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
> > > >> to the vmlinux...
> > > >>
> > > >> I think, I finally agree with you and Al Viro, that the kobj mapper is
> > > >> not the right place to put the filtering, but taking the above numbers
> > > >> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
> > > >> versions of security_inode_permission/security_file_permission/etc?
> > > >
> > > > Ask the security module interface maintainers about this, not me :)
> > >
> > > OK :) Thanks for your time, Greg.
> > >
> > > So, Serge, since you already have a LSM-based version, maybe you can
> > > change it with the proposed "fix" and send it to LSM maintainers for
> > > review?
> >
> > To take the point of view of someone who neither wants containers nor
> > LSM but just a fast box,
> >
> > you're asking me to introduce LSM hooks for the !SECURITY case? :)
> >
> > I can give it a shot, but I expect some complaints. Now at least the
> > _mknod hook shouldn't be a hotpath, and I suppose I can add yet
> > an #ifdef inside the !SECURITY version of security_inode_permission().
> > I still expect some complaints though. I'll send something soon.
>
> Not sure I'm following the plot here, but please don't do anything that
> will prohibit the use of containers/namespaces with security modules
> like SELinux/Smack.
Absolutely not. There would be a redundant explicit hook in the
!SECURITY case, but in the SECURITY=y case the regular hooks would be
used by the device controller.
I definately consider SELinux/Smack+containers an important case.
> Yes, that's a legitimate use case, and there will
> be people who will want to do that - they serve different but
> complementary purposes (containers are _not_ a substitute for MAC). We
> don't want them to be exclusive of one another.
Agreed.
-serge
Quoting Stephen Smalley ([email protected]):
>
> On Wed, 2008-03-12 at 09:18 -0400, Stephen Smalley wrote:
> > On Wed, 2008-03-12 at 08:09 -0500, Serge E. Hallyn wrote:
> > > Quoting Pavel Emelyanov ([email protected]):
> > > > Greg KH wrote:
> > > > > On Tue, Mar 11, 2008 at 12:57:55PM +0300, Pavel Emelyanov wrote:
> > > > >> Besides, I've measured some things - the lat_syscall test for open from
> > > > >> lmbench test suite and the nptl perf test. Here are the results:
> > > > >>
> > > > >> sec nosec
> > > > >> open 3.0980s 3.0709s
> > > > >> nptl 2.7746s 2.7710s
> > > > >>
> > > > >> So we have 0.88% loss in open and ~0.15% with nptl. I know, this is not that
> > > > >> much, but it is noticeable. Besides, this is only two tests, digging deeper
> > > > >> may reveal more.
> > > > >
> > > > > I think that is in the noise of sampling if you run that test many more
> > > > > times.
> > > >
> > > > These numbers are average values of 20 runs of each test. I didn't
> > > > provide the measurement accuracy, but the abs(open.sec - open.nosec)
> > > > is greater than it.
> > > >
> > > > >> Let alone the fact that simply turning the CONFIG_SECURITY to 'y' puts +8Kb
> > > > >> to the vmlinux...
> > > > >>
> > > > >> I think, I finally agree with you and Al Viro, that the kobj mapper is
> > > > >> not the right place to put the filtering, but taking the above numbers
> > > > >> into account, can we put the "hooks" into the #else /* CONFIG_SECURITY */
> > > > >> versions of security_inode_permission/security_file_permission/etc?
> > > > >
> > > > > Ask the security module interface maintainers about this, not me :)
> > > >
> > > > OK :) Thanks for your time, Greg.
> > > >
> > > > So, Serge, since you already have a LSM-based version, maybe you can
> > > > change it with the proposed "fix" and send it to LSM maintainers for
> > > > review?
> > >
> > > To take the point of view of someone who neither wants containers nor
> > > LSM but just a fast box,
> > >
> > > you're asking me to introduce LSM hooks for the !SECURITY case? :)
> > >
> > > I can give it a shot, but I expect some complaints. Now at least the
> > > _mknod hook shouldn't be a hotpath, and I suppose I can add yet
> > > an #ifdef inside the !SECURITY version of security_inode_permission().
> > > I still expect some complaints though. I'll send something soon.
> >
> > Not sure I'm following the plot here, but please don't do anything that
> > will prohibit the use of containers/namespaces with security modules
> > like SELinux/Smack. Yes, that's a legitimate use case, and there will
> > be people who will want to do that - they serve different but
> > complementary purposes (containers are _not_ a substitute for MAC). We
> > don't want them to be exclusive of one another.
>
> Also, note that "real" device labeling and access control (i.e. bind a
> label to a device in the kernel irrespective of what device node is used
> to access it so that a process that can create any device nodes at all
> can't effectively bypass all device access controls just by creating an
> arbitrary node to any device in a type accessible to it) is already
> called out on our kernel todo list for SELinux, and contributions there
> would be welcome.
I'll take a look at the todo list (James' I assume). The dev_cgroup lsm
will have to come first, I'll see about doing the SELinux version as
well.
thanks,
-serge
--- Stephen Smalley <[email protected]> wrote:
>
> ...
>
> Not sure I'm following the plot here, but please don't do anything that
> will prohibit the use of containers/namespaces with security modules
> like SELinux/Smack. Yes, that's a legitimate use case, and there will
> be people who will want to do that - they serve different but
> complementary purposes (containers are _not_ a substitute for MAC). We
> don't want them to be exclusive of one another.
I agree that we ought to be able to (dare I say it?) stack containers
and Smack. I have come around 180 degrees regarding the value of
module stacking and am now convinced that a general mechanism for
it would be a Good Thing. Both SELinux and Smack already provide
for stacking capabilities, and I've been asked by another project to
provide for stacking their module. The alternative to general stacking
looks more and more like each LSM providing for the modules it is
willing to stack with, and that could get painful pretty quickly.
Or, tell me why I'm wrong. I promise to listen nicely. (smiley)
Casey Schaufler
[email protected]