LinuxLists.cc - [RFC PATCH 0/8] v3 contained usermode helper execution

2015-02-05 02:34:08

[permalink] [raw]

Subject: [RFC PATCH 0/8] v3 contained usermode helper execution

There haven't been any comments about the previous series not being
an acceptable approach. Perhaps people were away, didn't notice or
didn't have time.

So here's another chance to speak up.

In summary it's assumed that, since the usermode helper uses the
root init namespace for process creation, using the init namespace
of a container is eqivalent and sufficient when execution within a
container is needed.

Thinking further about callers I believe there are cases that won't
be handled properly so I've tried to work out what the current use
cases are and added patches that demonstrate simple minded usage.
I'm not sure at all that this is sufficient so I need feedback.

I've changed the execution to pin the calling task for the duration
of the call as recommended by Jeff Layton but other than that not a
lot has changed in the call back code.

It's also not clear if the request key infrastructure will continue
to use a usermode callback so we'll need to wait on that.

---

Ian Kent (8):
nsproxy - refactor setns()
kmod - rename call_usermodehelper() flags parameter
kmod - teach call_usermodehelper() to use a namespace
KEYS - rename call_usermodehelper_keys() flags parameter
KEYS: exec request-key within the requesting task's init namespace
nfsd - use namespace if not executing in init namespace
nfs - cache_lib use namespace if not executing in init namespace
nfs - objlayout use namespace if not executing in init namespace

fs/nfs/cache_lib.c | 6 ++
fs/nfs/objlayout/objlayout.c | 7 ++
fs/nfsd/netns.h | 2 +
fs/nfsd/nfs4recover.c | 48 ++++++++++-----
include/linux/kmod.h | 20 ++++++
include/linux/nsproxy.h | 1
kernel/kmod.c | 131 ++++++++++++++++++++++++++++++++++++++----
kernel/nsproxy.c | 21 ++++---
security/keys/request_key.c | 64 +++++++++++++++++----
9 files changed, 252 insertions(+), 48 deletions(-)

--
Ian

2015-02-05 02:34:15

[permalink] [raw]

Subject: [RFC PATCH 1/8] nsproxy - refactor setns()

For usermode helpers to execute within a namspace a slightly different
entry point to setns() that takes a namspace inode is needed.

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
include/linux/nsproxy.h | 1 +
kernel/nsproxy.c | 21 ++++++++++++++-------
2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 35fa08f..c75bf12 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -62,6 +62,7 @@ extern struct nsproxy init_nsproxy;
*
*/

+int setns_inode(struct inode *inode, int nstype);
int copy_namespaces(unsigned long flags, struct task_struct *tsk);
void exit_task_namespaces(struct task_struct *tsk);
void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index 49746c8..27cc544 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -218,20 +218,15 @@ void exit_task_namespaces(struct task_struct *p)
switch_task_namespaces(p, NULL);
}

-SYSCALL_DEFINE2(setns, int, fd, int, nstype)
+int setns_inode(struct inode *inode, int nstype)
{
struct task_struct *tsk = current;
struct nsproxy *new_nsproxy;
- struct file *file;
struct ns_common *ns;
int err;

- file = proc_ns_fget(fd);
- if (IS_ERR(file))
- return PTR_ERR(file);
-
err = -EINVAL;
- ns = get_proc_ns(file_inode(file));
+ ns = get_proc_ns(inode);
if (nstype && (ns->ops->type != nstype))
goto out;

@@ -248,6 +243,18 @@ SYSCALL_DEFINE2(setns, int, fd, int, nstype)
}
switch_task_namespaces(tsk, new_nsproxy);
out:
+ return err;
+}
+
+SYSCALL_DEFINE2(setns, int, fd, int, nstype)
+{
+ struct file *file;
+ int err;
+
+ file = proc_ns_fget(fd);
+ if (IS_ERR(file))
+ return PTR_ERR(file);
+ err = setns_inode(file_inode(file), nstype);
fput(file);
return err;
}

2015-02-05 02:34:20

[permalink] [raw]

Subject: [RFC PATCH 2/8] kmod - rename call_usermodehelper() flags parameter

The wait parameter of call_usermodehelper() is not quite a parameter
that describes the wait behaviour alone and will later be used to
request exec within a namespace.

So change its name to flags.

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
include/linux/kmod.h | 4 ++--
kernel/kmod.c | 16 ++++++++--------
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 0555cc6..15bdeed 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -67,7 +67,7 @@ struct subprocess_info {
};

extern int
-call_usermodehelper(char *path, char **argv, char **envp, int wait);
+call_usermodehelper(char *path, char **argv, char **envp, int flags);

extern struct subprocess_info *
call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask,
@@ -75,7 +75,7 @@ call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask,
void (*cleanup)(struct subprocess_info *), void *data);

extern int
-call_usermodehelper_exec(struct subprocess_info *info, int wait);
+call_usermodehelper_exec(struct subprocess_info *info, int flags);

extern struct ctl_table usermodehelper_table[];

diff --git a/kernel/kmod.c b/kernel/kmod.c
index 2777f40..14c0188 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -534,7 +534,7 @@ EXPORT_SYMBOL(call_usermodehelper_setup);
* asynchronously if wait is not set, and runs as a child of keventd.
* (ie. it runs with full root capabilities).
*/
-int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
+int call_usermodehelper_exec(struct subprocess_info *sub_info, int flags)
{
DECLARE_COMPLETION_ONSTACK(done);
int retval = 0;
@@ -553,14 +553,14 @@ int call_usermodehelper_exec(struct subprocess_info *sub_info, int wait)
* This makes it possible to use umh_complete to free
* the data structure in case of UMH_NO_WAIT.
*/
- sub_info->complete = (wait == UMH_NO_WAIT) ? NULL : &done;
- sub_info->wait = wait;
+ sub_info->complete = (flags == UMH_NO_WAIT) ? NULL : &done;
+ sub_info->wait = flags;

queue_work(khelper_wq, &sub_info->work);
- if (wait == UMH_NO_WAIT) /* task has freed sub_info */
+ if (flags == UMH_NO_WAIT) /* task has freed sub_info */
goto unlock;

- if (wait & UMH_KILLABLE) {
+ if (flags & UMH_KILLABLE) {
retval = wait_for_completion_killable(&done);
if (!retval)
goto wait_done;
@@ -595,17 +595,17 @@ EXPORT_SYMBOL(call_usermodehelper_exec);
* This function is the equivalent to use call_usermodehelper_setup() and
* call_usermodehelper_exec().
*/
-int call_usermodehelper(char *path, char **argv, char **envp, int wait)
+int call_usermodehelper(char *path, char **argv, char **envp, int flags)
{
struct subprocess_info *info;
- gfp_t gfp_mask = (wait == UMH_NO_WAIT) ? GFP_ATOMIC : GFP_KERNEL;
+ gfp_t gfp_mask = (flags == UMH_NO_WAIT) ? GFP_ATOMIC : GFP_KERNEL;

info = call_usermodehelper_setup(path, argv, envp, gfp_mask,
NULL, NULL, NULL);
if (info == NULL)
return -ENOMEM;

- return call_usermodehelper_exec(info, wait);
+ return call_usermodehelper_exec(info, flags);
}
EXPORT_SYMBOL(call_usermodehelper);

2015-02-05 02:34:27

[permalink] [raw]

Subject: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

The call_usermodehelper() function executes all binaries in the
global "init" root context. This doesn't allow a binary to be run
within a namespace (eg. the namespace of a container).

Both containerized NFS client and NFS server need the ability to
execute a binary in a container's context. To do this use the init
process of the callers environment is used to setup the namespaces
in the same way the root init process is used otherwise.

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
include/linux/kmod.h | 16 +++++++
kernel/kmod.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/include/linux/kmod.h b/include/linux/kmod.h
index 15bdeed..b0f1b3c 100644
--- a/include/linux/kmod.h
+++ b/include/linux/kmod.h
@@ -52,6 +52,7 @@ struct file;
#define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */
#define UMH_WAIT_PROC 2 /* wait for the process to complete */
#define UMH_KILLABLE 4 /* wait for EXEC/PROC killable */
+#define UMH_USE_NS 8 /* exec using caller's init namespace */

struct subprocess_info {
struct work_struct work;
@@ -69,6 +70,21 @@ struct subprocess_info {
extern int
call_usermodehelper(char *path, char **argv, char **envp, int flags);

+#if !defined(CONFIG_PROC_FS) || !defined(CONFIG_NAMESPACES)
+inline struct task_struct *umh_get_init_task(void)
+{
+ return ERR_PTR(-ENOTSUP);
+}
+
+inline int umh_enter_ns(struct task_struct *tsk, struct cred *new)
+{
+ return -ENOTSUP;
+}
+#else
+struct task_struct *umh_get_init_pid(void);
+int umh_enter_ns(struct task_struct *tsk, struct cred *new);
+#endif
+
extern struct subprocess_info *
call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask,
int (*init)(struct subprocess_info *info, struct cred *new),
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 14c0188..4c649d6 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -582,6 +582,98 @@ unlock:
}
EXPORT_SYMBOL(call_usermodehelper_exec);

+#if defined(CONFIG_PROC_FS) && defined(CONFIG_NAMESPACES)
+#define NS_PATH_MAX 35
+#define NS_PATH_FMT "%lu/ns/%s"
+
+/* Note namespace name order is significant */
+static const char *ns_names[] = { "user", "ipc", "uts", "net", "pid", "mnt", NULL };
+
+struct task_struct *umh_get_init_pid(void)
+{
+ struct task_struct *tsk;
+
+ rcu_read_lock();
+ tsk = find_task_by_vpid(1);
+ if (tsk)
+ get_task_struct(tsk);
+ rcu_read_unlock();
+ if (!tsk)
+ return ERR_PTR(-ESRCH);
+
+ return tsk;
+}
+EXPORT_SYMBOL(umh_get_init_pid);
+
+int umh_enter_ns(struct task_struct *tsk, struct cred *new)
+{
+ char path[NS_PATH_MAX];
+ struct vfsmount *mnt;
+ const char *name;
+ pid_t pid;
+ int err = 0;
+
+ pid = task_pid_nr(tsk);
+
+ /*
+ * The user mode thread runner runs in the root init namespace
+ * so it will see all system pids.
+ */
+ mnt = task_active_pid_ns(current)->proc_mnt;
+
+ for (name = ns_names[0]; *name; name++) {
+ struct file *this;
+ int len;
+
+ len = snprintf(path,
+ NS_PATH_MAX, NS_PATH_FMT,
+ (unsigned long) pid, name);
+ if (len >= NS_PATH_MAX) {
+ err = -ENAMETOOLONG;
+ break;
+ }
+
+ this = file_open_root(mnt->mnt_root, mnt, path, O_RDONLY);
+ if (unlikely(IS_ERR(this))) {
+ err = PTR_ERR(this);
+ break;
+ }
+
+ err = setns_inode(file_inode(this), 0);
+ fput(this);
+ if (err)
+ break;
+ }
+
+ return err;
+}
+EXPORT_SYMBOL(umh_enter_ns);
+
+static int umh_set_ns(struct subprocess_info *info, struct cred *new)
+{
+ struct task_struct *tsk = info->data;
+
+ return umh_enter_ns(tsk, new);
+}
+
+static void umh_free_ns(struct subprocess_info *info)
+{
+ struct task_struct *tsk = info->data;
+
+ if (tsk)
+ put_task_struct(tsk);
+}
+#else
+static int umh_set_ns(struct subprocess_info *info, struct cred *new)
+{
+ return 0;
+}
+
+static void umh_free_ns(struct subprocess_info *info)
+{
+}
+#endif
+
/**
* call_usermodehelper() - prepare and start a usermode application
* @path: path to usermode executable
@@ -599,11 +691,28 @@ int call_usermodehelper(char *path, char **argv, char **envp, int flags)
{
struct subprocess_info *info;
gfp_t gfp_mask = (flags == UMH_NO_WAIT) ? GFP_ATOMIC : GFP_KERNEL;
+ unsigned int use_ns = flags & UMH_USE_NS;
+ struct task_struct *tsk = NULL;
+
+ if (use_ns) {
+ tsk = umh_get_init_pid();
+ if (IS_ERR(tsk))
+ return PTR_ERR(tsk);
+ }

- info = call_usermodehelper_setup(path, argv, envp, gfp_mask,
- NULL, NULL, NULL);
- if (info == NULL)
+ if (!tsk)
+ info = call_usermodehelper_setup(path, argv, envp,
+ gfp_mask, NULL, NULL, NULL);
+ else {
+ info = call_usermodehelper_setup(path, argv, envp, gfp_mask,
+ umh_set_ns, umh_free_ns,
+ tsk);
+ }
+ if (info == NULL) {
+ if (tsk)
+ put_task_struct(tsk);
return -ENOMEM;
+ }

return call_usermodehelper_exec(info, flags);
}

2015-02-05 02:34:34

[permalink] [raw]

Subject: [RFC PATCH 4/8] KEYS - rename call_usermodehelper_keys() flags parameter

The wait parameter of call_usermodehelper_keys() will later be used to
request exec within a namespace.

So change its name to flags.

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
security/keys/request_key.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 0c7aea4..9e79bbf 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -73,7 +73,7 @@ static void umh_keys_cleanup(struct subprocess_info *info)
* Call a usermode helper with a specific session keyring.
*/
static int call_usermodehelper_keys(char *path, char **argv, char **envp,
- struct key *session_keyring, int wait)
+ struct key *session_keyring, int flags)
{
struct subprocess_info *info;

@@ -84,7 +84,7 @@ static int call_usermodehelper_keys(char *path, char **argv, char **envp,
return -ENOMEM;

key_get(session_keyring);
- return call_usermodehelper_exec(info, wait);
+ return call_usermodehelper_exec(info, flags);
}

/*

2015-02-05 02:34:42

[permalink] [raw]

Subject: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

Containerized request key helper callbacks need the ability to execute
a binary in a container's context. To do this calling an in kernel
equivalent of setns(2) should be sufficient since the user mode helper
execution kernel thread ultimately calls do_execve().

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
security/keys/request_key.c | 60 +++++++++++++++++++++++++++++++++++++------
1 file changed, 52 insertions(+), 8 deletions(-)

diff --git a/security/keys/request_key.c b/security/keys/request_key.c
index 9e79bbf..59282aa 100644
--- a/security/keys/request_key.c
+++ b/security/keys/request_key.c
@@ -17,6 +17,9 @@
#include <linux/err.h>
#include <linux/keyctl.h>
#include <linux/slab.h>
+#include <linux/fs.h>
+#include <linux/nsproxy.h>
+#include <net/net_namespace.h>
#include "internal.h"

#define key_negative_timeout 60 /* default timeout on a negative key's existence */
@@ -46,6 +49,11 @@ void complete_request_key(struct key_construction *cons, int error)
}
EXPORT_SYMBOL(complete_request_key);

+struct request_key_info {
+ struct key *keyring;
+ struct task_struct *tsk;
+};
+
/*
* Initialise a usermode helper that is going to have a specific session
* keyring.
@@ -55,9 +63,18 @@ EXPORT_SYMBOL(complete_request_key);
*/
static int umh_keys_init(struct subprocess_info *info, struct cred *cred)
{
- struct key *keyring = info->data;
+ struct request_key_info *rki = info->data;
+ struct task_struct *tsk = rki->tsk;
+
+ if (tsk) {
+ int err;
+
+ err = umh_enter_ns(tsk, cred);
+ if (err)
+ return err;
+ }

- return install_session_keyring_to_cred(cred, keyring);
+ return install_session_keyring_to_cred(cred, rki->keyring);
}

/*
@@ -65,8 +82,11 @@ static int umh_keys_init(struct subprocess_info *info, struct cred *cred)
*/
static void umh_keys_cleanup(struct subprocess_info *info)
{
- struct key *keyring = info->data;
- key_put(keyring);
+ struct request_key_info *rki = info->data;
+ if (rki->tsk)
+ put_task_struct(rki->tsk);
+ key_put(rki->keyring);
+ kfree(rki);
}

/*
@@ -76,12 +96,32 @@ static int call_usermodehelper_keys(char *path, char **argv, char **envp,
struct key *session_keyring, int flags)
{
struct subprocess_info *info;
+ struct request_key_info *rki;
+ unsigned int use_ns = flags & UMH_USE_NS;
+ struct task_struct *tsk = NULL;
+
+ rki = kmalloc(sizeof(*rki), GFP_KERNEL);
+ if (!rki)
+ return -ENOMEM;
+
+ if (use_ns) {
+ tsk = umh_get_init_pid();
+ if (IS_ERR(tsk))
+ return PTR_ERR(tsk);
+ }
+
+ rki->keyring = session_keyring;
+ rki->tsk = tsk;

info = call_usermodehelper_setup(path, argv, envp, GFP_KERNEL,
umh_keys_init, umh_keys_cleanup,
- session_keyring);
- if (!info)
+ rki);
+ if (!info) {
+ if (tsk)
+ put_task_struct(rki->tsk);
+ kfree(rki);
return -ENOMEM;
+ }

key_get(session_keyring);
return call_usermodehelper_exec(info, flags);
@@ -102,10 +142,15 @@ static int call_sbin_request_key(struct key_construction *cons,
char *argv[9], *envp[3], uid_str[12], gid_str[12];
char key_str[12], keyring_str[3][12];
char desc[20];
+ int flags = UMH_WAIT_PROC;
int ret, i;

kenter("{%d},{%d},%s", key->serial, authkey->serial, op);

+ /* If running within a container use the container namespace */
+ if (current->nsproxy->net_ns != &init_net)
+ flags |= UMH_USE_NS;
+
ret = install_user_keyrings();
if (ret < 0)
goto error_alloc;
@@ -172,8 +217,7 @@ static int call_sbin_request_key(struct key_construction *cons,
argv[i] = NULL;

/* do it */
- ret = call_usermodehelper_keys(argv[0], argv, envp, keyring,
- UMH_WAIT_PROC);
+ ret = call_usermodehelper_keys(argv[0], argv, envp, keyring, flags);
kdebug("usermode -> 0x%x", ret);
if (ret >= 0) {
/* ret is the exit/wait code */

2015-02-05 02:34:47

[permalink] [raw]

Subject: [RFC PATCH 6/8] nfsd - use namespace if not executing in init namespace

If nfsd is running within a container the client tracking operations
should run within the container also.

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
fs/nfsd/netns.h | 2 ++
fs/nfsd/nfs4recover.c | 48 ++++++++++++++++++++++++++++++++----------------
2 files changed, 34 insertions(+), 16 deletions(-)

diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
index ea6749a..c168196 100644
--- a/fs/nfsd/netns.h
+++ b/fs/nfsd/netns.h
@@ -112,6 +112,8 @@ struct nfsd_net {
u32 clientid_counter;

struct svc_serv *nfsd_serv;
+
+ int umh_flags;
};

/* Simple check to find out if a given net was properly initialized */
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index cc6a760..b962856 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -1184,7 +1184,9 @@ nfsd4_cltrack_grace_start(time_t grace_start)
}

static int
-nfsd4_umh_cltrack_upcall(char *cmd, char *arg, char *env0, char *env1)
+nfsd4_umh_cltrack_upcall(char *cmd,
+ char *arg, char *env0, char *env1,
+ int flags)
{
char *envp[3];
char *argv[4];
@@ -1209,7 +1211,7 @@ nfsd4_umh_cltrack_upcall(char *cmd, char *arg, char *env0, char *env1)
argv[2] = arg;
argv[3] = NULL;

- ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
+ ret = call_usermodehelper(argv[0], argv, envp, flags);
/*
* Disable the upcall mechanism if we're getting an ENOENT or EACCES
* error. The admin can re-enable it on the fly by using sysfs
@@ -1252,14 +1254,13 @@ nfsd4_umh_cltrack_init(struct net *net)
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
char *grace_start = nfsd4_cltrack_grace_start(nn->boot_time);

- /* XXX: The usermode helper s not working in container yet. */
- if (net != &init_net) {
- WARN(1, KERN_ERR "NFSD: attempt to initialize umh client "
- "tracking in a container!\n");
- return -EINVAL;
- }
+ nn->umh_flags = UMH_WAIT_PROC;
+ if (net != &init_net)
+ nn->umh_flags |= UMH_USE_NS;

- ret = nfsd4_umh_cltrack_upcall("init", NULL, grace_start, NULL);
+ ret = nfsd4_umh_cltrack_upcall("init",
+ NULL, grace_start, NULL,
+ nn->umh_flags);
kfree(grace_start);
return ret;
}
@@ -1285,6 +1286,7 @@ nfsd4_umh_cltrack_create(struct nfs4_client *clp)
{
char *hexid, *has_session, *grace_start;
struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
+ int ret;

/*
* With v4.0 clients, there's little difference in outcome between a
@@ -1312,7 +1314,10 @@ nfsd4_umh_cltrack_create(struct nfs4_client *clp)
grace_start = nfsd4_cltrack_grace_start(nn->boot_time);

nfsd4_cltrack_upcall_lock(clp);
- if (!nfsd4_umh_cltrack_upcall("create", hexid, has_session, grace_start))
+ ret = nfsd4_umh_cltrack_upcall("create",
+ hexid, has_session, grace_start,
+ nn->umh_flags);
+ if (!ret)
set_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
nfsd4_cltrack_upcall_unlock(clp);

@@ -1324,7 +1329,9 @@ nfsd4_umh_cltrack_create(struct nfs4_client *clp)
static void
nfsd4_umh_cltrack_remove(struct nfs4_client *clp)
{
+ struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
char *hexid;
+ int ret;

if (!test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags))
return;
@@ -1336,9 +1343,13 @@ nfsd4_umh_cltrack_remove(struct nfs4_client *clp)
}

nfsd4_cltrack_upcall_lock(clp);
- if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags) &&
- nfsd4_umh_cltrack_upcall("remove", hexid, NULL, NULL) == 0)
- clear_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
+ if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags)) {
+ ret = nfsd4_umh_cltrack_upcall("remove",
+ hexid, NULL, NULL,
+ nn->umh_flags);
+ if (ret == 0)
+ clear_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
+ }
nfsd4_cltrack_upcall_unlock(clp);

kfree(hexid);
@@ -1347,8 +1358,9 @@ nfsd4_umh_cltrack_remove(struct nfs4_client *clp)
static int
nfsd4_umh_cltrack_check(struct nfs4_client *clp)
{
- int ret;
+ struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
char *hexid, *has_session, *legacy;
+ int ret;

if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags))
return 0;
@@ -1366,7 +1378,9 @@ nfsd4_umh_cltrack_check(struct nfs4_client *clp)
if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags)) {
ret = 0;
} else {
- ret = nfsd4_umh_cltrack_upcall("check", hexid, has_session, legacy);
+ ret = nfsd4_umh_cltrack_upcall("check", hexid,
+ has_session, legacy,
+ mm->umh_flags);
if (ret == 0)
set_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
}
@@ -1386,7 +1400,9 @@ nfsd4_umh_cltrack_grace_done(struct nfsd_net *nn)

sprintf(timestr, "%ld", nn->boot_time);
legacy = nfsd4_cltrack_legacy_topdir();
- nfsd4_umh_cltrack_upcall("gracedone", timestr, legacy, NULL);
+ nfsd4_umh_cltrack_upcall("gracedone",
+ timestr, legacy, NULL,
+ nn->umh_flags);
kfree(legacy);
}

2015-02-05 02:34:54

[permalink] [raw]

Subject: [RFC PATCH 7/8] nfs - cache_lib use namespace if not executing in init namespace

If the caller is running within a container then execute the usermode
helper callback within the init namespace of the container.

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
fs/nfs/cache_lib.c | 6 +++++-
fs/nfsd/nfs4recover.c | 2 +-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/cache_lib.c b/fs/nfs/cache_lib.c
index 5f7b053..1f7f6c1 100644
--- a/fs/nfs/cache_lib.c
+++ b/fs/nfs/cache_lib.c
@@ -44,11 +44,15 @@ int nfs_cache_upcall(struct cache_detail *cd, char *entry_name)
entry_name,
NULL
};
+ int umh_flags = UMH_WAIT_EXEC;
int ret = -EACCES;

+ if (cd->net != &init_net)
+ umh_flags |= UMH_USE_NS;
+
if (nfs_cache_getent_prog[0] == '\0')
goto out;
- ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_EXEC);
+ ret = call_usermodehelper(argv[0], argv, envp, umh_flags);
/*
* Disable the upcall mechanism if we're getting an ENOENT or
* EACCES error. The admin can re-enable it on the fly by using
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index b962856..0896ca7 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -1380,7 +1380,7 @@ nfsd4_umh_cltrack_check(struct nfs4_client *clp)
} else {
ret = nfsd4_umh_cltrack_upcall("check", hexid,
has_session, legacy,
- mm->umh_flags);
+ nn->umh_flags);
if (ret == 0)
set_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
}

2015-02-05 02:35:00

[permalink] [raw]

Subject: [RFC PATCH 8/8] nfs - objlayout use namespace if not executing in init namespace

If the caller is running within a container then execute the usermode
helper callback within the init namespace of the container.

Signed-off-by: Ian Kent <[email protected]>
Cc: Benjamin Coddington <[email protected]>
Cc: Al Viro <[email protected]>
Cc: J. Bruce Fields <[email protected]>
Cc: David Howells <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Jeff Layton <[email protected]>
---
fs/nfs/objlayout/objlayout.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/objlayout/objlayout.c b/fs/nfs/objlayout/objlayout.c
index 919efd4..dd5b9e8 100644
--- a/fs/nfs/objlayout/objlayout.c
+++ b/fs/nfs/objlayout/objlayout.c
@@ -600,8 +600,13 @@ static int __objlayout_upcall(struct __auto_login *login)
NULL
};
char *argv[8];
+ int umh_flags = UMH_WAIT_PROC;
int ret;

+ /* If running within a container use the container namespace */
+ if (current->nsproxy->net_ns != &init_net)
+ umh_flags |= UMH_USE_NS;
+
if (unlikely(!osd_login_prog[0])) {
dprintk("%s: osd_login_prog is disabled\n", __func__);
return -EACCES;
@@ -620,7 +625,7 @@ static int __objlayout_upcall(struct __auto_login *login)
argv[6] = login->systemid_hex;
argv[7] = NULL;

- ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
+ ret = call_usermodehelper(argv[0], argv, envp, umh_flags);
/*
* Disable the upcall mechanism if we're getting an ENOENT or
* EACCES error. The admin can re-enable it on the fly by using

2015-02-05 15:01:20

by David Howells

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] kmod - rename call_usermodehelper() flags parameter

Ian Kent <[email protected]> wrote:

> -call_usermodehelper(char *path, char **argv, char **envp, int wait);
> +call_usermodehelper(char *path, char **argv, char **envp, int flags);

Can we make flags unsigned whilst we're at it? Other than that:

Acked-by: David Howells <[email protected]>

2015-02-05 15:14:39

by David Howells

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

Ian Kent <[email protected]> wrote:

> +#include <linux/fs.h>

Is that actually needed?

> + rki = kmalloc(sizeof(*rki), GFP_KERNEL);
> + if (!rki)
> + return -ENOMEM;
> +
> + if (use_ns) {
> + tsk = umh_get_init_pid();
> + if (IS_ERR(tsk))
> + return PTR_ERR(tsk);

Memory leak of rki.

> + /* If running within a container use the container namespace */
> + if (current->nsproxy->net_ns != &init_net)

Is that a viable check? Is it possible to have a container that shares
networking details?

David

2015-02-05 15:24:45

by David Howells

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

Ian Kent <[email protected]> wrote:

> To do this use the init process of the callers environment is used to setup
> the namespaces in the same way the root init process is used otherwise.

This probably doesn't need both "use" and "used".

> +struct task_struct *umh_get_init_pid(void)
> +int umh_enter_ns(struct task_struct *tsk, struct cred *new)

These are exported, so should probably have doc comments.

> + this = file_open_root(mnt->mnt_root, mnt, path, O_RDONLY);

You probably should explain in the patch description or in a comment in the
function why we're opening the ns files rather than just going directly to
current->nsproxy.

David

2015-02-06 00:01:18

[permalink] [raw]

Subject: Re: [RFC PATCH 2/8] kmod - rename call_usermodehelper() flags parameter

On Thu, 2015-02-05 at 15:01 +0000, David Howells wrote:
> Ian Kent <[email protected]> wrote:
>
> > -call_usermodehelper(char *path, char **argv, char **envp, int wait);
> > +call_usermodehelper(char *path, char **argv, char **envp, int flags);
>
> Can we make flags unsigned whilst we're at it? Other than that:

Sure, thanks for your comments, here and elsewhere, I'll get onto fixing
them.

>
> Acked-by: David Howells <[email protected]>

2015-02-06 01:47:44

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
>
> > + /* If running within a container use the container namespace */
> > + if (current->nsproxy->net_ns != &init_net)
>
> Is that a viable check? Is it possible to have a container that shares
> networking details?

That's up for discussion.

I thought about it and concluded that the check is probably not
sufficient for any of the cases.

I left it like that because I'm not sure exactly what the use cases are,
hoping it promote discussion and here we are.

I also think the current container environments don't share net
namespace with the root init net namspace, necessarily, because thy are
containers, ;)

TBH I haven't looked at the user space container creation code but I
expect it could be done that way if it was needed, so the answer is yes
and no, ;)

The questions then are do we need to check anything else, and what
environment should the callback use in the different cases, and what
other cases might break if we change it?

For example, should the fs namespace also be checked for all of these
cases, since we're executing a callback, or is whatever that's set to in
the container always what's required for locating the executable.

Ian

2015-02-06 12:09:06

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Thu, 05 Feb 2015 10:34:11 +0800
Ian Kent <[email protected]> wrote:

> The call_usermodehelper() function executes all binaries in the
> global "init" root context. This doesn't allow a binary to be run
> within a namespace (eg. the namespace of a container).
>
> Both containerized NFS client and NFS server need the ability to
> execute a binary in a container's context. To do this use the init
> process of the callers environment is used to setup the namespaces
> in the same way the root init process is used otherwise.
>
> Signed-off-by: Ian Kent <[email protected]>
> Cc: Benjamin Coddington <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: J. Bruce Fields <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Trond Myklebust <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Eric W. Biederman <[email protected]>
> Cc: Jeff Layton <[email protected]>
> ---
> include/linux/kmod.h | 16 +++++++
> kernel/kmod.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 128 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kmod.h b/include/linux/kmod.h
> index 15bdeed..b0f1b3c 100644
> --- a/include/linux/kmod.h
> +++ b/include/linux/kmod.h
> @@ -52,6 +52,7 @@ struct file;
> #define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */
> #define UMH_WAIT_PROC 2 /* wait for the process to complete */
> #define UMH_KILLABLE 4 /* wait for EXEC/PROC killable */
> +#define UMH_USE_NS 8 /* exec using caller's init namespace */
>
> struct subprocess_info {
> struct work_struct work;
> @@ -69,6 +70,21 @@ struct subprocess_info {
> extern int
> call_usermodehelper(char *path, char **argv, char **envp, int flags);
>
> +#if !defined(CONFIG_PROC_FS) || !defined(CONFIG_NAMESPACES)
> +inline struct task_struct *umh_get_init_task(void)
> +{
> + return ERR_PTR(-ENOTSUP);
> +}
> +
> +inline int umh_enter_ns(struct task_struct *tsk, struct cred *new)
> +{
> + return -ENOTSUP;
> +}
> +#else
> +struct task_struct *umh_get_init_pid(void);
> +int umh_enter_ns(struct task_struct *tsk, struct cred *new);
> +#endif
> +
> extern struct subprocess_info *
> call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask,
> int (*init)(struct subprocess_info *info, struct cred *new),
> diff --git a/kernel/kmod.c b/kernel/kmod.c
> index 14c0188..4c649d6 100644
> --- a/kernel/kmod.c
> +++ b/kernel/kmod.c
> @@ -582,6 +582,98 @@ unlock:
> }
> EXPORT_SYMBOL(call_usermodehelper_exec);
>
> +#if defined(CONFIG_PROC_FS) && defined(CONFIG_NAMESPACES)
> +#define NS_PATH_MAX 35
> +#define NS_PATH_FMT "%lu/ns/%s"
> +
> +/* Note namespace name order is significant */
> +static const char *ns_names[] = { "user", "ipc", "uts", "net", "pid", "mnt", NULL };
> +
> +struct task_struct *umh_get_init_pid(void)

nit: we're not getting a pid here but a task_struct pointer. Maybe this
should be called umh_get_init_task?

> +{
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(1);
> + if (tsk)
> + get_task_struct(tsk);
> + rcu_read_unlock();

I'm not terribly familiar with the task_struct lifetime rules...

I assume that you can be assured that tsk won't go away while you hold
the rcu_read_lock, but is doing a get_task_struct while holding it
sufficient to pin it after you drop the lock?

IOW, could the refcount on the task_struct do a 0->1 transition here and
end up being freed anyway after you've grabbed a reference?

> + if (!tsk)
> + return ERR_PTR(-ESRCH);
> +
> + return tsk;
> +}
> +EXPORT_SYMBOL(umh_get_init_pid);
> +
> +int umh_enter_ns(struct task_struct *tsk, struct cred *new)
> +{
> + char path[NS_PATH_MAX];
> + struct vfsmount *mnt;
> + const char *name;
> + pid_t pid;
> + int err = 0;
> +
> + pid = task_pid_nr(tsk);
> +
> + /*
> + * The user mode thread runner runs in the root init namespace
> + * so it will see all system pids.
> + */
> + mnt = task_active_pid_ns(current)->proc_mnt;
> +
> + for (name = ns_names[0]; *name; name++) {
> + struct file *this;
> + int len;
> +
> + len = snprintf(path,
> + NS_PATH_MAX, NS_PATH_FMT,
> + (unsigned long) pid, name);
> + if (len >= NS_PATH_MAX) {
> + err = -ENAMETOOLONG;
> + break;
> + }
> +
> + this = file_open_root(mnt->mnt_root, mnt, path, O_RDONLY);
> + if (unlikely(IS_ERR(this))) {
> + err = PTR_ERR(this);
> + break;
> + }
> +
> + err = setns_inode(file_inode(this), 0);
> + fput(this);
> + if (err)
> + break;
> + }
> +
> + return err;
> +}
> +EXPORT_SYMBOL(umh_enter_ns);
> +
> +static int umh_set_ns(struct subprocess_info *info, struct cred *new)
> +{
> + struct task_struct *tsk = info->data;
> +
> + return umh_enter_ns(tsk, new);
> +}
> +
> +static void umh_free_ns(struct subprocess_info *info)
> +{
> + struct task_struct *tsk = info->data;
> +
> + if (tsk)
> + put_task_struct(tsk);
> +}
> +#else
> +static int umh_set_ns(struct subprocess_info *info, struct cred *new)
> +{
> + return 0;
> +}
> +
> +static void umh_free_ns(struct subprocess_info *info)
> +{
> +}
> +#endif
> +
> /**
> * call_usermodehelper() - prepare and start a usermode application
> * @path: path to usermode executable
> @@ -599,11 +691,28 @@ int call_usermodehelper(char *path, char **argv, char **envp, int flags)
> {
> struct subprocess_info *info;
> gfp_t gfp_mask = (flags == UMH_NO_WAIT) ? GFP_ATOMIC : GFP_KERNEL;
> + unsigned int use_ns = flags & UMH_USE_NS;
> + struct task_struct *tsk = NULL;
> +
> + if (use_ns) {
> + tsk = umh_get_init_pid();
> + if (IS_ERR(tsk))
> + return PTR_ERR(tsk);
> + }
>
> - info = call_usermodehelper_setup(path, argv, envp, gfp_mask,
> - NULL, NULL, NULL);
> - if (info == NULL)
> + if (!tsk)
> + info = call_usermodehelper_setup(path, argv, envp,
> + gfp_mask, NULL, NULL, NULL);
> + else {
> + info = call_usermodehelper_setup(path, argv, envp, gfp_mask,
> + umh_set_ns, umh_free_ns,
> + tsk);
> + }
> + if (info == NULL) {
> + if (tsk)
> + put_task_struct(tsk);
> return -ENOMEM;
> + }
>
> return call_usermodehelper_exec(info, flags);
> }
>

--
Jeff Layton <[email protected]>

2015-02-08 03:07:48

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Fri, 2015-02-06 at 07:08 -0500, Jeff Layton wrote:
> On Thu, 05 Feb 2015 10:34:11 +0800
> Ian Kent <[email protected]> wrote:
>
> > The call_usermodehelper() function executes all binaries in the
> > global "init" root context. This doesn't allow a binary to be run
> > within a namespace (eg. the namespace of a container).
> >
> > Both containerized NFS client and NFS server need the ability to
> > execute a binary in a container's context. To do this use the init
> > process of the callers environment is used to setup the namespaces
> > in the same way the root init process is used otherwise.
> >
> > Signed-off-by: Ian Kent <[email protected]>
> > Cc: Benjamin Coddington <[email protected]>
> > Cc: Al Viro <[email protected]>
> > Cc: J. Bruce Fields <[email protected]>
> > Cc: David Howells <[email protected]>
> > Cc: Trond Myklebust <[email protected]>
> > Cc: Oleg Nesterov <[email protected]>
> > Cc: Eric W. Biederman <[email protected]>
> > Cc: Jeff Layton <[email protected]>
> > ---
> > include/linux/kmod.h | 16 +++++++
> > kernel/kmod.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > 2 files changed, 128 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/kmod.h b/include/linux/kmod.h
> > index 15bdeed..b0f1b3c 100644
> > --- a/include/linux/kmod.h
> > +++ b/include/linux/kmod.h
> > @@ -52,6 +52,7 @@ struct file;
> > #define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */
> > #define UMH_WAIT_PROC 2 /* wait for the process to complete */
> > #define UMH_KILLABLE 4 /* wait for EXEC/PROC killable */
> > +#define UMH_USE_NS 8 /* exec using caller's init namespace */
> >
> > struct subprocess_info {
> > struct work_struct work;
> > @@ -69,6 +70,21 @@ struct subprocess_info {
> > extern int
> > call_usermodehelper(char *path, char **argv, char **envp, int flags);
> >
> > +#if !defined(CONFIG_PROC_FS) || !defined(CONFIG_NAMESPACES)
> > +inline struct task_struct *umh_get_init_task(void)
> > +{
> > + return ERR_PTR(-ENOTSUP);
> > +}
> > +
> > +inline int umh_enter_ns(struct task_struct *tsk, struct cred *new)
> > +{
> > + return -ENOTSUP;
> > +}
> > +#else
> > +struct task_struct *umh_get_init_pid(void);
> > +int umh_enter_ns(struct task_struct *tsk, struct cred *new);
> > +#endif
> > +
> > extern struct subprocess_info *
> > call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask,
> > int (*init)(struct subprocess_info *info, struct cred *new),
> > diff --git a/kernel/kmod.c b/kernel/kmod.c
> > index 14c0188..4c649d6 100644
> > --- a/kernel/kmod.c
> > +++ b/kernel/kmod.c
> > @@ -582,6 +582,98 @@ unlock:
> > }
> > EXPORT_SYMBOL(call_usermodehelper_exec);
> >
> > +#if defined(CONFIG_PROC_FS) && defined(CONFIG_NAMESPACES)
> > +#define NS_PATH_MAX 35
> > +#define NS_PATH_FMT "%lu/ns/%s"
> > +
> > +/* Note namespace name order is significant */
> > +static const char *ns_names[] = { "user", "ipc", "uts", "net", "pid", "mnt", NULL };
> > +
> > +struct task_struct *umh_get_init_pid(void)
>
> nit: we're not getting a pid here but a task_struct pointer. Maybe this
> should be called umh_get_init_task?

Ha, yep.

>
> > +{
> > + struct task_struct *tsk;
> > +
> > + rcu_read_lock();
> > + tsk = find_task_by_vpid(1);
> > + if (tsk)
> > + get_task_struct(tsk);
> > + rcu_read_unlock();
>
> I'm not terribly familiar with the task_struct lifetime rules...
>
> I assume that you can be assured that tsk won't go away while you hold
> the rcu_read_lock, but is doing a get_task_struct while holding it
> sufficient to pin it after you drop the lock?
>
> IOW, could the refcount on the task_struct do a 0->1 transition here and
> end up being freed anyway after you've grabbed a reference?

Good point, I thought getting a reference under he read lock would be
enough but maybe I need more checks as I do with dentrys. I'll check
that.

Ian

2015-02-08 15:22:16

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Sun, 08 Feb 2015 11:07:32 +0800
Ian Kent <[email protected]> wrote:

> On Fri, 2015-02-06 at 07:08 -0500, Jeff Layton wrote:
> > On Thu, 05 Feb 2015 10:34:11 +0800
> > Ian Kent <[email protected]> wrote:
> >
> > > The call_usermodehelper() function executes all binaries in the
> > > global "init" root context. This doesn't allow a binary to be run
> > > within a namespace (eg. the namespace of a container).
> > >
> > > Both containerized NFS client and NFS server need the ability to
> > > execute a binary in a container's context. To do this use the init
> > > process of the callers environment is used to setup the namespaces
> > > in the same way the root init process is used otherwise.
> > >
> > > Signed-off-by: Ian Kent <[email protected]>
> > > Cc: Benjamin Coddington <[email protected]>
> > > Cc: Al Viro <[email protected]>
> > > Cc: J. Bruce Fields <[email protected]>
> > > Cc: David Howells <[email protected]>
> > > Cc: Trond Myklebust <[email protected]>
> > > Cc: Oleg Nesterov <[email protected]>
> > > Cc: Eric W. Biederman <[email protected]>
> > > Cc: Jeff Layton <[email protected]>
> > > ---
> > > include/linux/kmod.h | 16 +++++++
> > > kernel/kmod.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++++-
> > > 2 files changed, 128 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/kmod.h b/include/linux/kmod.h
> > > index 15bdeed..b0f1b3c 100644
> > > --- a/include/linux/kmod.h
> > > +++ b/include/linux/kmod.h
> > > @@ -52,6 +52,7 @@ struct file;
> > > #define UMH_WAIT_EXEC 1 /* wait for the exec, but not the process */
> > > #define UMH_WAIT_PROC 2 /* wait for the process to complete */
> > > #define UMH_KILLABLE 4 /* wait for EXEC/PROC killable */
> > > +#define UMH_USE_NS 8 /* exec using caller's init namespace */
> > >
> > > struct subprocess_info {
> > > struct work_struct work;
> > > @@ -69,6 +70,21 @@ struct subprocess_info {
> > > extern int
> > > call_usermodehelper(char *path, char **argv, char **envp, int flags);
> > >
> > > +#if !defined(CONFIG_PROC_FS) || !defined(CONFIG_NAMESPACES)
> > > +inline struct task_struct *umh_get_init_task(void)
> > > +{
> > > + return ERR_PTR(-ENOTSUP);
> > > +}
> > > +
> > > +inline int umh_enter_ns(struct task_struct *tsk, struct cred *new)
> > > +{
> > > + return -ENOTSUP;
> > > +}
> > > +#else
> > > +struct task_struct *umh_get_init_pid(void);
> > > +int umh_enter_ns(struct task_struct *tsk, struct cred *new);
> > > +#endif
> > > +
> > > extern struct subprocess_info *
> > > call_usermodehelper_setup(char *path, char **argv, char **envp, gfp_t gfp_mask,
> > > int (*init)(struct subprocess_info *info, struct cred *new),
> > > diff --git a/kernel/kmod.c b/kernel/kmod.c
> > > index 14c0188..4c649d6 100644
> > > --- a/kernel/kmod.c
> > > +++ b/kernel/kmod.c
> > > @@ -582,6 +582,98 @@ unlock:
> > > }
> > > EXPORT_SYMBOL(call_usermodehelper_exec);
> > >
> > > +#if defined(CONFIG_PROC_FS) && defined(CONFIG_NAMESPACES)
> > > +#define NS_PATH_MAX 35
> > > +#define NS_PATH_FMT "%lu/ns/%s"
> > > +
> > > +/* Note namespace name order is significant */
> > > +static const char *ns_names[] = { "user", "ipc", "uts", "net", "pid", "mnt", NULL };
> > > +
> > > +struct task_struct *umh_get_init_pid(void)
> >
> > nit: we're not getting a pid here but a task_struct pointer. Maybe this
> > should be called umh_get_init_task?
>
> Ha, yep.
>
> >
> > > +{
> > > + struct task_struct *tsk;
> > > +
> > > + rcu_read_lock();
> > > + tsk = find_task_by_vpid(1);
> > > + if (tsk)
> > > + get_task_struct(tsk);
> > > + rcu_read_unlock();
> >
> > I'm not terribly familiar with the task_struct lifetime rules...
> >
> > I assume that you can be assured that tsk won't go away while you hold
> > the rcu_read_lock, but is doing a get_task_struct while holding it
> > sufficient to pin it after you drop the lock?
> >
> > IOW, could the refcount on the task_struct do a 0->1 transition here and
> > end up being freed anyway after you've grabbed a reference?
>
> Good point, I thought getting a reference under he read lock would be
> enough but maybe I need more checks as I do with dentrys. I'll check
> that.
>

It looks like the rcu_read_lock is mostly there to protect the pid_hash
actually, and get_pid_task seems to do something very similar here. So,
I think you're probably fine to do what you're doing in this patch.

That said, the "What is struct pid?" comments in include/linux/pid.h
are interesting. I wonder if my comments on your original patch were
actually unfounded. If you hold a reference to a pid_t, that might be
enough to ensure that it doesn't get reused, but I'm not sure at that
point if it could end up being detached from the task.

I suspect that pinning the actual task like you're doing here is
probably the right thing to do, but I'd certainly value input from
someone who understands the task/pid interaction better than I do.

--
Jeff Layton <[email protected]>

2015-02-08 18:13:29

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On 02/08, Ian Kent wrote:
>
> On Fri, 2015-02-06 at 07:08 -0500, Jeff Layton wrote:
> > On Thu, 05 Feb 2015 10:34:11 +0800
> > Ian Kent <[email protected]> wrote:
> >
> > > +{
> > > + struct task_struct *tsk;
> > > +
> > > + rcu_read_lock();
> > > + tsk = find_task_by_vpid(1);
> > > + if (tsk)
> > > + get_task_struct(tsk);
> > > + rcu_read_unlock();
> >
> > I'm not terribly familiar with the task_struct lifetime rules...
> >
> > I assume that you can be assured that tsk won't go away while you hold
> > the rcu_read_lock, but is doing a get_task_struct while holding it
> > sufficient to pin it after you drop the lock?
> >
> > IOW, could the refcount on the task_struct do a 0->1 transition here and
> > end up being freed anyway after you've grabbed a reference?
>
> Good point, I thought getting a reference under he read lock would be
> enough but maybe I need more checks as I do with dentrys. I'll check
> that.

This is fine. If find_task_by_vpid() succeeds then delayed_put_task_struct()
can't be called until rcu_read_unlock() at least, so this task_struct has
a reference.

But I can't understand why do you need this helper... I guess I need to read
the whole series first. find_task_by_vpid(1) can never fail, but it can be
zombie... At seems we only need this task_struct for task_pid_nr(tsk) in
umh_enter_ns(tsk) ? Confused, but please ignore.

Oleg.

2015-02-08 19:01:40

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On 02/05, Ian Kent wrote:
>
> +int umh_enter_ns(struct task_struct *tsk, struct cred *new)
> +{
> + char path[NS_PATH_MAX];
> + struct vfsmount *mnt;
> + const char *name;
> + pid_t pid;
> + int err = 0;
> +
> + pid = task_pid_nr(tsk);
> +
> + /*
> + * The user mode thread runner runs in the root init namespace
> + * so it will see all system pids.
> + */
> + mnt = task_active_pid_ns(current)->proc_mnt;
> +
> + for (name = ns_names[0]; *name; name++) {
> + struct file *this;
> + int len;
> +
> + len = snprintf(path,
> + NS_PATH_MAX, NS_PATH_FMT,
> + (unsigned long) pid, name);
> + if (len >= NS_PATH_MAX) {
> + err = -ENAMETOOLONG;
> + break;
> + }
> +
> + this = file_open_root(mnt->mnt_root, mnt, path, O_RDONLY);
> + if (unlikely(IS_ERR(this))) {
> + err = PTR_ERR(this);
> + break;
> + }
> +
> + err = setns_inode(file_inode(this), 0);
> + fput(this);
> + if (err)
> + break;
> + }
> +
> + return err;
> +}

Yes, I need to actually read this series and setns paths, but at first glance
there must be a simpler method to call ops->install's and switch_task_namespaces.

Sorry if this was already discussed before, but to me it looks a bit strange
to abuse /proc/ files for this. And again, iiuc file_open_root() can fail if
tsk has already exited (init can be multithreaded).

Oleg.

2015-02-09 01:43:41

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Sun, 2015-02-08 at 20:00 +0100, Oleg Nesterov wrote:
> On 02/05, Ian Kent wrote:
> >
> > +int umh_enter_ns(struct task_struct *tsk, struct cred *new)
> > +{
> > + char path[NS_PATH_MAX];
> > + struct vfsmount *mnt;
> > + const char *name;
> > + pid_t pid;
> > + int err = 0;
> > +
> > + pid = task_pid_nr(tsk);
> > +
> > + /*
> > + * The user mode thread runner runs in the root init namespace
> > + * so it will see all system pids.
> > + */
> > + mnt = task_active_pid_ns(current)->proc_mnt;
> > +
> > + for (name = ns_names[0]; *name; name++) {
> > + struct file *this;
> > + int len;
> > +
> > + len = snprintf(path,
> > + NS_PATH_MAX, NS_PATH_FMT,
> > + (unsigned long) pid, name);
> > + if (len >= NS_PATH_MAX) {
> > + err = -ENAMETOOLONG;
> > + break;
> > + }
> > +
> > + this = file_open_root(mnt->mnt_root, mnt, path, O_RDONLY);
> > + if (unlikely(IS_ERR(this))) {
> > + err = PTR_ERR(this);
> > + break;
> > + }
> > +
> > + err = setns_inode(file_inode(this), 0);
> > + fput(this);
> > + if (err)
> > + break;
> > + }
> > +
> > + return err;
> > +}
>
> Yes, I need to actually read this series and setns paths, but at first glance
> there must be a simpler method to call ops->install's and switch_task_namespaces.

Yes, the namespaces implementation does seem a bit strange in this
respect. I mentioned that concern the first time I posted these. But I'm
still not that clear on the big picture of how namespace are meant to
work.

It's not just access to ops->install() that's the problem.

For each of the individual namespaces we open a file handle, to get
access to ops->install() for that namespace, install it, drop "all" the
namespaces then replace them with the new set that essentially has one
namespace changed.

>
> Sorry if this was already discussed before, but to me it looks a bit strange
> to abuse /proc/ files for this. And again, iiuc file_open_root() can fail if
> tsk has already exited (init can be multithreaded).

Not sure that the failure is a problem though as long as it's handled
since, if the init process of the container is gone (or will be gone
once were done), so is the container and the caller.

The use of proc is largely because we can't use the callers environment
to setup the process as the caller could manipulate it to subvert the
system. When not executing in a container the thread runner runs under
root init so nothing needs to be done but in a container we want to use
the init process of the container so the container's namespaces are
used. There is probably a better way to do it, suggestions welcome!

Ian

2015-02-09 17:04:39

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On 02/09, Ian Kent wrote:
>
> On Sun, 2015-02-08 at 20:00 +0100, Oleg Nesterov wrote:
> > > +
> > > + this = file_open_root(mnt->mnt_root, mnt, path, O_RDONLY);
> > > + if (unlikely(IS_ERR(this))) {
> > > + err = PTR_ERR(this);
> > > + break;
> > > + }
> > > +
> > > + err = setns_inode(file_inode(this), 0);
> > > + fput(this);
> > > + if (err)
> > > + break;
> > > + }
> > > +
> > > + return err;
> > > +}
> >
> > Yes, I need to actually read this series and setns paths, but at first glance
> > there must be a simpler method to call ops->install's and switch_task_namespaces.
>
> Yes, the namespaces implementation does seem a bit strange in this
> respect. I mentioned that concern the first time I posted these. But I'm
> still not that clear on the big picture of how namespace are meant to
> work.
>
> It's not just access to ops->install() that's the problem.
>
> For each of the individual namespaces we open a file handle, to get
> access to ops->install() for that namespace, install it, drop "all" the
> namespaces then replace them with the new set that essentially has one
> namespace changed.

I understand. but I still can't understand why we can't implement something
like
enter_ns(struct nsproxy *p)
{
new_nsproxy = create_new_namespaces(...);

p->mnt_ns->ns->ops->install(new_nsproxy, ...);
p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
...

switch_task_namespaces(new_nsproxy);
}

Why we should abuse fs/proc ?

See also below.

> > Sorry if this was already discussed before, but to me it looks a bit strange
> > to abuse /proc/ files for this. And again, iiuc file_open_root() can fail if
> > tsk has already exited (init can be multithreaded).
>
> Not sure that the failure is a problem though as long as it's handled
> since, if the init process of the container is gone (or will be gone
> once were done), so is the container and the caller.

Not really. Individual thread can exit while the whole "init" process can be alive.
In particular the main thread can exit and become a zombie, so find_task_by_vpid(1)
can't work in general.

You can probably use task_active_pid_ns()-child_reaper, but again I do not think
you should pass "task_struct *" to enter_ns().

And. Whatever we do, ops->install() or setns_inode() can't solve the problem with
pid_ns. You need the additional clone() to "activate" it. pidns_install() does not
actually change task_active_pid_ns().

> but in a container we want to use
> the init process of the container

Yes sure, I understand. But see above.

Oleg.

2015-02-10 00:08:41

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> On 02/09, Ian Kent wrote:
> >
> > On Sun, 2015-02-08 at 20:00 +0100, Oleg Nesterov wrote:
> > > > +
> > > > + this = file_open_root(mnt->mnt_root, mnt, path, O_RDONLY);
> > > > + if (unlikely(IS_ERR(this))) {
> > > > + err = PTR_ERR(this);
> > > > + break;
> > > > + }
> > > > +
> > > > + err = setns_inode(file_inode(this), 0);
> > > > + fput(this);
> > > > + if (err)
> > > > + break;
> > > > + }
> > > > +
> > > > + return err;
> > > > +}
> > >
> > > Yes, I need to actually read this series and setns paths, but at first glance
> > > there must be a simpler method to call ops->install's and switch_task_namespaces.
> >
> > Yes, the namespaces implementation does seem a bit strange in this
> > respect. I mentioned that concern the first time I posted these. But I'm
> > still not that clear on the big picture of how namespace are meant to
> > work.
> >
> > It's not just access to ops->install() that's the problem.
> >
> > For each of the individual namespaces we open a file handle, to get
> > access to ops->install() for that namespace, install it, drop "all" the
> > namespaces then replace them with the new set that essentially has one
> > namespace changed.
>
> I understand. but I still can't understand why we can't implement something
> like
> enter_ns(struct nsproxy *p)
> {
> new_nsproxy = create_new_namespaces(...);
>
> p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> ...
>
> switch_task_namespaces(new_nsproxy);
> }
>
> Why we should abuse fs/proc ?

That sounds like a much better approach.
Your saying just take a reference to the nsproxy from the located
process and use it instead, right?
Working out if there's a difference with what you from the open is
challenging (I already tried), I'll have another go at it.

>
> See also below.
>
> > > Sorry if this was already discussed before, but to me it looks a bit strange
> > > to abuse /proc/ files for this. And again, iiuc file_open_root() can fail if
> > > tsk has already exited (init can be multithreaded).
> >
> > Not sure that the failure is a problem though as long as it's handled
> > since, if the init process of the container is gone (or will be gone
> > once were done), so is the container and the caller.
>
> Not really. Individual thread can exit while the whole "init" process can be alive.
> In particular the main thread can exit and become a zombie, so find_task_by_vpid(1)
> can't work in general.
>
> You can probably use task_active_pid_ns()-child_reaper, but again I do not think
> you should pass "task_struct *" to enter_ns().

OK, I need to check this, btw, thanks for the comments.

>
>
> And. Whatever we do, ops->install() or setns_inode() can't solve the problem with
> pid_ns. You need the additional clone() to "activate" it. pidns_install() does not
> actually change task_active_pid_ns().

Right, but all this is done in preparation for the following do_execve()
call. Isn't that enough or am I missing something?

Ian

2015-02-10 16:56:25

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On 02/10, Ian Kent wrote:
>
> On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> >
> > I understand. but I still can't understand why we can't implement something
> > like
> > enter_ns(struct nsproxy *p)
> > {
> > new_nsproxy = create_new_namespaces(...);
> >
> > p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> > p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> > ...
> >
> > switch_task_namespaces(new_nsproxy);
> > }
> >
> > Why we should abuse fs/proc ?
>
> That sounds like a much better approach.
> Your saying just take a reference to the nsproxy from the located
> process and use it instead, right?

Yes,

> Working out if there's a difference with what you from the open is
> challenging (I already tried), I'll have another go at it.

I thinks there should not be any difference, but please re-check ;)

> > And. Whatever we do, ops->install() or setns_inode() can't solve the problem with
> > pid_ns. You need the additional clone() to "activate" it. pidns_install() does not
> > actually change task_active_pid_ns().
>
> Right, but all this is done in preparation for the following do_execve()
> call. Isn't that enough or am I missing something?

Yes, but do_execve() doesn't (and shouldn't) change task_active_pid_ns(). Note
the ->pid_ns_for_children's name. It is only used by copy_process()->alloc_pid().

task_active_pid_ns() uses task_pid() and we obviously can't change it.

I am wondering if we can do something like

kernel_thread_in_ns(struct nsproxy *ns, ...)
{
struct nsproxy *saved_ns = current->nsproxy;
pid_t pid;

task_lock(current);
current->nsproxy = ns;
task_unlock(current);

pid = kernel_thread(...);

task_lock(current);
current->nsproxy = saved_ns;
task_unlock(current);

return pid;
}

used by __call_usermodehelper/wait_for_helper, instead of "enter_ns" from
sub_info->init()...

Oleg.

2015-02-11 00:40:42

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Tue, 2015-02-10 at 17:55 +0100, Oleg Nesterov wrote:
> On 02/10, Ian Kent wrote:
> >
> > On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> > >
> > > I understand. but I still can't understand why we can't implement something
> > > like
> > > enter_ns(struct nsproxy *p)
> > > {
> > > new_nsproxy = create_new_namespaces(...);
> > >
> > > p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> > > p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> > > ...
> > >
> > > switch_task_namespaces(new_nsproxy);
> > > }
> > >
> > > Why we should abuse fs/proc ?
> >
> > That sounds like a much better approach.
> > Your saying just take a reference to the nsproxy from the located
> > process and use it instead, right?
>
> Yes,
>
> > Working out if there's a difference with what you from the open is
> > challenging (I already tried), I'll have another go at it.
>
> I thinks there should not be any difference, but please re-check ;)
>
> > > And. Whatever we do, ops->install() or setns_inode() can't solve the problem with
> > > pid_ns. You need the additional clone() to "activate" it. pidns_install() does not
> > > actually change task_active_pid_ns().
> >
> > Right, but all this is done in preparation for the following do_execve()
> > call. Isn't that enough or am I missing something?
>
> Yes, but do_execve() doesn't (and shouldn't) change task_active_pid_ns(). Note
> the ->pid_ns_for_children's name. It is only used by copy_process()->alloc_pid().

Right, I vaguely recall seeing something like this earlier in the fork
procedure, I get it. I'll need to have a another look at it.

>
> task_active_pid_ns() uses task_pid() and we obviously can't change it.
>
> I am wondering if we can do something like
>
> kernel_thread_in_ns(struct nsproxy *ns, ...)
> {
> struct nsproxy *saved_ns = current->nsproxy;
> pid_t pid;
>
> task_lock(current);
> current->nsproxy = ns;
> task_unlock(current);
>
> pid = kernel_thread(...);
>
> task_lock(current);
> current->nsproxy = saved_ns;
> task_unlock(current);
>
> return pid;
> }
>
> used by __call_usermodehelper/wait_for_helper, instead of "enter_ns" from
> sub_info->init()...

Again, thanks for the suggestions.
You've given me a few things think about and check out.

Ian

2015-02-16 06:17:17

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Tue, 2015-02-10 at 17:55 +0100, Oleg Nesterov wrote:
> On 02/10, Ian Kent wrote:
> >
> > On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> > >
> > > I understand. but I still can't understand why we can't implement something
> > > like
> > > enter_ns(struct nsproxy *p)
> > > {
> > > new_nsproxy = create_new_namespaces(...);
> > >
> > > p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> > > p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> > > ...
> > >
> > > switch_task_namespaces(new_nsproxy);
> > > }
> > >
> > > Why we should abuse fs/proc ?
> >
> > That sounds like a much better approach.
> > Your saying just take a reference to the nsproxy from the located
> > process and use it instead, right?
>
> Yes,

I'm still not sure if this can be done (at least without surgery to the
namespace implementation) and I think I've been here before which is
what lead to the file_open_root() approach.

The difficulty is the second parameter to the install() call above, the
struct ns_common. In setns() it's obtained from the procfs file inode
and the file open is what's used to obtain each namespace type (in the
form of a struct ns_common) from a process context different from
current, current being the thread runner process.

I might still be able to work out a (viable) way to obtain the
appropriate ns_common struct in each case without a file open but it's
hard to see how.

Ian

2015-02-16 17:14:53

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On 02/16, Ian Kent wrote:
>
> On Tue, 2015-02-10 at 17:55 +0100, Oleg Nesterov wrote:
> > On 02/10, Ian Kent wrote:
> > >
> > > On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> > > >
> > > > I understand. but I still can't understand why we can't implement something
> > > > like
> > > > enter_ns(struct nsproxy *p)
> > > > {
> > > > new_nsproxy = create_new_namespaces(...);
> > > >
> > > > p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> > > > p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> > > > ...
> > > >
> > > > switch_task_namespaces(new_nsproxy);
> > > > }
> > > >
> > > > Why we should abuse fs/proc ?
> > >
> > > That sounds like a much better approach.
> > > Your saying just take a reference to the nsproxy from the located
> > > process and use it instead, right?
> >
> > Yes,
>
> I'm still not sure if this can be done (at least without surgery to the
> namespace implementation) and I think I've been here before which is
> what lead to the file_open_root() approach.
>
> The difficulty is the second parameter to the install() call above, the
> struct ns_common. In setns() it's obtained from the procfs file inode
> and the file open is what's used to obtain each namespace type (in the
> form of a struct ns_common) from a process context different from
> current, current being the thread runner process.
>
> I might still be able to work out a (viable) way to obtain the
> appropriate ns_common struct in each case without a file open but it's
> hard to see how.

Not sure I understand... Every "namespace" pointer "struct nsproxy" has
the "struct ns_common ns" you need? So you can do (for example)

p->mnt_ns->ns->ops->install(new_nsproxy, &p->mnt_ns->ns);

or I missed something? (userns differs, you need cred->user_ns, of course).

Perhaps I missed something, but this doesn't look like a problem...

The real problem is that , let me repeat, is that pidns_install() does not
and can't change active_pid_ns. So I think that kernel_thread_in_ns() probably
make more sense.

Oleg.

2015-02-16 18:26:07

by Oleg Nesterov

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On 02/16, Oleg Nesterov wrote:
>
> On 02/16, Ian Kent wrote:
> >
> > On Tue, 2015-02-10 at 17:55 +0100, Oleg Nesterov wrote:
> > > On 02/10, Ian Kent wrote:
> > > >
> > > > On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> > > > >
> > > > > I understand. but I still can't understand why we can't implement something
> > > > > like
> > > > > enter_ns(struct nsproxy *p)
> > > > > {
> > > > > new_nsproxy = create_new_namespaces(...);
> > > > >
> > > > > p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> > > > > p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> > > > > ...
> > > > >
> > > > > switch_task_namespaces(new_nsproxy);
> > > > > }
> > > > >
> > > > > Why we should abuse fs/proc ?
> > > >
> > > > That sounds like a much better approach.
> > > > Your saying just take a reference to the nsproxy from the located
> > > > process and use it instead, right?
> > >
> > > Yes,
> >
> > I'm still not sure if this can be done (at least without surgery to the
> > namespace implementation) and I think I've been here before which is
> > what lead to the file_open_root() approach.
> >
> > The difficulty is the second parameter to the install() call above, the
> > struct ns_common. In setns() it's obtained from the procfs file inode
> > and the file open is what's used to obtain each namespace type (in the
> > form of a struct ns_common) from a process context different from
> > current, current being the thread runner process.
> >
> > I might still be able to work out a (viable) way to obtain the
> > appropriate ns_common struct in each case without a file open but it's
> > hard to see how.
>
> Not sure I understand... Every "namespace" pointer "struct nsproxy" has
> the "struct ns_common ns" you need? So you can do (for example)
>
> p->mnt_ns->ns->ops->install(new_nsproxy, &p->mnt_ns->ns);
>
> or I missed something? (userns differs, you need cred->user_ns, of course).
>
>
> Perhaps I missed something, but this doesn't look like a problem...

And in case this wasn't clear we obviously need to pass nsproxy/cred to
sub_info->init().

There is nothing magical in /proc or I am totally confused. Look at ns_get_path(),
it (to simplify) does

inode->i_private = ns_ops->get(task);

and every ->get() method returns &ns->ns.

But again, again:

> The real problem is that , let me repeat, is that pidns_install() does not
> and can't change active_pid_ns. So I think that kernel_thread_in_ns() probably
> make more sense.

Yes...

Oleg.

2015-02-18 01:42:48

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Mon, 2015-02-16 at 18:13 +0100, Oleg Nesterov wrote:
> On 02/16, Ian Kent wrote:
> >
> > On Tue, 2015-02-10 at 17:55 +0100, Oleg Nesterov wrote:
> > > On 02/10, Ian Kent wrote:
> > > >
> > > > On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> > > > >
> > > > > I understand. but I still can't understand why we can't implement something
> > > > > like
> > > > > enter_ns(struct nsproxy *p)
> > > > > {
> > > > > new_nsproxy = create_new_namespaces(...);
> > > > >
> > > > > p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> > > > > p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> > > > > ...
> > > > >
> > > > > switch_task_namespaces(new_nsproxy);
> > > > > }
> > > > >
> > > > > Why we should abuse fs/proc ?
> > > >
> > > > That sounds like a much better approach.
> > > > Your saying just take a reference to the nsproxy from the located
> > > > process and use it instead, right?
> > >
> > > Yes,
> >
> > I'm still not sure if this can be done (at least without surgery to the
> > namespace implementation) and I think I've been here before which is
> > what lead to the file_open_root() approach.
> >
> > The difficulty is the second parameter to the install() call above, the
> > struct ns_common. In setns() it's obtained from the procfs file inode
> > and the file open is what's used to obtain each namespace type (in the
> > form of a struct ns_common) from a process context different from
> > current, current being the thread runner process.
> >
> > I might still be able to work out a (viable) way to obtain the
> > appropriate ns_common struct in each case without a file open but it's
> > hard to see how.
>
> Not sure I understand... Every "namespace" pointer "struct nsproxy" has
> the "struct ns_common ns" you need? So you can do (for example)
>
> p->mnt_ns->ns->ops->install(new_nsproxy, &p->mnt_ns->ns);
>
> or I missed something? (userns differs, you need cred->user_ns, of course).

I didn't see that when I looked so I missed it, thanks for pointing it
out.

>
>
> Perhaps I missed something, but this doesn't look like a problem...
>
> The real problem is that , let me repeat, is that pidns_install() does not
> and can't change active_pid_ns. So I think that kernel_thread_in_ns() probably
> make more sense.

Right, I didn't miss that when you mentioned it.

Changing the execution order using your kernel_thread_in_ns() suggestion
is clearly what needs to be done.

Ian

2015-02-18 02:10:06

[permalink] [raw]

Subject: Re: [RFC PATCH 3/8] kmod - teach call_usermodehelper() to use a namespace

On Mon, 2015-02-16 at 19:24 +0100, Oleg Nesterov wrote:
> On 02/16, Oleg Nesterov wrote:
> >
> > On 02/16, Ian Kent wrote:
> > >
> > > On Tue, 2015-02-10 at 17:55 +0100, Oleg Nesterov wrote:
> > > > On 02/10, Ian Kent wrote:
> > > > >
> > > > > On Mon, 2015-02-09 at 17:03 +0100, Oleg Nesterov wrote:
> > > > > >
> > > > > > I understand. but I still can't understand why we can't implement something
> > > > > > like
> > > > > > enter_ns(struct nsproxy *p)
> > > > > > {
> > > > > > new_nsproxy = create_new_namespaces(...);
> > > > > >
> > > > > > p->mnt_ns->ns->ops->install(new_nsproxy, ...);
> > > > > > p->pid_ns_for_children->ns->ops->install(new_nsproxy, ...);
> > > > > > ...
> > > > > >
> > > > > > switch_task_namespaces(new_nsproxy);
> > > > > > }
> > > > > >
> > > > > > Why we should abuse fs/proc ?
> > > > >
> > > > > That sounds like a much better approach.
> > > > > Your saying just take a reference to the nsproxy from the located
> > > > > process and use it instead, right?
> > > >
> > > > Yes,
> > >
> > > I'm still not sure if this can be done (at least without surgery to the
> > > namespace implementation) and I think I've been here before which is
> > > what lead to the file_open_root() approach.
> > >
> > > The difficulty is the second parameter to the install() call above, the
> > > struct ns_common. In setns() it's obtained from the procfs file inode
> > > and the file open is what's used to obtain each namespace type (in the
> > > form of a struct ns_common) from a process context different from
> > > current, current being the thread runner process.
> > >
> > > I might still be able to work out a (viable) way to obtain the
> > > appropriate ns_common struct in each case without a file open but it's
> > > hard to see how.
> >
> > Not sure I understand... Every "namespace" pointer "struct nsproxy" has
> > the "struct ns_common ns" you need? So you can do (for example)
> >
> > p->mnt_ns->ns->ops->install(new_nsproxy, &p->mnt_ns->ns);
> >
> > or I missed something? (userns differs, you need cred->user_ns, of course).
> >
> >
> > Perhaps I missed something, but this doesn't look like a problem...
>
> And in case this wasn't clear we obviously need to pass nsproxy/cred to
> sub_info->init().
>
> There is nothing magical in /proc or I am totally confused. Look at ns_get_path(),
> it (to simplify) does
>
> inode->i_private = ns_ops->get(task);
>
> and every ->get() method returns &ns->ns.

Yep, my confusion was with which nsproxy is the needed nsproxy and your
right, there isn't anything special about setting the inode nsproxy.

It looks like your comments expose the file open as a really complicated
way of getting back the nsproxy(->ns) that we (will) already have from
the located task. I should look a little further to confirm that but, in
hindsight, it seems obvious now.

And I think that's what your trying so hard to get through to me, ;)

>
>
> But again, again:
>
> > The real problem is that , let me repeat, is that pidns_install() does not
> > and can't change active_pid_ns. So I think that kernel_thread_in_ns() probably
> > make more sense.
>
> Yes...

Yes, fortunately I got that when you first mentioned it.
Your point here was easy to verify and understand.

Thanks for your persistence.
Ian

2015-02-18 17:06:24

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> >
> > > + /* If running within a container use the container namespace */
> > > + if (current->nsproxy->net_ns != &init_net)
> >
> > Is that a viable check? Is it possible to have a container that shares
> > networking details?
>
> That's up for discussion.
>
> I thought about it and concluded that the check is probably not
> sufficient for any of the cases.
>
> I left it like that because I'm not sure exactly what the use cases are,
> hoping it promote discussion and here we are.
>
> I also think the current container environments don't share net
> namespace with the root init net namspace, necessarily, because thy are
> containers, ;)
>
> TBH I haven't looked at the user space container creation code but I
> expect it could be done that way if it was needed, so the answer is yes
> and no, ;)
>
> The questions then are do we need to check anything else, and what
> environment should the callback use in the different cases, and what
> other cases might break if we change it?
>
> For example, should the fs namespace also be checked for all of these
> cases, since we're executing a callback, or is whatever that's set to in
> the container always what's required for locating the executable.

What would be the disadvantage of setting UMH_USE_NS unconditionally
here?

--b.

2015-02-18 17:31:34

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > >
> > > > + /* If running within a container use the container namespace */
> > > > + if (current->nsproxy->net_ns != &init_net)
> > >
> > > Is that a viable check? Is it possible to have a container that shares
> > > networking details?
> >
> > That's up for discussion.
> >
> > I thought about it and concluded that the check is probably not
> > sufficient for any of the cases.
> >
> > I left it like that because I'm not sure exactly what the use cases are,
> > hoping it promote discussion and here we are.
> >
> > I also think the current container environments don't share net
> > namespace with the root init net namspace, necessarily, because thy are
> > containers, ;)
> >
> > TBH I haven't looked at the user space container creation code but I
> > expect it could be done that way if it was needed, so the answer is yes
> > and no, ;)
> >
> > The questions then are do we need to check anything else, and what
> > environment should the callback use in the different cases, and what
> > other cases might break if we change it?
> >
> > For example, should the fs namespace also be checked for all of these
> > cases, since we're executing a callback, or is whatever that's set to in
> > the container always what's required for locating the executable.
>
> What would be the disadvantage of setting UMH_USE_NS unconditionally
> here?

In the nfs idmapping case, the mapping is per-nfs_client.

Can nfs_idmap_new be the one that calls umh_get_init_task, with the
corresponding put done in nfs_idmap_delete, or is there some reason that
doesn't work?

--b.

2015-02-18 17:37:38

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 6/8] nfsd - use namespace if not executing in init namespace

On Thu, Feb 05, 2015 at 10:34:31AM +0800, Ian Kent wrote:
> If nfsd is running within a container the client tracking operations
> should run within the container also.
>
> Signed-off-by: Ian Kent <[email protected]>
> Cc: Benjamin Coddington <[email protected]>
> Cc: Al Viro <[email protected]>
> Cc: J. Bruce Fields <[email protected]>
> Cc: David Howells <[email protected]>
> Cc: Trond Myklebust <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Eric W. Biederman <[email protected]>
> Cc: Jeff Layton <[email protected]>
> ---
> fs/nfsd/netns.h | 2 ++
> fs/nfsd/nfs4recover.c | 48 ++++++++++++++++++++++++++++++++----------------
> 2 files changed, 34 insertions(+), 16 deletions(-)
>
> diff --git a/fs/nfsd/netns.h b/fs/nfsd/netns.h
> index ea6749a..c168196 100644
> --- a/fs/nfsd/netns.h
> +++ b/fs/nfsd/netns.h
> @@ -112,6 +112,8 @@ struct nfsd_net {
> u32 clientid_counter;
>
> struct svc_serv *nfsd_serv;
> +
> + int umh_flags;
> };
>
> /* Simple check to find out if a given net was properly initialized */
> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index cc6a760..b962856 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -1184,7 +1184,9 @@ nfsd4_cltrack_grace_start(time_t grace_start)
> }
>
> static int
> -nfsd4_umh_cltrack_upcall(char *cmd, char *arg, char *env0, char *env1)
> +nfsd4_umh_cltrack_upcall(char *cmd,
> + char *arg, char *env0, char *env1,
> + int flags)
> {
> char *envp[3];
> char *argv[4];
> @@ -1209,7 +1211,7 @@ nfsd4_umh_cltrack_upcall(char *cmd, char *arg, char *env0, char *env1)
> argv[2] = arg;
> argv[3] = NULL;
>
> - ret = call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
> + ret = call_usermodehelper(argv[0], argv, envp, flags);
> /*
> * Disable the upcall mechanism if we're getting an ENOENT or EACCES
> * error. The admin can re-enable it on the fly by using sysfs
> @@ -1252,14 +1254,13 @@ nfsd4_umh_cltrack_init(struct net *net)
> struct nfsd_net *nn = net_generic(net, nfsd_net_id);
> char *grace_start = nfsd4_cltrack_grace_start(nn->boot_time);
>
> - /* XXX: The usermode helper s not working in container yet. */
> - if (net != &init_net) {
> - WARN(1, KERN_ERR "NFSD: attempt to initialize umh client "
> - "tracking in a container!\n");
> - return -EINVAL;
> - }
> + nn->umh_flags = UMH_WAIT_PROC;
> + if (net != &init_net)
> + nn->umh_flags |= UMH_USE_NS;
>
> - ret = nfsd4_umh_cltrack_upcall("init", NULL, grace_start, NULL);
> + ret = nfsd4_umh_cltrack_upcall("init",
> + NULL, grace_start, NULL,
> + nn->umh_flags);

This code is always run from a kernel thread, so I assume this isn't
actually going to do anything different.

--b.

> kfree(grace_start);
> return ret;
> }
> @@ -1285,6 +1286,7 @@ nfsd4_umh_cltrack_create(struct nfs4_client *clp)
> {
> char *hexid, *has_session, *grace_start;
> struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
> + int ret;
>
> /*
> * With v4.0 clients, there's little difference in outcome between a
> @@ -1312,7 +1314,10 @@ nfsd4_umh_cltrack_create(struct nfs4_client *clp)
> grace_start = nfsd4_cltrack_grace_start(nn->boot_time);
>
> nfsd4_cltrack_upcall_lock(clp);
> - if (!nfsd4_umh_cltrack_upcall("create", hexid, has_session, grace_start))
> + ret = nfsd4_umh_cltrack_upcall("create",
> + hexid, has_session, grace_start,
> + nn->umh_flags);
> + if (!ret)
> set_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
> nfsd4_cltrack_upcall_unlock(clp);
>
> @@ -1324,7 +1329,9 @@ nfsd4_umh_cltrack_create(struct nfs4_client *clp)
> static void
> nfsd4_umh_cltrack_remove(struct nfs4_client *clp)
> {
> + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
> char *hexid;
> + int ret;
>
> if (!test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags))
> return;
> @@ -1336,9 +1343,13 @@ nfsd4_umh_cltrack_remove(struct nfs4_client *clp)
> }
>
> nfsd4_cltrack_upcall_lock(clp);
> - if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags) &&
> - nfsd4_umh_cltrack_upcall("remove", hexid, NULL, NULL) == 0)
> - clear_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
> + if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags)) {
> + ret = nfsd4_umh_cltrack_upcall("remove",
> + hexid, NULL, NULL,
> + nn->umh_flags);
> + if (ret == 0)
> + clear_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
> + }
> nfsd4_cltrack_upcall_unlock(clp);
>
> kfree(hexid);
> @@ -1347,8 +1358,9 @@ nfsd4_umh_cltrack_remove(struct nfs4_client *clp)
> static int
> nfsd4_umh_cltrack_check(struct nfs4_client *clp)
> {
> - int ret;
> + struct nfsd_net *nn = net_generic(clp->net, nfsd_net_id);
> char *hexid, *has_session, *legacy;
> + int ret;
>
> if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags))
> return 0;
> @@ -1366,7 +1378,9 @@ nfsd4_umh_cltrack_check(struct nfs4_client *clp)
> if (test_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags)) {
> ret = 0;
> } else {
> - ret = nfsd4_umh_cltrack_upcall("check", hexid, has_session, legacy);
> + ret = nfsd4_umh_cltrack_upcall("check", hexid,
> + has_session, legacy,
> + mm->umh_flags);
> if (ret == 0)
> set_bit(NFSD4_CLIENT_STABLE, &clp->cl_flags);
> }
> @@ -1386,7 +1400,9 @@ nfsd4_umh_cltrack_grace_done(struct nfsd_net *nn)
>
> sprintf(timestr, "%ld", nn->boot_time);
> legacy = nfsd4_cltrack_legacy_topdir();
> - nfsd4_umh_cltrack_upcall("gracedone", timestr, legacy, NULL);
> + nfsd4_umh_cltrack_upcall("gracedone",
> + timestr, legacy, NULL,
> + nn->umh_flags);
> kfree(legacy);
> }
>

2015-02-18 20:59:11

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Wed, Feb 18, 2015 at 12:31:32PM -0500, J. Bruce Fields wrote:
> On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> > On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > > >
> > > > > + /* If running within a container use the container namespace */
> > > > > + if (current->nsproxy->net_ns != &init_net)
> > > >
> > > > Is that a viable check? Is it possible to have a container that shares
> > > > networking details?
> > >
> > > That's up for discussion.
> > >
> > > I thought about it and concluded that the check is probably not
> > > sufficient for any of the cases.
> > >
> > > I left it like that because I'm not sure exactly what the use cases are,
> > > hoping it promote discussion and here we are.
> > >
> > > I also think the current container environments don't share net
> > > namespace with the root init net namspace, necessarily, because thy are
> > > containers, ;)
> > >
> > > TBH I haven't looked at the user space container creation code but I
> > > expect it could be done that way if it was needed, so the answer is yes
> > > and no, ;)
> > >
> > > The questions then are do we need to check anything else, and what
> > > environment should the callback use in the different cases, and what
> > > other cases might break if we change it?
> > >
> > > For example, should the fs namespace also be checked for all of these
> > > cases, since we're executing a callback, or is whatever that's set to in
> > > the container always what's required for locating the executable.
> >
> > What would be the disadvantage of setting UMH_USE_NS unconditionally
> > here?
>
> In the nfs idmapping case, the mapping is per-nfs_client.
>
> Can nfs_idmap_new be the one that calls umh_get_init_task, with the
> corresponding put done in nfs_idmap_delete, or is there some reason that
> doesn't work?

It's confusing sorting out possible use cases, but I think both of these
are reasonable:

- mount an nfs filesystem from the host, then spawn containers
that all use that filesystem.
- mount an nfs filesystem from within a container.

Your approach might work for the second, but in the first case we'll end
up with idmappers from multiple containers all trying to do the
idmapping for the shared filesystem.

--b.

2015-02-19 00:39:20

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Wed, 2015-02-18 at 15:59 -0500, J. Bruce Fields wrote:
> On Wed, Feb 18, 2015 at 12:31:32PM -0500, J. Bruce Fields wrote:
> > On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> > > On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > > > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > > > >
> > > > > > + /* If running within a container use the container namespace */
> > > > > > + if (current->nsproxy->net_ns != &init_net)
> > > > >
> > > > > Is that a viable check? Is it possible to have a container that shares
> > > > > networking details?
> > > >
> > > > That's up for discussion.
> > > >
> > > > I thought about it and concluded that the check is probably not
> > > > sufficient for any of the cases.
> > > >
> > > > I left it like that because I'm not sure exactly what the use cases are,
> > > > hoping it promote discussion and here we are.
> > > >
> > > > I also think the current container environments don't share net
> > > > namespace with the root init net namspace, necessarily, because thy are
> > > > containers, ;)
> > > >
> > > > TBH I haven't looked at the user space container creation code but I
> > > > expect it could be done that way if it was needed, so the answer is yes
> > > > and no, ;)
> > > >
> > > > The questions then are do we need to check anything else, and what
> > > > environment should the callback use in the different cases, and what
> > > > other cases might break if we change it?
> > > >
> > > > For example, should the fs namespace also be checked for all of these
> > > > cases, since we're executing a callback, or is whatever that's set to in
> > > > the container always what's required for locating the executable.
> > >
> > > What would be the disadvantage of setting UMH_USE_NS unconditionally
> > > here?
> >
> > In the nfs idmapping case, the mapping is per-nfs_client.
> >
> > Can nfs_idmap_new be the one that calls umh_get_init_task, with the
> > corresponding put done in nfs_idmap_delete, or is there some reason that
> > doesn't work?
>
> It's confusing sorting out possible use cases, but I think both of these
> are reasonable:
>
> - mount an nfs filesystem from the host, then spawn containers
> that all use that filesystem.
> - mount an nfs filesystem from within a container.
>
> Your approach might work for the second, but in the first case we'll end
> up with idmappers from multiple containers all trying to do the
> idmapping for the shared filesystem.

These patches are examples for context.

Working out whether to run in a namespace or not was always going to be
difficult, specifically for the case you point out. Maybe we can make
use of some other information, namespace information in the super block
perhaps, or something else, or perhaps we will need to add some
information for this, not sure yet. We'll need to work together on that.

TBH, I'm not that focused on the use cases because the base
implementation is still undergoing significant change although I believe
the use of a flag to request namespace execution is a good approach.
That probably won't change.

Ian

2015-02-19 01:31:19

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Thu, Feb 19, 2015 at 08:39:01AM +0800, Ian Kent wrote:
> On Wed, 2015-02-18 at 15:59 -0500, J. Bruce Fields wrote:
> > On Wed, Feb 18, 2015 at 12:31:32PM -0500, J. Bruce Fields wrote:
> > > On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> > > > On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > > > > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > > > > >
> > > > > > > + /* If running within a container use the container namespace */
> > > > > > > + if (current->nsproxy->net_ns != &init_net)
> > > > > >
> > > > > > Is that a viable check? Is it possible to have a container that shares
> > > > > > networking details?
> > > > >
> > > > > That's up for discussion.
> > > > >
> > > > > I thought about it and concluded that the check is probably not
> > > > > sufficient for any of the cases.
> > > > >
> > > > > I left it like that because I'm not sure exactly what the use cases are,
> > > > > hoping it promote discussion and here we are.
> > > > >
> > > > > I also think the current container environments don't share net
> > > > > namespace with the root init net namspace, necessarily, because thy are
> > > > > containers, ;)
> > > > >
> > > > > TBH I haven't looked at the user space container creation code but I
> > > > > expect it could be done that way if it was needed, so the answer is yes
> > > > > and no, ;)
> > > > >
> > > > > The questions then are do we need to check anything else, and what
> > > > > environment should the callback use in the different cases, and what
> > > > > other cases might break if we change it?
> > > > >
> > > > > For example, should the fs namespace also be checked for all of these
> > > > > cases, since we're executing a callback, or is whatever that's set to in
> > > > > the container always what's required for locating the executable.
> > > >
> > > > What would be the disadvantage of setting UMH_USE_NS unconditionally
> > > > here?
> > >
> > > In the nfs idmapping case, the mapping is per-nfs_client.
> > >
> > > Can nfs_idmap_new be the one that calls umh_get_init_task, with the
> > > corresponding put done in nfs_idmap_delete, or is there some reason that
> > > doesn't work?
> >
> > It's confusing sorting out possible use cases, but I think both of these
> > are reasonable:
> >
> > - mount an nfs filesystem from the host, then spawn containers
> > that all use that filesystem.
> > - mount an nfs filesystem from within a container.
> >
> > Your approach might work for the second, but in the first case we'll end
> > up with idmappers from multiple containers all trying to do the
> > idmapping for the shared filesystem.
>
> These patches are examples for context.
>
> Working out whether to run in a namespace or not was always going to be
> difficult, specifically for the case you point out. Maybe we can make
> use of some other information, namespace information in the super block
> perhaps, or something else, or perhaps we will need to add some
> information for this, not sure yet. We'll need to work together on that.
>
> TBH, I'm not that focused on the use cases because the base
> implementation is still undergoing significant change although I believe
> the use of a flag to request namespace execution is a good approach.
> That probably won't change.

The flag requests that we use the container of the currently executing
task. In neither the nfs idmapper nor the nfsd state-recovery case is
that the correct choice.

--b.

2015-02-19 03:19:18

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Wed, 2015-02-18 at 20:31 -0500, J. Bruce Fields wrote:
> On Thu, Feb 19, 2015 at 08:39:01AM +0800, Ian Kent wrote:
> > On Wed, 2015-02-18 at 15:59 -0500, J. Bruce Fields wrote:
> > > On Wed, Feb 18, 2015 at 12:31:32PM -0500, J. Bruce Fields wrote:
> > > > On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> > > > > On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > > > > > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > > > > > >
> > > > > > > > + /* If running within a container use the container namespace */
> > > > > > > > + if (current->nsproxy->net_ns != &init_net)
> > > > > > >
> > > > > > > Is that a viable check? Is it possible to have a container that shares
> > > > > > > networking details?
> > > > > >
> > > > > > That's up for discussion.
> > > > > >
> > > > > > I thought about it and concluded that the check is probably not
> > > > > > sufficient for any of the cases.
> > > > > >
> > > > > > I left it like that because I'm not sure exactly what the use cases are,
> > > > > > hoping it promote discussion and here we are.
> > > > > >
> > > > > > I also think the current container environments don't share net
> > > > > > namespace with the root init net namspace, necessarily, because thy are
> > > > > > containers, ;)
> > > > > >
> > > > > > TBH I haven't looked at the user space container creation code but I
> > > > > > expect it could be done that way if it was needed, so the answer is yes
> > > > > > and no, ;)
> > > > > >
> > > > > > The questions then are do we need to check anything else, and what
> > > > > > environment should the callback use in the different cases, and what
> > > > > > other cases might break if we change it?
> > > > > >
> > > > > > For example, should the fs namespace also be checked for all of these
> > > > > > cases, since we're executing a callback, or is whatever that's set to in
> > > > > > the container always what's required for locating the executable.
> > > > >
> > > > > What would be the disadvantage of setting UMH_USE_NS unconditionally
> > > > > here?
> > > >
> > > > In the nfs idmapping case, the mapping is per-nfs_client.
> > > >
> > > > Can nfs_idmap_new be the one that calls umh_get_init_task, with the
> > > > corresponding put done in nfs_idmap_delete, or is there some reason that
> > > > doesn't work?
> > >
> > > It's confusing sorting out possible use cases, but I think both of these
> > > are reasonable:
> > >
> > > - mount an nfs filesystem from the host, then spawn containers
> > > that all use that filesystem.
> > > - mount an nfs filesystem from within a container.
> > >
> > > Your approach might work for the second, but in the first case we'll end
> > > up with idmappers from multiple containers all trying to do the
> > > idmapping for the shared filesystem.
> >
> > These patches are examples for context.
> >
> > Working out whether to run in a namespace or not was always going to be
> > difficult, specifically for the case you point out. Maybe we can make
> > use of some other information, namespace information in the super block
> > perhaps, or something else, or perhaps we will need to add some
> > information for this, not sure yet. We'll need to work together on that.
> >
> > TBH, I'm not that focused on the use cases because the base
> > implementation is still undergoing significant change although I believe
> > the use of a flag to request namespace execution is a good approach.
> > That probably won't change.
>
> The flag requests that we use the container of the currently executing
> task. In neither the nfs idmapper nor the nfsd state-recovery case is
> that the correct choice.

OK, I'll drop those patches then.

Ian

2015-02-20 09:33:43

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Wed, 2015-02-18 at 20:31 -0500, J. Bruce Fields wrote:
> On Thu, Feb 19, 2015 at 08:39:01AM +0800, Ian Kent wrote:
> > On Wed, 2015-02-18 at 15:59 -0500, J. Bruce Fields wrote:
> > > On Wed, Feb 18, 2015 at 12:31:32PM -0500, J. Bruce Fields wrote:
> > > > On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> > > > > On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > > > > > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > > > > > >
> > > > > > > > + /* If running within a container use the container namespace */
> > > > > > > > + if (current->nsproxy->net_ns != &init_net)
> > > > > > >
> > > > > > > Is that a viable check? Is it possible to have a container that shares
> > > > > > > networking details?
> > > > > >
> > > > > > That's up for discussion.
> > > > > >
> > > > > > I thought about it and concluded that the check is probably not
> > > > > > sufficient for any of the cases.
> > > > > >
> > > > > > I left it like that because I'm not sure exactly what the use cases are,
> > > > > > hoping it promote discussion and here we are.
> > > > > >
> > > > > > I also think the current container environments don't share net
> > > > > > namespace with the root init net namspace, necessarily, because thy are
> > > > > > containers, ;)
> > > > > >
> > > > > > TBH I haven't looked at the user space container creation code but I
> > > > > > expect it could be done that way if it was needed, so the answer is yes
> > > > > > and no, ;)
> > > > > >
> > > > > > The questions then are do we need to check anything else, and what
> > > > > > environment should the callback use in the different cases, and what
> > > > > > other cases might break if we change it?
> > > > > >
> > > > > > For example, should the fs namespace also be checked for all of these
> > > > > > cases, since we're executing a callback, or is whatever that's set to in
> > > > > > the container always what's required for locating the executable.
> > > > >
> > > > > What would be the disadvantage of setting UMH_USE_NS unconditionally
> > > > > here?
> > > >
> > > > In the nfs idmapping case, the mapping is per-nfs_client.
> > > >
> > > > Can nfs_idmap_new be the one that calls umh_get_init_task, with the
> > > > corresponding put done in nfs_idmap_delete, or is there some reason that
> > > > doesn't work?
> > >
> > > It's confusing sorting out possible use cases, but I think both of these
> > > are reasonable:
> > >
> > > - mount an nfs filesystem from the host, then spawn containers
> > > that all use that filesystem.
> > > - mount an nfs filesystem from within a container.
> > >
> > > Your approach might work for the second, but in the first case we'll end
> > > up with idmappers from multiple containers all trying to do the
> > > idmapping for the shared filesystem.
> >
> > These patches are examples for context.
> >
> > Working out whether to run in a namespace or not was always going to be
> > difficult, specifically for the case you point out. Maybe we can make
> > use of some other information, namespace information in the super block
> > perhaps, or something else, or perhaps we will need to add some
> > information for this, not sure yet. We'll need to work together on that.
> >
> > TBH, I'm not that focused on the use cases because the base
> > implementation is still undergoing significant change although I believe
> > the use of a flag to request namespace execution is a good approach.
> > That probably won't change.
>
> The flag requests that we use the container of the currently executing
> task. In neither the nfs idmapper nor the nfsd state-recovery case is
> that the correct choice.

There's a bit more to do but I've done the changes arising from Oleg's
comments so I've turned my attention to the recent discussion here.

Please understand that I added the example usage patches largely to get
this sort of comment because I was pretty sure there would be more to it
than what was in the patches.

I still don't fully understand the requirements so lets talk about them
and to start with the former case above.

I think your right that something needs to be saved in order to be able
to make a decision at access time in a container. That might be as
simple as working out whether the nn->umh_flags needs UMH_USE_NS at a
different time, say at mount time. Taking a reference to something like
the net namespace probably isn't necessary.

But I don't understand the relationship between the net namespace (eg.
is there a dependency on rpc requests and what the helper needs to do),
the user in a container, and the idmapper operation. Can you help with
that please.

The case of nfsd state-recovery might be similar but you'll need to help
me out a bit with that too.

Ian

2015-02-20 17:26:01

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> On Wed, 2015-02-18 at 20:31 -0500, J. Bruce Fields wrote:
> > On Thu, Feb 19, 2015 at 08:39:01AM +0800, Ian Kent wrote:
> > > On Wed, 2015-02-18 at 15:59 -0500, J. Bruce Fields wrote:
> > > > On Wed, Feb 18, 2015 at 12:31:32PM -0500, J. Bruce Fields wrote:
> > > > > On Wed, Feb 18, 2015 at 12:06:20PM -0500, J. Bruce Fields wrote:
> > > > > > On Fri, Feb 06, 2015 at 09:47:25AM +0800, Ian Kent wrote:
> > > > > > > On Thu, 2015-02-05 at 15:14 +0000, David Howells wrote:
> > > > > > > >
> > > > > > > > > + /* If running within a container use the container namespace */
> > > > > > > > > + if (current->nsproxy->net_ns != &init_net)
> > > > > > > >
> > > > > > > > Is that a viable check? Is it possible to have a container that shares
> > > > > > > > networking details?
> > > > > > >
> > > > > > > That's up for discussion.
> > > > > > >
> > > > > > > I thought about it and concluded that the check is probably not
> > > > > > > sufficient for any of the cases.
> > > > > > >
> > > > > > > I left it like that because I'm not sure exactly what the use cases are,
> > > > > > > hoping it promote discussion and here we are.
> > > > > > >
> > > > > > > I also think the current container environments don't share net
> > > > > > > namespace with the root init net namspace, necessarily, because thy are
> > > > > > > containers, ;)
> > > > > > >
> > > > > > > TBH I haven't looked at the user space container creation code but I
> > > > > > > expect it could be done that way if it was needed, so the answer is yes
> > > > > > > and no, ;)
> > > > > > >
> > > > > > > The questions then are do we need to check anything else, and what
> > > > > > > environment should the callback use in the different cases, and what
> > > > > > > other cases might break if we change it?
> > > > > > >
> > > > > > > For example, should the fs namespace also be checked for all of these
> > > > > > > cases, since we're executing a callback, or is whatever that's set to in
> > > > > > > the container always what's required for locating the executable.
> > > > > >
> > > > > > What would be the disadvantage of setting UMH_USE_NS unconditionally
> > > > > > here?
> > > > >
> > > > > In the nfs idmapping case, the mapping is per-nfs_client.
> > > > >
> > > > > Can nfs_idmap_new be the one that calls umh_get_init_task, with the
> > > > > corresponding put done in nfs_idmap_delete, or is there some reason that
> > > > > doesn't work?
> > > >
> > > > It's confusing sorting out possible use cases, but I think both of these
> > > > are reasonable:
> > > >
> > > > - mount an nfs filesystem from the host, then spawn containers
> > > > that all use that filesystem.
> > > > - mount an nfs filesystem from within a container.
> > > >
> > > > Your approach might work for the second, but in the first case we'll end
> > > > up with idmappers from multiple containers all trying to do the
> > > > idmapping for the shared filesystem.
> > >
> > > These patches are examples for context.
> > >
> > > Working out whether to run in a namespace or not was always going to be
> > > difficult, specifically for the case you point out. Maybe we can make
> > > use of some other information, namespace information in the super block
> > > perhaps, or something else, or perhaps we will need to add some
> > > information for this, not sure yet. We'll need to work together on that.
> > >
> > > TBH, I'm not that focused on the use cases because the base
> > > implementation is still undergoing significant change although I believe
> > > the use of a flag to request namespace execution is a good approach.
> > > That probably won't change.
> >
> > The flag requests that we use the container of the currently executing
> > task. In neither the nfs idmapper nor the nfsd state-recovery case is
> > that the correct choice.
>
> There's a bit more to do but I've done the changes arising from Oleg's
> comments so I've turned my attention to the recent discussion here.

Happy to take a look at the revised series and take a shot at the nfs
cases if it would help.

> Please understand that I added the example usage patches largely to get
> this sort of comment because I was pretty sure there would be more to it
> than what was in the patches.

Understood.

> I still don't fully understand the requirements so lets talk about them
> and to start with the former case above.
>
> I think your right that something needs to be saved in order to be able
> to make a decision at access time in a container. That might be as
> simple as working out whether the nn->umh_flags needs UMH_USE_NS at a
> different time, say at mount time. Taking a reference to something like
> the net namespace probably isn't necessary.
>
> But I don't understand the relationship between the net namespace (eg.
> is there a dependency on rpc requests and what the helper needs to do),
> the user in a container, and the idmapper operation. Can you help with
> that please.

Take the case where you mount an nfs filesystem and then spawn
containers that use it.

The on-the-wire nfs protocol uses names like user@domain to represent
file owners. The kernel of course needs an integer it can stick in
i_uid. So the upcall here is to ask userspace how to map the
on-the-wire name to a local uid.

The result is stored in inodes that will be shared between all the
containers. So the mapping has to be consistent across all of them.

So, if some user in a container does a stat() on this filesystem, we
can't do the mapping in that user's container. We have to use some
context that's the same for every user of the filesystem.

So, as I said, maybe:

- call umh_get_init_task from nfs_idmap_new, store the result in
a new idmap->idmap_init_task
- pass idmap_init_task to some request_key variant somehow in
nfs_idmap_request_key.
- clean up in nfs_idmap_delete

nfs_idmap_new is called at mount time, so you'll be getting the init
task of the container that the mount was originally done in. Sounds
like the least surprising result to me from the user's point of view.

I don't know if the task is actually the right thing to take the
reference on, or if it's OK to hold that reference for the life of the
filesystem. I think the important point is that the request_key upcalls
need to all happen consistently in one context for a given filesystem,
and that context should probably be derived from the mount's somehow.

> The case of nfsd state-recovery might be similar but you'll need to help
> me out a bit with that too.

Each network namespace can have its own virtual nfs server. Servers can
be started and stopped independently per network namespace. We decide
which server should handle an incoming rpc by looking at the network
namespace associated with the socket that it arrived over.

A server is started by the rpc.nfsd command writing a value into a magic
file somewhere. Part of the code that handles that writes ends up
calling nfsd4_umh_cltrack_create, which is our chance to capture
anything from the rpc.nfsd process. So grabbing the right init task
there might be the thing to do?

The actual call_usermode_helper calls happen later as part of processing
rpc's. At that point you're running in a kernel thread and I assume
umh_get_init_task is always just going to return the global init.

I think you'd need to add an nfsd4_umh_track_ops.exit to do the cleanup
on nfsd shutdown.

Anyway, the important point again is that the process calling
usermode_helper call doesn't provide useful context. The upcalls need
to happen consistently in one context for a given virtual nfs server,
and that context should probably be derived from rpc.nfsd's somehow.

--b.

2015-02-20 18:10:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

"J. Bruce Fields" <[email protected]> writes:

> On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:

>> The case of nfsd state-recovery might be similar but you'll need to help
>> me out a bit with that too.
>
> Each network namespace can have its own virtual nfs server. Servers can
> be started and stopped independently per network namespace. We decide
> which server should handle an incoming rpc by looking at the network
> namespace associated with the socket that it arrived over.
>
> A server is started by the rpc.nfsd command writing a value into a magic
> file somewhere.

nit. Unless I am completely turned around that file is on the nfsd
filesystem, that lives in fs/nfsd/nfs.c.

So I bevelive this really is a case of figuring out what we want the
semantics to be for mount and propogating the information down from
mount to where we call the user mode helpers.

Eric

2015-02-20 18:58:21

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Fri, 20 Feb 2015 12:07:15 -0600
[email protected] (Eric W. Biederman) wrote:

> "J. Bruce Fields" <[email protected]> writes:
>
> > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
>
> >> The case of nfsd state-recovery might be similar but you'll need to help
> >> me out a bit with that too.
> >
> > Each network namespace can have its own virtual nfs server. Servers can
> > be started and stopped independently per network namespace. We decide
> > which server should handle an incoming rpc by looking at the network
> > namespace associated with the socket that it arrived over.
> >
> > A server is started by the rpc.nfsd command writing a value into a magic
> > file somewhere.
>
> nit. Unless I am completely turned around that file is on the nfsd
> filesystem, that lives in fs/nfsd/nfs.c.
>

Correct.

> So I bevelive this really is a case of figuring out what we want the
> semantics to be for mount and propogating the information down from
> mount to where we call the user mode helpers.
>

Hmmm. I'm a little confused here. Are you saying that the namespace for
nfsd's upcalls umh ought to be derived from the process that did the
initial mount of /proc/fs/nfsd ?

--
Jeff Layton <[email protected]>

2015-02-20 19:05:49

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> "J. Bruce Fields" <[email protected]> writes:
>
> > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
>
> >> The case of nfsd state-recovery might be similar but you'll need to help
> >> me out a bit with that too.
> >
> > Each network namespace can have its own virtual nfs server. Servers can
> > be started and stopped independently per network namespace. We decide
> > which server should handle an incoming rpc by looking at the network
> > namespace associated with the socket that it arrived over.
> >
> > A server is started by the rpc.nfsd command writing a value into a magic
> > file somewhere.
>
> nit. Unless I am completely turned around that file is on the nfsd
> filesystem, that lives in fs/nfsd/nfs.c.
>
> So I bevelive this really is a case of figuring out what we want the
> semantics to be for mount and propogating the information down from
> mount to where we call the user mode helpers.

Oops, I agree. So when I said:

The upcalls need to happen consistently in one context for a
given virtual nfs server, and that context should probably be
derived from rpc.nfsd's somehow.

Instead of "rpc.nfsd's", I think I should have said "the mounter of
the nfsd filesystem".

Which is already how we choose a net namespace: nfsd_mount and
nfsd_fill_super store the current net namespace in s_fs_info. (And then
grep for "netns" to see the places where that's used.)

--b.

2015-02-21 03:59:16

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > "J. Bruce Fields" <[email protected]> writes:
> >
> > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> >
> > >> The case of nfsd state-recovery might be similar but you'll need to help
> > >> me out a bit with that too.
> > >
> > > Each network namespace can have its own virtual nfs server. Servers can
> > > be started and stopped independently per network namespace. We decide
> > > which server should handle an incoming rpc by looking at the network
> > > namespace associated with the socket that it arrived over.
> > >
> > > A server is started by the rpc.nfsd command writing a value into a magic
> > > file somewhere.
> >
> > nit. Unless I am completely turned around that file is on the nfsd
> > filesystem, that lives in fs/nfsd/nfs.c.
> >
> > So I bevelive this really is a case of figuring out what we want the
> > semantics to be for mount and propogating the information down from
> > mount to where we call the user mode helpers.
>
> Oops, I agree. So when I said:
>
> The upcalls need to happen consistently in one context for a
> given virtual nfs server, and that context should probably be
> derived from rpc.nfsd's somehow.
>
> Instead of "rpc.nfsd's", I think I should have said "the mounter of
> the nfsd filesystem".
>
> Which is already how we choose a net namespace: nfsd_mount and
> nfsd_fill_super store the current net namespace in s_fs_info. (And then
> grep for "netns" to see the places where that's used.)

This is going to be mostly a restatement of what's already been said,
partly for me to refer back to later and partly to clarify and confirm
what I need to do, so prepare to be bored.

As a result of Oleg's recommendations and comments, the next version of
the series will take a reference to an nsproxy and a user namespace
(from the init process of the calling task, while it's still a child of
that task), it won't carry around task structs. There are still a couple
of questions with this so it's not quite there yet.

We'll have to wait and see if what I've done is enough to remedy Oleg's
concerns too. LOL, and then there's the question of how much I'll need
to do to get it to actually work.

The other difference is obtaining the context (now nsproxy and user
namspace) has been taken entirely within the usermode helper. I think
that's a good thing from the calling process isolation requirement. That
may need to change again based on the discussion here.

Now we're starting to look at actual usage it's worth keeping in mind
that how to execute within required namespaces has to be sound before we
tackle use cases that have requirements over this fundamental
functionality.

There are a couple of things to think about.

One thing that's needed is how to work out if the UMH_USE_NS is needed
and another is how to provide provide persistent usage of particular
namespaces across containers. The later probably will relate to the
origin of the file system (which looks like it will be identified at
mount time).

The first case is when the mount originates in the root init namespace
and most of the time (if not all the time) the UMH_USE_NS doesn't need
to be set and the helper should run in the root init namspace. That
should work for mount propagation as well with mounts bound into a
container.

Is this also true for automounted mounts at mount point crossing? Or
perhaps I should ask, should automounted NFS mounts inherit the property
from their parent mount?

The second case is when the mount originates in another namespace,
possibly a container. TBH I haven't thought too much about mounts that
originate from namespaces created by unshare(1) or other source yet. I'm
hoping that will just work once this is done, ;)

The last time I tried binding NFS mounts from one container into another
it didn't work, but if we assume that will work at some point then, as
Bruce points out, we need to provide the ability to record the
namespaces to be used for subsequent "in namespace" execution while
maintaining caller isolation (ie. derived from the callers init
process).

I've been aware of the need for persistence for a while now and I've
been thinking about how to do it but I don't have a clear plan quite
yet. Bruce, having noticed this, has described details about the
environment I have to work with so that's a start. I need the thoughts
of others on this too.

As a result I'm not sure yet if this persistence can be integrated into
the current implementation or if additional calls will be needed to set
and clear the namespace information while maintaining the needed
isolation.

As Bruce says, perhaps the namespace information should be saved as
properties of a mount or perhaps it should be a list keyed by some
handle, the handle being the saved property. I'm not sure yet but the
later might be unnecessary complication and overhead. The cleanup of the
namespace information upon summary termination of processes could be a
bit difficult, but perhaps it will be as simple as making it a function
of freeing of the object it's stored in (in the cases we have so far
that would be the mount).

So, yes, I've still got a fair way to go yet, ;)

Ian

2015-02-23 14:52:40

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > "J. Bruce Fields" <[email protected]> writes:
> > >
> > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > >
> > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > >> me out a bit with that too.
> > > >
> > > > Each network namespace can have its own virtual nfs server. Servers can
> > > > be started and stopped independently per network namespace. We decide
> > > > which server should handle an incoming rpc by looking at the network
> > > > namespace associated with the socket that it arrived over.
> > > >
> > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > file somewhere.
> > >
> > > nit. Unless I am completely turned around that file is on the nfsd
> > > filesystem, that lives in fs/nfsd/nfs.c.
> > >
> > > So I bevelive this really is a case of figuring out what we want the
> > > semantics to be for mount and propogating the information down from
> > > mount to where we call the user mode helpers.
> >
> > Oops, I agree. So when I said:
> >
> > The upcalls need to happen consistently in one context for a
> > given virtual nfs server, and that context should probably be
> > derived from rpc.nfsd's somehow.
> >
> > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > the nfsd filesystem".
> >
> > Which is already how we choose a net namespace: nfsd_mount and
> > nfsd_fill_super store the current net namespace in s_fs_info. (And then
> > grep for "netns" to see the places where that's used.)
>
> This is going to be mostly a restatement of what's already been said,
> partly for me to refer back to later and partly to clarify and confirm
> what I need to do, so prepare to be bored.
>
> As a result of Oleg's recommendations and comments, the next version of
> the series will take a reference to an nsproxy and a user namespace
> (from the init process of the calling task, while it's still a child of
> that task), it won't carry around task structs. There are still a couple
> of questions with this so it's not quite there yet.
>
> We'll have to wait and see if what I've done is enough to remedy Oleg's
> concerns too. LOL, and then there's the question of how much I'll need
> to do to get it to actually work.
>
> The other difference is obtaining the context (now nsproxy and user
> namspace) has been taken entirely within the usermode helper. I think
> that's a good thing from the calling process isolation requirement. That
> may need to change again based on the discussion here.
>
> Now we're starting to look at actual usage it's worth keeping in mind
> that how to execute within required namespaces has to be sound before we
> tackle use cases that have requirements over this fundamental
> functionality.
>
> There are a couple of things to think about.
>
> One thing that's needed is how to work out if the UMH_USE_NS is needed
> and another is how to provide provide persistent usage of particular
> namespaces across containers. The later probably will relate to the
> origin of the file system (which looks like it will be identified at
> mount time).
>
> The first case is when the mount originates in the root init namespace
> and most of the time (if not all the time) the UMH_USE_NS doesn't need
> to be set and the helper should run in the root init namspace.

The helper always runs in the original mount's container. Sometimes
that container is the init container, yes, but I don't see what value
there is in setting a flag in that one case.

> That
> should work for mount propagation as well with mounts bound into a
> container.
>
> Is this also true for automounted mounts at mount point crossing? Or
> perhaps I should ask, should automounted NFS mounts inherit the property
> from their parent mount?

Yes. If we run separate helpers in each container, then the superblocks
should also be separate (so that one container can't poison cached
values used by another). So the containers would all end up with
entirely separate superblocks for the submounts.

That seems inefficient at least, and I don't think it's what an admin
would expect as the default behavior.

> The second case is when the mount originates in another namespace,
> possibly a container. TBH I haven't thought too much about mounts that
> originate from namespaces created by unshare(1) or other source yet. I'm
> hoping that will just work once this is done, ;)

So, one container mounts and spawns a "subcontainer" which continues to
use that filesystem? Yes, I think helpers should continue to run in the
container of the original mount, I don't see any tricky exception here.

> The last time I tried binding NFS mounts from one container into another
> it didn't work,

I'm not sure what you mean by "binding NFS mounts from one container
into another". What exactly didn't work?

--b.

> but if we assume that will work at some point then, as
> Bruce points out, we need to provide the ability to record the
> namespaces to be used for subsequent "in namespace" execution while
> maintaining caller isolation (ie. derived from the callers init
> process).
>
> I've been aware of the need for persistence for a while now and I've
> been thinking about how to do it but I don't have a clear plan quite
> yet. Bruce, having noticed this, has described details about the
> environment I have to work with so that's a start. I need the thoughts
> of others on this too.
>
> As a result I'm not sure yet if this persistence can be integrated into
> the current implementation or if additional calls will be needed to set
> and clear the namespace information while maintaining the needed
> isolation.
>
> As Bruce says, perhaps the namespace information should be saved as
> properties of a mount or perhaps it should be a list keyed by some
> handle, the handle being the saved property. I'm not sure yet but the
> later might be unnecessary complication and overhead. The cleanup of the
> namespace information upon summary termination of processes could be a
> bit difficult, but perhaps it will be as simple as making it a function
> of freeing of the object it's stored in (in the cases we have so far
> that would be the mount).
>
> So, yes, I've still got a fair way to go yet, ;)
>
> Ian

2015-02-24 00:50:45

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote:
> On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > > "J. Bruce Fields" <[email protected]> writes:
> > > >
> > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > > >
> > > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > > >> me out a bit with that too.
> > > > >
> > > > > Each network namespace can have its own virtual nfs server. Servers can
> > > > > be started and stopped independently per network namespace. We decide
> > > > > which server should handle an incoming rpc by looking at the network
> > > > > namespace associated with the socket that it arrived over.
> > > > >
> > > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > > file somewhere.
> > > >
> > > > nit. Unless I am completely turned around that file is on the nfsd
> > > > filesystem, that lives in fs/nfsd/nfs.c.
> > > >
> > > > So I bevelive this really is a case of figuring out what we want the
> > > > semantics to be for mount and propogating the information down from
> > > > mount to where we call the user mode helpers.
> > >
> > > Oops, I agree. So when I said:
> > >
> > > The upcalls need to happen consistently in one context for a
> > > given virtual nfs server, and that context should probably be
> > > derived from rpc.nfsd's somehow.
> > >
> > > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > > the nfsd filesystem".
> > >
> > > Which is already how we choose a net namespace: nfsd_mount and
> > > nfsd_fill_super store the current net namespace in s_fs_info. (And then
> > > grep for "netns" to see the places where that's used.)
> >
> > This is going to be mostly a restatement of what's already been said,
> > partly for me to refer back to later and partly to clarify and confirm
> > what I need to do, so prepare to be bored.
> >
> > As a result of Oleg's recommendations and comments, the next version of
> > the series will take a reference to an nsproxy and a user namespace
> > (from the init process of the calling task, while it's still a child of
> > that task), it won't carry around task structs. There are still a couple
> > of questions with this so it's not quite there yet.
> >
> > We'll have to wait and see if what I've done is enough to remedy Oleg's
> > concerns too. LOL, and then there's the question of how much I'll need
> > to do to get it to actually work.
> >
> > The other difference is obtaining the context (now nsproxy and user
> > namspace) has been taken entirely within the usermode helper. I think
> > that's a good thing from the calling process isolation requirement. That
> > may need to change again based on the discussion here.
> >
> > Now we're starting to look at actual usage it's worth keeping in mind
> > that how to execute within required namespaces has to be sound before we
> > tackle use cases that have requirements over this fundamental
> > functionality.
> >
> > There are a couple of things to think about.
> >
> > One thing that's needed is how to work out if the UMH_USE_NS is needed
> > and another is how to provide provide persistent usage of particular
> > namespaces across containers. The later probably will relate to the
> > origin of the file system (which looks like it will be identified at
> > mount time).
> >
> > The first case is when the mount originates in the root init namespace
> > and most of the time (if not all the time) the UMH_USE_NS doesn't need
> > to be set and the helper should run in the root init namspace.
>
> The helper always runs in the original mount's container. Sometimes
> that container is the init container, yes, but I don't see what value
> there is in setting a flag in that one case.

Yep. that's pretty much what I meant.

>
> > That
> > should work for mount propagation as well with mounts bound into a
> > container.
> >
> > Is this also true for automounted mounts at mount point crossing? Or
> > perhaps I should ask, should automounted NFS mounts inherit the property
> > from their parent mount?
>
> Yes. If we run separate helpers in each container, then the superblocks
> should also be separate (so that one container can't poison cached
> values used by another). So the containers would all end up with
> entirely separate superblocks for the submounts.

That's almost what I was thinking.

The question relates to a mount for which the namespace proxy would have
been set at mount time in a container and then bound into another
container (in Docker by using the "--volumes-from <name>"). I believe
the namespace information from the original mount should always be used
when calling a usermode helper. This might not be a sensible question
now but I think it needs to be considered.

>
> That seems inefficient at least, and I don't think it's what an admin
> would expect as the default behavior.

LOL, but the best way to manage this is to set the namespace information
at mount time (as Eric mentioned long ago) and use that everywhere. It's
consistent and it provides a way for a process with appropriate
privilege to specify the namespace information.

>
> > The second case is when the mount originates in another namespace,
> > possibly a container. TBH I haven't thought too much about mounts that
> > originate from namespaces created by unshare(1) or other source yet. I'm
> > hoping that will just work once this is done, ;)
>
> So, one container mounts and spawns a "subcontainer" which continues to
> use that filesystem? Yes, I think helpers should continue to run in the
> container of the original mount, I don't see any tricky exception here.

That's what I think should happen too.

>
> > The last time I tried binding NFS mounts from one container into another
> > it didn't work,
>
> I'm not sure what you mean by "binding NFS mounts from one container
> into another". What exactly didn't work?

It's the volumes-from Docker option I'm thinking of.
I'm not sure now if my statement is accurate. I'll need to test it
again. I thought I had but what didn't work with the volumes-from might
have been autofs not NFS mounts.

Anyway, I'm going to need to provide a way for clients to say "calculate
the namespace information and give me an identifier so it can be used
everywhere for this mount" which amounts to maintaining a list of the
namespace objects.

I'm not sure yet if I should undo some of what I've done recently or
leave it for users who need a straight "execute me now within the
current namespace".

>
> --b.
>
> > but if we assume that will work at some point then, as
> > Bruce points out, we need to provide the ability to record the
> > namespaces to be used for subsequent "in namespace" execution while
> > maintaining caller isolation (ie. derived from the callers init
> > process).
> >
> > I've been aware of the need for persistence for a while now and I've
> > been thinking about how to do it but I don't have a clear plan quite
> > yet. Bruce, having noticed this, has described details about the
> > environment I have to work with so that's a start. I need the thoughts
> > of others on this too.
> >
> > As a result I'm not sure yet if this persistence can be integrated into
> > the current implementation or if additional calls will be needed to set
> > and clear the namespace information while maintaining the needed
> > isolation.
> >
> > As Bruce says, perhaps the namespace information should be saved as
> > properties of a mount or perhaps it should be a list keyed by some
> > handle, the handle being the saved property. I'm not sure yet but the
> > later might be unnecessary complication and overhead. The cleanup of the
> > namespace information upon summary termination of processes could be a
> > bit difficult, but perhaps it will be as simple as making it a function
> > of freeing of the object it's stored in (in the cases we have so far
> > that would be the mount).
> >
> > So, yes, I've still got a fair way to go yet, ;)
> >
> > Ian

2015-02-24 01:22:17

by Benjamin Coddington

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Tue, 24 Feb 2015, Ian Kent wrote:

> On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote:
> > On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> > > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > > > "J. Bruce Fields" <[email protected]> writes:
> > > > >
> > > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > > > >
> > > > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > > > >> me out a bit with that too.
> > > > > >
> > > > > > Each network namespace can have its own virtual nfs server. Servers can
> > > > > > be started and stopped independently per network namespace. We decide
> > > > > > which server should handle an incoming rpc by looking at the network
> > > > > > namespace associated with the socket that it arrived over.
> > > > > >
> > > > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > > > file somewhere.
> > > > >
> > > > > nit. Unless I am completely turned around that file is on the nfsd
> > > > > filesystem, that lives in fs/nfsd/nfs.c.
> > > > >
> > > > > So I bevelive this really is a case of figuring out what we want the
> > > > > semantics to be for mount and propogating the information down from
> > > > > mount to where we call the user mode helpers.
> > > >
> > > > Oops, I agree. So when I said:
> > > >
> > > > The upcalls need to happen consistently in one context for a
> > > > given virtual nfs server, and that context should probably be
> > > > derived from rpc.nfsd's somehow.
> > > >
> > > > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > > > the nfsd filesystem".
> > > >
> > > > Which is already how we choose a net namespace: nfsd_mount and
> > > > nfsd_fill_super store the current net namespace in s_fs_info. (And then
> > > > grep for "netns" to see the places where that's used.)
> > >
> > > This is going to be mostly a restatement of what's already been said,
> > > partly for me to refer back to later and partly to clarify and confirm
> > > what I need to do, so prepare to be bored.
> > >
> > > As a result of Oleg's recommendations and comments, the next version of
> > > the series will take a reference to an nsproxy and a user namespace
> > > (from the init process of the calling task, while it's still a child of
> > > that task), it won't carry around task structs. There are still a couple
> > > of questions with this so it's not quite there yet.
> > >
> > > We'll have to wait and see if what I've done is enough to remedy Oleg's
> > > concerns too. LOL, and then there's the question of how much I'll need
> > > to do to get it to actually work.
> > >
> > > The other difference is obtaining the context (now nsproxy and user
> > > namspace) has been taken entirely within the usermode helper. I think
> > > that's a good thing from the calling process isolation requirement. That
> > > may need to change again based on the discussion here.
> > >
> > > Now we're starting to look at actual usage it's worth keeping in mind
> > > that how to execute within required namespaces has to be sound before we
> > > tackle use cases that have requirements over this fundamental
> > > functionality.
> > >
> > > There are a couple of things to think about.
> > >
> > > One thing that's needed is how to work out if the UMH_USE_NS is needed
> > > and another is how to provide provide persistent usage of particular
> > > namespaces across containers. The later probably will relate to the
> > > origin of the file system (which looks like it will be identified at
> > > mount time).
> > >
> > > The first case is when the mount originates in the root init namespace
> > > and most of the time (if not all the time) the UMH_USE_NS doesn't need
> > > to be set and the helper should run in the root init namspace.
> >
> > The helper always runs in the original mount's container. Sometimes
> > that container is the init container, yes, but I don't see what value
> > there is in setting a flag in that one case.
>
> Yep. that's pretty much what I meant.
>
> >
> > > That
> > > should work for mount propagation as well with mounts bound into a
> > > container.
> > >
> > > Is this also true for automounted mounts at mount point crossing? Or
> > > perhaps I should ask, should automounted NFS mounts inherit the property
> > > from their parent mount?
> >
> > Yes. If we run separate helpers in each container, then the superblocks
> > should also be separate (so that one container can't poison cached
> > values used by another). So the containers would all end up with
> > entirely separate superblocks for the submounts.
>
> That's almost what I was thinking.
>
> The question relates to a mount for which the namespace proxy would have
> been set at mount time in a container and then bound into another
> container (in Docker by using the "--volumes-from <name>"). I believe
> the namespace information from the original mount should always be used
> when calling a usermode helper. This might not be a sensible question
> now but I think it needs to be considered.
>
> >
> > That seems inefficient at least, and I don't think it's what an admin
> > would expect as the default behavior.
>
> LOL, but the best way to manage this is to set the namespace information
> at mount time (as Eric mentioned long ago) and use that everywhere. It's
> consistent and it provides a way for a process with appropriate
> privilege to specify the namespace information.
>
> >
> > > The second case is when the mount originates in another namespace,
> > > possibly a container. TBH I haven't thought too much about mounts that
> > > originate from namespaces created by unshare(1) or other source yet. I'm
> > > hoping that will just work once this is done, ;)
> >
> > So, one container mounts and spawns a "subcontainer" which continues to
> > use that filesystem? Yes, I think helpers should continue to run in the
> > container of the original mount, I don't see any tricky exception here.
>
> That's what I think should happen too.
>
> >
> > > The last time I tried binding NFS mounts from one container into another
> > > it didn't work,
> >
> > I'm not sure what you mean by "binding NFS mounts from one container
> > into another". What exactly didn't work?
>
> It's the volumes-from Docker option I'm thinking of.
> I'm not sure now if my statement is accurate. I'll need to test it
> again. I thought I had but what didn't work with the volumes-from might
> have been autofs not NFS mounts.
>
> Anyway, I'm going to need to provide a way for clients to say "calculate
> the namespace information and give me an identifier so it can be used
> everywhere for this mount" which amounts to maintaining a list of the
> namespace objects.

That sounds a lot closer to some of the work I've been doing to see if I can
come up with a way to solve the "where's the namespace I need?" problem.

I agree with Greg's very early comments that the easiest way to determine
which namespace context a process should use is to keep it as a copy of
the task -- and the place that copy should be done is fork(). The
problem was where to keep that information and how to make it reusable.

I've been hacking out a keyrings-based "key-agent" service that is basically
a special type of key (like a keyring). A key_agent type roughly
corresponds to a particular type of upcall user, such as the idmapper. A
key_agent_type is registered, and that registration ties a particular
key_type to that key_agent. When a process calls request_key() for that
key_type instead of using the helper to execute /sbin/request-key the
process' keyrings are searched for a key_agent. If a key_agent isn't found,
the key_agent provider is then asked to provide an existing one based on
some rules (is there an existing key_agent running in a different namespace
that we might want to use for this purpose -- for example: is there there
one already running in the namespace where the mount occurred). If so, it
is linked to the calling process' keyrings and then used for the upcall. If
not, then the calling process itself is forked/execve-ed into a new
persistent key_agent that is installed on the calling process' keyrings just
like a key, and with the same lifetime and GC expectations of a key.

A key_agent is a user-space process waiting for a realtime signal to process a
particular key and provide the requested key information that can be
installed back onto the calling process' keyrings.

Basically, this approach allows a particular user of a keyrings-based upcall
to specify their own rules about how to provide a namespace context for a
calling process. It does, however, require extra work to create a specific
key_agent_type for each individual key_type that might want to use this
mechanism.

I've been waiting to have a bit more of a proof-of-concept before bringing
this approach into the discussion. However, it looks like it may be
important to allow particular users of the upcall their own rules about
which namespace contexts they might want to use. This approach could
provide that flexibility.

Ben

> I'm not sure yet if I should undo some of what I've done recently or
> leave it for users who need a straight "execute me now within the
> current namespace".
>
> >
> > --b.
> >
> > > but if we assume that will work at some point then, as
> > > Bruce points out, we need to provide the ability to record the
> > > namespaces to be used for subsequent "in namespace" execution while
> > > maintaining caller isolation (ie. derived from the callers init
> > > process).
> > >
> > > I've been aware of the need for persistence for a while now and I've
> > > been thinking about how to do it but I don't have a clear plan quite
> > > yet. Bruce, having noticed this, has described details about the
> > > environment I have to work with so that's a start. I need the thoughts
> > > of others on this too.
> > >
> > > As a result I'm not sure yet if this persistence can be integrated into
> > > the current implementation or if additional calls will be needed to set
> > > and clear the namespace information while maintaining the needed
> > > isolation.
> > >
> > > As Bruce says, perhaps the namespace information should be saved as
> > > properties of a mount or perhaps it should be a list keyed by some
> > > handle, the handle being the saved property. I'm not sure yet but the
> > > later might be unnecessary complication and overhead. The cleanup of the
> > > namespace information upon summary termination of processes could be a
> > > bit difficult, but perhaps it will be as simple as making it a function
> > > of freeing of the object it's stored in (in the cases we have so far
> > > that would be the mount).
> > >
> > > So, yes, I've still got a fair way to go yet, ;)
> > >
> > > Ian
>
>
>

2015-02-24 08:02:08

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Mon, 2015-02-23 at 17:22 -0800, Benjamin Coddington wrote:
> On Tue, 24 Feb 2015, Ian Kent wrote:
>
> > On Mon, 2015-02-23 at 09:52 -0500, J. Bruce Fields wrote:
> > > On Sat, Feb 21, 2015 at 11:58:58AM +0800, Ian Kent wrote:
> > > > On Fri, 2015-02-20 at 14:05 -0500, J. Bruce Fields wrote:
> > > > > On Fri, Feb 20, 2015 at 12:07:15PM -0600, Eric W. Biederman wrote:
> > > > > > "J. Bruce Fields" <[email protected]> writes:
> > > > > >
> > > > > > > On Fri, Feb 20, 2015 at 05:33:25PM +0800, Ian Kent wrote:
> > > > > >
> > > > > > >> The case of nfsd state-recovery might be similar but you'll need to help
> > > > > > >> me out a bit with that too.
> > > > > > >
> > > > > > > Each network namespace can have its own virtual nfs server. Servers can
> > > > > > > be started and stopped independently per network namespace. We decide
> > > > > > > which server should handle an incoming rpc by looking at the network
> > > > > > > namespace associated with the socket that it arrived over.
> > > > > > >
> > > > > > > A server is started by the rpc.nfsd command writing a value into a magic
> > > > > > > file somewhere.
> > > > > >
> > > > > > nit. Unless I am completely turned around that file is on the nfsd
> > > > > > filesystem, that lives in fs/nfsd/nfs.c.
> > > > > >
> > > > > > So I bevelive this really is a case of figuring out what we want the
> > > > > > semantics to be for mount and propogating the information down from
> > > > > > mount to where we call the user mode helpers.
> > > > >
> > > > > Oops, I agree. So when I said:
> > > > >
> > > > > The upcalls need to happen consistently in one context for a
> > > > > given virtual nfs server, and that context should probably be
> > > > > derived from rpc.nfsd's somehow.
> > > > >
> > > > > Instead of "rpc.nfsd's", I think I should have said "the mounter of
> > > > > the nfsd filesystem".
> > > > >
> > > > > Which is already how we choose a net namespace: nfsd_mount and
> > > > > nfsd_fill_super store the current net namespace in s_fs_info. (And then
> > > > > grep for "netns" to see the places where that's used.)
> > > >
> > > > This is going to be mostly a restatement of what's already been said,
> > > > partly for me to refer back to later and partly to clarify and confirm
> > > > what I need to do, so prepare to be bored.
> > > >
> > > > As a result of Oleg's recommendations and comments, the next version of
> > > > the series will take a reference to an nsproxy and a user namespace
> > > > (from the init process of the calling task, while it's still a child of
> > > > that task), it won't carry around task structs. There are still a couple
> > > > of questions with this so it's not quite there yet.
> > > >
> > > > We'll have to wait and see if what I've done is enough to remedy Oleg's
> > > > concerns too. LOL, and then there's the question of how much I'll need
> > > > to do to get it to actually work.
> > > >
> > > > The other difference is obtaining the context (now nsproxy and user
> > > > namspace) has been taken entirely within the usermode helper. I think
> > > > that's a good thing from the calling process isolation requirement. That
> > > > may need to change again based on the discussion here.
> > > >
> > > > Now we're starting to look at actual usage it's worth keeping in mind
> > > > that how to execute within required namespaces has to be sound before we
> > > > tackle use cases that have requirements over this fundamental
> > > > functionality.
> > > >
> > > > There are a couple of things to think about.
> > > >
> > > > One thing that's needed is how to work out if the UMH_USE_NS is needed
> > > > and another is how to provide provide persistent usage of particular
> > > > namespaces across containers. The later probably will relate to the
> > > > origin of the file system (which looks like it will be identified at
> > > > mount time).
> > > >
> > > > The first case is when the mount originates in the root init namespace
> > > > and most of the time (if not all the time) the UMH_USE_NS doesn't need
> > > > to be set and the helper should run in the root init namspace.
> > >
> > > The helper always runs in the original mount's container. Sometimes
> > > that container is the init container, yes, but I don't see what value
> > > there is in setting a flag in that one case.
> >
> > Yep. that's pretty much what I meant.
> >
> > >
> > > > That
> > > > should work for mount propagation as well with mounts bound into a
> > > > container.
> > > >
> > > > Is this also true for automounted mounts at mount point crossing? Or
> > > > perhaps I should ask, should automounted NFS mounts inherit the property
> > > > from their parent mount?
> > >
> > > Yes. If we run separate helpers in each container, then the superblocks
> > > should also be separate (so that one container can't poison cached
> > > values used by another). So the containers would all end up with
> > > entirely separate superblocks for the submounts.
> >
> > That's almost what I was thinking.
> >
> > The question relates to a mount for which the namespace proxy would have
> > been set at mount time in a container and then bound into another
> > container (in Docker by using the "--volumes-from <name>"). I believe
> > the namespace information from the original mount should always be used
> > when calling a usermode helper. This might not be a sensible question
> > now but I think it needs to be considered.
> >
> > >
> > > That seems inefficient at least, and I don't think it's what an admin
> > > would expect as the default behavior.
> >
> > LOL, but the best way to manage this is to set the namespace information
> > at mount time (as Eric mentioned long ago) and use that everywhere. It's
> > consistent and it provides a way for a process with appropriate
> > privilege to specify the namespace information.
> >
> > >
> > > > The second case is when the mount originates in another namespace,
> > > > possibly a container. TBH I haven't thought too much about mounts that
> > > > originate from namespaces created by unshare(1) or other source yet. I'm
> > > > hoping that will just work once this is done, ;)
> > >
> > > So, one container mounts and spawns a "subcontainer" which continues to
> > > use that filesystem? Yes, I think helpers should continue to run in the
> > > container of the original mount, I don't see any tricky exception here.
> >
> > That's what I think should happen too.
> >
> > >
> > > > The last time I tried binding NFS mounts from one container into another
> > > > it didn't work,
> > >
> > > I'm not sure what you mean by "binding NFS mounts from one container
> > > into another". What exactly didn't work?
> >
> > It's the volumes-from Docker option I'm thinking of.
> > I'm not sure now if my statement is accurate. I'll need to test it
> > again. I thought I had but what didn't work with the volumes-from might
> > have been autofs not NFS mounts.
> >
> > Anyway, I'm going to need to provide a way for clients to say "calculate
> > the namespace information and give me an identifier so it can be used
> > everywhere for this mount" which amounts to maintaining a list of the
> > namespace objects.
>
> That sounds a lot closer to some of the work I've been doing to see if I can
> come up with a way to solve the "where's the namespace I need?" problem.
>
> I agree with Greg's very early comments that the easiest way to determine
> which namespace context a process should use is to keep it as a copy of
> the task -- and the place that copy should be done is fork(). The
> problem was where to keep that information and how to make it reusable.
>
> I've been hacking out a keyrings-based "key-agent" service that is basically
> a special type of key (like a keyring). A key_agent type roughly
> corresponds to a particular type of upcall user, such as the idmapper. A
> key_agent_type is registered, and that registration ties a particular
> key_type to that key_agent. When a process calls request_key() for that
> key_type instead of using the helper to execute /sbin/request-key the
> process' keyrings are searched for a key_agent. If a key_agent isn't found,
> the key_agent provider is then asked to provide an existing one based on
> some rules (is there an existing key_agent running in a different namespace
> that we might want to use for this purpose -- for example: is there there
> one already running in the namespace where the mount occurred). If so, it
> is linked to the calling process' keyrings and then used for the upcall. If
> not, then the calling process itself is forked/execve-ed into a new
> persistent key_agent that is installed on the calling process' keyrings just
> like a key, and with the same lifetime and GC expectations of a key.
>
> A key_agent is a user-space process waiting for a realtime signal to process a
> particular key and provide the requested key information that can be
> installed back onto the calling process' keyrings.
>
> Basically, this approach allows a particular user of a keyrings-based upcall
> to specify their own rules about how to provide a namespace context for a
> calling process. It does, however, require extra work to create a specific
> key_agent_type for each individual key_type that might want to use this
> mechanism.
>
> I've been waiting to have a bit more of a proof-of-concept before bringing
> this approach into the discussion. However, it looks like it may be
> important to allow particular users of the upcall their own rules about
> which namespace contexts they might want to use. This approach could
> provide that flexibility.

I was wondering if what you've been doing would help.
This does sound interesting, perhaps I should wait a little before doing
much more in case it can be generalized a little and used here too.

It's likely the current limited implementation I have will also be
useful for upcalls that need a straight "just execute me in the caller
namespace", so it's probably worth continuing it for that case.

>
> Ben
>
>
>
> > I'm not sure yet if I should undo some of what I've done recently or
> > leave it for users who need a straight "execute me now within the
> > current namespace".
> >
> > >
> > > --b.
> > >
> > > > but if we assume that will work at some point then, as
> > > > Bruce points out, we need to provide the ability to record the
> > > > namespaces to be used for subsequent "in namespace" execution while
> > > > maintaining caller isolation (ie. derived from the callers init
> > > > process).
> > > >
> > > > I've been aware of the need for persistence for a while now and I've
> > > > been thinking about how to do it but I don't have a clear plan quite
> > > > yet. Bruce, having noticed this, has described details about the
> > > > environment I have to work with so that's a start. I need the thoughts
> > > > of others on this too.
> > > >
> > > > As a result I'm not sure yet if this persistence can be integrated into
> > > > the current implementation or if additional calls will be needed to set
> > > > and clear the namespace information while maintaining the needed
> > > > isolation.
> > > >
> > > > As Bruce says, perhaps the namespace information should be saved as
> > > > properties of a mount or perhaps it should be a list keyed by some
> > > > handle, the handle being the saved property. I'm not sure yet but the
> > > > later might be unnecessary complication and overhead. The cleanup of the
> > > > namespace information upon summary termination of processes could be a
> > > > bit difficult, but perhaps it will be as simple as making it a function
> > > > of freeing of the object it's stored in (in the cases we have so far
> > > > that would be the mount).
> > > >
> > > > So, yes, I've still got a fair way to go yet, ;)
> > > >
> > > > Ian
> >
> >
> >

2015-02-24 15:33:31

by J. Bruce Fields

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Mon, Feb 23, 2015 at 05:22:12PM -0800, Benjamin Coddington wrote:
> That sounds a lot closer to some of the work I've been doing to see if I can
> come up with a way to solve the "where's the namespace I need?" problem.
>
> I agree with Greg's very early comments that the easiest way to determine
> which namespace context a process should use is to keep it as a copy of
> the task -- and the place that copy should be done is fork().

So you're suggesting that the key_agent could be that copy? But:

> ... If not, then the calling process itself is forked/execve-ed into a
> new persistent key_agent that is installed on the calling process'
> keyrings just like a key, and with the same lifetime and GC
> expectations of a key.
>
> A key_agent is a user-space process...

If the key_agent can die before it's needed, then we have to keep around
some other context information to allow regenerating a new one. So what
is that piece of information? Aren't we back where we started?

--b.

2015-02-25 00:41:14

by Benjamin Coddington

[permalink] [raw]

Subject: Re: [RFC PATCH 5/8] KEYS: exec request-key within the requesting task's init namespace

On Tue, 24 Feb 2015, J. Bruce Fields wrote:

> On Mon, Feb 23, 2015 at 05:22:12PM -0800, Benjamin Coddington wrote:
> > That sounds a lot closer to some of the work I've been doing to see if I can
> > come up with a way to solve the "where's the namespace I need?" problem.
> >
> > I agree with Greg's very early comments that the easiest way to determine
> > which namespace context a process should use is to keep it as a copy of
> > the task -- and the place that copy should be done is fork().
>
> So you're suggesting that the key_agent could be that copy? But:
>
> > ... If not, then the calling process itself is forked/execve-ed into a
> > new persistent key_agent that is installed on the calling process'
> > keyrings just like a key, and with the same lifetime and GC
> > expectations of a key.
> >
> > A key_agent is a user-space process...
>
> If the key_agent can die before it's needed, then we have to keep around
> some other context information to allow regenerating a new one. So what
> is that piece of information? Aren't we back where we started?

Yes. It would seem to make sense then to keep a copy of task which would be
used to create the user-space key_agent. If that's the standard behavior,
it doubles the number of tasks for the average use case. The average use
case (I anticipate) would be to just always fork the caller if a key_agent
is not found.

It would probably make sense provide the key_agent_type the option of
binding a key_agent reference task to another structure -- like a mount.
And, if so, that reservied reference task would be used to create/re-create the
user-space key_agent. The trouble there is how to have the caller
communicate the object bound to the task.. a reference to the object could
be guessed, which would give any caller access to a that reserved task..

In order to keep further verbose speculation to a minimum, I'll say: good
point.. there may be a way to authorize the usage of a reference task by
means of the possession of an authkey. I'll keep hacking at this.

Ben