2006-03-06 23:53:23

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 0/6] support separate namespaces for sysv

This is more followup on this thread:

http://marc.theaimsgroup.com/?l=linux-kernel&m=114108773827517&w=2

The following series makes it possible to isolate different
sets of sysv structures (msg, sem, and shm) on one kernel.

The initial patch makes it much more feasible to access these
disparite sets from the traditional sysctl interfaces.


2006-03-06 23:54:12

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 6/6] sysvshm: containerize sysctls


Note that this will effectively remove the system-wide limits
on sysv shm resources that the sysctl mechanism currently
provides and move it to a per-container limit. This is
currently by design, and may be later addressed when we have
proper large-scale resource controls in the kernel.

Signed-off-by: Dave Hansen <[email protected]>
---

desc.txt | 0
work-dave/ipc/shm.c | 33 +++++++++++++++++++++++----------
work-dave/ipc/util.h | 4 ++++
work-dave/kernel/sysctl.c | 12 ++++++------
4 files changed, 33 insertions(+), 16 deletions(-)

diff -puN ipc/shm.c~sysvshm-containerize-sysctls ipc/shm.c
--- work/ipc/shm.c~sysvshm-containerize-sysctls 2006-03-06 15:42:00.000000000 -0800
+++ work-dave/ipc/shm.c 2006-03-06 15:42:00.000000000 -0800
@@ -55,12 +55,12 @@ static void shm_close (struct vm_area_st
static int sysvipc_shm_proc_show(struct seq_file *s, void *it);
#endif

-size_t shm_ctlmax = SHMMAX;
-size_t shm_ctlall = SHMALL;
-int shm_ctlmni = SHMMNI;
-
void __init shm_init (struct ipc_shm_context *context)
{
+ context->ctlmax = SHMMAX;
+ context->ctlall = SHMALL;
+ context->ctlmni = SHMMNI;
+
ipc_init_ids(&context->ids, 1);
ipc_init_proc_interface("sysvipc/shm",
" key shmid perms size cpid lpid nattch uid gid cuid cgid atime dtime ctime\n",
@@ -85,7 +85,7 @@ struct shmid_kernel *shm_rmid(struct ipc
static inline
int shm_addid(struct ipc_shm_context *context, struct shmid_kernel *shp)
{
- return ipc_addid(&context->ids, &shp->shm_perm, shm_ctlmni);
+ return ipc_addid(&context->ids, &shp->shm_perm, context->ctlmni);
}


@@ -208,10 +208,10 @@ static int newseg (struct ipc_shm_contex
char name[13];
int id;

- if (size < SHMMIN || size > shm_ctlmax)
+ if (size < SHMMIN || size > context->ctlmax)
return -EINVAL;

- if (context->tot + numpages >= shm_ctlall)
+ if (context->tot + numpages >= context->ctlall)
return -ENOSPC;

shp = ipc_rcu_alloc(sizeof(*shp));
@@ -463,9 +463,9 @@ asmlinkage long sys_shmctl (int shmid, i
return err;

memset(&shminfo,0,sizeof(shminfo));
- shminfo.shmmni = shminfo.shmseg = shm_ctlmni;
- shminfo.shmmax = shm_ctlmax;
- shminfo.shmall = shm_ctlall;
+ shminfo.shmmni = shminfo.shmseg = context->ctlmni;
+ shminfo.shmmax = context->ctlmax;
+ shminfo.shmall = context->ctlall;

shminfo.shmmin = SHMMIN;
if(copy_shminfo_to_user (buf, &shminfo, version))
@@ -935,3 +935,16 @@ static int sysvipc_shm_proc_show(struct
shp->shm_ctim);
}
#endif
+
+void *shm_ctlmax_helper(void)
+{
+ return &current->ipc_context->shm.ctlmax;
+}
+void *shm_ctlall_helper(void)
+{
+ return &current->ipc_context->shm.ctlall;
+}
+void *shm_ctlmni_helper(void)
+{
+ return &current->ipc_context->shm.ctlmni;
+}
diff -puN ipc/util.h~sysvshm-containerize-sysctls ipc/util.h
--- work/ipc/util.h~sysvshm-containerize-sysctls 2006-03-06 15:42:00.000000000 -0800
+++ work-dave/ipc/util.h 2006-03-06 15:42:00.000000000 -0800
@@ -47,6 +47,10 @@ struct ipc_sem_context {
struct ipc_shm_context {
struct ipc_ids ids;

+ size_t ctlmax;
+ size_t ctlall;
+ int ctlmni;
+
int tot; /* total number of shared memory pages */
};

diff -puN kernel/sysctl.c~sysvshm-containerize-sysctls kernel/sysctl.c
--- work/kernel/sysctl.c~sysvshm-containerize-sysctls 2006-03-06 15:42:00.000000000 -0800
+++ work-dave/kernel/sysctl.c 2006-03-06 15:42:00.000000000 -0800
@@ -92,12 +92,12 @@ extern char modprobe_path[];
extern int sg_big_buff;
#endif
#ifdef CONFIG_SYSVIPC
-extern size_t shm_ctlmax;
-extern size_t shm_ctlall;
+extern void *shm_ctlmax_helper(void);
+extern void *shm_ctlall_helper(void);
+extern void *shm_ctlmni_helper(void);
extern void *msg_ctlmnb_helper(void);
extern void *msg_ctlmni_helper(void);
extern void *msg_ctlmax_helper(void);
-extern int shm_ctlmni;
extern int sem_ctls[];
#endif

@@ -428,7 +428,7 @@ static ctl_table kern_table[] = {
{
.ctl_name = KERN_SHMMAX,
.procname = "shmmax",
- .data = &shm_ctlmax,
+ .data_access = &shm_ctlmax_helper,
.maxlen = sizeof (size_t),
.mode = 0644,
.proc_handler = &proc_doulongvec_minmax,
@@ -436,7 +436,7 @@ static ctl_table kern_table[] = {
{
.ctl_name = KERN_SHMALL,
.procname = "shmall",
- .data = &shm_ctlall,
+ .data_access = &shm_ctlall_helper,
.maxlen = sizeof (size_t),
.mode = 0644,
.proc_handler = &proc_doulongvec_minmax,
@@ -444,7 +444,7 @@ static ctl_table kern_table[] = {
{
.ctl_name = KERN_SHMMNI,
.procname = "shmmni",
- .data = &shm_ctlmni,
+ .data_access = &shm_ctlmni_helper,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec,
diff -puN desc.txt~sysvshm-containerize-sysctls desc.txt
_

2006-03-06 23:53:22

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 2/6] sysvmsg: containerize


Take all of the global sysv msg context and put it inside of
a special structure. This allows future work to give different
containers completely different views into the sysv space.

I'm still using the "_context" nomenclature here. The consensus
seems to be to use something like "namespace". Please speak up
if you have any other suggestions. I'll fix it up before
submitting upstream.

Signed-off-by: Dave Hansen <[email protected]>
---

work-dave/desc.txt | 4 +
work-dave/include/linux/init_task.h | 1
work-dave/include/linux/ipc.h | 4 +
work-dave/include/linux/sched.h | 1
work-dave/ipc/msg.c | 138 ++++++++++++++++++------------------
work-dave/ipc/util.c | 8 +-
work-dave/ipc/util.h | 14 +++
7 files changed, 99 insertions(+), 71 deletions(-)

diff -puN include/linux/init_task.h~sysvmsg-container include/linux/init_task.h
--- work/include/linux/init_task.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
+++ work-dave/include/linux/init_task.h 2006-03-06 15:41:56.000000000 -0800
@@ -2,6 +2,7 @@
#define _LINUX__INIT_TASK_H

#include <linux/file.h>
+#include <linux/ipc.h>
#include <linux/rcupdate.h>

#define INIT_FDTABLE \
diff -puN include/linux/ipc.h~sysvmsg-container include/linux/ipc.h
--- work/include/linux/ipc.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
+++ work-dave/include/linux/ipc.h 2006-03-06 15:41:56.000000000 -0800
@@ -68,6 +68,10 @@ struct kern_ipc_perm
void *security;
};

+struct ipc_context;
+extern struct ipc_context *ipc_context;
+
+extern struct ipc_msg_context *init_ipc_context(void);
#endif /* __KERNEL__ */

#endif /* _LINUX_IPC_H */
diff -puN include/linux/sched.h~sysvmsg-container include/linux/sched.h
--- work/include/linux/sched.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
+++ work-dave/include/linux/sched.h 2006-03-06 15:41:56.000000000 -0800
@@ -793,6 +793,7 @@ struct task_struct {
int link_count, total_link_count;
/* ipc stuff */
struct sysv_sem sysvsem;
+ struct ipc_context *ipc_context;
/* CPU-specific state of this task */
struct thread_struct thread;
/* filesystem information */
diff -puN ipc/msg.c~sysvmsg-container ipc/msg.c
--- work/ipc/msg.c~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
+++ work-dave/ipc/msg.c 2006-03-06 15:41:56.000000000 -0800
@@ -60,35 +60,30 @@ struct msg_sender {
#define SEARCH_NOTEQUAL 3
#define SEARCH_LESSEQUAL 4

-static atomic_t msg_bytes = ATOMIC_INIT(0);
-static atomic_t msg_hdrs = ATOMIC_INIT(0);
+#define msg_lock(ctx, id) ((struct msg_queue*)ipc_lock(&ctx->ids,id))
+#define msg_unlock(ctx, msq) ipc_unlock(&(msq)->q_perm)
+#define msg_rmid(ctx, id) ((struct msg_queue*)ipc_rmid(&ctx->ids,id))
+#define msg_checkid(ctx, msq, msgid) \
+ ipc_checkid(&ctx->ids,&msq->q_perm,msgid)
+#define msg_buildid(ctx, id, seq) \
+ ipc_buildid(&ctx->ids, id, seq)

-static struct ipc_ids msg_ids;
-
-#define msg_lock(id) ((struct msg_queue*)ipc_lock(&msg_ids,id))
-#define msg_unlock(msq) ipc_unlock(&(msq)->q_perm)
-#define msg_rmid(id) ((struct msg_queue*)ipc_rmid(&msg_ids,id))
-#define msg_checkid(msq, msgid) \
- ipc_checkid(&msg_ids,&msq->q_perm,msgid)
-#define msg_buildid(id, seq) \
- ipc_buildid(&msg_ids, id, seq)
-
-static void freeque (struct msg_queue *msq, int id);
-static int newque (key_t key, int msgflg);
+static void freeque (struct ipc_msg_context *, struct msg_queue *msq, int id);
+static int newque (struct ipc_msg_context *context, key_t key, int id);
#ifdef CONFIG_PROC_FS
static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
#endif

-void __init msg_init (void)
+void __init msg_init (struct ipc_msg_context *context)
{
- ipc_init_ids(&msg_ids,msg_ctlmni);
+ ipc_init_ids(&context->ids,msg_ctlmni);
ipc_init_proc_interface("sysvipc/msg",
" key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n",
- &msg_ids,
+ &context->ids,
sysvipc_msg_proc_show);
}

-static int newque (key_t key, int msgflg)
+static int newque (struct ipc_msg_context *context, key_t key, int msgflg)
{
int id;
int retval;
@@ -108,14 +103,14 @@ static int newque (key_t key, int msgflg
return retval;
}

- id = ipc_addid(&msg_ids, &msq->q_perm, msg_ctlmni);
+ id = ipc_addid(&context->ids, &msq->q_perm, msg_ctlmni);
if(id == -1) {
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
return -ENOSPC;
}

- msq->q_id = msg_buildid(id,msq->q_perm.seq);
+ msq->q_id = msg_buildid(context,id,msq->q_perm.seq);
msq->q_stime = msq->q_rtime = 0;
msq->q_ctime = get_seconds();
msq->q_cbytes = msq->q_qnum = 0;
@@ -124,7 +119,7 @@ static int newque (key_t key, int msgflg
INIT_LIST_HEAD(&msq->q_messages);
INIT_LIST_HEAD(&msq->q_receivers);
INIT_LIST_HEAD(&msq->q_senders);
- msg_unlock(msq);
+ msg_unlock(context, msq);

return msq->q_id;
}
@@ -182,23 +177,24 @@ static void expunge_all(struct msg_queue
* msg_ids.sem and the spinlock for this message queue is hold
* before freeque() is called. msg_ids.sem remains locked on exit.
*/
-static void freeque (struct msg_queue *msq, int id)
+static void freeque (struct ipc_msg_context *context,
+ struct msg_queue *msq, int id)
{
struct list_head *tmp;

expunge_all(msq,-EIDRM);
ss_wakeup(&msq->q_senders,1);
- msq = msg_rmid(id);
- msg_unlock(msq);
+ msq = msg_rmid(context, id);
+ msg_unlock(context, msq);

tmp = msq->q_messages.next;
while(tmp != &msq->q_messages) {
struct msg_msg* msg = list_entry(tmp,struct msg_msg,m_list);
tmp = tmp->next;
- atomic_dec(&msg_hdrs);
+ atomic_dec(&context->hdrs);
free_msg(msg);
}
- atomic_sub(msq->q_cbytes, &msg_bytes);
+ atomic_sub(msq->q_cbytes, &context->bytes);
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
}
@@ -207,32 +203,34 @@ asmlinkage long sys_msgget (key_t key, i
{
int id, ret = -EPERM;
struct msg_queue *msq;
-
- down(&msg_ids.sem);
+ struct ipc_msg_context *context = &current->ipc_context->msg;
+
+ down(&context->ids.sem);
if (key == IPC_PRIVATE)
- ret = newque(key, msgflg);
- else if ((id = ipc_findkey(&msg_ids, key)) == -1) { /* key not used */
+ ret = newque(context, key, msgflg);
+ else if ((id = ipc_findkey(&context->ids, key)) == -1) {
+ /* key not used */
if (!(msgflg & IPC_CREAT))
ret = -ENOENT;
else
- ret = newque(key, msgflg);
+ ret = newque(context, key, msgflg);
} else if (msgflg & IPC_CREAT && msgflg & IPC_EXCL) {
ret = -EEXIST;
} else {
- msq = msg_lock(id);
+ msq = msg_lock(context, id);
if(msq==NULL)
BUG();
if (ipcperms(&msq->q_perm, msgflg))
ret = -EACCES;
else {
- int qid = msg_buildid(id, msq->q_perm.seq);
+ int qid = msg_buildid(context, id, msq->q_perm.seq);
ret = security_msg_queue_associate(msq, msgflg);
if (!ret)
ret = qid;
}
- msg_unlock(msq);
+ msg_unlock(context, msq);
}
- up(&msg_ids.sem);
+ up(&context->ids.sem);
return ret;
}

@@ -333,6 +331,7 @@ asmlinkage long sys_msgctl (int msqid, i
struct msg_queue *msq;
struct msq_setbuf setbuf;
struct kern_ipc_perm *ipcp;
+ struct ipc_msg_context *context = &current->ipc_context->msg;

if (msqid < 0 || cmd < 0)
return -EINVAL;
@@ -362,18 +361,18 @@ asmlinkage long sys_msgctl (int msqid, i
msginfo.msgmnb = msg_ctlmnb;
msginfo.msgssz = MSGSSZ;
msginfo.msgseg = MSGSEG;
- down(&msg_ids.sem);
+ down(&context->ids.sem);
if (cmd == MSG_INFO) {
- msginfo.msgpool = msg_ids.in_use;
- msginfo.msgmap = atomic_read(&msg_hdrs);
- msginfo.msgtql = atomic_read(&msg_bytes);
+ msginfo.msgpool = context->ids.in_use;
+ msginfo.msgmap = atomic_read(&context->hdrs);
+ msginfo.msgtql = atomic_read(&context->bytes);
} else {
msginfo.msgmap = MSGMAP;
msginfo.msgpool = MSGPOOL;
msginfo.msgtql = MSGTQL;
}
- max_id = msg_ids.max_id;
- up(&msg_ids.sem);
+ max_id = context->ids.max_id;
+ up(&context->ids.sem);
if (copy_to_user (buf, &msginfo, sizeof(struct msginfo)))
return -EFAULT;
return (max_id < 0) ? 0: max_id;
@@ -385,20 +384,21 @@ asmlinkage long sys_msgctl (int msqid, i
int success_return;
if (!buf)
return -EFAULT;
- if(cmd == MSG_STAT && msqid >= msg_ids.entries->size)
+ if(cmd == MSG_STAT && msqid >= context->ids.entries->size)
return -EINVAL;

memset(&tbuf,0,sizeof(tbuf));

- msq = msg_lock(msqid);
+ msq = msg_lock(context, msqid);
if (msq == NULL)
return -EINVAL;

if(cmd == MSG_STAT) {
- success_return = msg_buildid(msqid, msq->q_perm.seq);
+ success_return =
+ msg_buildid(context, msqid, msq->q_perm.seq);
} else {
err = -EIDRM;
- if (msg_checkid(msq,msqid))
+ if (msg_checkid(context,msq,msqid))
goto out_unlock;
success_return = 0;
}
@@ -419,7 +419,7 @@ asmlinkage long sys_msgctl (int msqid, i
tbuf.msg_qbytes = msq->q_qbytes;
tbuf.msg_lspid = msq->q_lspid;
tbuf.msg_lrpid = msq->q_lrpid;
- msg_unlock(msq);
+ msg_unlock(context, msq);
if (copy_msqid_to_user(buf, &tbuf, version))
return -EFAULT;
return success_return;
@@ -438,14 +438,14 @@ asmlinkage long sys_msgctl (int msqid, i
return -EINVAL;
}

- down(&msg_ids.sem);
- msq = msg_lock(msqid);
+ down(&context->ids.sem);
+ msq = msg_lock(context, msqid);
err=-EINVAL;
if (msq == NULL)
goto out_up;

err = -EIDRM;
- if (msg_checkid(msq,msqid))
+ if (msg_checkid(context,msq,msqid))
goto out_unlock_up;
ipcp = &msq->q_perm;
err = -EPERM;
@@ -480,22 +480,22 @@ asmlinkage long sys_msgctl (int msqid, i
* due to a larger queue size.
*/
ss_wakeup(&msq->q_senders,0);
- msg_unlock(msq);
+ msg_unlock(context, msq);
break;
}
case IPC_RMID:
- freeque (msq, msqid);
+ freeque (context, msq, msqid);
break;
}
err = 0;
out_up:
- up(&msg_ids.sem);
+ up(&context->ids.sem);
return err;
out_unlock_up:
- msg_unlock(msq);
+ msg_unlock(context, msq);
goto out_up;
out_unlock:
- msg_unlock(msq);
+ msg_unlock(context, msq);
return err;
}

@@ -558,7 +558,8 @@ asmlinkage long sys_msgsnd (int msqid, s
struct msg_msg *msg;
long mtype;
int err;
-
+ struct ipc_msg_context *context = &current->ipc_context->msg;
+
if (msgsz > msg_ctlmax || (long) msgsz < 0 || msqid < 0)
return -EINVAL;
if (get_user(mtype, &msgp->mtype))
@@ -573,13 +574,13 @@ asmlinkage long sys_msgsnd (int msqid, s
msg->m_type = mtype;
msg->m_ts = msgsz;

- msq = msg_lock(msqid);
+ msq = msg_lock(context, msqid);
err=-EINVAL;
if(msq==NULL)
goto out_free;

err= -EIDRM;
- if (msg_checkid(msq,msqid))
+ if (msg_checkid(context,msq,msqid))
goto out_unlock_free;

for (;;) {
@@ -605,7 +606,7 @@ asmlinkage long sys_msgsnd (int msqid, s
}
ss_add(msq, &s);
ipc_rcu_getref(msq);
- msg_unlock(msq);
+ msg_unlock(context, msq);
schedule();

ipc_lock_by_ptr(&msq->q_perm);
@@ -630,15 +631,15 @@ asmlinkage long sys_msgsnd (int msqid, s
list_add_tail(&msg->m_list,&msq->q_messages);
msq->q_cbytes += msgsz;
msq->q_qnum++;
- atomic_add(msgsz,&msg_bytes);
- atomic_inc(&msg_hdrs);
+ atomic_add(msgsz,&context->bytes);
+ atomic_inc(&context->hdrs);
}

err = 0;
msg = NULL;

out_unlock_free:
- msg_unlock(msq);
+ msg_unlock(context, msq);
out_free:
if(msg!=NULL)
free_msg(msg);
@@ -670,17 +671,18 @@ asmlinkage long sys_msgrcv (int msqid, s
struct msg_queue *msq;
struct msg_msg *msg;
int mode;
+ struct ipc_msg_context *context = &current->ipc_context->msg;

if (msqid < 0 || (long) msgsz < 0)
return -EINVAL;
mode = convert_mode(&msgtyp,msgflg);

- msq = msg_lock(msqid);
+ msq = msg_lock(context, msqid);
if(msq==NULL)
return -EINVAL;

msg = ERR_PTR(-EIDRM);
- if (msg_checkid(msq,msqid))
+ if (msg_checkid(context,msq,msqid))
goto out_unlock;

for (;;) {
@@ -720,10 +722,10 @@ asmlinkage long sys_msgrcv (int msqid, s
msq->q_rtime = get_seconds();
msq->q_lrpid = current->tgid;
msq->q_cbytes -= msg->m_ts;
- atomic_sub(msg->m_ts,&msg_bytes);
- atomic_dec(&msg_hdrs);
+ atomic_sub(msg->m_ts,&context->bytes);
+ atomic_dec(&context->hdrs);
ss_wakeup(&msq->q_senders,0);
- msg_unlock(msq);
+ msg_unlock(context, msq);
break;
}
/* No message waiting. Wait for a message */
@@ -741,7 +743,7 @@ asmlinkage long sys_msgrcv (int msqid, s
msr_d.r_maxsize = msgsz;
msr_d.r_msg = ERR_PTR(-EAGAIN);
current->state = TASK_INTERRUPTIBLE;
- msg_unlock(msq);
+ msg_unlock(context, msq);

schedule();

@@ -794,7 +796,7 @@ asmlinkage long sys_msgrcv (int msqid, s
if (signal_pending(current)) {
msg = ERR_PTR(-ERESTARTNOHAND);
out_unlock:
- msg_unlock(msq);
+ msg_unlock(context, msq);
break;
}
}
diff -puN ipc/util.c~sysvmsg-container ipc/util.c
--- work/ipc/util.c~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
+++ work-dave/ipc/util.c 2006-03-06 15:41:56.000000000 -0800
@@ -45,11 +45,15 @@ struct ipc_proc_iface {
* The various system5 IPC resources (semaphores, messages and shared
* memory are initialised
*/
-
+
static int __init ipc_init(void)
{
+ init_task.ipc_context = kzalloc(sizeof(*ipc_context), GFP_KERNEL);
+ if (!init_task.ipc_context)
+ return -ENOMEM;
+
sem_init();
- msg_init();
+ msg_init(&init_task.ipc_context->msg);
shm_init();
return 0;
}
diff -puN ipc/util.h~sysvmsg-container ipc/util.h
--- work/ipc/util.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
+++ work-dave/ipc/util.h 2006-03-06 15:41:56.000000000 -0800
@@ -12,7 +12,7 @@
#define SEQ_MULTIPLIER (IPCMNI)

void sem_init (void);
-void msg_init (void);
+void msg_init (struct ipc_msg_context *context);
void shm_init (void);

struct ipc_id_ary {
@@ -30,6 +30,18 @@ struct ipc_ids {
struct ipc_id_ary* entries;
};

+struct ipc_msg_context {
+ atomic_t bytes;
+ atomic_t hdrs;
+
+ struct ipc_ids ids;
+ struct kref count;
+};
+
+struct ipc_context {
+ struct ipc_msg_context msg;
+};
+
struct seq_file;
void __init ipc_init_ids(struct ipc_ids* ids, int size);
#ifdef CONFIG_PROC_FS
diff -puN kernel/sysctl.c~sysvmsg-container kernel/sysctl.c
diff -puN /dev/null desc.txt
--- /dev/null 2005-03-30 22:36:15.000000000 -0800
+++ work-dave/desc.txt 2006-03-06 15:41:56.000000000 -0800
@@ -0,0 +1,4 @@
+DESC
+support namespaces for sysv
+EDESC
+
_

2006-03-06 23:53:01

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 3/6] sysvmsg: containerize sysctls


Use the new sysctl data accessor function pointers to put
the sysvmsg sysctl variables inside of the context structure.

Note that this will effectively remove the system-wide limits
on sysv msg resources that the sysctl mechanism currently
provides and move it to a per-container limit. This is
currently by design, and may be later addressed when we have
proper large-scale resource controls in the kernel.

Signed-off-by: Dave Hansen <[email protected]>
---

work-dave/ipc/msg.c | 38 +++++++++++++++++++++++++-------------
work-dave/ipc/util.h | 4 ++++
work-dave/kernel/sysctl.c | 12 ++++++------
3 files changed, 35 insertions(+), 19 deletions(-)

diff -puN ipc/msg.c~sysvmsg-containerize-sysctls ipc/msg.c
--- work/ipc/msg.c~sysvmsg-containerize-sysctls 2006-03-06 15:41:57.000000000 -0800
+++ work-dave/ipc/msg.c 2006-03-06 15:41:57.000000000 -0800
@@ -32,11 +32,6 @@
#include <asm/uaccess.h>
#include "util.h"

-/* sysctl: */
-int msg_ctlmax = MSGMAX;
-int msg_ctlmnb = MSGMNB;
-int msg_ctlmni = MSGMNI;
-
/* one msg_receiver structure for each sleeping receiver */
struct msg_receiver {
struct list_head r_list;
@@ -76,7 +71,11 @@ static int sysvipc_msg_proc_show(struct

void __init msg_init (struct ipc_msg_context *context)
{
- ipc_init_ids(&context->ids,msg_ctlmni);
+ context->ctlmax = MSGMAX;
+ context->ctlmnb = MSGMNB;
+ context->ctlmni = MSGMNI;
+
+ ipc_init_ids(&context->ids,context->ctlmni);
ipc_init_proc_interface("sysvipc/msg",
" key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n",
&context->ids,
@@ -103,7 +102,7 @@ static int newque (struct ipc_msg_contex
return retval;
}

- id = ipc_addid(&context->ids, &msq->q_perm, msg_ctlmni);
+ id = ipc_addid(&context->ids, &msq->q_perm, context->ctlmni);
if(id == -1) {
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
@@ -114,7 +113,7 @@ static int newque (struct ipc_msg_contex
msq->q_stime = msq->q_rtime = 0;
msq->q_ctime = get_seconds();
msq->q_cbytes = msq->q_qnum = 0;
- msq->q_qbytes = msg_ctlmnb;
+ msq->q_qbytes = context->ctlmnb;
msq->q_lspid = msq->q_lrpid = 0;
INIT_LIST_HEAD(&msq->q_messages);
INIT_LIST_HEAD(&msq->q_receivers);
@@ -356,9 +355,9 @@ asmlinkage long sys_msgctl (int msqid, i
return err;

memset(&msginfo,0,sizeof(msginfo));
- msginfo.msgmni = msg_ctlmni;
- msginfo.msgmax = msg_ctlmax;
- msginfo.msgmnb = msg_ctlmnb;
+ msginfo.msgmni = context->ctlmni;
+ msginfo.msgmax = context->ctlmax;
+ msginfo.msgmnb = context->ctlmnb;
msginfo.msgssz = MSGSSZ;
msginfo.msgseg = MSGSEG;
down(&context->ids.sem);
@@ -462,7 +461,7 @@ asmlinkage long sys_msgctl (int msqid, i
case IPC_SET:
{
err = -EPERM;
- if (setbuf.qbytes > msg_ctlmnb && !capable(CAP_SYS_RESOURCE))
+ if (setbuf.qbytes > context->ctlmnb && !capable(CAP_SYS_RESOURCE))
goto out_unlock_up;

msq->q_qbytes = setbuf.qbytes;
@@ -560,7 +559,7 @@ asmlinkage long sys_msgsnd (int msqid, s
int err;
struct ipc_msg_context *context = &current->ipc_context->msg;

- if (msgsz > msg_ctlmax || (long) msgsz < 0 || msqid < 0)
+ if (msgsz > context->ctlmax || (long) msgsz < 0 || msqid < 0)
return -EINVAL;
if (get_user(mtype, &msgp->mtype))
return -EFAULT;
@@ -835,3 +834,16 @@ static int sysvipc_msg_proc_show(struct
msq->q_ctime);
}
#endif
+
+void *msg_ctlmnb_helper(void)
+{
+ return &current->ipc_context->msg.ctlmnb;
+}
+void *msg_ctlmni_helper(void)
+{
+ return &current->ipc_context->msg.ctlmni;
+}
+void *msg_ctlmax_helper(void)
+{
+ return &current->ipc_context->msg.ctlmax;
+}
diff -puN ipc/util.h~sysvmsg-containerize-sysctls ipc/util.h
--- work/ipc/util.h~sysvmsg-containerize-sysctls 2006-03-06 15:41:57.000000000 -0800
+++ work-dave/ipc/util.h 2006-03-06 15:41:57.000000000 -0800
@@ -34,6 +34,10 @@ struct ipc_msg_context {
atomic_t bytes;
atomic_t hdrs;

+ int ctlmax;
+ int ctlmnb;
+ int ctlmni;
+
struct ipc_ids ids;
struct kref count;
};
diff -puN kernel/sysctl.c~sysvmsg-containerize-sysctls kernel/sysctl.c
--- work/kernel/sysctl.c~sysvmsg-containerize-sysctls 2006-03-06 15:41:57.000000000 -0800
+++ work-dave/kernel/sysctl.c 2006-03-06 15:41:57.000000000 -0800
@@ -94,10 +94,10 @@ extern int sg_big_buff;
#ifdef CONFIG_SYSVIPC
extern size_t shm_ctlmax;
extern size_t shm_ctlall;
+extern void *msg_ctlmnb_helper(void);
+extern void *msg_ctlmni_helper(void);
+extern void *msg_ctlmax_helper(void);
extern int shm_ctlmni;
-extern int msg_ctlmax;
-extern int msg_ctlmnb;
-extern int msg_ctlmni;
extern int sem_ctls[];
#endif

@@ -452,7 +452,7 @@ static ctl_table kern_table[] = {
{
.ctl_name = KERN_MSGMAX,
.procname = "msgmax",
- .data = &msg_ctlmax,
+ .data_access = &msg_ctlmax_helper,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec,
@@ -460,7 +460,7 @@ static ctl_table kern_table[] = {
{
.ctl_name = KERN_MSGMNI,
.procname = "msgmni",
- .data = &msg_ctlmni,
+ .data_access = &msg_ctlmni_helper,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec,
@@ -468,7 +468,7 @@ static ctl_table kern_table[] = {
{
.ctl_name = KERN_MSGMNB,
.procname = "msgmnb",
- .data = &msg_ctlmnb,
+ .data_access = &msg_ctlmnb_helper,
.maxlen = sizeof (int),
.mode = 0644,
.proc_handler = &proc_dointvec,
_

2006-03-06 23:52:59

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 1/6] prepare sysctls for containers


Right now, sysctls can only deal with global variables. This
patch makes them a _little_ more flexible by allowing there to
be an accessor function to get at the variable being changed,
instead of it being global.

This allows the sysctls to be backed by variables that are,
for instance, dynamically allocated and not available at
compile-time.

This also provides a very simple mechanism to take things that
are currently global and containerize them.

Signed-off-by: Dave Hansen <[email protected]>
---

work-dave/include/linux/sysctl.h | 8 ++++
work-dave/kernel/sysctl.c | 65 ++++++++++++++++++++++++++-------------
2 files changed, 52 insertions(+), 21 deletions(-)

diff -puN include/linux/sysctl.h~sysctls-for-containers include/linux/sysctl.h
--- work/include/linux/sysctl.h~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
+++ work-dave/include/linux/sysctl.h 2006-03-06 15:41:55.000000000 -0800
@@ -872,6 +872,7 @@ extern void sysctl_init(void);

typedef struct ctl_table ctl_table;

+typedef void *ctl_data_access (void);
typedef int ctl_handler (ctl_table *table, int __user *name, int nlen,
void __user *oldval, size_t __user *oldlenp,
void __user *newval, size_t newlen,
@@ -957,6 +958,13 @@ struct ctl_table
int ctl_name; /* Binary ID */
const char *procname; /* Text ID for /proc/sys, or zero */
void *data;
+ ctl_data_access *data_access; /* set this to a function if you
+ * don't have a static place to point
+ * ->data at compile-time. This
+ * function will be called to dynamically
+ * figure out a ->data pointer. Do not
+ * set this and ->data at once.
+ */
int maxlen;
mode_t mode;
ctl_table *child;
diff -puN kernel/sysctl.c~sysctls-for-containers kernel/sysctl.c
--- work/kernel/sysctl.c~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
+++ work-dave/kernel/sysctl.c 2006-03-06 15:41:55.000000000 -0800
@@ -1197,6 +1197,24 @@ repeat:
return -ENOTDIR;
}

+void *sysctl_table_data(ctl_table *table)
+{
+ void *data;
+
+ if (table->data && table->data_access) {
+ printk(KERN_WARNING
+ "sysctl: data and accessor function set for: '%s'\n",
+ table->procname);
+ table->data = NULL;
+ }
+
+ data = table->data;
+ if (!data && table->data_access)
+ data = table->data_access();
+
+ return data;
+}
+
/* Perform the actual read/write of a sysctl table entry. */
int do_sysctl_strategy (ctl_table *table,
int __user *name, int nlen,
@@ -1205,6 +1223,7 @@ int do_sysctl_strategy (ctl_table *table
{
int op = 0, rc;
size_t len;
+ void *data;

if (oldval)
op |= 004;
@@ -1224,14 +1243,15 @@ int do_sysctl_strategy (ctl_table *table

/* If there is no strategy routine, or if the strategy returns
* zero, proceed with automatic r/w */
- if (table->data && table->maxlen) {
+ data = sysctl_table_data(table);
+ if (data && table->maxlen) {
if (oldval && oldlenp) {
if (get_user(len, oldlenp))
return -EFAULT;
if (len) {
if (len > table->maxlen)
len = table->maxlen;
- if(copy_to_user(oldval, table->data, len))
+ if(copy_to_user(oldval, data, len))
return -EFAULT;
if(put_user(len, oldlenp))
return -EFAULT;
@@ -1241,7 +1261,7 @@ int do_sysctl_strategy (ctl_table *table
len = newlen;
if (len > table->maxlen)
len = table->maxlen;
- if(copy_from_user(table->data, newval, len))
+ if(copy_from_user(data, newval, len))
return -EFAULT;
}
}
@@ -1539,7 +1559,7 @@ int proc_dostring(ctl_table *table, int
char __user *p;
char c;

- if (!table->data || !table->maxlen || !*lenp ||
+ if (!sysctl_table_data(table) || !table->maxlen || !*lenp ||
(*ppos && !write)) {
*lenp = 0;
return 0;
@@ -1557,18 +1577,18 @@ int proc_dostring(ctl_table *table, int
}
if (len >= table->maxlen)
len = table->maxlen-1;
- if(copy_from_user(table->data, buffer, len))
+ if(copy_from_user(sysctl_table_data(table), buffer, len))
return -EFAULT;
- ((char *) table->data)[len] = 0;
+ ((char *) sysctl_table_data(table))[len] = 0;
*ppos += *lenp;
} else {
- len = strlen(table->data);
+ len = strlen(sysctl_table_data(table));
if (len > table->maxlen)
len = table->maxlen;
if (len > *lenp)
len = *lenp;
if (len)
- if(copy_to_user(buffer, table->data, len))
+ if(copy_to_user(buffer, sysctl_table_data(table), len))
return -EFAULT;
if (len < *lenp) {
if(put_user('\n', ((char __user *) buffer) + len))
@@ -1636,13 +1656,13 @@ static int do_proc_dointvec(ctl_table *t
char buf[TMPBUFLEN], *p;
char __user *s = buffer;

- if (!table->data || !table->maxlen || !*lenp ||
+ if (!sysctl_table_data(table) || !table->maxlen || !*lenp ||
(*ppos && !write)) {
*lenp = 0;
return 0;
}

- i = (int *) table->data;
+ i = (int *) sysctl_table_data(table);
vleft = table->maxlen / sizeof(*i);
left = *lenp;

@@ -1878,13 +1898,13 @@ static int do_proc_doulongvec_minmax(ctl
char buf[TMPBUFLEN], *p;
char __user *s = buffer;

- if (!table->data || !table->maxlen || !*lenp ||
+ if (!sysctl_table_data(table) || !table->maxlen || !*lenp ||
(*ppos && !write)) {
*lenp = 0;
return 0;
}

- i = (unsigned long *) table->data;
+ i = (unsigned long *) sysctl_table_data(table);
min = (unsigned long *) table->extra1;
max = (unsigned long *) table->extra2;
vleft = table->maxlen / sizeof(unsigned long);
@@ -2230,7 +2250,9 @@ int sysctl_string(ctl_table *table, int
void __user *oldval, size_t __user *oldlenp,
void __user *newval, size_t newlen, void **context)
{
- if (!table->data || !table->maxlen)
+ char *data = sysctl_table_data(table);
+
+ if (!data || !table->maxlen)
return -ENOTDIR;

if (oldval && oldlenp) {
@@ -2238,7 +2260,7 @@ int sysctl_string(ctl_table *table, int
if (get_user(bufsize, oldlenp))
return -EFAULT;
if (bufsize) {
- size_t len = strlen(table->data), copied;
+ size_t len = strlen(data), copied;

/* This shouldn't trigger for a well-formed sysctl */
if (len > table->maxlen)
@@ -2247,7 +2269,7 @@ int sysctl_string(ctl_table *table, int
/* Copy up to a max of bufsize-1 bytes of the string */
copied = (len >= bufsize) ? bufsize - 1 : len;

- if (copy_to_user(oldval, table->data, copied) ||
+ if (copy_to_user(oldval, data, copied) ||
put_user(0, (char __user *)(oldval + copied)))
return -EFAULT;
if (put_user(len, oldlenp))
@@ -2258,11 +2280,11 @@ int sysctl_string(ctl_table *table, int
size_t len = newlen;
if (len > table->maxlen)
len = table->maxlen;
- if(copy_from_user(table->data, newval, len))
+ if(copy_from_user(data, newval, len))
return -EFAULT;
if (len == table->maxlen)
len--;
- ((char *) table->data)[len] = 0;
+ data[len] = 0;
}
return 1;
}
@@ -2320,7 +2342,7 @@ int sysctl_jiffies(ctl_table *table, int
if (olen!=sizeof(int))
return -EINVAL;
}
- if (put_user(*(int *)(table->data)/HZ, (int __user *)oldval) ||
+ if (put_user(*(int *)(sysctl_table_data(table))/HZ, (int __user *)oldval) ||
(oldlenp && put_user(sizeof(int),oldlenp)))
return -EFAULT;
}
@@ -2330,7 +2352,7 @@ int sysctl_jiffies(ctl_table *table, int
return -EINVAL;
if (get_user(new, (int __user *)newval))
return -EFAULT;
- *(int *)(table->data) = new*HZ;
+ *(int *)(sysctl_table_data(table)) = new*HZ;
}
return 1;
}
@@ -2340,6 +2362,7 @@ int sysctl_ms_jiffies(ctl_table *table,
void __user *oldval, size_t __user *oldlenp,
void __user *newval, size_t newlen, void **context)
{
+ int *jiffies_ptr = sysctl_table_data(table);
if (oldval) {
size_t olen;
if (oldlenp) {
@@ -2348,7 +2371,7 @@ int sysctl_ms_jiffies(ctl_table *table,
if (olen!=sizeof(int))
return -EINVAL;
}
- if (put_user(jiffies_to_msecs(*(int *)(table->data)), (int __user *)oldval) ||
+ if (put_user(jiffies_to_msecs(*jiffies_ptr), (int __user *)oldval) ||
(oldlenp && put_user(sizeof(int),oldlenp)))
return -EFAULT;
}
@@ -2358,7 +2381,7 @@ int sysctl_ms_jiffies(ctl_table *table,
return -EINVAL;
if (get_user(new, (int __user *)newval))
return -EFAULT;
- *(int *)(table->data) = msecs_to_jiffies(new);
+ *jiffies_ptr = msecs_to_jiffies(new);
}
return 1;
}
_

2006-03-06 23:54:44

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 5/6] sysvshm: containerize


Same thing as sems and msgs.

Take all of the global sysv shm context and put it inside of
a special structure. This allows future work to give different
containers completely different views into the sysv space.

Signed-off-by: Dave Hansen <[email protected]>
---

work-dave/ipc/shm.c | 197 +++++++++++++++++++++++++++------------------------
work-dave/ipc/util.c | 3
work-dave/ipc/util.h | 9 ++
3 files changed, 118 insertions(+), 91 deletions(-)

diff -puN include/linux/sched.h~sysvshm-container include/linux/sched.h
diff -puN ipc/shm.c~sysvshm-container ipc/shm.c
--- work/ipc/shm.c~sysvshm-container 2006-03-06 15:41:59.000000000 -0800
+++ work-dave/ipc/shm.c 2006-03-06 15:41:59.000000000 -0800
@@ -38,15 +38,17 @@
static struct file_operations shm_file_operations;
static struct vm_operations_struct shm_vm_ops;

-static struct ipc_ids shm_ids;
+#define shm_lock(context, id) \
+ ((struct shmid_kernel*)ipc_lock(&context->ids,id))
+#define shm_unlock(context, shp) \
+ ipc_unlock(&(shp)->shm_perm)
+#define shm_get(context, id)\
+ ((struct shmid_kernel*)ipc_get(&context->ids,id))
+#define shm_buildid(context, id, seq) \
+ ipc_buildid(&context->ids, id, seq)

-#define shm_lock(id) ((struct shmid_kernel*)ipc_lock(&shm_ids,id))
-#define shm_unlock(shp) ipc_unlock(&(shp)->shm_perm)
-#define shm_get(id) ((struct shmid_kernel*)ipc_get(&shm_ids,id))
-#define shm_buildid(id, seq) \
- ipc_buildid(&shm_ids, id, seq)
-
-static int newseg (key_t key, int shmflg, size_t size);
+static int newseg (struct ipc_shm_context *context,
+ key_t key, int shmflg, size_t size);
static void shm_open (struct vm_area_struct *shmd);
static void shm_close (struct vm_area_struct *shmd);
#ifdef CONFIG_PROC_FS
@@ -57,51 +59,53 @@ size_t shm_ctlmax = SHMMAX;
size_t shm_ctlall = SHMALL;
int shm_ctlmni = SHMMNI;

-static int shm_tot; /* total number of shared memory pages */
-
-void __init shm_init (void)
+void __init shm_init (struct ipc_shm_context *context)
{
- ipc_init_ids(&shm_ids, 1);
+ ipc_init_ids(&context->ids, 1);
ipc_init_proc_interface("sysvipc/shm",
" key shmid perms size cpid lpid nattch uid gid cuid cgid atime dtime ctime\n",
- &shm_ids,
+ &context->ids,
sysvipc_shm_proc_show);
}

-static inline int shm_checkid(struct shmid_kernel *s, int id)
+static inline int shm_checkid(struct ipc_shm_context *context,
+ struct shmid_kernel *s, int id)
{
- if (ipc_checkid(&shm_ids,&s->shm_perm,id))
+ if (ipc_checkid(&context->ids,&s->shm_perm,id))
return -EIDRM;
return 0;
}

-static inline struct shmid_kernel *shm_rmid(int id)
+static inline
+struct shmid_kernel *shm_rmid(struct ipc_shm_context *context, int id)
{
- return (struct shmid_kernel *)ipc_rmid(&shm_ids,id);
+ return (struct shmid_kernel *)ipc_rmid(&context->ids,id);
}

-static inline int shm_addid(struct shmid_kernel *shp)
+static inline
+int shm_addid(struct ipc_shm_context *context, struct shmid_kernel *shp)
{
- return ipc_addid(&shm_ids, &shp->shm_perm, shm_ctlmni);
+ return ipc_addid(&context->ids, &shp->shm_perm, shm_ctlmni);
}



-static inline void shm_inc (int id) {
+static inline void shm_inc (struct ipc_shm_context *context, int id) {
struct shmid_kernel *shp;

- if(!(shp = shm_lock(id)))
+ if(!(shp = shm_lock(context, id)))
BUG();
shp->shm_atim = get_seconds();
shp->shm_lprid = current->tgid;
shp->shm_nattch++;
- shm_unlock(shp);
+ shm_unlock(context, shp);
}

/* This is called by fork, once for every shm attach. */
static void shm_open (struct vm_area_struct *shmd)
{
- shm_inc (shmd->vm_file->f_dentry->d_inode->i_ino);
+ struct ipc_shm_context *context = shmd->vm_private_data;
+ shm_inc (context, shmd->vm_file->f_dentry->d_inode->i_ino);
}

/*
@@ -109,14 +113,15 @@ static void shm_open (struct vm_area_str
*
* @shp: struct to free
*
- * It has to be called with shp and shm_ids.sem locked,
+ * It has to be called with shp and context->ids.sem locked,
* but returns with shp unlocked and freed.
*/
-static void shm_destroy (struct shmid_kernel *shp)
+static void shm_destroy (struct ipc_shm_context *context,
+ struct shmid_kernel *shp)
{
- shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
- shm_rmid (shp->id);
- shm_unlock(shp);
+ context->tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ shm_rmid (context, shp->id);
+ shm_unlock(context, shp);
if (!is_file_hugepages(shp->shm_file))
shmem_lock(shp->shm_file, 0, shp->mlock_user);
else
@@ -138,30 +143,39 @@ static void shm_close (struct vm_area_st
struct file * file = shmd->vm_file;
int id = file->f_dentry->d_inode->i_ino;
struct shmid_kernel *shp;
+ struct ipc_shm_context *context = shmd->vm_private_data;

- down (&shm_ids.sem);
+ down (&context->ids.sem);
/* remove from the list of attaches of the shm segment */
- if(!(shp = shm_lock(id)))
+ if(!(shp = shm_lock(context, id)))
BUG();
shp->shm_lprid = current->tgid;
shp->shm_dtim = get_seconds();
shp->shm_nattch--;
if(shp->shm_nattch == 0 &&
shp->shm_perm.mode & SHM_DEST)
- shm_destroy (shp);
+ shm_destroy (context, shp);
else
- shm_unlock(shp);
- up (&shm_ids.sem);
+ shm_unlock(context, shp);
+ up (&context->ids.sem);
}

static int shm_mmap(struct file * file, struct vm_area_struct * vma)
{
int ret;
+ struct ipc_shm_context *context = current->ipc_shm_context;

ret = shmem_mmap(file, vma);
if (ret == 0) {
vma->vm_ops = &shm_vm_ops;
- shm_inc(file->f_dentry->d_inode->i_ino);
+ /*
+ * This implies that the shm context can not change
+ * while a task is running because this mmap's
+ * context would change underneath it.
+ * -- You many only change shm context at exec()
+ */
+ vma->vm_private_data = context;
+ shm_inc(context, file->f_dentry->d_inode->i_ino);
}

return ret;
@@ -184,7 +198,8 @@ static struct vm_operations_struct shm_v
#endif
};

-static int newseg (key_t key, int shmflg, size_t size)
+static int newseg (struct ipc_shm_context *context,
+ key_t key, int shmflg, size_t size)
{
int error;
struct shmid_kernel *shp;
@@ -196,7 +211,7 @@ static int newseg (key_t key, int shmflg
if (size < SHMMIN || size > shm_ctlmax)
return -EINVAL;

- if (shm_tot + numpages >= shm_ctlall)
+ if (context->tot + numpages >= shm_ctlall)
return -ENOSPC;

shp = ipc_rcu_alloc(sizeof(*shp));
@@ -235,7 +250,7 @@ static int newseg (key_t key, int shmflg
goto no_file;

error = -ENOSPC;
- id = shm_addid(shp);
+ id = shm_addid(context, shp);
if(id == -1)
goto no_id;

@@ -245,7 +260,7 @@ static int newseg (key_t key, int shmflg
shp->shm_ctim = get_seconds();
shp->shm_segsz = size;
shp->shm_nattch = 0;
- shp->id = shm_buildid(id,shp->shm_perm.seq);
+ shp->id = shm_buildid(context, id,shp->shm_perm.seq);
shp->shm_file = file;
file->f_dentry->d_inode->i_ino = shp->id;

@@ -253,8 +268,8 @@ static int newseg (key_t key, int shmflg
if (!(shmflg & SHM_HUGETLB))
file->f_op = &shm_file_operations;

- shm_tot += numpages;
- shm_unlock(shp);
+ context->tot += numpages;
+ shm_unlock(context, shp);
return shp->id;

no_id:
@@ -269,19 +284,20 @@ asmlinkage long sys_shmget (key_t key, s
{
struct shmid_kernel *shp;
int err, id = 0;
+ struct ipc_shm_context *context = current->ipc_shm_context;

- down(&shm_ids.sem);
+ down(&context->ids.sem);
if (key == IPC_PRIVATE) {
- err = newseg(key, shmflg, size);
- } else if ((id = ipc_findkey(&shm_ids, key)) == -1) {
+ err = newseg(context, key, shmflg, size);
+ } else if ((id = ipc_findkey(&context->ids, key)) == -1) {
if (!(shmflg & IPC_CREAT))
err = -ENOENT;
else
- err = newseg(key, shmflg, size);
+ err = newseg(context, key, shmflg, size);
} else if ((shmflg & IPC_CREAT) && (shmflg & IPC_EXCL)) {
err = -EEXIST;
} else {
- shp = shm_lock(id);
+ shp = shm_lock(context, id);
if(shp==NULL)
BUG();
if (shp->shm_segsz < size)
@@ -289,14 +305,14 @@ asmlinkage long sys_shmget (key_t key, s
else if (ipcperms(&shp->shm_perm, shmflg))
err = -EACCES;
else {
- int shmid = shm_buildid(id, shp->shm_perm.seq);
+ int shmid = shm_buildid(context, id, shp->shm_perm.seq);
err = security_shm_associate(shp, shmflg);
if (!err)
err = shmid;
}
- shm_unlock(shp);
+ shm_unlock(context, shp);
}
- up(&shm_ids.sem);
+ up(&context->ids.sem);

return err;
}
@@ -392,18 +408,19 @@ static inline unsigned long copy_shminfo
}
}

-static void shm_get_stat(unsigned long *rss, unsigned long *swp)
+static void shm_get_stat(struct ipc_shm_context *context,
+ unsigned long *rss, unsigned long *swp)
{
int i;

*rss = 0;
*swp = 0;

- for (i = 0; i <= shm_ids.max_id; i++) {
+ for (i = 0; i <= context->ids.max_id; i++) {
struct shmid_kernel *shp;
struct inode *inode;

- shp = shm_get(i);
+ shp = shm_get(context, i);
if(!shp)
continue;

@@ -427,6 +444,7 @@ asmlinkage long sys_shmctl (int shmid, i
struct shm_setbuf setbuf;
struct shmid_kernel *shp;
int err, version;
+ struct ipc_shm_context *context = current->ipc_shm_context;

if (cmd < 0 || shmid < 0) {
err = -EINVAL;
@@ -453,7 +471,7 @@ asmlinkage long sys_shmctl (int shmid, i
if(copy_shminfo_to_user (buf, &shminfo, version))
return -EFAULT;
/* reading a integer is always atomic */
- err= shm_ids.max_id;
+ err= context->ids.max_id;
if(err<0)
err = 0;
goto out;
@@ -467,14 +485,14 @@ asmlinkage long sys_shmctl (int shmid, i
return err;

memset(&shm_info,0,sizeof(shm_info));
- down(&shm_ids.sem);
- shm_info.used_ids = shm_ids.in_use;
- shm_get_stat (&shm_info.shm_rss, &shm_info.shm_swp);
- shm_info.shm_tot = shm_tot;
+ down(&context->ids.sem);
+ shm_info.used_ids = context->ids.in_use;
+ shm_get_stat (context, &shm_info.shm_rss, &shm_info.shm_swp);
+ shm_info.shm_tot = context->tot;
shm_info.swap_attempts = 0;
shm_info.swap_successes = 0;
- err = shm_ids.max_id;
- up(&shm_ids.sem);
+ err = context->ids.max_id;
+ up(&context->ids.sem);
if(copy_to_user (buf, &shm_info, sizeof(shm_info))) {
err = -EFAULT;
goto out;
@@ -489,17 +507,17 @@ asmlinkage long sys_shmctl (int shmid, i
struct shmid64_ds tbuf;
int result;
memset(&tbuf, 0, sizeof(tbuf));
- shp = shm_lock(shmid);
+ shp = shm_lock(context, shmid);
if(shp==NULL) {
err = -EINVAL;
goto out;
} else if(cmd==SHM_STAT) {
err = -EINVAL;
- if (shmid > shm_ids.max_id)
+ if (shmid > context->ids.max_id)
goto out_unlock;
- result = shm_buildid(shmid, shp->shm_perm.seq);
+ result = shm_buildid(context, shmid, shp->shm_perm.seq);
} else {
- err = shm_checkid(shp,shmid);
+ err = shm_checkid(context,shp,shmid);
if(err)
goto out_unlock;
result = 0;
@@ -521,7 +539,7 @@ asmlinkage long sys_shmctl (int shmid, i
tbuf.shm_nattch = shp->shm_nattch;
else
tbuf.shm_nattch = file_count(shp->shm_file) - 1;
- shm_unlock(shp);
+ shm_unlock(context, shp);
if(copy_shmid_to_user (buf, &tbuf, version))
err = -EFAULT;
else
@@ -531,12 +549,12 @@ asmlinkage long sys_shmctl (int shmid, i
case SHM_LOCK:
case SHM_UNLOCK:
{
- shp = shm_lock(shmid);
+ shp = shm_lock(context, shmid);
if(shp==NULL) {
err = -EINVAL;
goto out;
}
- err = shm_checkid(shp,shmid);
+ err = shm_checkid(context,shp,shmid);
if(err)
goto out_unlock;

@@ -568,7 +586,7 @@ asmlinkage long sys_shmctl (int shmid, i
shp->shm_perm.mode &= ~SHM_LOCKED;
shp->mlock_user = NULL;
}
- shm_unlock(shp);
+ shm_unlock(context, shp);
goto out;
}
case IPC_RMID:
@@ -583,12 +601,12 @@ asmlinkage long sys_shmctl (int shmid, i
* Instead we set a destroyed flag, and then blow
* the name away when the usage hits zero.
*/
- down(&shm_ids.sem);
- shp = shm_lock(shmid);
+ down(&context->ids.sem);
+ shp = shm_lock(context, shmid);
err = -EINVAL;
if (shp == NULL)
goto out_up;
- err = shm_checkid(shp, shmid);
+ err = shm_checkid(context, shp, shmid);
if(err)
goto out_unlock_up;

@@ -607,10 +625,10 @@ asmlinkage long sys_shmctl (int shmid, i
shp->shm_perm.mode |= SHM_DEST;
/* Do not find it any more */
shp->shm_perm.key = IPC_PRIVATE;
- shm_unlock(shp);
+ shm_unlock(context, shp);
} else
- shm_destroy (shp);
- up(&shm_ids.sem);
+ shm_destroy (context, shp);
+ up(&context->ids.sem);
goto out;
}

@@ -622,12 +640,12 @@ asmlinkage long sys_shmctl (int shmid, i
}
if ((err = audit_ipc_perms(0, setbuf.uid, setbuf.gid, setbuf.mode)))
return err;
- down(&shm_ids.sem);
- shp = shm_lock(shmid);
+ down(&context->ids.sem);
+ shp = shm_lock(context, shmid);
err=-EINVAL;
if(shp==NULL)
goto out_up;
- err = shm_checkid(shp,shmid);
+ err = shm_checkid(context,shp,shmid);
if(err)
goto out_unlock_up;
err=-EPERM;
@@ -656,12 +674,12 @@ asmlinkage long sys_shmctl (int shmid, i

err = 0;
out_unlock_up:
- shm_unlock(shp);
+ shm_unlock(context, shp);
out_up:
- up(&shm_ids.sem);
+ up(&context->ids.sem);
goto out;
out_unlock:
- shm_unlock(shp);
+ shm_unlock(context, shp);
out:
return err;
}
@@ -675,6 +693,7 @@ out:
*/
long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr)
{
+ struct ipc_shm_context *context = current->ipc_shm_context;
struct shmid_kernel *shp;
unsigned long addr;
unsigned long size;
@@ -725,32 +744,32 @@ long do_shmat(int shmid, char __user *sh
* We cannot rely on the fs check since SYSV IPC does have an
* additional creator id...
*/
- shp = shm_lock(shmid);
+ shp = shm_lock(context, shmid);
if(shp == NULL) {
err = -EINVAL;
goto out;
}
- err = shm_checkid(shp,shmid);
+ err = shm_checkid(context,shp,shmid);
if (err) {
- shm_unlock(shp);
+ shm_unlock(context, shp);
goto out;
}
if (ipcperms(&shp->shm_perm, acc_mode)) {
- shm_unlock(shp);
+ shm_unlock(context, shp);
err = -EACCES;
goto out;
}

err = security_shm_shmat(shp, shmaddr, shmflg);
if (err) {
- shm_unlock(shp);
+ shm_unlock(context, shp);
return err;
}

file = shp->shm_file;
size = i_size_read(file->f_dentry->d_inode);
shp->shm_nattch++;
- shm_unlock(shp);
+ shm_unlock(context, shp);

down_write(&current->mm->mmap_sem);
if (addr && !(shmflg & SHM_REMAP)) {
@@ -771,16 +790,16 @@ long do_shmat(int shmid, char __user *sh
invalid:
up_write(&current->mm->mmap_sem);

- down (&shm_ids.sem);
- if(!(shp = shm_lock(shmid)))
+ down (&context->ids.sem);
+ if(!(shp = shm_lock(context, shmid)))
BUG();
shp->shm_nattch--;
if(shp->shm_nattch == 0 &&
shp->shm_perm.mode & SHM_DEST)
- shm_destroy (shp);
+ shm_destroy (context, shp);
else
- shm_unlock(shp);
- up (&shm_ids.sem);
+ shm_unlock(context, shp);
+ up (&context->ids.sem);

*raddr = (unsigned long) user_addr;
err = 0;
diff -puN ipc/util.c~sysvshm-container ipc/util.c
--- work/ipc/util.c~sysvshm-container 2006-03-06 15:41:59.000000000 -0800
+++ work-dave/ipc/util.c 2006-03-06 15:41:59.000000000 -0800
@@ -46,6 +46,7 @@ struct ipc_proc_iface {
* memory are initialised
*/

+struct ipc_shm_context *ipc_shm_context;
static int __init ipc_init(void)
{
init_task.ipc_context = kzalloc(sizeof(*ipc_context), GFP_KERNEL);
@@ -54,7 +55,7 @@ static int __init ipc_init(void)

sem_init(&init_task.ipc_context->sem);
msg_init(&init_task.ipc_context->msg);
- shm_init();
+ shm_init(&init_task.ipc_context->shm);
return 0;
}
__initcall(ipc_init);
diff -puN ipc/util.h~sysvshm-container ipc/util.h
--- work/ipc/util.h~sysvshm-container 2006-03-06 15:41:59.000000000 -0800
+++ work-dave/ipc/util.h 2006-03-06 15:41:59.000000000 -0800
@@ -44,14 +44,21 @@ struct ipc_sem_context {
int nr_used;
};

+struct ipc_shm_context {
+ struct ipc_ids ids;
+
+ int tot; /* total number of shared memory pages */
+};
+
struct ipc_context {
struct ipc_msg_context msg;
struct ipc_sem_context sem;
+ struct ipc_shm_context shm;
};

void sem_init (struct ipc_sem_context *context);
void msg_init (struct ipc_msg_context *context);
-void shm_init (void);
+void shm_init (struct ipc_shm_context *context);

struct seq_file;
void __init ipc_init_ids(struct ipc_ids* ids, int size);
_

2006-03-06 23:53:22

by Dave Hansen

[permalink] [raw]
Subject: [RFC][PATCH 4/6] sysvsem: containerize


Same thing as the msgs.

Take all of the global sysv sem context and put it inside of
a special structure. This allows future work to give different
containers completely different views into the sysv space.

Signed-off-by: Dave Hansen <[email protected]>
---

work-dave/ipc/sem.c | 194 ++++++++++++++++++++++++++-------------------------
work-dave/ipc/util.c | 2
work-dave/ipc/util.h | 15 ++-
3 files changed, 114 insertions(+), 97 deletions(-)

diff -puN ipc/sem.c~sysvsem-container ipc/sem.c
--- work/ipc/sem.c~sysvsem-container 2006-03-06 15:41:58.000000000 -0800
+++ work-dave/ipc/sem.c 2006-03-06 15:41:58.000000000 -0800
@@ -79,17 +79,19 @@
#include "util.h"


-#define sem_lock(id) ((struct sem_array*)ipc_lock(&sem_ids,id))
-#define sem_unlock(sma) ipc_unlock(&(sma)->sem_perm)
-#define sem_rmid(id) ((struct sem_array*)ipc_rmid(&sem_ids,id))
-#define sem_checkid(sma, semid) \
- ipc_checkid(&sem_ids,&sma->sem_perm,semid)
-#define sem_buildid(id, seq) \
- ipc_buildid(&sem_ids, id, seq)
-static struct ipc_ids sem_ids;
+#define sem_lock(context, id) \
+ ((struct sem_array*)ipc_lock(&context->ids,id))
+#define sem_unlock(context, sma) \
+ ipc_unlock(&(sma)->sem_perm)
+#define sem_rmid(context, id) \
+ ((struct sem_array*)ipc_rmid(&context->ids,id))
+#define sem_checkid(context, sma, semid) \
+ ipc_checkid(&context->ids,&sma->sem_perm,semid)
+#define sem_buildid(context, id, seq) \
+ ipc_buildid(&context->ids, id, seq)

-static int newary (key_t, int, int);
-static void freeary (struct sem_array *sma, int id);
+static int newary (struct ipc_sem_context *, key_t, int, int);
+static void freeary (struct ipc_sem_context *, struct sem_array *, int);
#ifdef CONFIG_PROC_FS
static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
#endif
@@ -112,15 +114,12 @@ int sem_ctls[4] = {SEMMSL, SEMMNS, SEMOP
#define sc_semopm (sem_ctls[2])
#define sc_semmni (sem_ctls[3])

-static int used_sems;
-
-void __init sem_init (void)
+void __init sem_init (struct ipc_sem_context *context)
{
- used_sems = 0;
- ipc_init_ids(&sem_ids,sc_semmni);
+ ipc_init_ids(&context->ids,sc_semmni);
ipc_init_proc_interface("sysvipc/sem",
" key semid perms nsems uid gid cuid cgid otime ctime\n",
- &sem_ids,
+ &context->ids,
sysvipc_sem_proc_show);
}

@@ -158,7 +157,8 @@ void __init sem_init (void)
*/
#define IN_WAKEUP 1

-static int newary (key_t key, int nsems, int semflg)
+static int newary (struct ipc_sem_context *context,
+ key_t key, int nsems, int semflg)
{
int id;
int retval;
@@ -167,7 +167,7 @@ static int newary (key_t key, int nsems,

if (!nsems)
return -EINVAL;
- if (used_sems + nsems > sc_semmns)
+ if (context->nr_used + nsems > sc_semmns)
return -ENOSPC;

size = sizeof (*sma) + nsems * sizeof (struct sem);
@@ -187,22 +187,22 @@ static int newary (key_t key, int nsems,
return retval;
}

- id = ipc_addid(&sem_ids, &sma->sem_perm, sc_semmni);
+ id = ipc_addid(&context->ids, &sma->sem_perm, sc_semmni);
if(id == -1) {
security_sem_free(sma);
ipc_rcu_putref(sma);
return -ENOSPC;
}
- used_sems += nsems;
+ context->nr_used += nsems;

- sma->sem_id = sem_buildid(id, sma->sem_perm.seq);
+ sma->sem_id = sem_buildid(context, id, sma->sem_perm.seq);
sma->sem_base = (struct sem *) &sma[1];
/* sma->sem_pending = NULL; */
sma->sem_pending_last = &sma->sem_pending;
/* sma->undo = NULL; */
sma->sem_nsems = nsems;
sma->sem_ctime = get_seconds();
- sem_unlock(sma);
+ sem_unlock(context, sma);

return sma->sem_id;
}
@@ -211,22 +211,23 @@ asmlinkage long sys_semget (key_t key, i
{
int id, err = -EINVAL;
struct sem_array *sma;
+ struct ipc_sem_context *context = &current->ipc_context->sem;

if (nsems < 0 || nsems > sc_semmsl)
return -EINVAL;
- down(&sem_ids.sem);
+ down(&context->ids.sem);

if (key == IPC_PRIVATE) {
- err = newary(key, nsems, semflg);
- } else if ((id = ipc_findkey(&sem_ids, key)) == -1) { /* key not used */
+ err = newary(context, key, nsems, semflg);
+ } else if ((id = ipc_findkey(&context->ids, key)) == -1) { /* key not used */
if (!(semflg & IPC_CREAT))
err = -ENOENT;
else
- err = newary(key, nsems, semflg);
+ err = newary(context, key, nsems, semflg);
} else if (semflg & IPC_CREAT && semflg & IPC_EXCL) {
err = -EEXIST;
} else {
- sma = sem_lock(id);
+ sma = sem_lock(context, id);
if(sma==NULL)
BUG();
if (nsems > sma->sem_nsems)
@@ -234,15 +235,15 @@ asmlinkage long sys_semget (key_t key, i
else if (ipcperms(&sma->sem_perm, semflg))
err = -EACCES;
else {
- int semid = sem_buildid(id, sma->sem_perm.seq);
+ int semid = sem_buildid(context, id, sma->sem_perm.seq);
err = security_sem_associate(sma, semflg);
if (!err)
err = semid;
}
- sem_unlock(sma);
+ sem_unlock(context, sma);
}

- up(&sem_ids.sem);
+ up(&context->ids.sem);
return err;
}

@@ -437,11 +438,12 @@ static int count_semzcnt (struct sem_arr
return semzcnt;
}

-/* Free a semaphore set. freeary() is called with sem_ids.sem down and
- * the spinlock for this semaphore set hold. sem_ids.sem remains locked
+/* Free a semaphore set. freeary() is called with context->ids.sem down and
+ * the spinlock for this semaphore set hold. context->ids.sem remains locked
* on exit.
*/
-static void freeary (struct sem_array *sma, int id)
+static void freeary (struct ipc_sem_context *context,
+ struct sem_array *sma, int id)
{
struct sem_undo *un;
struct sem_queue *q;
@@ -469,10 +471,10 @@ static void freeary (struct sem_array *s
}

/* Remove the semaphore set from the ID array*/
- sma = sem_rmid(id);
- sem_unlock(sma);
+ sma = sem_rmid(context, id);
+ sem_unlock(context, sma);

- used_sems -= sma->sem_nsems;
+ context->nr_used -= sma->sem_nsems;
size = sizeof (*sma) + sma->sem_nsems * sizeof (struct sem);
security_sem_free(sma);
ipc_rcu_putref(sma);
@@ -500,7 +502,8 @@ static unsigned long copy_semid_to_user(
}
}

-static int semctl_nolock(int semid, int semnum, int cmd, int version, union semun arg)
+static int semctl_nolock(struct ipc_sem_context *context,
+ int semid, int semnum, int cmd, int version, union semun arg)
{
int err = -EINVAL;
struct sem_array *sma;
@@ -525,16 +528,16 @@ static int semctl_nolock(int semid, int
seminfo.semmnu = SEMMNU;
seminfo.semmap = SEMMAP;
seminfo.semume = SEMUME;
- down(&sem_ids.sem);
+ down(&context->ids.sem);
if (cmd == SEM_INFO) {
- seminfo.semusz = sem_ids.in_use;
- seminfo.semaem = used_sems;
+ seminfo.semusz = context->ids.in_use;
+ seminfo.semaem = context->nr_used;
} else {
seminfo.semusz = SEMUSZ;
seminfo.semaem = SEMAEM;
}
- max_id = sem_ids.max_id;
- up(&sem_ids.sem);
+ max_id = context->ids.max_id;
+ up(&context->ids.sem);
if (copy_to_user (arg.__buf, &seminfo, sizeof(struct seminfo)))
return -EFAULT;
return (max_id < 0) ? 0: max_id;
@@ -544,12 +547,12 @@ static int semctl_nolock(int semid, int
struct semid64_ds tbuf;
int id;

- if(semid >= sem_ids.entries->size)
+ if(semid >= context->ids.entries->size)
return -EINVAL;

memset(&tbuf,0,sizeof(tbuf));

- sma = sem_lock(semid);
+ sma = sem_lock(context, semid);
if(sma == NULL)
return -EINVAL;

@@ -561,13 +564,13 @@ static int semctl_nolock(int semid, int
if (err)
goto out_unlock;

- id = sem_buildid(semid, sma->sem_perm.seq);
+ id = sem_buildid(context, semid, sma->sem_perm.seq);

kernel_to_ipc64_perm(&sma->sem_perm, &tbuf.sem_perm);
tbuf.sem_otime = sma->sem_otime;
tbuf.sem_ctime = sma->sem_ctime;
tbuf.sem_nsems = sma->sem_nsems;
- sem_unlock(sma);
+ sem_unlock(context, sma);
if (copy_semid_to_user (arg.buf, &tbuf, version))
return -EFAULT;
return id;
@@ -577,11 +580,13 @@ static int semctl_nolock(int semid, int
}
return err;
out_unlock:
- sem_unlock(sma);
+ sem_unlock(context, sma);
return err;
}

-static int semctl_main(int semid, int semnum, int cmd, int version, union semun arg)
+static int semctl_main(struct ipc_sem_context *context,
+ int semid, int semnum, int cmd,
+ int version, union semun arg)
{
struct sem_array *sma;
struct sem* curr;
@@ -590,14 +595,14 @@ static int semctl_main(int semid, int se
ushort* sem_io = fast_sem_io;
int nsems;

- sma = sem_lock(semid);
+ sma = sem_lock(context, semid);
if(sma==NULL)
return -EINVAL;

nsems = sma->sem_nsems;

err=-EIDRM;
- if (sem_checkid(sma,semid))
+ if (sem_checkid(context, sma,semid))
goto out_unlock;

err = -EACCES;
@@ -617,20 +622,20 @@ static int semctl_main(int semid, int se

if(nsems > SEMMSL_FAST) {
ipc_rcu_getref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);

sem_io = ipc_alloc(sizeof(ushort)*nsems);
if(sem_io == NULL) {
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);
return -ENOMEM;
}

ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
if (sma->sem_perm.deleted) {
- sem_unlock(sma);
+ sem_unlock(context, sma);
err = -EIDRM;
goto out_free;
}
@@ -638,7 +643,7 @@ static int semctl_main(int semid, int se

for (i = 0; i < sma->sem_nsems; i++)
sem_io[i] = sma->sem_base[i].semval;
- sem_unlock(sma);
+ sem_unlock(context, sma);
err = 0;
if(copy_to_user(array, sem_io, nsems*sizeof(ushort)))
err = -EFAULT;
@@ -650,14 +655,14 @@ static int semctl_main(int semid, int se
struct sem_undo *un;

ipc_rcu_getref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);

if(nsems > SEMMSL_FAST) {
sem_io = ipc_alloc(sizeof(ushort)*nsems);
if(sem_io == NULL) {
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);
return -ENOMEM;
}
}
@@ -665,7 +670,7 @@ static int semctl_main(int semid, int se
if (copy_from_user (sem_io, arg.array, nsems*sizeof(ushort))) {
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);
err = -EFAULT;
goto out_free;
}
@@ -674,7 +679,7 @@ static int semctl_main(int semid, int se
if (sem_io[i] > SEMVMX) {
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);
err = -ERANGE;
goto out_free;
}
@@ -682,7 +687,7 @@ static int semctl_main(int semid, int se
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
if (sma->sem_perm.deleted) {
- sem_unlock(sma);
+ sem_unlock(context, sma);
err = -EIDRM;
goto out_free;
}
@@ -706,7 +711,7 @@ static int semctl_main(int semid, int se
tbuf.sem_otime = sma->sem_otime;
tbuf.sem_ctime = sma->sem_ctime;
tbuf.sem_nsems = sma->sem_nsems;
- sem_unlock(sma);
+ sem_unlock(context, sma);
if (copy_semid_to_user (arg.buf, &tbuf, version))
return -EFAULT;
return 0;
@@ -752,7 +757,7 @@ static int semctl_main(int semid, int se
}
}
out_unlock:
- sem_unlock(sma);
+ sem_unlock(context, sma);
out_free:
if(sem_io != fast_sem_io)
ipc_free(sem_io, sizeof(ushort)*nsems);
@@ -799,7 +804,9 @@ static inline unsigned long copy_semid_f
}
}

-static int semctl_down(int semid, int semnum, int cmd, int version, union semun arg)
+static int semctl_down(struct ipc_sem_context *context,
+ int semid, int semnum, int cmd,
+ int version, union semun arg)
{
struct sem_array *sma;
int err;
@@ -812,11 +819,11 @@ static int semctl_down(int semid, int se
if ((err = audit_ipc_perms(0, setbuf.uid, setbuf.gid, setbuf.mode)))
return err;
}
- sma = sem_lock(semid);
+ sma = sem_lock(context, semid);
if(sma==NULL)
return -EINVAL;

- if (sem_checkid(sma,semid)) {
+ if (sem_checkid(context, sma,semid)) {
err=-EIDRM;
goto out_unlock;
}
@@ -834,7 +841,7 @@ static int semctl_down(int semid, int se

switch(cmd){
case IPC_RMID:
- freeary(sma, semid);
+ freeary(context, sma, semid);
err = 0;
break;
case IPC_SET:
@@ -843,18 +850,18 @@ static int semctl_down(int semid, int se
ipcp->mode = (ipcp->mode & ~S_IRWXUGO)
| (setbuf.mode & S_IRWXUGO);
sma->sem_ctime = get_seconds();
- sem_unlock(sma);
+ sem_unlock(context, sma);
err = 0;
break;
default:
- sem_unlock(sma);
+ sem_unlock(context, sma);
err = -EINVAL;
break;
}
return err;

out_unlock:
- sem_unlock(sma);
+ sem_unlock(context, sma);
return err;
}

@@ -862,6 +869,7 @@ asmlinkage long sys_semctl (int semid, i
{
int err = -EINVAL;
int version;
+ struct ipc_sem_context *context = &current->ipc_context->sem;

if (semid < 0)
return -EINVAL;
@@ -872,7 +880,7 @@ asmlinkage long sys_semctl (int semid, i
case IPC_INFO:
case SEM_INFO:
case SEM_STAT:
- err = semctl_nolock(semid,semnum,cmd,version,arg);
+ err = semctl_nolock(context, semid,semnum,cmd,version,arg);
return err;
case GETALL:
case GETVAL:
@@ -882,13 +890,13 @@ asmlinkage long sys_semctl (int semid, i
case IPC_STAT:
case SETVAL:
case SETALL:
- err = semctl_main(semid,semnum,cmd,version,arg);
+ err = semctl_main(context, semid,semnum,cmd,version,arg);
return err;
case IPC_RMID:
case IPC_SET:
- down(&sem_ids.sem);
- err = semctl_down(semid,semnum,cmd,version,arg);
- up(&sem_ids.sem);
+ down(&context->ids.sem);
+ err = semctl_down(context, semid,semnum,cmd,version,arg);
+ up(&context->ids.sem);
return err;
default:
return -EINVAL;
@@ -976,7 +984,7 @@ static struct sem_undo *lookup_undo(stru
return un;
}

-static struct sem_undo *find_undo(int semid)
+static struct sem_undo *find_undo(struct ipc_sem_context *context, int semid)
{
struct sem_array *sma;
struct sem_undo_list *ulp;
@@ -995,24 +1003,24 @@ static struct sem_undo *find_undo(int se
goto out;

/* no undo structure around - allocate one. */
- sma = sem_lock(semid);
+ sma = sem_lock(context, semid);
un = ERR_PTR(-EINVAL);
if(sma==NULL)
goto out;
un = ERR_PTR(-EIDRM);
- if (sem_checkid(sma,semid)) {
- sem_unlock(sma);
+ if (sem_checkid(context, sma,semid)) {
+ sem_unlock(context, sma);
goto out;
}
nsems = sma->sem_nsems;
ipc_rcu_getref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);

new = (struct sem_undo *) kmalloc(sizeof(struct sem_undo) + sizeof(short)*nsems, GFP_KERNEL);
if (!new) {
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);
return ERR_PTR(-ENOMEM);
}
memset(new, 0, sizeof(struct sem_undo) + sizeof(short)*nsems);
@@ -1026,13 +1034,13 @@ static struct sem_undo *find_undo(int se
kfree(new);
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
- sem_unlock(sma);
+ sem_unlock(context, sma);
goto out;
}
ipc_lock_by_ptr(&sma->sem_perm);
ipc_rcu_putref(sma);
if (sma->sem_perm.deleted) {
- sem_unlock(sma);
+ sem_unlock(context, sma);
unlock_semundo();
kfree(new);
un = ERR_PTR(-EIDRM);
@@ -1042,7 +1050,7 @@ static struct sem_undo *find_undo(int se
ulp->proc_list = new;
new->id_next = sma->undo;
sma->undo = new;
- sem_unlock(sma);
+ sem_unlock(context, sma);
un = new;
unlock_semundo();
out:
@@ -1052,6 +1060,7 @@ out:
asmlinkage long sys_semtimedop(int semid, struct sembuf __user *tsops,
unsigned nsops, const struct timespec __user *timeout)
{
+ struct ipc_sem_context *context = &current->ipc_context->sem;
int error = -EINVAL;
struct sem_array *sma;
struct sembuf fast_sops[SEMOPM_FAST];
@@ -1099,7 +1108,7 @@ asmlinkage long sys_semtimedop(int semid

retry_undos:
if (undos) {
- un = find_undo(semid);
+ un = find_undo(context, semid);
if (IS_ERR(un)) {
error = PTR_ERR(un);
goto out_free;
@@ -1107,12 +1116,12 @@ retry_undos:
} else
un = NULL;

- sma = sem_lock(semid);
+ sma = sem_lock(context, semid);
error=-EINVAL;
if(sma==NULL)
goto out_free;
error = -EIDRM;
- if (sem_checkid(sma,semid))
+ if (sem_checkid(context, sma,semid))
goto out_unlock_free;
/*
* semid identifies are not unique - find_undo may have
@@ -1120,7 +1129,7 @@ retry_undos:
* and now a new array with received the same id. Check and retry.
*/
if (un && un->semid == -1) {
- sem_unlock(sma);
+ sem_unlock(context, sma);
goto retry_undos;
}
error = -EFBIG;
@@ -1161,7 +1170,7 @@ retry_undos:
queue.status = -EINTR;
queue.sleeper = current;
current->state = TASK_INTERRUPTIBLE;
- sem_unlock(sma);
+ sem_unlock(context, sma);

if (timeout)
jiffies_left = schedule_timeout(jiffies_left);
@@ -1180,7 +1189,7 @@ retry_undos:
goto out_free;
}

- sma = sem_lock(semid);
+ sma = sem_lock(context, semid);
if(sma==NULL) {
if(queue.prev != NULL)
BUG();
@@ -1205,7 +1214,7 @@ retry_undos:
goto out_unlock_free;

out_unlock_free:
- sem_unlock(sma);
+ sem_unlock(context, sma);
out_free:
if(sops != fast_sops)
kfree(sops);
@@ -1258,6 +1267,7 @@ void exit_sem(struct task_struct *tsk)
{
struct sem_undo_list *undo_list;
struct sem_undo *u, **up;
+ struct ipc_sem_context *context = &tsk->ipc_context->sem;

undo_list = tsk->sysvsem.undo_list;
if (!undo_list)
@@ -1279,14 +1289,14 @@ void exit_sem(struct task_struct *tsk)

if(semid == -1)
continue;
- sma = sem_lock(semid);
+ sma = sem_lock(context, semid);
if (sma == NULL)
continue;

if (u->semid == -1)
goto next_entry;

- BUG_ON(sem_checkid(sma,u->semid));
+ BUG_ON(sem_checkid(context, sma,u->semid));

/* remove u from the sma->undo list */
for (unp = &sma->undo; (un = *unp); unp = &un->id_next) {
@@ -1327,7 +1337,7 @@ found:
/* maybe some queued-up processes were waiting for this */
update_queue(sma);
next_entry:
- sem_unlock(sma);
+ sem_unlock(context, sma);
}
kfree(undo_list);
}
diff -puN ipc/util.c~sysvsem-container ipc/util.c
--- work/ipc/util.c~sysvsem-container 2006-03-06 15:41:58.000000000 -0800
+++ work-dave/ipc/util.c 2006-03-06 15:41:58.000000000 -0800
@@ -52,7 +52,7 @@ static int __init ipc_init(void)
if (!init_task.ipc_context)
return -ENOMEM;

- sem_init();
+ sem_init(&init_task.ipc_context->sem);
msg_init(&init_task.ipc_context->msg);
shm_init();
return 0;
diff -puN ipc/util.h~sysvsem-container ipc/util.h
--- work/ipc/util.h~sysvsem-container 2006-03-06 15:41:58.000000000 -0800
+++ work-dave/ipc/util.h 2006-03-06 15:41:58.000000000 -0800
@@ -11,10 +11,6 @@
#define USHRT_MAX 0xffff
#define SEQ_MULTIPLIER (IPCMNI)

-void sem_init (void);
-void msg_init (struct ipc_msg_context *context);
-void shm_init (void);
-
struct ipc_id_ary {
int size;
struct kern_ipc_perm *p[0];
@@ -42,10 +38,21 @@ struct ipc_msg_context {
struct kref count;
};

+struct ipc_sem_context {
+ struct ipc_ids ids;
+
+ int nr_used;
+};
+
struct ipc_context {
struct ipc_msg_context msg;
+ struct ipc_sem_context sem;
};

+void sem_init (struct ipc_sem_context *context);
+void msg_init (struct ipc_msg_context *context);
+void shm_init (void);
+
struct seq_file;
void __init ipc_init_ids(struct ipc_ids* ids, int size);
#ifdef CONFIG_PROC_FS
_

2006-03-07 00:50:05

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

On Mon, Mar 06, 2006 at 03:52:49PM -0800, Dave Hansen wrote:
>
> Right now, sysctls can only deal with global variables. This
> patch makes them a _little_ more flexible by allowing there to
> be an accessor function to get at the variable being changed,
> instead of it being global.
>
> This allows the sysctls to be backed by variables that are,
> for instance, dynamically allocated and not available at
> compile-time.
>
> This also provides a very simple mechanism to take things that
> are currently global and containerize them.

hmm, why do you call the sysctl_table_data() over and
over again? what's the purpose? what about sideeffects?

> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> work-dave/include/linux/sysctl.h | 8 ++++
> work-dave/kernel/sysctl.c | 65 ++++++++++++++++++++++++++-------------
> 2 files changed, 52 insertions(+), 21 deletions(-)
>
> diff -puN include/linux/sysctl.h~sysctls-for-containers include/linux/sysctl.h
> --- work/include/linux/sysctl.h~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
> +++ work-dave/include/linux/sysctl.h 2006-03-06 15:41:55.000000000 -0800
> @@ -872,6 +872,7 @@ extern void sysctl_init(void);
>
> typedef struct ctl_table ctl_table;
>
> +typedef void *ctl_data_access (void);
> typedef int ctl_handler (ctl_table *table, int __user *name, int nlen,
> void __user *oldval, size_t __user *oldlenp,
> void __user *newval, size_t newlen,
> @@ -957,6 +958,13 @@ struct ctl_table
> int ctl_name; /* Binary ID */
> const char *procname; /* Text ID for /proc/sys, or zero */
> void *data;
> + ctl_data_access *data_access; /* set this to a function if you
> + * don't have a static place to point
> + * ->data at compile-time. This
> + * function will be called to dynamically
> + * figure out a ->data pointer. Do not
> + * set this and ->data at once.
> + */
> int maxlen;
> mode_t mode;
> ctl_table *child;
> diff -puN kernel/sysctl.c~sysctls-for-containers kernel/sysctl.c
> --- work/kernel/sysctl.c~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
> +++ work-dave/kernel/sysctl.c 2006-03-06 15:41:55.000000000 -0800
> @@ -1197,6 +1197,24 @@ repeat:
> return -ENOTDIR;
> }
>

I'd expect that to be inline, and to vanish when
containers are disabled ...

> +void *sysctl_table_data(ctl_table *table)
> +{
> + void *data;
> +
> + if (table->data && table->data_access) {
> + printk(KERN_WARNING
> + "sysctl: data and accessor function set for: '%s'\n",
> + table->procname);
> + table->data = NULL;

why is ->data and ->data_access evil?

> + }
> +
> + data = table->data;
> + if (!data && table->data_access)
> + data = table->data_access();
> +
> + return data;
> +}
> +
> /* Perform the actual read/write of a sysctl table entry. */
> int do_sysctl_strategy (ctl_table *table,
> int __user *name, int nlen,
> @@ -1205,6 +1223,7 @@ int do_sysctl_strategy (ctl_table *table
> {
> int op = 0, rc;
> size_t len;
> + void *data;
>
> if (oldval)
> op |= 004;
> @@ -1224,14 +1243,15 @@ int do_sysctl_strategy (ctl_table *table
>
> /* If there is no strategy routine, or if the strategy returns
> * zero, proceed with automatic r/w */
> - if (table->data && table->maxlen) {
> + data = sysctl_table_data(table);
> + if (data && table->maxlen) {
> if (oldval && oldlenp) {
> if (get_user(len, oldlenp))
> return -EFAULT;
> if (len) {
> if (len > table->maxlen)
> len = table->maxlen;
> - if(copy_to_user(oldval, table->data, len))
> + if(copy_to_user(oldval, data, len))
> return -EFAULT;
> if(put_user(len, oldlenp))
> return -EFAULT;
> @@ -1241,7 +1261,7 @@ int do_sysctl_strategy (ctl_table *table
> len = newlen;
> if (len > table->maxlen)
> len = table->maxlen;
> - if(copy_from_user(table->data, newval, len))
> + if(copy_from_user(data, newval, len))
> return -EFAULT;
> }
> }
> @@ -1539,7 +1559,7 @@ int proc_dostring(ctl_table *table, int
> char __user *p;
> char c;
>
> - if (!table->data || !table->maxlen || !*lenp ||
> + if (!sysctl_table_data(table) || !table->maxlen || !*lenp ||
> (*ppos && !write)) {
> *lenp = 0;
> return 0;
> @@ -1557,18 +1577,18 @@ int proc_dostring(ctl_table *table, int
> }
> if (len >= table->maxlen)
> len = table->maxlen-1;
> - if(copy_from_user(table->data, buffer, len))
> + if(copy_from_user(sysctl_table_data(table), buffer, len))
> return -EFAULT;
> - ((char *) table->data)[len] = 0;
> + ((char *) sysctl_table_data(table))[len] = 0;
> *ppos += *lenp;
> } else {
> - len = strlen(table->data);
> + len = strlen(sysctl_table_data(table));
> if (len > table->maxlen)
> len = table->maxlen;
> if (len > *lenp)
> len = *lenp;
> if (len)
> - if(copy_to_user(buffer, table->data, len))
> + if(copy_to_user(buffer, sysctl_table_data(table), len))
> return -EFAULT;
> if (len < *lenp) {
> if(put_user('\n', ((char __user *) buffer) + len))
> @@ -1636,13 +1656,13 @@ static int do_proc_dointvec(ctl_table *t
> char buf[TMPBUFLEN], *p;
> char __user *s = buffer;
>
> - if (!table->data || !table->maxlen || !*lenp ||
> + if (!sysctl_table_data(table) || !table->maxlen || !*lenp ||
> (*ppos && !write)) {
> *lenp = 0;
> return 0;
> }
>
> - i = (int *) table->data;
> + i = (int *) sysctl_table_data(table);
> vleft = table->maxlen / sizeof(*i);
> left = *lenp;
>
> @@ -1878,13 +1898,13 @@ static int do_proc_doulongvec_minmax(ctl
> char buf[TMPBUFLEN], *p;
> char __user *s = buffer;
>
> - if (!table->data || !table->maxlen || !*lenp ||
> + if (!sysctl_table_data(table) || !table->maxlen || !*lenp ||
> (*ppos && !write)) {
> *lenp = 0;
> return 0;
> }
>
> - i = (unsigned long *) table->data;
> + i = (unsigned long *) sysctl_table_data(table);
> min = (unsigned long *) table->extra1;
> max = (unsigned long *) table->extra2;
> vleft = table->maxlen / sizeof(unsigned long);
> @@ -2230,7 +2250,9 @@ int sysctl_string(ctl_table *table, int
> void __user *oldval, size_t __user *oldlenp,
> void __user *newval, size_t newlen, void **context)
> {
> - if (!table->data || !table->maxlen)
> + char *data = sysctl_table_data(table);
> +
> + if (!data || !table->maxlen)
> return -ENOTDIR;
>
> if (oldval && oldlenp) {
> @@ -2238,7 +2260,7 @@ int sysctl_string(ctl_table *table, int
> if (get_user(bufsize, oldlenp))
> return -EFAULT;
> if (bufsize) {
> - size_t len = strlen(table->data), copied;
> + size_t len = strlen(data), copied;
>
> /* This shouldn't trigger for a well-formed sysctl */
> if (len > table->maxlen)
> @@ -2247,7 +2269,7 @@ int sysctl_string(ctl_table *table, int
> /* Copy up to a max of bufsize-1 bytes of the string */
> copied = (len >= bufsize) ? bufsize - 1 : len;
>
> - if (copy_to_user(oldval, table->data, copied) ||
> + if (copy_to_user(oldval, data, copied) ||
> put_user(0, (char __user *)(oldval + copied)))
> return -EFAULT;
> if (put_user(len, oldlenp))
> @@ -2258,11 +2280,11 @@ int sysctl_string(ctl_table *table, int
> size_t len = newlen;
> if (len > table->maxlen)
> len = table->maxlen;
> - if(copy_from_user(table->data, newval, len))
> + if(copy_from_user(data, newval, len))
> return -EFAULT;
> if (len == table->maxlen)
> len--;
> - ((char *) table->data)[len] = 0;
> + data[len] = 0;
> }
> return 1;
> }
> @@ -2320,7 +2342,7 @@ int sysctl_jiffies(ctl_table *table, int
> if (olen!=sizeof(int))
> return -EINVAL;
> }
> - if (put_user(*(int *)(table->data)/HZ, (int __user *)oldval) ||
> + if (put_user(*(int *)(sysctl_table_data(table))/HZ, (int __user *)oldval) ||
> (oldlenp && put_user(sizeof(int),oldlenp)))
> return -EFAULT;
> }
> @@ -2330,7 +2352,7 @@ int sysctl_jiffies(ctl_table *table, int
> return -EINVAL;
> if (get_user(new, (int __user *)newval))
> return -EFAULT;
> - *(int *)(table->data) = new*HZ;
> + *(int *)(sysctl_table_data(table)) = new*HZ;
> }
> return 1;
> }
> @@ -2340,6 +2362,7 @@ int sysctl_ms_jiffies(ctl_table *table,
> void __user *oldval, size_t __user *oldlenp,
> void __user *newval, size_t newlen, void **context)
> {
> + int *jiffies_ptr = sysctl_table_data(table);
> if (oldval) {
> size_t olen;
> if (oldlenp) {
> @@ -2348,7 +2371,7 @@ int sysctl_ms_jiffies(ctl_table *table,
> if (olen!=sizeof(int))
> return -EINVAL;
> }
> - if (put_user(jiffies_to_msecs(*(int *)(table->data)), (int __user *)oldval) ||
> + if (put_user(jiffies_to_msecs(*jiffies_ptr), (int __user *)oldval) ||
> (oldlenp && put_user(sizeof(int),oldlenp)))
> return -EFAULT;
> }
> @@ -2358,7 +2381,7 @@ int sysctl_ms_jiffies(ctl_table *table,
> return -EINVAL;
> if (get_user(new, (int __user *)newval))
> return -EFAULT;
> - *(int *)(table->data) = msecs_to_jiffies(new);
> + *jiffies_ptr = msecs_to_jiffies(new);
> }
> return 1;
> }
> _

wouldn't some get/set helper make more sense?
i.e. some virtualizer and devirtualizer functions?

best,
Herbert

2006-03-07 00:57:25

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

* Dave Hansen ([email protected]) wrote:
> /* If there is no strategy routine, or if the strategy returns
> * zero, proceed with automatic r/w */
> - if (table->data && table->maxlen) {
> + data = sysctl_table_data(table);
> + if (data && table->maxlen) {
> if (oldval && oldlenp) {
> if (get_user(len, oldlenp))
> return -EFAULT;
> if (len) {
> if (len > table->maxlen)
> len = table->maxlen;
> - if(copy_to_user(oldval, table->data, len))
> + if(copy_to_user(oldval, data, len))
> return -EFAULT;
> if(put_user(len, oldlenp))
> return -EFAULT;
> @@ -1241,7 +1261,7 @@ int do_sysctl_strategy (ctl_table *table
> len = newlen;
> if (len > table->maxlen)
> len = table->maxlen;
> - if(copy_from_user(table->data, newval, len))
> + if(copy_from_user(data, newval, len))
> return -EFAULT;

Interesting idea. One piece that's missing is strategy for controlling
creation the new context (assuming the data_access() will always evaluate
into a context sensitive piece of data). Otherwise a user can get out
of the limits imposed by sysadmin (since they may have placed themselves
in a context which differs from admin).

thanks,
-chris

2006-03-07 01:24:44

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

On Mon, Mar 06, 2006 at 03:52:49PM -0800, Dave Hansen wrote:
>
> Right now, sysctls can only deal with global variables. This
> patch makes them a _little_ more flexible by allowing there to
> be an accessor function to get at the variable being changed,
> instead of it being global.
>
> This allows the sysctls to be backed by variables that are,
> for instance, dynamically allocated and not available at
> compile-time.
>
> This also provides a very simple mechanism to take things that
> are currently global and containerize them.
>

This is disgusting. Please, don't pile more and more complexity into
sysctl_table - it's already choke-full of it and needs to be simplified,
not to grow more crap.

NAK.

2006-03-07 01:53:29

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] sysvmsg: containerize

* Dave Hansen ([email protected]) wrote:
> --- work/include/linux/init_task.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> +++ work-dave/include/linux/init_task.h 2006-03-06 15:41:56.000000000 -0800
> @@ -2,6 +2,7 @@
> #define _LINUX__INIT_TASK_H
>
> #include <linux/file.h>
> +#include <linux/ipc.h>
> #include <linux/rcupdate.h>
>
> #define INIT_FDTABLE \

missing INIT_IPC_CONTEXT type of macro?

> diff -puN include/linux/ipc.h~sysvmsg-container include/linux/ipc.h
> --- work/include/linux/ipc.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> +++ work-dave/include/linux/ipc.h 2006-03-06 15:41:56.000000000 -0800
> @@ -68,6 +68,10 @@ struct kern_ipc_perm
> void *security;
> };
>
> +struct ipc_context;
> +extern struct ipc_context *ipc_context;
> +
> +extern struct ipc_msg_context *init_ipc_context(void);
> #endif /* __KERNEL__ */
>
> #endif /* _LINUX_IPC_H */
> diff -puN include/linux/sched.h~sysvmsg-container include/linux/sched.h
> --- work/include/linux/sched.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> +++ work-dave/include/linux/sched.h 2006-03-06 15:41:56.000000000 -0800
> @@ -793,6 +793,7 @@ struct task_struct {
> int link_count, total_link_count;
> /* ipc stuff */
> struct sysv_sem sysvsem;
> + struct ipc_context *ipc_context;

how does this propagate on clone (presently it's just side-effect of
dup_task_struct starting from init_task, with no dynamic lifetime),
how is it meant to be changed?

> /* CPU-specific state of this task */
> struct thread_struct thread;
> /* filesystem information */
> diff -puN ipc/msg.c~sysvmsg-container ipc/msg.c
> --- work/ipc/msg.c~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> +++ work-dave/ipc/msg.c 2006-03-06 15:41:56.000000000 -0800
> @@ -60,35 +60,30 @@ struct msg_sender {
> #define SEARCH_NOTEQUAL 3
> #define SEARCH_LESSEQUAL 4
>
> -static atomic_t msg_bytes = ATOMIC_INIT(0);
> -static atomic_t msg_hdrs = ATOMIC_INIT(0);
> +#define msg_lock(ctx, id) ((struct msg_queue*)ipc_lock(&ctx->ids,id))
> +#define msg_unlock(ctx, msq) ipc_unlock(&(msq)->q_perm)
> +#define msg_rmid(ctx, id) ((struct msg_queue*)ipc_rmid(&ctx->ids,id))
> +#define msg_checkid(ctx, msq, msgid) \
> + ipc_checkid(&ctx->ids,&msq->q_perm,msgid)
> +#define msg_buildid(ctx, id, seq) \
> + ipc_buildid(&ctx->ids, id, seq)
>
> -static struct ipc_ids msg_ids;
> -
> -#define msg_lock(id) ((struct msg_queue*)ipc_lock(&msg_ids,id))
> -#define msg_unlock(msq) ipc_unlock(&(msq)->q_perm)
> -#define msg_rmid(id) ((struct msg_queue*)ipc_rmid(&msg_ids,id))
> -#define msg_checkid(msq, msgid) \
> - ipc_checkid(&msg_ids,&msq->q_perm,msgid)
> -#define msg_buildid(id, seq) \
> - ipc_buildid(&msg_ids, id, seq)
> -
> -static void freeque (struct msg_queue *msq, int id);
> -static int newque (key_t key, int msgflg);
> +static void freeque (struct ipc_msg_context *, struct msg_queue *msq, int id);
> +static int newque (struct ipc_msg_context *context, key_t key, int id);
> #ifdef CONFIG_PROC_FS
> static int sysvipc_msg_proc_show(struct seq_file *s, void *it);
> #endif
>
> -void __init msg_init (void)
> +void __init msg_init (struct ipc_msg_context *context)
> {
> - ipc_init_ids(&msg_ids,msg_ctlmni);
> + ipc_init_ids(&context->ids,msg_ctlmni);
> ipc_init_proc_interface("sysvipc/msg",
> " key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n",
> - &msg_ids,
> + &context->ids,
> sysvipc_msg_proc_show);

Does that mean /proc interface only gets init_task context?
Along those lines, I think now ipcs -a is incomplete from admin
perspective. Suppose that's a feature from the container/vserver
POV.

> --- work/ipc/util.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> +++ work-dave/ipc/util.h 2006-03-06 15:41:56.000000000 -0800
> @@ -12,7 +12,7 @@
> #define SEQ_MULTIPLIER (IPCMNI)
>
> void sem_init (void);
> -void msg_init (void);
> +void msg_init (struct ipc_msg_context *context);
> void shm_init (void);
>
> struct ipc_id_ary {
> @@ -30,6 +30,18 @@ struct ipc_ids {
> struct ipc_id_ary* entries;
> };
>
> +struct ipc_msg_context {
> + atomic_t bytes;
> + atomic_t hdrs;
> +
> + struct ipc_ids ids;
> + struct kref count;
> +};
> +
> +struct ipc_context {
> + struct ipc_msg_context msg;
> +};

This looks a little suspect. ref count embedded in embedded object.
My first thought was, sem and shared memory would introduce same, and
now we'd have 3 independent refounts for top level object. However,
it's not used, and doesn't appear to be mirrored in the sem and shared
mem contexts. Perhaps it is just a leftover?

thanks,
-chris

2006-03-07 01:56:45

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

On Tue, 2006-03-07 at 01:24 +0000, Al Viro wrote:
> This is disgusting. Please, don't pile more and more complexity into
> sysctl_table - it's already choke-full of it and needs to be simplified,
> not to grow more crap.

I don't completely disagree. It certainly isn't the most elegant
approach I've ever seen.

Any ideas on ways we could simplify it? I was thinking that we could
get rid of the .data member and allow access only via the mechanism I
just introduced. It would be pretty easy to make some macros to
generate "simple" access functions for the existing global variables.

-- Dave

2006-03-07 01:57:42

by Al Viro

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

On Mon, Mar 06, 2006 at 05:55:48PM -0800, Dave Hansen wrote:
> On Tue, 2006-03-07 at 01:24 +0000, Al Viro wrote:
> > This is disgusting. Please, don't pile more and more complexity into
> > sysctl_table - it's already choke-full of it and needs to be simplified,
> > not to grow more crap.
>
> I don't completely disagree. It certainly isn't the most elegant
> approach I've ever seen.
>
> Any ideas on ways we could simplify it? I was thinking that we could
> get rid of the .data member and allow access only via the mechanism I
> just introduced. It would be pretty easy to make some macros to
> generate "simple" access functions for the existing global variables.

I'll resurrect the sysctl-cleanups tree and drop it on kernel.org tonight.

2006-03-07 02:01:17

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

On Tue, 2006-03-07 at 01:50 +0100, Herbert Poetzl wrote:
> On Mon, Mar 06, 2006 at 03:52:49PM -0800, Dave Hansen wrote:
> >
> > Right now, sysctls can only deal with global variables. This
> > patch makes them a _little_ more flexible by allowing there to
> > be an accessor function to get at the variable being changed,
> > instead of it being global.
> >
> > This allows the sysctls to be backed by variables that are,
> > for instance, dynamically allocated and not available at
> > compile-time.
> >
> > This also provides a very simple mechanism to take things that
> > are currently global and containerize them.
>
> hmm, why do you call the sysctl_table_data() over and
> over again? what's the purpose?

Letting me be lazy and code with s/// :)

For the current application, it doesn't really matter. But, I can
imagine that other users could be a bit more costly.

> what about sideeffects?

Require that there aren't any. ;)

It might be necessary to have something effectively implementing put and
get, but this certainly doesn't need it yet.

> > Signed-off-by: Dave Hansen <[email protected]>
> > ---
> >
> > work-dave/include/linux/sysctl.h | 8 ++++
> > work-dave/kernel/sysctl.c | 65 ++++++++++++++++++++++++++-------------
> > 2 files changed, 52 insertions(+), 21 deletions(-)
> >
> > diff -puN include/linux/sysctl.h~sysctls-for-containers include/linux/sysctl.h
> > --- work/include/linux/sysctl.h~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
> > +++ work-dave/include/linux/sysctl.h 2006-03-06 15:41:55.000000000 -0800
> > @@ -872,6 +872,7 @@ extern void sysctl_init(void);
> >
> > typedef struct ctl_table ctl_table;
> >
> > +typedef void *ctl_data_access (void);
> > typedef int ctl_handler (ctl_table *table, int __user *name, int nlen,
> > void __user *oldval, size_t __user *oldlenp,
> > void __user *newval, size_t newlen,
> > @@ -957,6 +958,13 @@ struct ctl_table
> > int ctl_name; /* Binary ID */
> > const char *procname; /* Text ID for /proc/sys, or zero */
> > void *data;
> > + ctl_data_access *data_access; /* set this to a function if you
> > + * don't have a static place to point
> > + * ->data at compile-time. This
> > + * function will be called to dynamically
> > + * figure out a ->data pointer. Do not
> > + * set this and ->data at once.
> > + */
> > int maxlen;
> > mode_t mode;
> > ctl_table *child;
> > diff -puN kernel/sysctl.c~sysctls-for-containers kernel/sysctl.c
> > --- work/kernel/sysctl.c~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
> > +++ work-dave/kernel/sysctl.c 2006-03-06 15:41:55.000000000 -0800
> > @@ -1197,6 +1197,24 @@ repeat:
> > return -ENOTDIR;
> > }
> >
>
> I'd expect that to be inline, and to vanish when
> containers are disabled ...

Unless we want non-container code to be able to use it. I guess we
could restrict it to containers only, though.

> > +void *sysctl_table_data(ctl_table *table)
> > +{
> > + void *data;
> > +
> > + if (table->data && table->data_access) {
> > + printk(KERN_WARNING
> > + "sysctl: data and accessor function set for: '%s'\n",
> > + table->procname);
> > + table->data = NULL;
>
> why is ->data and ->data_access evil?

As it stands, which one do you use? What if they aren't consistent? Do
you use both? Just one? Which first? Easiest to just say that it
isn't allowed.

> wouldn't some get/set helper make more sense?
> i.e. some virtualizer and devirtualizer functions?

I'm not quite sure I know what you mean. Can you elaborate some more?

-- Dave

2006-03-07 02:05:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

On Mon, 2006-03-06 at 17:01 -0800, Chris Wright wrote:
> Interesting idea. One piece that's missing is strategy for controlling
> creation the new context (assuming the data_access() will always evaluate
> into a context sensitive piece of data). Otherwise a user can get out
> of the limits imposed by sysadmin (since they may have placed themselves
> in a context which differs from admin).

Yup, that is missing for now. We couldn't agree on quite which
implementation we want for basic containers/vservers/vpses. So, for
now, making it useful is left as an exercise to the reader. :)

BTW, the current code _is_ potentially context sensitive because
"current" provides much of the context that we will ever need.

-- Dave

2006-03-07 02:09:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] sysvmsg: containerize

On Mon, 2006-03-06 at 17:57 -0800, Chris Wright wrote:
> * Dave Hansen ([email protected]) wrote:
> > --- work/include/linux/init_task.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> > +++ work-dave/include/linux/init_task.h 2006-03-06 15:41:56.000000000 -0800
> > @@ -2,6 +2,7 @@
> > #define _LINUX__INIT_TASK_H
> >
> > #include <linux/file.h>
> > +#include <linux/ipc.h>
> > #include <linux/rcupdate.h>
> >
> > #define INIT_FDTABLE \
>
> missing INIT_IPC_CONTEXT type of macro?

Yeah, I was doing something like that at first. But, I decided to
dynamically allocate it, much like it will have to be done when we have
_real_ namespaces for it. That means that we can't actually initialize
it at init_task.h time.

> > diff -puN include/linux/sched.h~sysvmsg-container include/linux/sched.h
> > --- work/include/linux/sched.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> > +++ work-dave/include/linux/sched.h 2006-03-06 15:41:56.000000000 -0800
> > @@ -793,6 +793,7 @@ struct task_struct {
> > int link_count, total_link_count;
> > /* ipc stuff */
> > struct sysv_sem sysvsem;
> > + struct ipc_context *ipc_context;
>
> how does this propagate on clone (presently it's just side-effect of
> dup_task_struct starting from init_task, with no dynamic lifetime),
> how is it meant to be changed?

Some interface later down the road. You caught me a bit early in the
cycle.

> > -void __init msg_init (void)
> > +void __init msg_init (struct ipc_msg_context *context)
> > {
> > - ipc_init_ids(&msg_ids,msg_ctlmni);
> > + ipc_init_ids(&context->ids,msg_ctlmni);
> > ipc_init_proc_interface("sysvipc/msg",
> > " key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n",
> > - &msg_ids,
> > + &context->ids,
> > sysvipc_msg_proc_show);
>
> Does that mean /proc interface only gets init_task context?
> Along those lines, I think now ipcs -a is incomplete from admin
> perspective. Suppose that's a feature from the container/vserver
> POV.

It will get context from the current task, which means the current
container. We haven't quite decided how these things will be (or if
they need to be) aggregated on a a system-wide basis.

> > --- work/ipc/util.h~sysvmsg-container 2006-03-06 15:41:56.000000000 -0800
> > +++ work-dave/ipc/util.h 2006-03-06 15:41:56.000000000 -0800
> > @@ -12,7 +12,7 @@
> > #define SEQ_MULTIPLIER (IPCMNI)
> >
> > void sem_init (void);
> > -void msg_init (void);
> > +void msg_init (struct ipc_msg_context *context);
> > void shm_init (void);
> >
> > struct ipc_id_ary {
> > @@ -30,6 +30,18 @@ struct ipc_ids {
> > struct ipc_id_ary* entries;
> > };
> >
> > +struct ipc_msg_context {
> > + atomic_t bytes;
> > + atomic_t hdrs;
> > +
> > + struct ipc_ids ids;
> > + struct kref count;
> > +};
> > +
> > +struct ipc_context {
> > + struct ipc_msg_context msg;
> > +};
>
> This looks a little suspect. ref count embedded in embedded object.
> My first thought was, sem and shared memory would introduce same, and
> now we'd have 3 independent refounts for top level object. However,
> it's not used, and doesn't appear to be mirrored in the sem and shared
> mem contexts. Perhaps it is just a leftover?

Yeah, I have the feeling that it used to be an atomic_t, and got
converted over to a kref a bit needlessly. I'll investigate more when I
spin the patches again.

-- Dave

2006-03-07 02:13:52

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

* Dave Hansen ([email protected]) wrote:
> Yup, that is missing for now. We couldn't agree on quite which
> implementation we want for basic containers/vservers/vpses. So, for
> now, making it useful is left as an exercise to the reader. :)

;-)

> BTW, the current code _is_ potentially context sensitive because
> "current" provides much of the context that we will ever need.

Right. More a question of sysadmin lowering limits which aren't seen
in any context but sysadmin context. But from your comments above and
in other mail, I understand the requirements aren't fully baked yet.

thanks,
-chris

2006-03-07 02:30:29

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] sysvmsg: containerize

* Dave Hansen ([email protected]) wrote:
> On Mon, 2006-03-06 at 17:57 -0800, Chris Wright wrote:
> > * Dave Hansen ([email protected]) wrote:
> > > -void __init msg_init (void)
> > > +void __init msg_init (struct ipc_msg_context *context)
> > > {
> > > - ipc_init_ids(&msg_ids,msg_ctlmni);
> > > + ipc_init_ids(&context->ids,msg_ctlmni);
> > > ipc_init_proc_interface("sysvipc/msg",
> > > " key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime ctime\n",
> > > - &msg_ids,
> > > + &context->ids,
> > > sysvipc_msg_proc_show);
> >
> > Does that mean /proc interface only gets init_task context?
> > Along those lines, I think now ipcs -a is incomplete from admin
> > perspective. Suppose that's a feature from the container/vserver
> > POV.
>
> It will get context from the current task, which means the current
> container. We haven't quite decided how these things will be (or if
> they need to be) aggregated on a a system-wide basis.

The /proc interface is registering with &context->ids of init_task. So,
all other contexts using that interface will be looking at the wrong
info, AFAICT.

As you can tell my concerns are in resource consumption. If a user can
create contexts which it can hide from sysadmin, and they aren't subject
to sysadmin mandated resource limits, it's effectively a leak, esp. since
these resources don't die with exit(2).

thanks,
-chris

2006-03-07 02:40:04

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] sysvsem: containerize

* Dave Hansen ([email protected]) wrote:
> @@ -112,15 +114,12 @@ int sem_ctls[4] = {SEMMSL, SEMMNS, SEMOP
> #define sc_semopm (sem_ctls[2])
> #define sc_semmni (sem_ctls[3])

Minor one, but didn't see an attempt to handle these sem_ctls global sysctls.

thanks,
-chris

2006-03-07 02:45:08

by Herbert Poetzl

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

On Mon, Mar 06, 2006 at 06:00:21PM -0800, Dave Hansen wrote:
> On Tue, 2006-03-07 at 01:50 +0100, Herbert Poetzl wrote:
> > On Mon, Mar 06, 2006 at 03:52:49PM -0800, Dave Hansen wrote:
> > >
> > > Right now, sysctls can only deal with global variables. This
> > > patch makes them a _little_ more flexible by allowing there to
> > > be an accessor function to get at the variable being changed,
> > > instead of it being global.
> > >
> > > This allows the sysctls to be backed by variables that are,
> > > for instance, dynamically allocated and not available at
> > > compile-time.
> > >
> > > This also provides a very simple mechanism to take things that
> > > are currently global and containerize them.
> >
> > hmm, why do you call the sysctl_table_data() over and
> > over again? what's the purpose?
>
> Letting me be lazy and code with s/// :)
>
> For the current application, it doesn't really matter. But, I can
> imagine that other users could be a bit more costly.
>
> > what about sideeffects?
>
> Require that there aren't any. ;)
>
> It might be necessary to have something effectively implementing put and
> get, but this certainly doesn't need it yet.
>
> > > Signed-off-by: Dave Hansen <[email protected]>
> > > ---
> > >
> > > work-dave/include/linux/sysctl.h | 8 ++++
> > > work-dave/kernel/sysctl.c | 65 ++++++++++++++++++++++++++-------------
> > > 2 files changed, 52 insertions(+), 21 deletions(-)
> > >
> > > diff -puN include/linux/sysctl.h~sysctls-for-containers include/linux/sysctl.h
> > > --- work/include/linux/sysctl.h~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
> > > +++ work-dave/include/linux/sysctl.h 2006-03-06 15:41:55.000000000 -0800
> > > @@ -872,6 +872,7 @@ extern void sysctl_init(void);
> > >
> > > typedef struct ctl_table ctl_table;
> > >
> > > +typedef void *ctl_data_access (void);
> > > typedef int ctl_handler (ctl_table *table, int __user *name, int nlen,
> > > void __user *oldval, size_t __user *oldlenp,
> > > void __user *newval, size_t newlen,
> > > @@ -957,6 +958,13 @@ struct ctl_table
> > > int ctl_name; /* Binary ID */
> > > const char *procname; /* Text ID for /proc/sys, or zero */
> > > void *data;
> > > + ctl_data_access *data_access; /* set this to a function if you
> > > + * don't have a static place to point
> > > + * ->data at compile-time. This
> > > + * function will be called to dynamically
> > > + * figure out a ->data pointer. Do not
> > > + * set this and ->data at once.
> > > + */
> > > int maxlen;
> > > mode_t mode;
> > > ctl_table *child;
> > > diff -puN kernel/sysctl.c~sysctls-for-containers kernel/sysctl.c
> > > --- work/kernel/sysctl.c~sysctls-for-containers 2006-03-06 15:41:55.000000000 -0800
> > > +++ work-dave/kernel/sysctl.c 2006-03-06 15:41:55.000000000 -0800
> > > @@ -1197,6 +1197,24 @@ repeat:
> > > return -ENOTDIR;
> > > }
> > >
> >
> > I'd expect that to be inline, and to vanish when
> > containers are disabled ...
>
> Unless we want non-container code to be able to use it. I guess we
> could restrict it to containers only, though.

uhm, what about kernel compiled _without_ container
support?

> > > +void *sysctl_table_data(ctl_table *table)
> > > +{
> > > + void *data;
> > > +
> > > + if (table->data && table->data_access) {
> > > + printk(KERN_WARNING
> > > + "sysctl: data and accessor function set for: '%s'\n",
> > > + table->procname);
> > > + table->data = NULL;
> >
> > why is ->data and ->data_access evil?
>
> As it stands, which one do you use? What if they aren't consistent? Do
> you use both? Just one? Which first? Easiest to just say that it
> isn't allowed.

wouldn't it make sense to have some 'default' in
->data and let ->data_access() override it when
there is something to override?

> > wouldn't some get/set helper make more sense?
> > i.e. some virtualizer and devirtualizer functions?
>
> I'm not quite sure I know what you mean. Can you elaborate some more?

sure, basically we have two cases which interact
with userspace: the read and the write case.

for the read case, we want something which gives
us the 'virtualized' view, which might often be
the same as the host gets, where for the write
case it is usually not that simple, as we might
not want a context to write to 'world' stuff

so, having two different functions here, or one
which gets passed the direction, might be much
simpler to adjust in many cases than adding more
and more structures ... but YMMV

best,
Herbert

> -- Dave

2006-03-07 03:02:28

by Sam Vilain

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

Dave Hansen wrote:

>>Interesting idea. One piece that's missing is strategy for controlling
>>creation the new context (assuming the data_access() will always evaluate
>>into a context sensitive piece of data). Otherwise a user can get out
>>of the limits imposed by sysadmin (since they may have placed themselves
>>in a context which differs from admin).
>>
>>
>Yup, that is missing for now. We couldn't agree on quite which
>implementation we want for basic containers/vservers/vpses. So, for
>now, making it useful is left as an exercise to the reader. :)
>
>BTW, the current code _is_ potentially context sensitive because
>"current" provides much of the context that we will ever need.
>
>

I have an RFC quality submission ready, extracted from Herbert's work.
I'll prepare and forward it to the ML now for reference.

Sam.

2006-03-07 05:09:54

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/6] sysvsem: containerize

On Mon, 2006-03-06 at 18:44 -0800, Chris Wright wrote:
> * Dave Hansen ([email protected]) wrote:
> > @@ -112,15 +114,12 @@ int sem_ctls[4] = {SEMMSL, SEMMNS, SEMOP
> > #define sc_semopm (sem_ctls[2])
> > #define sc_semmni (sem_ctls[3])
>
> Minor one, but didn't see an attempt to handle these sem_ctls global sysctls.

Nope, haven't gotten to those yet. Thanks for pointing them out.

-- Dave

2006-03-19 14:52:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

Al Viro <[email protected]> writes:

> On Mon, Mar 06, 2006 at 05:55:48PM -0800, Dave Hansen wrote:
>> On Tue, 2006-03-07 at 01:24 +0000, Al Viro wrote:
>> > This is disgusting. Please, don't pile more and more complexity into
>> > sysctl_table - it's already choke-full of it and needs to be simplified,
>> > not to grow more crap.
>>
>> I don't completely disagree. It certainly isn't the most elegant
>> approach I've ever seen.
>>
>> Any ideas on ways we could simplify it? I was thinking that we could
>> get rid of the .data member and allow access only via the mechanism I
>> just introduced. It would be pretty easy to make some macros to
>> generate "simple" access functions for the existing global variables.
>
> I'll resurrect the sysctl-cleanups tree and drop it on kernel.org tonight.

Has that happened yet?

I just looked and I didn't see anything up there.

Eric

2006-03-19 15:30:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

Dave Hansen <[email protected]> writes:

> Right now, sysctls can only deal with global variables. This
> patch makes them a _little_ more flexible by allowing there to
> be an accessor function to get at the variable being changed,
> instead of it being global.
>
> This allows the sysctls to be backed by variables that are,
> for instance, dynamically allocated and not available at
> compile-time.
>
> This also provides a very simple mechanism to take things that
> are currently global and containerize them.

Sorry for taking so long to look at this I just spotted this
series of patches.

The parameters that describe the variable are:
data, maxlen, extra1, and extra2.

I am concerned that this does not provide a capability for
anything except data to vary at runtime.

Also so that we can do a good job and report process local resources
in /proc this data_access should take a task_struct parameter.

data_access should also be passed the the ctl_table in case
we can make an generic data accessor.

That would allow us to put offsetof is data and then
simply add currrent->ipc_context to get the address of
the variable. Which should keep the number of accessor functions
down.

Which give a signature something like:

struct ctl_data_info {
void *data;
int maxlen;
void *extra1;
void *extra2;
};

void ctl_data_access(struct ctl_table *tbl, sruct task_struct *task,
struct ctl_data_info *info);


> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> work-dave/include/linux/sysctl.h | 8 ++++
> work-dave/kernel/sysctl.c | 65 ++++++++++++++++++++++++++-------------
> 2 files changed, 52 insertions(+), 21 deletions(-)
>
> diff -puN include/linux/sysctl.h~sysctls-for-containers include/linux/sysctl.h
> --- work/include/linux/sysctl.h~sysctls-for-containers 2006-03-06
> 15:41:55.000000000 -0800
> +++ work-dave/include/linux/sysctl.h 2006-03-06 15:41:55.000000000 -0800
> @@ -872,6 +872,7 @@ extern void sysctl_init(void);
>
> typedef struct ctl_table ctl_table;
>
> +typedef void *ctl_data_access (void);
> typedef int ctl_handler (ctl_table *table, int __user *name, int nlen,
> void __user *oldval, size_t __user *oldlenp,
> void __user *newval, size_t newlen,
> @@ -957,6 +958,13 @@ struct ctl_table
> int ctl_name; /* Binary ID */
> const char *procname; /* Text ID for /proc/sys, or zero */
> void *data;
> + ctl_data_access *data_access; /* set this to a function if you
> + * don't have a static place to point
> + * ->data at compile-time. This
> + * function will be called to dynamically
> + * figure out a ->data pointer. Do not
> + * set this and ->data at once.
> + */
> int maxlen;
> mode_t mode;
> ctl_table *child;
> diff -puN kernel/sysctl.c~sysctls-for-containers kernel/sysctl.c
> --- work/kernel/sysctl.c~sysctls-for-containers 2006-03-06 15:41:55.000000000
> -0800
> +++ work-dave/kernel/sysctl.c 2006-03-06 15:41:55.000000000 -0800
> @@ -1197,6 +1197,24 @@ repeat:
> return -ENOTDIR;
> }
>
> +void *sysctl_table_data(ctl_table *table)
> +{
> + void *data;
> +
> + if (table->data && table->data_access) {
> + printk(KERN_WARNING
> + "sysctl: data and accessor function set for: '%s'\n",
> + table->procname);
> + table->data = NULL;
> + }
> +
> + data = table->data;
> + if (!data && table->data_access)
> + data = table->data_access();
> +
> + return data;
> +}

I think we should always call data_access if it is populated.

For the rest it looks like it is getting there.

Eric

2006-03-19 15:37:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] sysvmsg: containerize

Chris Wright <[email protected]> writes:

> * Dave Hansen ([email protected]) wrote:
>> On Mon, 2006-03-06 at 17:57 -0800, Chris Wright wrote:
>> > * Dave Hansen ([email protected]) wrote:
>> > > -void __init msg_init (void)
>> > > +void __init msg_init (struct ipc_msg_context *context)
>> > > {
>> > > - ipc_init_ids(&msg_ids,msg_ctlmni);
>> > > + ipc_init_ids(&context->ids,msg_ctlmni);
>> > > ipc_init_proc_interface("sysvipc/msg",
>> > > " key msqid perms cbytes qnum lspid lrpid uid gid cuid cgid stime rtime
> ctime\n",
>> > > - &msg_ids,
>> > > + &context->ids,
>> > > sysvipc_msg_proc_show);
>> >
>> > Does that mean /proc interface only gets init_task context?
>> > Along those lines, I think now ipcs -a is incomplete from admin
>> > perspective. Suppose that's a feature from the container/vserver
>> > POV.
>>
>> It will get context from the current task, which means the current
>> container. We haven't quite decided how these things will be (or if
>> they need to be) aggregated on a a system-wide basis.
>
> The /proc interface is registering with &context->ids of init_task. So,
> all other contexts using that interface will be looking at the wrong
> info, AFAICT.

We need to make this per process in /proc to get it right.
So /proc/sysvipc becomes a symlink to /proc/<pid>/sysvipc.

> As you can tell my concerns are in resource consumption. If a user can
> create contexts which it can hide from sysadmin, and they aren't subject
> to sysadmin mandated resource limits, it's effectively a leak, esp. since
> these resources don't die with exit(2).

I haven't spotted it yet in Dave's series but this is something that should
happen when all of the tasks using the ipc_context in this case exit.

Eric

2006-03-19 18:04:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/6] prepare sysctls for containers

Herbert Poetzl <[email protected]> writes:
> sure, basically we have two cases which interact
> with userspace: the read and the write case.
>
> for the read case, we want something which gives
> us the 'virtualized' view, which might often be
> the same as the host gets, where for the write
> case it is usually not that simple, as we might
> not want a context to write to 'world' stuff
>
> so, having two different functions here, or one
> which gets passed the direction, might be much
> simpler to adjust in many cases than adding more
> and more structures ... but YMMV

I'm not convinced either way yet but my gut feel is that
the use case for get/put functions is incomplete permission
checks on various sysctls.

If that is the fixing the permission checks looks like the
right thing to do.

Eric

2006-03-20 19:34:13

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] sysvmsg: containerize

* Eric W. Biederman ([email protected]) wrote:
> Chris Wright <[email protected]> writes:
> > The /proc interface is registering with &context->ids of init_task. So,
> > all other contexts using that interface will be looking at the wrong
> > info, AFAICT.
>
> We need to make this per process in /proc to get it right.
> So /proc/sysvipc becomes a symlink to /proc/<pid>/sysvipc.

That, and any considerations for context access to protect against
reading /proc/pid/sysvipc/* (assuming you can share pid namespace,
while not sharing sysvipc context).

> > As you can tell my concerns are in resource consumption. If a user can
> > create contexts which it can hide from sysadmin, and they aren't subject
> > to sysadmin mandated resource limits, it's effectively a leak, esp. since
> > these resources don't die with exit(2).
>
> I haven't spotted it yet in Dave's series but this is something that should
> happen when all of the tasks using the ipc_context in this case exit.

Making it look like an 'init 0' from the P.O.V. of that ipc_context, WFM.
Seems the context needs to have context limits so any privileged process
in the context is still subject to the top-level administered limits.

thanks,
-chris

2006-03-20 21:31:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] sysvmsg: containerize

Chris Wright <[email protected]> writes:

> * Eric W. Biederman ([email protected]) wrote:
>> Chris Wright <[email protected]> writes:
>> > The /proc interface is registering with &context->ids of init_task. So,
>> > all other contexts using that interface will be looking at the wrong
>> > info, AFAICT.
>>
>> We need to make this per process in /proc to get it right.
>> So /proc/sysvipc becomes a symlink to /proc/<pid>/sysvipc.
>
> That, and any considerations for context access to protect against
> reading /proc/pid/sysvipc/* (assuming you can share pid namespace,
> while not sharing sysvipc context).

Yes. Although reading is fairly harmless activity.

>> > As you can tell my concerns are in resource consumption. If a user can
>> > create contexts which it can hide from sysadmin, and they aren't subject
>> > to sysadmin mandated resource limits, it's effectively a leak, esp. since
>> > these resources don't die with exit(2).
>>
>> I haven't spotted it yet in Dave's series but this is something that should
>> happen when all of the tasks using the ipc_context in this case exit.
>
> Making it look like an 'init 0' from the P.O.V. of that ipc_context, WFM.
> Seems the context needs to have context limits so any privileged process
> in the context is still subject to the top-level administered limits.

Agreed. Getting the limit checking correct without imposing measurable
overhead is tricky. My gut feel is that the limits should remain global
until we can agree on how to implement more fine grained limits.


Eric

2006-03-20 21:50:48

by Chris Wright

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/6] sysvmsg: containerize

* Eric W. Biederman ([email protected]) wrote:
> Agreed. Getting the limit checking correct without imposing measurable
> overhead is tricky. My gut feel is that the limits should remain global
> until we can agree on how to implement more fine grained limits.

Sounds reasonable.