2022-02-14 19:24:56

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v4 0/2] ipc: Store mq and ipc sysctls in the ipc namespace

This patchset changes the implementation of mq and ipc sysctls. It moves the
sysctls inside the ipc namespace. This will allow us to manage these sysctls
inside the user namespace.

--

Alexey Gladkov (2):
ipc: Store mqueue sysctls in the ipc namespace
ipc: Store ipc sysctls in the ipc namespace

include/linux/ipc_namespace.h | 37 ++++++-
ipc/ipc_sysctl.c | 189 ++++++++++++++++++++++------------
ipc/mq_sysctl.c | 121 ++++++++++++----------
ipc/mqueue.c | 10 +-
ipc/namespace.c | 10 ++
5 files changed, 235 insertions(+), 132 deletions(-)

--
2.33.0


2022-03-24 02:08:24

by Eric W. Biederman

[permalink] [raw]
Subject: [GIT PULL] ipc: Bind to the ipc namespace at open time.


Linus,

Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git per-namespace-ipc-sysctls-for-v5.18
HEAD: 1f5c135ee509e89e0cc274333a65f73c62cb16e5 ipc: Store ipc sysctls in the ipc namespace


The per ipc namespace sysctls have been imperfect since they were
implemented. Instead of binding to the ipc namespace of the opener of
the file the code bound to the ipc namespace of the writer of the
file.

This short series of changes addresses that old deficiency in the
code.

Alexey Gladkov (2):
ipc: Store mqueue sysctls in the ipc namespace
ipc: Store ipc sysctls in the ipc namespace

include/linux/ipc_namespace.h | 37 ++++++++-
ipc/ipc_sysctl.c | 189 +++++++++++++++++++++++++++---------------
ipc/mq_sysctl.c | 121 +++++++++++++++------------
ipc/mqueue.c | 10 +--
ipc/namespace.c | 10 +++
5 files changed, 235 insertions(+), 132 deletions(-)

Signed-off-by: "Eric W. Biederman" <[email protected]>


Eric

2022-03-25 16:55:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.

On Wed, Mar 23, 2022 at 1:24 PM Eric W. Biederman <[email protected]> wrote:
>
> Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:

Ugh.

I pulled this. Then I stared at it for a long time.

And then I decided that this is too ugly to live.

I'm sorry. I think Alexey has probably spent a fair amount of time on
it, but I really think the sysctl code needs to be cleaned up way more
than this.

The old code was horribly hacky too, but that setup_ipc_sysctls() (and
setup_mq_sysctls()) thing that copies the whole sysctls table, and
then walks it entry by entry to modify it, is just too ugly for words.

And then it hides things in .extra1, and because it does that it can't
use the normal "extra1 and extra2 contains the limits" so then at
write() time it copies it into a local one AGAIN only to set the
limits back so that it can call the normal routines again.

Not ok.

Yes, yes, the old code did some similar things - to set the 'data'
pointer. That was disgusting too. Don't get me wrong - the existing
code was nasty too. But this took nasty code, and doubled down on it.

I really think this is a fundamental problem, and needs a more
fundamental fix than adding more and more of these kinds of nasty
hacks.

And yes, that fundamental fix is almost certainly to pass in 'struct
cred *' to all those sysctl helper functions.

Then, when you do the actual 'sysctl()' system calls, you pass in
'current_cred()".

And the /proc users would pass in file->f_cred.

And yes, that patch might be quite messy, because we have quite a lot
of those random .proc_handler users.

But *most* of them by far (at least in random code) use the standard
proc_dointvec_minmax() and friends, and won't even notice.

And then the ones that are about namespace issues will have to
continue to do the nasty "make a copy and update the data pointer",
but *MAYBE* we could also introduce the notion of an "offset" to those
proc_dointvec_minmax() things to help them out (and at least avoid the
"make a copy" thing).

Anyway, I really think we must not make that sysctl code even uglier
than it is today, and I think we need to move towards a model that
actually makes sense. And that "pass in the right cred" is the only
sensible model I can see.

I haven't tried to create such a patch, and maybe Alexey already tried
to do something like that and it turned out to be too ugly for words
and that's why these nasty patches happened.

But at least for now, I can't with a good conscience pull this.

Sorry,
Linus

2022-03-25 18:27:04

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.

Linus Torvalds <[email protected]> writes:

> On Wed, Mar 23, 2022 at 1:24 PM Eric W. Biederman <[email protected]> wrote:
>>
>> Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:
>
> Ugh.
>
> I pulled this. Then I stared at it for a long time.
>
> And then I decided that this is too ugly to live.
>
> I'm sorry. I think Alexey has probably spent a fair amount of time on
> it, but I really think the sysctl code needs to be cleaned up way more
> than this.
>
> The old code was horribly hacky too, but that setup_ipc_sysctls() (and
> setup_mq_sysctls()) thing that copies the whole sysctls table, and
> then walks it entry by entry to modify it, is just too ugly for words.

I am not convinced when it comes to the .data pointers it is too
ugly to live. The only realistic alternative is doing something
with the offset to adjust the pointers to the per namespace values.

Fundamentally the copy is needed because in the general case (which is
present in the networking stack) we only show a subset of the sysctls in
anything other than the initial namespace. That is necessary as some
sysctls have not been shown to be safe for unprivileged users to write.

So there does need to be a separate array of ctl_table entries per
namespace, and the data pointers in those entries need to be adjusted to
the per namespace values.

> And then it hides things in .extra1, and because it does that it can't
> use the normal "extra1 and extra2 contains the limits" so then at
> write() time it copies it into a local one AGAIN only to set the
> limits back so that it can call the normal routines again.
>
> Not ok.

I agree. That is a bit ugly, and I missed it when I was looking
at the code.

The mq_sysctl.c changes look reasonable to me. There are no strange
games being played with .extra1 or .extra2. All I see happening
in mq_sysctl.c is boiler plate to make things work.

Looking at ipc_sysctl.c I agree playing games with .extra1 and .extra2
is ugly and error prone. Worse I noticed when reading it through that
proc_ipc_sem_dointvec passes sem_check_semmni current->nsproxy->ipc_ns
instead of passing the computed ns value. A bug but not a regression.

I went through the code and we don't need to play games with .extra1
to get the namespace value all we need to call container_of with
the .data member. That takes a little extra boiler plate especially
for the checkpoint_restore case.

The checkpoint_restore sysctls arguably need to be done differently so
that the permissions can be checked at open time instead of at write
time. That would eliminate the need to play games with foobar.

>
> Anyway, I really think we must not make that sysctl code even uglier
> than it is today, and I think we need to move towards a model that
> actually makes sense. And that "pass in the right cred" is the only
> sensible model I can see.

I don't see how passing in struct cred does anything helpful. The
namespace can not be found in struct cred.

Now this set of changes is a setup to allow implementing .set_ownership
and .permission methods so that we can allow the namespace root to write
to sysctls it is safe for the namespace root to be able to write to.
Even there I don't see how passing in a struct cred would help.

Given that there are no regressions I don't see it even being helpful
to not merge this code.

I did look and there are some cleanup pretty significant cleanup
opportunities, by using container_of on .data, which avoids the need
to stuff a namespace parameter in .extra1.

There are also significant cleanup opportunities in implementing a
.permission method that will allow the checkpoint_restart sysctls
to perform all of their permission checks at open time, and not
need any other special code.

But all of these cleanups I see are moving forward from the current
point so I don't see why we would not want to merge the code as is
and then get the tested versions of my cleanups below.



ipc/ipc_sysctl.c | 81 +++++++++++++++++++++++++-------------------------------
1 file changed, 36 insertions(+), 45 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 15210ac47e9e..e2209d48db04 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -19,16 +19,11 @@
static int proc_ipc_dointvec_minmax_orphans(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
- struct ipc_namespace *ns = table->extra1;
- struct ctl_table ipc_table;
+ struct ipc_namespace *ns =
+ container_of(table->data, struct ipc_namespace, shm_rmid_forced);
int err;

- memcpy(&ipc_table, table, sizeof(ipc_table));
-
- ipc_table.extra1 = SYSCTL_ZERO;
- ipc_table.extra2 = SYSCTL_ONE;
-
- err = proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
+ err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);

if (err < 0)
return err;
@@ -55,20 +50,15 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int write,
static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
void *buffer, size_t *lenp, loff_t *ppos)
{
- struct ipc_namespace *ns = table->extra1;
- struct ctl_table ipc_table;
+ struct ipc_namespace *ns =
+ container_of(table->data, struct ipc_namespace, sem_ctls);
int ret, semmni;

- memcpy(&ipc_table, table, sizeof(ipc_table));
-
- ipc_table.extra1 = NULL;
- ipc_table.extra2 = NULL;
-
semmni = ns->sem_ctls[3];
ret = proc_dointvec(table, write, buffer, lenp, ppos);

if (!ret)
- ret = sem_check_semmni(current->nsproxy->ipc_ns);
+ ret = sem_check_semmni(ns);

/*
* Reset the semmni value if an error happens.
@@ -78,24 +68,6 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
return ret;
}

-#ifdef CONFIG_CHECKPOINT_RESTORE
-static int proc_ipc_dointvec_minmax_checkpoint_restore(struct ctl_table *table,
- int write, void *buffer, size_t *lenp, loff_t *ppos)
-{
- struct ipc_namespace *ns = table->extra1;
- struct ctl_table ipc_table;
-
- if (write && !checkpoint_restore_ns_capable(ns->user_ns))
- return -EPERM;
-
- memcpy(&ipc_table, table, sizeof(ipc_table));
-
- ipc_table.extra1 = SYSCTL_ZERO;
- ipc_table.extra2 = SYSCTL_INT_MAX;
-
- return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
-}
-#endif

int ipc_mni = IPCMNI;
int ipc_mni_shift = IPCMNI_SHIFT;
@@ -131,6 +103,8 @@ static struct ctl_table ipc_sysctls[] = {
.maxlen = sizeof(init_ipc_ns.shm_rmid_forced),
.mode = 0644,
.proc_handler = proc_ipc_dointvec_minmax_orphans,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_ONE,
},
{
.procname = "msgmax",
@@ -180,22 +154,28 @@ static struct ctl_table ipc_sysctls[] = {
.procname = "sem_next_id",
.data = &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
.maxlen = sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
- .mode = 0666,
- .proc_handler = proc_ipc_dointvec_minmax_checkpoint_restore,
+ .mode = 0444,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
},
{
.procname = "msg_next_id",
.data = &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
.maxlen = sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
- .mode = 0666,
- .proc_handler = proc_ipc_dointvec_minmax_checkpoint_restore,
+ .mode = 0444,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
},
{
.procname = "shm_next_id",
.data = &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
.maxlen = sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
- .mode = 0666,
- .proc_handler = proc_ipc_dointvec_minmax_checkpoint_restore,
+ .mode = 0444,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
},
#endif
{}
@@ -211,8 +191,24 @@ static int set_is_seen(struct ctl_table_set *set)
return &current->nsproxy->ipc_ns->ipc_set == set;
}

+static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table)
+{
+ struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+ int mode = table->mode;
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+ if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) ||
+ (table->data == &ns->ids[IPC_MSG_IDS].next_id) ||
+ (table->data == &ns->ids[IPC_SHM_IDS].next_id)) &&
+ checkpoint_restore_ns_capable(ns->user_ns))
+ mode = 0666;
+#endif
+ return mode;
+}
+
static struct ctl_table_root set_root = {
.lookup = set_lookup,
+ .permissions = ipc_permissions,
};

bool setup_ipc_sysctls(struct ipc_namespace *ns)
@@ -237,7 +233,6 @@ bool setup_ipc_sysctls(struct ipc_namespace *ns)

} else if (tbl[i].data == &init_ipc_ns.shm_rmid_forced) {
tbl[i].data = &ns->shm_rmid_forced;
- tbl[i].extra1 = ns;

} else if (tbl[i].data == &init_ipc_ns.msg_ctlmax) {
tbl[i].data = &ns->msg_ctlmax;
@@ -250,19 +245,15 @@ bool setup_ipc_sysctls(struct ipc_namespace *ns)

} else if (tbl[i].data == &init_ipc_ns.sem_ctls) {
tbl[i].data = &ns->sem_ctls;
- tbl[i].extra1 = ns;
#ifdef CONFIG_CHECKPOINT_RESTORE
} else if (tbl[i].data == &init_ipc_ns.ids[IPC_SEM_IDS].next_id) {
tbl[i].data = &ns->ids[IPC_SEM_IDS].next_id;
- tbl[i].extra1 = ns;

} else if (tbl[i].data == &init_ipc_ns.ids[IPC_MSG_IDS].next_id) {
tbl[i].data = &ns->ids[IPC_MSG_IDS].next_id;
- tbl[i].extra1 = ns;

} else if (tbl[i].data == &init_ipc_ns.ids[IPC_SHM_IDS].next_id) {
tbl[i].data = &ns->ids[IPC_SHM_IDS].next_id;
- tbl[i].extra1 = ns;
#endif
} else {
tbl[i].data = NULL;



Eric

2022-03-25 19:23:50

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.

On Thu, Mar 24, 2022 at 11:12:21AM -0700, Linus Torvalds wrote:
> On Wed, Mar 23, 2022 at 1:24 PM Eric W. Biederman <[email protected]> wrote:
> >
> > Please pull the per-namespace-ipc-sysctls-for-v5.18 tag from the git tree:
>
> Ugh.
>
> I pulled this. Then I stared at it for a long time.
>
> And then I decided that this is too ugly to live.
>
> I'm sorry. I think Alexey has probably spent a fair amount of time on
> it, but I really think the sysctl code needs to be cleaned up way more
> than this.

Apparently it's my fault that the purpose of these changes is not clear. I
did this refactoring not for the sake of refactoring as such.

In my original patch [1], I was trying to fix the situation where the user
cannot change the /proc/sys/fs/mqueue/* options inside rootless container.

But then I split the changes into refactoring which fixes the hack and
permission changes which I wanted to discuss and propose later.

[1] https://lore.kernel.org/lkml/0f0408bb7fbf3187966a9bf19a98642a5d9669fe.1641225760.git.legion@kernel.org/

> The old code was horribly hacky too, but that setup_ipc_sysctls() (and
> setup_mq_sysctls()) thing that copies the whole sysctls table, and
> then walks it entry by entry to modify it, is just too ugly for words.
>
> And then it hides things in .extra1, and because it does that it can't
> use the normal "extra1 and extra2 contains the limits" so then at
> write() time it copies it into a local one AGAIN only to set the
> limits back so that it can call the normal routines again.
>
> Not ok.
>
> Yes, yes, the old code did some similar things - to set the 'data'
> pointer. That was disgusting too. Don't get me wrong - the existing
> code was nasty too. But this took nasty code, and doubled down on it.
>
> I really think this is a fundamental problem, and needs a more
> fundamental fix than adding more and more of these kinds of nasty
> hacks.
>
> And yes, that fundamental fix is almost certainly to pass in 'struct
> cred *' to all those sysctl helper functions.
>
> Then, when you do the actual 'sysctl()' system calls, you pass in
> 'current_cred()".
>
> And the /proc users would pass in file->f_cred.
>
> And yes, that patch might be quite messy, because we have quite a lot
> of those random .proc_handler users.
>
> But *most* of them by far (at least in random code) use the standard
> proc_dointvec_minmax() and friends, and won't even notice.
>
> And then the ones that are about namespace issues will have to
> continue to do the nasty "make a copy and update the data pointer",
> but *MAYBE* we could also introduce the notion of an "offset" to those
> proc_dointvec_minmax() things to help them out (and at least avoid the
> "make a copy" thing).
>
> Anyway, I really think we must not make that sysctl code even uglier
> than it is today, and I think we need to move towards a model that
> actually makes sense. And that "pass in the right cred" is the only
> sensible model I can see.
>
> I haven't tried to create such a patch, and maybe Alexey already tried
> to do something like that and it turned out to be too ugly for words
> and that's why these nasty patches happened.
>
> But at least for now, I can't with a good conscience pull this.
>
> Sorry,
> Linus

OK. I'll try to come up with some other solution.

--
Rgrds, legion

2022-03-25 19:43:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ipc: Bind to the ipc namespace at open time.

On Thu, Mar 24, 2022 at 2:49 PM Eric W. Biederman <[email protected]> wrote:
>
> I don't see how passing in struct cred does anything helpful. The
> namespace can not be found in struct cred.

Duh. For some reason I thought we had it there, but yeah, that only
contains the user-namespace.

The rest is in tsk->nsproxy.

So it would have to be squirrelled away in some other way.

> But all of these cleanups I see are moving forward from the current
> point so I don't see why we would not want to merge the code as is
> and then get the tested versions of my cleanups below.

What part of "that code is too ugly to live" did not end up registering.

Maybe it can be cleaned up. But it had better be cleaned up before I pull it.

Linus

2022-04-22 19:54:18

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v1 0/4] ipc: Remove extra1 field abuse to pass ipc namespace

These patches are made on top of branch:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git per-namespace-ipc-sysctls-for-v5.18

After discussion, I took into account Eric Biederman's fixes. With these
changes, the hacks of passing namespace through .extra1 are not required.

With the current design of sysctl dynamic memory allocation is
necessary.

Yes, Linus, these changes are not the refactoring you were talking
about, but I plan to try to do such a refactoring in the my next
patchset.

--

Alexey Gladkov (4):
ipc: Remove extra1 field abuse to pass ipc namespace
ipc: Use proper ipc namespace
ipc: Check permissions for checkpoint_restart sysctls at open time
ipc: Remove extra braces

ipc/ipc_sysctl.c | 108 +++++++++++++++++++++--------------------------
1 file changed, 49 insertions(+), 59 deletions(-)

--
2.33.2

2022-04-22 20:48:52

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v1 3/4] ipc: Check permissions for checkpoint_restart sysctls at open time

As Eric Biederman pointed out, it is possible not to use a custom
proc_handler and check permissions for every write, but to use a
.permission handler. That will allow the checkpoint_restart sysctls to
perform all of their permission checks at open time, and not need any
other special code.

Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Alexey Gladkov <[email protected]>
---
ipc/ipc_sysctl.c | 54 ++++++++++++++++++++++++++----------------------
1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index ff99d0305a5b..5a58598d48c8 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -68,25 +68,6 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
return ret;
}

-#ifdef CONFIG_CHECKPOINT_RESTORE
-static int proc_ipc_dointvec_minmax_checkpoint_restore(struct ctl_table *table,
- int write, void *buffer, size_t *lenp, loff_t *ppos)
-{
- struct ipc_namespace *ns = current->nsproxy->ipc_ns;
- struct ctl_table ipc_table;
-
- if (write && !checkpoint_restore_ns_capable(ns->user_ns))
- return -EPERM;
-
- memcpy(&ipc_table, table, sizeof(ipc_table));
-
- ipc_table.extra1 = SYSCTL_ZERO;
- ipc_table.extra2 = SYSCTL_INT_MAX;
-
- return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
-}
-#endif
-
int ipc_mni = IPCMNI;
int ipc_mni_shift = IPCMNI_SHIFT;
int ipc_min_cycle = RADIX_TREE_MAP_SIZE;
@@ -172,22 +153,28 @@ static struct ctl_table ipc_sysctls[] = {
.procname = "sem_next_id",
.data = &init_ipc_ns.ids[IPC_SEM_IDS].next_id,
.maxlen = sizeof(init_ipc_ns.ids[IPC_SEM_IDS].next_id),
- .mode = 0666,
- .proc_handler = proc_ipc_dointvec_minmax_checkpoint_restore,
+ .mode = 0444,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
},
{
.procname = "msg_next_id",
.data = &init_ipc_ns.ids[IPC_MSG_IDS].next_id,
.maxlen = sizeof(init_ipc_ns.ids[IPC_MSG_IDS].next_id),
- .mode = 0666,
- .proc_handler = proc_ipc_dointvec_minmax_checkpoint_restore,
+ .mode = 0444,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
},
{
.procname = "shm_next_id",
.data = &init_ipc_ns.ids[IPC_SHM_IDS].next_id,
.maxlen = sizeof(init_ipc_ns.ids[IPC_SHM_IDS].next_id),
- .mode = 0666,
- .proc_handler = proc_ipc_dointvec_minmax_checkpoint_restore,
+ .mode = 0444,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_INT_MAX,
},
#endif
{}
@@ -203,8 +190,25 @@ static int set_is_seen(struct ctl_table_set *set)
return &current->nsproxy->ipc_ns->ipc_set == set;
}

+static int ipc_permissions(struct ctl_table_header *head, struct ctl_table *table)
+{
+ int mode = table->mode;
+
+#ifdef CONFIG_CHECKPOINT_RESTORE
+ struct ipc_namespace *ns = current->nsproxy->ipc_ns;
+
+ if (((table->data == &ns->ids[IPC_SEM_IDS].next_id) ||
+ (table->data == &ns->ids[IPC_MSG_IDS].next_id) ||
+ (table->data == &ns->ids[IPC_SHM_IDS].next_id)) &&
+ checkpoint_restore_ns_capable(ns->user_ns))
+ mode = 0666;
+#endif
+ return mode;
+}
+
static struct ctl_table_root set_root = {
.lookup = set_lookup,
+ .permissions = ipc_permissions,
};

bool setup_ipc_sysctls(struct ipc_namespace *ns)
--
2.33.2

2022-04-22 22:17:53

by Alexey Gladkov

[permalink] [raw]
Subject: [PATCH v1 4/4] ipc: Remove extra braces

Fix coding style. In the previous commit, I added braces because,
in addition to changing .data, .extra1 also changed. Now this is not
needed.

Signed-off-by: Alexey Gladkov <[email protected]>
---
ipc/ipc_sysctl.c | 25 ++++++++++++-------------
1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 5a58598d48c8..ef313ecfb53a 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -222,42 +222,41 @@ bool setup_ipc_sysctls(struct ipc_namespace *ns)
int i;

for (i = 0; i < ARRAY_SIZE(ipc_sysctls); i++) {
- if (tbl[i].data == &init_ipc_ns.shm_ctlmax) {
+ if (tbl[i].data == &init_ipc_ns.shm_ctlmax)
tbl[i].data = &ns->shm_ctlmax;

- } else if (tbl[i].data == &init_ipc_ns.shm_ctlall) {
+ else if (tbl[i].data == &init_ipc_ns.shm_ctlall)
tbl[i].data = &ns->shm_ctlall;

- } else if (tbl[i].data == &init_ipc_ns.shm_ctlmni) {
+ else if (tbl[i].data == &init_ipc_ns.shm_ctlmni)
tbl[i].data = &ns->shm_ctlmni;

- } else if (tbl[i].data == &init_ipc_ns.shm_rmid_forced) {
+ else if (tbl[i].data == &init_ipc_ns.shm_rmid_forced)
tbl[i].data = &ns->shm_rmid_forced;

- } else if (tbl[i].data == &init_ipc_ns.msg_ctlmax) {
+ else if (tbl[i].data == &init_ipc_ns.msg_ctlmax)
tbl[i].data = &ns->msg_ctlmax;

- } else if (tbl[i].data == &init_ipc_ns.msg_ctlmni) {
+ else if (tbl[i].data == &init_ipc_ns.msg_ctlmni)
tbl[i].data = &ns->msg_ctlmni;

- } else if (tbl[i].data == &init_ipc_ns.msg_ctlmnb) {
+ else if (tbl[i].data == &init_ipc_ns.msg_ctlmnb)
tbl[i].data = &ns->msg_ctlmnb;

- } else if (tbl[i].data == &init_ipc_ns.sem_ctls) {
+ else if (tbl[i].data == &init_ipc_ns.sem_ctls)
tbl[i].data = &ns->sem_ctls;
#ifdef CONFIG_CHECKPOINT_RESTORE
- } else if (tbl[i].data == &init_ipc_ns.ids[IPC_SEM_IDS].next_id) {
+ else if (tbl[i].data == &init_ipc_ns.ids[IPC_SEM_IDS].next_id)
tbl[i].data = &ns->ids[IPC_SEM_IDS].next_id;

- } else if (tbl[i].data == &init_ipc_ns.ids[IPC_MSG_IDS].next_id) {
+ else if (tbl[i].data == &init_ipc_ns.ids[IPC_MSG_IDS].next_id)
tbl[i].data = &ns->ids[IPC_MSG_IDS].next_id;

- } else if (tbl[i].data == &init_ipc_ns.ids[IPC_SHM_IDS].next_id) {
+ else if (tbl[i].data == &init_ipc_ns.ids[IPC_SHM_IDS].next_id)
tbl[i].data = &ns->ids[IPC_SHM_IDS].next_id;
#endif
- } else {
+ else
tbl[i].data = NULL;
- }
}

ns->ipc_sysctls = __register_sysctl_table(&ns->ipc_set, "kernel", tbl);
--
2.33.2

2022-04-22 22:59:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH v1 0/4] ipc: Remove extra1 field abuse to pass ipc namespace

On Fri, Apr 22, 2022 at 5:53 AM Alexey Gladkov <[email protected]> wrote:
>
> Yes, Linus, these changes are not the refactoring you were talking
> about, but I plan to try to do such a refactoring in the my next
> patchset.

Heh. Ok, I'm not saying these patches are pretty, and looking up the
namespace thing is a bit subtle, but it's certainly prettier than the
existing odd "create a new ctl_table entry because of field abuse".

So this certainly looks like an improvement from a quick look.

Linus

2022-05-04 10:14:34

by Philip Rhoades

[permalink] [raw]
Subject: Re: [PATCH v1 0/4] ipc: Remove extra1 field abuse to pass ipc namespace

People,


On 2022-04-23 06:44, Linus Torvalds wrote:
> On Fri, Apr 22, 2022 at 5:53 AM Alexey Gladkov <[email protected]>
> wrote:
>>
>> Yes, Linus, these changes are not the refactoring you were talking
>> about, but I plan to try to do such a refactoring in the my next
>> patchset.
>
> Heh. Ok, I'm not saying these patches are pretty, and looking up the
> namespace thing is a bit subtle, but it's certainly prettier than the
> existing odd "create a new ctl_table entry because of field abuse".
>
> So this certainly looks like an improvement from a quick look.
>
> Linus


Thanks for that little glimpse into what happens with kernel development
- interesting!

Phil.
--
Philip Rhoades

PO Box 896
Cowra NSW 2794
Australia
E-mail: [email protected]

2022-06-01 19:29:38

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH 0/4] API extension for handling sysctl

On Fri, Apr 22, 2022 at 01:44:50PM -0700, Linus Torvalds wrote:
> On Fri, Apr 22, 2022 at 5:53 AM Alexey Gladkov <[email protected]> wrote:
> >
> > Yes, Linus, these changes are not the refactoring you were talking
> > about, but I plan to try to do such a refactoring in the my next
> > patchset.
>
> Heh. Ok, I'm not saying these patches are pretty, and looking up the
> namespace thing is a bit subtle, but it's certainly prettier than the
> existing odd "create a new ctl_table entry because of field abuse".

As I promised, here is one of the possible options for how to get rid of dynamic
memory allocation.

We can slightly extend the API and thus be able to save data at the time the
file is opened. This will not only eliminate the need to allocate memory, but
also provide access to file struct and f_cred.

I made an RFC because I'm not sure that I did the permissions check for
ipc_sysctl. I also did not change all the places where this API can be applied
to make the patch smaller. As in the case of /proc/sys/kernel/printk where
CAP_SYS_ADMIN is checked[1] for the current process at the time of write.

I made a patchset on top of:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-next

Because there are my previous changes.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/printk/sysctl.c#n17

--

Alexey Gladkov (4):
sysctl: API extension for handling sysctl
sysctl: ipc: Do not use dynamic memory
sysctl: userns: Do not use dynamic memory
sysctl: mqueue: Do not use dynamic memory

fs/proc/proc_sysctl.c | 71 ++++++++--
include/linux/ipc_namespace.h | 35 -----
include/linux/sysctl.h | 20 ++-
include/linux/user_namespace.h | 6 -
ipc/ipc_sysctl.c | 236 +++++++++++++++++----------------
ipc/mq_sysctl.c | 138 ++++++++++---------
ipc/mqueue.c | 5 -
ipc/namespace.c | 10 --
kernel/ucount.c | 116 +++++++---------
kernel/user_namespace.c | 10 +-
10 files changed, 323 insertions(+), 324 deletions(-)

--
2.33.3


2022-06-01 21:09:53

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH 1/4] sysctl: API extension for handling sysctl

This adds additional optional functions for handling open, read, and
write operations that can be customized for each sysctl file. It also
creates ctl_context that persists from opening to closing the file in
the /proc/sys.

The context allows us to store dynamic information at the time the file
is opened. This eliminates the need to duplicate ctl_table in order to
dynamically change .data, .extra1 or .extra2.

This API extends the existing one and does not require any changes to
already existing sysctl handlers.

Signed-off-by: Alexey Gladkov <[email protected]>
---
fs/proc/proc_sysctl.c | 71 +++++++++++++++++++++++++++++++++++-------
include/linux/sysctl.h | 20 ++++++++++--
2 files changed, 77 insertions(+), 14 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 7d9cfc730bd4..d3d43e738f01 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -560,6 +560,7 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
struct inode *inode = file_inode(iocb->ki_filp);
struct ctl_table_header *head = grab_header(inode);
struct ctl_table *table = PROC_I(inode)->sysctl_entry;
+ struct ctl_fops *fops = table->ctl_fops;
size_t count = iov_iter_count(iter);
char *kbuf;
ssize_t error;
@@ -577,7 +578,7 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,

/* if that can happen at all, it should be -EINVAL, not -EISDIR */
error = -EINVAL;
- if (!table->proc_handler)
+ if (!table->proc_handler && !fops)
goto out;

/* don't even try if the size is too large */
@@ -600,8 +601,20 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
if (error)
goto out_free_buf;

- /* careful: calling conventions are nasty here */
- error = table->proc_handler(table, write, kbuf, &count, &iocb->ki_pos);
+ if (fops) {
+ struct ctl_context *ctx = iocb->ki_filp->private_data;
+
+ if (write && fops->write)
+ error = fops->write(ctx, iocb->ki_filp, kbuf, &count, &iocb->ki_pos);
+ else if (!write && fops->read)
+ error = fops->read(ctx, iocb->ki_filp, kbuf, &count, &iocb->ki_pos);
+ else
+ error = -EINVAL;
+ } else {
+ /* careful: calling conventions are nasty here */
+ error = table->proc_handler(table, write, kbuf, &count, &iocb->ki_pos);
+ }
+
if (error)
goto out_free_buf;

@@ -634,17 +647,50 @@ static int proc_sys_open(struct inode *inode, struct file *filp)
{
struct ctl_table_header *head = grab_header(inode);
struct ctl_table *table = PROC_I(inode)->sysctl_entry;
+ struct ctl_context *ctx;
+ int ret = 0;

/* sysctl was unregistered */
if (IS_ERR(head))
return PTR_ERR(head);

- if (table->poll)
- filp->private_data = proc_sys_poll_event(table->poll);
+ ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+ if (!ctx)
+ return -ENOMEM;
+
+ ctx->table = table;
+ filp->private_data = ctx;
+
+ if (table->ctl_fops && table->ctl_fops->open)
+ ret = table->ctl_fops->open(ctx, inode, filp);
+
+ if (!ret && table->poll)
+ ctx->poll_event = proc_sys_poll_event(table->poll);

sysctl_head_finish(head);

- return 0;
+ return ret;
+}
+
+static int proc_sys_release(struct inode *inode, struct file *filp)
+{
+ struct ctl_table_header *head = grab_header(inode);
+ struct ctl_table *table = PROC_I(inode)->sysctl_entry;
+ struct ctl_context *ctx = filp->private_data;
+ int ret = 0;
+
+ if (IS_ERR(head))
+ return PTR_ERR(head);
+
+ if (table->ctl_fops && table->ctl_fops->release)
+ ret = table->ctl_fops->release(ctx, inode, filp);
+
+ sysctl_head_finish(head);
+
+ kfree(ctx);
+ filp->private_data = NULL;
+
+ return ret;
}

static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
@@ -653,23 +699,23 @@ static __poll_t proc_sys_poll(struct file *filp, poll_table *wait)
struct ctl_table_header *head = grab_header(inode);
struct ctl_table *table = PROC_I(inode)->sysctl_entry;
__poll_t ret = DEFAULT_POLLMASK;
- unsigned long event;
+ struct ctl_context *ctx;

/* sysctl was unregistered */
if (IS_ERR(head))
return EPOLLERR | EPOLLHUP;

- if (!table->proc_handler)
+ if (!table->proc_handler && !table->ctl_fops)
goto out;

if (!table->poll)
goto out;

- event = (unsigned long)filp->private_data;
+ ctx = filp->private_data;
poll_wait(filp, &table->poll->wait, wait);

- if (event != atomic_read(&table->poll->event)) {
- filp->private_data = proc_sys_poll_event(table->poll);
+ if (ctx->poll_event != atomic_read(&table->poll->event)) {
+ ctx->poll_event = proc_sys_poll_event(table->poll);
ret = EPOLLIN | EPOLLRDNORM | EPOLLERR | EPOLLPRI;
}

@@ -866,6 +912,7 @@ static int proc_sys_getattr(struct user_namespace *mnt_userns,

static const struct file_operations proc_sys_file_operations = {
.open = proc_sys_open,
+ .release = proc_sys_release,
.poll = proc_sys_poll,
.read_iter = proc_sys_read,
.write_iter = proc_sys_write,
@@ -1153,7 +1200,7 @@ static int sysctl_check_table(const char *path, struct ctl_table *table)
else
err |= sysctl_check_table_array(path, table);
}
- if (!table->proc_handler)
+ if (!table->proc_handler && !table->ctl_fops)
err |= sysctl_err(path, table, "No proc_handler");

if ((table->mode & (S_IRUGO|S_IWUGO)) != table->mode)
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 6353d6db69b2..ca5657c9fcb2 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,9 +116,9 @@ struct ctl_table_poll {
wait_queue_head_t wait;
};

-static inline void *proc_sys_poll_event(struct ctl_table_poll *poll)
+static inline unsigned long proc_sys_poll_event(struct ctl_table_poll *poll)
{
- return (void *)(unsigned long)atomic_read(&poll->event);
+ return (unsigned long)atomic_read(&poll->event);
}

#define __CTL_TABLE_POLL_INITIALIZER(name) { \
@@ -128,6 +128,21 @@ static inline void *proc_sys_poll_event(struct ctl_table_poll *poll)
#define DEFINE_CTL_TABLE_POLL(name) \
struct ctl_table_poll name = __CTL_TABLE_POLL_INITIALIZER(name)

+struct ctl_context {
+ struct ctl_table *table;
+ unsigned long poll_event;
+ void *ctl_data;
+};
+
+struct inode;
+
+struct ctl_fops {
+ int (*open) (struct ctl_context *, struct inode *, struct file *);
+ int (*release) (struct ctl_context *, struct inode *, struct file *);
+ ssize_t (*read) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
+ ssize_t (*write) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
+};
+
/* A sysctl table is an array of struct ctl_table: */
struct ctl_table {
const char *procname; /* Text ID for /proc/sys, or zero */
@@ -139,6 +154,7 @@ struct ctl_table {
struct ctl_table_poll *poll;
void *extra1;
void *extra2;
+ struct ctl_fops *ctl_fops;
} __randomize_layout;

struct ctl_node {
--
2.33.3


2022-06-01 21:13:23

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH 4/4] sysctl: mqueue: Do not use dynamic memory

Dynamic memory allocation is needed to modify .data and specify the
per namespace parameter. The new sysctl API is allowed to get rid of
the need for such modification.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/ipc_namespace.h | 17 -----
ipc/mq_sysctl.c | 138 +++++++++++++++++++---------------
ipc/mqueue.c | 5 --
ipc/namespace.c | 6 --
4 files changed, 79 insertions(+), 87 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index 51c2c247c447..d20753093a2c 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -174,21 +174,4 @@ static inline void put_ipc_ns(struct ipc_namespace *ns)
}
#endif

-#ifdef CONFIG_POSIX_MQUEUE_SYSCTL
-
-void retire_mq_sysctls(struct ipc_namespace *ns);
-bool setup_mq_sysctls(struct ipc_namespace *ns);
-
-#else /* CONFIG_POSIX_MQUEUE_SYSCTL */
-
-static inline void retire_mq_sysctls(struct ipc_namespace *ns)
-{
-}
-
-static inline bool setup_mq_sysctls(struct ipc_namespace *ns)
-{
- return true;
-}
-
-#endif /* CONFIG_POSIX_MQUEUE_SYSCTL */
#endif
diff --git a/ipc/mq_sysctl.c b/ipc/mq_sysctl.c
index fbf6a8b93a26..08ff7dfb721c 100644
--- a/ipc/mq_sysctl.c
+++ b/ipc/mq_sysctl.c
@@ -13,6 +13,45 @@
#include <linux/capability.h>
#include <linux/slab.h>

+static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table);
+
+static int mq_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file)
+{
+ ctx->ctl_data = current->nsproxy->ipc_ns;
+ return 0;
+}
+
+static ssize_t mq_sys_read(struct ctl_context *ctx, struct file *file,
+ char *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table table = *ctx->table;
+ table.data = data_from_ns(ctx, ctx->table);
+ return table.proc_handler(&table, 0, buffer, lenp, ppos);
+}
+
+static ssize_t mq_sys_write(struct ctl_context *ctx, struct file *file,
+ char *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table table = *ctx->table;
+ table.data = data_from_ns(ctx, ctx->table);
+ return table.proc_handler(&table, 1, buffer, lenp, ppos);
+}
+
+static struct ctl_fops mq_sys_fops = {
+ .open = mq_sys_open,
+ .read = mq_sys_read,
+ .write = mq_sys_write,
+};
+
+enum {
+ MQ_SYSCTL_QUEUES_MAX,
+ MQ_SYSCTL_MSG_MAX,
+ MQ_SYSCTL_MSGSIZE_MAX,
+ MQ_SYSCTL_MSG_DEFAULT,
+ MQ_SYSCTL_MSGSIZE_DEFAULT,
+ MQ_SYSCTL_COUNTS
+};
+
static int msg_max_limit_min = MIN_MSGMAX;
static int msg_max_limit_max = HARD_MSGMAX;

@@ -20,14 +59,15 @@ static int msg_maxsize_limit_min = MIN_MSGSIZEMAX;
static int msg_maxsize_limit_max = HARD_MSGSIZEMAX;

static struct ctl_table mq_sysctls[] = {
- {
+ [MQ_SYSCTL_QUEUES_MAX] = {
.procname = "queues_max",
.data = &init_ipc_ns.mq_queues_max,
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec,
+ .ctl_fops = &mq_sys_fops,
},
- {
+ [MQ_SYSCTL_MSG_MAX] = {
.procname = "msg_max",
.data = &init_ipc_ns.mq_msg_max,
.maxlen = sizeof(int),
@@ -35,8 +75,9 @@ static struct ctl_table mq_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = &msg_max_limit_min,
.extra2 = &msg_max_limit_max,
+ .ctl_fops = &mq_sys_fops,
},
- {
+ [MQ_SYSCTL_MSGSIZE_MAX] = {
.procname = "msgsize_max",
.data = &init_ipc_ns.mq_msgsize_max,
.maxlen = sizeof(int),
@@ -44,8 +85,9 @@ static struct ctl_table mq_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = &msg_maxsize_limit_min,
.extra2 = &msg_maxsize_limit_max,
+ .ctl_fops = &mq_sys_fops,
},
- {
+ [MQ_SYSCTL_MSG_DEFAULT] = {
.procname = "msg_default",
.data = &init_ipc_ns.mq_msg_default,
.maxlen = sizeof(int),
@@ -53,8 +95,9 @@ static struct ctl_table mq_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = &msg_max_limit_min,
.extra2 = &msg_max_limit_max,
+ .ctl_fops = &mq_sys_fops,
},
- {
+ [MQ_SYSCTL_MSGSIZE_DEFAULT] = {
.procname = "msgsize_default",
.data = &init_ipc_ns.mq_msgsize_default,
.maxlen = sizeof(int),
@@ -62,70 +105,47 @@ static struct ctl_table mq_sysctls[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = &msg_maxsize_limit_min,
.extra2 = &msg_maxsize_limit_max,
+ .ctl_fops = &mq_sys_fops,
},
{}
};

-static struct ctl_table_set *set_lookup(struct ctl_table_root *root)
+static inline void *data_from_ns(struct ctl_context *ctx, struct ctl_table *table)
{
- return &current->nsproxy->ipc_ns->mq_set;
+ struct ipc_namespace *ns = ctx->ctl_data;
+
+ switch (ctx->table - mq_sysctls) {
+ case MQ_SYSCTL_QUEUES_MAX: return &ns->mq_queues_max;
+ case MQ_SYSCTL_MSG_MAX: return &ns->mq_msg_max;
+ case MQ_SYSCTL_MSGSIZE_MAX: return &ns->mq_msgsize_max;
+ case MQ_SYSCTL_MSG_DEFAULT: return &ns->mq_msg_default;
+ case MQ_SYSCTL_MSGSIZE_DEFAULT: return &ns->mq_msgsize_default;
+ }
+ return NULL;
}

-static int set_is_seen(struct ctl_table_set *set)
-{
- return &current->nsproxy->ipc_ns->mq_set == set;
-}
+static struct ctl_table mq_sysctl_dir[] = {
+ {
+ .procname = "mqueue",
+ .mode = 0555,
+ .child = mq_sysctls,
+ },
+ {}
+};

-static struct ctl_table_root set_root = {
- .lookup = set_lookup,
+static struct ctl_table mq_sysctl_root[] = {
+ {
+ .procname = "fs",
+ .mode = 0555,
+ .child = mq_sysctl_dir,
+ },
+ {}
};

-bool setup_mq_sysctls(struct ipc_namespace *ns)
+static int __init mq_sysctl_init(void)
{
- struct ctl_table *tbl;
-
- setup_sysctl_set(&ns->mq_set, &set_root, set_is_seen);
-
- tbl = kmemdup(mq_sysctls, sizeof(mq_sysctls), GFP_KERNEL);
- if (tbl) {
- int i;
-
- for (i = 0; i < ARRAY_SIZE(mq_sysctls); i++) {
- if (tbl[i].data == &init_ipc_ns.mq_queues_max)
- tbl[i].data = &ns->mq_queues_max;
-
- else if (tbl[i].data == &init_ipc_ns.mq_msg_max)
- tbl[i].data = &ns->mq_msg_max;
-
- else if (tbl[i].data == &init_ipc_ns.mq_msgsize_max)
- tbl[i].data = &ns->mq_msgsize_max;
-
- else if (tbl[i].data == &init_ipc_ns.mq_msg_default)
- tbl[i].data = &ns->mq_msg_default;
-
- else if (tbl[i].data == &init_ipc_ns.mq_msgsize_default)
- tbl[i].data = &ns->mq_msgsize_default;
- else
- tbl[i].data = NULL;
- }
-
- ns->mq_sysctls = __register_sysctl_table(&ns->mq_set, "fs/mqueue", tbl);
- }
- if (!ns->mq_sysctls) {
- kfree(tbl);
- retire_sysctl_set(&ns->mq_set);
- return false;
- }
-
- return true;
+ register_sysctl_table(mq_sysctl_root);
+ return 0;
}

-void retire_mq_sysctls(struct ipc_namespace *ns)
-{
- struct ctl_table *tbl;
-
- tbl = ns->mq_sysctls->ctl_table_arg;
- unregister_sysctl_table(ns->mq_sysctls);
- retire_sysctl_set(&ns->mq_set);
- kfree(tbl);
-}
+device_initcall(mq_sysctl_init);
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index c0f24cc9f619..ffb79a24d70b 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -1711,11 +1711,6 @@ static int __init init_mqueue_fs(void)
if (mqueue_inode_cachep == NULL)
return -ENOMEM;

- if (!setup_mq_sysctls(&init_ipc_ns)) {
- pr_warn("sysctl registration failed\n");
- return -ENOMEM;
- }
-
error = register_filesystem(&mqueue_fs_type);
if (error)
goto out_sysctl;
diff --git a/ipc/namespace.c b/ipc/namespace.c
index f760243ca685..ae83f0f2651b 100644
--- a/ipc/namespace.c
+++ b/ipc/namespace.c
@@ -59,10 +59,6 @@ static struct ipc_namespace *create_ipc_ns(struct user_namespace *user_ns,
if (err)
goto fail_put;

- err = -ENOMEM;
- if (!setup_mq_sysctls(ns))
- goto fail_put;
-
sem_init_ns(ns);
msg_init_ns(ns);
shm_init_ns(ns);
@@ -129,8 +125,6 @@ static void free_ipc_ns(struct ipc_namespace *ns)
msg_exit_ns(ns);
shm_exit_ns(ns);

- retire_mq_sysctls(ns);
-
dec_ipc_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(&ns->ns);
--
2.33.3


2022-06-01 21:21:46

by Alexey Gladkov

[permalink] [raw]
Subject: [RFC PATCH 3/4] sysctl: userns: Do not use dynamic memory

Dynamic memory allocation is needed to modify .data and specify the
per namespace parameter. The new sysctl API is allowed to get rid of
the need for such modification.

Signed-off-by: Alexey Gladkov <[email protected]>
---
include/linux/user_namespace.h | 6 --
kernel/ucount.c | 116 +++++++++++++--------------------
kernel/user_namespace.c | 10 +--
3 files changed, 46 insertions(+), 86 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 45f09bec02c4..7b134516e5cb 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -95,10 +95,6 @@ struct user_namespace {
struct key *persistent_keyring_register;
#endif
struct work_struct work;
-#ifdef CONFIG_SYSCTL
- struct ctl_table_set set;
- struct ctl_table_header *sysctls;
-#endif
struct ucounts *ucounts;
long ucount_max[UCOUNT_COUNTS];
long rlimit_max[UCOUNT_RLIMIT_COUNTS];
@@ -116,8 +112,6 @@ struct ucounts {
extern struct user_namespace init_user_ns;
extern struct ucounts init_ucounts;

-bool setup_userns_sysctls(struct user_namespace *ns);
-void retire_userns_sysctls(struct user_namespace *ns);
struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
struct ucounts *alloc_ucounts(struct user_namespace *ns, kuid_t uid);
diff --git a/kernel/ucount.c b/kernel/ucount.c
index ee8e57fd6f90..4a5072671847 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -7,6 +7,7 @@
#include <linux/hash.h>
#include <linux/kmemleak.h>
#include <linux/user_namespace.h>
+#include <linux/fs.h>

struct ucounts init_ucounts = {
.ns = &init_user_ns,
@@ -26,38 +27,20 @@ static DEFINE_SPINLOCK(ucounts_lock);


#ifdef CONFIG_SYSCTL
-static struct ctl_table_set *
-set_lookup(struct ctl_table_root *root)
-{
- return &current_user_ns()->set;
-}
-
-static int set_is_seen(struct ctl_table_set *set)
-{
- return &current_user_ns()->set == set;
-}
-
-static int set_permissions(struct ctl_table_header *head,
- struct ctl_table *table)
-{
- struct user_namespace *user_ns =
- container_of(head->set, struct user_namespace, set);
- int mode;
-
- /* Allow users with CAP_SYS_RESOURCE unrestrained access */
- if (ns_capable(user_ns, CAP_SYS_RESOURCE))
- mode = (table->mode & S_IRWXU) >> 6;
- else
- /* Allow all others at most read-only access */
- mode = table->mode & S_IROTH;
- return (mode << 6) | (mode << 3) | mode;
-}
-
-static struct ctl_table_root set_root = {
- .lookup = set_lookup,
- .permissions = set_permissions,
+static int user_sys_open(struct ctl_context *ctx, struct inode *inode,
+ struct file *file);
+static ssize_t user_sys_read(struct ctl_context *ctx, struct file *file,
+ char *buffer, size_t *lenp, loff_t *ppos);
+static ssize_t user_sys_write(struct ctl_context *ctx, struct file *file,
+ char *buffer, size_t *lenp, loff_t *ppos);
+
+static struct ctl_fops user_sys_fops = {
+ .open = user_sys_open,
+ .read = user_sys_read,
+ .write = user_sys_write,
};

+static long ue_dummy = 0;
static long ue_zero = 0;
static long ue_int_max = INT_MAX;

@@ -66,9 +49,11 @@ static long ue_int_max = INT_MAX;
.procname = name, \
.maxlen = sizeof(long), \
.mode = 0644, \
+ .data = &ue_dummy, \
.proc_handler = proc_doulongvec_minmax, \
.extra1 = &ue_zero, \
.extra2 = &ue_int_max, \
+ .ctl_fops = &user_sys_fops, \
}
static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_user_namespaces"),
@@ -89,44 +74,43 @@ static struct ctl_table user_table[] = {
#endif
{ }
};
-#endif /* CONFIG_SYSCTL */

-bool setup_userns_sysctls(struct user_namespace *ns)
+static int user_sys_open(struct ctl_context *ctx, struct inode *inode, struct file *file)
{
-#ifdef CONFIG_SYSCTL
- struct ctl_table *tbl;
-
- BUILD_BUG_ON(ARRAY_SIZE(user_table) != UCOUNT_COUNTS + 1);
- setup_sysctl_set(&ns->set, &set_root, set_is_seen);
- tbl = kmemdup(user_table, sizeof(user_table), GFP_KERNEL);
- if (tbl) {
- int i;
- for (i = 0; i < UCOUNT_COUNTS; i++) {
- tbl[i].data = &ns->ucount_max[i];
- }
- ns->sysctls = __register_sysctl_table(&ns->set, "user", tbl);
- }
- if (!ns->sysctls) {
- kfree(tbl);
- retire_sysctl_set(&ns->set);
- return false;
- }
-#endif
- return true;
+ /* Allow users with CAP_SYS_RESOURCE unrestrained access */
+ if ((file->f_mode & FMODE_WRITE) &&
+ !ns_capable(file->f_cred->user_ns, CAP_SYS_RESOURCE))
+ return -EPERM;
+ return 0;
}

-void retire_userns_sysctls(struct user_namespace *ns)
+static ssize_t user_sys_read(struct ctl_context *ctx, struct file *file,
+ char *buffer, size_t *lenp, loff_t *ppos)
{
-#ifdef CONFIG_SYSCTL
- struct ctl_table *tbl;
+ struct ctl_table table = *ctx->table;
+ table.data = &file->f_cred->user_ns->ucount_max[ctx->table - user_table];
+ return table.proc_handler(&table, 0, buffer, lenp, ppos);
+}

- tbl = ns->sysctls->ctl_table_arg;
- unregister_sysctl_table(ns->sysctls);
- retire_sysctl_set(&ns->set);
- kfree(tbl);
-#endif
+static ssize_t user_sys_write(struct ctl_context *ctx, struct file *file,
+ char *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct ctl_table table = *ctx->table;
+ table.data = &file->f_cred->user_ns->ucount_max[ctx->table - user_table];
+ return table.proc_handler(&table, 1, buffer, lenp, ppos);
}

+static struct ctl_table user_root_table[] = {
+ {
+ .procname = "user",
+ .mode = 0555,
+ .child = user_table,
+ },
+ {}
+};
+
+#endif /* CONFIG_SYSCTL */
+
static struct ucounts *find_ucounts(struct user_namespace *ns, kuid_t uid, struct hlist_head *hashent)
{
struct ucounts *ucounts;
@@ -357,17 +341,7 @@ bool is_rlimit_overlimit(struct ucounts *ucounts, enum rlimit_type type, unsigne
static __init int user_namespace_sysctl_init(void)
{
#ifdef CONFIG_SYSCTL
- static struct ctl_table_header *user_header;
- static struct ctl_table empty[1];
- /*
- * It is necessary to register the user directory in the
- * default set so that registrations in the child sets work
- * properly.
- */
- user_header = register_sysctl("user", empty);
- kmemleak_ignore(user_header);
- BUG_ON(!user_header);
- BUG_ON(!setup_userns_sysctls(&init_user_ns));
+ register_sysctl_table(user_root_table);
#endif
hlist_add_ucounts(&init_ucounts);
inc_rlimit_ucounts(&init_ucounts, UCOUNT_RLIMIT_NPROC, 1);
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index 981bb2d10d83..c0e707bc9a31 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -149,17 +149,10 @@ int create_user_ns(struct cred *new)
INIT_LIST_HEAD(&ns->keyring_name_list);
init_rwsem(&ns->keyring_sem);
#endif
- ret = -ENOMEM;
- if (!setup_userns_sysctls(ns))
- goto fail_keyring;

set_cred_user_ns(new, ns);
return 0;
-fail_keyring:
-#ifdef CONFIG_PERSISTENT_KEYRINGS
- key_put(ns->persistent_keyring_register);
-#endif
- ns_free_inum(&ns->ns);
+
fail_free:
kmem_cache_free(user_ns_cachep, ns);
fail_dec:
@@ -208,7 +201,6 @@ static void free_user_ns(struct work_struct *work)
kfree(ns->projid_map.forward);
kfree(ns->projid_map.reverse);
}
- retire_userns_sysctls(ns);
key_free_user_ns(ns);
ns_free_inum(&ns->ns);
kmem_cache_free(user_ns_cachep, ns);
--
2.33.3


2022-06-01 21:34:25

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl

On Wed, Jun 01, 2022 at 03:20:29PM +0200, Alexey Gladkov wrote:
> +struct ctl_fops {
> + int (*open) (struct ctl_context *, struct inode *, struct file *);
> + int (*release) (struct ctl_context *, struct inode *, struct file *);
> + ssize_t (*read) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> + ssize_t (*write) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> +};

Why not pass the iocb in ->read and ->write? We're still regretting not
doing that with file_operations.

2022-06-01 21:34:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl

On Wed, Jun 01, 2022 at 12:23:06PM -0700, Linus Torvalds wrote:
> On Wed, Jun 1, 2022 at 12:19 PM Matthew Wilcox <[email protected]> wrote:
> >
> > Why not pass the iocb in ->read and ->write? We're still regretting not
> > doing that with file_operations.
>
> No, all the actual "io" is done by the caller.
>
> There is no way in hell I want the sysctl callbacks to actually
> possibly do user space accesses etc.
>
> They get a kernel buffer that has already been set up. There is no
> iocb or iovec left for them.

I wasn't suggesting the iovec. Just the iocb, instead of passing in the
ki_filp and the ki_pos.

> (That also means that they can take whatever locks they need,
> including spinlocks, because there's not going to be any random user
> accesses or complex pipe buffer lookups or whatever).
>
> Linus

2022-06-01 21:38:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl

On Wed, Jun 1, 2022 at 12:19 PM Matthew Wilcox <[email protected]> wrote:
>
> Why not pass the iocb in ->read and ->write? We're still regretting not
> doing that with file_operations.

No, all the actual "io" is done by the caller.

There is no way in hell I want the sysctl callbacks to actually
possibly do user space accesses etc.

They get a kernel buffer that has already been set up. There is no
iocb or iovec left for them.

(That also means that they can take whatever locks they need,
including spinlocks, because there's not going to be any random user
accesses or complex pipe buffer lookups or whatever).

Linus

2022-06-01 21:40:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl

On Wed, Jun 1, 2022 at 12:25 PM Matthew Wilcox <[email protected]> wrote:
>
> I wasn't suggesting the iovec. Just the iocb, instead of passing in the
> ki_filp and the ki_pos.

I guess that would work, but would it actually help anything?

I think it's a lot more obvious when it just looks mostly like a
"read/write()" system call (ok, pread/pwrite since you get the pos,
but still).

There is nothing that could possibly be relevant in the kiocb. All
those fields are about odd async or atomicity issues (or "report
completion" issues) that simply cannot possibly be valid for a sysctl
value.

And if they are, that sysctl value is doing something really really
really wrong. So we absolutely don't want to encourage that.

This is not a "do IO" interface. This is a "read or write value" interface.

Linus

2022-06-01 21:40:29

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [RFC PATCH 1/4] sysctl: API extension for handling sysctl

On Wed, Jun 01, 2022 at 08:19:14PM +0100, Matthew Wilcox wrote:
> On Wed, Jun 01, 2022 at 03:20:29PM +0200, Alexey Gladkov wrote:
> > +struct ctl_fops {
> > + int (*open) (struct ctl_context *, struct inode *, struct file *);
> > + int (*release) (struct ctl_context *, struct inode *, struct file *);
> > + ssize_t (*read) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> > + ssize_t (*write) (struct ctl_context *, struct file *, char *, size_t *, loff_t *);
> > +};
>
> Why not pass the iocb in ->read and ->write? We're still regretting not
> doing that with file_operations.

Because buf and len can be modified in BPF_CGROUP_RUN_PROG_SYSCTL.
We need to pass the result of this hook to read/write callback.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/proc_sysctl.c#n605
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/bpf/cgroup.c#n1441

--
Rgrds, legion


2022-06-09 17:12:09

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC PATCH 0/4] API extension for handling sysctl

On Wed, Jun 01, 2022 at 03:20:28PM +0200, Alexey Gladkov wrote:
> On Fri, Apr 22, 2022 at 01:44:50PM -0700, Linus Torvalds wrote:
> > On Fri, Apr 22, 2022 at 5:53 AM Alexey Gladkov <[email protected]> wrote:
> > >
> > > Yes, Linus, these changes are not the refactoring you were talking
> > > about, but I plan to try to do such a refactoring in the my next
> > > patchset.
> >
> > Heh. Ok, I'm not saying these patches are pretty, and looking up the
> > namespace thing is a bit subtle, but it's certainly prettier than the
> > existing odd "create a new ctl_table entry because of field abuse".
>
> As I promised, here is one of the possible options for how to get rid of dynamic
> memory allocation.
>
> We can slightly extend the API and thus be able to save data at the time the
> file is opened. This will not only eliminate the need to allocate memory, but
> also provide access to file struct and f_cred.
>
> I made an RFC because I'm not sure that I did the permissions check for
> ipc_sysctl. I also did not change all the places where this API can be applied
> to make the patch smaller. As in the case of /proc/sys/kernel/printk where
> CAP_SYS_ADMIN is checked[1] for the current process at the time of write.

Thanks for all this, can you also add respective selftests extensions
for this on lib/test_sysctl.c and tools/testing/selftests/sysctl/sysctl.sh ?

Luis