2021-06-29 00:37:08

by Eric W. Biederman

[permalink] [raw]
Subject: [GIT PULL] ucounts: Count rlimits in each user namespace


Linus,

Please pull the for-linus branch from the git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

HEAD: 5e6b8a50a7cec5686ee2c4bda1d49899c79a7eae cred: add missing return error code when set_cred_ucounts() failed

This is the work mainly by Alexey Gladkov to limit rlimits to the
rlimits of the user that created a user namespace, and to allow users to
have stricter limits on the resources created within a user namespace.

There is a more detailed changelog from Alexey in the merge commit
9b624988221b ("ucounts: Count rlimits in each user namespace")

---

Alexey Gladkov (9):
Increase size of ucounts to atomic_long_t
Add a reference to ucounts for each cred
Use atomic_t for ucounts reference counting
Reimplement RLIMIT_NPROC on top of ucounts
Reimplement RLIMIT_MSGQUEUE on top of ucounts
Reimplement RLIMIT_SIGPENDING on top of ucounts
Reimplement RLIMIT_MEMLOCK on top of ucounts
kselftests: Add test to check for rlimit changes in different user namespaces
ucounts: Set ucount_max to the largest positive value the type can hold

Eric W. Biederman (2):
ucounts: Count rlimits in each user namespace
ucounts: Silence warning in dec_rlimit_ucounts

Yang Yingliang (1):
cred: add missing return error code when set_cred_ucounts() failed

fs/exec.c | 6 +-
fs/hugetlbfs/inode.c | 16 +-
fs/proc/array.c | 2 +-
include/linux/cred.h | 4 +
include/linux/hugetlb.h | 4 +-
include/linux/mm.h | 4 +-
include/linux/sched/user.h | 7 -
include/linux/shmem_fs.h | 2 +-
include/linux/signal_types.h | 4 +-
include/linux/user_namespace.h | 31 +++-
ipc/mqueue.c | 40 ++---
ipc/shm.c | 26 ++--
kernel/cred.c | 51 ++++++-
kernel/exit.c | 2 +-
kernel/fork.c | 18 ++-
kernel/signal.c | 25 ++--
kernel/sys.c | 14 +-
kernel/ucount.c | 116 +++++++++++----
kernel/user.c | 3 -
kernel/user_namespace.c | 9 +-
mm/memfd.c | 4 +-
mm/mlock.c | 22 ++-
mm/mmap.c | 4 +-
mm/shmem.c | 10 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/rlimits/.gitignore | 2 +
tools/testing/selftests/rlimits/Makefile | 6 +
tools/testing/selftests/rlimits/config | 1 +
.../testing/selftests/rlimits/rlimits-per-userns.c | 161 +++++++++++++++++++++
29 files changed, 468 insertions(+), 127 deletions(-)



2021-06-29 03:50:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Mon, Jun 28, 2021 at 3:35 PM Eric W. Biederman <[email protected]> wrote:
>
> This is the work mainly by Alexey Gladkov to limit rlimits to the
> rlimits of the user that created a user namespace, and to allow users to
> have stricter limits on the resources created within a user namespace.

I guess all the performance issues got sorted, since I haven't seen
any reports from the test robots.

I do end up with two questions, mainly because of looking at the
result of the conflict resolution.

In particular, in __sigqueue_alloc(), two oddities..

Why the "sigpending < LONG_MAX" test in that

if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
task_rlimit(t, RLIMIT_SIGPENDING))) {

thing?

And why test for "ucounts" being non-NULL in

if (ucounts && dec_rlimit_ucounts(ucounts,
UCOUNT_RLIMIT_SIGPENDING, 1))
put_ucounts(ucounts);

when afaik both of those should be happy with a NULL 'ucounts' pointer
(if it was NULL, we certainly already used it for the reverse
operations for get_ucounts() and inc_rlimit_ucounts()..)

Hmm?

And somebody should verify that I didn't screw anything up in my merge
resolution. It all looked very straightforward, but mistakes happen..

Linus

2021-06-29 03:53:59

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

The pull request you sent on Mon, 28 Jun 2021 17:35:22 -0500:

> git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/c54b245d011855ea91c5beff07f1db74143ce614

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2021-06-29 15:42:10

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

Linus Torvalds <[email protected]> writes:

> On Mon, Jun 28, 2021 at 3:35 PM Eric W. Biederman <[email protected]> wrote:
>>
>> This is the work mainly by Alexey Gladkov to limit rlimits to the
>> rlimits of the user that created a user namespace, and to allow users to
>> have stricter limits on the resources created within a user namespace.
>
> I guess all the performance issues got sorted, since I haven't seen
> any reports from the test robots.

Yes. The structure was made to not change anything unnecessarily
(such as the ordering in sigqueue_alloc) and the performances
differences went away.

With the code in linux-next the entire cycle I think that is a reliable
result. There are probably some things we could do to further optimize
things but we did not need them to avoid regressions.

> I do end up with two questions, mainly because of looking at the
> result of the conflict resolution.
>
> In particular, in __sigqueue_alloc(), two oddities..
>
> Why the "sigpending < LONG_MAX" test in that
>
> if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
>
> thing?

> And why test for "ucounts" being non-NULL in
>
> if (ucounts && dec_rlimit_ucounts(ucounts,
> UCOUNT_RLIMIT_SIGPENDING, 1))
> put_ucounts(ucounts);
>
> when afaik both of those should be happy with a NULL 'ucounts' pointer
> (if it was NULL, we certainly already used it for the reverse
> operations for get_ucounts() and inc_rlimit_ucounts()..)
>
> Hmm?

Yes. I suspect that those tests are left over from a previous version
of the change. Alex do you remember why those tests are there?

> And somebody should verify that I didn't screw anything up in my merge
> resolution. It all looked very straightforward, but mistakes happen..

Just reading through the resolution looks correct.

Eric

2021-06-29 15:54:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

Linus Torvalds <[email protected]> writes:

> Why the "sigpending < LONG_MAX" test in that
>
> if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
> thing?

On second look that sigpending < LONG_MAX check is necessary. When
inc_rlimit_ucounts detects a problem it returns LONG_MAX.

So the check for LONG_MAX is the condensed check to see if there is a
problem in any other levels of the ucount hierarchy.

Eric

2021-06-29 17:03:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <[email protected]> wrote:
>
> Linus Torvalds <[email protected]> writes:
>
> > Why the "sigpending < LONG_MAX" test in that
> >
> > if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> > task_rlimit(t, RLIMIT_SIGPENDING))) {
> > thing?
>
> On second look that sigpending < LONG_MAX check is necessary. When
> inc_rlimit_ucounts detects a problem it returns LONG_MAX.

I saw that, but _without_ that test you'd be left with just that

sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)

and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.

IOW, afaik even _if_ sigpending ends up being LONG_MAX, the
conditional does the right thing without that "sigpending < LONG_MAX"
test.

Linus

2021-06-29 17:04:11

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

Linus Torvalds <[email protected]> writes:

> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <[email protected]> wrote:
>>
>> Linus Torvalds <[email protected]> writes:
>>
>> > Why the "sigpending < LONG_MAX" test in that
>> >
>> > if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
>> > thing?
>>
>> On second look that sigpending < LONG_MAX check is necessary. When
>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
>
> I saw that, but _without_ that test you'd be left with just that
>
> sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>
> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.

It means no limits locally. The creator of your user namespace might
have had a limit which you are also bound by.

The other possibility is that inc_rlimits_ucounts caused a sigpending
counter to overflow. In which case we need to fail and run
dec_rlimit_ucounts to keep the counter from staying overflowed.

So I don't see a clever way to avoid the sigpending < LONG_MAX test.

Eric

2021-06-29 17:12:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

[email protected] (Eric W. Biederman) writes:

> Linus Torvalds <[email protected]> writes:
>
>> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <[email protected]> wrote:
>>>
>>> Linus Torvalds <[email protected]> writes:
>>>
>>> > Why the "sigpending < LONG_MAX" test in that
>>> >
>>> > if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
>>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
>>> > thing?
>>>
>>> On second look that sigpending < LONG_MAX check is necessary. When
>>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
>>
>> I saw that, but _without_ that test you'd be left with just that
>>
>> sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>>
>> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.
>
> It means no limits locally. The creator of your user namespace might
> have had a limit which you are also bound by.
>
> The other possibility is that inc_rlimits_ucounts caused a sigpending
> counter to overflow. In which case we need to fail and run
> dec_rlimit_ucounts to keep the counter from staying overflowed.
>
> So I don't see a clever way to avoid the sigpending < LONG_MAX test.

Hmm. I take that back. There is a simple clever way to satisfy all of
the tests.

- sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
+ sigpending < task_rlimit(t, RLIMIT_SIGPENDING)

That would just need a small comment to explain the subtleties.

Eric

2021-06-29 17:21:22

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Mon, Jun 28, 2021 at 08:47:12PM -0700, Linus Torvalds wrote:
> On Mon, Jun 28, 2021 at 3:35 PM Eric W. Biederman <[email protected]> wrote:
> >
> > This is the work mainly by Alexey Gladkov to limit rlimits to the
> > rlimits of the user that created a user namespace, and to allow users to
> > have stricter limits on the resources created within a user namespace.
>
> I guess all the performance issues got sorted, since I haven't seen
> any reports from the test robots.
>
> I do end up with two questions, mainly because of looking at the
> result of the conflict resolution.
>
> In particular, in __sigqueue_alloc(), two oddities..
>
> Why the "sigpending < LONG_MAX" test in that
>
> if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> task_rlimit(t, RLIMIT_SIGPENDING))) {
>
> thing?

inc_rlimit_ucounts() returns long and uses LONG_MAX as an overflow flag.
At the same time, we have increased the size of sigpending from int to
long.

> And why test for "ucounts" being non-NULL in
>
> if (ucounts && dec_rlimit_ucounts(ucounts,
> UCOUNT_RLIMIT_SIGPENDING, 1))
> put_ucounts(ucounts);
>
> when afaik both of those should be happy with a NULL 'ucounts' pointer
> (if it was NULL, we certainly already used it for the reverse
> operations for get_ucounts() and inc_rlimit_ucounts()..)

The get_ucount() can theoretically return NULL. It increments the
reference counter and if it overflows, the function will return NULL.

> Hmm?
>
> And somebody should verify that I didn't screw anything up in my merge
> resolution. It all looked very straightforward, but mistakes happen..

--
Rgrds, legion

2021-06-29 19:25:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Tue, Jun 29, 2021 at 10:18 AM Alexey Gladkov <[email protected]> wrote:
>
>
> > And why test for "ucounts" being non-NULL in
> >
> > if (ucounts && dec_rlimit_ucounts(ucounts,
> > UCOUNT_RLIMIT_SIGPENDING, 1))
> > put_ucounts(ucounts);
> >
> > when afaik both of those should be happy with a NULL 'ucounts' pointer
> > (if it was NULL, we certainly already used it for the reverse
> > operations for get_ucounts() and inc_rlimit_ucounts()..)
>
> The get_ucount() can theoretically return NULL. It increments the
> reference counter and if it overflows, the function will return NULL.

.. but my point is that dec_rlimit_ucounts() and put_ucounts() should
be fine with whatever get_ucounts() returned. No

It looks like put_ucounts() is unhappy with a NULL ucounts argument,
but I think _that_ is what should get fixed.

I think that conceptually we should have two clear alternatives:

(a) either "get_ucounts()" returning NULL should be an error, and we
would have returned long before

or

(b) a NULL uncounts is usable, and a sequence like
put_ucounts(get_ucounts()) should just always work.

And honestly, a lot of the other ucounts funcrtions seem to take that
(b) approach. Example in that very function:

ucounts = task_ucounts(t);
sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);

which at no point tested for NULL or returned an error.

(And that also implies that the comment in dec_rlimit_ucounts() about
"Silence compiler warning" should just go away, because it's not just
a compiler warning, it's a required initialization).

Linus

2021-06-29 21:06:20

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Tue, Jun 29, 2021 at 11:07:11AM -0700, Linus Torvalds wrote:
> On Tue, Jun 29, 2021 at 10:18 AM Alexey Gladkov <[email protected]> wrote:
> >
> >
> > > And why test for "ucounts" being non-NULL in
> > >
> > > if (ucounts && dec_rlimit_ucounts(ucounts,
> > > UCOUNT_RLIMIT_SIGPENDING, 1))
> > > put_ucounts(ucounts);
> > >
> > > when afaik both of those should be happy with a NULL 'ucounts' pointer
> > > (if it was NULL, we certainly already used it for the reverse
> > > operations for get_ucounts() and inc_rlimit_ucounts()..)
> >
> > The get_ucount() can theoretically return NULL. It increments the
> > reference counter and if it overflows, the function will return NULL.
>
> .. but my point is that dec_rlimit_ucounts() and put_ucounts() should
> be fine with whatever get_ucounts() returned. No
>
> It looks like put_ucounts() is unhappy with a NULL ucounts argument,
> but I think _that_ is what should get fixed.
>
> I think that conceptually we should have two clear alternatives:
>
> (a) either "get_ucounts()" returning NULL should be an error, and we
> would have returned long before

get_ucounts() in the __sigqueue_alloc() performs the get_uid() function
but does not ignore the counter overflow. Basically get_uid() can fail in
same way as get_ucounts(), but we just ignore it.

> or
>
> (b) a NULL uncounts is usable, and a sequence like
> put_ucounts(get_ucounts()) should just always work.
>
> And honestly, a lot of the other ucounts funcrtions seem to take that
> (b) approach. Example in that very function:
>
> ucounts = task_ucounts(t);
> sigpending = inc_rlimit_ucounts(ucounts, UCOUNT_RLIMIT_SIGPENDING, 1);
>
> which at no point tested for NULL or returned an error.

Waaaait. task_ucounts() is a different thing. This function only gets a
field from the task structure without any reference counting. But the
get_ucounts() is more like get_user_ns() or get_uid(), but does not ignore
counter overflow.

Earlier I tried to use refcount_t which never returns errors [1]. We
talked and you said that ignoring counter overflow errors is bad
design for this case.

> (And that also implies that the comment in dec_rlimit_ucounts() about
> "Silence compiler warning" should just go away, because it's not just
> a compiler warning, it's a required initialization).
>
> Linus

[1] https://lore.kernel.org/lkml/CAHk-%3dwjYOCgM%[email protected]/

--
Rgrds, legion

2021-07-01 16:46:26

by Alexey Gladkov

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

On Tue, Jun 29, 2021 at 12:09:01PM -0500, Eric W. Biederman wrote:
> [email protected] (Eric W. Biederman) writes:
>
> > Linus Torvalds <[email protected]> writes:
> >
> >> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <[email protected]> wrote:
> >>>
> >>> Linus Torvalds <[email protected]> writes:
> >>>
> >>> > Why the "sigpending < LONG_MAX" test in that
> >>> >
> >>> > if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
> >>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
> >>> > thing?
> >>>
> >>> On second look that sigpending < LONG_MAX check is necessary. When
> >>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
> >>
> >> I saw that, but _without_ that test you'd be left with just that
> >>
> >> sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
> >>
> >> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.
> >
> > It means no limits locally. The creator of your user namespace might
> > have had a limit which you are also bound by.
> >
> > The other possibility is that inc_rlimits_ucounts caused a sigpending
> > counter to overflow. In which case we need to fail and run
> > dec_rlimit_ucounts to keep the counter from staying overflowed.
> >
> > So I don't see a clever way to avoid the sigpending < LONG_MAX test.
>
> Hmm. I take that back. There is a simple clever way to satisfy all of
> the tests.
>
> - sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
> + sigpending < task_rlimit(t, RLIMIT_SIGPENDING)
>
> That would just need a small comment to explain the subtleties.

Is it because user.sigpending was atomic_t before this patch ?

--
Rgrds, legion

2021-07-01 20:07:26

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [GIT PULL] ucounts: Count rlimits in each user namespace

Alexey Gladkov <[email protected]> writes:

> On Tue, Jun 29, 2021 at 12:09:01PM -0500, Eric W. Biederman wrote:
>> [email protected] (Eric W. Biederman) writes:
>>
>> > Linus Torvalds <[email protected]> writes:
>> >
>> >> On Tue, Jun 29, 2021 at 8:52 AM Eric W. Biederman <[email protected]> wrote:
>> >>>
>> >>> Linus Torvalds <[email protected]> writes:
>> >>>
>> >>> > Why the "sigpending < LONG_MAX" test in that
>> >>> >
>> >>> > if (override_rlimit || (sigpending < LONG_MAX && sigpending <=
>> >>> > task_rlimit(t, RLIMIT_SIGPENDING))) {
>> >>> > thing?
>> >>>
>> >>> On second look that sigpending < LONG_MAX check is necessary. When
>> >>> inc_rlimit_ucounts detects a problem it returns LONG_MAX.
>> >>
>> >> I saw that, but _without_ that test you'd be left with just that
>> >>
>> >> sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>> >>
>> >> and if task_rlimit() is LONG_MAX, then that means "no limits", so it is all ok.
>> >
>> > It means no limits locally. The creator of your user namespace might
>> > have had a limit which you are also bound by.
>> >
>> > The other possibility is that inc_rlimits_ucounts caused a sigpending
>> > counter to overflow. In which case we need to fail and run
>> > dec_rlimit_ucounts to keep the counter from staying overflowed.
>> >
>> > So I don't see a clever way to avoid the sigpending < LONG_MAX test.
>>
>> Hmm. I take that back. There is a simple clever way to satisfy all of
>> the tests.
>>
>> - sigpending < LONG_MAX && sigpending <= task_rlimit(t, RLIMIT_SIGPENDING)
>> + sigpending < task_rlimit(t, RLIMIT_SIGPENDING)
>>
>> That would just need a small comment to explain the subtleties.
>
> Is it because user.sigpending was atomic_t before this patch ?

Apologies I was wrong.

The replacement of "<=" with "<" is correct for the case where
"task_rlimit(t, RLIMIT_SIGPENDING) == LONG_MAX".

Unfortunately off by one for all other values of
"task_rlimit(t, RLIMIT_SIGPENDING)".

It completely breaks things for the case where RLIMIT_SIGPENDING == 1,
where no signals are allowed to be queued. Today allowing 1 queued
signal with a single task and a sender that does not send a second
signal until the first is consumed will work reliably.

That was just a brain fart on my part.

Eric