When a user send a signal to any another processes it forces the kernel
to allocate memory for 'struct sigqueue' objects. The number of signals
is limited by RLIMIT_SIGPENDING resource limit, but even the default
settings allow each user to consume up to several megabytes of memory.
Moreover, an untrusted admin inside container can increase the limit or
create new fake users and force them to sent signals.
It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.
Signed-off-by: Vasily Averin <[email protected]>
---
kernel/signal.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/signal.c b/kernel/signal.c
index a3229ad..8921c4a 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -4663,7 +4663,7 @@ void __init signals_init(void)
{
siginfo_buildtime_checks();
- sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
+ sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
}
#ifdef CONFIG_KGDB_KDB
--
1.8.3.1
Vasily Averin <[email protected]> writes:
> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.
Not any more. Currently the number of sigqueue objects is limited
by the rlimit of the creator of the user namespace of the container.
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
Does it? Why? The given justification appears to have bit-rotted
since -rc1.
I know a lot of these things only really need a limit just to catch a
program that starts malfunctioning. If that is indeed the case
reasonable per-resource limits are probably better than some great big
group limit that can be exhausted with any single resource in the group.
Is there a reason I am not aware of that where it makes sense to group
all of the resources together and only count the number of bytes
consumed?
Eric
> Signed-off-by: Vasily Averin <[email protected]>
> ---
> kernel/signal.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index a3229ad..8921c4a 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -4663,7 +4663,7 @@ void __init signals_init(void)
> {
> siginfo_buildtime_checks();
>
> - sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC);
> + sigqueue_cachep = KMEM_CACHE(sigqueue, SLAB_PANIC | SLAB_ACCOUNT);
> }
>
> #ifdef CONFIG_KGDB_KDB
On 7/19/21 8:32 PM, Eric W. Biederman wrote:
> Vasily Averin <[email protected]> writes:
>
>> When a user send a signal to any another processes it forces the kernel
>> to allocate memory for 'struct sigqueue' objects. The number of signals
>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>> settings allow each user to consume up to several megabytes of memory.
>> Moreover, an untrusted admin inside container can increase the limit or
>> create new fake users and force them to sent signals.
>
> Not any more. Currently the number of sigqueue objects is limited
> by the rlimit of the creator of the user namespace of the container.
>
>> It makes sense to account for these allocations to restrict the host's
>> memory consumption from inside the memcg-limited container.
>
> Does it? Why? The given justification appears to have bit-rotted
> since -rc1.
Could you please explain what was changed in rc1?
From my POV accounting is required to help OOM-killer to select proper target.
> I know a lot of these things only really need a limit just to catch a
> program that starts malfunctioning. If that is indeed the case
> reasonable per-resource limits are probably better than some great big
> group limit that can be exhausted with any single resource in the group.
>
> Is there a reason I am not aware of that where it makes sense to group
> all of the resources together and only count the number of bytes
> consumed?
Any new limits:
a) should be set properly depending on huge number of incoming parameters.
b) should properly notify about hits
c) should be updated properly after b)
d) do a)-c) automatically if possible
In past OpenVz had own accounting subsystem, user beancounters (UBC).
It accounted and limited 20+ resources per-container: numfiles, file locks,
signals, netfilter rules, socket buffers and so on.
I assume you want to do something similar, so let me share our experience.
We had a lot of problems with UBC:
- it's quite hard to set up the limit.
Why it's good to consume N entities of some resource but it's bad to consume N+1 ones?
per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
To answer the questions host admin should have additional knowledge and skills.
- Ok, we have set all limits. Some application hits it and fails.
It's quite hard to understand that application hits the limit, and failed due to this reason.
From users point of view, if some application does not work (stable enough)
inside container => containers are guilty.
- It's quite hard to understand that failed application just want to increase limit X up to N entities.
As result both host admins and container users was unhappy.
So after years of such fights we decided just to limit accounted memory instead.
Anyway, OOM-killer must know who consumed memory to select proper target.
Thank you,
vasily Averin
On Tue, Jul 20, 2021 at 1:35 AM Vasily Averin <[email protected]> wrote:
>
> On 7/19/21 8:32 PM, Eric W. Biederman wrote:
> > Vasily Averin <[email protected]> writes:
> >
> >> When a user send a signal to any another processes it forces the kernel
> >> to allocate memory for 'struct sigqueue' objects. The number of signals
> >> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> >> settings allow each user to consume up to several megabytes of memory.
> >> Moreover, an untrusted admin inside container can increase the limit or
> >> create new fake users and force them to sent signals.
> >
> > Not any more. Currently the number of sigqueue objects is limited
> > by the rlimit of the creator of the user namespace of the container.
> >
> >> It makes sense to account for these allocations to restrict the host's
> >> memory consumption from inside the memcg-limited container.
> >
> > Does it? Why? The given justification appears to have bit-rotted
> > since -rc1.
>
> Could you please explain what was changed in rc1?
> From my POV accounting is required to help OOM-killer to select proper target.
>
> > I know a lot of these things only really need a limit just to catch a
> > program that starts malfunctioning. If that is indeed the case
> > reasonable per-resource limits are probably better than some great big
> > group limit that can be exhausted with any single resource in the group.
> >
> > Is there a reason I am not aware of that where it makes sense to group
> > all of the resources together and only count the number of bytes
> > consumed?
>
> Any new limits:
> a) should be set properly depending on huge number of incoming parameters.
> b) should properly notify about hits
> c) should be updated properly after b)
> d) do a)-c) automatically if possible
>
> In past OpenVz had own accounting subsystem, user beancounters (UBC).
> It accounted and limited 20+ resources per-container: numfiles, file locks,
> signals, netfilter rules, socket buffers and so on.
> I assume you want to do something similar, so let me share our experience.
>
> We had a lot of problems with UBC:
> - it's quite hard to set up the limit.
> Why it's good to consume N entities of some resource but it's bad to consume N+1 ones?
> per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
> To answer the questions host admin should have additional knowledge and skills.
>
> - Ok, we have set all limits. Some application hits it and fails.
> It's quite hard to understand that application hits the limit, and failed due to this reason.
> From users point of view, if some application does not work (stable enough)
> inside container => containers are guilty.
>
> - It's quite hard to understand that failed application just want to increase limit X up to N entities.
>
> As result both host admins and container users was unhappy.
> So after years of such fights we decided just to limit accounted memory instead.
>
> Anyway, OOM-killer must know who consumed memory to select proper target.
>
Just to support Vasily's point further, for systems running multiple
workloads, it is much more preferred to be able to set one limit for
each workload than to set many different limits.
One concrete example is described in commit ac7b79fd190b ("inotify,
memcg: account inotify instances to kmemcg"). The inotify instances
which can be limited through fs sysctl inotify/max_user_instances and
be further partitioned to users through per-user namespace specific
sysctl but there is no sensible way to set a limit and partition it on
a system that runs different workloads.
Vasily Averin <[email protected]> writes:
> On 7/19/21 8:32 PM, Eric W. Biederman wrote:
>> Vasily Averin <[email protected]> writes:
>>
>>> When a user send a signal to any another processes it forces the kernel
>>> to allocate memory for 'struct sigqueue' objects. The number of signals
>>> is limited by RLIMIT_SIGPENDING resource limit, but even the default
>>> settings allow each user to consume up to several megabytes of memory.
>>> Moreover, an untrusted admin inside container can increase the limit or
>>> create new fake users and force them to sent signals.
>>
>> Not any more. Currently the number of sigqueue objects is limited
>> by the rlimit of the creator of the user namespace of the container.
>>
>>> It makes sense to account for these allocations to restrict the host's
>>> memory consumption from inside the memcg-limited container.
>>
>> Does it? Why? The given justification appears to have bit-rotted
>> since -rc1.
>
> Could you please explain what was changed in rc1?
> From my POV accounting is required to help OOM-killer to select proper target.
You can no longer escape the rlimit of the creator of the user
namespace, by creating multiple users inside of the user namespace. The
users (in the user namespace) will have their individual rlimits but
are jointly by the limit of the creator of the user namespace at the
time the user namespace was created.
>> I know a lot of these things only really need a limit just to catch a
>> program that starts malfunctioning. If that is indeed the case
>> reasonable per-resource limits are probably better than some great big
>> group limit that can be exhausted with any single resource in the group.
>>
>> Is there a reason I am not aware of that where it makes sense to group
>> all of the resources together and only count the number of bytes
>> consumed?
>
> Any new limits:
> a) should be set properly depending on huge number of incoming parameters.
> b) should properly notify about hits
> c) should be updated properly after b)
> d) do a)-c) automatically if possible
>
> In past OpenVz had own accounting subsystem, user beancounters (UBC).
> It accounted and limited 20+ resources per-container: numfiles, file locks,
> signals, netfilter rules, socket buffers and so on.
> I assume you want to do something similar, so let me share our experience.
>
> We had a lot of problems with UBC:
> - it's quite hard to set up the limit.
> Why it's good to consume N entities of some resource but it's bad to consume N+1 ones?
> per-process? per-user? per-thread? per-task? per-namespace? if nested? per-container? per-host?
> To answer the questions host admin should have additional knowledge and skills.
>
> - Ok, we have set all limits. Some application hits it and fails.
> It's quite hard to understand that application hits the limit, and failed due to this reason.
> From users point of view, if some application does not work (stable enough)
> inside container => containers are guilty.
>
> - It's quite hard to understand that failed application just want to increase limit X up to N entities.
>
> As result both host admins and container users was unhappy.
> So after years of such fights we decided just to limit accounted
> memory instead.
Which is a perfectly fine justification. However the justification
presented in the change log was that there is no existing limit, and
that is what is factually wrong.
Different kinds of limits serve different purposes. An accounting of how
many instance of a resource has been used serve the purpose of detecting
when an application has gone completely out of spec.
Limits like you are implementing here are much better for just sharing and
managing system resources.
> Anyway, OOM-killer must know who consumed memory to select proper
> target.
The limits I am talking about are not useful to the OOM killer as
usually the amount of resources that indicates something has gone wrong
are too small to be a system-wide problem.
So please just correct the justification in the commit message and I
will be happy.
Eric
On Mon, Jul 19, 2021 at 3:46 AM Vasily Averin <[email protected]> wrote:
>
> When a user send a signal to any another processes it forces the kernel
> to allocate memory for 'struct sigqueue' objects. The number of signals
> is limited by RLIMIT_SIGPENDING resource limit, but even the default
> settings allow each user to consume up to several megabytes of memory.
> Moreover, an untrusted admin inside container can increase the limit or
> create new fake users and force them to sent signals.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <[email protected]>
It seems like there is an agreement on this patch with the updated
commit message. In next version you can add:
Reviewed-by: Shakeel Butt <[email protected]>