2018-06-15 14:33:35

by Ivan Zahariev

[permalink] [raw]
Subject: Re: Cgroups "pids" controller does not update "pids.current" count immediately

On 14.6.2018 г. 18:06 ч., Tejun Heo wrote:
> On Thu, Jun 14, 2018 at 02:56:00PM +0300, Ivan Zahariev wrote:
>> I posted a kernel bug about this a month ago but it did not receive
>> any attention: https://bugzilla.kernel.org/show_bug.cgi?id=199713
>>
>> Here is a copy of the bug report and I hope that this is the correct
>> place to discuss this:
> Well, for now at least, that's the expected behavior. It's not
> supposed to be able to account all changes immediately (the kernel
> doesn't free a lot of things immediately for performance and other
> reasons). The intended use is setting up a reasonable upperbound with
> some buffer space.

If that's by design, it's a bit disappointing and at least the docs
should mention it.

The standard RLIMIT_NPROC does not suffer from such accounting
discrepancies at any time. The "memory" cgroups controller also does not
suffer from any discrepancies -- it accounts memory usage in real time
without any lag on process start or exit. The "tasks" file list is also
always up-to-date.

Is it really technically not possible to make "pids.current" do
accounting properly like RLIMIT_NPROC does? We were hoping to replace
RLIMIT_NPROC with the "pids" controller.

--Ivan


2018-06-15 15:42:27

by Tejun Heo

[permalink] [raw]
Subject: Re: Cgroups "pids" controller does not update "pids.current" count immediately

Hello,

On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote:
> The standard RLIMIT_NPROC does not suffer from such accounting
> discrepancies at any time.

RLIMIT_NPROC uses a dedicated atomic counter which is updated when the
process is getting reaped; however, that doesn't actually coincide
with the pid being freed. The base pid ref is put then but there can
be other refs and even after that it has to go through RCU grace
period to be actually freed.

They seem equivalent but serve a bit different purposes. RLIMIT_NPROC
is primarily about limiting what the user can do and doesn't guarantee
that that actually matches resource (pid here) consumption. pid
controller's primary role is limiting pid consumption - ie. no matter
what happens the cgroup must not be able to take away more than the
specified number from the available pool, which has to account for the
lazy release and draining refs and stuff.

> The "memory" cgroups controller also does
> not suffer from any discrepancies -- it accounts memory usage in
> real time without any lag on process start or exit. The "tasks" file
> list is also always up-to-date.

The memory controller does the same thing, actually way more
extensively. It's just less noticeable because people generally don't
try to control at individual page level.

> Is it really technically not possible to make "pids.current" do
> accounting properly like RLIMIT_NPROC does? We were hoping to
> replace RLIMIT_NPROC with the "pids" controller.

It is of course possible but at a cost. The cost (getting rid of lazy
release optimizations) is just not justifiable for most cases.

Thanks.

--
tejun

2018-06-15 16:10:26

by Ivan Zahariev

[permalink] [raw]
Subject: Re: Cgroups "pids" controller does not update "pids.current" count immediately

Hi,

Thank you for the quick and insightful reply. I have one suggestion below:

On 15.6.2018 г. 18:41 ч., Tejun Heo wrote:
> On Fri, Jun 15, 2018 at 05:26:04PM +0300, Ivan Zahariev wrote:
>> The standard RLIMIT_NPROC does not suffer from such accounting
>> discrepancies at any time.
> They seem equivalent but serve a bit different purposes. RLIMIT_NPROC
> is primarily about limiting what the user can do and doesn't guarantee
> that that actually matches resource (pid here) consumption.
>
>> Is it really technically not possible to make "pids.current" do
>> accounting properly like RLIMIT_NPROC does? We were hoping to
>> replace RLIMIT_NPROC with the "pids" controller.
> It is of course possible but at a cost. The cost (getting rid of lazy
> release optimizations) is just not justifiable for most cases.

I understand all concerns and design decisions. However, having
RLIMIT_NPROC support combined with "cgroups" hierarchy would be very handy.

Does it make sense that you introduce "nproc.current" and "nproc.max"
metrics which work in the same atomic, real-time way like RLIMIT_NPROC?
Or make this in a new "nproc" controller?

--
Ivan
--

2018-06-15 16:19:20

by Tejun Heo

[permalink] [raw]
Subject: Re: Cgroups "pids" controller does not update "pids.current" count immediately

Hello,

On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote:
> I understand all concerns and design decisions. However, having
> RLIMIT_NPROC support combined with "cgroups" hierarchy would be very
> handy.
>
> Does it make sense that you introduce "nproc.current" and
> "nproc.max" metrics which work in the same atomic, real-time way
> like RLIMIT_NPROC? Or make this in a new "nproc" controller?

I'm skeptical for two reasons.

1. That doesn't sound much like a resource control problem but more of
a policy enforcement problem.

2. and it's difficult to see why such policies would need to be that
strict. Where is the requirement coming from?

Thanks.

--
tejun

2018-06-15 17:40:51

by Ivan Zahariev

[permalink] [raw]
Subject: Re: Cgroups "pids" controller does not update "pids.current" count immediately

Hello,

On 15.6.2018 г. 19:16 ч., Tejun Heo wrote:
> On Fri, Jun 15, 2018 at 07:07:27PM +0300, Ivan Zahariev wrote:
>> I understand all concerns and design decisions. However, having
>> RLIMIT_NPROC support combined with "cgroups" hierarchy would be very
>> handy.
>>
>> Does it make sense that you introduce "nproc.current" and
>> "nproc.max" metrics which work in the same atomic, real-time way
>> like RLIMIT_NPROC? Or make this in a new "nproc" controller?
> I'm skeptical for two reasons.
>
> 1. That doesn't sound much like a resource control problem but more of
> a policy enforcement problem.
>
> 2. and it's difficult to see why such policies would need to be that
> strict. Where is the requirement coming from?
>

The lazy pids accounting + modern fast CPUs makes the "pids.current"
metric practically unusable for resource limiting in our case. For a
test, when we started and ended one single process very quickly, we saw
"pids.current" equal up to 185 (while the correct value at all time is
either 0 or 1). If we want that a "cgroup" can spawn maximum 50
processes, we should use some high value like 300 for "pids.max", in
order to compensate the pids uncharge lag (and this depends on the speed
of the CPU and how busy the system is).

Our use-case is for a shared web hosting service. Our customers start a
CGI process for each PHP web request and therefore process start/end
happens at a very high rate. We don't want customers to be able to
launch too many CGI processes (NPROC limit) because this exhausts the
web & database servers, and probably obsesses Linux kernel resources
(like total "opened files" per user). Furthermore, some users are
malicious and launch fork-bombs and other resource-exhaustion attacks.

You may be right that we enforce a policy rather than resource control.
This has worked for us for 15+ years now. The motivation is that a
global RLIMIT_NPROC easily let's us limit all system and Linux kernel
resources "per customer" ("cgroups" allows us to limit only certain
system resources). Additionally, not all user-space daemons allow for a
granular "per user" limit or proper grouping (for example, MySQL has
only users, and no "per customer" groups support). Now we want to have
different "cgroups" hierarchies for a customer (SSH, CGI, Crond), each
with their own RLIMIT_NPROC, and a total RLIMIT_NPROC for the parent
"per customer" cgroup.

Excuse me for the lengthy post :-)

--
Ivan



2018-06-15 19:08:13

by Tejun Heo

[permalink] [raw]
Subject: Re: Cgroups "pids" controller does not update "pids.current" count immediately

Hello, Ivan.

On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote:
> The lazy pids accounting + modern fast CPUs makes the "pids.current"
> metric practically unusable for resource limiting in our case. For a
> test, when we started and ended one single process very quickly, we
> saw "pids.current" equal up to 185 (while the correct value at all
> time is either 0 or 1). If we want that a "cgroup" can spawn maximum
> 50 processes, we should use some high value like 300 for "pids.max",
> in order to compensate the pids uncharge lag (and this depends on
> the speed of the CPU and how busy the system is).

Yeah, that actually makes a lot of sense. We can't keep everything
synchronous for obvious performance reasons but we definitely can wait
for RCU grace period before failing. Forking might become a bit
slower while pids are draining but shouldn't fail and that shouldn't
incur any performance overhead in normal conditions when pids aren't
constrained.

Thanks.

--
tejun

2018-06-15 19:38:50

by Ivan Zahariev

[permalink] [raw]
Subject: Re: Cgroups "pids" controller does not update "pids.current" count immediately

Hello,


On 15.6.2018 г. 22:07 ч., Tejun Heo wrote:
> On Fri, Jun 15, 2018 at 08:40:02PM +0300, Ivan Zahariev wrote:
>> The lazy pids accounting + modern fast CPUs makes the "pids.current"
>> metric practically unusable for resource limiting in our case. For a
>> test, when we started and ended one single process very quickly, we
>> saw "pids.current" equal up to 185 (while the correct value at all
>> time is either 0 or 1). If we want that a "cgroup" can spawn maximum
>> 50 processes, we should use some high value like 300 for "pids.max",
>> in order to compensate the pids uncharge lag (and this depends on
>> the speed of the CPU and how busy the system is).
> Yeah, that actually makes a lot of sense. We can't keep everything
> synchronous for obvious performance reasons but we definitely can wait
> for RCU grace period before failing. Forking might become a bit
> slower while pids are draining but shouldn't fail and that shouldn't
> incur any performance overhead in normal conditions when pids aren't
> constrained.

I lack expertise to comment on this. As a system administrator, I can
only remind that nowadays machines with 80+ CPU cores are something
usual. I don't know how the RCU grace period scales with an increasing
number of CPUs.

If you develop a patch for this, we can try it in production and give
you feedback. Just send me an email notification.

Thank you for your time and attention!

--
Ivan