2013-04-01 18:43:25

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

A year later - what ever happened with this? I want it more than ever
for Google's use.

On Tue, Jan 31, 2012 at 7:37 PM, Frederic Weisbecker <[email protected]> wrote:
> Hi,
>
> Changes In this version:
>
> - Split 32/64 bits version of res_counter_write_u64() [1/10]
> Courtesy of Kirill A. Shutemov
>
> - Added Kirill's ack [8/10]
>
> - Added selftests [9/10], [10/10]
>
> Please consider for merging. At least two users want this feature:
> https://lkml.org/lkml/2011/12/13/309
> https://lkml.org/lkml/2011/12/13/364
>
> More general details provided in the last version posting:
> https://lkml.org/lkml/2012/1/13/230
>
> Thanks!
>
>
> Frederic Weisbecker (9):
> cgroups: add res_counter_write_u64() API
> cgroups: new resource counter inheritance API
> cgroups: ability to stop res charge propagation on bounded ancestor
> res_counter: allow charge failure pointer to be null
> cgroups: pull up res counter charge failure interpretation to caller
> cgroups: allow subsystems to cancel a fork
> cgroups: Add a task counter subsystem
> selftests: Enter each directories before executing selftests
> selftests: Add a new task counter selftest
>
> Kirill A. Shutemov (1):
> cgroups: add res counter common ancestor searching
>
> Documentation/cgroups/resource_counter.txt | 20 ++-
> Documentation/cgroups/task_counter.txt | 153 +++++++++++
> include/linux/cgroup.h | 20 +-
> include/linux/cgroup_subsys.h | 5 +
> include/linux/res_counter.h | 27 ++-
> init/Kconfig | 9 +
> kernel/Makefile | 1 +
> kernel/cgroup.c | 23 ++-
> kernel/cgroup_freezer.c | 6 +-
> kernel/cgroup_task_counter.c | 272 ++++++++++++++++++++
> kernel/exit.c | 2 +-
> kernel/fork.c | 7 +-
> kernel/res_counter.c | 103 +++++++-
> tools/testing/selftests/Makefile | 2 +-
> tools/testing/selftests/run_tests | 6 +-
> tools/testing/selftests/task_counter/Makefile | 8 +
> tools/testing/selftests/task_counter/fork.c | 40 +++
> tools/testing/selftests/task_counter/forkbomb.c | 40 +++
> tools/testing/selftests/task_counter/multithread.c | 68 +++++
> tools/testing/selftests/task_counter/run_test | 198 ++++++++++++++
> .../selftests/task_counter/spread_thread_group.c | 82 ++++++
> 21 files changed, 1056 insertions(+), 36 deletions(-)
> create mode 100644 Documentation/cgroups/task_counter.txt
> create mode 100644 kernel/cgroup_task_counter.c
> create mode 100644 tools/testing/selftests/task_counter/Makefile
> create mode 100644 tools/testing/selftests/task_counter/fork.c
> create mode 100644 tools/testing/selftests/task_counter/forkbomb.c
> create mode 100644 tools/testing/selftests/task_counter/multithread.c
> create mode 100755 tools/testing/selftests/task_counter/run_test
> create mode 100644 tools/testing/selftests/task_counter/spread_thread_group.c
>
> --
> 1.7.5.4
>


2013-04-01 18:46:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote:
> A year later - what ever happened with this? I want it more than ever
> for Google's use.

I think the conclusion was "use kmemcg instead".

Thanks.

--
tejun

2013-04-01 20:09:32

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

On Mon, Apr 1, 2013 at 11:46 AM, Tejun Heo <[email protected]> wrote:
> On Mon, Apr 01, 2013 at 11:43:03AM -0700, Tim Hockin wrote:
>> A year later - what ever happened with this? I want it more than ever
>> for Google's use.
>
> I think the conclusion was "use kmemcg instead".

Pardon my ignorance, but... what? Use kernel memory limits as a proxy
for process/thread counts? That sounds terrible - I hope I am
misunderstanding? This task counter patch had several properties that
mapped very well to what we want.

Is it dead in the water?

2013-04-01 20:29:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote:
> Pardon my ignorance, but... what? Use kernel memory limits as a proxy
> for process/thread counts? That sounds terrible - I hope I am

Well, the argument was that process / thread counts were a poor and
unnecessary proxy for kernel memory consumption limit. IIRC, Johannes
put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread
worth of component".

> misunderstanding? This task counter patch had several properties that
> mapped very well to what we want.
>
> Is it dead in the water?

After some discussion, Frederic agreed that at least his use case can
be served well by kmemcg, maybe even better - IIRC it was container
fork bomb scenario, so you'll have to argue your way in why kmemcg
isn't a suitable solution for your use case if you wanna revive this.

Thanks.

--
tejun

2013-04-01 21:02:29

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

On Mon, Apr 1, 2013 at 1:29 PM, Tejun Heo <[email protected]> wrote:
> On Mon, Apr 01, 2013 at 01:09:09PM -0700, Tim Hockin wrote:
>> Pardon my ignorance, but... what? Use kernel memory limits as a proxy
>> for process/thread counts? That sounds terrible - I hope I am
>
> Well, the argument was that process / thread counts were a poor and
> unnecessary proxy for kernel memory consumption limit. IIRC, Johannes
> put it as (I'm paraphrasing) "you can't go to Fry's and buy 4k thread
> worth of component".
>
>> misunderstanding? This task counter patch had several properties that
>> mapped very well to what we want.
>>
>> Is it dead in the water?
>
> After some discussion, Frederic agreed that at least his use case can
> be served well by kmemcg, maybe even better - IIRC it was container
> fork bomb scenario, so you'll have to argue your way in why kmemcg
> isn't a suitable solution for your use case if you wanna revive this.

We run dozens of jobs from dozens users on a single machine. We
regularly experience users who leak threads, running into the tens of
thousands. We are unable to raise the PID_MAX significantly due to
some bad, but really thoroughly baked-in decisions that were made a
long time ago. What we experience on a daily basis is users
complaining about getting a "pthread_create(): resource unavailable"
error because someone on the machine has leaked.

Today we use RLIMIT_NPROC to lock most users down to a smaller max.
But this is a per-user setting, not a per-container setting, and users
do not control where their jobs land. Scheduling decisions often put
multiple thread-heavy but non-leaking jobs from one user onto the same
machine, which again causes problems. Further, it does not help for
some of our use cases where a logical job can run as multiple UIDs for
different processes within.

>From the end-user point of view this is an isolation leak which is
totally non-deterministic for them. They can not know what to plan
for. Getting cgroup-level control of this limit is important for a
saner SLA for our users.

In addition, the behavior around locking-out new tasks seems like a
nice way to simplify and clean up end-life work for the administrative
system. Admittedly, we can mostly work around this with freezer
instead.

What I really don't understand is why so much push back? We have this
nicely structured cgroup system. Each cgroup controller's code is
pretty well partitioned - why would we not want more complete
functionality built around it? We accept device drivers for the most
random, useless crap on the assertion that "if you don't need it,
don't compile it in". I can think of a half dozen more really useful,
cool things we can do with cgroups, but I know the pushback will be
tremendous, and I just don't grok why.

Tim

2013-04-01 22:03:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

Hello, Tim.

On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote:
> We run dozens of jobs from dozens users on a single machine. We
> regularly experience users who leak threads, running into the tens of
> thousands. We are unable to raise the PID_MAX significantly due to
> some bad, but really thoroughly baked-in decisions that were made a
> long time ago. What we experience on a daily basis is users

Ummmm.... so that's why you guys can't use kernel memory limit? :(

> complaining about getting a "pthread_create(): resource unavailable"
> error because someone on the machine has leaked.
...
> What I really don't understand is why so much push back? We have this
> nicely structured cgroup system. Each cgroup controller's code is
> pretty well partitioned - why would we not want more complete
> functionality built around it? We accept device drivers for the most
> random, useless crap on the assertion that "if you don't need it,
> don't compile it in". I can think of a half dozen more really useful,
> cool things we can do with cgroups, but I know the pushback will be
> tremendous, and I just don't grok why.

In general, because it adds to maintenance overhead. e.g. We've been
trying to make all cgroups follow consistent nesting rules. We're now
almost there with a couple controllers left. This one would have been
another thing to do, which is fine if it's necessary but if it isn't
we're just adding up work for no good reason.

More importantly, because cgroup is already plagued with so many bad
design decisions - some from core design decisions - e.g. not being
able to actually identify a resource outside of a context of a task.
Others are added on by each controller going out doing whatever it
wants without thinking about how the whole thing would come together
afterwards - e.g. double accounting between cpu and cpuacct,
completely illogical and unusable hierarchy implementations in
anything other than cpu controllers (they're getting better), and so
on. Right now it's in a state where there's not many things coherent
about it. Sure, every controller and feature supports the ones their
makers intended them to but when collected together it's just a mess,
which is one of the common complaints against cgroup.

So, no free-for-all, please.

Thanks.

--
tejun

2013-04-01 22:21:13

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

On Mon, Apr 1, 2013 at 3:03 PM, Tejun Heo <[email protected]> wrote:
> Hello, Tim.
>
> On Mon, Apr 01, 2013 at 02:02:06PM -0700, Tim Hockin wrote:
>> We run dozens of jobs from dozens users on a single machine. We
>> regularly experience users who leak threads, running into the tens of
>> thousands. We are unable to raise the PID_MAX significantly due to
>> some bad, but really thoroughly baked-in decisions that were made a
>> long time ago. What we experience on a daily basis is users
>
> Ummmm.... so that's why you guys can't use kernel memory limit? :(

Because it is completely non-obvious how to map between the two in a
way that is safe across kernel versions and not likely to blow up in
our faces. It's a hack, in other words.

>> complaining about getting a "pthread_create(): resource unavailable"
>> error because someone on the machine has leaked.
> ...
>> What I really don't understand is why so much push back? We have this
>> nicely structured cgroup system. Each cgroup controller's code is
>> pretty well partitioned - why would we not want more complete
>> functionality built around it? We accept device drivers for the most
>> random, useless crap on the assertion that "if you don't need it,
>> don't compile it in". I can think of a half dozen more really useful,
>> cool things we can do with cgroups, but I know the pushback will be
>> tremendous, and I just don't grok why.
>
> In general, because it adds to maintenance overhead. e.g. We've been
> trying to make all cgroups follow consistent nesting rules. We're now
> almost there with a couple controllers left. This one would have been
> another thing to do, which is fine if it's necessary but if it isn't
> we're just adding up work for no good reason.
>
> More importantly, because cgroup is already plagued with so many bad
> design decisions - some from core design decisions - e.g. not being
> able to actually identify a resource outside of a context of a task.
> Others are added on by each controller going out doing whatever it
> wants without thinking about how the whole thing would come together
> afterwards - e.g. double accounting between cpu and cpuacct,
> completely illogical and unusable hierarchy implementations in
> anything other than cpu controllers (they're getting better), and so
> on. Right now it's in a state where there's not many things coherent
> about it. Sure, every controller and feature supports the ones their
> makers intended them to but when collected together it's just a mess,
> which is one of the common complaints against cgroup.
>
> So, no free-for-all, please.
>
> Thanks.
>
> --
> tejun

2013-04-01 22:35:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

Hey,

On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote:
> > Ummmm.... so that's why you guys can't use kernel memory limit? :(
>
> Because it is completely non-obvious how to map between the two in a
> way that is safe across kernel versions and not likely to blow up in
> our faces. It's a hack, in other words.

Now we're repeating the argument Frederic and Johannes had, so I'd
suggest going back the thread and reading the discussion and if you
still think using kmemcg is a bad idea, please explain why that is so.
For the specific point that you just raised, the scale tilted toward
thread/process count is a hacky and unreliable representation of
kernel memory resource than the other way around, at least back then.
If you think you can tilt it the other way, please feel free to try.

Thanks.

--
tejun

2013-04-01 22:58:10

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

On Mon, Apr 1, 2013 at 3:35 PM, Tejun Heo <[email protected]> wrote:
> Hey,
>
> On Mon, Apr 01, 2013 at 03:20:47PM -0700, Tim Hockin wrote:
>> > Ummmm.... so that's why you guys can't use kernel memory limit? :(
>>
>> Because it is completely non-obvious how to map between the two in a
>> way that is safe across kernel versions and not likely to blow up in
>> our faces. It's a hack, in other words.
>
> Now we're repeating the argument Frederic and Johannes had, so I'd
> suggest going back the thread and reading the discussion and if you
> still think using kmemcg is a bad idea, please explain why that is so.
> For the specific point that you just raised, the scale tilted toward
> thread/process count is a hacky and unreliable representation of
> kernel memory resource than the other way around, at least back then.

I am not limited by kernel memory, I am limited by PIDs, and I need to
be able to manage them. memory.kmem.usage_in_bytes seems to be far
too noisy to be useful for this purpose. It may work fine for "just
stop a fork bomb" but not for any sort of finer-grained control.

> If you think you can tilt it the other way, please feel free to try.

Just because others caved, doesn't make it less of a hack. And I will
cave, too, because I don't have time to bang my head against a wall,
especially when I can see the remnants of other people who have tried.

We'll work around it, or we'll hack around it, or we'll carry this
patch in our own tree and just grumble about ridiculous hacks every
time we have to forward port it.

I was just hoping that things had worked themselves out in the last year.

2013-04-01 23:18:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

Hello,

On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote:
> I am not limited by kernel memory, I am limited by PIDs, and I need to
> be able to manage them. memory.kmem.usage_in_bytes seems to be far
> too noisy to be useful for this purpose. It may work fine for "just
> stop a fork bomb" but not for any sort of finer-grained control.

So, why are you limited by PIDs other than the arcane / weird
limitation that you have whereever that limitation is?

> > If you think you can tilt it the other way, please feel free to try.
>
> Just because others caved, doesn't make it less of a hack. And I will
> cave, too, because I don't have time to bang my head against a wall,
> especially when I can see the remnants of other people who have tried.
>
> We'll work around it, or we'll hack around it, or we'll carry this
> patch in our own tree and just grumble about ridiculous hacks every
> time we have to forward port it.
>
> I was just hoping that things had worked themselves out in the last year.

It's kinda weird getting this response, as I don't think it has been
particularly walley. The arguments were pretty sound from what I
recall and Frederic's use case was actually better covered by kmemcg,
so where's the said wall? And I asked you why your use case is
different and the only reason you gave me is some arbitrary PID
limitation on whatever thing you're using, which you gotta agree is a
pretty hard sell. So, if you think you have a valid case, please just
explain it. Why go passive agressive on it? If you don't have a
valid case for pushing it, yes, you'll have to hack around it - carry
the patches in your tree, whatever, or better, fix the weird PID
problem.

--
tejun

2013-04-02 00:08:13

by Tim Hockin

[permalink] [raw]
Subject: Re: [PATCH 00/10] cgroups: Task counter subsystem v8

On Mon, Apr 1, 2013 at 4:18 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Mon, Apr 01, 2013 at 03:57:46PM -0700, Tim Hockin wrote:
>> I am not limited by kernel memory, I am limited by PIDs, and I need to
>> be able to manage them. memory.kmem.usage_in_bytes seems to be far
>> too noisy to be useful for this purpose. It may work fine for "just
>> stop a fork bomb" but not for any sort of finer-grained control.
>
> So, why are you limited by PIDs other than the arcane / weird
> limitation that you have whereever that limitation is?

Does anyone anywhere actually set PID_MAX > 64K? As far as I can
tell, distros default it to 32K or 64K because there's a lot of stuff
out there that assumes this to be true. This is the problem we have -
deep down in the bowels of code that is taking literally years to
overhaul, we have identified a bad assumption that PIDs are always 5
characters long. I can't fix it any faster. That said, we also
identified other software that make similar assumptions, though they
are less critical to us.

>> > If you think you can tilt it the other way, please feel free to try.
>>
>> Just because others caved, doesn't make it less of a hack. And I will
>> cave, too, because I don't have time to bang my head against a wall,
>> especially when I can see the remnants of other people who have tried.
>>
>> We'll work around it, or we'll hack around it, or we'll carry this
>> patch in our own tree and just grumble about ridiculous hacks every
>> time we have to forward port it.
>>
>> I was just hoping that things had worked themselves out in the last year.
>
> It's kinda weird getting this response, as I don't think it has been
> particularly walley. The arguments were pretty sound from what I
> recall and Frederic's use case was actually better covered by kmemcg,
> so where's the said wall? And I asked you why your use case is
> different and the only reason you gave me is some arbitrary PID
> limitation on whatever thing you're using, which you gotta agree is a
> pretty hard sell. So, if you think you have a valid case, please just
> explain it. Why go passive agressive on it? If you don't have a
> valid case for pushing it, yes, you'll have to hack around it - carry
> the patches in your tree, whatever, or better, fix the weird PID
> problem.

Sorry Tejun, you're being very reasonable, I was not. The history of
this patch is what makes me frustrated. It seems like such an obvious
thing to support that it blows my mind that people argue it.

You know our environment. Users can use their memory budgets however
they like - kernel or userspace. We have good accounting, but we are
PID limited. We've even implemented some hacks of our own to make
that hurt less because the previously-mentioned assumptions are just
NOT going away any time soon. I literally have user bugs every week
on this. Hopefully the hacks we have put in place will make the users
stop hurting. But we're left with some residual problems, some of
which are because the only limits we can apply are per-user rather
than per-container.

>From our POV building the cluster, cgroups are strictly superior to
most other control interfaces because they work at the same
granularity that we do. I want more things to support cgroup control.
This particular one was double-tasty because the ability to set the
limit to 0 would actually solve a different problem we have in
teardown. But as I said, we can mostly work around that.

So I am frustrated because I don't think my use case will convince you
(at the root of it, it is a problem of our own making, but it LONG
predates me), despite my belief that it is obviously a good feature.
I find myself hoping that someone else comes along and says "me too"
rather than using a totally different hack for this.

Oh well. Thanks for the update. Off to do our own thing again.


> --
> tejun