LinuxLists.cc - Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

2012-05-03 09:13:56

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Sat, 28 Apr 2012, Aneesh Kumar K.V wrote:

> My first version was to do it as a seperate controller
>
> http://thread.gmane.org/gmane.linux.kernel.mm/73826
>
> But the feedback I received was to do it as a part of memcg extension,
> because what the controller is limiting is memory albeit a different
> type. AFAIU there is also this goal of avoiding controller proliferation.
>

Maybe Kame can speak up if he feels strongly about this, but I really
think it should be its own controller in its own file (which would
obviously make this discussion irrelevant since mm/hugetlbcg.c would be
dependent on your own config symbol). I don't feel like this is the same
as kmem since its not a global resource like hugetlb pages are.

Hugetlb pages can either be allocated statically on the command line at
boot or dynamically via sysfs and they are globally available to whoever
mmaps them through hugetlbfs. I see a real benefit from being able to
limit the number of hugepages in the global pool to a set of tasks so they
can't overuse what has been statically or dynamically allocated. And that
ability should be available, in my opinion, without having to enable
memcg, the page_cgroup metadata overhead that comes along with it, and the
performance impact in using it. I also think it would be wise to seperate
it out into its own file at the source level so things like this don't
arise in the future.

What do you think? Kame?

2012-05-03 10:30:40

by Hiroyuki Kamezawa

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Thu, May 3, 2012 at 6:13 PM, David Rientjes <[email protected]> wrote:
> On Sat, 28 Apr 2012, Aneesh Kumar K.V wrote:
>
>> My first version was to do it as a seperate controller
>>
>> http://thread.gmane.org/gmane.linux.kernel.mm/73826
>>
>> But the feedback I received was to do it as a part of memcg extension,
>> because what the controller is limiting is memory albeit a different
>> type. AFAIU there is also this goal of avoiding controller proliferation.
>>
>
> Maybe Kame can speak up if he feels strongly about this, but I really
> think it should be its own controller in its own file (which would
> obviously make this discussion irrelevant since mm/hugetlbcg.c would be
> dependent on your own config symbol). ?I don't feel like this is the same
> as kmem since its not a global resource like hugetlb pages are.
>
> Hugetlb pages can either be allocated statically on the command line at
> boot or dynamically via sysfs and they are globally available to whoever
> mmaps them through hugetlbfs. ?I see a real benefit from being able to
> limit the number of hugepages in the global pool to a set of tasks so they
> can't overuse what has been statically or dynamically allocated. ?And that
> ability should be available, in my opinion, without having to enable
> memcg, the page_cgroup metadata overhead that comes along with it, and the
> performance impact in using it. ?I also think it would be wise to seperate
> it out into its own file at the source level so things like this don't
> arise in the future.
>
> What do you think? ?Kame?

I think hugetlb should be handled under memcg.

1. I think Hugetlb is memory.

2. The characteristics of hugetlb usage you pointed out is
characteristics comes from
"current" implementation.
Yes, it's now unreclaimable and should be allocated by hands of
admin. But,
considering recent improvements, memory-defrag, CMA, it can be less
hard-to-use thing by updating implementation and on-demand allocation
can be allowed.

3. If overhead is the problem, and it's better to disable memcg,
Please show numbers with HPC apps. I didn't think memcg has very
bad overhead
with Bull's presentation in collaboration summit, this April.

4. I guess a user who uses hugetlbfs will use usual memory at the same time.
Having 2 hierarchy for memory and hugetlb will bring him a confusion.

Thanks,
-Kame

2012-05-03 13:54:13

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

David Rientjes <[email protected]> writes:

>> My first version was to do it as a seperate controller
>>
>> http://thread.gmane.org/gmane.linux.kernel.mm/73826
>>
>> But the feedback I received was to do it as a part of memcg extension,
>> because what the controller is limiting is memory albeit a different
>> type. AFAIU there is also this goal of avoiding controller proliferation.
>>
>
> Maybe Kame can speak up if he feels strongly about this, but I really
> think it should be its own controller in its own file (which would
> obviously make this discussion irrelevant since mm/hugetlbcg.c would be
> dependent on your own config symbol). I don't feel like this is the same
> as kmem since its not a global resource like hugetlb pages are.

> Hugetlb pages can either be allocated statically on the command line at
> boot or dynamically via sysfs and they are globally available to whoever
> mmaps them through hugetlbfs. I see a real benefit from being able to
> limit the number of hugepages in the global pool to a set of tasks so they
> can't overuse what has been statically or dynamically allocated. And that
> ability should be available, in my opinion, without having to enable
> memcg, the page_cgroup metadata overhead that comes along with it, and the
> performance impact in using it. I also think it would be wise to seperate
> it out into its own file at the source level so things like this don't
> arise in the future.

All the use cases I came across requested for limiting both memory
and hugetlb pages. They want to limit the usage of both. So for the use case
I am looking at memcg will already be enabled.

-aneesh

2012-05-03 20:39:41

by David Rientjes

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Thu, 3 May 2012, Aneesh Kumar K.V wrote:

> All the use cases I came across requested for limiting both memory
> and hugetlb pages. They want to limit the usage of both.

And as cgroups moves to a single hierarchy for simplification, this isn't
hard to do by mounting both cgroups.

2012-05-03 20:56:53

by David Rientjes

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Thu, 3 May 2012, Hiroyuki Kamezawa wrote:

> I think hugetlb should be handled under memcg.
>
> 1. I think Hugetlb is memory.
>

Agreed, but hugetlb control is done in a very different way than regular
memory in terms of implementation and preallocation. Just because it's
called "memory controller" doesn't mean it must control all types of
memory; hugetlb has always been considered a seperate type of VM that
diverges quite radically from the VM implementation. Forcing users into
an all-or-nothing approach is a lousy solution when its simpler, cleaner,
more extendable, and doesn't lose any functionality when seperated.

> 2. The characteristics of hugetlb usage you pointed out is
> characteristics comes from
> "current" implementation.
> Yes, it's now unreclaimable and should be allocated by hands of
> admin. But,
> considering recent improvements, memory-defrag, CMA, it can be less
> hard-to-use thing by updating implementation and on-demand allocation
> can be allowed.
>

You're describing transparent hugepages which are already supported by
memcg specifically because they are transparent. I haven't seen any
proposals on how to change hugetlb when it comes to preallocation and
mmaping the memory because it would break the API with userspace.
Userspace packages like hugeadm are actually used in a wide variety of
places.

[ I would love to see hugetlb be deprecated entirely and move in a
direction where transparent hugepages can make that happen, but we're
not there yet because we're missing key functionality such as pagecache
support. ]

> 3. If overhead is the problem, and it's better to disable memcg,
> Please show numbers with HPC apps. I didn't think memcg has very
> bad overhead
> with Bull's presentation in collaboration summit, this April.
>

Is this a claim that memory-intensive workloads will have the exact same
performance with and without memcg enabled? That would be quite an
amazing feat, I agree, since tracking user pages would have absolutely
zero cost. Please clarify your answer here and whether memcg is not
expected to cause even the slightest performance degradation on any
workload, I want to make sure I'm understanding it correctly. I'll follow
up after that.

Even if there's the slightest performance degradation, these are what
users of hugetlb are concerned with already. They use hugetlb for
performance and it would be a shame for it to regress because you have to
enable memcg.

> 4. I guess a user who uses hugetlbfs will use usual memory at the same time.
> Having 2 hierarchy for memory and hugetlb will bring him a confusion.
>

Cgroups is moving to a single hierarchy for simplification, this isn't the
only example of where this is currently suboptimal and it would be
disappointing to solidify hugetlb control as part of memcg because of this
current limitation that will be addressed by generic cgroups development.

Folks, once these things are merged they become an API that can't easily
be shifted around and seperated out later. The decision now is either to
join hugetlb control with memcg forever when they act in very different
ways or to seperate them so they can be used and configured individually.

2012-05-03 21:57:43

by David Rientjes

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Thu, 3 May 2012, David Rientjes wrote:

> Is this a claim that memory-intensive workloads will have the exact same
> performance with and without memcg enabled?

I've just run specjbb2005 three times on my system both with and without
cgroup_disable=memory on the command line and it is consistently 1% faster
without memcg. If I add XX:+UseLargePages to the command line to use
hugepages it's even larger. So why must I incur this performance
degradation if I simply want to control who may mmap hugepages out of the
global pool?

The functionality to control this is pretty important if I want to ensure
applications aren't able to infringe on the preallocated hugepages of a
higher priority application for business goals.

2012-05-03 23:17:15

by Hiroyuki Kamezawa

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Fri, May 4, 2012 at 5:56 AM, David Rientjes <[email protected]> wrote:
> On Thu, 3 May 2012, Hiroyuki Kamezawa wrote:
>
>> I think hugetlb should be handled under memcg.

>> 2. The characteristics of hugetlb usage you pointed out is
>> characteristics comes from
>> ? ?"current" implementation.
>> ? ? Yes, it's now unreclaimable and should be allocated by hands of
>> admin. But,
>> ? ? considering recent improvements, memory-defrag, CMA, it can be less
>> ? ? hard-to-use thing by updating implementation and on-demand allocation
>> ? ? can be allowed.
>>
>
> You're describing transparent hugepages which are already supported by
> memcg specifically because they are transparent.

THP just handles hugepages whose size is equal to pgd size. So, hugetlb
is something more than that, it has various sizes.

>?I haven't seen any
> proposals on how to change hugetlb when it comes to preallocation and
> mmaping the memory because it would break the API with userspace.
> Userspace packages like hugeadm are actually used in a wide variety of
> places.
>

I just said if users doesn't need to set sysctl, it's more useful. I got similar
claims from users with IPC max params ;) I answerd set it unlimited...

>> 3. If overhead is the problem, and it's better to disable memcg,
>> ? ? Please show numbers with HPC apps. I didn't think memcg has very
>> bad overhead
>> ? ? with Bull's presentation in collaboration summit, this April.
>>
>
> Is this a claim that memory-intensive workloads will have the exact same
> performance with and without memcg enabled?

I wrote that I don't get any report that memcg is too slow and need to be fixed.

I think, in general, once memory is allocated, application will run
faster if it never
free memory. So, good application frees memory in batch when it can do. Because
memcg just adds overheads to memory allocation and unmapping, runtime overhead
tend to be small.
My target number when I started to join memcg developments was 2-3% overheads.

> Even if there's the slightest performance degradation, these are what
> users of hugetlb are concerned with already. ?They use hugetlb for
> performance and it would be a shame for it to regress because you have to
> enable memcg.
>

I think such people don't limit any usages....any kinds of
virtualization/resource controls has 0 overheads.

>> 4. I guess a user who uses hugetlbfs will use usual memory at the same time.
>> ? ? Having 2 hierarchy for memory and hugetlb will bring him a confusion.
>>
>
> Cgroups is moving to a single hierarchy for simplification, this isn't the
> only example of where this is currently suboptimal and it would be
> disappointing to solidify hugetlb control as part of memcg because of this
> current limitation that will be addressed by generic cgroups development.
>
> Folks, once these things are merged they become an API that can't easily
> be shifted around and seperated out later. ?The decision now is either to
> join hugetlb control with memcg forever when they act in very different
> ways or to seperate them so they can be used and configured individually.

How do other guys think ? Tejun ?

Thanks,
-Kame

2012-05-03 23:21:34

by Hiroyuki Kamezawa

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Fri, May 4, 2012 at 6:57 AM, David Rientjes <[email protected]> wrote:
> On Thu, 3 May 2012, David Rientjes wrote:
>
>> Is this a claim that memory-intensive workloads will have the exact same
>> performance with and without memcg enabled?
>
> I've just run specjbb2005 three times on my system both with and without
> cgroup_disable=memory on the command line and it is consistently 1% faster
> without memcg.
Hm, ok. Where is that overheads from ? Do you have perf output ?
I'll need to check what is bad.

> If I add XX:+UseLargePages to the command line to use
> hugepages it's even larger. ?So why must I incur this performance
> degradation if I simply want to control who may mmap hugepages out of the
> global pool?

Is that common use case ? If he wants to do some resource control,
common users will limit usual memory, too. That kinds of too much flexibility
makes cgroup complicated, hard-to-use.

Thanks,
-Kame

2012-05-03 23:33:45

by Hiroyuki Kamezawa

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Fri, May 4, 2012 at 8:21 AM, Hiroyuki Kamezawa
<[email protected]> wrote:
> On Fri, May 4, 2012 at 6:57 AM, David Rientjes <[email protected]> wrote:
>> On Thu, 3 May 2012, David Rientjes wrote:
>> If I add XX:+UseLargePages to the command line to use
>> hugepages it's even larger.

Ah, sorry. I couldn't understand this. Why performance difference gets larger if
usage of anon memory decreases ? I guess overheads are just added to anon page
faults and file cache handling. If you use Hugepage, anon memory overheads will
disappear.

Thanks,
-Kame

2012-05-04 17:24:27

by Tejun Heo

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

Hello,

(cc'ing Johannes and Michal, hi guys)

On Fri, May 04, 2012 at 08:17:11AM +0900, Hiroyuki Kamezawa wrote:
> > Cgroups is moving to a single hierarchy for simplification, this isn't the
> > only example of where this is currently suboptimal and it would be
> > disappointing to solidify hugetlb control as part of memcg because of this
> > current limitation that will be addressed by generic cgroups development.
> >
> > Folks, once these things are merged they become an API that can't easily
> > be shifted around and seperated out later. ?The decision now is either to
> > join hugetlb control with memcg forever when they act in very different
> > ways or to seperate them so they can be used and configured individually.
>
> How do other guys think ? Tejun ?

I don't know. hugetlbfs already is this franken thing which is
separate from the usual memory management. It needing cgroup type
resource limitation feels a bit weird to me. Isn't this supposed to
be used in more-or-less tightly controlled setups? The whole thing
needs to have its memory cut out from boot after all.

If someone really has to add cgroup support to hugetlbfs, I'm more
inclined to say let them play in their own corner unless incorporating
it into memcg makes it inherently better.

That said, I really don't know that much about mm. Johannes, Michal,
what do you guys think?

Thanks.

--
tejun

2012-05-04 18:29:17

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

David Rientjes <[email protected]> writes:

>> Is this a claim that memory-intensive workloads will have the exact same
>> performance with and without memcg enabled?

> I've just run specjbb2005 three times on my system both with and without
> cgroup_disable=memory on the command line and it is consistently 1% faster
> without memcg. If I add XX:+UseLargePages to the command line to use
> hugepages it's even larger. So why must I incur this performance
> degradation if I simply want to control who may mmap hugepages out of the
> global pool?

Even if we end up having a seperate controller for hugetlb, we would need
some bits of memcg, like tracking page cgroup, moving page cgroup on
page offline. We will also be duplicating some amount of framework for
supporting cgroup removal etc, because all those code deal with struct
page (actually compound page )

-aneesh

2012-05-07 14:01:10

by Michal Hocko

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Fri 04-05-12 10:24:20, Tejun Heo wrote:
> Hello,
>
> (cc'ing Johannes and Michal, hi guys)
>
> On Fri, May 04, 2012 at 08:17:11AM +0900, Hiroyuki Kamezawa wrote:
> > > Cgroups is moving to a single hierarchy for simplification, this isn't the
> > > only example of where this is currently suboptimal and it would be
> > > disappointing to solidify hugetlb control as part of memcg because of this
> > > current limitation that will be addressed by generic cgroups development.
> > >
> > > Folks, once these things are merged they become an API that can't easily
> > > be shifted around and seperated out later. ?The decision now is either to
> > > join hugetlb control with memcg forever when they act in very different
> > > ways or to seperate them so they can be used and configured individually.
> >
> > How do other guys think ? Tejun ?
>
> I don't know. hugetlbfs already is this franken thing which is
> separate from the usual memory management. It needing cgroup type
> resource limitation feels a bit weird to me. Isn't this supposed to
> be used in more-or-less tightly controlled setups? The whole thing
> needs to have its memory cut out from boot after all.
>
> If someone really has to add cgroup support to hugetlbfs, I'm more
> inclined to say let them play in their own corner unless incorporating
> it into memcg makes it inherently better.

I would agree with you but my impression from the previous (hugetlb)
implementation was that it is much harder to implement the charge moving
if we do not use page_cgroup.
Also the range tracking is rather ugly and clumsy.

> That said, I really don't know that much about mm. Johannes, Michal,
> what do you guys think?
>
> Thanks.
>
> --
> tejun

--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic

2012-05-07 17:08:49

by Tejun Heo

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

Hello,

On Mon, May 07, 2012 at 04:01:04PM +0200, Michal Hocko wrote:
> > If someone really has to add cgroup support to hugetlbfs, I'm more
> > inclined to say let them play in their own corner unless incorporating
> > it into memcg makes it inherently better.
>
> I would agree with you but my impression from the previous (hugetlb)
> implementation was that it is much harder to implement the charge moving
> if we do not use page_cgroup.
> Also the range tracking is rather ugly and clumsy.

Understood. I haven't looked at the code, so my opinion was based on
the assumption that the whole thing is completely separate (in design
and implementation) from memcg as hugtlbfs is from the usual mm. If
it's better / easier implemented together with memcg, I have no
objection to making it part of memcg.

Thanks.

--
tejun

2012-05-08 10:48:50

by Michal Hocko

[permalink] [raw]

Subject: Re: inux-next: Tree for Apr 27 (uml + mm/memcontrol.c)

On Mon 07-05-12 10:08:40, Tejun Heo wrote:
> Hello,
>
> On Mon, May 07, 2012 at 04:01:04PM +0200, Michal Hocko wrote:
> > > If someone really has to add cgroup support to hugetlbfs, I'm more
> > > inclined to say let them play in their own corner unless incorporating
> > > it into memcg makes it inherently better.
> >
> > I would agree with you but my impression from the previous (hugetlb)
> > implementation was that it is much harder to implement the charge moving
> > if we do not use page_cgroup.
> > Also the range tracking is rather ugly and clumsy.
>
> Understood. I haven't looked at the code, so my opinion was based on
> the assumption that the whole thing is completely separate (in design
> and implementation) from memcg as hugtlbfs is from the usual mm. If
> it's better / easier implemented together with memcg, I have no
> objection to making it part of memcg.

I think we could still consider a possibility of using page_cgroup for
tracking without the rest of memcg infrastructure (charging) in place.
It sounds like the memory overhead would be too big (at least now) for
relatively few hugetlb pages in use but it would reduce the performance
hit if a user is interested only in the hugetlb limits (mentioned by
David in the other email).
On the other hand we are on the way to get rid of page_cgroup and push
the missing parts into the struct page. Then we could accomplish the
hugetlb only use case by cgroup_disable=memory hugetlbaccount=1 kernel
parameters (yes still not very nice...).
That being said I think that going memcg way is simpler and that the
!memcg && hugetlb use case is still possible (somehow).

Or does anybody have a different idea?
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic