2007-05-04 01:38:27

by Paul Jackson

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

Adding Christoph Lameter <[email protected]> to the cc list, as he knows
more about hugetlb pages than I do.

This patch strikes me as a bit odd.

Granted, it's solving what could be a touchy problem with a fairly
simple solution, which is usually a Good Thing(tm).

However, the idea that different tasks would see different values for
the following fields in /proc/meminfo:

HugePages_Total: 0
HugePages_Free: 0

strikes me as odd, and risky. I would have thought that usually, all
tasks in the system should see the same values in the files in /proc
(as opposed to the files in particular task subdirectories /proc/<pid>.)

This patch strikes me as a bit of a hack, good for compatibility, but
hiding a booby trap that will bite some user code in the long run.

But I'm not enough of an expert to know what the right tradeoffs are
in this matter.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401


2007-05-04 04:49:20

by Ken Chen

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

On 5/3/07, Paul Jackson <[email protected]> wrote:
> Adding Christoph Lameter <[email protected]> to the cc list, as he knows
> more about hugetlb pages than I do.
>
> This patch strikes me as a bit odd.
>
> Granted, it's solving what could be a touchy problem with a fairly
> simple solution, which is usually a Good Thing(tm).
>
> However, the idea that different tasks would see different values for
> the following fields in /proc/meminfo:
>
> HugePages_Total: 0
> HugePages_Free: 0
>
> strikes me as odd, and risky. I would have thought that usually, all
> tasks in the system should see the same values in the files in /proc
> (as opposed to the files in particular task subdirectories /proc/<pid>.)
>
> This patch strikes me as a bit of a hack, good for compatibility, but
> hiding a booby trap that will bite some user code in the long run.
>
> But I'm not enough of an expert to know what the right tradeoffs are
> in this matter.

Would annotating the Hugepages_* field with name of cpuset help? I
orginally thought that since cpuset's mems are hirearchical in memory
assignment, it is fairly straightforward to understand what's going
on: parent cpuset stats include its and all of its children. For
example, if root cpuset has two sub job1 and job2 cpusets, each has 20
and 30 htlb pages, when query at each level, we have:

[root@box]# echo $$ > /dev/cpuset/tasks
[root@box]# grep HugePages_Total /proc/meminfo
HugePages_Total: 50

[root@box]# echo $$ > /dev/cpuset/job1/tasks
[root@box]# grep HugePages_Total /proc/meminfo
HugePages_Total: 20

[root@box]# echo $$ > /dev/cpuset/job2/tasks
[root@box]# grep HugePages_Total /proc/meminfo
HugePages_Total: 30

If this is odd, do you have any suggestions for alternative?

- Ken

2007-05-04 05:03:46

by Andrew Morton

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

On Thu, 3 May 2007 21:49:12 -0700 "Ken Chen" <[email protected]> wrote:

> On 5/3/07, Paul Jackson <[email protected]> wrote:
> > Adding Christoph Lameter <[email protected]> to the cc list, as he knows
> > more about hugetlb pages than I do.
> >
> > This patch strikes me as a bit odd.
> >
> > Granted, it's solving what could be a touchy problem with a fairly
> > simple solution, which is usually a Good Thing(tm).
> >
> > However, the idea that different tasks would see different values for
> > the following fields in /proc/meminfo:
> >
> > HugePages_Total: 0
> > HugePages_Free: 0
> >
> > strikes me as odd, and risky. I would have thought that usually, all
> > tasks in the system should see the same values in the files in /proc
> > (as opposed to the files in particular task subdirectories /proc/<pid>.)
> >
> > This patch strikes me as a bit of a hack, good for compatibility, but
> > hiding a booby trap that will bite some user code in the long run.
> >
> > But I'm not enough of an expert to know what the right tradeoffs are
> > in this matter.
>
> Would annotating the Hugepages_* field with name of cpuset help?

There are existing programs which parse /proc/meminfo. If we're going to
do any of this then it would need to be via new fields.

I don't think we should be altering the meaning of the HugePages fields
like this. One can imagine scenarios in which such a change would cause
existing userspace scripts to fail. Plus it's Just Weird to use
/proc/meminfo in this manner.

> I
> orginally thought that since cpuset's mems are hirearchical in memory
> assignment, it is fairly straightforward to understand what's going
> on: parent cpuset stats include its and all of its children. For
> example, if root cpuset has two sub job1 and job2 cpusets, each has 20
> and 30 htlb pages, when query at each level, we have:
>
> [root@box]# echo $$ > /dev/cpuset/tasks
> [root@box]# grep HugePages_Total /proc/meminfo
> HugePages_Total: 50
>
> [root@box]# echo $$ > /dev/cpuset/job1/tasks
> [root@box]# grep HugePages_Total /proc/meminfo
> HugePages_Total: 20
>
> [root@box]# echo $$ > /dev/cpuset/job2/tasks
> [root@box]# grep HugePages_Total /proc/meminfo
> HugePages_Total: 30
>
> If this is odd, do you have any suggestions for alternative?

If it's per-cpuset information then shouldn't it be presented in
/dev/cpuset/something?

2007-05-04 05:12:10

by Paul Jackson

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

Ken wrote:
> If this is odd, do you have any suggestions for alternative?

No, I don't. Sorry.

It's a touchy problem, and I'm not enough of an expert to know what the
right tradeoffs are in this matter.

I agree with your point that if you realize what's going on, namely
that what cpuset the task reading meminfo is in affects the HugePages
values that are read, then one can use the interface easily enough.

... how about:

1) don't change the existing HugePages_* values - keep them
system-wide, and

2) adding two new values, by such names as:

Current_Cpuset_HugePages_Total: 0
Current_Cpuset_HugePages_Free: 0

That's certainly an uglier proposal than yours ;). But at least it
seems clearer, and doesn't make incompatible changes to what's there.

It does require user level code change to actually benefit from the
new values, whereas your patch sort of sneaks them in, on the assumption
that the majority of reads of these values would really prefer getting
the cpuset relative totals instead.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-05-04 05:35:53

by David Rientjes

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

On Thu, 3 May 2007, Paul Jackson wrote:

> 2) adding two new values, by such names as:
>
> Current_Cpuset_HugePages_Total: 0
> Current_Cpuset_HugePages_Free: 0
>

This information is already exported to userspace through sysfs. Simply
grab the N-mems allowed to your task from /proc/pid/status, cat
/sys/devices/system/node/nodeN/meminfo for each N, and add.

2007-05-04 05:58:40

by Paul Jackson

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

Andrew wrote:
> If it's per-cpuset information then shouldn't it be presented in
> /dev/cpuset/something?

Yeah - if huge pages were mainline future, rather than the more
controversial sideline they are now, then it would make more sense
to put in these stats in each cpuset.

Note, Ken, that if we did that, the calculation of these new Total and
Free stats would be a little different than your new code. Instead of
looping over the memory nodes in the current tasks mems_allowed mask,
we would loop over the memory nodes allowed in the cpuset being queried
(the cpuset whose 'hugepages_total' or 'hugepages_free' special
file we were reading, not the current tasks cpuset.)

But I'm reluctant to entertain such cpuset additions until I see more
of where my colleague Christoph is going in related work.

Clearly as can be seen on one of his posts on the parallel lkml thread:

Re: + pretend-cpuset-has-some-form-of-hugetlb-page-reservation.patch added to -mm tree

earlier today, Christoph is no great fan of the current implementation
of huge pages.

And clearly as memory continues to get bigger, we will be putting more
stress on these page size related mechanisms.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-05-04 06:12:37

by Paul Jackson

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

David wrote:
> This information is already exported to userspace through sysfs. Simply
> grab the N-mems allowed to your task from /proc/pid/status, cat
> /sys/devices/system/node/nodeN/meminfo for each N, and add.

Good point.

I don't see how this present patch, to change /proc/meminfo,
can be justified, given this.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2007-05-04 06:14:12

by Ken Chen

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

On 5/3/07, Paul Jackson <[email protected]> wrote:
> Note, Ken, that if we did that, the calculation of these new Total and
> Free stats would be a little different than your new code. Instead of
> looping over the memory nodes in the current tasks mems_allowed mask,
> we would loop over the memory nodes allowed in the cpuset being queried
> (the cpuset whose 'hugepages_total' or 'hugepages_free' special
> file we were reading, not the current tasks cpuset.)

This is even more controversial and messy. akpm already dropped the
patch and expressed that he doesn't like it. And I won't go down
another messy path. I will let this idea RIP.

2007-05-04 06:46:19

by Bill Irwin

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

On Thu, May 03, 2007 at 06:38:21PM -0700, Paul Jackson wrote:
> Adding Christoph Lameter <[email protected]> to the cc list, as he knows
> more about hugetlb pages than I do.
> This patch strikes me as a bit odd.
> Granted, it's solving what could be a touchy problem with a fairly
> simple solution, which is usually a Good Thing(tm).
> However, the idea that different tasks would see different values for
> the following fields in /proc/meminfo:
> HugePages_Total: 0
> HugePages_Free: 0
> strikes me as odd, and risky. I would have thought that usually, all
> tasks in the system should see the same values in the files in /proc
> (as opposed to the files in particular task subdirectories /proc/<pid>.)
> This patch strikes me as a bit of a hack, good for compatibility, but
> hiding a booby trap that will bite some user code in the long run.
> But I'm not enough of an expert to know what the right tradeoffs are
> in this matter.

The semantics of the global /proc/meminfo should not change; a separate
per-cpuset reporting mechanism should really be used.


-- wli

2007-05-04 07:39:13

by Paul Jackson

[permalink] [raw]
Subject: Re: + per-cpuset-hugetlb-accounting-and-administration.patch added to -mm tree

> I will let this idea RIP.

Agreed.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401