Hi,
Just noticed the ugly SGI /proc/*/numa_maps code got merged. I argued
several times against it and I very deliberately didn't include
a similar facility when I wrote the NUMA policy code because it's a bad
idea.
- it's a lot of ugly code.
- it's basically only a debugging hack right now
- it presents lots of kernel internal information and mempolicy
internals (like how many people have a page mapped) etc.
to userland that shouldn't be exposed to this.
- the format is very complicated and the chance of bug free
userland parsers of this is near zero.
- there is no demonstrated application that needs it
(there was a theoretical usecase where it might be needed,
but there were better solutions proposed for this)
Can the patch please be removed?
Thanks,
-Andi
Andi Kleen <[email protected]> wrote:
>
> Just noticed the ugly SGI /proc/*/numa_maps code got merged.
Been in -mm for over two months.
> I argued several times against it
OK, I either didn't notice or forgot to make a note of that.
> and I very deliberately didn't include
> a similar facility when I wrote the NUMA policy code because it's a bad
> idea.
>
>
> - it's a lot of ugly code.
> - it's basically only a debugging hack right now
> - it presents lots of kernel internal information and mempolicy
> internals (like how many people have a page mapped) etc.
> to userland that shouldn't be exposed to this.
> - the format is very complicated and the chance of bug free
> userland parsers of this is near zero.
> - there is no demonstrated application that needs it
> (there was a theoretical usecase where it might be needed,
> but there were better solutions proposed for this)
>
>
> Can the patch please be removed?
OK by me. I queued a revert patch.
On Sat, 10 Sep 2005, Andrew Morton wrote:
> Andi Kleen <[email protected]> wrote:
> >
> > Just noticed the ugly SGI /proc/*/numa_maps code got merged.
Well its ugly because you said that the fixes to make it less ugly were
"useless". I can still submit those fixes that make numa_maps a part of
smaps and that cleanup the way policies are displayed.
> Been in -mm for over two months.
>
> > I argued several times against it
Nope you argued against changing memory policies via proc. The numa_maps
patch is displaying on which node a vma uses memory.
> > and I very deliberately didn't include
> > a similar facility when I wrote the NUMA policy code because it's a bad
> > idea.
Thats the idea of changing memory policies right? NBot displaying memory
usages across nodes?
> > - it presents lots of kernel internal information and mempolicy
> > internals (like how many people have a page mapped) etc.
> > to userland that shouldn't be exposed to this.
Very important information. Somewhat similar information is already
available via smaps.
> > - the format is very complicated and the chance of bug free
> > userland parsers of this is near zero.
What is complicated about the format? And the basic use is to see where
memory for it was placed. I have code here that parses the format.
> > - there is no demonstrated application that needs it
> > (there was a theoretical usecase where it might be needed,
> > but there were better solutions proposed for this)
Could you be more specific? The application is to figure out how memory is
placed. Just to cat /proc/<pid>/numa_maps. Seems to be a favorite with
some people.
> > Can the patch please be removed?
>
> OK by me. I queued a revert patch.
No idea why that would be necessary.
Christoph Lameter <[email protected]> wrote:
>
> On Sat, 10 Sep 2005, Andrew Morton wrote:
>
> > Andi Kleen <[email protected]> wrote:
> > >
> > > Just noticed the ugly SGI /proc/*/numa_maps code got merged.
>
> Well its ugly because you said that the fixes to make it less ugly were
> "useless". I can still submit those fixes that make numa_maps a part of
> smaps and that cleanup the way policies are displayed.
It would be useful to see these.
> > > - it presents lots of kernel internal information and mempolicy
> > > internals (like how many people have a page mapped) etc.
> > > to userland that shouldn't be exposed to this.
>
> Very important information.
>
Important to whom? Kernel developers or userspace developers? If the
latter, what use do they actually make of it? Shouldn't it be documented?
> > > - there is no demonstrated application that needs it
> > > (there was a theoretical usecase where it might be needed,
> > > but there were better solutions proposed for this)
>
> Could you be more specific? The application is to figure out how memory is
> placed. Just to cat /proc/<pid>/numa_maps. Seems to be a favorite with
> some people.
If it's useful to application developers then fine. It it's only useful to
kernel developers then the argument is weakened. However there's still
quite a lot of development going on in this area, so there's still some
argument for having the monitoring ability in the mainline tree.
On Sat, 10 Sep 2005, Andrew Morton wrote:
> > Well its ugly because you said that the fixes to make it less ugly were
> > "useless". I can still submit those fixes that make numa_maps a part of
> > smaps and that cleanup the way policies are displayed.
>
> It would be useful to see these.
URLs (these are not up to date and in particular the conversion
functions are much simpler in recent versions thanks to some help by
Paul Jackson since then)
http://www.uwsg.iu.edu/hypermail/linux/kernel/0507.3/1662.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0507.3/1663.html
http://www.uwsg.iu.edu/hypermail/linux/kernel/0507.3/1665.html
> > > > - it presents lots of kernel internal information and mempolicy
> > > > internals (like how many people have a page mapped) etc.
> > > > to userland that shouldn't be exposed to this.
> > Very important information.
> Important to whom? Kernel developers or userspace developers? If the
> latter, what use do they actually make of it? Shouldn't it be documented?
Both. System administrators would like to know on which node an
application has allocated memory. They would also like to change the way
applications do allocate memory while they are running but Andi has
philosophical concerns about that and will not even discuss methods to fix
the design issues in order to make that possible. Got a couple here ready
pull out if an opportunity arises.
> > Could you be more specific? The application is to figure out how memory is
> > placed. Just to cat /proc/<pid>/numa_maps. Seems to be a favorite with
> > some people.
> If it's useful to application developers then fine. It it's only useful to
> kernel developers then the argument is weakened. However there's still
> quite a lot of development going on in this area, so there's still some
> argument for having the monitoring ability in the mainline tree.
I still have a hard time to see how people can accept the line of
reasoning that says:
Users are not allowed to know on which nodes the operating system
allocated resources for a process and are also not allowed to see the
memory policies in effect for the memory areas
Then the application developers have to guess the effect that the memory
policies have on memory allocation. For memory alloc debugging the poor
app guys must today simply imagine what the operating system is doing.
They can see the amount of total memory allocated on a node via other proc
entries and then guess based on that which application has taken it. Then
they modify their apps and do another run.
My thinking today is that I'd rather leave /proc/<pid>/numa_stats
instead of using smaps because the smaps format is a bit verbose and
will make it difficult to see the allocation distribution. If we use smaps
then we probably need some tool to parse and present information.
numa_stats is directly usable.
I have a new series of patches here that does a gradual thing with
the policy layer:
1. Clean up policy layer to properly use node macros instead of bitmaps.
Some comments to explain certain limitations of the policy layer.
2. Clean up policy layer by doing do_xx and sys_xx separation
[optional but this separates the dynamic bitmaps in user space from
the static node maps in kernel space which I find very helpful]
3. Add mpol_to_str to policy layer and make numa_stats use mpol_to_str.
4. Solve the potential access issue when set_mempolicy is
updating task->mempolicy while numa_stats are being displayed by taking
a writelock on mmap_sem in set_mempolicy. This is in harmony with vma
mempolicy updates that also take a lock on mmap_sem and that are already
safe to access since numa_stats always takes an mmap_sem readlock. The
patch is essentially inserting two lines.
Then I still have these evil intentions of making it possible to
dynamically change memory policies from the outside. The mininum
that we all need is to least be able to see whats going on.
Of course we would be happier if we would also be allowed to change
policies to control memory allocation. The argument that the layer is not
able to handle these is of course true since attempts to fix the issues
have been blocked.
Christoph Lameter <[email protected]> wrote:
>
> On Sat, 10 Sep 2005, Andrew Morton wrote:
>
> > > Well its ugly because you said that the fixes to make it less ugly were
> > > "useless". I can still submit those fixes that make numa_maps a part of
> > > smaps and that cleanup the way policies are displayed.
> >
> > It would be useful to see these.
>
> URLs (these are not up to date and in particular the conversion
> functions are much simpler in recent versions thanks to some help by
> Paul Jackson since then)
>
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0507.3/1662.html
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0507.3/1663.html
> http://www.uwsg.iu.edu/hypermail/linux/kernel/0507.3/1665.html
>
That doesn't looks like a great improvement. The code you have there now
is quite straightforward.
>
> > > > > - it presents lots of kernel internal information and mempolicy
> > > > > internals (like how many people have a page mapped) etc.
> > > > > to userland that shouldn't be exposed to this.
> > > Very important information.
> > Important to whom? Kernel developers or userspace developers? If the
> > latter, what use do they actually make of it? Shouldn't it be documented?
>
> Both. System administrators would like to know on which node an
> application has allocated memory. They would also like to change the way
> applications do allocate memory while they are running but Andi has
> philosophical concerns about that and will not even discuss methods to fix
> the design issues in order to make that possible. Got a couple here ready
> pull out if an opportunity arises.
Well I can understand concerns about externally fiddling with a process's
allocation policies, but that's a separate issue from the one at hand.
I'd expect that the ability to know "on which node an application has
allocated memory" would be more valuable to a developer than to a sysadmin.
Can you provide an example usage?
> I still have a hard time to see how people can accept the line of
> reasoning that says:
>
> Users are not allowed to know on which nodes the operating system
> allocated resources for a process
This would only be useful if the process wasn't explicitly setting memory
policies, yes? It's "hey, where is did all my memory end up".
> and are also not allowed to see the
> memory policies in effect for the memory areas
And this is the case where the process _has_ set some memory allocation
policies.
Certainly I can see value in that. How can a developer test his
code without any form of runtime feedback?
> Then the application developers have to guess the effect that the memory
> policies have on memory allocation. For memory alloc debugging the poor
> app guys must today simply imagine what the operating system is doing.
> They can see the amount of total memory allocated on a node via other proc
> entries and then guess based on that which application has taken it. Then
> they modify their apps and do another run.
Agree.
What does Andi mean by "there was a theoretical usecase where it might be
needed, but there were better solutions proposed for this"? The
application developer's problem here seems very real to me. Maybe the
"theoretical usecase" was something different?
> My thinking today is that I'd rather leave /proc/<pid>/numa_stats
> instead of using smaps because the smaps format is a bit verbose and
> will make it difficult to see the allocation distribution.
We don't want /proc/pid/smaps to have different formats on NUMA versus !NUMA.
> I have a new series of patches here that does a gradual thing with
> the policy layer:
That's off-topic.
Andi, I don't understand the objection. If I was developing the memory
policy tuning component of some honking number-crunching app I'd sure as
heck want to have some way of seeing the results of my efforts. How do I
do that without numa_maps?
And the content doesn't look too bad:
2000000000000000 default MaxRef=43 Pages=11 Mapped=11 N0=4 N1=3 N2=2 N3=2
2000000000038000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
2000000000040000 default MaxRef=1 Pages=1 Mapped=1 Anon=1 N0=1
2000000000058000 default MaxRef=43 Pages=61 Mapped=61 N0=14 N1=15 N2=16 N3=16
2000000000268000 default MaxRef=1 Pages=2 Mapped=2 Anon=2 N0=2
It's easy to parse and it is extensible. It needs documenting though.
On Sunday 11 September 2005 08:51, Andrew Morton wrote:
> Certainly I can see value in that. How can a developer test his
> code without any form of runtime feedback?
There are already several ways to do that: first the counters output
by numastat (local_node, other_node, interleave_hit etc.), which tells you
exactly how the allocation strategy ended up. And a process can find out
on which node a specific page is using get_mempolicy()
If you really want to know what's going on you can use performance counters
of the machine to tell you the amount of cross node traffic
(e.g. see numamon in the numactl source tree as an example)
I don't think the /proc information gives additional information
to the programmers. Externally you shouldn't know about the
individual addresses anyways.
All it does is to open the flood gates of external mempolicy management, which
is wrong.
> It's easy to parse and it is extensible. It needs documenting though.
Extensible yes, but I have my doubts on easy to parse. User processes
very likely will get it wrong like they traditionally did with anything
more complicated in /proc. /proc/*/maps has been a similar disaster too.
-Andi
On Sun, 11 Sep 2005, Andi Kleen wrote:
> On Sunday 11 September 2005 08:51, Andrew Morton wrote:
>
> > Certainly I can see value in that. How can a developer test his
> > code without any form of runtime feedback?
>
> There are already several ways to do that: first the counters output
> by numastat (local_node, other_node, interleave_hit etc.), which tells you
> exactly how the allocation strategy ended up. And a process can find out
> on which node a specific page is using get_mempolicy()
get_mempolicy only works from inside the thread that is executing. This
means all applications would need some specialized library to dump the
memory allocation information and you would need some external means of
triggering the dump of node information.
The other methods via numastat gives aggregate values that not useful at
all if multiple processes are running. Yes, on a bare machine with one
process running one could determine the memory allocated by searching
through all nodes in order to guess where the memory was allocated (try
that on a box with hundreds of nodes!).
However, one still does not know which memory section (vma) is allocated
on which nodes. And this may be important since critical data may need to
be interleaved whereas other massive amounts of data may be organized in a
different way.
> If you really want to know what's going on you can use performance counters
> of the machine to tell you the amount of cross node traffic
> (e.g. see numamon in the numactl source tree as an example)
Cross node traffic is not the same as memory use. Performance counters are
a debugging feature, they are complex to manage and only available in a
limited way. Its also not trivial figuring out a way to trigger these
counters in order to get the data.
> All it does is to open the flood gates of external mempolicy management, which
> is wrong.
It does not open those doors since the modification of memory policies is
still not possible (IMHO this is one element of a series of problematic
"features" in the policy layer).
External memory policy management is a necessary feature for system
administration, batch process scheduling as well as for testing and
debugging a system.
On Mon, Sep 19, 2005 at 10:11:20AM -0700, Christoph Lameter wrote:
> However, one still does not know which memory section (vma) is allocated
> on which nodes. And this may be important since critical data may need to
Maybe. Well sure of things could be maybe important. Or maybe not.
Doesn't seem like a particularly strong case to add a lot of ugly
code though.
> External memory policy management is a necessary feature for system
> administration, batch process scheduling as well as for testing and
> debugging a system.
I'm not convinced of this at all. Most of these things proposed so far
can be done much simpler with 90% of the functionality (e.g. just swapoff
per process for migration) , and I haven't seen a clear rationale except
for lots of maybes that the missing 10% are worth all the complexity
you seem to plan to add.
-Andi
On Mon, 19 Sep 2005, Andi Kleen wrote:
> On Mon, Sep 19, 2005 at 10:11:20AM -0700, Christoph Lameter wrote:
> > However, one still does not know which memory section (vma) is allocated
> > on which nodes. And this may be important since critical data may need to
>
> Maybe. Well sure of things could be maybe important. Or maybe not.
> Doesn't seem like a particularly strong case to add a lot of ugly
> code though.
We gradually need to fix the deficiencies of the policy layer. Calling
fixes "ugly code" and refusing to discuss solutions does not help anyone.
> > External memory policy management is a necessary feature for system
> > administration, batch process scheduling as well as for testing and
> > debugging a system.
>
> I'm not convinced of this at all. Most of these things proposed so far
> can be done much simpler with 90% of the functionality (e.g. just swapoff
> per process for migration) , and I haven't seen a clear rationale except
> for lots of maybes that the missing 10% are worth all the complexity
> you seem to plan to add.
Have you ever had the challenge to work with large HPC applications on a
large NUMA system? Which things? Many HPC apps do not use swap space
at all and we likely wont be using swap for page migration (see Marcelo's
work on a migration cache). All I have heard is you imagining complex
solutions ("performance counters" etc) to things that would be simple if
the policy layer would be up to the task.
On Monday 19 September 2005 23:32, Christoph Lameter wrote:
> On Mon, 19 Sep 2005, Andi Kleen wrote:
> > On Mon, Sep 19, 2005 at 10:11:20AM -0700, Christoph Lameter wrote:
> > > However, one still does not know which memory section (vma) is
> > > allocated on which nodes. And this may be important since critical data
> > > may need to
> >
> > Maybe. Well sure of things could be maybe important. Or maybe not.
> > Doesn't seem like a particularly strong case to add a lot of ugly
> > code though.
>
> We gradually need to fix the deficiencies of the policy layer. Calling
> fixes "ugly code" and refusing to discuss solutions does not help anyone.
I'm happy to discuss solutions given a clear use case what you want
to do, why you want to do it etc.
> > > External memory policy management is a necessary feature for system
> > > administration, batch process scheduling as well as for testing and
> > > debugging a system.
> >
> > I'm not convinced of this at all. Most of these things proposed so far
> > can be done much simpler with 90% of the functionality (e.g. just swapoff
> > per process for migration) , and I haven't seen a clear rationale except
> > for lots of maybes that the missing 10% are worth all the complexity
> > you seem to plan to add.
>
> Have you ever had the challenge to work with large HPC applications on a
> large NUMA system?
Ah - my code is better because my credentials are better. Maybe better than
maybe... I did only some tuning on large systems, but I spent quite some time
tuning NUMA code on small NUMA systems. Also I did spent a bit of time
looking at some of the tools offered by other Unixes and it left the clear
impression that they were far too complex and shouldn't be emulated in Linux.
The existing NUMA API was designed by keeping things relatively
simple (in fact my experience so far was that most users only
want to have the most simple of its policies - even the moderately
fancy stuff in there seems to be rarely used) so I think the bar
for more NUMA policy should be set extremly high and everything
come with extremly good rationales.
> Which things? Many HPC apps do not use swap space
> at all and we likely wont be using swap for page migration
Yes that was a lot of quite complicated code that seemed to me
quite overkill for its job.
Regarding swap: surely you know swapping doesn't necessarily
write to disk but first put stuff into the swap cache (we even
talked about that in Ottawa). So the plan would be to first
implement swapoff_process() that writes out to disk. And then
if someone really comes up with a clear case where this doesn't work
for them this can be extended to migrate directly out of the swap
cache missing the IO step.
-Andi
Andi Kleen <[email protected]> wrote:
>
> On Monday 19 September 2005 23:32, Christoph Lameter wrote:
> > On Mon, 19 Sep 2005, Andi Kleen wrote:
> > > On Mon, Sep 19, 2005 at 10:11:20AM -0700, Christoph Lameter wrote:
> > > > However, one still does not know which memory section (vma) is
> > > > allocated on which nodes. And this may be important since critical data
> > > > may need to
> > >
> > > Maybe. Well sure of things could be maybe important. Or maybe not.
> > > Doesn't seem like a particularly strong case to add a lot of ugly
> > > code though.
> >
> > We gradually need to fix the deficiencies of the policy layer. Calling
> > fixes "ugly code" and refusing to discuss solutions does not help anyone.
>
> I'm happy to discuss solutions given a clear use case what you want
> to do, why you want to do it etc.
Yes. A clear explanation of the requirements and usecases based on
real-world experience from real-world users and/or application developers.
I asked Christoph for that last week. The answers were, iirc, a bit
half-baked, but believeable.
> > Have you ever had the challenge to work with large HPC applications on a
> > large NUMA system?
>
> Ah - my code is better because my credentials are better.
No fair. I've never worked on big HPC systems and any feedback from the
field which Christoph can provide is really important in helping us
understand what features the kernel needs to offer. I would expect that
SGI engineering have a better understanding of HPC users' needs than pretty
much anyone else in the world.
It's a shame that SGI engineering aren't better at communicating those
needs to wee little kernel developers. And we need to get better at this
because, as you say, external policy control is going to be a ton harder to
swallow than /proc/pid/numa_maps.
On Mon, 19 Sep 2005, Andrew Morton wrote:
> No fair. I've never worked on big HPC systems and any feedback from the
> field which Christoph can provide is really important in helping us
> understand what features the kernel needs to offer. I would expect that
> SGI engineering have a better understanding of HPC users' needs than pretty
> much anyone else in the world.
We would be glad to do be more clean on these issues. Its difficult
to see how that can happen when the discussion is blocked on
principle. I have discussed various scenarios in long discussion
threads with Andi. The answer "My code is simple and cannot be changed"
wont get us anywhere.
> It's a shame that SGI engineering aren't better at communicating those
> needs to wee little kernel developers. And we need to get better at this
I wish I knew how we can improve the communication. I have discussed this
for two months now and tried to respond to every objection and concern
that came up.
> because, as you say, external policy control is going to be a ton harder to
> swallow than /proc/pid/numa_maps.
Could we just do the numa_maps in the right way now and do a code cleanup
for 2.6.14 (node_maps, user interface / system interface separation)? We can
then think longer term about how the memory of a process can be managed.
Christoph wrote:
> I wish I knew how we can improve the communication. I have discussed this
> for two months now and tried to respond to every objection and concern
> that came up.
The success of communication must be measured by what is heard,
not what is said.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401