Hi Everyone,
In the NUMA, we only have one page cache for each file. For the
program/shared libraries, the
remote-access delays longer then the local-access.
So, is it possible to implement the per-node page cache for
programs/libraries?
We can do it like this:
1.) Add a new system call to control specific files to
NUMA-aware, such as:
set_numa_aware("/usr/lib/libc.so", enable);
After the system call, the page cache of libc.so has the
flags "NUMA_ENABLED"
2.) When A new process tries to setup the MMU page table for
libc.so, it will check
if NUMA_ENABLED is set. If it set, the kernel will give a
page which is bind to the process's NUMA node.
By this way, we can eliminate the remote-access for
programs/shared library.
Is this proposal ok? Or do you have a better idea?
Thanks
Huang Shijie
On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> In the NUMA, we only have one page cache for each file. For the
> program/shared libraries, the
> remote-access delays longer then the local-access.
>
> So, is it possible to implement the per-node page cache for
> programs/libraries?
At this point, we have no way to support text replication within a
process. So what you're suggesting (if implemented) would work for
processes which limit themselves to a single node. That is, if you
have a system with CPUs 0-3 on node 0 and CPUs 4-7 on node 1, a process
which only works on node 0 or only works on node 1 will get text on the
appropriate node.
If there's a process which runs on both nodes 0 and 1, there's no support
for per-node PGDs. So it will get a mix of pages from nodes 0 and 1,
and that doesn't necessarily seem like a big win. I haven't yet dived
into how hard it would be to make mm->pgd a per-node allocation.
I have been thinking about this a bit; one of our internal performance
teams flagged the potential performance win to me a few months ago.
I don't have a concrete design for text replication yet; there have been
various attempts over the years, but none were particularly compelling.
By the way, the degree of performance win varies between different CPUs,
but it's measurable on all the systems we've tested on (from three
different vendors).
On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> Hi Everyone,
>
> ??? In the NUMA, we only have one page cache for each file. For the
> program/shared libraries, the
>
> remote-access delays longer then the? local-access.
>
> So, is it possible to implement the per-node page cache for
> programs/libraries?
What do you mean, per-node page cache? Multiple pages for the same
area of file? That'd be bloody awful on coherency...
On Wed, Sep 01, 2021 at 04:55:01AM +0000, Al Viro wrote:
> On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > Hi Everyone,
> >
> > ??? In the NUMA, we only have one page cache for each file. For the
> > program/shared libraries, the
> >
> > remote-access delays longer then the? local-access.
> >
> > So, is it possible to implement the per-node page cache for
> > programs/libraries?
>
> What do you mean, per-node page cache? Multiple pages for the same
> area of file? That'd be bloody awful on coherency...
Yes. per-NUMA-node page cache.
We can limit the files to program/(shared libraries) which are read-only mostly,
and do not need coherency.
Thanks
Huang Shijie
On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > In the NUMA, we only have one page cache for each file. For the
> > program/shared libraries, the
> > remote-access delays longer then the local-access.
> >
> > So, is it possible to implement the per-node page cache for
> > programs/libraries?
>
> At this point, we have no way to support text replication within a
> process. So what you're suggesting (if implemented) would work for
I created a glibc patch which can do the text replication within a process.
I will send to glibc maintainer later..
(it seems glibc does not use patches to maintain the code.)
> processes which limit themselves to a single node. That is, if you
> have a system with CPUs 0-3 on node 0 and CPUs 4-7 on node 1, a process
> which only works on node 0 or only works on node 1 will get text on the
> appropriate node.
>
> If there's a process which runs on both nodes 0 and 1, there's no support
> for per-node PGDs. So it will get a mix of pages from nodes 0 and 1,
I think we do not need the per-node PGDs.
One-PGD for one process is okay to me.
> and that doesn't necessarily seem like a big win. I haven't yet dived
> into how hard it would be to make mm->pgd a per-node allocation.
>
> I have been thinking about this a bit; one of our internal performance
> teams flagged the potential performance win to me a few months ago.
> I don't have a concrete design for text replication yet; there have been
> various attempts over the years, but none were particularly compelling.
>
> By the way, the degree of performance win varies between different CPUs,
> but it's measurable on all the systems we've tested on (from three
> different vendors).
Thank you for sharing this.
Thanks
Huang Shijie
On Wed, Sep 1, 2021 at 11:09 AM Shijie Huang
<[email protected]> wrote:
>
> Hi Everyone,
>
> In the NUMA, we only have one page cache for each file. For the
> program/shared libraries, the
>
> remote-access delays longer then the local-access.
>
> So, is it possible to implement the per-node page cache for
> programs/libraries?
as far as i know, this is an very interesting topic, we do have some
"solutions" on this.
MIPS kernel supports kernel TEXT replication:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/mips/sgi-ip27/Kconfig
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/mips/sgi-ip27/ip27-klnuma.c
config REPLICATE_KTEXT
bool "Kernel text replication support"
depends on SGI_IP27
select MAPPED_KERNEL
help
Say Y here to enable replicating the kernel text across multiple
nodes in a NUMA cluster. This trades memory for speed.
for x86, RedHawk Linux(https://www.concurrent-rt.com/solutions/linux/)supports
kernel text replication.
here are some benchmark:
https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf
For userspace, dplace from SGI can help replicate text:
https://www.spec.org/cpu2006/flags/SGI-platform.html
-r bl: specifies that text should be replicated on the NUMA node or
nodes where the process is running.
'b' indicates that binary (a.out) text should be replicated;
'l' indicates that library text should be replicated.
but all of the above except mips ktext replication are out of tree.
Please count me in if you have any solution and any pending patch.
I am interested in this topic.
>
>
> We can do it like this:
>
> 1.) Add a new system call to control specific files to
> NUMA-aware, such as:
>
> set_numa_aware("/usr/lib/libc.so", enable);
>
> After the system call, the page cache of libc.so has the
> flags "NUMA_ENABLED"
>
>
> 2.) When A new process tries to setup the MMU page table for
> libc.so, it will check
>
> if NUMA_ENABLED is set. If it set, the kernel will give a
> page which is bind to the process's NUMA node.
>
> By this way, we can eliminate the remote-access for
> programs/shared library.
>
>
> Is this proposal ok? Or do you have a better idea?
>
>
> Thanks
>
> Huang Shijie
Thanks
barry
On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > > In the NUMA, we only have one page cache for each file. For the
> > > > program/shared libraries, the
> > > > remote-access delays longer then the local-access.
> > > >
> > > > So, is it possible to implement the per-node page cache for
> > > > programs/libraries?
> > >
> > > At this point, we have no way to support text replication within a
> > > process. So what you're suggesting (if implemented) would work for
> >
> > I created a glibc patch which can do the text replication within a process.
> The "text replication" means the shared libraries, not program itself.
Is it really worthwhile to do only the shared libraries?
On Tue, Aug 31, 2021 at 9:57 PM Al Viro <[email protected]> wrote:
>
> What do you mean, per-node page cache? Multiple pages for the same
> area of file? That'd be bloody awful on coherency...
You absolutely don't want to actually duplicate it in the cache.
But what you could do, if you wanted to, would be to catch the
situation where you have lots of expensive NUMA accesses either using
our VM infrastructure or performance counters, and when the mapping is
a MAP_PRIVATE you just do a COW fault on them.
Honestly, I suspect it only makes sense when you have already bound
your process to one particular NUMA node.
Sounds entirely doable, and has absolutely nothing to do with the page
cache. It would literally just be an "over-eager COW fault triggered
by NUMA access counters".
Linus
On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
<[email protected]> wrote:
>
> But what you could do, if you wanted to, would be to catch the
> situation where you have lots of expensive NUMA accesses either using
> our VM infrastructure or performance counters, and when the mapping is
> a MAP_PRIVATE you just do a COW fault on them.
>
> Sounds entirely doable, and has absolutely nothing to do with the page
> cache. It would literally just be an "over-eager COW fault triggered
> by NUMA access counters".
Note how it would work perfectly fine for anonymous mappings too. Just
to reinforce the point that this has nothing to do with any page cache
issues.
Of course, if you want to actually then *share* pages within a node
(rather than replicate them for each process), that gets more
exciting.
But I suspect that this is mainly only useful for long-running big
processes (not least due to that node binding thing), so I question
the need for that kind of excitement.
Linus
On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > In the NUMA, we only have one page cache for each file. For the
> > > program/shared libraries, the
> > > remote-access delays longer then the local-access.
> > >
> > > So, is it possible to implement the per-node page cache for
> > > programs/libraries?
> >
> > At this point, we have no way to support text replication within a
> > process. So what you're suggesting (if implemented) would work for
>
> I created a glibc patch which can do the text replication within a process.
The "text replication" means the shared libraries, not program itself.
Thanks
Huang Shijie
On Thu, Sep 2, 2021 at 5:31 AM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > But what you could do, if you wanted to, would be to catch the
> > situation where you have lots of expensive NUMA accesses either using
> > our VM infrastructure or performance counters, and when the mapping is
> > a MAP_PRIVATE you just do a COW fault on them.
> >
> > Sounds entirely doable, and has absolutely nothing to do with the page
> > cache. It would literally just be an "over-eager COW fault triggered
> > by NUMA access counters".
>
> Note how it would work perfectly fine for anonymous mappings too. Just
> to reinforce the point that this has nothing to do with any page cache
> issues.
>
> Of course, if you want to actually then *share* pages within a node
> (rather than replicate them for each process), that gets more
> exciting.
>
> But I suspect that this is mainly only useful for long-running big
> processes (not least due to that node binding thing), so I question
> the need for that kind of excitement.
In Linux server scenarios, it would be quite common to have long-running big
processes constantly running on one machine, for example, web, database
etc. This kind of process can cross a couple of NUMA nodes using all CPUs
in a server to achieve the maximum throughput.
SGI/HPE has a numatool with command "dplace" to help deploy processes
with replicated text in either libraries or binary (a.out) [1]:
dplace [-e] [-c cpu_numbers] [-s skip_count] [-n process_name] \
[-x skip_mask] [-r [l|b|t]] [-o log_file] [-v 1|2] \
command [command-args]
The dplace command accepts the following options:
...
-r: Specifies that text should be replicated on the node or nodes
where the application is running.
In some cases, replication will improve performance by reducing the
need to make offnode memory
references for code. The replication option applies to all programs
placed by the dplace command.
See the dplace man page for additional information on text
replication. The replication options are
a string of one or more of the following characters:
l - Replicate library text
b - Replicate binary (a.out) text
t - Thread round-robin option
On the other hand, it would be also interesting to investigate if
kernel text replication can help
improve performance. MIPS does have REPLICATE_KTEXT support in the kernel:
config REPLICATE_KTEXT
bool "Kernel text replication support"
depends on SGI_IP27
select MAPPED_KERNEL
help
Say Y here to enable replicating the kernel text across multiple
nodes in a NUMA cluster. This trades memory for speed.
Not quite sure how it will benefit X86 and ARM64 though it seems concurrent-rt
has some solution and benchmark data in RedHawk Linux[2].
[1] http://www.nacad.ufrj.br/online/sgi/007-5646-002/sgi_html/ch05.html
[2] https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf
>
> Linus
Thanks
Barry
On Thu, Sep 2, 2021 at 12:00 PM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > > > In the NUMA, we only have one page cache for each file. For the
> > > > > program/shared libraries, the
> > > > > remote-access delays longer then the local-access.
> > > > >
> > > > > So, is it possible to implement the per-node page cache for
> > > > > programs/libraries?
> > > >
> > > > At this point, we have no way to support text replication within a
> > > > process. So what you're suggesting (if implemented) would work for
> > >
> > > I created a glibc patch which can do the text replication within a process.
> > The "text replication" means the shared libraries, not program itself.
>
> Thinking about it some more, if you're ok with it only being shared
> libraries, you can do this:
>
> for i in `seq 0 3`; do \
> cp --reflink=always /lib/x86_64-linux-gnu/libc.so.6 \
> /lib/x86_64-linux-gnu/libc.so.6.numa$i; \
> done
>
> Reflinked files don't share page cache, so you can do this all in
> userspace with no kernel changes.
Not quite sure I catch your point. In case we are running mysql on a
machine with 128 cores
(4numa, 32cores in each numa), how will the reflink help the only
mysql process to leverage
its local libc copy?
Thanks
Barry
On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > > In the NUMA, we only have one page cache for each file. For the
> > > > program/shared libraries, the
> > > > remote-access delays longer then the local-access.
> > > >
> > > > So, is it possible to implement the per-node page cache for
> > > > programs/libraries?
> > >
> > > At this point, we have no way to support text replication within a
> > > process. So what you're suggesting (if implemented) would work for
> >
> > I created a glibc patch which can do the text replication within a process.
> The "text replication" means the shared libraries, not program itself.
Thinking about it some more, if you're ok with it only being shared
libraries, you can do this:
for i in `seq 0 3`; do \
cp --reflink=always /lib/x86_64-linux-gnu/libc.so.6 \
/lib/x86_64-linux-gnu/libc.so.6.numa$i; \
done
Reflinked files don't share page cache, so you can do this all in
userspace with no kernel changes.
On Wed, Sep 1, 2021 at 5:15 PM Barry Song <[email protected]> wrote:
>
> In case we are running mysql on a machine with 128 cores
> (4numa, 32cores in each numa), how will the reflink help the only
> mysql process to leverage its local libc copy?
That's a fundamentally harder problem anyway, and for the foreseeable
future you should expect the answer to that be "Not a way in hell".
Because it's not about "local libc copies" at that point any more,
it's about "a single process only has a single page table".
So a single process will have a particular virtual address mapped to
*one* physical page. And no, it doesn't matter how many threads you
have. What makes them threads - not processes - is that they share the
same VM image.
So the only way you will have local NUMA copies is if you
(a) run multiple processes
(b) bind each process to a particular NUMA node
(c) do something special to then have per-node mappings
That "(c)" is what is up for discussion, whether it be with various
user mode hacks, or the "NUMA COW" thing, or whatever.
But (a) and (b) are basically required.
Linus
Hi Linus,
On Wed, Sep 01, 2021 at 10:29:01AM -0700, Linus Torvalds wrote:
> On Wed, Sep 1, 2021 at 10:24 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > But what you could do, if you wanted to, would be to catch the
> > situation where you have lots of expensive NUMA accesses either using
> > our VM infrastructure or performance counters, and when the mapping is
> > a MAP_PRIVATE you just do a COW fault on them.
> >
> > Sounds entirely doable, and has absolutely nothing to do with the page
> > cache. It would literally just be an "over-eager COW fault triggered
> > by NUMA access counters".
Yes. You are right, we can use COW. :)
Actually we have _TWO_ levels to do the optimization for NUMA remote-access:
1.) the page cache which is independent to process.
2.) the process address space(page table).
For 2.), we can use the over-eager COW:
2.1) I have finished a user patch for glibc which uses "over-eager COW" to do the text
replication in NUMA.
2.2) Also a kernel patch uses the "over-eager COW" to do the replication for
the programs itself in NUMA. (We may refine it to another topic..)
>
> Note how it would work perfectly fine for anonymous mappings too. Just
> to reinforce the point that this has nothing to do with any page cache
> issues.
>
> Of course, if you want to actually then *share* pages within a node
> (rather than replicate them for each process), that gets more
> exciting.
Do we really need to change the page cache?
The 2.1) above may produces one-copy "shared libraries pages" for each process, such glibc.so.
Even in the same NUMA node 0, we may run two same processes. So it produces "two glibc.so" now.
If We run 5 same processes in NUMA Node 0, it will produces "five glibs.so".
But if we have per-node page cache for the glibc.so, we can do it like this:
(1) disable the "over-eager COW" in the process.
(2) use the per-node page cache's pages to different processes in the _SAME_ NUMA node.
So all the processes in the same NUMA node, can use only one same page.
(3) Processes in other NUMA nodes, use the pages belong to this node.
By this way, we can save many pages, and provide more access speed in NUMA.
Thanks
Huang Shijie
On Thu, Sep 02, 2021 at 10:56:20AM +1200, Barry Song wrote:
> Not quite sure how it will benefit X86 and ARM64 though it seems concurrent-rt
> has some solution and benchmark data in RedHawk Linux[2].
>
> [1] http://www.nacad.ufrj.br/online/sgi/007-5646-002/sgi_html/ch05.html
> [2] https://www.concurrent-rt.com/wp-content/uploads/2016/11/kernel-page-replication.pdf
Thanks for sharing this.
Thanks
Huang Shijie
On Thu, Sep 02, 2021 at 12:58:02AM +0100, Matthew Wilcox wrote:
> On Wed, Sep 01, 2021 at 02:25:34PM +0000, Huang Shijie wrote:
> > On Wed, Sep 01, 2021 at 01:30:45PM +0000, Huang Shijie wrote:
> > > On Wed, Sep 01, 2021 at 04:25:01AM +0100, Matthew Wilcox wrote:
> > > > On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
> > > > > In the NUMA, we only have one page cache for each file. For the
> > > > > program/shared libraries, the
> > > > > remote-access delays longer then the local-access.
> > > > >
> > > > > So, is it possible to implement the per-node page cache for
> > > > > programs/libraries?
> > > >
> > > > At this point, we have no way to support text replication within a
> > > > process. So what you're suggesting (if implemented) would work for
> > >
> > > I created a glibc patch which can do the text replication within a process.
> > The "text replication" means the shared libraries, not program itself.
>
> Thinking about it some more, if you're ok with it only being shared
> libraries, you can do this:
>
> for i in `seq 0 3`; do \
> cp --reflink=always /lib/x86_64-linux-gnu/libc.so.6 \
> /lib/x86_64-linux-gnu/libc.so.6.numa$i; \
> done
>
> Reflinked files don't share page cache, so you can do this all in
> userspace with no kernel changes.
This is not grace enough :)
And customers may not accept it..
For the shared libraries, it is better to change the glibc/ld.so.
For the program itself, it is better to change the linux kernel.
Thanks
Huang Shijie
Excerpts from Matthew Wilcox's message of September 1, 2021 1:25 pm:
> On Wed, Sep 01, 2021 at 11:07:41AM +0800, Shijie Huang wrote:
>> In the NUMA, we only have one page cache for each file. For the
>> program/shared libraries, the
>> remote-access delays longer then the local-access.
>>
>> So, is it possible to implement the per-node page cache for
>> programs/libraries?
>
> At this point, we have no way to support text replication within a
> process. So what you're suggesting (if implemented) would work for
> processes which limit themselves to a single node. That is, if you
> have a system with CPUs 0-3 on node 0 and CPUs 4-7 on node 1, a process
> which only works on node 0 or only works on node 1 will get text on the
> appropriate node.
>
> If there's a process which runs on both nodes 0 and 1, there's no support
> for per-node PGDs. So it will get a mix of pages from nodes 0 and 1,
> and that doesn't necessarily seem like a big win. I haven't yet dived
> into how hard it would be to make mm->pgd a per-node allocation.
>
> I have been thinking about this a bit; one of our internal performance
> teams flagged the potential performance win to me a few months ago.
> I don't have a concrete design for text replication yet; there have been
> various attempts over the years, but none were particularly compelling.
What was not compelling about it?
https://lists.openwall.net/linux-kernel/2007/07/27/112
What are the other attempts?
Thanks,
Nick
On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
> > I have been thinking about this a bit; one of our internal performance
> > teams flagged the potential performance win to me a few months ago.
> > I don't have a concrete design for text replication yet; there have been
> > various attempts over the years, but none were particularly compelling.
>
> What was not compelling about it?
It wasn't merged, so clearly it wasn't compelling enough?
> https://lists.openwall.net/linux-kernel/2007/07/27/112
>
> What are the other attempts?
I found one from Dave Hansen in 2003:
https://lwn.net/Articles/45082/
I think somebody else may have posted a different one, but I don't
remember now.
Excerpts from Matthew Wilcox's message of September 2, 2021 8:17 pm:
> On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
>> > I have been thinking about this a bit; one of our internal performance
>> > teams flagged the potential performance win to me a few months ago.
>> > I don't have a concrete design for text replication yet; there have been
>> > various attempts over the years, but none were particularly compelling.
>>
>> What was not compelling about it?
>
> It wasn't merged, so clearly it wasn't compelling enough?
Ha ha. It sounded like you had some reasons you didn't find it
particularly compelling :P
>
>> https://lists.openwall.net/linux-kernel/2007/07/27/112
>>
>> What are the other attempts?
>
> I found one from Dave Hansen in 2003:
>
> https://lwn.net/Articles/45082/
>
Huh interesting. I'd be surprised if I didn't see it go by at the time.
Thanks,
Nick
On Fri, Sep 03, 2021 at 05:10:31PM +1000, Nicholas Piggin wrote:
> Excerpts from Matthew Wilcox's message of September 2, 2021 8:17 pm:
> > On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
> >> > I have been thinking about this a bit; one of our internal performance
> >> > teams flagged the potential performance win to me a few months ago.
> >> > I don't have a concrete design for text replication yet; there have been
> >> > various attempts over the years, but none were particularly compelling.
> >>
> >> What was not compelling about it?
> >
> > It wasn't merged, so clearly it wasn't compelling enough?
>
> Ha ha. It sounded like you had some reasons you didn't find it
> particularly compelling :P
I haven't studied it in detail, but it seems to me that your patch (from
2007!) chooses whether to store pages or pcache_desc pointers in i_pages.
Was there a reason you chose to do it that way instead of having per-node
i_mapping pointers? (And which way would you choose to do it now, given
the infrastructure we have now?)
On Fri, Sep 3, 2021 at 12:02 PM Matthew Wilcox <[email protected]> wrote:
>
> Was there a reason you chose to do it that way instead of having per-node
> i_mapping pointers?
You can't have per-node i_mapping pointers without huge coherence issues.
If you don't care about coherence, that's fine - but that has to be a
user-space decision (ie "I will just replicate this file").
You can't just have the kernel decide "I'll map this set of pages on
this node, and that other ser of pages on that other node", in case
there's MAP_SHARED things going on.
Anyway, I think very fundamentally this is one of those things where
99.9% of all people don't care, and DO NOT WANT the complexity.
And the 0.1% that _does_ care really could and should do this in user
space, because they know they care.
Asking the kernel to do complex things in critical core functions for
something that is very very rare and irrelevant to most people, and
that can and should just be done in user space for the people who care
is the wrong approach.
Because the question here really should be "is this truly important,
and does this need kernel help because user space simply cannot do it
itself".
And the answer is a fairly simple "no".
Linus
Excerpts from Matthew Wilcox's message of September 4, 2021 5:01 am:
> On Fri, Sep 03, 2021 at 05:10:31PM +1000, Nicholas Piggin wrote:
>> Excerpts from Matthew Wilcox's message of September 2, 2021 8:17 pm:
>> > On Thu, Sep 02, 2021 at 01:25:36PM +1000, Nicholas Piggin wrote:
>> >> > I have been thinking about this a bit; one of our internal performance
>> >> > teams flagged the potential performance win to me a few months ago.
>> >> > I don't have a concrete design for text replication yet; there have been
>> >> > various attempts over the years, but none were particularly compelling.
>> >>
>> >> What was not compelling about it?
>> >
>> > It wasn't merged, so clearly it wasn't compelling enough?
>>
>> Ha ha. It sounded like you had some reasons you didn't find it
>> particularly compelling :P
>
> I haven't studied it in detail, but it seems to me that your patch (from
> 2007!) chooses whether to store pages or pcache_desc pointers in i_pages.
> Was there a reason you chose to do it that way instead of having per-node
> i_mapping pointers?
What Linus said. The patch was obviously mechanism only and more
heuristics would need to be done (in that case you could have per inode
hints or whatever).
> (And which way would you choose to do it now, given
> the infrastructure we have now?)
I'm not aware of anything new that would change it fundamentally.
Thanks,
Nick
Hi Linus,
On Fri, Sep 03, 2021 at 12:08:03PM -0700, Linus Torvalds wrote:
> On Fri, Sep 3, 2021 at 12:02 PM Matthew Wilcox <[email protected]> wrote:
> >
> > Was there a reason you chose to do it that way instead of having per-node
> > i_mapping pointers?
>
> You can't have per-node i_mapping pointers without huge coherence issues.
>
> If you don't care about coherence, that's fine - but that has to be a
> user-space decision (ie "I will just replicate this file").
>
> You can't just have the kernel decide "I'll map this set of pages on
> this node, and that other ser of pages on that other node", in case
> there's MAP_SHARED things going on.
>
> Anyway, I think very fundamentally this is one of those things where
> 99.9% of all people don't care, and DO NOT WANT the complexity.
>
> And the 0.1% that _does_ care really could and should do this in user
> space, because they know they care.
>
> Asking the kernel to do complex things in critical core functions for
> something that is very very rare and irrelevant to most people, and
> that can and should just be done in user space for the people who care
> is the wrong approach.
>
> Because the question here really should be "is this truly important,
> and does this need kernel help because user space simply cannot do it
> itself".
>
> And the answer is a fairly simple "no".
Okay.
Thanks for confirming this.
Thanks
Huang Shijie