2019-10-17 13:27:09

by Dave Hansen

[permalink] [raw]
Subject: [RFC] Memory Tiering

The memory hierarchy is getting more complicated and the kernel is
playing an increasing role in managing the different tiers. A few
different groups of folks described "migration" optimizations they were
doing in this area at LSF/MM earlier this year. One of the questions
folks asked was why autonuma wasn't being used.

At Intel, the primary new tier that we're looking at is persistent
memory (PMEM). We'd like to be able to use "persistent memory"
*without* using its persistence properties, treating it as slightly
slower DRAM. Keith Busch has some patches to use NUMA migration to
automatically migrate DRAM->PMEM instead of discarding it near the end
of the reclaim process. Huang Ying has some patches which use a
modified autonuma to migrate frequently-used data *back* from PMEM->DRAM.

We've tried to do this all generically so that it is not tied to
persistent memory and can be applied to any memory types in lots of
topologies.

We've been running this code in various forms for the past few months,
comparing it to pure DRAM and hardware-based caching. The initial
results are encouraging and we thought others might want to take a look
at the code or run their own experiments. We're expecting to post the
individual patches soon. But, until then, the code is available here:

https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git

and is tagged with "tiering-0.2", aka. d8e31e81b1dca9.

Note that internally folks have been calling this "hmem" which is
terribly easy to confuse with the existing hmm. There are still some
"hmem"'s in the tree, but I don't expect them to live much longer.


2019-10-18 10:14:54

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

On 16.10.19 22:05, Dave Hansen wrote:
> The memory hierarchy is getting more complicated and the kernel is
> playing an increasing role in managing the different tiers. A few
> different groups of folks described "migration" optimizations they were
> doing in this area at LSF/MM earlier this year. One of the questions
> folks asked was why autonuma wasn't being used.
>
> At Intel, the primary new tier that we're looking at is persistent
> memory (PMEM). We'd like to be able to use "persistent memory"
> *without* using its persistence properties, treating it as slightly
> slower DRAM. Keith Busch has some patches to use NUMA migration to
> automatically migrate DRAM->PMEM instead of discarding it near the end
> of the reclaim process. Huang Ying has some patches which use a
> modified autonuma to migrate frequently-used data *back* from PMEM->DRAM.

Very interesting topic. I heard similar demand from HPC folks
(especially involving other memory types ("tiers")). There, I think you
often want to let the application manage that. But of course, for many
applications an automatic management might already be beneficial.

Am I correct that you are using PMEM in this area along with ZONE_DEVICE
and not by giving PMEM to the buddy (add_memory())?

>
> We've tried to do this all generically so that it is not tied to
> persistent memory and can be applied to any memory types in lots of
> topologies.
>
> We've been running this code in various forms for the past few months,
> comparing it to pure DRAM and hardware-based caching. The initial
> results are encouraging and we thought others might want to take a look
> at the code or run their own experiments. We're expecting to post the
> individual patches soon. But, until then, the code is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git
>
> and is tagged with "tiering-0.2", aka. d8e31e81b1dca9.
>
> Note that internally folks have been calling this "hmem" which is
> terribly easy to confuse with the existing hmm. There are still some
> "hmem"'s in the tree, but I don't expect them to live much longer.
>


--

Thanks,

David / dhildenb

2019-10-18 16:19:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

On 10/17/19 1:07 AM, David Hildenbrand wrote:
> Very interesting topic. I heard similar demand from HPC folks
> (especially involving other memory types ("tiers")). There, I think
> you often want to let the application manage that. But of course, for
> many applications an automatic management might already be
> beneficial.
>
> Am I correct that you are using PMEM in this area along with
> ZONE_DEVICE and not by giving PMEM to the buddy (add_memory())?

The PMEM starts out as ZONE_DEVICE, but we unbind it from its original
driver and bind it to this stub of a "driver": drivers/dax/kmem.c which
uses add_memory() on it.

There's some nice tooling inside the daxctl component of ndctl to do all
the sysfs magic to make this happen.

2019-10-18 21:22:34

by Verma, Vishal L

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering


On Thu, 2019-10-17 at 07:17 -0700, Dave Hansen wrote:
> On 10/17/19 1:07 AM, David Hildenbrand wrote:
> > Very interesting topic. I heard similar demand from HPC folks
> > (especially involving other memory types ("tiers")). There, I think
> > you often want to let the application manage that. But of course, for
> > many applications an automatic management might already be
> > beneficial.
> >
> > Am I correct that you are using PMEM in this area along with
> > ZONE_DEVICE and not by giving PMEM to the buddy (add_memory())?
>
> The PMEM starts out as ZONE_DEVICE, but we unbind it from its original
> driver and bind it to this stub of a "driver": drivers/dax/kmem.c which
> uses add_memory() on it.
>
> There's some nice tooling inside the daxctl component of ndctl to do all
> the sysfs magic to make this happen.
>
Here is more info about the daxctl command in question:

https://pmem.io/ndctl/daxctl-reconfigure-device.html

2019-10-18 21:49:47

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

On 17.10.19 19:07, Verma, Vishal L wrote:
>
> On Thu, 2019-10-17 at 07:17 -0700, Dave Hansen wrote:
>> On 10/17/19 1:07 AM, David Hildenbrand wrote:
>>> Very interesting topic. I heard similar demand from HPC folks
>>> (especially involving other memory types ("tiers")). There, I think
>>> you often want to let the application manage that. But of course, for
>>> many applications an automatic management might already be
>>> beneficial.
>>>
>>> Am I correct that you are using PMEM in this area along with
>>> ZONE_DEVICE and not by giving PMEM to the buddy (add_memory())?
>>
>> The PMEM starts out as ZONE_DEVICE, but we unbind it from its original
>> driver and bind it to this stub of a "driver": drivers/dax/kmem.c which
>> uses add_memory() on it.
>>
>> There's some nice tooling inside the daxctl component of ndctl to do all
>> the sysfs magic to make this happen.
>>
> Here is more info about the daxctl command in question:
>
> https://pmem.io/ndctl/daxctl-reconfigure-device.html
>

Thanks, yeah I saw the patches back then (I though they were by Pavel
but they were actually by you :) ) to add the memory to the buddy (via
add_memory()).

Will explore some more, thanks!

--

Thanks,

David / dhildenb

2019-10-24 16:46:53

by Jonathan Adams

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

1
On Wed, Oct 16, 2019 at 1:05 PM Dave Hansen <[email protected]> wrote:
>
> The memory hierarchy is getting more complicated and the kernel is
> playing an increasing role in managing the different tiers. A few
> different groups of folks described "migration" optimizations they were
> doing in this area at LSF/MM earlier this year. One of the questions
> folks asked was why autonuma wasn't being used.
>
> At Intel, the primary new tier that we're looking at is persistent
> memory (PMEM). We'd like to be able to use "persistent memory"
> *without* using its persistence properties, treating it as slightly
> slower DRAM. Keith Busch has some patches to use NUMA migration to
> automatically migrate DRAM->PMEM instead of discarding it near the end
> of the reclaim process. Huang Ying has some patches which use a
> modified autonuma to migrate frequently-used data *back* from PMEM->DRAM.
>
> We've tried to do this all generically so that it is not tied to
> persistent memory and can be applied to any memory types in lots of
> topologies.
>
> We've been running this code in various forms for the past few months,
> comparing it to pure DRAM and hardware-based caching. The initial
> results are encouraging and we thought others might want to take a look
> at the code or run their own experiments. We're expecting to post the
> individual patches soon. But, until then, the code is available here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git
>
> and is tagged with "tiering-0.2", aka. d8e31e81b1dca9.

Hi Dave,

Thanks for sharing this git link and information on your approach.
This interesting, and lines up somewhat with the approach Google has
been investigating. As we discussed at LSF/MM[1] and Linux
Plumbers[2], we're working on an approach which integrates with our
proactive reclaim work, with a similar attitude to PMEM (use it as
"slightly slower" DRAM, ignoring its persistence). The prototype we
have has a similar basic structure to what you're doing here and Yang
Shi's patchset from March[3] (separate NUMA nodes for PMEM), but
relies on a fair amount of kernel changes to control allocations from
the NUMA nodes, and uses a similar "is_far" NUMA flag to Yang Shi's
approach.

We're working on redesigning to reduce the scope of kernel changes and
to remove the "is_far" special handling; we still haven't refined
down to a final approach, but one basic part we want to keep from the
prototype is proactively pushing PMEM data back to DRAM when we've
noticed it's in use. If we look at a two-socket system:

A: DRAM & CPU node for socket 0
B: PMEM node for socket 0
C: DRAM & CPU node for socket 1
D: PMEM node for socket 1

instead of the unidirectional approach your patches go for:

A is marked as "in reclaim, push pages to" B
C is marked as "in reclaim, push pages to" D
B & D have no markings

we would have a bidirectional attachment:

A is marked "move cold pages to" B
B is marked "move hot pages to" A
C is marked "move cold pages to" D
D is marked "move hot pages to" C

By using autonuma for moving PMEM pages back to DRAM, you avoid
needing the B->A & D->C links, at the cost of migrating the pages
back synchronously at pagefault time (assuming my understanding of how
autonuma works is accurate).

Our approach still lets you have multiple levels of hierarchy for a
given socket (you could imaging an "E" node with the same relation to
"B" as "B" has to "A"), but doesn't make it easy to represent (say) an
"E" which was equally close to all sockets (which I could imagine for
something like remote memory on GenZ or what-have-you), since there
wouldn't be a single back link; there would need to be something like
your autonuma support to achieve that.

Does that make sense?

Thanks,
- Jonathan

[1] Shakeel's talk, I can't find a link at the moment. The basic
kstaled/kreclaimd approach we built upon is talked about in
https://blog.acolyer.org/2019/05/22/sw-far-memory/ and the linked
ASPLOS paper
[2] https://linuxplumbersconf.org/event/4/contributions/561/; slides
at https://linuxplumbersconf.org/event/4/contributions/561/attachments/363/596/Persistent_Memory_as_Memory.pdf
[3] https://lkml.org/lkml/2019/3/23/10

2019-10-25 14:27:10

by Yang Shi

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

On Wed, Oct 23, 2019 at 4:12 PM Jonathan Adams <[email protected]> wrote:
>
> 1
> On Wed, Oct 16, 2019 at 1:05 PM Dave Hansen <[email protected]> wrote:
> >
> > The memory hierarchy is getting more complicated and the kernel is
> > playing an increasing role in managing the different tiers. A few
> > different groups of folks described "migration" optimizations they were
> > doing in this area at LSF/MM earlier this year. One of the questions
> > folks asked was why autonuma wasn't being used.
> >
> > At Intel, the primary new tier that we're looking at is persistent
> > memory (PMEM). We'd like to be able to use "persistent memory"
> > *without* using its persistence properties, treating it as slightly
> > slower DRAM. Keith Busch has some patches to use NUMA migration to
> > automatically migrate DRAM->PMEM instead of discarding it near the end
> > of the reclaim process. Huang Ying has some patches which use a
> > modified autonuma to migrate frequently-used data *back* from PMEM->DRAM.
> >
> > We've tried to do this all generically so that it is not tied to
> > persistent memory and can be applied to any memory types in lots of
> > topologies.
> >
> > We've been running this code in various forms for the past few months,
> > comparing it to pure DRAM and hardware-based caching. The initial
> > results are encouraging and we thought others might want to take a look
> > at the code or run their own experiments. We're expecting to post the
> > individual patches soon. But, until then, the code is available here:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git
> >
> > and is tagged with "tiering-0.2", aka. d8e31e81b1dca9.
>
> Hi Dave,
>
> Thanks for sharing this git link and information on your approach.
> This interesting, and lines up somewhat with the approach Google has
> been investigating. As we discussed at LSF/MM[1] and Linux
> Plumbers[2], we're working on an approach which integrates with our
> proactive reclaim work, with a similar attitude to PMEM (use it as
> "slightly slower" DRAM, ignoring its persistence). The prototype we
> have has a similar basic structure to what you're doing here and Yang
> Shi's patchset from March[3] (separate NUMA nodes for PMEM), but
> relies on a fair amount of kernel changes to control allocations from
> the NUMA nodes, and uses a similar "is_far" NUMA flag to Yang Shi's
> approach.
>
> We're working on redesigning to reduce the scope of kernel changes and
> to remove the "is_far" special handling; we still haven't refined
> down to a final approach, but one basic part we want to keep from the
> prototype is proactively pushing PMEM data back to DRAM when we've
> noticed it's in use. If we look at a two-socket system:
>
> A: DRAM & CPU node for socket 0
> B: PMEM node for socket 0
> C: DRAM & CPU node for socket 1
> D: PMEM node for socket 1
>
> instead of the unidirectional approach your patches go for:
>
> A is marked as "in reclaim, push pages to" B
> C is marked as "in reclaim, push pages to" D
> B & D have no markings
>
> we would have a bidirectional attachment:
>
> A is marked "move cold pages to" B
> B is marked "move hot pages to" A
> C is marked "move cold pages to" D
> D is marked "move hot pages to" C
>
> By using autonuma for moving PMEM pages back to DRAM, you avoid
> needing the B->A & D->C links, at the cost of migrating the pages
> back synchronously at pagefault time (assuming my understanding of how
> autonuma works is accurate).
>
> Our approach still lets you have multiple levels of hierarchy for a
> given socket (you could imaging an "E" node with the same relation to
> "B" as "B" has to "A"), but doesn't make it easy to represent (say) an
> "E" which was equally close to all sockets (which I could imagine for
> something like remote memory on GenZ or what-have-you), since there
> wouldn't be a single back link; there would need to be something like
> your autonuma support to achieve that.

I don't quite get why you want to achieve this and it is driven by
what usecase. With this approach pages just can be promoted from B to
A or D to C, is it correct? If A accesses the pages on D, the page
would have to be migrated twice D -> C -> A.

But NUMA balancing would migrate the page to the node of CPU that did
the access, of course it does some trick (twice access) to make sure
the connection is stable. So, it just does one migration.

>
> Does that make sense?
>
> Thanks,
> - Jonathan
>
> [1] Shakeel's talk, I can't find a link at the moment. The basic
> kstaled/kreclaimd approach we built upon is talked about in
> https://blog.acolyer.org/2019/05/22/sw-far-memory/ and the linked
> ASPLOS paper
> [2] https://linuxplumbersconf.org/event/4/contributions/561/; slides
> at https://linuxplumbersconf.org/event/4/contributions/561/attachments/363/596/Persistent_Memory_as_Memory.pdf
> [3] https://lkml.org/lkml/2019/3/23/10
>

2019-10-25 18:11:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

On 10/23/19 4:11 PM, Jonathan Adams wrote:
> we would have a bidirectional attachment:
>
> A is marked "move cold pages to" B
> B is marked "move hot pages to" A
> C is marked "move cold pages to" D
> D is marked "move hot pages to" C
>
> By using autonuma for moving PMEM pages back to DRAM, you avoid
> needing the B->A & D->C links, at the cost of migrating the pages
> back synchronously at pagefault time (assuming my understanding of how
> autonuma works is accurate).
>
> Our approach still lets you have multiple levels of hierarchy for a
> given socket (you could imaging an "E" node with the same relation to
> "B" as "B" has to "A"), but doesn't make it easy to represent (say) an
> "E" which was equally close to all sockets (which I could imagine for
> something like remote memory on GenZ or what-have-you), since there
> wouldn't be a single back link; there would need to be something like
> your autonuma support to achieve that.
>
> Does that make sense?

Yes, it does. We've actually tried a few other approaches separate from
autonuma-based ones for promotion. For some of those, we have a
promotion path which is separate from the demotion path.

That said, I took a quick look to see what the autonuma behavior was and
couldn't find anything obvious. Ying, when moving a slow page due to
autonuma, do we move it close to the CPU that did the access, or do we
promote it to the DRAM close to the slow memory where it is now?

2019-10-25 19:23:37

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

Dave Hansen <[email protected]> writes:

> On 10/23/19 4:11 PM, Jonathan Adams wrote:
>> we would have a bidirectional attachment:
>>
>> A is marked "move cold pages to" B
>> B is marked "move hot pages to" A
>> C is marked "move cold pages to" D
>> D is marked "move hot pages to" C
>>
>> By using autonuma for moving PMEM pages back to DRAM, you avoid
>> needing the B->A & D->C links, at the cost of migrating the pages
>> back synchronously at pagefault time (assuming my understanding of how
>> autonuma works is accurate).
>>
>> Our approach still lets you have multiple levels of hierarchy for a
>> given socket (you could imaging an "E" node with the same relation to
>> "B" as "B" has to "A"), but doesn't make it easy to represent (say) an
>> "E" which was equally close to all sockets (which I could imagine for
>> something like remote memory on GenZ or what-have-you), since there
>> wouldn't be a single back link; there would need to be something like
>> your autonuma support to achieve that.
>>
>> Does that make sense?
>
> Yes, it does. We've actually tried a few other approaches separate from
> autonuma-based ones for promotion. For some of those, we have a
> promotion path which is separate from the demotion path.
>
> That said, I took a quick look to see what the autonuma behavior was and
> couldn't find anything obvious. Ying, when moving a slow page due to
> autonuma, do we move it close to the CPU that did the access, or do we
> promote it to the DRAM close to the slow memory where it is now?

Now in autonuma, the slow page will be moved to the CPU that did the
access. So I think Jonathan's requirement has been covered already.

Best Regards,
Huang, Ying

2019-10-25 19:23:49

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] Memory Tiering

Jonathan Adams <[email protected]> writes:

> 1
> On Wed, Oct 16, 2019 at 1:05 PM Dave Hansen <[email protected]> wrote:
>>
>> The memory hierarchy is getting more complicated and the kernel is
>> playing an increasing role in managing the different tiers. A few
>> different groups of folks described "migration" optimizations they were
>> doing in this area at LSF/MM earlier this year. One of the questions
>> folks asked was why autonuma wasn't being used.
>>
>> At Intel, the primary new tier that we're looking at is persistent
>> memory (PMEM). We'd like to be able to use "persistent memory"
>> *without* using its persistence properties, treating it as slightly
>> slower DRAM. Keith Busch has some patches to use NUMA migration to
>> automatically migrate DRAM->PMEM instead of discarding it near the end
>> of the reclaim process. Huang Ying has some patches which use a
>> modified autonuma to migrate frequently-used data *back* from PMEM->DRAM.
>>
>> We've tried to do this all generically so that it is not tied to
>> persistent memory and can be applied to any memory types in lots of
>> topologies.
>>
>> We've been running this code in various forms for the past few months,
>> comparing it to pure DRAM and hardware-based caching. The initial
>> results are encouraging and we thought others might want to take a look
>> at the code or run their own experiments. We're expecting to post the
>> individual patches soon. But, until then, the code is available here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git
>>
>> and is tagged with "tiering-0.2", aka. d8e31e81b1dca9.
>
> Hi Dave,
>
> Thanks for sharing this git link and information on your approach.
> This interesting, and lines up somewhat with the approach Google has
> been investigating. As we discussed at LSF/MM[1] and Linux
> Plumbers[2], we're working on an approach which integrates with our
> proactive reclaim work, with a similar attitude to PMEM (use it as
> "slightly slower" DRAM, ignoring its persistence). The prototype we
> have has a similar basic structure to what you're doing here and Yang
> Shi's patchset from March[3] (separate NUMA nodes for PMEM), but
> relies on a fair amount of kernel changes to control allocations from
> the NUMA nodes, and uses a similar "is_far" NUMA flag to Yang Shi's
> approach.
>
> We're working on redesigning to reduce the scope of kernel changes and
> to remove the "is_far" special handling; we still haven't refined
> down to a final approach, but one basic part we want to keep from the
> prototype is proactively pushing PMEM data back to DRAM when we've
> noticed it's in use. If we look at a two-socket system:
>
> A: DRAM & CPU node for socket 0
> B: PMEM node for socket 0
> C: DRAM & CPU node for socket 1
> D: PMEM node for socket 1
>
> instead of the unidirectional approach your patches go for:
>
> A is marked as "in reclaim, push pages to" B
> C is marked as "in reclaim, push pages to" D
> B & D have no markings
>
> we would have a bidirectional attachment:
>
> A is marked "move cold pages to" B
> B is marked "move hot pages to" A
> C is marked "move cold pages to" D
> D is marked "move hot pages to" C
>
> By using autonuma for moving PMEM pages back to DRAM, you avoid
> needing the B->A & D->C links, at the cost of migrating the pages
> back synchronously at pagefault time (assuming my understanding of how

Yes. It's "synchronously". But it will avoid most time-consuming
direct-reclaiming or page lock acquiring, so the latency will not be
uncontrollable.

> autonuma works is accurate).
>
> Our approach still lets you have multiple levels of hierarchy for a
> given socket (you could imaging an "E" node with the same relation to
> "B" as "B" has to "A"), but doesn't make it easy to represent (say) an
> "E" which was equally close to all sockets (which I could imagine for
> something like remote memory on GenZ or what-have-you), since there
> wouldn't be a single back link; there would need to be something like
> your autonuma support to achieve that.

If hot pages in PMEM is identified via check "A" bit in PTE, there's no
information about which CPU is accessing the pages. One way to mitigate
is to use the original AutoNUMA. For example, the hot pages may be
migrated

B -> A: via PMEM hot page promotion
A -> C: via original AutoNUMA

> Does that make sense?

Yes. Definitely.

Best Regards,
Huang, Ying

> Thanks,
> - Jonathan
>
> [1] Shakeel's talk, I can't find a link at the moment. The basic
> kstaled/kreclaimd approach we built upon is talked about in
> https://blog.acolyer.org/2019/05/22/sw-far-memory/ and the linked
> ASPLOS paper
> [2] https://linuxplumbersconf.org/event/4/contributions/561/; slides
> at
> https://linuxplumbersconf.org/event/4/contributions/561/attachments/363/596/Persistent_Memory_as_Memory.pdf
> [3] https://lkml.org/lkml/2019/3/23/10