LinuxLists.cc - [LSF/MM TOPIC] Page Cache Flexibility for NVM

2019-02-21 23:12:39

Subject: [LSF/MM TOPIC] Page Cache Flexibility for NVM

Hello,

I would like to attend the LSF/MM Summit 2019. I'm interested in
several MM topics that are mentioned below as well as Zoned Block
Devices and any io determinism topics that come up in the storage
track.

I have been working on a caching layer, hmmap (heterogeneous memory
map) [1], for emerging NVM and it is in spirit close to the page
cache. The key difference being that the backend device and caching
layer of hmmap is pluggable. In addition, hmmap supports DAX and write
protection, which I believe are key features for emerging NVMs that may
have write/read asymmetry as well as write endurance constraints.
Lastly we can leverage hardware, such as a DMA engine, when moving
pages between the cache while also allowing direct access if the device
is capable.

I am proposing that as an alternative to using NVMs as a NUMA node
we expose the NVM through the page cache or a viable alternative and
have userspace applications mmap the NVM and hand out memory with
their favorite userspace memory allocator.

This would isolate the NVMs to only applications that are well aware
of the performance implications of accessing NVM. I believe that all
of this work could be solved with the NUMA node approach, but the two
approaches are seeming to blur together.

The main points I would like to discuss are:

* Is the page cache model a viable alternative to NVM as a NUMA NODE?
* Can we add more flexibility to the page cache?
* Should we force separation of NVM through an explicit mmap?

I believe this discussion could be merged with NUMA, memory hierarchy
and device memory, Use NVDIMM as NUMA node and NUMA API, or memory
reclaim with NUMA balancing.

Here are some performance numbers of hmmap (in development):

All numbers are collected on a 4GiB hmmap device with a 128MiB cache.
For the mmap tests I used cgroups to limit the page cache usage to
128MiB. All results are an average of 10 runs. W and R access the
entire device with all threads segregated in the address space. RR
reads the entire device randomly 8 bytes at a time and is limited to
8MiB of data accessed.

hmmap brd vs. mmap of brd

hmmap mmap

Threads W R RR W R RR

1 7.21 5.39 5.04 6.80 5.63 5.23
2 5.19 3.87 3.74 4.66 3.33 3.20
4 3.65 2.95 3.07 3.53 2.26 2.18
8 4.52 3.43 3.59 4.30 1.98 1.88
16 5.00 3.85 3.98 4.92 2.00 1.99

Memory Backend Test (Dax capable)

hmmap hmmap-dax hmmap-wrprotect

Threads W R RR W R RR W R RR

1 6.29 4.94 4.37 2.54 1.36 0.16 7.12 2.13 0.73
2 4.62 3.63 3.57 1.41 0.69 0.08 5.06 1.14 0.41
4 3.45 2.97 3.11 0.77 0.36 0.04 3.66 0.63 0.25
8 4.10 3.53 3.71 0.44 0.19 0.02 4.03 0.35 0.17
16 4.60 3.98 4.04 0.34 0.16 0.02 4.52 0.27 0.14

Thanks,
Adam

2019-02-21 23:15:36

by Dave Hansen

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Page Cache Flexibility for NVM

On 2/21/19 3:11 PM, Adam Manzanares wrote:
> I am proposing that as an alternative to using NVMs as a NUMA node
> we expose the NVM through the page cache or a viable alternative and
> have userspace applications mmap the NVM and hand out memory with
> their favorite userspace memory allocator.

Are you proposing that the kernel manage this memory (it's managed in
the buddy lists, for instance) or that something else manage the memory,
like we do for device-dax or HMM?

2019-02-21 23:19:03

by Adam Manzanares

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Page Cache Flexibility for NVM

On Thu, 2019-02-21 at 15:14 -0800, Dave Hansen wrote:
> On 2/21/19 3:11 PM, Adam Manzanares wrote:
> > I am proposing that as an alternative to using NVMs as a NUMA node
> > we expose the NVM through the page cache or a viable alternative
> > and
> > have userspace applications mmap the NVM and hand out memory with
> > their favorite userspace memory allocator.
>
> Are you proposing that the kernel manage this memory (it's managed in
> the buddy lists, for instance) or that something else manage the
> memory,
> like we do for device-dax or HMM?
>

I am proposing we use a device-dax or HMM like model.

2019-02-21 23:51:42

by Adam Manzanares

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Page Cache Flexibility for NVM

Forgot the link.

[1] https://github.com/westerndigitalcorporation/hmmap

Take care,
Adam

On Thu, 2019-02-21 at 15:11 -0800, Adam Manzanares wrote:
> Hello,
>
> I would like to attend the LSF/MM Summit 2019. I'm interested in
> several MM topics that are mentioned below as well as Zoned Block
> Devices and any io determinism topics that come up in the storage
> track.
>
> I have been working on a caching layer, hmmap (heterogeneous memory
> map) [1], for emerging NVM and it is in spirit close to the page
> cache. The key difference being that the backend device and caching
> layer of hmmap is pluggable. In addition, hmmap supports DAX and
> write
> protection, which I believe are key features for emerging NVMs that
> may
> have write/read asymmetry as well as write endurance constraints.
> Lastly we can leverage hardware, such as a DMA engine, when moving
> pages between the cache while also allowing direct access if the
> device
> is capable.
>
> I am proposing that as an alternative to using NVMs as a NUMA node
> we expose the NVM through the page cache or a viable alternative and
> have userspace applications mmap the NVM and hand out memory with
> their favorite userspace memory allocator.
>
> This would isolate the NVMs to only applications that are well aware
> of the performance implications of accessing NVM. I believe that all
> of this work could be solved with the NUMA node approach, but the two
> approaches are seeming to blur together.
>
> The main points I would like to discuss are:
>
> * Is the page cache model a viable alternative to NVM as a NUMA NODE?
> * Can we add more flexibility to the page cache?
> * Should we force separation of NVM through an explicit mmap?
>
> I believe this discussion could be merged with NUMA, memory hierarchy
> and device memory, Use NVDIMM as NUMA node and NUMA API, or memory
> reclaim with NUMA balancing.
>
> Here are some performance numbers of hmmap (in development):
>
> All numbers are collected on a 4GiB hmmap device with a 128MiB cache.
> For the mmap tests I used cgroups to limit the page cache usage to
> 128MiB. All results are an average of 10 runs. W and R access the
> entire device with all threads segregated in the address space. RR
> reads the entire device randomly 8 bytes at a time and is limited to
> 8MiB of data accessed.
>
> hmmap brd vs. mmap of brd
>
> hmmap mmap
>
> Threads W R RR W R RR
>
> 1 7.21 5.39 5.04 6.80 5.63 5.23
> 2 5.19 3.87 3.74 4.66 3.33 3.20
> 4 3.65 2.95 3.07 3.53 2.26 2.18
> 8 4.52 3.43 3.59 4.30 1.98 1.88
> 16 5.00 3.85 3.98 4.92 2.00 1.99
>
>
>
> Memory Backend Test (Dax capable)
>
> hmmap hmmap-dax hmmap-wrprotect
>
> Threads W R RR W R RR W R RR
>
> 1 6.29 4.94 4.37 2.54 1.36 0.16 7.12 2.13 0.73
> 2 4.62 3.63 3.57 1.41 0.69 0.08 5.06 1.14 0.41
> 4 3.45 2.97 3.11 0.77 0.36 0.04 3.66 0.63 0.25
> 8 4.10 3.53 3.71 0.44 0.19 0.02 4.03 0.35 0.17
> 16 4.60 3.98 4.04 0.34 0.16 0.02 4.52 0.27 0.14
>
>
> Thanks,
> Adam
>
>
>
>

2019-02-22 00:28:39

by Jerome Glisse

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Page Cache Flexibility for NVM

On Thu, Feb 21, 2019 at 11:11:51PM +0000, Adam Manzanares wrote:
> Hello,
>
> I would like to attend the LSF/MM Summit 2019. I'm interested in
> several MM topics that are mentioned below as well as Zoned Block
> Devices and any io determinism topics that come up in the storage
> track.
>
> I have been working on a caching layer, hmmap (heterogeneous memory
> map) [1], for emerging NVM and it is in spirit close to the page
> cache. The key difference being that the backend device and caching
> layer of hmmap is pluggable. In addition, hmmap supports DAX and write
> protection, which I believe are key features for emerging NVMs that may
> have write/read asymmetry as well as write endurance constraints.
> Lastly we can leverage hardware, such as a DMA engine, when moving
> pages between the cache while also allowing direct access if the device
> is capable.
>
> I am proposing that as an alternative to using NVMs as a NUMA node
> we expose the NVM through the page cache or a viable alternative and
> have userspace applications mmap the NVM and hand out memory with
> their favorite userspace memory allocator.
>
> This would isolate the NVMs to only applications that are well aware
> of the performance implications of accessing NVM. I believe that all
> of this work could be solved with the NUMA node approach, but the two
> approaches are seeming to blur together.
>
> The main points I would like to discuss are:
>
> * Is the page cache model a viable alternative to NVM as a NUMA NODE?
> * Can we add more flexibility to the page cache?
> * Should we force separation of NVM through an explicit mmap?
>
> I believe this discussion could be merged with NUMA, memory hierarchy
> and device memory, Use NVDIMM as NUMA node and NUMA API, or memory
> reclaim with NUMA balancing.

What about cache coherency and atomic ? If device block are expose
through PCIE then there is no cache coherency or atomic and thus
direct mmap will not have the expected memory model which would
break program expectation of a mmap.

This is also one of the reasons i do not see a way forward with NUMA
and device memory. It can depart from the usual memory too much to
be drop in like that to unaware application.

In any case yes this kind of memory falls into the device memory i
wish to discuss during LSF/MM.

Cheers,
J?r?me

2019-02-22 01:15:59

by Adam Manzanares

[permalink] [raw]

Subject: Re: [LSF/MM TOPIC] Page Cache Flexibility for NVM

On Thu, 2019-02-21 at 19:27 -0500, Jerome Glisse wrote:
> On Thu, Feb 21, 2019 at 11:11:51PM +0000, Adam Manzanares wrote:
> > Hello,
> >
> > I would like to attend the LSF/MM Summit 2019. I'm interested in
> > several MM topics that are mentioned below as well as Zoned Block
> > Devices and any io determinism topics that come up in the storage
> > track.
> >
> > I have been working on a caching layer, hmmap (heterogeneous memory
> > map) [1], for emerging NVM and it is in spirit close to the page
> > cache. The key difference being that the backend device and caching
> > layer of hmmap is pluggable. In addition, hmmap supports DAX and
> > write
> > protection, which I believe are key features for emerging NVMs that
> > may
> > have write/read asymmetry as well as write endurance constraints.
> > Lastly we can leverage hardware, such as a DMA engine, when moving
> > pages between the cache while also allowing direct access if the
> > device
> > is capable.
> >
> > I am proposing that as an alternative to using NVMs as a NUMA node
> > we expose the NVM through the page cache or a viable alternative
> > and
> > have userspace applications mmap the NVM and hand out memory with
> > their favorite userspace memory allocator.
> >
> > This would isolate the NVMs to only applications that are well
> > aware
> > of the performance implications of accessing NVM. I believe that
> > all
> > of this work could be solved with the NUMA node approach, but the
> > two
> > approaches are seeming to blur together.
> >
> > The main points I would like to discuss are:
> >
> > * Is the page cache model a viable alternative to NVM as a NUMA
> > NODE?
> > * Can we add more flexibility to the page cache?
> > * Should we force separation of NVM through an explicit mmap?
> >
> > I believe this discussion could be merged with NUMA, memory
> > hierarchy
> > and device memory, Use NVDIMM as NUMA node and NUMA API, or memory
> > reclaim with NUMA balancing.
>
> What about cache coherency and atomic ? If device block are expose
> through PCIE then there is no cache coherency or atomic and thus
> direct mmap will not have the expected memory model which would
> break program expectation of a mmap.

For the PCIE cache coherency case I would envision that you would map
the memory as read only into the process address space. Once a write
occurs I would then remap the PCIE memory to a page in the proposed
caching mechanism.

I have to think more about what this means for atomic operations.

>
> This is also one of the reasons i do not see a way forward with NUMA
> and device memory. It can depart from the usual memory too much to
> be drop in like that to unaware application.

I have similar concerns and am trying to segregate the device memory to
aware applications.

>
> In any case yes this kind of memory falls into the device memory i
> wish to discuss during LSF/MM.
>
> Cheers,
> Jérôme
>