Hello!
We have some interest in hardware on devices that is cache-coherent
with main memory, and in migrating memory between host memory and
device memory. We believe that we might not be the only ones looking
ahead to hardware like this, so please see below for a draft of some
approaches that we have been thinking of.
Thoughts?
Thanx, Paul
------------------------------------------------------------------------
COHERENT ON-DEVICE MEMORY: ACCESS AND MIGRATION
Ben Herrenschmidt
(As told to Paul E. McKenney)
Special-purpose hardware becoming more prevalent, and some of this
hardware allows for tight interaction with CPU-based processing.
For example, IBM's coherent accelerator processor interface
(CAPI) will allow this sort of device to be constructed,
and it is likely that GPGPUs will need similar capabilities.
(See http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html for a
high-level description of CAPI.) Let's call these cache-coherent
accelerator devices (CCAD for short, which should at least
motivate someone to come up with something better).
This document covers devices with the following properties:
1. The device is cache-coherent, in other words, the device's
memory has all the characteristics of system memory from
the viewpoint of CPUs and other devices accessing it.
2. The device provides local memory that it has high-bandwidth
low-latency access to, but the device can also access
normal system memory.
3. The device shares system page tables, so that it can
transparently access userspace virtual memory, regardless
of whether this virtual memory maps to normal system
memory or to memory local to the device.
Although such a device will provide CPU's with cache-coherent
access to on-device memory, the resulting memory latency is
expected to be slower than the normal memory that is tightly
coupled to the CPUs. Nevertheless, data that is only occasionally
accessed by CPUs should be stored in the device's memory.
On the other hand, data that is accessed rarely by the device but
frequently by the CPUs should be stored in normal system memory.
Of course, some workloads will have predictable access patterns
that allow data to be optimally placed up front. However, other
workloads will have less-predictable access patterns, and these
workloads can benefit from automatic migration of data between
device memory and system memory as access patterns change.
Furthermore, some devices will provide special hardware that
collects access statistics that can be used to determine whether
or not a given page of memory should be migrated, and if so,
to where.
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
REQUIREMENTS
1. It should be possible to remove a given CCAD device
from service, for example, to reset it, to download
updated firmware, or to change its functionality.
This results in the following additional requirements:
a. It should be possible to migrate all data away
from the device's memory at any time.
b. Normal memory allocation should avoid using the
device's memory, as this would interfere
with the needed migration. It may nevertheless
be desirable to use the device's memory
if system memory is exhausted, however, in some
cases, even this "emergency" use is best avoided.
In fact, a good solution will provide some means
for avoiding this for those cases where it is
necessary to evacuate memory when offlining the
device.
2. Memory can be either explicitly or implicitly allocated
from the CCAD device's memory. (Both usermode and kernel
allocation required.)
Please note that implicit allocation will need to be
avoided in a number of use cases. The reason for this
is that random kernel allocations might be pinned into
memory, which could conflict with requirement (1) above,
and might furthermore fragment the device's memory.
3. The device's memory is treated like normal system
memory by the Linux kernel, for example, each page has a
"struct page" associate with it. (In contrast, the
traditional approach has used special-purpose OS mechanisms
to manage the device's memory, and this memory was treated
as MMIO space by the kernel.)
4. The system's normal tuning mechanism may be used to
tune allocation locality, migration, and so on, as
required to match performance and functional requirements.
POTENTIAL IDEAS
It is only reasonable to ask whether CCAD devices can simply
use the HMM patch that has recently been proposed to allow
migration between system and device memory via page faults.
Although this works well for devices whose local MMU can contain
mappings different from that of the system MMU, the HMM patch
is still working with MMIO space that gets special treatment.
The HMM patch does not (yet) provide the full transparency that
would allow the device memory to be treated in the same way as
system memory. Something more is therefore required, for example,
one or more of the following:
1. Model the CCAD device's memory as a memory-only NUMA node
with a very large distance metric. This allows use of
the existing mechanisms for choosing where to satisfy
explicit allocations and where to target migrations.
2. Cover the memory with a CMA to prevent non-migratable
pinned data from being placed in the CCAD device's memory.
It would also permit the driver to perform dedicated
physically contiguous allocations as needed.
3. Add a new ZONE_EXTERNAL zone for all CCAD-like devices.
Note that this would likely require support for
discontinuous zones in order to support large NUMA
systems, in which each node has a single block of the
overall physical address space. In such systems, the
physical address ranges of normal system memory would
be interleaved with those of device memory.
This would also require some sort of
migration infrastructure to be added, as autonuma would
not apply. However, this approach has the advantage
of preventing allocations in these regions, at least
unless those allocations have been explicitly flagged
to go there.
4. Your idea here!
AUTONUMA
The Linux kernel's autonuma facility supports migrating both
memory and processes to promote NUMA memory locality. It was
accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
This approach uses a kernel thread "knuma_scand" that periodically
marks pages inaccessible. The page-fault handler notes any
mismatches between the NUMA node that the process is running on
and the NUMA node on which the page resides.
http://lwn.net/Articles/488709/
https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
It will be necessary to set up the CCAD device's memory as
a very distant NUMA node, and the architecture-specific
__numa_distance() function can be used for this purpose.
There is a RECLAIM_DISTANCE macro that can be set by the
architecture to prevent reclaiming from nodes that are too
far away. Some experimentation would be required to determine
the combination of values for the various distance macros.
This approach needs some way to pull in data from the hardware
on access patterns. Aneesh Kk Veetil is prototyping an approach
based on Power 8 hardware counters. This data will need to be
plugged into the migration algorithm, which is currently based
on collecting information from page faults.
Finally, the contiguous memory allocator (CMA, see
http://lwn.net/Articles/486301/) is needed in order to prevent
the kernel from placing non-migratable allocations in the CCAD
device's memory. This would need to be of type MIGRATE_CMA to
ensure that all memory taken from that range be migratable.
The result would be that the kernel would allocate only migratable
pages within the CCAD device's memory, and even then only if
memory was otherwise exhausted. Normal CONFIG_NUMA_BALANCING
migration could be brought to bear, possibly enhanced with
information from hardware counters. One remaining issue is that
there is no way to absolutely prevent random kernel subsystems
from allocating the CCAD device's memory, which could cause
failures should the device need to reset itself, in which case
the memory would be temporarily inaccessible -- which could be
a fatal surprise to that kernel subsystem.
MEMORY ZONE
One way to avoid the problem of random kernel subsystems using
the CAPI device's memory is to create a new memory zone for
this purpose. This would add something like ZONE_DEVMEM to the
current set that includes ZONE_DMA, ZONE_NORMAL, and ZONE_MOVABLE.
Currently, there are a maximum of four zones, so this limit must
either be increased or kernels built with ZONE_DEVMEM must avoid
having more than one of ZONE_DMA, ZONE_DMA32, and ZONE_HIGHMEM.
This approach requires that migration be implemented on the side,
as the CONFIG_NUMA_BALANCING will not help here (unless I am
missing something). One advantage of this situation is that
hardware locality measurements could be incorporated from the
beginning. Another advantage is that random kernel subsystems
and user programs would not get CAPI device memory unless they
explicitly requested it.
Code would be needed at boot time to place the CAPI device
memory into ZONE_DEVMEM, perhaps involving changes to
mem_init() and paging_init().
In addition, an appropriate GFP_DEVMEM would be needed, along
with code in various paths to handle it appropriately.
Also, because large NUMA systems will sometimes interleave the
addresses of blocks of physical memory and device memory,
support for discontiguous interleaved zones will be required.
On Tue, Apr 21, 2015 at 02:44:45PM -0700, Paul E. McKenney wrote:
> Hello!
>
> We have some interest in hardware on devices that is cache-coherent
> with main memory, and in migrating memory between host memory and
> device memory. We believe that we might not be the only ones looking
> ahead to hardware like this, so please see below for a draft of some
> approaches that we have been thinking of.
>
> Thoughts?
I have posted several time a patchset just for doing that, i am sure
Ben did see it. Search for HMM. I am about to repost it in next couple
weeks.
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> COHERENT ON-DEVICE MEMORY: ACCESS AND MIGRATION
> Ben Herrenschmidt
> (As told to Paul E. McKenney)
>
> Special-purpose hardware becoming more prevalent, and some of this
> hardware allows for tight interaction with CPU-based processing.
> For example, IBM's coherent accelerator processor interface
> (CAPI) will allow this sort of device to be constructed,
> and it is likely that GPGPUs will need similar capabilities.
> (See http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html for a
> high-level description of CAPI.) Let's call these cache-coherent
> accelerator devices (CCAD for short, which should at least
> motivate someone to come up with something better).
>
> This document covers devices with the following properties:
>
> 1. The device is cache-coherent, in other words, the device's
> memory has all the characteristics of system memory from
> the viewpoint of CPUs and other devices accessing it.
>
> 2. The device provides local memory that it has high-bandwidth
> low-latency access to, but the device can also access
> normal system memory.
>
> 3. The device shares system page tables, so that it can
> transparently access userspace virtual memory, regardless
> of whether this virtual memory maps to normal system
> memory or to memory local to the device.
>
> Although such a device will provide CPU's with cache-coherent
> access to on-device memory, the resulting memory latency is
> expected to be slower than the normal memory that is tightly
> coupled to the CPUs. Nevertheless, data that is only occasionally
> accessed by CPUs should be stored in the device's memory.
> On the other hand, data that is accessed rarely by the device but
> frequently by the CPUs should be stored in normal system memory.
>
> Of course, some workloads will have predictable access patterns
> that allow data to be optimally placed up front. However, other
> workloads will have less-predictable access patterns, and these
> workloads can benefit from automatic migration of data between
> device memory and system memory as access patterns change.
> Furthermore, some devices will provide special hardware that
> collects access statistics that can be used to determine whether
> or not a given page of memory should be migrated, and if so,
> to where.
>
> The purpose of this document is to explore how this access
> and migration can be provided for within the Linux kernel.
All of the above is the exact requisit for hardware that want to use
HMM.
>
> REQUIREMENTS
>
> 1. It should be possible to remove a given CCAD device
> from service, for example, to reset it, to download
> updated firmware, or to change its functionality.
> This results in the following additional requirements:
>
> a. It should be possible to migrate all data away
> from the device's memory at any time.
>
> b. Normal memory allocation should avoid using the
> device's memory, as this would interfere
> with the needed migration. It may nevertheless
> be desirable to use the device's memory
> if system memory is exhausted, however, in some
> cases, even this "emergency" use is best avoided.
> In fact, a good solution will provide some means
> for avoiding this for those cases where it is
> necessary to evacuate memory when offlining the
> device.
>
> 2. Memory can be either explicitly or implicitly allocated
> from the CCAD device's memory. (Both usermode and kernel
> allocation required.)
>
> Please note that implicit allocation will need to be
> avoided in a number of use cases. The reason for this
> is that random kernel allocations might be pinned into
> memory, which could conflict with requirement (1) above,
> and might furthermore fragment the device's memory.
>
> 3. The device's memory is treated like normal system
> memory by the Linux kernel, for example, each page has a
> "struct page" associate with it. (In contrast, the
> traditional approach has used special-purpose OS mechanisms
> to manage the device's memory, and this memory was treated
> as MMIO space by the kernel.)
>
> 4. The system's normal tuning mechanism may be used to
> tune allocation locality, migration, and so on, as
> required to match performance and functional requirements.
Ok here you diverge substantially from HMM design, HMM is intended for
platform where the device memory is not necessarily (and unlikely) to
be visible by the CPU (x86 IOMMU PCI bar size are all the keywords here).
For this reason in HMM there is no intention to expose the device memory
as some memory useable by the CPU and thus no intention to create struct
page for it.
That being said commenting on your idea i would say that normal memory
allocation should never use the device memory unless the allocation
happens due to a device page fault and the device driver request it.
Moreover even if you go down the lets add a struct page for this remote
memory, it will not work with file backed page in the DAX case.
>
>
> POTENTIAL IDEAS
>
> It is only reasonable to ask whether CCAD devices can simply
> use the HMM patch that has recently been proposed to allow
> migration between system and device memory via page faults.
> Although this works well for devices whose local MMU can contain
> mappings different from that of the system MMU, the HMM patch
> is still working with MMIO space that gets special treatment.
> The HMM patch does not (yet) provide the full transparency that
> would allow the device memory to be treated in the same way as
> system memory. Something more is therefore required, for example,
> one or more of the following:
>
> 1. Model the CCAD device's memory as a memory-only NUMA node
> with a very large distance metric. This allows use of
> the existing mechanisms for choosing where to satisfy
> explicit allocations and where to target migrations.
>
> 2. Cover the memory with a CMA to prevent non-migratable
> pinned data from being placed in the CCAD device's memory.
> It would also permit the driver to perform dedicated
> physically contiguous allocations as needed.
>
> 3. Add a new ZONE_EXTERNAL zone for all CCAD-like devices.
> Note that this would likely require support for
> discontinuous zones in order to support large NUMA
> systems, in which each node has a single block of the
> overall physical address space. In such systems, the
> physical address ranges of normal system memory would
> be interleaved with those of device memory.
>
> This would also require some sort of
> migration infrastructure to be added, as autonuma would
> not apply. However, this approach has the advantage
> of preventing allocations in these regions, at least
> unless those allocations have been explicitly flagged
> to go there.
>
> 4. Your idea here!
Well AUTONUMA is interesting if you collect informations from the device
on what memory the device is accessing the most. But even then i am not
convince that actually collecting hint from userspace isn't more efficient.
Often the userspace library/program that leverage the GPU knows better
what will be the memory access pattern and can make better decissions.
In any case i think you definitly need the new special zone to block any
kernel allocation from using the device memory. Device memory should
solely be use on request from the process/device driver. I also think
this is does not block doing something like AUTONUMA on top, probably
with slight modification to the autonuma code to become aware of this
new kind of node.
>
>
> AUTONUMA
>
> The Linux kernel's autonuma facility supports migrating both
> memory and processes to promote NUMA memory locality. It was
> accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
> It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
>
> This approach uses a kernel thread "knuma_scand" that periodically
> marks pages inaccessible. The page-fault handler notes any
> mismatches between the NUMA node that the process is running on
> and the NUMA node on which the page resides.
>
> http://lwn.net/Articles/488709/
> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
>
> It will be necessary to set up the CCAD device's memory as
> a very distant NUMA node, and the architecture-specific
> __numa_distance() function can be used for this purpose.
> There is a RECLAIM_DISTANCE macro that can be set by the
> architecture to prevent reclaiming from nodes that are too
> far away. Some experimentation would be required to determine
> the combination of values for the various distance macros.
>
> This approach needs some way to pull in data from the hardware
> on access patterns. Aneesh Kk Veetil is prototyping an approach
> based on Power 8 hardware counters. This data will need to be
> plugged into the migration algorithm, which is currently based
> on collecting information from page faults.
>
> Finally, the contiguous memory allocator (CMA, see
> http://lwn.net/Articles/486301/) is needed in order to prevent
> the kernel from placing non-migratable allocations in the CCAD
> device's memory. This would need to be of type MIGRATE_CMA to
> ensure that all memory taken from that range be migratable.
>
> The result would be that the kernel would allocate only migratable
> pages within the CCAD device's memory, and even then only if
> memory was otherwise exhausted. Normal CONFIG_NUMA_BALANCING
> migration could be brought to bear, possibly enhanced with
> information from hardware counters. One remaining issue is that
> there is no way to absolutely prevent random kernel subsystems
> from allocating the CCAD device's memory, which could cause
> failures should the device need to reset itself, in which case
> the memory would be temporarily inaccessible -- which could be
> a fatal surprise to that kernel subsystem.
>
> MEMORY ZONE
>
> One way to avoid the problem of random kernel subsystems using
> the CAPI device's memory is to create a new memory zone for
> this purpose. This would add something like ZONE_DEVMEM to the
> current set that includes ZONE_DMA, ZONE_NORMAL, and ZONE_MOVABLE.
> Currently, there are a maximum of four zones, so this limit must
> either be increased or kernels built with ZONE_DEVMEM must avoid
> having more than one of ZONE_DMA, ZONE_DMA32, and ZONE_HIGHMEM.
>
> This approach requires that migration be implemented on the side,
> as the CONFIG_NUMA_BALANCING will not help here (unless I am
> missing something). One advantage of this situation is that
> hardware locality measurements could be incorporated from the
> beginning. Another advantage is that random kernel subsystems
> and user programs would not get CAPI device memory unless they
> explicitly requested it.
>
> Code would be needed at boot time to place the CAPI device
> memory into ZONE_DEVMEM, perhaps involving changes to
> mem_init() and paging_init().
>
> In addition, an appropriate GFP_DEVMEM would be needed, along
> with code in various paths to handle it appropriately.
>
> Also, because large NUMA systems will sometimes interleave the
> addresses of blocks of physical memory and device memory,
> support for discontiguous interleaved zones will be required.
Zone and numa node should be orthogonal in my mind, even if most of the
different zone (DMA, DMA32, NORMAL) always endup being on the same node.
Zone is really the outcome of some "old" hardware restriction (32bits
brave old world). So zone most likely require some work to face reality
of today world. While existing zone need to keep their definition base
on physical address, the zone code should not care about that, effectively
allowing zone that have several different chunk of physical address range.
I also believe that persistant memory might have same kind of requirement
so you might be able to piggy back on any work they might have to do, or
at least work i believe they need to do.
But i have not look into all that code much and i might just be dreaming
about how the world should be and some subtle details is likely escaping
me.
Cheers,
J?r?me
On Tue, 21 Apr 2015, Paul E. McKenney wrote:
> Thoughts?
Use DAX for memory instead of the other approaches? That way it is
explicitly clear what information is put on the CAPI device.
> Although such a device will provide CPU's with cache-coherent
Maybe call this coprocessor like IBM does? It is like a processor after
all in terms of its participation in cache coherent?
> access to on-device memory, the resulting memory latency is
> expected to be slower than the normal memory that is tightly
> coupled to the CPUs. Nevertheless, data that is only occasionally
> accessed by CPUs should be stored in the device's memory.
> On the other hand, data that is accessed rarely by the device but
> frequently by the CPUs should be stored in normal system memory.
I would expect many devices to not have *normal memory* at all (those
that simply process some data or otherwise interface with external
hardware like f.e. a NIC). Other devices like GPUs have local memory but
what is in GPU memory is very specific and general OS structures should
not be allocated there.
What I mostly would like to see is that these devices will have the
ability to participate in the cpu cache coherency scheme. I.e. they
will have l1/l2/l3 caches that will allow fast data exchange between the
coprocessor and the regular processors in the system.
>
> a. It should be possible to migrate all data away
> from the device's memory at any time.
That would be device specific and only a special device driver for that
device could save the state of the device (if that is necessary. It would
not be for something like a NIC).
> b. Normal memory allocation should avoid using the
> device's memory, as this would interfere
> with the needed migration. It may nevertheless
> be desirable to use the device's memory
> if system memory is exhausted, however, in some
> cases, even this "emergency" use is best avoided.
> In fact, a good solution will provide some means
> for avoiding this for those cases where it is
> necessary to evacuate memory when offlining the
> device.
Ok that seems to mean that none of the approaches suggested later would
be useful.
> 3. The device's memory is treated like normal system
> memory by the Linux kernel, for example, each page has a
> "struct page" associate with it. (In contrast, the
> traditional approach has used special-purpose OS mechanisms
> to manage the device's memory, and this memory was treated
> as MMIO space by the kernel.)
Why do we need a struct page? If so then maybe equip DAX with a struct
page so that the contents of the device memory can be controlled via a
filesystem? (may be custom to the needs of the device).
On Tue, Apr 21, 2015 at 06:49:29PM -0500, Christoph Lameter wrote:
> On Tue, 21 Apr 2015, Paul E. McKenney wrote:
>
> > Thoughts?
>
> Use DAX for memory instead of the other approaches? That way it is
> explicitly clear what information is put on the CAPI device.
>
Memory on this device should not be considered as something special
(even if it is). More below.
[...]
>
> > 3. The device's memory is treated like normal system
> > memory by the Linux kernel, for example, each page has a
> > "struct page" associate with it. (In contrast, the
> > traditional approach has used special-purpose OS mechanisms
> > to manage the device's memory, and this memory was treated
> > as MMIO space by the kernel.)
>
> Why do we need a struct page? If so then maybe equip DAX with a struct
> page so that the contents of the device memory can be controlled via a
> filesystem? (may be custom to the needs of the device).
So big use case here, let say you have an application that rely on a
scientific library that do matrix computation. Your application simply
use malloc and give pointer to this scientific library. Now let say
the good folks working on this scientific library wants to leverage
the GPU, they could do it by allocating GPU memory through GPU specific
API and copy data in and out. For matrix that can be easy enough, but
still inefficient. What you really want is the GPU directly accessing
this malloced chunk of memory, eventualy migrating it to device memory
while performing the computation and migrating it back to system memory
once done. Which means that you do not want some kind of filesystem or
anything like that.
By allowing transparent migration you allow library to just start using
the GPU without the application being non the wiser about that. More
over when you start playing with data set that use more advance design
pattern (list, tree, vector, a mix of all the above) you do not want
to have to duplicate the list for the GPU address space and for the
regular CPU address space (which you would need to do in case of a
filesystem solution).
So the corner stone of HMM and Paul requirement are the same, we want
to be able to move normal anonymous memory as well as regular file
backed page to device memory for some period of time while at the same
time allowing the usual memory management to keep going as if nothing
was different.
Paul is working on a platform that is more advance that the one HMM try
to address and i believe the x86 platform will not have functionality
such a CAPI, at least it is not part of any roadmap i know about for
x86.
Cheers,
J?r?me
On Tue, 2015-04-21 at 19:46 -0400, Jerome Glisse wrote:
> On Tue, Apr 21, 2015 at 02:44:45PM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > We have some interest in hardware on devices that is cache-coherent
> > with main memory, and in migrating memory between host memory and
> > device memory. We believe that we might not be the only ones looking
> > ahead to hardware like this, so please see below for a draft of some
> > approaches that we have been thinking of.
> >
> > Thoughts?
>
> I have posted several time a patchset just for doing that, i am sure
> Ben did see it. Search for HMM. I am about to repost it in next couple
> weeks.
Actually no :-) This is not at all HMM realm.
HMM deals with non-cachable (MMIO) device memory that isn't represented
by struct page and separate MMUs that allow pages to be selectively
unmapped from CPU vs. device.
This proposal is about a very different type of device where the device
memory is fully cachable from a CPU standpoint, and thus can be
represented by struct page, and the device has an MMU that is completely
shared with the CPU, ie, the device operates within a given context of
the system and if a page is marked read-only or inaccessible, this will
be true on both the CPU and the device.
Note: IBM is also interested in HMM for devices that don't qualify with
the above such as some GPUs or NICs, but this is something *else*.
Cheers,
Ben.
On Tue, 2015-04-21 at 18:49 -0500, Christoph Lameter wrote:
> On Tue, 21 Apr 2015, Paul E. McKenney wrote:
>
> > Thoughts?
>
> Use DAX for memory instead of the other approaches? That way it is
> explicitly clear what information is put on the CAPI device.
Care to elaborate on what DAX is ?
> > Although such a device will provide CPU's with cache-coherent
>
> Maybe call this coprocessor like IBM does? It is like a processor after
> all in terms of its participation in cache coherent?
It is, yes, in a way, though the actual implementation could be anything
from a NIC to a GPU or a crypto accelerator or whatever you can think
of.
The device memory is fully cachable from the CPU standpoint and the
device *completely* shares the MMU with the CPU (operates within a
normal linux mm context).
> > access to on-device memory, the resulting memory latency is
> > expected to be slower than the normal memory that is tightly
> > coupled to the CPUs. Nevertheless, data that is only occasionally
> > accessed by CPUs should be stored in the device's memory.
> > On the other hand, data that is accessed rarely by the device but
> > frequently by the CPUs should be stored in normal system memory.
>
> I would expect many devices to not have *normal memory* at all (those
> that simply process some data or otherwise interface with external
> hardware like f.e. a NIC). Other devices like GPUs have local memory but
> what is in GPU memory is very specific and general OS structures should
> not be allocated there.
That isn't entirely true. Take the GPU as an example, they can have
*large* amounts of local memory and you want to migrate the working set
(not just control structures) over.
So you can basically malloc() something on the host, hand it over to the
coprocessor which churns on it, the bus interface/MMU on the device
"detects" that a given page or set of pages is heavily pounded on by the
GPU and sends an interrupt to the host via a sideband channel to request
its migration to the device.
Since the device memory is fully cachable and coherent, it can simply be
represented with struct pages like normal system memory and we can use
the existing migration mechanism.
> What I mostly would like to see is that these devices will have the
> ability to participate in the cpu cache coherency scheme. I.e. they
> will have l1/l2/l3 caches that will allow fast data exchange between the
> coprocessor and the regular processors in the system.
Yes they can in theory.
> >
> > a. It should be possible to migrate all data away
> > from the device's memory at any time.
>
> That would be device specific and only a special device driver for that
> device could save the state of the device (if that is necessary. It would
> not be for something like a NIC).
Yes and no. If the memory is fully given to the system as struct pages,
we can have random kernel allocations on it which means we can't evict
it.
The ideas here are to try to mitigate that, ie, keep the benefit of
struct page and limit the problem of unrelated allocs hitting the
device.
> > b. Normal memory allocation should avoid using the
> > device's memory, as this would interfere
> > with the needed migration. It may nevertheless
> > be desirable to use the device's memory
> > if system memory is exhausted, however, in some
> > cases, even this "emergency" use is best avoided.
> > In fact, a good solution will provide some means
> > for avoiding this for those cases where it is
> > necessary to evacuate memory when offlining the
> > device.
>
> Ok that seems to mean that none of the approaches suggested later would
> be useful.
Why ? A far away numa node covered with a CMA would probably do the
trick, a ZONE would definitely do the trick...
> > 3. The device's memory is treated like normal system
> > memory by the Linux kernel, for example, each page has a
> > "struct page" associate with it. (In contrast, the
> > traditional approach has used special-purpose OS mechanisms
> > to manage the device's memory, and this memory was treated
> > as MMIO space by the kernel.)
>
> Why do we need a struct page? If so then maybe equip DAX with a struct
> page so that the contents of the device memory can be controlled via a
> filesystem? (may be custom to the needs of the device).
What is DAX ?
struct page means we can transparently migrate anonymous memory accross
among others.
Cheers,
Ben.
On Tue, 21 Apr 2015, Jerome Glisse wrote:
> Memory on this device should not be considered as something special
> (even if it is). More below.
Uhh?
> So big use case here, let say you have an application that rely on a
> scientific library that do matrix computation. Your application simply
> use malloc and give pointer to this scientific library. Now let say
> the good folks working on this scientific library wants to leverage
> the GPU, they could do it by allocating GPU memory through GPU specific
> API and copy data in and out. For matrix that can be easy enough, but
> still inefficient. What you really want is the GPU directly accessing
> this malloced chunk of memory, eventualy migrating it to device memory
> while performing the computation and migrating it back to system memory
> once done. Which means that you do not want some kind of filesystem or
> anything like that.
With a filesystem the migration can be controlled by the application. It
can copy stuff whenever it wants to.Having the OS do that behind my back
is not something that feels safe and secure.
> By allowing transparent migration you allow library to just start using
> the GPU without the application being non the wiser about that. More
> over when you start playing with data set that use more advance design
> pattern (list, tree, vector, a mix of all the above) you do not want
> to have to duplicate the list for the GPU address space and for the
> regular CPU address space (which you would need to do in case of a
> filesystem solution).
There is no need for duplication if both address spaces use the same
addresses. F.e. DAX would allow you to mmap arbitrary portions of memory
of the GPU into a process space. Since this is cache coherent both
processor cache and coprocessor cache would be able to hold cachelines
from the device or from main memory.
> So the corner stone of HMM and Paul requirement are the same, we want
> to be able to move normal anonymous memory as well as regular file
> backed page to device memory for some period of time while at the same
> time allowing the usual memory management to keep going as if nothing
> was different.
This still sounds pretty wild and is doing major changes to core OS
mechanisms with little reason from that I can see. There are already
mechanisms in place that do what you want.
> Paul is working on a platform that is more advance that the one HMM try
> to address and i believe the x86 platform will not have functionality
> such a CAPI, at least it is not part of any roadmap i know about for
> x86.
We will be one of the first users of Paul's Platform. Please do not do
crazy stuff but give us a sane solution where we can control the
hardware. No strange VM hooks that automatically move stuff back and forth
please. If you do this we will have to disable them anyways because they
would interfere with our needs to have the code not be disturbed by random
OS noise. We need detailed control as to when and how we move data.
On Wed, Apr 22, 2015 at 10:42:52AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2015-04-21 at 18:49 -0500, Christoph Lameter wrote:
> > On Tue, 21 Apr 2015, Paul E. McKenney wrote:
> >
> > > Thoughts?
> >
> > Use DAX for memory instead of the other approaches? That way it is
> > explicitly clear what information is put on the CAPI device.
>
> Care to elaborate on what DAX is ?
I would like to know as well. My first attempt to Google got me nothing
but Star Trek. Is DAX the persistent-memory topic covered here?
https://lwn.net/Articles/591779/
https://lwn.net/Articles/610174/
Ben will correct me if I am wrong, but I do not believe that we are
looking for persistent memory in this case.
Thanx, Paul
On Tue, 2015-04-21 at 19:50 -0500, Christoph Lameter wrote:
> With a filesystem the migration can be controlled by the application. It
> can copy stuff whenever it wants to.Having the OS do that behind my back
> is not something that feels safe and secure.
But this is not something the user wants. The filesystem model is
completely the wrong model for us.
This is fundamentally the same model as memory migrating between NUMA
nodes except that one of these is a co-processor with its local memory.
You want to malloc() some stuff or get a pointer provided by an app to
your library and be able to farm that job out to the co-processor. No
filesystem in the picture here.
> > By allowing transparent migration you allow library to just start using
> > the GPU without the application being non the wiser about that. More
> > over when you start playing with data set that use more advance design
> > pattern (list, tree, vector, a mix of all the above) you do not want
> > to have to duplicate the list for the GPU address space and for the
> > regular CPU address space (which you would need to do in case of a
> > filesystem solution).
>
> There is no need for duplication if both address spaces use the same
> addresses. F.e. DAX would allow you to mmap arbitrary portions of memory
> of the GPU into a process space. Since this is cache coherent both
> processor cache and coprocessor cache would be able to hold cachelines
> from the device or from main memory.
But it won't give you transparent migration which is what this is *all*
about.
> > So the corner stone of HMM and Paul requirement are the same, we want
> > to be able to move normal anonymous memory as well as regular file
> > backed page to device memory for some period of time while at the same
> > time allowing the usual memory management to keep going as if nothing
> > was different.
>
> This still sounds pretty wild and is doing major changes to core OS
> mechanisms with little reason from that I can see. There are already
> mechanisms in place that do what you want.
What "major" changes ? HMM has some changes yes, what we propose is
about using existing mechanisms with possibly *few* changes, but we are
trying to get that discussion going.
> > Paul is working on a platform that is more advance that the one HMM try
> > to address and i believe the x86 platform will not have functionality
> > such a CAPI, at least it is not part of any roadmap i know about for
> > x86.
>
> We will be one of the first users of Paul's Platform. Please do not do
> crazy stuff but give us a sane solution where we can control the
> hardware. No strange VM hooks that automatically move stuff back and forth
> please. If you do this we will have to disable them anyways because they
> would interfere with our needs to have the code not be disturbed by random
> OS noise. We need detailed control as to when and how we move data.
There is strictly nothing *sane* about requiring the workload to be put
into files that have to be explicitly moved around. This is utterly
backward. We aren't talking about CAPI based flash storage here, we are
talking about a coprocessor that can be buried under library,
accelerating existing APIs, which are going to take existing pointers
themselves being mmap'ed file, anonymous memory, or whatever else the
application choses to use.
This is the model that GPU *users* have been pushing for over and over
again, that some NIC vendors want as well (with HMM initially) etc...
Ben.
On Tue, 2015-04-21 at 17:57 -0700, Paul E. McKenney wrote:
> On Wed, Apr 22, 2015 at 10:42:52AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2015-04-21 at 18:49 -0500, Christoph Lameter wrote:
> > > On Tue, 21 Apr 2015, Paul E. McKenney wrote:
> > >
> > > > Thoughts?
> > >
> > > Use DAX for memory instead of the other approaches? That way it is
> > > explicitly clear what information is put on the CAPI device.
> >
> > Care to elaborate on what DAX is ?
>
> I would like to know as well. My first attempt to Google got me nothing
> but Star Trek. Is DAX the persistent-memory topic covered here?
>
> https://lwn.net/Articles/591779/
> https://lwn.net/Articles/610174/
>
> Ben will correct me if I am wrong, but I do not believe that we are
> looking for persistent memory in this case.
Right, it doesn't look at all like what we want.
Ben.
On Tue, Apr 21, 2015 at 07:46:07PM -0400, Jerome Glisse wrote:
> On Tue, Apr 21, 2015 at 02:44:45PM -0700, Paul E. McKenney wrote:
> > Hello!
> >
> > We have some interest in hardware on devices that is cache-coherent
> > with main memory, and in migrating memory between host memory and
> > device memory. We believe that we might not be the only ones looking
> > ahead to hardware like this, so please see below for a draft of some
> > approaches that we have been thinking of.
> >
> > Thoughts?
>
> I have posted several time a patchset just for doing that, i am sure
> Ben did see it. Search for HMM. I am about to repost it in next couple
> weeks.
Looking forward to seeing it! As you note below, we are not trying
to replace HMM, but rather to build upon it.
> > ------------------------------------------------------------------------
> >
> > COHERENT ON-DEVICE MEMORY: ACCESS AND MIGRATION
> > Ben Herrenschmidt
> > (As told to Paul E. McKenney)
> >
> > Special-purpose hardware becoming more prevalent, and some of this
> > hardware allows for tight interaction with CPU-based processing.
> > For example, IBM's coherent accelerator processor interface
> > (CAPI) will allow this sort of device to be constructed,
> > and it is likely that GPGPUs will need similar capabilities.
> > (See http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html for a
> > high-level description of CAPI.) Let's call these cache-coherent
> > accelerator devices (CCAD for short, which should at least
> > motivate someone to come up with something better).
> >
> > This document covers devices with the following properties:
> >
> > 1. The device is cache-coherent, in other words, the device's
> > memory has all the characteristics of system memory from
> > the viewpoint of CPUs and other devices accessing it.
> >
> > 2. The device provides local memory that it has high-bandwidth
> > low-latency access to, but the device can also access
> > normal system memory.
> >
> > 3. The device shares system page tables, so that it can
> > transparently access userspace virtual memory, regardless
> > of whether this virtual memory maps to normal system
> > memory or to memory local to the device.
> >
> > Although such a device will provide CPU's with cache-coherent
> > access to on-device memory, the resulting memory latency is
> > expected to be slower than the normal memory that is tightly
> > coupled to the CPUs. Nevertheless, data that is only occasionally
> > accessed by CPUs should be stored in the device's memory.
> > On the other hand, data that is accessed rarely by the device but
> > frequently by the CPUs should be stored in normal system memory.
> >
> > Of course, some workloads will have predictable access patterns
> > that allow data to be optimally placed up front. However, other
> > workloads will have less-predictable access patterns, and these
> > workloads can benefit from automatic migration of data between
> > device memory and system memory as access patterns change.
> > Furthermore, some devices will provide special hardware that
> > collects access statistics that can be used to determine whether
> > or not a given page of memory should be migrated, and if so,
> > to where.
> >
> > The purpose of this document is to explore how this access
> > and migration can be provided for within the Linux kernel.
>
> All of the above is the exact requisit for hardware that want to use
> HMM.
>
> >
> > REQUIREMENTS
> >
> > 1. It should be possible to remove a given CCAD device
> > from service, for example, to reset it, to download
> > updated firmware, or to change its functionality.
> > This results in the following additional requirements:
> >
> > a. It should be possible to migrate all data away
> > from the device's memory at any time.
> >
> > b. Normal memory allocation should avoid using the
> > device's memory, as this would interfere
> > with the needed migration. It may nevertheless
> > be desirable to use the device's memory
> > if system memory is exhausted, however, in some
> > cases, even this "emergency" use is best avoided.
> > In fact, a good solution will provide some means
> > for avoiding this for those cases where it is
> > necessary to evacuate memory when offlining the
> > device.
> >
> > 2. Memory can be either explicitly or implicitly allocated
> > from the CCAD device's memory. (Both usermode and kernel
> > allocation required.)
> >
> > Please note that implicit allocation will need to be
> > avoided in a number of use cases. The reason for this
> > is that random kernel allocations might be pinned into
> > memory, which could conflict with requirement (1) above,
> > and might furthermore fragment the device's memory.
> >
> > 3. The device's memory is treated like normal system
> > memory by the Linux kernel, for example, each page has a
> > "struct page" associate with it. (In contrast, the
> > traditional approach has used special-purpose OS mechanisms
> > to manage the device's memory, and this memory was treated
> > as MMIO space by the kernel.)
> >
> > 4. The system's normal tuning mechanism may be used to
> > tune allocation locality, migration, and so on, as
> > required to match performance and functional requirements.
>
> Ok here you diverge substantially from HMM design, HMM is intended for
> platform where the device memory is not necessarily (and unlikely) to
> be visible by the CPU (x86 IOMMU PCI bar size are all the keywords here).
Yep! ;-)
> For this reason in HMM there is no intention to expose the device memory
> as some memory useable by the CPU and thus no intention to create struct
> page for it.
>
> That being said commenting on your idea i would say that normal memory
> allocation should never use the device memory unless the allocation
> happens due to a device page fault and the device driver request it.
For many use case, agreed. Perhaps even for all use cases.
> Moreover even if you go down the lets add a struct page for this remote
> memory, it will not work with file backed page in the DAX case.
At a first glance, I agree that DAX's filesystems seem not to be
helping here. Christoph might have other thoughts.
> > POTENTIAL IDEAS
> >
> > It is only reasonable to ask whether CCAD devices can simply
> > use the HMM patch that has recently been proposed to allow
> > migration between system and device memory via page faults.
> > Although this works well for devices whose local MMU can contain
> > mappings different from that of the system MMU, the HMM patch
> > is still working with MMIO space that gets special treatment.
> > The HMM patch does not (yet) provide the full transparency that
> > would allow the device memory to be treated in the same way as
> > system memory. Something more is therefore required, for example,
> > one or more of the following:
> >
> > 1. Model the CCAD device's memory as a memory-only NUMA node
> > with a very large distance metric. This allows use of
> > the existing mechanisms for choosing where to satisfy
> > explicit allocations and where to target migrations.
> >
> > 2. Cover the memory with a CMA to prevent non-migratable
> > pinned data from being placed in the CCAD device's memory.
> > It would also permit the driver to perform dedicated
> > physically contiguous allocations as needed.
> >
> > 3. Add a new ZONE_EXTERNAL zone for all CCAD-like devices.
> > Note that this would likely require support for
> > discontinuous zones in order to support large NUMA
> > systems, in which each node has a single block of the
> > overall physical address space. In such systems, the
> > physical address ranges of normal system memory would
> > be interleaved with those of device memory.
> >
> > This would also require some sort of
> > migration infrastructure to be added, as autonuma would
> > not apply. However, this approach has the advantage
> > of preventing allocations in these regions, at least
> > unless those allocations have been explicitly flagged
> > to go there.
> >
> > 4. Your idea here!
>
> Well AUTONUMA is interesting if you collect informations from the device
> on what memory the device is accessing the most. But even then i am not
> convince that actually collecting hint from userspace isn't more efficient.
>
> Often the userspace library/program that leverage the GPU knows better
> what will be the memory access pattern and can make better decissions.
The argument over which of hardware measurements and usermode hints
should prevail has been going on for the better part of two decades,
in various contexts. ;-)
> In any case i think you definitly need the new special zone to block any
> kernel allocation from using the device memory. Device memory should
> solely be use on request from the process/device driver. I also think
> this is does not block doing something like AUTONUMA on top, probably
> with slight modification to the autonuma code to become aware of this
> new kind of node.
Agreed, there are important use cases where you don't want random
allocations in device memory, for example, cases where you might
need to remove or reset the device at runtime.
> > AUTONUMA
> >
> > The Linux kernel's autonuma facility supports migrating both
> > memory and processes to promote NUMA memory locality. It was
> > accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
> > It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
> >
> > This approach uses a kernel thread "knuma_scand" that periodically
> > marks pages inaccessible. The page-fault handler notes any
> > mismatches between the NUMA node that the process is running on
> > and the NUMA node on which the page resides.
> >
> > http://lwn.net/Articles/488709/
> > https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
> >
> > It will be necessary to set up the CCAD device's memory as
> > a very distant NUMA node, and the architecture-specific
> > __numa_distance() function can be used for this purpose.
> > There is a RECLAIM_DISTANCE macro that can be set by the
> > architecture to prevent reclaiming from nodes that are too
> > far away. Some experimentation would be required to determine
> > the combination of values for the various distance macros.
> >
> > This approach needs some way to pull in data from the hardware
> > on access patterns. Aneesh Kk Veetil is prototyping an approach
> > based on Power 8 hardware counters. This data will need to be
> > plugged into the migration algorithm, which is currently based
> > on collecting information from page faults.
> >
> > Finally, the contiguous memory allocator (CMA, see
> > http://lwn.net/Articles/486301/) is needed in order to prevent
> > the kernel from placing non-migratable allocations in the CCAD
> > device's memory. This would need to be of type MIGRATE_CMA to
> > ensure that all memory taken from that range be migratable.
> >
> > The result would be that the kernel would allocate only migratable
> > pages within the CCAD device's memory, and even then only if
> > memory was otherwise exhausted. Normal CONFIG_NUMA_BALANCING
> > migration could be brought to bear, possibly enhanced with
> > information from hardware counters. One remaining issue is that
> > there is no way to absolutely prevent random kernel subsystems
> > from allocating the CCAD device's memory, which could cause
> > failures should the device need to reset itself, in which case
> > the memory would be temporarily inaccessible -- which could be
> > a fatal surprise to that kernel subsystem.
> >
> > MEMORY ZONE
> >
> > One way to avoid the problem of random kernel subsystems using
> > the CAPI device's memory is to create a new memory zone for
> > this purpose. This would add something like ZONE_DEVMEM to the
> > current set that includes ZONE_DMA, ZONE_NORMAL, and ZONE_MOVABLE.
> > Currently, there are a maximum of four zones, so this limit must
> > either be increased or kernels built with ZONE_DEVMEM must avoid
> > having more than one of ZONE_DMA, ZONE_DMA32, and ZONE_HIGHMEM.
> >
> > This approach requires that migration be implemented on the side,
> > as the CONFIG_NUMA_BALANCING will not help here (unless I am
> > missing something). One advantage of this situation is that
> > hardware locality measurements could be incorporated from the
> > beginning. Another advantage is that random kernel subsystems
> > and user programs would not get CAPI device memory unless they
> > explicitly requested it.
> >
> > Code would be needed at boot time to place the CAPI device
> > memory into ZONE_DEVMEM, perhaps involving changes to
> > mem_init() and paging_init().
> >
> > In addition, an appropriate GFP_DEVMEM would be needed, along
> > with code in various paths to handle it appropriately.
> >
> > Also, because large NUMA systems will sometimes interleave the
> > addresses of blocks of physical memory and device memory,
> > support for discontiguous interleaved zones will be required.
>
>
> Zone and numa node should be orthogonal in my mind, even if most of the
> different zone (DMA, DMA32, NORMAL) always endup being on the same node.
> Zone is really the outcome of some "old" hardware restriction (32bits
> brave old world). So zone most likely require some work to face reality
> of today world. While existing zone need to keep their definition base
> on physical address, the zone code should not care about that, effectively
> allowing zone that have several different chunk of physical address range.
> I also believe that persistant memory might have same kind of requirement
> so you might be able to piggy back on any work they might have to do, or
> at least work i believe they need to do.
>
> But i have not look into all that code much and i might just be dreaming
> about how the world should be and some subtle details is likely escaping
> me.
I believe that some substantial changes to zones would be required. So it
might be that some other approach would be better. But until we come
up with an alternative, I am thinking along the lines of changes to zones.
Thanx, Paul
On Tue, Apr 21, 2015 at 07:50:02PM -0500, Christoph Lameter wrote:
> On Tue, 21 Apr 2015, Jerome Glisse wrote:
[ . . . ]
> > Paul is working on a platform that is more advance that the one HMM try
> > to address and i believe the x86 platform will not have functionality
> > such a CAPI, at least it is not part of any roadmap i know about for
> > x86.
>
> We will be one of the first users of Paul's Platform. Please do not do
> crazy stuff but give us a sane solution where we can control the
> hardware. No strange VM hooks that automatically move stuff back and forth
> please. If you do this we will have to disable them anyways because they
> would interfere with our needs to have the code not be disturbed by random
> OS noise. We need detailed control as to when and how we move data.
I completely agree that some critically important use cases, such as
yours, will absolutely require that the application explicitly choose
memory placement and have the memory stay there.
Requirement 2 was supposed to be getting at this by saying "explicitly
or implicitly allocated", with the "explicitly" calling out your use
case. How should I reword this to better bring this out?
Thanx, Paul
On Wed, Apr 22, 2015 at 11:01:26AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2015-04-21 at 19:50 -0500, Christoph Lameter wrote:
>
> > With a filesystem the migration can be controlled by the application. It
> > can copy stuff whenever it wants to.Having the OS do that behind my back
> > is not something that feels safe and secure.
>
> But this is not something the user wants. The filesystem model is
> completely the wrong model for us.
>
> This is fundamentally the same model as memory migrating between NUMA
> nodes except that one of these is a co-processor with its local memory.
>
> You want to malloc() some stuff or get a pointer provided by an app to
> your library and be able to farm that job out to the co-processor. No
> filesystem in the picture here.
I updated the document based on feedback thus far, and a big "thank you"
to everyone! Diffs below, followed by the full document.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/DeviceMem.txt b/DeviceMem.txt
index e2d65d585f03..cdedf2ee96e9 100644
--- a/DeviceMem.txt
+++ b/DeviceMem.txt
@@ -48,6 +48,25 @@
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
+
+USE CASES
+
+ o GPGPU matrix operations, from Jerome Glisse.
+ https://lkml.org/lkml/2015/4/21/898
+
+ Suppose that you have an application that uses a
+ scientific library to do matrix computations, and that
+ this application simply calls malloc() and give the
+ resulting pointer to the library function. If the GPGPU
+ has coherent access to system memory (and vice versa),
+ it would help performance and application compatibility
+ to be able to transparently migrate the malloc()ed
+ memory to and from the GPGPU's memory without requiring
+ changes to the application.
+
+ o (More here for CAPI.)
+
+
REQUIREMENTS
1. It should be possible to remove a given CCAD device
@@ -132,6 +151,9 @@ POTENTIAL IDEAS
4. Your idea here!
+The following sections cover AutoNUMA, use of memory zones, and DAX.
+
+
AUTONUMA
The Linux kernel's autonuma facility supports migrating both
@@ -178,6 +200,10 @@ AUTONUMA
the memory would be temporarily inaccessible -- which could be
a fatal surprise to that kernel subsystem.
+ Jerome Glisse suggests that usermode hints are quite important,
+ and perhaps should replace any AutoNUMA measurements.
+
+
MEMORY ZONE
One way to avoid the problem of random kernel subsystems using
@@ -206,3 +232,26 @@ MEMORY ZONE
Also, because large NUMA systems will sometimes interleave the
addresses of blocks of physical memory and device memory,
support for discontiguous interleaved zones will be required.
+
+
+DAX
+
+ DAX is a mechanism for providing direct-memory access to
+ high-speed non-volatile (AKA "persistent") memory. Good
+ introductions to DAX may be found in the following LWN
+ articles:
+
+ https://lwn.net/Articles/591779/
+ https://lwn.net/Articles/610174/
+
+ DAX provides filesystem-level access to persistent memory.
+ One important CCAD use case is allowing a legacy application
+ to pass memory from malloc() to a CCAD device, and having
+ the allocated memory migrate as needed. DAX does not seem to
+ support this use case.
+
+
+ACKNOWLEDGMENTS
+
+ Updates to this document include feedback from Christoph Lameter
+ and Jerome Glisse.
------------------------------------------------------------------------
COHERENT ON-DEVICE MEMORY: ACCESS AND MIGRATION
Ben Herrenschmidt
(As told to Paul E. McKenney)
Special-purpose hardware becoming more prevalent, and some of this
hardware allows for tight interaction with CPU-based processing.
For example, IBM's coherent accelerator processor interface
(CAPI) will allow this sort of device to be constructed,
and it is likely that GPGPUs will need similar capabilities.
(See http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html for a
high-level description of CAPI.) Let's call these cache-coherent
accelerator devices (CCAD for short, which should at least
motivate someone to come up with something better).
This document covers devices with the following properties:
1. The device is cache-coherent, in other words, the device's
memory has all the characteristics of system memory from
the viewpoint of CPUs and other devices accessing it.
2. The device provides local memory that it has high-bandwidth
low-latency access to, but the device can also access
normal system memory.
3. The device shares system page tables, so that it can
transparently access userspace virtual memory, regardless
of whether this virtual memory maps to normal system
memory or to memory local to the device.
Although such a device will provide CPU's with cache-coherent
access to on-device memory, the resulting memory latency is
expected to be slower than the normal memory that is tightly
coupled to the CPUs. Nevertheless, data that is only occasionally
accessed by CPUs should be stored in the device's memory.
On the other hand, data that is accessed rarely by the device but
frequently by the CPUs should be stored in normal system memory.
Of course, some workloads will have predictable access patterns
that allow data to be optimally placed up front. However, other
workloads will have less-predictable access patterns, and these
workloads can benefit from automatic migration of data between
device memory and system memory as access patterns change.
Furthermore, some devices will provide special hardware that
collects access statistics that can be used to determine whether
or not a given page of memory should be migrated, and if so,
to where.
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
USE CASES
o GPGPU matrix operations, from Jerome Glisse.
https://lkml.org/lkml/2015/4/21/898
Suppose that you have an application that uses a
scientific library to do matrix computations, and that
this application simply calls malloc() and give the
resulting pointer to the library function. If the GPGPU
has coherent access to system memory (and vice versa),
it would help performance and application compatibility
to be able to transparently migrate the malloc()ed
memory to and from the GPGPU's memory without requiring
changes to the application.
o (More here for CAPI.)
REQUIREMENTS
1. It should be possible to remove a given CCAD device
from service, for example, to reset it, to download
updated firmware, or to change its functionality.
This results in the following additional requirements:
a. It should be possible to migrate all data away
from the device's memory at any time.
b. Normal memory allocation should avoid using the
device's memory, as this would interfere
with the needed migration. It may nevertheless
be desirable to use the device's memory
if system memory is exhausted, however, in some
cases, even this "emergency" use is best avoided.
In fact, a good solution will provide some means
for avoiding this for those cases where it is
necessary to evacuate memory when offlining the
device.
2. Memory can be either explicitly or implicitly allocated
from the CCAD device's memory. (Both usermode and kernel
allocation required.)
Please note that implicit allocation will need to be
avoided in a number of use cases. The reason for this
is that random kernel allocations might be pinned into
memory, which could conflict with requirement (1) above,
and might furthermore fragment the device's memory.
3. The device's memory is treated like normal system
memory by the Linux kernel, for example, each page has a
"struct page" associate with it. (In contrast, the
traditional approach has used special-purpose OS mechanisms
to manage the device's memory, and this memory was treated
as MMIO space by the kernel.)
4. The system's normal tuning mechanism may be used to
tune allocation locality, migration, and so on, as
required to match performance and functional requirements.
POTENTIAL IDEAS
It is only reasonable to ask whether CCAD devices can simply
use the HMM patch that has recently been proposed to allow
migration between system and device memory via page faults.
Although this works well for devices whose local MMU can contain
mappings different from that of the system MMU, the HMM patch
is still working with MMIO space that gets special treatment.
The HMM patch does not (yet) provide the full transparency that
would allow the device memory to be treated in the same way as
system memory. Something more is therefore required, for example,
one or more of the following:
1. Model the CCAD device's memory as a memory-only NUMA node
with a very large distance metric. This allows use of
the existing mechanisms for choosing where to satisfy
explicit allocations and where to target migrations.
2. Cover the memory with a CMA to prevent non-migratable
pinned data from being placed in the CCAD device's memory.
It would also permit the driver to perform dedicated
physically contiguous allocations as needed.
3. Add a new ZONE_EXTERNAL zone for all CCAD-like devices.
Note that this would likely require support for
discontinuous zones in order to support large NUMA
systems, in which each node has a single block of the
overall physical address space. In such systems, the
physical address ranges of normal system memory would
be interleaved with those of device memory.
This would also require some sort of
migration infrastructure to be added, as autonuma would
not apply. However, this approach has the advantage
of preventing allocations in these regions, at least
unless those allocations have been explicitly flagged
to go there.
4. Your idea here!
The following sections cover AutoNUMA, use of memory zones, and DAX.
AUTONUMA
The Linux kernel's autonuma facility supports migrating both
memory and processes to promote NUMA memory locality. It was
accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
This approach uses a kernel thread "knuma_scand" that periodically
marks pages inaccessible. The page-fault handler notes any
mismatches between the NUMA node that the process is running on
and the NUMA node on which the page resides.
http://lwn.net/Articles/488709/
https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
It will be necessary to set up the CCAD device's memory as
a very distant NUMA node, and the architecture-specific
__numa_distance() function can be used for this purpose.
There is a RECLAIM_DISTANCE macro that can be set by the
architecture to prevent reclaiming from nodes that are too
far away. Some experimentation would be required to determine
the combination of values for the various distance macros.
This approach needs some way to pull in data from the hardware
on access patterns. Aneesh Kk Veetil is prototyping an approach
based on Power 8 hardware counters. This data will need to be
plugged into the migration algorithm, which is currently based
on collecting information from page faults.
Finally, the contiguous memory allocator (CMA, see
http://lwn.net/Articles/486301/) is needed in order to prevent
the kernel from placing non-migratable allocations in the CCAD
device's memory. This would need to be of type MIGRATE_CMA to
ensure that all memory taken from that range be migratable.
The result would be that the kernel would allocate only migratable
pages within the CCAD device's memory, and even then only if
memory was otherwise exhausted. Normal CONFIG_NUMA_BALANCING
migration could be brought to bear, possibly enhanced with
information from hardware counters. One remaining issue is that
there is no way to absolutely prevent random kernel subsystems
from allocating the CCAD device's memory, which could cause
failures should the device need to reset itself, in which case
the memory would be temporarily inaccessible -- which could be
a fatal surprise to that kernel subsystem.
Jerome Glisse suggests that usermode hints are quite important,
and perhaps should replace any AutoNUMA measurements.
MEMORY ZONE
One way to avoid the problem of random kernel subsystems using
the CAPI device's memory is to create a new memory zone for
this purpose. This would add something like ZONE_DEVMEM to the
current set that includes ZONE_DMA, ZONE_NORMAL, and ZONE_MOVABLE.
Currently, there are a maximum of four zones, so this limit must
either be increased or kernels built with ZONE_DEVMEM must avoid
having more than one of ZONE_DMA, ZONE_DMA32, and ZONE_HIGHMEM.
This approach requires that migration be implemented on the side,
as the CONFIG_NUMA_BALANCING will not help here (unless I am
missing something). One advantage of this situation is that
hardware locality measurements could be incorporated from the
beginning. Another advantage is that random kernel subsystems
and user programs would not get CAPI device memory unless they
explicitly requested it.
Code would be needed at boot time to place the CAPI device
memory into ZONE_DEVMEM, perhaps involving changes to
mem_init() and paging_init().
In addition, an appropriate GFP_DEVMEM would be needed, along
with code in various paths to handle it appropriately.
Also, because large NUMA systems will sometimes interleave the
addresses of blocks of physical memory and device memory,
support for discontiguous interleaved zones will be required.
DAX
DAX is a mechanism for providing direct-memory access to
high-speed non-volatile (AKA "persistent") memory. Good
introductions to DAX may be found in the following LWN
articles:
https://lwn.net/Articles/591779/
https://lwn.net/Articles/610174/
DAX provides filesystem-level access to persistent memory.
One important CCAD use case is allowing a legacy application
to pass memory from malloc() to a CCAD device, and having
the allocated memory migrate as needed. DAX does not seem to
support this use case.
ACKNOWLEDGMENTS
Updates to this document include feedback from Christoph Lameter
and Jerome Glisse.
On Tue, 21 Apr 2015, Paul E. McKenney wrote:
> Ben will correct me if I am wrong, but I do not believe that we are
> looking for persistent memory in this case.
DAX is way of mapping special memory into user space. Persistance is one
possible use case. Its like the XIP that you IBMers know from z/OS
or the 390 mainframe stuff.
Its been widely discussed at memory managenent meetings. A bit surprised
that this is not a well known thing.
On Wed, 22 Apr 2015, Benjamin Herrenschmidt wrote:
> Right, it doesn't look at all like what we want.
Its definitely a way to map memory that is outside of the kernel managed
pool into a user space process. For that matter any device driver could be
doing this as well. The point is that we already have pletora of features
to do this. Putting new requirements on the already
warped-and-screwed-up-beyond-all-hope zombie of a page allocator that we
have today is not the way to do this. In particular what I have head
repeatedly is that we do not want kernel structures alllocated there but
then we still want to use this because we want malloc support in
libraries. The memory has different performance characteristics (for
starters there may be lots of other isssues depending on the device) so we
just add a NUMA "node" with estremely high distance.
There are hooks in glibc where you can replace the memory
management of the apps if you want that.
On Wed, 22 Apr 2015, Paul E. McKenney wrote:
> I completely agree that some critically important use cases, such as
> yours, will absolutely require that the application explicitly choose
> memory placement and have the memory stay there.
Most of what you are trying to do here is already there and has been done.
GPU memory is accessible. NICs work etc etc. All without CAPI. What
exactly are the benefits of CAPI? Is driver simplification? Reduction of
overhead? If so then the measures proposed are a bit radical and
may result in just the opposite.
For my use cases the advantage of CAPI lies in the reduction of latency
for coprocessor communication. I hope that CAPI will allow fast cache to
cache transactions between a coprocessor and the main one. This is
improving the ability to exchange data rapidly between a application code
and some piece of hardware (NIC, GPU, custom hardware etc etc)
Fundamentally this is currently an design issue since CAPI is running on
top of PCI-E and PCI-E transactions establish a minimum latency that
cannot be avoided. So its hard to see how CAPI can improve the situation.
The new thing about CAPI are the cache to cache transactions and
participation in cache coherency at the cacheline level. That is a
different approach than the device memory oriented PCI transcactions.
Perhaps even CAPI over PCI-E can improve the situation there (maybe the
transactions are lower latency than going to device memory) and hopefully
CAPI will not forever be bound to PCI-E and thus at some point shake off
the shackles of a bus designed by a competitor.
On Wed, Apr 22, 2015 at 10:25:37AM -0500, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Benjamin Herrenschmidt wrote:
>
> > Right, it doesn't look at all like what we want.
>
> Its definitely a way to map memory that is outside of the kernel managed
> pool into a user space process. For that matter any device driver could be
> doing this as well. The point is that we already have pletora of features
> to do this. Putting new requirements on the already
> warped-and-screwed-up-beyond-all-hope zombie of a page allocator that we
> have today is not the way to do this. In particular what I have head
> repeatedly is that we do not want kernel structures alllocated there but
> then we still want to use this because we want malloc support in
> libraries. The memory has different performance characteristics (for
> starters there may be lots of other isssues depending on the device) so we
> just add a NUMA "node" with estremely high distance.
>
> There are hooks in glibc where you can replace the memory
> management of the apps if you want that.
Glibc hooks will not work, this is about having same address space on
CPU and GPU/accelerator while allowing backing memory to be regular
system memory or device memory all this in a transparent manner to
userspace program and library.
You also have to think at things like mmaped file, let say you have a
big file on disk and you want to crunch number from its data, you do
not want to copy it, instead you want to to the usual mmap and just
have device driver do migration to device memory (how device driver
make the decision is a different problem and this can be entirely
leave to the userspace application or their can be heuristic or both).
Glibc hooks do not work with share memory either and again this is
a usecase we care about. You really have to think of let's have today
applications start using those accelerators without the application
even knowing about it.
So you would not know before hand what will end up being use by the
GPU/accelerator and would need to be allocated from special memory.
We do not want today model of using GPU, we want to provide tomorrow
infrastructure for using GPU in a transparent way.
I understand that the application you care about wants to be clever
and can make better decission and we intend to support that, but this
does not need to be at the expense of all the others applications.
Like i said numerous time the decission to migrate memory is a device
driver decission and how the device driver make that decission can
be entirely control by userspace through proper device driver API.
The numa idea is interesting for application that do not know about
this and do not need to know. It would allow to have heuristic inside
the kernel, under the control of the device driver and that could be
disabled by application that know better.
Bottom line is we want today anonymous, share or file mapped memory
to stay the only kind of memory that exist and we want to choose the
backing store of each of those kind for better placement depending
on how memory is use (again which can be in the total control of
the application). But we do not want to introduce a third kind of
disjoint memory to userspace, this is today situation and we want
to move forward to tomorrow solution.
Cheers,
J?r?me
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, Apr 22, 2015 at 11:16:49AM -0500, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Paul E. McKenney wrote:
>
> > I completely agree that some critically important use cases, such as
> > yours, will absolutely require that the application explicitly choose
> > memory placement and have the memory stay there.
>
>
>
> Most of what you are trying to do here is already there and has been done.
> GPU memory is accessible. NICs work etc etc. All without CAPI. What
> exactly are the benefits of CAPI? Is driver simplification? Reduction of
> overhead? If so then the measures proposed are a bit radical and
> may result in just the opposite.
>
No, what Paul is trying to do, and what i am trying to do with HMM, does
not exist. This is share address space btw CPU and GPU/accelerator and
leveraging GPU local memory transparently at the same time.
Today world is GPU have different address space and complex data structure
like list or tree need to be replicated accross different address space.
You might not care for this but for lot of application this is a show
stopper and the outcome is using GPU is too complex because of that.
Now if you have the exact same address space then structure you have on
the CPU are exactly view in the same way on the GPU and you can start
porting library to leverage GPU without having to change a single line of
code inside many many many applications. It is also lot easier to debug
things as you do not have to strungly with two distinct address space.
Finaly, leveraging transparently the local GPU memory is the only way to
reach the full potential of the GPU. GPU are all about bandwidth and GPU
local memory have bandwidth far greater than any system memory i know
about. Here again if you can transparently leverage this memory without
the application ever needing to know about such subtlety.
But again let me stress that application that want to be in control will
stay in control. If you want to make the decission yourself about where
things should end up then nothing in all we are proposing will preclude
you from doing that. Please just think about others people application,
not just yours, they are a lot of others thing in the world and they do
not want to be as close to the metal as you want to be. We just want to
accomodate the largest number of use case.
Cheers,
J?r?me
On Wed, 22 Apr 2015, Jerome Glisse wrote:
> Glibc hooks will not work, this is about having same address space on
> CPU and GPU/accelerator while allowing backing memory to be regular
> system memory or device memory all this in a transparent manner to
> userspace program and library.
If you control the address space used by malloc and provide your own
implementation then I do not see why this would not work.
> You also have to think at things like mmaped file, let say you have a
> big file on disk and you want to crunch number from its data, you do
> not want to copy it, instead you want to to the usual mmap and just
> have device driver do migration to device memory (how device driver
> make the decision is a different problem and this can be entirely
> leave to the userspace application or their can be heuristic or both).
If the data is on disk then you cannot access it. If its in the page cache
or in the device then you can mmap it. Not sure how you could avoid a copy
unless the device can direct read from disk via another controller.
> Glibc hooks do not work with share memory either and again this is
> a usecase we care about. You really have to think of let's have today
> applications start using those accelerators without the application
> even knowing about it.
Applications always have to be reworked. This does not look like a high
performance solution but some sort way of emulation for legacy code? HPC
codes are mostly written to the hardware and they will be modified as
needed to use maximum performance that the hardware will permit.
> So you would not know before hand what will end up being use by the
> GPU/accelerator and would need to be allocated from special memory.
> We do not want today model of using GPU, we want to provide tomorrow
> infrastructure for using GPU in a transparent way.
Urm... Then provide hardware that actually givse you a performance
benefit instead of proposing some weird software solution that
makes old software work? Transparency with the random varying latencies
that you propose will kill performance of MPI jobs as well as make the
system unusable for financial applications. This seems be wrong all
around.
> I understand that the application you care about wants to be clever
> and can make better decission and we intend to support that, but this
> does not need to be at the expense of all the others applications.
> Like i said numerous time the decission to migrate memory is a device
> driver decission and how the device driver make that decission can
> be entirely control by userspace through proper device driver API.
What application would be using this? HPC probably not given the
sensitivity to random latencies. Hadoop style stuff?
> Bottom line is we want today anonymous, share or file mapped memory
> to stay the only kind of memory that exist and we want to choose the
> backing store of each of those kind for better placement depending
> on how memory is use (again which can be in the total control of
> the application). But we do not want to introduce a third kind of
> disjoint memory to userspace, this is today situation and we want
> to move forward to tomorrow solution.
Frankly, I do not see any benefit here, nor a use case and I wonder who
would adopt this. The future requires higher performance and not more band
aid.
On Wed, 22 Apr 2015, Jerome Glisse wrote:
> Now if you have the exact same address space then structure you have on
> the CPU are exactly view in the same way on the GPU and you can start
> porting library to leverage GPU without having to change a single line of
> code inside many many many applications. It is also lot easier to debug
> things as you do not have to strungly with two distinct address space.
Right. That already works. Note however that GPU programming is a bit
different. Saying that the same code runs on the GPU is strong
simplification. Any effective GPU code still requires a lot of knowlege to
make it work in a high performant way.
The two distinct address spaces can be controlled already via a number of
mechanisms and there are ways from either side to access the other one.
This includes mmapping areas from the other side.
If you really want this then you should even be able to write a shared
library that does this.
> Finaly, leveraging transparently the local GPU memory is the only way to
> reach the full potential of the GPU. GPU are all about bandwidth and GPU
> local memory have bandwidth far greater than any system memory i know
> about. Here again if you can transparently leverage this memory without
> the application ever needing to know about such subtlety.
Well if you do this transparently then the GPU may not have access to its
data when it needs it. You are adding demand paging to the GPUs? The
performance would suffer significantly. AFAICT GPUs are not designed to
work like that and would not have optimal performance with such an
approach.
> But again let me stress that application that want to be in control will
> stay in control. If you want to make the decission yourself about where
> things should end up then nothing in all we are proposing will preclude
> you from doing that. Please just think about others people application,
> not just yours, they are a lot of others thing in the world and they do
> not want to be as close to the metal as you want to be. We just want to
> accomodate the largest number of use case.
What I think you want to do is to automatize something that should not be
automatized and cannot be automatized for performance reasons. Anyone
wanting performance (and that is the prime reason to use a GPU) would
switch this off because the latencies are otherwise not controllable and
those may impact performance severely. There are typically multiple
parallel strands of executing that must execute with similar performance
in order to allow a data exchange at defined intervals. That is no longer
possible if you add variances that come with the "transparency" here.
On Wed, Apr 22, 2015 at 01:17:58PM -0500, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Jerome Glisse wrote:
>
> > Now if you have the exact same address space then structure you have on
> > the CPU are exactly view in the same way on the GPU and you can start
> > porting library to leverage GPU without having to change a single line of
> > code inside many many many applications. It is also lot easier to debug
> > things as you do not have to strungly with two distinct address space.
>
> Right. That already works. Note however that GPU programming is a bit
> different. Saying that the same code runs on the GPU is strong
> simplification. Any effective GPU code still requires a lot of knowlege to
> make it work in a high performant way.
>
> The two distinct address spaces can be controlled already via a number of
> mechanisms and there are ways from either side to access the other one.
> This includes mmapping areas from the other side.
I believe that the two of you are talking about two distinct but closely
related use cases. Christoph wants full performance, and is willing to
put quite a bit of development effort into getting the last little bit.
Jerome is looking to get most of the performance, but where modifications
are limited to substituting a different library.
> If you really want this then you should even be able to write a shared
> library that does this.
>From what I can see, this is indeed Jerome's goal, but he needs to be
able to do this without having to go through the program and work out
which malloc() calls should work as before and which should allocate
from device memory.
> > Finaly, leveraging transparently the local GPU memory is the only way to
> > reach the full potential of the GPU. GPU are all about bandwidth and GPU
> > local memory have bandwidth far greater than any system memory i know
> > about. Here again if you can transparently leverage this memory without
> > the application ever needing to know about such subtlety.
>
> Well if you do this transparently then the GPU may not have access to its
> data when it needs it. You are adding demand paging to the GPUs? The
> performance would suffer significantly. AFAICT GPUs are not designed to
> work like that and would not have optimal performance with such an
> approach.
Agreed, the use case that Jerome is thinking of differs from yours.
You would not (and should not) tolerate things like page faults because
it would destroy your worst-case response times. I believe that Jerome
is more interested in throughput with minimal change to existing code.
> > But again let me stress that application that want to be in control will
> > stay in control. If you want to make the decission yourself about where
> > things should end up then nothing in all we are proposing will preclude
> > you from doing that. Please just think about others people application,
> > not just yours, they are a lot of others thing in the world and they do
> > not want to be as close to the metal as you want to be. We just want to
> > accomodate the largest number of use case.
>
> What I think you want to do is to automatize something that should not be
> automatized and cannot be automatized for performance reasons. Anyone
> wanting performance (and that is the prime reason to use a GPU) would
> switch this off because the latencies are otherwise not controllable and
> those may impact performance severely. There are typically multiple
> parallel strands of executing that must execute with similar performance
> in order to allow a data exchange at defined intervals. That is no longer
> possible if you add variances that come with the "transparency" here.
Let's suppose that you and Jerome were using GPGPU hardware that had
32,768 hardware threads. You would want very close to 100% of the full
throughput out of the hardware with pretty much zero unnecessary latency.
In contrast, Jerome might be OK with (say) 20,000 threads worth of
throughput with the occasional latency hiccup.
And yes, support for both use cases is needed.
Thanx, Paul
On Wed, Apr 22, 2015 at 12:14:50PM -0500, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Jerome Glisse wrote:
>
> > Glibc hooks will not work, this is about having same address space on
> > CPU and GPU/accelerator while allowing backing memory to be regular
> > system memory or device memory all this in a transparent manner to
> > userspace program and library.
>
> If you control the address space used by malloc and provide your own
> implementation then I do not see why this would not work.
mmaped file, shared memory, anonymous memory allocated outside the control
of the library that want to use the GPU. I keep repeating myself i dunno
what words are wrong.
>
> > You also have to think at things like mmaped file, let say you have a
> > big file on disk and you want to crunch number from its data, you do
> > not want to copy it, instead you want to to the usual mmap and just
> > have device driver do migration to device memory (how device driver
> > make the decision is a different problem and this can be entirely
> > leave to the userspace application or their can be heuristic or both).
>
> If the data is on disk then you cannot access it. If its in the page cache
> or in the device then you can mmap it. Not sure how you could avoid a copy
> unless the device can direct read from disk via another controller.
>
Page cache page are allocated by the kernel how do you propose we map
them to the device transparently without touching a single line of
kernel code ?
Moreover yes there are disk where you can directly map each disk page to
the device without ever allocating a page and copying the data (some ssd
on pcie device allows that).
> > Glibc hooks do not work with share memory either and again this is
> > a usecase we care about. You really have to think of let's have today
> > applications start using those accelerators without the application
> > even knowing about it.
>
> Applications always have to be reworked. This does not look like a high
> performance solution but some sort way of emulation for legacy code? HPC
> codes are mostly written to the hardware and they will be modified as
> needed to use maximum performance that the hardware will permit.
No, application do not need to be rewritten and that is the point i am
trying to get accross and you keep denying. Many applications use library
to perform scientific computation, this is very common, and you only need
to port the library. In today world if you want to leverage the GPU you
will have to perform copy of all data the application submit to the library.
Only people writting the library would need to know about efficient algo
for GPU and the application can be left alone ignoring all the gory
details.
Now with solution we are proposing there will be no copy, the malloced
memory of the application will be accessible to the GPU transparently.
This is not the case today. Today you need to use specialize allocator
if you want to use same kind of address space. We want to move away from
that model. What is it you do not understand here ?
>
> > So you would not know before hand what will end up being use by the
> > GPU/accelerator and would need to be allocated from special memory.
> > We do not want today model of using GPU, we want to provide tomorrow
> > infrastructure for using GPU in a transparent way.
>
> Urm... Then provide hardware that actually givse you a performance
> benefit instead of proposing some weird software solution that
> makes old software work? Transparency with the random varying latencies
> that you propose will kill performance of MPI jobs as well as make the
> system unusable for financial applications. This seems be wrong all
> around.
I have repeated numerous time what is propose here will not imped in any
way your precious low latencies workload on contrary it will benefit you.
Is it be easier to debug an application where you do not need different
interpretation for pointer value depending if an object is allocated for
GPU or if it is allocated for CPU ? Don't you think that avoinding
different address space is not a benefit ?
That you will not benefit from automatic memory migration is a given, i
repeatly acknownlegded that point but you just seems to ignore that. I
also repeatedly said that what we propose will in noway forbid total
control by application that want such control. So yes you will not
benefit from numa migration but you are not alone and thousand of others
application will benefit from it. Please stop seeing the world through
the only use case you know and care about.
>
> > I understand that the application you care about wants to be clever
> > and can make better decission and we intend to support that, but this
> > does not need to be at the expense of all the others applications.
> > Like i said numerous time the decission to migrate memory is a device
> > driver decission and how the device driver make that decission can
> > be entirely control by userspace through proper device driver API.
>
> What application would be using this? HPC probably not given the
> sensitivity to random latencies. Hadoop style stuff?
Again think any application that link against some library that can
benefit from GPU like https://www.gnu.org/software/gsl/ and countless
others. There is a whole word of application that do not run on HPC and
that can benefit from that. Even a standard office suite or even your
mail client to search string inside your mail database.
It is a matter of enabling those application to transparently use the
GPU in a way that does not need each of their programmer to deal with
separate address space or details of each GPU to know when to migrate
or not memory. Like i said for those proper heuristic will give good
results and again and again your application can stay in total control
if it believes it will make better decission.
>
> > Bottom line is we want today anonymous, share or file mapped memory
> > to stay the only kind of memory that exist and we want to choose the
> > backing store of each of those kind for better placement depending
> > on how memory is use (again which can be in the total control of
> > the application). But we do not want to introduce a third kind of
> > disjoint memory to userspace, this is today situation and we want
> > to move forward to tomorrow solution.
>
> Frankly, I do not see any benefit here, nor a use case and I wonder who
> would adopt this. The future requires higher performance and not more band
> aid.
Well all i can tell you is that if you go to any conference where there are
people doing GPGPU they will almost all tells you they would love unified
address space. Why in hell do you think the OpenCL 2.0 specification makes
that a corner stone, with different level of support, the lowest level being
what we have today using special allocator.
There is a whole industry out there spending billions of dollars on what
you call a band aid. Don't you think they have a market for it ?
Cheers,
J?r?me
On Wed, 2015-04-22 at 10:25 -0500, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Benjamin Herrenschmidt wrote:
>
> > Right, it doesn't look at all like what we want.
>
> Its definitely a way to map memory that is outside of the kernel managed
> pool into a user space process. For that matter any device driver could be
> doing this as well. The point is that we already have pletora of features
> to do this. Putting new requirements on the already
> warped-and-screwed-up-beyond-all-hope zombie of a page allocator that we
> have today is not the way to do this. In particular what I have head
> repeatedly is that we do not want kernel structures alllocated there but
> then we still want to use this because we want malloc support in
> libraries. The memory has different performance characteristics (for
> starters there may be lots of other isssues depending on the device) so we
> just add a NUMA "node" with estremely high distance.
>
> There are hooks in glibc where you can replace the memory
> management of the apps if you want that.
We don't control the app. Let's say we are doing a plugin for libfoo
which accelerates "foo" using GPUs.
Now some other app we have no control on uses libfoo. So pointers
already allocated/mapped, possibly a long time ago, will hit libfoo (or
the plugin) and we need GPUs to churn on the data.
The point I'm making is you are arguing against a usage model which has
been repeatedly asked for by large amounts of customer (after all that's
also why HMM exists).
We should focus on how to make this happen rather than trying to shovel
a *different* model that removes transparency from the equation into the
user faces.
Ben.
On Wed, 2015-04-22 at 11:16 -0500, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Paul E. McKenney wrote:
>
> > I completely agree that some critically important use cases, such as
> > yours, will absolutely require that the application explicitly choose
> > memory placement and have the memory stay there.
>
>
>
> Most of what you are trying to do here is already there and has been done.
> GPU memory is accessible. NICs work etc etc. All without CAPI. What
> exactly are the benefits of CAPI? Is driver simplification? Reduction of
> overhead? If so then the measures proposed are a bit radical and
> may result in just the opposite.
They are via MMIO space. The big differences here are that via CAPI the
memory can be fully cachable and thus have the same characteristics as
normal memory from the processor point of view, and the device shares
the MMU with the host.
Practically what that means is that the device memory *is* just some
normal system memory with a larger distance. The NUMA model is an
excellent representation of it.
> For my use cases the advantage of CAPI lies in the reduction of latency
> for coprocessor communication. I hope that CAPI will allow fast cache to
> cache transactions between a coprocessor and the main one. This is
> improving the ability to exchange data rapidly between a application code
> and some piece of hardware (NIC, GPU, custom hardware etc etc)
>
> Fundamentally this is currently an design issue since CAPI is running on
> top of PCI-E and PCI-E transactions establish a minimum latency that
> cannot be avoided. So its hard to see how CAPI can improve the situation.
It's on top of the lower layers of PCIe yes, I don't know the exact
latency numbers. It does enable the device to own cache lines though and
vice versa.
> The new thing about CAPI are the cache to cache transactions and
> participation in cache coherency at the cacheline level. That is a
> different approach than the device memory oriented PCI transcactions.
> Perhaps even CAPI over PCI-E can improve the situation there (maybe the
> transactions are lower latency than going to device memory) and hopefully
> CAPI will not forever be bound to PCI-E and thus at some point shake off
> the shackles of a bus designed by a competitor.
Ben.
On Wed, 2015-04-22 at 12:14 -0500, Christoph Lameter wrote:
>
> > Bottom line is we want today anonymous, share or file mapped memory
> > to stay the only kind of memory that exist and we want to choose the
> > backing store of each of those kind for better placement depending
> > on how memory is use (again which can be in the total control of
> > the application). But we do not want to introduce a third kind of
> > disjoint memory to userspace, this is today situation and we want
> > to move forward to tomorrow solution.
>
> Frankly, I do not see any benefit here, nor a use case and I wonder who
> would adopt this. The future requires higher performance and not more band
> aid.
You may not but we have a large number of customers who do.
In fact I'm quite surprised, what we want to achieve is the most natural
way from an application perspective.
You have something in memory, whether you got it via malloc, mmap'ing a file,
shmem with some other application, ... and you want to work on it with the
co-processor that is residing in your address space. Even better, pass a pointer
to it to some library you don't control which might itself want to use the
coprocessor ....
What you propose can simply not provide that natural usage model with any
efficiency.
It might not be *your* model based on *your* application but that doesn't mean
it's not there, and isn't relevant.
Ben.
On Wed, 2015-04-22 at 13:17 -0500, Christoph Lameter wrote:
>
> > But again let me stress that application that want to be in control will
> > stay in control. If you want to make the decission yourself about where
> > things should end up then nothing in all we are proposing will preclude
> > you from doing that. Please just think about others people application,
> > not just yours, they are a lot of others thing in the world and they do
> > not want to be as close to the metal as you want to be. We just want to
> > accomodate the largest number of use case.
>
> What I think you want to do is to automatize something that should not be
> automatized and cannot be automatized for performance reasons.
You don't know that.
> Anyone
> wanting performance (and that is the prime reason to use a GPU) would
> switch this off because the latencies are otherwise not controllable and
> those may impact performance severely. There are typically multiple
> parallel strands of executing that must execute with similar performance
> in order to allow a data exchange at defined intervals. That is no longer
> possible if you add variances that come with the "transparency" here.
Stop trying to apply your unique usage model to the entire world :-)
Ben.
On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
> > Anyone
> > wanting performance (and that is the prime reason to use a GPU) would
> > switch this off because the latencies are otherwise not controllable and
> > those may impact performance severely. There are typically multiple
> > parallel strands of executing that must execute with similar performance
> > in order to allow a data exchange at defined intervals. That is no longer
> > possible if you add variances that come with the "transparency" here.
>
> Stop trying to apply your unique usage model to the entire world :-)
Much of the HPC apps that the world is using is severely impacted by what
you are proposing. Its the industries usage model not mine. That is why I
was asking about the use case. Does not seem to fit the industry you are
targeting. This is also the basic design principle that got GPUs to work
as fast as they do today. Introducing random memory latencies there will
kill much of the benefit of GPUs there too.
On Wed, 22 Apr 2015, Paul E. McKenney wrote:
> Agreed, the use case that Jerome is thinking of differs from yours.
> You would not (and should not) tolerate things like page faults because
> it would destroy your worst-case response times. I believe that Jerome
> is more interested in throughput with minimal change to existing code.
As far as I know Jerome is talkeing about HPC loads and high performance
GPU processing. This is the same use case.
> Let's suppose that you and Jerome were using GPGPU hardware that had
> 32,768 hardware threads. You would want very close to 100% of the full
> throughput out of the hardware with pretty much zero unnecessary latency.
> In contrast, Jerome might be OK with (say) 20,000 threads worth of
> throughput with the occasional latency hiccup.
>
> And yes, support for both use cases is needed.
What you are proposing for High Performacne Computing is reducing the
performance these guys trying to get. You cannot sell someone a Volkswagen
if he needs the Ferrari.
On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
> > There are hooks in glibc where you can replace the memory
> > management of the apps if you want that.
>
> We don't control the app. Let's say we are doing a plugin for libfoo
> which accelerates "foo" using GPUs.
There are numerous examples of malloc implementation that can be used for
apps without modifying the app.
>
> Now some other app we have no control on uses libfoo. So pointers
> already allocated/mapped, possibly a long time ago, will hit libfoo (or
> the plugin) and we need GPUs to churn on the data.
IF the GPU would need to suspend one of its computation thread to wait on
a mapping to be established on demand or so then it looks like the
performance of the parallel threads on a GPU will be significantly
compromised. You would want to do the transfer explicitly in some fashion
that meshes with the concurrent calculation in the GPU. You do not want
stalls while GPU number crunching is ongoing.
> The point I'm making is you are arguing against a usage model which has
> been repeatedly asked for by large amounts of customer (after all that's
> also why HMM exists).
I am still not clear what is the use case for this would be. Who is asking
for this?
On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
> They are via MMIO space. The big differences here are that via CAPI the
> memory can be fully cachable and thus have the same characteristics as
> normal memory from the processor point of view, and the device shares
> the MMU with the host.
>
> Practically what that means is that the device memory *is* just some
> normal system memory with a larger distance. The NUMA model is an
> excellent representation of it.
I sure wish you would be working on using these features to increase
performance and the speed of communication to devices.
Device memory is inherently different from main memory (otherwise the
device would be using main memory) and thus not really NUMA. NUMA at least
assumes that the basic characteristics of memory are the same while just
the access speeds vary. GPU memory has very different performance
characteristics and the various assumptions on memory that the kernel
makes for the regular processors may not hold anymore.
> For my use cases the advantage of CAPI lies in the reduction of latency
> > for coprocessor communication. I hope that CAPI will allow fast cache to
> > cache transactions between a coprocessor and the main one. This is
> > improving the ability to exchange data rapidly between a application code
> > and some piece of hardware (NIC, GPU, custom hardware etc etc)
> >
> > Fundamentally this is currently an design issue since CAPI is running on
> > top of PCI-E and PCI-E transactions establish a minimum latency that
> > cannot be avoided. So its hard to see how CAPI can improve the situation.
>
> It's on top of the lower layers of PCIe yes, I don't know the exact
> latency numbers. It does enable the device to own cache lines though and
> vice versa.
Could you come up with a way to allow faster device communication through
improving on the PCI-E cacheline handoff via CAPI? That would be something
useful that I expected from it. If the processor can transfer some word
faster into a CAPI device or get status faster then that is a valuable
thing.
On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
> In fact I'm quite surprised, what we want to achieve is the most natural
> way from an application perspective.
Well the most natural thing would be if the beast would just do what I
tell it in plain english. But then I would not have my job anymore.
> You have something in memory, whether you got it via malloc, mmap'ing a file,
> shmem with some other application, ... and you want to work on it with the
> co-processor that is residing in your address space. Even better, pass a pointer
> to it to some library you don't control which might itself want to use the
> coprocessor ....
Yes that works already. Whats new about this? This seems to have been
solved on the Intel platform f.e.
> What you propose can simply not provide that natural usage model with any
> efficiency.
There is no effiecency anymore if the OS can create random events in a
computational stream that is highly optimized for data exchange of
multiple threads at defined time intervals. If transparency or the natural
usage model can avoid this then ok but what I see here proposed is some
behind-the-scenes model that may severely degrate performance. And this
does seem to go way beyond CAPI. At leasdt the way I so far thought about
this as a method for cache coherency at the cache line level and about a
way to simplify the coordination of page tables and TLBs across multiple
divergent architectures.
I think these two things need to be separated. The shift-the-memory-back-
and-forth approach should be separate and if someone wants to use the
thing then it should also work on other platforms like ARM and Intel.
CAPI needs to be implemented as a way to potentially improve the existing
communication paths between devices and the main processor. F.e the
existing Infiniband MMU synchronization issues and RDMA registration
problems could be addressed with this. The existing mechanisms for GPU
communication could become much cleaner and easier to handle. This is all
good but independant of any "transparent" memory implementation.
> It might not be *your* model based on *your* application but that doesn't mean
> it's not there, and isn't relevant.
Sadly this is the way that an entire industry does its thing.
On 2015-04-23 10:25, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
>
>> They are via MMIO space. The big differences here are that via CAPI the
>> memory can be fully cachable and thus have the same characteristics as
>> normal memory from the processor point of view, and the device shares
>> the MMU with the host.
>>
>> Practically what that means is that the device memory *is* just some
>> normal system memory with a larger distance. The NUMA model is an
>> excellent representation of it.
>
> I sure wish you would be working on using these features to increase
> performance and the speed of communication to devices.
>
> Device memory is inherently different from main memory (otherwise the
> device would be using main memory) and thus not really NUMA. NUMA at least
> assumes that the basic characteristics of memory are the same while just
> the access speeds vary. GPU memory has very different performance
> characteristics and the various assumptions on memory that the kernel
> makes for the regular processors may not hold anymore.
>
You are restricting your definition of NUMA to what the industry
constrains it to mean. Based solely on the academic definition of a
NUMA system, this _is_ NUMA. In fact, based on the academic definition,
all modern systems could be considered to be NUMA systems, with each
level of cache representing a memory only node.
Looking at this whole conversation, all I see is two different views on
how to present the asymmetric multiprocessing arrangements that have
become commonplace in today's systems to userspace. Your model favors
performance, while CAPI favors simplicity for userspace.
On Thu, Apr 23, 2015 at 09:10:13AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
>
> > > Anyone
> > > wanting performance (and that is the prime reason to use a GPU) would
> > > switch this off because the latencies are otherwise not controllable and
> > > those may impact performance severely. There are typically multiple
> > > parallel strands of executing that must execute with similar performance
> > > in order to allow a data exchange at defined intervals. That is no longer
> > > possible if you add variances that come with the "transparency" here.
> >
> > Stop trying to apply your unique usage model to the entire world :-)
>
> Much of the HPC apps that the world is using is severely impacted by what
> you are proposing. Its the industries usage model not mine. That is why I
> was asking about the use case. Does not seem to fit the industry you are
> targeting. This is also the basic design principle that got GPUs to work
> as fast as they do today. Introducing random memory latencies there will
> kill much of the benefit of GPUs there too.
We obviously have different experience and i fear yours is restricted to
a specific uncommon application. You care about latency all my previous
experience (i developped application for HPC platform in the past) is
that latency is not the issue, throughput is. For instance i developed
on HPC where the data was coming from magnetic tape, latency here was
several minutes before the data starts streaming (yes a robot arm had
to pick the tape and load it into one of the available readers). All
people i interacted with accross various fields (physics, biology, data
mining) where not worried a bit about latency. They could not care more
about latency actually. What they care about was overall throughput and
ease of use.
You need to stop thinking HPC == low latency. Low latency is only useful
in time critical application such as the high frequency trading you seem
to care about. For people working on physics, biology, data mining, CAD,
... they do care more about throughput than latency. I strongly believe
here that this cover a far greater number of users of HPC than yours
(maybe not in term of money power ... alas).
On GPU front i have a lot of experience, more than 15 years working on
open source driver for them. I would like to think that i have a clue or
two on how they work. So when i say latency is not the primary concern
in most cases, i do mean it. GPU is about having many threads in flight
and hidding memory latency through this many threads. If you have
1000 "core" on a GPU and you have 5000 threads in flight then you have
big chance that no matter of memory latency for each clock cycle you
will still have 1000 threads ready to compute something.
I am not saying latency never matter, it is all about the kind of app
that is running and how much data it needs to consume and how much
thread the hw can keep in flight at the same time.
So yes, autonuma solution are worth investigating, as a matter of fact
even today driver actually use heuristic (taking into account hint
provided by userspace) to decide what to put into video memory or not.
For many applications the driver stack will be able to provide good
hint on what to migrate or not, but you still need to think multiple
process and so you need to share resources. This is the role of the
kernel to share resources among process, it always have been.
Now for your use case, you know before hand how many process there
gonna be and you can partition the resources accordingly and you make
better taylored decission on where things should reside. But again
this is not the common case. All HPC i know can not predict the number
of process nor partition resource for them. Program that run on those
system are updated frequently and you need to share resources with
others. For all those people and for people just working on a work
station the solution of autonuma is most likely the best. It might
not lead to 100% saturation of GPU but it will be good enough to
make a difference.
The numa code we have today for CPU case exist because it does make
a difference but you keep trying to restrict GPU user to a workload
that is specific. Go talk to people doing physic, biology, data
mining, CAD most of them do not care about latency. They have not
hard deadline to meet with their computation. They just want things
to compute as fast as possible and programming to be as easy as it
can get.
J?r?me
On 04/21/2015 08:50 PM, Christoph Lameter wrote:
> On Tue, 21 Apr 2015, Jerome Glisse wrote:
>> So big use case here, let say you have an application that rely on a
>> scientific library that do matrix computation. Your application simply
>> use malloc and give pointer to this scientific library. Now let say
>> the good folks working on this scientific library wants to leverage
>> the GPU, they could do it by allocating GPU memory through GPU specific
>> API and copy data in and out. For matrix that can be easy enough, but
>> still inefficient. What you really want is the GPU directly accessing
>> this malloced chunk of memory, eventualy migrating it to device memory
>> while performing the computation and migrating it back to system memory
>> once done. Which means that you do not want some kind of filesystem or
>> anything like that.
>
> With a filesystem the migration can be controlled by the application.
Which is absolutely the wrong thing to do when using the "GPU"
(or whatever co-processor it is) transparently from libraries,
without the applications having to know about it.
Your use case is legitimate, but so is this other case.
On Thu, Apr 23, 2015 at 09:38:15AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
[...]
> > You have something in memory, whether you got it via malloc, mmap'ing a file,
> > shmem with some other application, ... and you want to work on it with the
> > co-processor that is residing in your address space. Even better, pass a pointer
> > to it to some library you don't control which might itself want to use the
> > coprocessor ....
>
> Yes that works already. Whats new about this? This seems to have been
> solved on the Intel platform f.e.
No this not have been solve properly. Today solution is doing an explicit
copy and again and again when complex data struct are involve (list, tree,
...) this is extremly tedious and hard to debug. So today solution often
restrict themself to easy thing like matrix multiplication. But if you
provide a unified address space then you make things a lot easiers for a
lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
standard is a proof that unified address space is one of the most important
feature requested by user of GPGPU. You might not care but the rest of the
world does.
>
> > What you propose can simply not provide that natural usage model with any
> > efficiency.
>
> There is no effiecency anymore if the OS can create random events in a
> computational stream that is highly optimized for data exchange of
> multiple threads at defined time intervals. If transparency or the natural
> usage model can avoid this then ok but what I see here proposed is some
> behind-the-scenes model that may severely degrate performance. And this
> does seem to go way beyond CAPI. At leasdt the way I so far thought about
> this as a method for cache coherency at the cache line level and about a
> way to simplify the coordination of page tables and TLBs across multiple
> divergent architectures.
Again you restrict yourself to your usecase. Many HPC workload do not have
stringent time constraint and synchronization point.
>
> I think these two things need to be separated. The shift-the-memory-back-
> and-forth approach should be separate and if someone wants to use the
> thing then it should also work on other platforms like ARM and Intel.
What IBM does with there platform is there choice, they can not force ARM
or Intel or AMD to do the same. Each of those might have different view
on what is their most important target. For instance i highly doubt ARM
cares about any of this.
>
> CAPI needs to be implemented as a way to potentially improve the existing
> communication paths between devices and the main processor. F.e the
> existing Infiniband MMU synchronization issues and RDMA registration
> problems could be addressed with this. The existing mechanisms for GPU
> communication could become much cleaner and easier to handle. This is all
> good but independant of any "transparent" memory implementation.
No, transparent memory implementation is a prerequisite to leverage to
cache coherency. If address for a same process does not means the same
thing on a device that on the CPU then doing cache coherency becomes a
lot harder because you need to track several address for same physical
backing storage. N (virtual) to 1 (physical) mapping is hard.
Same address on the other hand means that it is lot easier to have cache
coherency distributed accross device and CPU because they will all agree
on what physical memory is backing each address of a given process.
1 (virtual) to 1 (physical) is easier.
> > It might not be *your* model based on *your* application but that doesn't mean
> > it's not there, and isn't relevant.
>
> Sadly this is the way that an entire industry does its thing.
Again no, you are wrong, the HPC industry is not only about latency.
Only time critical application care about latency, everyone else cares
about throughput, where the applications can runs for days, weeks, months
before producing any useable/meaningfull results. Many of which do not
care a tiny bit about latency because they can perform independant
computation.
Take a company rendering a movie for instance, they want to render the
millions of frame as fast as possible but each frame can be rendered
independently, they only share data is the input geometry, textures and
lighting but this are constant, the rendering of one frame does not
depend on the rendering of the previous (leaving post processing like
motion blur aside).
Same apply if you do some data mining. You want might want to find all
occurence of a specific sequence in a large data pool. You can slice
your data pool and have an independant job per slice and only aggregate
the result of each jobs at the end (or as they finish).
I will not go on and on and on about all the thing that do not care
about latency, i am just trying to open your eyes on the world that
exist out there.
Cheers,
J?r?me
On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
>
> > > There are hooks in glibc where you can replace the memory
> > > management of the apps if you want that.
> >
> > We don't control the app. Let's say we are doing a plugin for libfoo
> > which accelerates "foo" using GPUs.
>
> There are numerous examples of malloc implementation that can be used for
> apps without modifying the app.
What about share memory pass btw process ? Or mmaped file ? Or
a library that is loaded through dlopen and thus had no way to
control any allocation that happen before it became active ?
> >
> > Now some other app we have no control on uses libfoo. So pointers
> > already allocated/mapped, possibly a long time ago, will hit libfoo (or
> > the plugin) and we need GPUs to churn on the data.
>
> IF the GPU would need to suspend one of its computation thread to wait on
> a mapping to be established on demand or so then it looks like the
> performance of the parallel threads on a GPU will be significantly
> compromised. You would want to do the transfer explicitly in some fashion
> that meshes with the concurrent calculation in the GPU. You do not want
> stalls while GPU number crunching is ongoing.
You do not understand how GPU works. GPU have a pools of thread, and they
always try to have the pool as big as possible so that when a group of
thread is waiting for some memory access, there are others thread ready
to perform some operation. GPU are about hidding memory latency that's
what they are good at. But they only achieve that when they have more
thread in flight than compute unit. The whole thread scheduling is done
by hardware and barely control by the device driver.
So no having the GPU wait for a page fault is not as dramatic as you
think. If you use GPU as they are intended to use you might even never
notice the pagefault and reach close to the theoritical throughput of
the GPU nonetheless.
>
> > The point I'm making is you are arguing against a usage model which has
> > been repeatedly asked for by large amounts of customer (after all that's
> > also why HMM exists).
>
> I am still not clear what is the use case for this would be. Who is asking
> for this?
Everyone but you ? OpenCL 2.0 specific request it and have several level
of support about transparent address space. The lowest one is the one
implemented today in which application needs to use a special memory
allocator.
The most advance one imply integration with the kernel in which any
memory (mmaped file, share memory or anonymous memory) can be use by
the GPU and does not need to come from a special allocator.
Everyone in the industry is moving toward the most advance one. That
is the raison d'?tre of HMM, to provide this functionality on hw
platform that do not have things such as CAPI. Which is x86/arm.
So use case is all application using OpenCL or Cuda. So pretty much
everyone doing GPGPU wants this. I dunno how you can't see that.
Share address space is so much easier. Believe it or not most coders
do not have deep knowledge of how things work and if you can remove
the complexity of different memory allocation and different address
space from them they will be happy.
Cheers,
J?r?me
On 04/22/2015 01:14 PM, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Jerome Glisse wrote:
>
>> Glibc hooks will not work, this is about having same address space on
>> CPU and GPU/accelerator while allowing backing memory to be regular
>> system memory or device memory all this in a transparent manner to
>> userspace program and library.
>
> If you control the address space used by malloc and provide your own
> implementation then I do not see why this would not work.
Your program does not know how many other programs it is
sharing the co-processor / GPU device with, which means
it does not know how much of the co-processor or GPU
memory will be available for it at any point in time.
Well, in your specific case your program might know,
but in the typical case it will not.
This means the OS will have to manage the resource.
On Thu, Apr 23, 2015 at 09:38:15AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
[ . . . ]
> > It might not be *your* model based on *your* application but that doesn't mean
> > it's not there, and isn't relevant.
>
> Sadly this is the way that an entire industry does its thing.
I must confess that I got lost in the pronouns.
If by "this is the way" and "entire industry" you mean hand-tuning
for the former and the specific industry you are in for the latter,
I am with you. And again, we are not going to do anything that would
prevent hand-tuning. For example, it will be possible to completely
disable any migration operations that might contribute to OS jitter.
And I have added a requirement that this migration mechanism not
contribute to OS jitter unless it is enabled. Does that help?
If by "entire industry" you mean everyone who might want to use hardware
acceleration, for example, including mechanical computer-aided design,
I am skeptical.
Thanx, Paul
On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
>
> > > There are hooks in glibc where you can replace the memory
> > > management of the apps if you want that.
> >
> > We don't control the app. Let's say we are doing a plugin for libfoo
> > which accelerates "foo" using GPUs.
>
> There are numerous examples of malloc implementation that can be used for
> apps without modifying the app.
Except that the app might be mapping a file or operating on a big
array in bss instead of (or as well as) using malloc()ed memory.
> > Now some other app we have no control on uses libfoo. So pointers
> > already allocated/mapped, possibly a long time ago, will hit libfoo (or
> > the plugin) and we need GPUs to churn on the data.
>
> IF the GPU would need to suspend one of its computation thread to wait on
> a mapping to be established on demand or so then it looks like the
> performance of the parallel threads on a GPU will be significantly
> compromised. You would want to do the transfer explicitly in some fashion
> that meshes with the concurrent calculation in the GPU. You do not want
> stalls while GPU number crunching is ongoing.
Yep. But for throughput-oriented applications, as long as stalls don't
happen very often, this can be OK.
> > The point I'm making is you are arguing against a usage model which has
> > been repeatedly asked for by large amounts of customer (after all that's
> > also why HMM exists).
>
> I am still not clear what is the use case for this would be. Who is asking
> for this?
Ben and I are. I have added a use case, which I will send out shortly
with the next version.
Thanx, Paul
On Thu, Apr 23, 2015 at 09:12:38AM -0500, Christoph Lameter wrote:
> On Wed, 22 Apr 2015, Paul E. McKenney wrote:
>
> > Agreed, the use case that Jerome is thinking of differs from yours.
> > You would not (and should not) tolerate things like page faults because
> > it would destroy your worst-case response times. I believe that Jerome
> > is more interested in throughput with minimal change to existing code.
>
> As far as I know Jerome is talkeing about HPC loads and high performance
> GPU processing. This is the same use case.
The difference is sensitivity to latency. You have latency-sensitive
HPC workloads, and Jerome is talking about HPC workloads that need
high throughput, but are insensitive to latency.
> > Let's suppose that you and Jerome were using GPGPU hardware that had
> > 32,768 hardware threads. You would want very close to 100% of the full
> > throughput out of the hardware with pretty much zero unnecessary latency.
> > In contrast, Jerome might be OK with (say) 20,000 threads worth of
> > throughput with the occasional latency hiccup.
> >
> > And yes, support for both use cases is needed.
>
> What you are proposing for High Performacne Computing is reducing the
> performance these guys trying to get. You cannot sell someone a Volkswagen
> if he needs the Ferrari.
You do need the low-latency Ferrari. But others are best served by a
high-throughput freight train.
Thanx, Paul
And another update, again diffs followed by the full document. The
diffs are against the version at https://lkml.org/lkml/2015/4/22/235.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/DeviceMem.txt b/DeviceMem.txt
index cdedf2ee96e9..15d0a8b5d360 100644
--- a/DeviceMem.txt
+++ b/DeviceMem.txt
@@ -51,6 +51,38 @@
USE CASES
+ o Multiple transformations without requiring multiple
+ memory transfers for throughput-oriented applications.
+ For example, suppose the device supports both compression
+ and encryption algorithms, but that significant CPU
+ work is required to generate the data to be compressed
+ and encrypted. Suppose also that the application uses
+ a library to do the compression and encryption, and
+ that this application needs to run correctly, without
+ rebuilding, on systems with the device and also on systems
+ without the device. In addition, the application operates
+ on data mapped from files, data in normal data/bss memory,
+ and data in heap memory from malloc().
+
+ In this case, it would be beneficial to have the memory
+ automatically migrate to and from device memory.
+ Note that the device-specific library functions could
+ reasonably initiate the migration before starting their
+ work, but could not know whether or not to migrate the
+ data back upon completion.
+
+ o A special-purpose globally hand-optimized application
+ wishes to use the device, from Christoph Lameter.
+
+ In this case, the application will get the absolute
+ best performance by manually controlling allocation
+ and migration decisions. This use case is probably
+ not helped much by this proposal.
+
+ However, an application including a special-purpose
+ hand-optimized core and less-intense ancillary processing
+ could well benefit.
+
o GPGPU matrix operations, from Jerome Glisse.
https://lkml.org/lkml/2015/4/21/898
@@ -109,6 +141,11 @@ REQUIREMENTS
tune allocation locality, migration, and so on, as
required to match performance and functional requirements.
+ 5. It must be possible to configure a system containing
+ a CCAD device so that it does no migration, as will be
+ required for low-latency applications that are sensitive
+ to OS jitter.
+
POTENTIAL IDEAS
------------------------------------------------------------------------
COHERENT ON-DEVICE MEMORY: ACCESS AND MIGRATION
Ben Herrenschmidt
(As told to Paul E. McKenney)
Special-purpose hardware becoming more prevalent, and some of this
hardware allows for tight interaction with CPU-based processing.
For example, IBM's coherent accelerator processor interface
(CAPI) will allow this sort of device to be constructed,
and it is likely that GPGPUs will need similar capabilities.
(See http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html for a
high-level description of CAPI.) Let's call these cache-coherent
accelerator devices (CCAD for short, which should at least
motivate someone to come up with something better).
This document covers devices with the following properties:
1. The device is cache-coherent, in other words, the device's
memory has all the characteristics of system memory from
the viewpoint of CPUs and other devices accessing it.
2. The device provides local memory that it has high-bandwidth
low-latency access to, but the device can also access
normal system memory.
3. The device shares system page tables, so that it can
transparently access userspace virtual memory, regardless
of whether this virtual memory maps to normal system
memory or to memory local to the device.
Although such a device will provide CPU's with cache-coherent
access to on-device memory, the resulting memory latency is
expected to be slower than the normal memory that is tightly
coupled to the CPUs. Nevertheless, data that is only occasionally
accessed by CPUs should be stored in the device's memory.
On the other hand, data that is accessed rarely by the device but
frequently by the CPUs should be stored in normal system memory.
Of course, some workloads will have predictable access patterns
that allow data to be optimally placed up front. However, other
workloads will have less-predictable access patterns, and these
workloads can benefit from automatic migration of data between
device memory and system memory as access patterns change.
Furthermore, some devices will provide special hardware that
collects access statistics that can be used to determine whether
or not a given page of memory should be migrated, and if so,
to where.
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
USE CASES
o Multiple transformations without requiring multiple
memory transfers for throughput-oriented applications.
For example, suppose the device supports both compression
and encryption algorithms, but that significant CPU
work is required to generate the data to be compressed
and encrypted. Suppose also that the application uses
a library to do the compression and encryption, and
that this application needs to run correctly, without
rebuilding, on systems with the device and also on systems
without the device. In addition, the application operates
on data mapped from files, data in normal data/bss memory,
and data in heap memory from malloc().
In this case, it would be beneficial to have the memory
automatically migrate to and from device memory.
Note that the device-specific library functions could
reasonably initiate the migration before starting their
work, but could not know whether or not to migrate the
data back upon completion.
o A special-purpose globally hand-optimized application
wishes to use the device, from Christoph Lameter.
In this case, the application will get the absolute
best performance by manually controlling allocation
and migration decisions. This use case is probably
not helped much by this proposal.
However, an application including a special-purpose
hand-optimized core and less-intense ancillary processing
could well benefit.
o GPGPU matrix operations, from Jerome Glisse.
https://lkml.org/lkml/2015/4/21/898
Suppose that you have an application that uses a
scientific library to do matrix computations, and that
this application simply calls malloc() and give the
resulting pointer to the library function. If the GPGPU
has coherent access to system memory (and vice versa),
it would help performance and application compatibility
to be able to transparently migrate the malloc()ed
memory to and from the GPGPU's memory without requiring
changes to the application.
o (More here for CAPI.)
REQUIREMENTS
1. It should be possible to remove a given CCAD device
from service, for example, to reset it, to download
updated firmware, or to change its functionality.
This results in the following additional requirements:
a. It should be possible to migrate all data away
from the device's memory at any time.
b. Normal memory allocation should avoid using the
device's memory, as this would interfere
with the needed migration. It may nevertheless
be desirable to use the device's memory
if system memory is exhausted, however, in some
cases, even this "emergency" use is best avoided.
In fact, a good solution will provide some means
for avoiding this for those cases where it is
necessary to evacuate memory when offlining the
device.
2. Memory can be either explicitly or implicitly allocated
from the CCAD device's memory. (Both usermode and kernel
allocation required.)
Please note that implicit allocation will need to be
avoided in a number of use cases. The reason for this
is that random kernel allocations might be pinned into
memory, which could conflict with requirement (1) above,
and might furthermore fragment the device's memory.
3. The device's memory is treated like normal system
memory by the Linux kernel, for example, each page has a
"struct page" associate with it. (In contrast, the
traditional approach has used special-purpose OS mechanisms
to manage the device's memory, and this memory was treated
as MMIO space by the kernel.)
4. The system's normal tuning mechanism may be used to
tune allocation locality, migration, and so on, as
required to match performance and functional requirements.
5. It must be possible to configure a system containing
a CCAD device so that it does no migration, as will be
required for low-latency applications that are sensitive
to OS jitter.
POTENTIAL IDEAS
It is only reasonable to ask whether CCAD devices can simply
use the HMM patch that has recently been proposed to allow
migration between system and device memory via page faults.
Although this works well for devices whose local MMU can contain
mappings different from that of the system MMU, the HMM patch
is still working with MMIO space that gets special treatment.
The HMM patch does not (yet) provide the full transparency that
would allow the device memory to be treated in the same way as
system memory. Something more is therefore required, for example,
one or more of the following:
1. Model the CCAD device's memory as a memory-only NUMA node
with a very large distance metric. This allows use of
the existing mechanisms for choosing where to satisfy
explicit allocations and where to target migrations.
2. Cover the memory with a CMA to prevent non-migratable
pinned data from being placed in the CCAD device's memory.
It would also permit the driver to perform dedicated
physically contiguous allocations as needed.
3. Add a new ZONE_EXTERNAL zone for all CCAD-like devices.
Note that this would likely require support for
discontinuous zones in order to support large NUMA
systems, in which each node has a single block of the
overall physical address space. In such systems, the
physical address ranges of normal system memory would
be interleaved with those of device memory.
This would also require some sort of
migration infrastructure to be added, as autonuma would
not apply. However, this approach has the advantage
of preventing allocations in these regions, at least
unless those allocations have been explicitly flagged
to go there.
4. Your idea here!
The following sections cover AutoNUMA, use of memory zones, and DAX.
AUTONUMA
The Linux kernel's autonuma facility supports migrating both
memory and processes to promote NUMA memory locality. It was
accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
This approach uses a kernel thread "knuma_scand" that periodically
marks pages inaccessible. The page-fault handler notes any
mismatches between the NUMA node that the process is running on
and the NUMA node on which the page resides.
http://lwn.net/Articles/488709/
https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
It will be necessary to set up the CCAD device's memory as
a very distant NUMA node, and the architecture-specific
__numa_distance() function can be used for this purpose.
There is a RECLAIM_DISTANCE macro that can be set by the
architecture to prevent reclaiming from nodes that are too
far away. Some experimentation would be required to determine
the combination of values for the various distance macros.
This approach needs some way to pull in data from the hardware
on access patterns. Aneesh Kk Veetil is prototyping an approach
based on Power 8 hardware counters. This data will need to be
plugged into the migration algorithm, which is currently based
on collecting information from page faults.
Finally, the contiguous memory allocator (CMA, see
http://lwn.net/Articles/486301/) is needed in order to prevent
the kernel from placing non-migratable allocations in the CCAD
device's memory. This would need to be of type MIGRATE_CMA to
ensure that all memory taken from that range be migratable.
The result would be that the kernel would allocate only migratable
pages within the CCAD device's memory, and even then only if
memory was otherwise exhausted. Normal CONFIG_NUMA_BALANCING
migration could be brought to bear, possibly enhanced with
information from hardware counters. One remaining issue is that
there is no way to absolutely prevent random kernel subsystems
from allocating the CCAD device's memory, which could cause
failures should the device need to reset itself, in which case
the memory would be temporarily inaccessible -- which could be
a fatal surprise to that kernel subsystem.
Jerome Glisse suggests that usermode hints are quite important,
and perhaps should replace any AutoNUMA measurements.
MEMORY ZONE
One way to avoid the problem of random kernel subsystems using
the CAPI device's memory is to create a new memory zone for
this purpose. This would add something like ZONE_DEVMEM to the
current set that includes ZONE_DMA, ZONE_NORMAL, and ZONE_MOVABLE.
Currently, there are a maximum of four zones, so this limit must
either be increased or kernels built with ZONE_DEVMEM must avoid
having more than one of ZONE_DMA, ZONE_DMA32, and ZONE_HIGHMEM.
This approach requires that migration be implemented on the side,
as the CONFIG_NUMA_BALANCING will not help here (unless I am
missing something). One advantage of this situation is that
hardware locality measurements could be incorporated from the
beginning. Another advantage is that random kernel subsystems
and user programs would not get CAPI device memory unless they
explicitly requested it.
Code would be needed at boot time to place the CAPI device
memory into ZONE_DEVMEM, perhaps involving changes to
mem_init() and paging_init().
In addition, an appropriate GFP_DEVMEM would be needed, along
with code in various paths to handle it appropriately.
Also, because large NUMA systems will sometimes interleave the
addresses of blocks of physical memory and device memory,
support for discontiguous interleaved zones will be required.
DAX
DAX is a mechanism for providing direct-memory access to
high-speed non-volatile (AKA "persistent") memory. Good
introductions to DAX may be found in the following LWN
articles:
https://lwn.net/Articles/591779/
https://lwn.net/Articles/610174/
DAX provides filesystem-level access to persistent memory.
One important CCAD use case is allowing a legacy application
to pass memory from malloc() to a CCAD device, and having
the allocated memory migrate as needed. DAX does not seem to
support this use case.
ACKNOWLEDGMENTS
Updates to this document include feedback from Christoph Lameter
and Jerome Glisse.
On Thu, 2015-04-23 at 09:10 -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
>
> > > Anyone
> > > wanting performance (and that is the prime reason to use a GPU) would
> > > switch this off because the latencies are otherwise not controllable and
> > > those may impact performance severely. There are typically multiple
> > > parallel strands of executing that must execute with similar performance
> > > in order to allow a data exchange at defined intervals. That is no longer
> > > possible if you add variances that come with the "transparency" here.
> >
> > Stop trying to apply your unique usage model to the entire world :-)
>
> Much of the HPC apps that the world is using is severely impacted by what
> you are proposing. Its the industries usage model not mine. That is why I
> was asking about the use case. Does not seem to fit the industry you are
> targeting. This is also the basic design principle that got GPUs to work
> as fast as they do today. Introducing random memory latencies there will
> kill much of the benefit of GPUs there too.
How would it be impacted ? You can still do dedicated allocations etc...
if you want to do so. I think Jerome gave a pretty good explanation of
the need for the usage model we are proposing, it's also coming from the
industry ...
Ben.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Thu, 2015-04-23 at 11:25 -0400, Austin S Hemmelgarn wrote:
> Looking at this whole conversation, all I see is two different views on
> how to present the asymmetric multiprocessing arrangements that have
> become commonplace in today's systems to userspace. Your model favors
> performance, while CAPI favors simplicity for userspace.
I would say it differently.... when you say "CAPI favors..." it's not CAPI,
it's the usage model we are proposing as an option for CAPI and other
similar technology (there's at least one other I can't quite talk about
yet), but basically anything that has the characteristics defined in
the document Paul posted. CAPI is just one such example.
On another hand, CAPI can also perfectly be used as Christoph describes.
The ability to transparently handle and migrate memory is not exclusive
with the ability for an application to explicitly decide where to allocate
its memory and explicitly move the data around. Both options will be provided.
Before the thread degraded into a debate on usage model, this was an
attempt at discussing the technical details of what would be the best
approach to implement the "transparent" model in Linux. I'd like to go back
to it if possible ...
Cheers,
Ben.
On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> > As far as I know Jerome is talkeing about HPC loads and high performance
> > GPU processing. This is the same use case.
>
> The difference is sensitivity to latency. You have latency-sensitive
> HPC workloads, and Jerome is talking about HPC workloads that need
> high throughput, but are insensitive to latency.
Those are correlated.
> > What you are proposing for High Performacne Computing is reducing the
> > performance these guys trying to get. You cannot sell someone a Volkswagen
> > if he needs the Ferrari.
>
> You do need the low-latency Ferrari. But others are best served by a
> high-throughput freight train.
The problem is that they want to run 2000 trains at the same time
and they all must arrive at the destination before they can be send on
their next trip. 1999 trains will be sitting idle because they need
to wait of the one train that was delayed. This reduces the troughput.
People really would like all 2000 trains to arrive on schedule so that
they get more performance.
On Thu, 23 Apr 2015, Jerome Glisse wrote:
> The numa code we have today for CPU case exist because it does make
> a difference but you keep trying to restrict GPU user to a workload
> that is specific. Go talk to people doing physic, biology, data
> mining, CAD most of them do not care about latency. They have not
> hard deadline to meet with their computation. They just want things
> to compute as fast as possible and programming to be as easy as it
> can get.
I started working on the latency issues a long time ago because
performance of those labs was restricted by OS processing. A noted problem
was SLABs scanning of its objects every 2 seconds which caused pretty
significant performance regressions due to the delay of the computation in
individual threads.
On Thu, 23 Apr 2015, Austin S Hemmelgarn wrote:
> Looking at this whole conversation, all I see is two different views on how to
> present the asymmetric multiprocessing arrangements that have become
> commonplace in today's systems to userspace. Your model favors performance,
> while CAPI favors simplicity for userspace.
Oww. No performance just simplicity? Really?
The simplification of the memory registration for Infiniband etc is
certainly useful and I hope to see contributions on that going into the
kernel.
On Thu, 23 Apr 2015, Paul E. McKenney wrote:
>
> DAX
>
> DAX is a mechanism for providing direct-memory access to
> high-speed non-volatile (AKA "persistent") memory. Good
> introductions to DAX may be found in the following LWN
> articles:
DAX is a mechanism to access memory not managed by the kernel and is the
successor to XIP. It just happens to be needed for persistent memory.
Fundamentally any driver can provide an MMAPPed interface to allow access
to a devices memory.
On Fri, Apr 24, 2015 at 09:01:47AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
>
> > > As far as I know Jerome is talkeing about HPC loads and high performance
> > > GPU processing. This is the same use case.
> >
> > The difference is sensitivity to latency. You have latency-sensitive
> > HPC workloads, and Jerome is talking about HPC workloads that need
> > high throughput, but are insensitive to latency.
>
> Those are correlated.
In some cases, yes. But are you -really- claiming that -all- HPC
workloads are highly sensitive to latency? That would be quite a claim!
> > > What you are proposing for High Performacne Computing is reducing the
> > > performance these guys trying to get. You cannot sell someone a Volkswagen
> > > if he needs the Ferrari.
> >
> > You do need the low-latency Ferrari. But others are best served by a
> > high-throughput freight train.
>
> The problem is that they want to run 2000 trains at the same time
> and they all must arrive at the destination before they can be send on
> their next trip. 1999 trains will be sitting idle because they need
> to wait of the one train that was delayed. This reduces the troughput.
> People really would like all 2000 trains to arrive on schedule so that
> they get more performance.
Yes, there is some portion of the market that needs both high throughput
and highly predictable latencies. You are claiming that the -entire- HPC
market has this sort of requirement? Again, this would be quite a claim!
Thanx, Paul
On Thu, 23 Apr 2015, Jerome Glisse wrote:
> No this not have been solve properly. Today solution is doing an explicit
> copy and again and again when complex data struct are involve (list, tree,
> ...) this is extremly tedious and hard to debug. So today solution often
> restrict themself to easy thing like matrix multiplication. But if you
> provide a unified address space then you make things a lot easiers for a
> lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
> standard is a proof that unified address space is one of the most important
> feature requested by user of GPGPU. You might not care but the rest of the
> world does.
You could use page tables on the kernel side to transfer data on demand
from the GPU. And you can use a device driver to establish mappings to the
GPUs memory.
There is no copy needed with these approaches.
> > I think these two things need to be separated. The shift-the-memory-back-
> > and-forth approach should be separate and if someone wants to use the
> > thing then it should also work on other platforms like ARM and Intel.
>
> What IBM does with there platform is there choice, they can not force ARM
> or Intel or AMD to do the same. Each of those might have different view
> on what is their most important target. For instance i highly doubt ARM
> cares about any of this.
Well but the kernel code submitted should allow for easy use on other
platform. I.e. Intel processors should be able to implement the
"transparent" memory by establishing device mappings to PCI-E space
and/or transferring data from the GPU and signaling the GPU to establish
such a mapping.
> Only time critical application care about latency, everyone else cares
> about throughput, where the applications can runs for days, weeks, months
> before producing any useable/meaningfull results. Many of which do not
> care a tiny bit about latency because they can perform independant
> computation.
Computationally intensive high performance application care about
random latency introduced to computational threads because that is
delaying the data exchange and thus slows everything down. And that is the
typical case of a GPUI.
> Take a company rendering a movie for instance, they want to render the
> millions of frame as fast as possible but each frame can be rendered
> independently, they only share data is the input geometry, textures and
> lighting but this are constant, the rendering of one frame does not
> depend on the rendering of the previous (leaving post processing like
> motion blur aside).
The rendering would be done by the GPU and this will involve concurrency
rapidly accessing data. Performance is certainly impacted if the GPU
cannot use its own RAM designed for the proper feeding of its processing.
And if you add a paging layer and swivel stuff below then this will be
very bad.
At minimum you need to shovel blocks of data into the GPU to allow it to
operate undisturbed for a while on the data and do its job.
> Same apply if you do some data mining. You want might want to find all
> occurence of a specific sequence in a large data pool. You can slice
> your data pool and have an independant job per slice and only aggregate
> the result of each jobs at the end (or as they finish).
This sounds more like a case for a general purpose processor. If it is a
special device then it will typically also have special memory to allow
fast searches.
On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> If by "entire industry" you mean everyone who might want to use hardware
> acceleration, for example, including mechanical computer-aided design,
> I am skeptical.
The industry designs GPUs with super fast special ram and accellerators
with special ram designed to do fast searches and you think you can demand page
that stuff in from the main processor?
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
>
> > If by "entire industry" you mean everyone who might want to use hardware
> > acceleration, for example, including mechanical computer-aided design,
> > I am skeptical.
>
> The industry designs GPUs with super fast special ram and accellerators
> with special ram designed to do fast searches and you think you can demand page
> that stuff in from the main processor?
The demand paging is indeed a drawback for the option of using autonuma
to handle the migration. And again, this is not intended to replace the
careful hand-tuning that is required to get the last drop of performance
out of the system. It is instead intended to handle the cases where
the application needs substantially more performance than the CPUs alone
can deliver, but where the cost of full-fledge hand tuning cannot be
justified.
You seem to believe that this latter category is the empty set, which
I must confess does greatly surprise me.
Thanx, Paul
On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
>
> >
> > DAX
> >
> > DAX is a mechanism for providing direct-memory access to
> > high-speed non-volatile (AKA "persistent") memory. Good
> > introductions to DAX may be found in the following LWN
> > articles:
>
> DAX is a mechanism to access memory not managed by the kernel and is the
> successor to XIP. It just happens to be needed for persistent memory.
> Fundamentally any driver can provide an MMAPPed interface to allow access
> to a devices memory.
I will take another look, but others in this thread have called out
difficulties with DAX's filesystem nature.
Thanx, Paul
On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Jerome Glisse wrote:
>
> > No this not have been solve properly. Today solution is doing an explicit
> > copy and again and again when complex data struct are involve (list, tree,
> > ...) this is extremly tedious and hard to debug. So today solution often
> > restrict themself to easy thing like matrix multiplication. But if you
> > provide a unified address space then you make things a lot easiers for a
> > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
> > standard is a proof that unified address space is one of the most important
> > feature requested by user of GPGPU. You might not care but the rest of the
> > world does.
>
> You could use page tables on the kernel side to transfer data on demand
> from the GPU. And you can use a device driver to establish mappings to the
> GPUs memory.
>
> There is no copy needed with these approaches.
So you are telling me to do get_user_page() ? If so you aware that this pins
memory ? So what happens when the GPU wants to access a range of 32GB of
memory ? I pin everything ?
I am not talking about only transfrom from GPU to system memory i am talking
about application that have :
dataset = mmap(datatset, 32<<30);
// ...
dl_open(superlibrary)
superlibrary.dosomething(dataset);
So the application here have no clue about GPU and we do not want to change
that yes this is a valid usecase and countless user ask for it.
How can the superlibrary give access to the GPU to the dataset ? Does it
have to go get_user_page() on all single page effectively pinning memory ?
Should it allocate GPU memory through special API and memcpy ?
What HMM does is allow to share the process page table with the GPU and GPU
can transparently access the dataset (no pinning whatsover). Will there be
pagefault ? It can happens and if it does the assumption is that you have
more threads that do not get a pagefault than one that does, so GPU keeps
being saturated (ie all its unit are feed with something to do) while the
pagefault are resolve. For some workload yes you will see the penalty of the
pagefault ie you will have a group of thread that finish late but the thing
you seem to fail to get is that all the other GPU thread can make process
and finish even before the pagefault is resolved. It all depends on the
application. Moreover if you have several application then GPU can switch
to different application and make progress on them too.
Overall the throughput of the GPU will stay close to its theoritical maximum
if you have enough other thread that can progress and this is very common.
>
> > > I think these two things need to be separated. The shift-the-memory-back-
> > > and-forth approach should be separate and if someone wants to use the
> > > thing then it should also work on other platforms like ARM and Intel.
> >
> > What IBM does with there platform is there choice, they can not force ARM
> > or Intel or AMD to do the same. Each of those might have different view
> > on what is their most important target. For instance i highly doubt ARM
> > cares about any of this.
>
> Well but the kernel code submitted should allow for easy use on other
> platform. I.e. Intel processors should be able to implement the
> "transparent" memory by establishing device mappings to PCI-E space
> and/or transferring data from the GPU and signaling the GPU to establish
> such a mapping.
HMM does that, it only require the GPU to have a certain set of features
and the only requirement for the platform is to offer a bus which allow
cache coherent system memory access such as PCIE.
But IBM here want to go further and to provide a more advance solution,
so their need are specific to there platform and we can not know if AMD,
ARM or Intel will want to go down the same road, they do not seem to be
interested. Does it means we should not support IBM ? I think it would be
wrong.
>
> > Only time critical application care about latency, everyone else cares
> > about throughput, where the applications can runs for days, weeks, months
> > before producing any useable/meaningfull results. Many of which do not
> > care a tiny bit about latency because they can perform independant
> > computation.
>
> Computationally intensive high performance application care about
> random latency introduced to computational threads because that is
> delaying the data exchange and thus slows everything down. And that is the
> typical case of a GPUI.
You assume that all HPC application have strong data exchange, i gave
you example of application where there is 0 data exchange btw threads
what so ever. Those use case exist and we want to support them too.
Yes for thread where there is data exchange page fault stall jobs but
again we are talking about HPC where several _different_ application
run in // and share resources so while page fault can block part of
an application, other applications can still make progress as GPU can
switch to work on them.
Moreover the expectation is thate pagefault will remain a rare events,
as proper application should make sure that the dataset they are working
on it hot in memory.
>
> > Take a company rendering a movie for instance, they want to render the
> > millions of frame as fast as possible but each frame can be rendered
> > independently, they only share data is the input geometry, textures and
> > lighting but this are constant, the rendering of one frame does not
> > depend on the rendering of the previous (leaving post processing like
> > motion blur aside).
>
> The rendering would be done by the GPU and this will involve concurrency
> rapidly accessing data. Performance is certainly impacted if the GPU
> cannot use its own RAM designed for the proper feeding of its processing.
> And if you add a paging layer and swivel stuff below then this will be
> very bad.
>
> At minimum you need to shovel blocks of data into the GPU to allow it to
> operate undisturbed for a while on the data and do its job.
You completely misunderstand the design of what we are trying to achieve
we are not trying to have a kernel thread that constantly move data around.
For the autonuma case you start by mapping the system memory to the GPU
the GPU start working on it, after a bit the GPU reports statistics and
autonuma kicks in and migrate memory to GPU memory transparently without
interruption for the GPU, so GPU keeps running. While it might start the
job being limited by the bus bandwidth, it will end the job using the full
bandwidth.
Now this is only with autonuma, and we never intended this to be the only
factor on the contrary the primary factor is decision made by the device
driver. So device driver that get information from userspace can migrate
the memory even before the job start on the GPU and in this case you will
never have autonuma do anything to your data whatsoever.
>
> > Same apply if you do some data mining. You want might want to find all
> > occurence of a specific sequence in a large data pool. You can slice
> > your data pool and have an independant job per slice and only aggregate
> > the result of each jobs at the end (or as they finish).
>
> This sounds more like a case for a general purpose processor. If it is a
> special device then it will typically also have special memory to allow
> fast searches.
No this kind of thing can be fast on a GPU, with GPU you easily have x500
more cores than CPU cores, so you can slice the dataset even more and have
each of the GPU core perform the search. Note that i am not only thinking
of stupid memcmp here it can be something more complex like searching a
pattern that allow variation and that require a whole program to decide if
a chunk falls under the variation rules or not.
Cheers,
J?r?me
On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote:
> On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
> > On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> >
> > >
> > > DAX
> > >
> > > DAX is a mechanism for providing direct-memory access to
> > > high-speed non-volatile (AKA "persistent") memory. Good
> > > introductions to DAX may be found in the following LWN
> > > articles:
> >
> > DAX is a mechanism to access memory not managed by the kernel and is the
> > successor to XIP. It just happens to be needed for persistent memory.
> > Fundamentally any driver can provide an MMAPPed interface to allow access
> > to a devices memory.
>
> I will take another look, but others in this thread have called out
> difficulties with DAX's filesystem nature.
Do not waste your time on that this is not what we want. Christoph here
is more than stuborn and fails to see the world.
Cheers,
J?r?me
On Fri, 24 Apr 2015, Paul E. McKenney wrote:
> can deliver, but where the cost of full-fledge hand tuning cannot be
> justified.
>
> You seem to believe that this latter category is the empty set, which
> I must confess does greatly surprise me.
If there are already compromises are being made then why would you want to
modify the kernel for this? Some user space coding and device drivers
should be sufficient.
On Fri, 24 Apr 2015, Paul E. McKenney wrote:
> > DAX is a mechanism to access memory not managed by the kernel and is the
> > successor to XIP. It just happens to be needed for persistent memory.
> > Fundamentally any driver can provide an MMAPPed interface to allow access
> > to a devices memory.
>
> I will take another look, but others in this thread have called out
> difficulties with DAX's filesystem nature.
Right so you do not need the filesystem structure. Just simply writing a
device driver that mmaps data as needed from the coprocessor will also do
the trick.
On 04/24/2015 10:01 AM, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
>
>>> As far as I know Jerome is talkeing about HPC loads and high performance
>>> GPU processing. This is the same use case.
>>
>> The difference is sensitivity to latency. You have latency-sensitive
>> HPC workloads, and Jerome is talking about HPC workloads that need
>> high throughput, but are insensitive to latency.
>
> Those are correlated.
>
>>> What you are proposing for High Performacne Computing is reducing the
>>> performance these guys trying to get. You cannot sell someone a Volkswagen
>>> if he needs the Ferrari.
>>
>> You do need the low-latency Ferrari. But others are best served by a
>> high-throughput freight train.
>
> The problem is that they want to run 2000 trains at the same time
> and they all must arrive at the destination before they can be send on
> their next trip. 1999 trains will be sitting idle because they need
> to wait of the one train that was delayed. This reduces the troughput.
> People really would like all 2000 trains to arrive on schedule so that
> they get more performance.
So you run 4000 or even 6000 trains, and have some subset of them
run at full steam, while others are waiting on memory accesses.
In reality the overcommit factor is likely much smaller, because
the GPU threads run and block on memory in smaller, more manageable
numbers, say a few dozen at a time.
--
All rights reversed
On Fri, Apr 24, 2015 at 09:30:40AM -0500, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
>
> > If by "entire industry" you mean everyone who might want to use hardware
> > acceleration, for example, including mechanical computer-aided design,
> > I am skeptical.
>
> The industry designs GPUs with super fast special ram and accellerators
> with special ram designed to do fast searches and you think you can demand page
> that stuff in from the main processor?
>
Why do you think AMD and NVidia are adding page fault support to their GPU
in the first place ? They are not doing this on a whim, they have carefully
thought about that.
Are you saying you know better than the 2 biggest GPU designer on the planet ?
And who do you think is pushing for such thing in the kernel ? Do you think
we are working on this on a whim ? Because we woke up one day and thought that
it would be cool and that it should be done this way ?
Yes if all your GPU do is pagefault it will be disastrous, but is this the
usual thing we see on CPU ? No ! Are people complaining about the numerous
page fault that happens over a day ? No, the vast majority of user are
completely oblivious to page fault. This is how it works on CPU and yes this
can work for GPU too. What happens on CPU ? Well CPU can switch to work on
a different thread or a different application altogether. The same thing will
happen on the GPU. If you have enough jobs, your GPU will be busy and you
will never worry about page fault because overall your GPU will deliver the
same kind of throughput as if there was no pagefault. It can very well be
buried into the overall noise if the ratio of available runnable thread
versus page faulting thread is high enough. Which is most of the time the
case for the CPU, why would the same assumption not work on the GPU ?
Note that i am not dismissing low latency folks, i know they exist, i know
they hate page fault and in no way what we propose will make it worse for
them. They will be able to keep the same kind of control they cherish but
this does not mean you should go on a holy crusade to pretend that other
people workload does not exist. They do exist. Page fault is not evil and
it has prove usefull to the whole computer industry for CPU.
To be sure you are not misinterpretting what we propose, in no way we say
we gonna migrate thing on page fault for everyone. We are saying first
the device driver decide where thing need to be (system memory or local
memory) device driver can get hint/request from userspace for this (as they
do today). So no change whatsoever here, people that hand tune things will
keep being able to do so.
Now we want to add the case where device driver do not get any kind of
directive or hint from userspace. So what autonuma is, simply collect
informations from the GPU on what is access often and then migrate this
transparently (yes this can happen without interruption to GPU). So you
are migrating from a memory that has 16GB/s or 32GB/s bandwidth to the
device memory that have 500GB/s.
This is a valid usecase, they are many people outthere that do not want
to learn about hand tuning there application for the GPU but they could
nonetheless benefit from it.
Cheers,
J?r?me
On Fri, 24 Apr 2015, Jerome Glisse wrote:
> On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
> > On Thu, 23 Apr 2015, Jerome Glisse wrote:
> >
> > > No this not have been solve properly. Today solution is doing an explicit
> > > copy and again and again when complex data struct are involve (list, tree,
> > > ...) this is extremly tedious and hard to debug. So today solution often
> > > restrict themself to easy thing like matrix multiplication. But if you
> > > provide a unified address space then you make things a lot easiers for a
> > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
> > > standard is a proof that unified address space is one of the most important
> > > feature requested by user of GPGPU. You might not care but the rest of the
> > > world does.
> >
> > You could use page tables on the kernel side to transfer data on demand
> > from the GPU. And you can use a device driver to establish mappings to the
> > GPUs memory.
> >
> > There is no copy needed with these approaches.
>
> So you are telling me to do get_user_page() ? If so you aware that this pins
> memory ? So what happens when the GPU wants to access a range of 32GB of
> memory ? I pin everything ?
Use either a device driver to create PTEs pointing to the data or do
something similar like what DAX does. Pinning can be avoided if you use
mmu_notifiers. Those will give you a callback before the OS removes the
data and thus you can operate without pinning.
> Overall the throughput of the GPU will stay close to its theoritical maximum
> if you have enough other thread that can progress and this is very common.
GPUs operate on groups of threads not single ones. If you stall
then there will be a stall of a whole group of them. We are dealing with
accellerators here that are different for performance reasons. They are
not to be treated like regular processor, nor is memory like
operating like host mmemory.
> But IBM here want to go further and to provide a more advance solution,
> so their need are specific to there platform and we can not know if AMD,
> ARM or Intel will want to go down the same road, they do not seem to be
> interested. Does it means we should not support IBM ? I think it would be
> wrong.
What exactly is the more advanced version's benefit? What are the features
that the other platforms do not provide?
> > This sounds more like a case for a general purpose processor. If it is a
> > special device then it will typically also have special memory to allow
> > fast searches.
>
> No this kind of thing can be fast on a GPU, with GPU you easily have x500
> more cores than CPU cores, so you can slice the dataset even more and have
> each of the GPU core perform the search. Note that i am not only thinking
> of stupid memcmp here it can be something more complex like searching a
> pattern that allow variation and that require a whole program to decide if
> a chunk falls under the variation rules or not.
Then you have the problem of fast memory access and you are proposing to
complicate that access path on the GPU.
On 04/24/2015 11:49 AM, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Paul E. McKenney wrote:
>
>> can deliver, but where the cost of full-fledge hand tuning cannot be
>> justified.
>>
>> You seem to believe that this latter category is the empty set, which
>> I must confess does greatly surprise me.
>
> If there are already compromises are being made then why would you want to
> modify the kernel for this? Some user space coding and device drivers
> should be sufficient.
You assume only one program at a time would get to use the GPU
for accelerated computations, and the GPU would get dedicated
to that program.
That will not be the case when you have libraries using the GPU
for computations. There could be dozens of programs in the system
using that library, with no knowledge of how many GPU resources
are used by the other programs.
There is a very clear cut case for having the OS manage the
GPU resources transparently, just like it does for all the
other resources in the system.
--
All rights reversed
On 04/24/2015 10:30 AM, Christoph Lameter wrote:
> On Thu, 23 Apr 2015, Paul E. McKenney wrote:
>
>> If by "entire industry" you mean everyone who might want to use hardware
>> acceleration, for example, including mechanical computer-aided design,
>> I am skeptical.
>
> The industry designs GPUs with super fast special ram and accellerators
> with special ram designed to do fast searches and you think you can demand page
> that stuff in from the main processor?
DRAM access latencies are a few hundred CPU cycles, but somehow
CPUs can still do computations at a fast speed, and we do not
require gigabytes of L2-cache-speed memory in the system.
It turns out the vast majority of programs have working sets,
and data access patterns where prefetching works satisfactorily.
With GPU calculations done transparently by libraries, and
largely hidden from programs, why would this be any different?
--
All rights reversed
On Fri, Apr 24, 2015 at 11:03:52AM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
>
> > On Fri, Apr 24, 2015 at 09:29:12AM -0500, Christoph Lameter wrote:
> > > On Thu, 23 Apr 2015, Jerome Glisse wrote:
> > >
> > > > No this not have been solve properly. Today solution is doing an explicit
> > > > copy and again and again when complex data struct are involve (list, tree,
> > > > ...) this is extremly tedious and hard to debug. So today solution often
> > > > restrict themself to easy thing like matrix multiplication. But if you
> > > > provide a unified address space then you make things a lot easiers for a
> > > > lot more usecase. That's a fact, and again OpenCL 2.0 which is an industry
> > > > standard is a proof that unified address space is one of the most important
> > > > feature requested by user of GPGPU. You might not care but the rest of the
> > > > world does.
> > >
> > > You could use page tables on the kernel side to transfer data on demand
> > > from the GPU. And you can use a device driver to establish mappings to the
> > > GPUs memory.
> > >
> > > There is no copy needed with these approaches.
> >
> > So you are telling me to do get_user_page() ? If so you aware that this pins
> > memory ? So what happens when the GPU wants to access a range of 32GB of
> > memory ? I pin everything ?
>
> Use either a device driver to create PTEs pointing to the data or do
> something similar like what DAX does. Pinning can be avoided if you use
> mmu_notifiers. Those will give you a callback before the OS removes the
> data and thus you can operate without pinning.
So you are actualy telling me to do as i am doing inside the HMM patchset ?
Because what you seem to say here is exactly what the HMM patchset does.
So you are acknowledging that we need work inside the kernel ?
That being said Paul have the chance to have a more advance platform where
what i am doing would actualy be under using the capabilities of the platform.
So he needs a different solution.
>
> > Overall the throughput of the GPU will stay close to its theoritical maximum
> > if you have enough other thread that can progress and this is very common.
>
> GPUs operate on groups of threads not single ones. If you stall
> then there will be a stall of a whole group of them. We are dealing with
> accellerators here that are different for performance reasons. They are
> not to be treated like regular processor, nor is memory like
> operating like host mmemory.
Again i know how GPU works, they work on group of thread i am well aware of
that, the group size is often 32 or 64 threads. But they keep in the hardware
a large pool of thread group, something like 2^11 or 2^12 thread group in
flight for 2^4 or 2^5 unit capable working on thread group (in thread count
this is 2^15/2^16 thread in flight for 2^9/2^10 cores). So again like on
the CPU we do not exepect the whole 2^11/2^12 group of thread to hit a
pagefault and i am saying as long as only a small number of group hit one
let say 2^3 group (2^8/2^9 thread) then you still have a large number of
thread group that can make progress without being impacted whatsoever.
And you can bet that GPU designer are also improving this by allowing to
swap out faulting thread and swapin runnable one so the overall 2^16 threads
in flight might be lot bigger in future hardware giving even more chance
to hide page fault.
GPU can operate on host memory and you can still saturate GPU with host
memory as long as the workload you are running are not bandwidth starved.
I know this is unlikely for GPU but again think several _different_
application some of thos application might already have their dataset
in the GPU memory and thus can run along side slower thread that are
limited by the system memory bandwidth. But still you can saturate your
GPU that way.
>
> > But IBM here want to go further and to provide a more advance solution,
> > so their need are specific to there platform and we can not know if AMD,
> > ARM or Intel will want to go down the same road, they do not seem to be
> > interested. Does it means we should not support IBM ? I think it would be
> > wrong.
>
> What exactly is the more advanced version's benefit? What are the features
> that the other platforms do not provide?
Transparent access to device memory from the CPU, you can map any of the GPU
memory inside the CPU and have the whole cache coherency including proper
atomic memory operation. CAPI is not some mumbo jumbo marketing name there
is real hardware behind it.
On x86 you have to take into account the PCI bar size, you also have to take
into account that PCIE transaction are really bad when it comes to sharing
memory with CPU. CAPI really improve things here.
So on x86 even if you could map all the GPU memory it would still be a bad
solution and thing like atomic memory operation might not even work properly.
>
> > > This sounds more like a case for a general purpose processor. If it is a
> > > special device then it will typically also have special memory to allow
> > > fast searches.
> >
> > No this kind of thing can be fast on a GPU, with GPU you easily have x500
> > more cores than CPU cores, so you can slice the dataset even more and have
> > each of the GPU core perform the search. Note that i am not only thinking
> > of stupid memcmp here it can be something more complex like searching a
> > pattern that allow variation and that require a whole program to decide if
> > a chunk falls under the variation rules or not.
>
> Then you have the problem of fast memory access and you are proposing to
> complicate that access path on the GPU.
No, i am proposing to have a solution where people doing such kind of work
load can leverage the GPU, yes it will not be as fast as people hand tuning
and rewritting their application for the GPU but it will still be faster
by a significant factor than only using the CPU.
Moreover i am saying that this can happen without even touching a single
line of code of many many applications, because many of them rely on library
and those are the only one that would need to know about GPU.
Finaly i am saying that having a unified address space btw the GPU and CPU
is a primordial prerequisite for this to happen in a transparent fashion
and thus DAX solution is non-sense and does not provide transparent address
space sharing. DAX solution is not even something new, this is how today
stack is working, no need for DAX, userspace just mmap the device driver
file and that's how they access the GPU accessible memory (which in most
case is just system memory mapped through the device file to the user
application).
Cheers,
J?r?me
On Fri, 24 Apr 2015, Jerome Glisse wrote:
> > What exactly is the more advanced version's benefit? What are the features
> > that the other platforms do not provide?
>
> Transparent access to device memory from the CPU, you can map any of the GPU
> memory inside the CPU and have the whole cache coherency including proper
> atomic memory operation. CAPI is not some mumbo jumbo marketing name there
> is real hardware behind it.
Got the hardware here but I am getting pretty sobered given what I heard
here. The IBM mumbo jumpo marketing comes down to "not much" now.
> On x86 you have to take into account the PCI bar size, you also have to take
> into account that PCIE transaction are really bad when it comes to sharing
> memory with CPU. CAPI really improve things here.
Ok that would be interesting for the general device driver case. Can you
show a real performance benefit here of CAPI transactions vs. PCI-E
transactions?
> So on x86 even if you could map all the GPU memory it would still be a bad
> solution and thing like atomic memory operation might not even work properly.
That is solvable and doable in many other ways if needed. Actually I'd
prefer a Xeon Phi in that case because then we also have the same
instruction set. Having locks work right with different instruction sets
and different coherency schemes. Ewww...
> > Then you have the problem of fast memory access and you are proposing to
> > complicate that access path on the GPU.
>
> No, i am proposing to have a solution where people doing such kind of work
> load can leverage the GPU, yes it will not be as fast as people hand tuning
> and rewritting their application for the GPU but it will still be faster
> by a significant factor than only using the CPU.
Well the general purpose processors also also gaining more floating point
capabilities which increases the pressure on accellerators to become more
specialized.
> Moreover i am saying that this can happen without even touching a single
> line of code of many many applications, because many of them rely on library
> and those are the only one that would need to know about GPU.
Yea. We have heard this numerous times in parallel computing and it never
really worked right.
> Finaly i am saying that having a unified address space btw the GPU and CPU
> is a primordial prerequisite for this to happen in a transparent fashion
> and thus DAX solution is non-sense and does not provide transparent address
> space sharing. DAX solution is not even something new, this is how today
> stack is working, no need for DAX, userspace just mmap the device driver
> file and that's how they access the GPU accessible memory (which in most
> case is just system memory mapped through the device file to the user
> application).
Right this is how things work and you could improve on that. Stay with the
scheme. Why would that not work if you map things the same way in both
environments if both accellerator and host processor can acceess each
others memory?
On Fri, Apr 24, 2015 at 11:58:39AM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
>
> > > What exactly is the more advanced version's benefit? What are the features
> > > that the other platforms do not provide?
> >
> > Transparent access to device memory from the CPU, you can map any of the GPU
> > memory inside the CPU and have the whole cache coherency including proper
> > atomic memory operation. CAPI is not some mumbo jumbo marketing name there
> > is real hardware behind it.
>
> Got the hardware here but I am getting pretty sobered given what I heard
> here. The IBM mumbo jumpo marketing comes down to "not much" now.
>
> > On x86 you have to take into account the PCI bar size, you also have to take
> > into account that PCIE transaction are really bad when it comes to sharing
> > memory with CPU. CAPI really improve things here.
>
> Ok that would be interesting for the general device driver case. Can you
> show a real performance benefit here of CAPI transactions vs. PCI-E
> transactions?
I am sure IBM will show benchmark here when they have everything in place. I
am not working on CAPI personnaly, i just went through some of the specification
for it.
> > So on x86 even if you could map all the GPU memory it would still be a bad
> > solution and thing like atomic memory operation might not even work properly.
>
> That is solvable and doable in many other ways if needed. Actually I'd
> prefer a Xeon Phi in that case because then we also have the same
> instruction set. Having locks work right with different instruction sets
> and different coherency schemes. Ewww...
>
Well then go the Xeon Phi solution way and let people that want to provide a
different simpler (from programmer point of view) solution work on it.
>
> > > Then you have the problem of fast memory access and you are proposing to
> > > complicate that access path on the GPU.
> >
> > No, i am proposing to have a solution where people doing such kind of work
> > load can leverage the GPU, yes it will not be as fast as people hand tuning
> > and rewritting their application for the GPU but it will still be faster
> > by a significant factor than only using the CPU.
>
> Well the general purpose processors also also gaining more floating point
> capabilities which increases the pressure on accellerators to become more
> specialized.
>
> > Moreover i am saying that this can happen without even touching a single
> > line of code of many many applications, because many of them rely on library
> > and those are the only one that would need to know about GPU.
>
> Yea. We have heard this numerous times in parallel computing and it never
> really worked right.
Because you had split userspace, a pointer value was not pointing to the same
thing on the GPU as on the CPU so porting library or application is hard and
troublesome. AMD is already working on porting general application or library
to leverage the brave new world of share address space (libreoffice, gimp, ...).
Other people keep presuring for same address space, again this is the corner
stone of OpenCL 2.0.
I can not predict if it will work this time, if all meaning full and usefull
library will start leveraging GPU. All i am trying to do is solve the split
address space problem. Problem that you seem to ignore completely because you
are happy the way things are. Other people are not happy.
>
> > Finaly i am saying that having a unified address space btw the GPU and CPU
> > is a primordial prerequisite for this to happen in a transparent fashion
> > and thus DAX solution is non-sense and does not provide transparent address
> > space sharing. DAX solution is not even something new, this is how today
> > stack is working, no need for DAX, userspace just mmap the device driver
> > file and that's how they access the GPU accessible memory (which in most
> > case is just system memory mapped through the device file to the user
> > application).
>
> Right this is how things work and you could improve on that. Stay with the
> scheme. Why would that not work if you map things the same way in both
> environments if both accellerator and host processor can acceess each
> others memory?
Again and again share address space, having a pointer means the same thing
for the GPU than it means for the CPU ie having a random pointer point to
the same memory whether it is accessed by the GPU or the CPU. While also
keeping the property of the backing memory. It can be share memory from
other process, a file mmaped from disk or simply anonymous memory and
thus we have no control whatsoever on how such memory is allocated.
Then you had transparent migration (transparent in the sense that we can
handle CPU page fault on migrated memory) and you will see that you need
to modify the kernel to become aware of this and provide a common code
to deal with all this.
Cheers,
J?r?me
On 04/23/2015 07:22 PM, Jerome Glisse wrote:
> On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote:
>> On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:
>>
>>>> There are hooks in glibc where you can replace the memory
>>>> management of the apps if you want that.
>>>
>>> We don't control the app. Let's say we are doing a plugin for libfoo
>>> which accelerates "foo" using GPUs.
>>
>> There are numerous examples of malloc implementation that can be used for
>> apps without modifying the app.
>
> What about share memory pass btw process ? Or mmaped file ? Or
> a library that is loaded through dlopen and thus had no way to
> control any allocation that happen before it became active ?
>
>>>
>>> Now some other app we have no control on uses libfoo. So pointers
>>> already allocated/mapped, possibly a long time ago, will hit libfoo (or
>>> the plugin) and we need GPUs to churn on the data.
>>
>> IF the GPU would need to suspend one of its computation thread to wait on
>> a mapping to be established on demand or so then it looks like the
>> performance of the parallel threads on a GPU will be significantly
>> compromised. You would want to do the transfer explicitly in some fashion
>> that meshes with the concurrent calculation in the GPU. You do not want
>> stalls while GPU number crunching is ongoing.
>
> You do not understand how GPU works. GPU have a pools of thread, and they
> always try to have the pool as big as possible so that when a group of
> thread is waiting for some memory access, there are others thread ready
> to perform some operation. GPU are about hidding memory latency that's
> what they are good at. But they only achieve that when they have more
> thread in flight than compute unit. The whole thread scheduling is done
> by hardware and barely control by the device driver.
>
> So no having the GPU wait for a page fault is not as dramatic as you
> think. If you use GPU as they are intended to use you might even never
> notice the pagefault and reach close to the theoritical throughput of
> the GPU nonetheless.
>
>
>>
>>> The point I'm making is you are arguing against a usage model which has
>>> been repeatedly asked for by large amounts of customer (after all that's
>>> also why HMM exists).
>>
>> I am still not clear what is the use case for this would be. Who is asking
>> for this?
>
> Everyone but you ? OpenCL 2.0 specific request it and have several level
> of support about transparent address space. The lowest one is the one
> implemented today in which application needs to use a special memory
> allocator.
>
> The most advance one imply integration with the kernel in which any
> memory (mmaped file, share memory or anonymous memory) can be use by
> the GPU and does not need to come from a special allocator.
>
> Everyone in the industry is moving toward the most advance one. That
> is the raison d'?tre of HMM, to provide this functionality on hw
> platform that do not have things such as CAPI. Which is x86/arm.
>
> So use case is all application using OpenCL or Cuda. So pretty much
> everyone doing GPGPU wants this. I dunno how you can't see that.
> Share address space is so much easier. Believe it or not most coders
> do not have deep knowledge of how things work and if you can remove
> the complexity of different memory allocation and different address
> space from them they will be happy.
>
> Cheers,
> J?r?me
I second what Jerome said, and add that one of the key features of HSA
is the ptr-is-a-ptr scheme, where the applications do *not* need to
handle different address spaces. Instead, all the memory is seen as a
unified address space.
See slide 6 on the following presentation:
http://www.slideshare.net/hsafoundation/hsa-overview
Thanks,
Oded
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Fri, 24 Apr 2015, Jerome Glisse wrote:
> > Right this is how things work and you could improve on that. Stay with the
> > scheme. Why would that not work if you map things the same way in both
> > environments if both accellerator and host processor can acceess each
> > others memory?
>
> Again and again share address space, having a pointer means the same thing
> for the GPU than it means for the CPU ie having a random pointer point to
> the same memory whether it is accessed by the GPU or the CPU. While also
> keeping the property of the backing memory. It can be share memory from
> other process, a file mmaped from disk or simply anonymous memory and
> thus we have no control whatsoever on how such memory is allocated.
Still no answer as to why is that not possible with the current scheme?
You keep on talking about pointers and I keep on responding that this is a
matter of making the address space compatible on both sides.
> Then you had transparent migration (transparent in the sense that we can
> handle CPU page fault on migrated memory) and you will see that you need
> to modify the kernel to become aware of this and provide a common code
> to deal with all this.
If the GPU works like a CPU (which I keep hearing) then you should also be
able to run a linu8x kernel on it and make it a regular NUMA node. Hey why
dont we make the host cpu a GPU (hello Xeon Phi).
On Fri, Apr 24, 2015 at 01:56:45PM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
>
> > > Right this is how things work and you could improve on that. Stay with the
> > > scheme. Why would that not work if you map things the same way in both
> > > environments if both accellerator and host processor can acceess each
> > > others memory?
> >
> > Again and again share address space, having a pointer means the same thing
> > for the GPU than it means for the CPU ie having a random pointer point to
> > the same memory whether it is accessed by the GPU or the CPU. While also
> > keeping the property of the backing memory. It can be share memory from
> > other process, a file mmaped from disk or simply anonymous memory and
> > thus we have no control whatsoever on how such memory is allocated.
>
> Still no answer as to why is that not possible with the current scheme?
> You keep on talking about pointers and I keep on responding that this is a
> matter of making the address space compatible on both sides.
So if do that in a naive way, how can we migrate a chunk of memory to video
memory while still handling properly the case where CPU try to access that
same memory while it is migrated to the GPU memory.
Without modifying a single line of mm code, the only way to do this is to
either unmap from the cpu page table the range being migrated or to mprotect
it in some way. In both case the cpu access will trigger some kind of fault.
This is not the behavior we want. What we want is same address space while
being able to migrate system memory to device memory (who make that decision
should not be part of that discussion) while still gracefully handling any
CPU access.
This means if CPU access it we want to migrate memory back to system memory.
To achieve this there is no way around adding couple of if inside the mm
page fault code path. Now do you want each driver to add its own if branch
or do you want a common infrastructure to do just that ?
As i keep saying the solution you propose is what we have today, today we
have fake share address space through the trick of remapping system memory
at same address inside the GPU address space and also enforcing the use of
a special memory allocator that goes behind the back of mm code.
But this limit to only using system memory, you can not use video memory
transparently through such scheme. Some trick use today is to copy memory
to device memory and to not bother with CPU access pretend it can not happen
and as such the GPU and CPU can diverge in what they see for same address.
We want to avoid trick like this that just lead to some weird and unexpected
behavior.
As you pointed out, not using GPU memory is a waste and we want to be able
to use it. Now Paul have more sofisticated hardware that offer oportunities
to do thing in a more transparent and efficient way.
>
> > Then you had transparent migration (transparent in the sense that we can
> > handle CPU page fault on migrated memory) and you will see that you need
> > to modify the kernel to become aware of this and provide a common code
> > to deal with all this.
>
> If the GPU works like a CPU (which I keep hearing) then you should also be
> able to run a linu8x kernel on it and make it a regular NUMA node. Hey why
> dont we make the host cpu a GPU (hello Xeon Phi).
I am not saying it works like a CPU, i am saying it should face the same kind
of pattern when it comes to page fault, ie page fault are not the end of the
world for the GPU and you should not assume that all GPU threads will wait
for a pagefault because this is not the common case on CPU. Yes we prefer when
page fault never happen, so does the CPU.
No, you can not run the linux kernel on the GPU unless you are willing to allow
having the kernel runs on heterogneous architecture with different instruction
set. Not even going into the problematic of ring level/system level. We might
one day go down that road but i see no compeling point today.
Cheers,
J?r?me
On Fri, 24 Apr 2015, Jerome Glisse wrote:
> > Still no answer as to why is that not possible with the current scheme?
> > You keep on talking about pointers and I keep on responding that this is a
> > matter of making the address space compatible on both sides.
>
> So if do that in a naive way, how can we migrate a chunk of memory to video
> memory while still handling properly the case where CPU try to access that
> same memory while it is migrated to the GPU memory.
Well that the same issue that the migration code is handling which I
submitted a long time ago to the kernel.
> Without modifying a single line of mm code, the only way to do this is to
> either unmap from the cpu page table the range being migrated or to mprotect
> it in some way. In both case the cpu access will trigger some kind of fault.
Yes that is how Linux migration works. If you can fix that then how about
improving page migration in Linux between NUMA nodes first?
> This is not the behavior we want. What we want is same address space while
> being able to migrate system memory to device memory (who make that decision
> should not be part of that discussion) while still gracefully handling any
> CPU access.
Well then there could be a situation where you have concurrent write
access. How do you reconcile that then? Somehow you need to stall one or
the other until the transaction is complete.
> This means if CPU access it we want to migrate memory back to system memory.
> To achieve this there is no way around adding couple of if inside the mm
> page fault code path. Now do you want each driver to add its own if branch
> or do you want a common infrastructure to do just that ?
If you can improve the page migration in general then we certainly would
love that. Having faultless migration is certain a good thing for a lot of
functionality that depends on page migration.
> As i keep saying the solution you propose is what we have today, today we
> have fake share address space through the trick of remapping system memory
> at same address inside the GPU address space and also enforcing the use of
> a special memory allocator that goes behind the back of mm code.
Hmmm... I'd like to know more details about that.
> As you pointed out, not using GPU memory is a waste and we want to be able
> to use it. Now Paul have more sofisticated hardware that offer oportunities
> to do thing in a more transparent and efficient way.
Does this also work between NUMA nodes in a Power8 system?
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
>
> > > Still no answer as to why is that not possible with the current scheme?
> > > You keep on talking about pointers and I keep on responding that this is a
> > > matter of making the address space compatible on both sides.
> >
> > So if do that in a naive way, how can we migrate a chunk of memory to video
> > memory while still handling properly the case where CPU try to access that
> > same memory while it is migrated to the GPU memory.
>
> Well that the same issue that the migration code is handling which I
> submitted a long time ago to the kernel.
Yes so you had to modify the kernel for that ! So do we, and no, page migration
as it exist is not sufficience and does not cover all use case we have.
>
> > Without modifying a single line of mm code, the only way to do this is to
> > either unmap from the cpu page table the range being migrated or to mprotect
> > it in some way. In both case the cpu access will trigger some kind of fault.
>
> Yes that is how Linux migration works. If you can fix that then how about
> improving page migration in Linux between NUMA nodes first?
In my case i can not use the page migration because there is no where to hook
to explain how to migrate thing back and forth with a device. The page migration
code is all on CPU and enjoy the benefit of being able to do thing atomicaly,
i do not have such luxury.
More over the core mm code assume that cpu pte migration entry is a short lived
state. In case of migration to device memory we are talking about time span of
several minutes. So obviously the page migration is not what we want, we want
something similar but with different properties. That exactly what my HMM patchset
does provide.
What Paul wants to do however should be able to leverage the page migration that
does exist. But again he has a far more advance platform.
>
> > This is not the behavior we want. What we want is same address space while
> > being able to migrate system memory to device memory (who make that decision
> > should not be part of that discussion) while still gracefully handling any
> > CPU access.
>
> Well then there could be a situation where you have concurrent write
> access. How do you reconcile that then? Somehow you need to stall one or
> the other until the transaction is complete.
No, it is exactly like thread on a CPU, if you have 2 threads that write to
same address without having anykind of synchronization btw them, you can not
predict what will be the end result. Same will happen here, either the GPU
write goes last or the CPU one. Anyway this is not the use case we have in
mind. We are thinking about concurrent access to same page but in a non
conflicting way. Any conflicting access is a software bug like it is in the
case of CPU threads.
>
> > This means if CPU access it we want to migrate memory back to system memory.
> > To achieve this there is no way around adding couple of if inside the mm
> > page fault code path. Now do you want each driver to add its own if branch
> > or do you want a common infrastructure to do just that ?
>
> If you can improve the page migration in general then we certainly would
> love that. Having faultless migration is certain a good thing for a lot of
> functionality that depends on page migration.
Faultless migration i am talking about is only on GPU side, but this is just
an extra feature where you keep something mapped read only while migrating
it to device memory and updating the GPU page table once done. So GPU will
keep accessing system memory without interruption, this assume read only
access. Otherwise you need a faulty migration thought you can cooperate with
the thread scheduler to schedule other thread while migration is on going.
>
> > As i keep saying the solution you propose is what we have today, today we
> > have fake share address space through the trick of remapping system memory
> > at same address inside the GPU address space and also enforcing the use of
> > a special memory allocator that goes behind the back of mm code.
>
> Hmmm... I'd like to know more details about that.
Well there is no open source OpenCL 2.0 stack for discret GPU. But the idea is
that you need special allocator because the GPU driver need to know about all
the possible pages that might be use ie there is no page fault so all object
need to be mapped and thus all page are pinned down. Well this is a little more
complex as the special allocator keep track of each allocation creating an
object for each of them and trying to only pin object that are use by current
shader.
Anyway bottom line is that it needs a special allocator, you can not use mmaped
file directly or shared memory directly or anonymous memory allocated outside
the special allocator. It require pinning memory. It can not migrate memory to
device memory. We want to fix all that.
>
> > As you pointed out, not using GPU memory is a waste and we want to be able
> > to use it. Now Paul have more sofisticated hardware that offer oportunities
> > to do thing in a more transparent and efficient way.
>
> Does this also work between NUMA nodes in a Power8 system?
My guess is that it just improve the device exchange with CPU, like trying to
make the device memory access cost as much as would remote CPU memory access.
I do not think it improve the regular NUMA nodes.
Cheers,
J?r?me
On Fri, 2015-04-24 at 11:58 -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
>
> > > What exactly is the more advanced version's benefit? What are the features
> > > that the other platforms do not provide?
> >
> > Transparent access to device memory from the CPU, you can map any of the GPU
> > memory inside the CPU and have the whole cache coherency including proper
> > atomic memory operation. CAPI is not some mumbo jumbo marketing name there
> > is real hardware behind it.
>
> Got the hardware here but I am getting pretty sobered given what I heard
> here. The IBM mumbo jumpo marketing comes down to "not much" now.
Ugh ... first nothing we propose precludes using it with explicit memory
management the way you want. So I don't know why you have a problem
here. We are trying to cover a *different* usage model than yours
obviously. But they aren't exclusive.
Secondly, none of what we are discussing here is supported by *existing*
hardware, so whatever you have is not concerned. There is no CAPI based
coprocessor today that provides cachable memory to the system (though
CAPI as a technology supports it), and no GPU doing that either *yet*.
Today CAPI adapters can own host cache lines but don't expose large
swath of cachable local memory.
Finally, this discussion is not even specifically about CAPI or its
performances. It's about the *general* case of a coherent coprocessor
sharing the MMU. Whether it's using CAPI or whatever other technology
that allows that sort of thing that we may or may not be able to mention
at this point.
CAPI is just an example because architecturally it allows that too.
Ben.
On 04/21/2015 05:44 PM, Paul E. McKenney wrote:
> AUTONUMA
>
> The Linux kernel's autonuma facility supports migrating both
> memory and processes to promote NUMA memory locality. It was
> accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
> It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
>
> This approach uses a kernel thread "knuma_scand" that periodically
> marks pages inaccessible. The page-fault handler notes any
> mismatches between the NUMA node that the process is running on
> and the NUMA node on which the page resides.
Minor nit: marking pages inaccessible is done from task_work
nowadays, there no longer is a kernel thread.
> The result would be that the kernel would allocate only migratable
> pages within the CCAD device's memory, and even then only if
> memory was otherwise exhausted.
Does it make sense to allocate the device's page tables in memory
belonging to the device?
Is this a necessary thing with some devices? Jerome's HMM comes
to mind...
--
All rights reversed
On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote:
> > The result would be that the kernel would allocate only
> migratable
> > pages within the CCAD device's memory, and even then only if
> > memory was otherwise exhausted.
>
> Does it make sense to allocate the device's page tables in memory
> belonging to the device?
>
> Is this a necessary thing with some devices? Jerome's HMM comes
> to mind...
In our case, the device's MMU shares the host page tables (which is why
we can't use HMM, ie we can't have a page with different permissions on
CPU vs. device which HMM does).
However the device has a pretty fast path to system memory, the best
thing we can do is pin the workload to the same chip the device is
connected to so those page tables arent' too far away.
Cheers,
Ben.
On Fri, Apr 24, 2015 at 11:09:36AM -0400, Jerome Glisse wrote:
> On Fri, Apr 24, 2015 at 07:57:38AM -0700, Paul E. McKenney wrote:
> > On Fri, Apr 24, 2015 at 09:12:07AM -0500, Christoph Lameter wrote:
> > > On Thu, 23 Apr 2015, Paul E. McKenney wrote:
> > >
> > > >
> > > > DAX
> > > >
> > > > DAX is a mechanism for providing direct-memory access to
> > > > high-speed non-volatile (AKA "persistent") memory. Good
> > > > introductions to DAX may be found in the following LWN
> > > > articles:
> > >
> > > DAX is a mechanism to access memory not managed by the kernel and is the
> > > successor to XIP. It just happens to be needed for persistent memory.
> > > Fundamentally any driver can provide an MMAPPed interface to allow access
> > > to a devices memory.
> >
> > I will take another look, but others in this thread have called out
> > difficulties with DAX's filesystem nature.
>
> Do not waste your time on that this is not what we want. Christoph here
> is more than stuborn and fails to see the world.
Well, we do need to make sure that we are correctly representing DAX's
capabilities. It is a hot topic, and others will probably also suggest
that it be used. That said, at the moment, I don't see how it would help,
given the need to migrate memory. Perhaps Boas Harrosh's patch set to
allow struct pages to be associated might help? But from what I can see,
a fair amount of other functionality would still be required either way.
I am updating the DAX section a bit, but I don't claim that it is complete.
Thanx, Paul
On Fri, Apr 24, 2015 at 03:00:18PM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
>
> > > Still no answer as to why is that not possible with the current scheme?
> > > You keep on talking about pointers and I keep on responding that this is a
> > > matter of making the address space compatible on both sides.
> >
> > So if do that in a naive way, how can we migrate a chunk of memory to video
> > memory while still handling properly the case where CPU try to access that
> > same memory while it is migrated to the GPU memory.
>
> Well that the same issue that the migration code is handling which I
> submitted a long time ago to the kernel.
Would you have a URL or other pointer to this code?
> > Without modifying a single line of mm code, the only way to do this is to
> > either unmap from the cpu page table the range being migrated or to mprotect
> > it in some way. In both case the cpu access will trigger some kind of fault.
>
> Yes that is how Linux migration works. If you can fix that then how about
> improving page migration in Linux between NUMA nodes first?
In principle, that also would be a good thing. But why do that first?
> > This is not the behavior we want. What we want is same address space while
> > being able to migrate system memory to device memory (who make that decision
> > should not be part of that discussion) while still gracefully handling any
> > CPU access.
>
> Well then there could be a situation where you have concurrent write
> access. How do you reconcile that then? Somehow you need to stall one or
> the other until the transaction is complete.
Or have store buffers on one or both sides.
> > This means if CPU access it we want to migrate memory back to system memory.
> > To achieve this there is no way around adding couple of if inside the mm
> > page fault code path. Now do you want each driver to add its own if branch
> > or do you want a common infrastructure to do just that ?
>
> If you can improve the page migration in general then we certainly would
> love that. Having faultless migration is certain a good thing for a lot of
> functionality that depends on page migration.
We do have to start somewhere, though. If we insist on perfection for
all situations before we agree to make a change, we won't be making very
many changes, now will we?
> > As i keep saying the solution you propose is what we have today, today we
> > have fake share address space through the trick of remapping system memory
> > at same address inside the GPU address space and also enforcing the use of
> > a special memory allocator that goes behind the back of mm code.
>
> Hmmm... I'd like to know more details about that.
As I understand it, the trick (if you can call it that) is having the
device have the same memory-mapping capabilities as the CPUs.
> > As you pointed out, not using GPU memory is a waste and we want to be able
> > to use it. Now Paul have more sofisticated hardware that offer oportunities
> > to do thing in a more transparent and efficient way.
>
> Does this also work between NUMA nodes in a Power8 system?
Heh! At the rate we are going with this discussion, Power8 will be
obsolete before we have this in. ;-)
Thanx, Paul
On Fri, Apr 24, 2015 at 10:49:28AM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Paul E. McKenney wrote:
>
> > can deliver, but where the cost of full-fledge hand tuning cannot be
> > justified.
> >
> > You seem to believe that this latter category is the empty set, which
> > I must confess does greatly surprise me.
>
> If there are already compromises are being made then why would you want to
> modify the kernel for this? Some user space coding and device drivers
> should be sufficient.
The goal is to gain substantial performance improvement without any
user-space changes.
Thanx, Paul
On Sat, Apr 25, 2015 at 01:32:39PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2015-04-24 at 22:32 -0400, Rik van Riel wrote:
> > > The result would be that the kernel would allocate only
> > migratable
> > > pages within the CCAD device's memory, and even then only if
> > > memory was otherwise exhausted.
> >
> > Does it make sense to allocate the device's page tables in memory
> > belonging to the device?
> >
> > Is this a necessary thing with some devices? Jerome's HMM comes
> > to mind...
>
> In our case, the device's MMU shares the host page tables (which is why
> we can't use HMM, ie we can't have a page with different permissions on
> CPU vs. device which HMM does).
>
> However the device has a pretty fast path to system memory, the best
> thing we can do is pin the workload to the same chip the device is
> connected to so those page tables arent' too far away.
And another update, diffs then full document. Among other things, this
version explicitly calls out the goal of gaining substantial performance
without changing user applications, which should hopefully help.
Thanx, Paul
------------------------------------------------------------------------
diff --git a/DeviceMem.txt b/DeviceMem.txt
index 15d0a8b5d360..3de70c4b9922 100644
--- a/DeviceMem.txt
+++ b/DeviceMem.txt
@@ -40,10 +40,13 @@
workloads will have less-predictable access patterns, and these
workloads can benefit from automatic migration of data between
device memory and system memory as access patterns change.
- Furthermore, some devices will provide special hardware that
- collects access statistics that can be used to determine whether
- or not a given page of memory should be migrated, and if so,
- to where.
+ In this latter case, the goal is not optimal performance,
+ but rather a significant increase in performance compared to
+ what the CPUs alone can provide without needing to recompile
+ any of the applications making up the workload. Furthermore,
+ some devices will provide special hardware that collects access
+ statistics that can be used to determine whether or not a given
+ page of memory should be migrated, and if so, to where.
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
@@ -146,6 +149,32 @@ REQUIREMENTS
required for low-latency applications that are sensitive
to OS jitter.
+ 6. It must be possible to cause an application to use a
+ CCAD device simply by switching dynamically linked
+ libraries, but without recompiling that application.
+ This implies the following requirements:
+
+ a. Address spaces must be synchronized for a given
+ application on the CPUs and the CCAD. In other
+ words, a given virtual address must access the same
+ physical memory from the CCAD device and from
+ the CPUs.
+
+ b. Code running on the CCAD device must be able to
+ access the running application's memory,
+ regardless of how that memory was allocated,
+ including statically allocated at compile time.
+
+ c. Use of the CCAD device must not interfere with
+ memory allocations that are never used by the
+ CCAD device. For example, if a CCAD device
+ has 16GB of memory, that should not prevent an
+ application using that device from allocating
+ more than 16GB of memory. For another example,
+ memory that is never accessed by a given CCAD
+ device should preferably remain outside of that
+ CCAD device's memory.
+
POTENTIAL IDEAS
@@ -178,12 +207,11 @@ POTENTIAL IDEAS
physical address ranges of normal system memory would
be interleaved with those of device memory.
- This would also require some sort of
- migration infrastructure to be added, as autonuma would
- not apply. However, this approach has the advantage
- of preventing allocations in these regions, at least
- unless those allocations have been explicitly flagged
- to go there.
+ This would also require some sort of migration
+ infrastructure to be added, as autonuma would not apply.
+ However, this approach has the advantage of preventing
+ allocations in these regions, at least unless those
+ allocations have been explicitly flagged to go there.
4. Your idea here!
@@ -274,21 +302,30 @@ MEMORY ZONE
DAX
DAX is a mechanism for providing direct-memory access to
- high-speed non-volatile (AKA "persistent") memory. Good
- introductions to DAX may be found in the following LWN
- articles:
+ special memory, for example, to high-speed non-volatile (AKA
+ "persistent") memory. A number of current use cases for DAX
+ put filesystems on top of DAX. Good introductions to DAX may
+ be found in the following LWN articles:
https://lwn.net/Articles/591779/
https://lwn.net/Articles/610174/
+ https://lwn.net/Articles/640113/
+
+ DAX is now in mainline, see for example fs/dax.c.
+
+ One important CCAD use case allows an unmodified legacy
+ application to pass some memory to a CCAD device, no matter how
+ this memory was allocated, while leaving other memory in system
+ memory, even if this other memory was allocated in exactly
+ the same way. The intent to use migration to move the memory
+ as required. DAX does not seem to help much with this use case.
- DAX provides filesystem-level access to persistent memory.
- One important CCAD use case is allowing a legacy application
- to pass memory from malloc() to a CCAD device, and having
- the allocated memory migrate as needed. DAX does not seem to
- support this use case.
+ There has been some discussion of associating struct page
+ structures, which might (or might not) make DAX a better fit
+ for CCAD.
ACKNOWLEDGMENTS
- Updates to this document include feedback from Christoph Lameter
- and Jerome Glisse.
+ Updates to this document include feedback from Christoph Lameter,
+ Jerome Glisse, Rik van Riel, Austin S Hemmelgarn, and Oded Gabbay.
------------------------------------------------------------------------
COHERENT ON-DEVICE MEMORY: ACCESS AND MIGRATION
Ben Herrenschmidt
(As told to Paul E. McKenney)
Special-purpose hardware becoming more prevalent, and some of this
hardware allows for tight interaction with CPU-based processing.
For example, IBM's coherent accelerator processor interface
(CAPI) will allow this sort of device to be constructed,
and it is likely that GPGPUs will need similar capabilities.
(See http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html for a
high-level description of CAPI.) Let's call these cache-coherent
accelerator devices (CCAD for short, which should at least
motivate someone to come up with something better).
This document covers devices with the following properties:
1. The device is cache-coherent, in other words, the device's
memory has all the characteristics of system memory from
the viewpoint of CPUs and other devices accessing it.
2. The device provides local memory that it has high-bandwidth
low-latency access to, but the device can also access
normal system memory.
3. The device shares system page tables, so that it can
transparently access userspace virtual memory, regardless
of whether this virtual memory maps to normal system
memory or to memory local to the device.
Although such a device will provide CPU's with cache-coherent
access to on-device memory, the resulting memory latency is
expected to be slower than the normal memory that is tightly
coupled to the CPUs. Nevertheless, data that is only occasionally
accessed by CPUs should be stored in the device's memory.
On the other hand, data that is accessed rarely by the device but
frequently by the CPUs should be stored in normal system memory.
Of course, some workloads will have predictable access patterns
that allow data to be optimally placed up front. However, other
workloads will have less-predictable access patterns, and these
workloads can benefit from automatic migration of data between
device memory and system memory as access patterns change.
In this latter case, the goal is not optimal performance,
but rather a significant increase in performance compared to
what the CPUs alone can provide without needing to recompile
any of the applications making up the workload. Furthermore,
some devices will provide special hardware that collects access
statistics that can be used to determine whether or not a given
page of memory should be migrated, and if so, to where.
The purpose of this document is to explore how this access
and migration can be provided for within the Linux kernel.
USE CASES
o Multiple transformations without requiring multiple
memory transfers for throughput-oriented applications.
For example, suppose the device supports both compression
and encryption algorithms, but that significant CPU
work is required to generate the data to be compressed
and encrypted. Suppose also that the application uses
a library to do the compression and encryption, and
that this application needs to run correctly, without
rebuilding, on systems with the device and also on systems
without the device. In addition, the application operates
on data mapped from files, data in normal data/bss memory,
and data in heap memory from malloc().
In this case, it would be beneficial to have the memory
automatically migrate to and from device memory.
Note that the device-specific library functions could
reasonably initiate the migration before starting their
work, but could not know whether or not to migrate the
data back upon completion.
o A special-purpose globally hand-optimized application
wishes to use the device, from Christoph Lameter.
In this case, the application will get the absolute
best performance by manually controlling allocation
and migration decisions. This use case is probably
not helped much by this proposal.
However, an application including a special-purpose
hand-optimized core and less-intense ancillary processing
could well benefit.
o GPGPU matrix operations, from Jerome Glisse.
https://lkml.org/lkml/2015/4/21/898
Suppose that you have an application that uses a
scientific library to do matrix computations, and that
this application simply calls malloc() and give the
resulting pointer to the library function. If the GPGPU
has coherent access to system memory (and vice versa),
it would help performance and application compatibility
to be able to transparently migrate the malloc()ed
memory to and from the GPGPU's memory without requiring
changes to the application.
o (More here for CAPI.)
REQUIREMENTS
1. It should be possible to remove a given CCAD device
from service, for example, to reset it, to download
updated firmware, or to change its functionality.
This results in the following additional requirements:
a. It should be possible to migrate all data away
from the device's memory at any time.
b. Normal memory allocation should avoid using the
device's memory, as this would interfere
with the needed migration. It may nevertheless
be desirable to use the device's memory
if system memory is exhausted, however, in some
cases, even this "emergency" use is best avoided.
In fact, a good solution will provide some means
for avoiding this for those cases where it is
necessary to evacuate memory when offlining the
device.
2. Memory can be either explicitly or implicitly allocated
from the CCAD device's memory. (Both usermode and kernel
allocation required.)
Please note that implicit allocation will need to be
avoided in a number of use cases. The reason for this
is that random kernel allocations might be pinned into
memory, which could conflict with requirement (1) above,
and might furthermore fragment the device's memory.
3. The device's memory is treated like normal system
memory by the Linux kernel, for example, each page has a
"struct page" associate with it. (In contrast, the
traditional approach has used special-purpose OS mechanisms
to manage the device's memory, and this memory was treated
as MMIO space by the kernel.)
4. The system's normal tuning mechanism may be used to
tune allocation locality, migration, and so on, as
required to match performance and functional requirements.
5. It must be possible to configure a system containing
a CCAD device so that it does no migration, as will be
required for low-latency applications that are sensitive
to OS jitter.
6. It must be possible to cause an application to use a
CCAD device simply by switching dynamically linked
libraries, but without recompiling that application.
This implies the following requirements:
a. Address spaces must be synchronized for a given
application on the CPUs and the CCAD. In other
words, a given virtual address must access the same
physical memory from the CCAD device and from
the CPUs.
b. Code running on the CCAD device must be able to
access the running application's memory,
regardless of how that memory was allocated,
including statically allocated at compile time.
c. Use of the CCAD device must not interfere with
memory allocations that are never used by the
CCAD device. For example, if a CCAD device
has 16GB of memory, that should not prevent an
application using that device from allocating
more than 16GB of memory. For another example,
memory that is never accessed by a given CCAD
device should preferably remain outside of that
CCAD device's memory.
POTENTIAL IDEAS
It is only reasonable to ask whether CCAD devices can simply
use the HMM patch that has recently been proposed to allow
migration between system and device memory via page faults.
Although this works well for devices whose local MMU can contain
mappings different from that of the system MMU, the HMM patch
is still working with MMIO space that gets special treatment.
The HMM patch does not (yet) provide the full transparency that
would allow the device memory to be treated in the same way as
system memory. Something more is therefore required, for example,
one or more of the following:
1. Model the CCAD device's memory as a memory-only NUMA node
with a very large distance metric. This allows use of
the existing mechanisms for choosing where to satisfy
explicit allocations and where to target migrations.
2. Cover the memory with a CMA to prevent non-migratable
pinned data from being placed in the CCAD device's memory.
It would also permit the driver to perform dedicated
physically contiguous allocations as needed.
3. Add a new ZONE_EXTERNAL zone for all CCAD-like devices.
Note that this would likely require support for
discontinuous zones in order to support large NUMA
systems, in which each node has a single block of the
overall physical address space. In such systems, the
physical address ranges of normal system memory would
be interleaved with those of device memory.
This would also require some sort of migration
infrastructure to be added, as autonuma would not apply.
However, this approach has the advantage of preventing
allocations in these regions, at least unless those
allocations have been explicitly flagged to go there.
4. Your idea here!
The following sections cover AutoNUMA, use of memory zones, and DAX.
AUTONUMA
The Linux kernel's autonuma facility supports migrating both
memory and processes to promote NUMA memory locality. It was
accepted into 3.13 and is available in RHEL 7.0 and SLES 12.
It is enabled by the Kconfig variable CONFIG_NUMA_BALANCING.
This approach uses a kernel thread "knuma_scand" that periodically
marks pages inaccessible. The page-fault handler notes any
mismatches between the NUMA node that the process is running on
and the NUMA node on which the page resides.
http://lwn.net/Articles/488709/
https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
It will be necessary to set up the CCAD device's memory as
a very distant NUMA node, and the architecture-specific
__numa_distance() function can be used for this purpose.
There is a RECLAIM_DISTANCE macro that can be set by the
architecture to prevent reclaiming from nodes that are too
far away. Some experimentation would be required to determine
the combination of values for the various distance macros.
This approach needs some way to pull in data from the hardware
on access patterns. Aneesh Kk Veetil is prototyping an approach
based on Power 8 hardware counters. This data will need to be
plugged into the migration algorithm, which is currently based
on collecting information from page faults.
Finally, the contiguous memory allocator (CMA, see
http://lwn.net/Articles/486301/) is needed in order to prevent
the kernel from placing non-migratable allocations in the CCAD
device's memory. This would need to be of type MIGRATE_CMA to
ensure that all memory taken from that range be migratable.
The result would be that the kernel would allocate only migratable
pages within the CCAD device's memory, and even then only if
memory was otherwise exhausted. Normal CONFIG_NUMA_BALANCING
migration could be brought to bear, possibly enhanced with
information from hardware counters. One remaining issue is that
there is no way to absolutely prevent random kernel subsystems
from allocating the CCAD device's memory, which could cause
failures should the device need to reset itself, in which case
the memory would be temporarily inaccessible -- which could be
a fatal surprise to that kernel subsystem.
Jerome Glisse suggests that usermode hints are quite important,
and perhaps should replace any AutoNUMA measurements.
MEMORY ZONE
One way to avoid the problem of random kernel subsystems using
the CAPI device's memory is to create a new memory zone for
this purpose. This would add something like ZONE_DEVMEM to the
current set that includes ZONE_DMA, ZONE_NORMAL, and ZONE_MOVABLE.
Currently, there are a maximum of four zones, so this limit must
either be increased or kernels built with ZONE_DEVMEM must avoid
having more than one of ZONE_DMA, ZONE_DMA32, and ZONE_HIGHMEM.
This approach requires that migration be implemented on the side,
as the CONFIG_NUMA_BALANCING will not help here (unless I am
missing something). One advantage of this situation is that
hardware locality measurements could be incorporated from the
beginning. Another advantage is that random kernel subsystems
and user programs would not get CAPI device memory unless they
explicitly requested it.
Code would be needed at boot time to place the CAPI device
memory into ZONE_DEVMEM, perhaps involving changes to
mem_init() and paging_init().
In addition, an appropriate GFP_DEVMEM would be needed, along
with code in various paths to handle it appropriately.
Also, because large NUMA systems will sometimes interleave the
addresses of blocks of physical memory and device memory,
support for discontiguous interleaved zones will be required.
DAX
DAX is a mechanism for providing direct-memory access to
special memory, for example, to high-speed non-volatile (AKA
"persistent") memory. A number of current use cases for DAX
put filesystems on top of DAX. Good introductions to DAX may
be found in the following LWN articles:
https://lwn.net/Articles/591779/
https://lwn.net/Articles/610174/
https://lwn.net/Articles/640113/
DAX is now in mainline, see for example fs/dax.c.
One important CCAD use case allows an unmodified legacy
application to pass some memory to a CCAD device, no matter how
this memory was allocated, while leaving other memory in system
memory, even if this other memory was allocated in exactly
the same way. The intent to use migration to move the memory
as required. DAX does not seem to help much with this use case.
There has been some discussion of associating struct page
structures, which might (or might not) make DAX a better fit
for CCAD.
ACKNOWLEDGMENTS
Updates to this document include feedback from Christoph Lameter,
Jerome Glisse, Rik van Riel, Austin S Hemmelgarn, and Oded Gabbay.
On Sat, 25 Apr 2015, Paul E. McKenney wrote:
> Would you have a URL or other pointer to this code?
linux/mm/migrate.c
> > > Without modifying a single line of mm code, the only way to do this is to
> > > either unmap from the cpu page table the range being migrated or to mprotect
> > > it in some way. In both case the cpu access will trigger some kind of fault.
> >
> > Yes that is how Linux migration works. If you can fix that then how about
> > improving page migration in Linux between NUMA nodes first?
>
> In principle, that also would be a good thing. But why do that first?
Because it would benefit a lot of functionality that today relies on page
migration to have a faster more reliable way of moving pages around.
> > > This is not the behavior we want. What we want is same address space while
> > > being able to migrate system memory to device memory (who make that decision
> > > should not be part of that discussion) while still gracefully handling any
> > > CPU access.
> >
> > Well then there could be a situation where you have concurrent write
> > access. How do you reconcile that then? Somehow you need to stall one or
> > the other until the transaction is complete.
>
> Or have store buffers on one or both sides.
Well if those store buffers end up with divergent contents then you have
the problem of not being able to decide which version should survive. But
from Jerome's response I deduce that this is avoided by only allow
read-only access during migration. That is actually similar to what page
migration does.
> > > This means if CPU access it we want to migrate memory back to system memory.
> > > To achieve this there is no way around adding couple of if inside the mm
> > > page fault code path. Now do you want each driver to add its own if branch
> > > or do you want a common infrastructure to do just that ?
> >
> > If you can improve the page migration in general then we certainly would
> > love that. Having faultless migration is certain a good thing for a lot of
> > functionality that depends on page migration.
>
> We do have to start somewhere, though. If we insist on perfection for
> all situations before we agree to make a change, we won't be making very
> many changes, now will we?
Improvements to the general code would be preferred instead of
having specialized solutions for a particular hardware alone. If the
general code can then handle the special coprocessor situation then we
avoid a lot of code development.
> As I understand it, the trick (if you can call it that) is having the
> device have the same memory-mapping capabilities as the CPUs.
Well yes that works with read-only mappings. Maybe we can special case
that in the page migration code? We do not need migration entries if
access is read-only actually.
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote:
> On Sat, 25 Apr 2015, Paul E. McKenney wrote:
>
> > Would you have a URL or other pointer to this code?
>
> linux/mm/migrate.c
>
> > > > Without modifying a single line of mm code, the only way to do this is to
> > > > either unmap from the cpu page table the range being migrated or to mprotect
> > > > it in some way. In both case the cpu access will trigger some kind of fault.
> > >
> > > Yes that is how Linux migration works. If you can fix that then how about
> > > improving page migration in Linux between NUMA nodes first?
> >
> > In principle, that also would be a good thing. But why do that first?
>
> Because it would benefit a lot of functionality that today relies on page
> migration to have a faster more reliable way of moving pages around.
I do no think in the CAPI case there is anyway to improve on current low
leve page migration. I am talking about :
- write protect & tlb flush
- copy
- update page table tlb flush
The upper level that have the logic for the migration would however need
some change. Like Paul said some kind of new metric and also new way to
gather statistics from device instead from CPU. I think the device can
provide better informations that the actual logic where page are unmap
and the kernel look which CPU fault on page first. Also a way to allow
hint provide by userspace through the device driver into the numa
decision process.
So i do not think that anything in this work would benefit any other work
load then the one Paul is interested in. Still i am sure Paul want to
build on top of existing infrastructure.
>
> > > > This is not the behavior we want. What we want is same address space while
> > > > being able to migrate system memory to device memory (who make that decision
> > > > should not be part of that discussion) while still gracefully handling any
> > > > CPU access.
> > >
> > > Well then there could be a situation where you have concurrent write
> > > access. How do you reconcile that then? Somehow you need to stall one or
> > > the other until the transaction is complete.
> >
> > Or have store buffers on one or both sides.
>
> Well if those store buffers end up with divergent contents then you have
> the problem of not being able to decide which version should survive. But
> from Jerome's response I deduce that this is avoided by only allow
> read-only access during migration. That is actually similar to what page
> migration does.
Yes, as said above no change to the logic there, we do not want divergent
content at all. The thing is, autonuma is a better fit for Paul because
Paul platform being more advance he can allocate struct page for the device
memory. While in my case it would be pointless as the memory is not CPU
accessible. This is why the HMM patchset do not build on top of autonuma
and current page migration but still use the same kind of logic.
>
> > > > This means if CPU access it we want to migrate memory back to system memory.
> > > > To achieve this there is no way around adding couple of if inside the mm
> > > > page fault code path. Now do you want each driver to add its own if branch
> > > > or do you want a common infrastructure to do just that ?
> > >
> > > If you can improve the page migration in general then we certainly would
> > > love that. Having faultless migration is certain a good thing for a lot of
> > > functionality that depends on page migration.
> >
> > We do have to start somewhere, though. If we insist on perfection for
> > all situations before we agree to make a change, we won't be making very
> > many changes, now will we?
>
> Improvements to the general code would be preferred instead of
> having specialized solutions for a particular hardware alone. If the
> general code can then handle the special coprocessor situation then we
> avoid a lot of code development.
I think Paul only big change would be the memory ZONE changes. Having a
way to add the device memory as struct page while blocking the kernel
allocation from using this memory. Beside that i think the autonuma changes
he would need would really be specific to his usecase but would still
reuse all of the low level logic.
>
> > As I understand it, the trick (if you can call it that) is having the
> > device have the same memory-mapping capabilities as the CPUs.
>
> Well yes that works with read-only mappings. Maybe we can special case
> that in the page migration code? We do not need migration entries if
> access is read-only actually.
The duplicate read only memory on device, is really an optimization that
is not critical to the whole. The common use case remain the migration of
read & write memory to device memory when the memory is mostly/only
accessed by the device.
Cheers,
J?r?me
On Mon, Apr 27, 2015 at 10:08:29AM -0500, Christoph Lameter wrote:
> On Sat, 25 Apr 2015, Paul E. McKenney wrote:
>
> > Would you have a URL or other pointer to this code?
>
> linux/mm/migrate.c
Ah, I thought you were calling out something not yet in mainline.
> > > > Without modifying a single line of mm code, the only way to do this is to
> > > > either unmap from the cpu page table the range being migrated or to mprotect
> > > > it in some way. In both case the cpu access will trigger some kind of fault.
> > >
> > > Yes that is how Linux migration works. If you can fix that then how about
> > > improving page migration in Linux between NUMA nodes first?
> >
> > In principle, that also would be a good thing. But why do that first?
>
> Because it would benefit a lot of functionality that today relies on page
> migration to have a faster more reliable way of moving pages around.
I would instead look on this as a way to try out use of hardware migration
hints, which could lead to hardware vendors providing similar hints for
node-to-node migrations. At that time, the benefits could be provided
all the functionality relying on such migrations.
> > > > This is not the behavior we want. What we want is same address space while
> > > > being able to migrate system memory to device memory (who make that decision
> > > > should not be part of that discussion) while still gracefully handling any
> > > > CPU access.
> > >
> > > Well then there could be a situation where you have concurrent write
> > > access. How do you reconcile that then? Somehow you need to stall one or
> > > the other until the transaction is complete.
> >
> > Or have store buffers on one or both sides.
>
> Well if those store buffers end up with divergent contents then you have
> the problem of not being able to decide which version should survive. But
> from Jerome's response I deduce that this is avoided by only allow
> read-only access during migration. That is actually similar to what page
> migration does.
Fair enough.
> > > > This means if CPU access it we want to migrate memory back to system memory.
> > > > To achieve this there is no way around adding couple of if inside the mm
> > > > page fault code path. Now do you want each driver to add its own if branch
> > > > or do you want a common infrastructure to do just that ?
> > >
> > > If you can improve the page migration in general then we certainly would
> > > love that. Having faultless migration is certain a good thing for a lot of
> > > functionality that depends on page migration.
> >
> > We do have to start somewhere, though. If we insist on perfection for
> > all situations before we agree to make a change, we won't be making very
> > many changes, now will we?
>
> Improvements to the general code would be preferred instead of
> having specialized solutions for a particular hardware alone. If the
> general code can then handle the special coprocessor situation then we
> avoid a lot of code development.
All else being equal, I agree that generality is preferred. But here,
as is often the case, all else is not necessarily equal.
> > As I understand it, the trick (if you can call it that) is having the
> > device have the same memory-mapping capabilities as the CPUs.
>
> Well yes that works with read-only mappings. Maybe we can special case
> that in the page migration code? We do not need migration entries if
> access is read-only actually.
So you are talking about the situation only during the migration itself,
then? If there is no migration in progress, then of course there is
no problem with concurrent writes because the cache-coherence protocol
takes care of things. During migration of a given page, I agree that
marking that page read-only on both sides makes sense.
And I agree that latency-sensitive applications might not tolerate
the page being read-only, and thus would want to avoid migration.
Such applications would of course instead rely on placing the memory.
Thanx, Paul
On Mon, 27 Apr 2015, Jerome Glisse wrote:
> > Improvements to the general code would be preferred instead of
> > having specialized solutions for a particular hardware alone. If the
> > general code can then handle the special coprocessor situation then we
> > avoid a lot of code development.
>
> I think Paul only big change would be the memory ZONE changes. Having a
> way to add the device memory as struct page while blocking the kernel
> allocation from using this memory. Beside that i think the autonuma changes
> he would need would really be specific to his usecase but would still
> reuse all of the low level logic.
Well lets avoid that. Access to device memory comparable to what the
drivers do today by establishing page table mappings or a generalization
of DAX approaches would be the most straightforward way of implementing it
and would build based on existing functionality. Page migration currently
does not work with driver mappings or DAX because there is no struct page
that would allow the lockdown of the page. That may require either
continued work on the DAX with page structs approach or new developments
in the page migration logic comparable to the get_user_page() alternative
of simply creating a scatter gather table to just submit a couple of
memory ranges to the I/O subsystem thereby avoiding page structs.
On 04/27/2015 12:17 PM, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
>
>>> Improvements to the general code would be preferred instead of
>>> having specialized solutions for a particular hardware alone. If the
>>> general code can then handle the special coprocessor situation then we
>>> avoid a lot of code development.
>>
>> I think Paul only big change would be the memory ZONE changes. Having a
>> way to add the device memory as struct page while blocking the kernel
>> allocation from using this memory. Beside that i think the autonuma changes
>> he would need would really be specific to his usecase but would still
>> reuse all of the low level logic.
>
> Well lets avoid that.
Why would we want to avoid the sane approach that makes this thing
work with the fewest required changes to core code?
Just because your workload is different from the workload they are
trying to enable?
--
All rights reversed
On Mon, 27 Apr 2015, Paul E. McKenney wrote:
> I would instead look on this as a way to try out use of hardware migration
> hints, which could lead to hardware vendors providing similar hints for
> node-to-node migrations. At that time, the benefits could be provided
> all the functionality relying on such migrations.
Ok that sounds good. These "hints" could allow for the optimization of the
page migration logic.
> > Well yes that works with read-only mappings. Maybe we can special case
> > that in the page migration code? We do not need migration entries if
> > access is read-only actually.
>
> So you are talking about the situation only during the migration itself,
> then? If there is no migration in progress, then of course there is
> no problem with concurrent writes because the cache-coherence protocol
> takes care of things. During migration of a given page, I agree that
> marking that page read-only on both sides makes sense.
This is sortof what happens in the current migration scheme. In the page
tables the regular entries are replaced by migration ptes and the page is
therefore inaccessible. Any access is then trapped until the page
contentshave been moved to the new location. Then the migration pte is
replaced by a real pte again that allows full access to the page. At that
point the processes that have been put to sleep because they attempted an
access to that page are woken up.
The current scheme may be improvied on by allowing read access to the page
while migration is in process. If we would change the migration entries to
allow read access then the readers would not have to be put to sleep. Only
writers would have to be put to sleep until the migration is complete.
> > And I agree that latency-sensitive applications might not tolerate
> the page being read-only, and thus would want to avoid migration.
> Such applications would of course instead rely on placing the memory.
Thats why we have the ability to switch off these automatism and that is
why we are trying to keep the OS away from certain processors.
But this is not the only concern here. The other thing is to make this fit
into existing functionaly as cleanly as possible. So I think we would be
looking at gradual improvements in the page migration logic as well as
in the support for mapping external memory via driver mmap calls, DAX
and/or RDMA subsystem functionality. Those two areas of functionality need
to work together better in order to provide a solution for your use cases.
On Mon, Apr 27, 2015 at 11:17:43AM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
>
> > > Improvements to the general code would be preferred instead of
> > > having specialized solutions for a particular hardware alone. If the
> > > general code can then handle the special coprocessor situation then we
> > > avoid a lot of code development.
> >
> > I think Paul only big change would be the memory ZONE changes. Having a
> > way to add the device memory as struct page while blocking the kernel
> > allocation from using this memory. Beside that i think the autonuma changes
> > he would need would really be specific to his usecase but would still
> > reuse all of the low level logic.
>
> Well lets avoid that. Access to device memory comparable to what the
> drivers do today by establishing page table mappings or a generalization
> of DAX approaches would be the most straightforward way of implementing it
> and would build based on existing functionality. Page migration currently
> does not work with driver mappings or DAX because there is no struct page
> that would allow the lockdown of the page. That may require either
> continued work on the DAX with page structs approach or new developments
> in the page migration logic comparable to the get_user_page() alternative
> of simply creating a scatter gather table to just submit a couple of
> memory ranges to the I/O subsystem thereby avoiding page structs.
What you refuse to see is that DAX is geared toward filesystem and as such
rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
and i keep pointing out we do not want our mecanism to be perceive as fs
from userspace point of view. We want to be below the fs, at the mm level
where we could really do thing transparently no matter what kind of memory
we are talking about (anonymous, file mapped, share).
The fact is that DAX is about persistant storage but the people that
develop the persitant storage think it would be nice to expose it as some
kind of special memory. I am all for the direct mapping of this kind of
memory but still it is use as a backing store for a filesystem.
While in our case we are talking about "usual" _volatile_ memory that
should be use or expose as a filesystem.
I can't understand why you are so hellbent on the DAX paradigm, but it is
not what suit us in no way. We are not filesystem, we are regular memory,
our realm is mm/ not fs/
Cheers,
J?r?me
On Mon, 27 Apr 2015, Rik van Riel wrote:
> Why would we want to avoid the sane approach that makes this thing
> work with the fewest required changes to core code?
Becaus new ZONEs are a pretty invasive change to the memory management and
because there are other ways to handle references to device specific
memory.
On Mon, 27 Apr 2015, Jerome Glisse wrote:
> > Well lets avoid that. Access to device memory comparable to what the
> > drivers do today by establishing page table mappings or a generalization
> > of DAX approaches would be the most straightforward way of implementing it
> > and would build based on existing functionality. Page migration currently
> > does not work with driver mappings or DAX because there is no struct page
> > that would allow the lockdown of the page. That may require either
> > continued work on the DAX with page structs approach or new developments
> > in the page migration logic comparable to the get_user_page() alternative
> > of simply creating a scatter gather table to just submit a couple of
> > memory ranges to the I/O subsystem thereby avoiding page structs.
>
> What you refuse to see is that DAX is geared toward filesystem and as such
> rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
> and i keep pointing out we do not want our mecanism to be perceive as fs
> from userspace point of view. We want to be below the fs, at the mm level
> where we could really do thing transparently no matter what kind of memory
> we are talking about (anonymous, file mapped, share).
Ok that is why I mentioned the device memory mappings that are currently
used for this purpose. You could generalize the DAX approach (which I
understand as providing rw mappings to memory outside of the memory
managed by the kernel and not as a fs specific thing).
We can drop the DAX name and just talk about mapping to external memory if
that confuses the issue.
On Mon, Apr 27, 2015 at 11:51:51AM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
>
> > > Well lets avoid that. Access to device memory comparable to what the
> > > drivers do today by establishing page table mappings or a generalization
> > > of DAX approaches would be the most straightforward way of implementing it
> > > and would build based on existing functionality. Page migration currently
> > > does not work with driver mappings or DAX because there is no struct page
> > > that would allow the lockdown of the page. That may require either
> > > continued work on the DAX with page structs approach or new developments
> > > in the page migration logic comparable to the get_user_page() alternative
> > > of simply creating a scatter gather table to just submit a couple of
> > > memory ranges to the I/O subsystem thereby avoiding page structs.
> >
> > What you refuse to see is that DAX is geared toward filesystem and as such
> > rely on special mapping. There is a reason why dax.c is in fs/ and not mm/
> > and i keep pointing out we do not want our mecanism to be perceive as fs
> > from userspace point of view. We want to be below the fs, at the mm level
> > where we could really do thing transparently no matter what kind of memory
> > we are talking about (anonymous, file mapped, share).
>
> Ok that is why I mentioned the device memory mappings that are currently
> used for this purpose. You could generalize the DAX approach (which I
> understand as providing rw mappings to memory outside of the memory
> managed by the kernel and not as a fs specific thing).
>
> We can drop the DAX name and just talk about mapping to external memory if
> that confuses the issue.
DAX is for direct access block layer (X is for the cool name factor)
there is zero code inside DAX that would be usefull to us. Because it
is all about filesystem and short circuiting the pagecache. So DAX is
_not_ about providing rw mappings to non regular memory, it is about
allowing to directly map _filesystem backing storage_ into a process.
Moreover DAX is not about managing that persistent memory, all the
management is done inside the fs (ext4, xfs, ...) in the same way as
for non persistent memory. While in our case we want to manage the
memory as a runtime resources that is allocated to process the same
way regular system memory is managed.
So current DAX code have nothing of value for our usecase nor what we
propose will have anyvalue for DAX. Unless they decide to go down the
struct page road for persistent memory (which from last discussion i
heard was not there plan, i am pretty sure they entirely dismissed
that idea for now).
My point is that this is 2 differents non overlapping problems, and
thus mandate 2 differents solution.
Cheers,
J?r?me
On Mon, 27 Apr 2015, Jerome Glisse wrote:
> > We can drop the DAX name and just talk about mapping to external memory if
> > that confuses the issue.
>
> DAX is for direct access block layer (X is for the cool name factor)
> there is zero code inside DAX that would be usefull to us. Because it
> is all about filesystem and short circuiting the pagecache. So DAX is
> _not_ about providing rw mappings to non regular memory, it is about
> allowing to directly map _filesystem backing storage_ into a process.
Its about directly mapping memory outside of regular kernel
management via a block device into user space. That you can put a
filesystem on top is one possible use case. You can provide a block
device to map the memory of the coprocessor and then configure the memory
space to have the same layout on the coprocessor as well as the linux
process.
> Moreover DAX is not about managing that persistent memory, all the
> management is done inside the fs (ext4, xfs, ...) in the same way as
> for non persistent memory. While in our case we want to manage the
> memory as a runtime resources that is allocated to process the same
> way regular system memory is managed.
I repeatedly said that. So you would have a block device that would be
used to mmap portions of the special memory into a process.
> So current DAX code have nothing of value for our usecase nor what we
> propose will have anyvalue for DAX. Unless they decide to go down the
> struct page road for persistent memory (which from last discussion i
> heard was not there plan, i am pretty sure they entirely dismissed
> that idea for now).
DAX is about directly accessing memory. It is made for the purpose of
serving as a block device for a filesystem right now but it can easily be
used as a way to map any external memory into a processes space using the
abstraction of a block device. But then you can do that with any device
driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
instead. Guess I have repeated myself 6 times or so now? I am stopping
with this one.
> My point is that this is 2 differents non overlapping problems, and
> thus mandate 2 differents solution.
Well confusion abounds since so much other stuff has ben attached to DAX
devices.
Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP
is the mechanism that DAX relies on in the VM.
On 04/27/2015 03:26 PM, Christoph Lameter wrote:
> DAX is about directly accessing memory. It is made for the purpose of
> serving as a block device for a filesystem right now but it can easily be
> used as a way to map any external memory into a processes space using the
> abstraction of a block device. But then you can do that with any device
> driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
> instead. Guess I have repeated myself 6 times or so now? I am stopping
> with this one.
Yeah, please stop.
If after 6 times you have still not grasped that having the
application manage which memory goes onto the device and
which goes in RAM is the exact opposite of the use model
that Paul and Jerome are trying to enable (transparent moving
around of memory, by eg. GPU calculation libraries), you are
clearly not paying enough attention.
--
All rights reversed
On Mon, Apr 27, 2015 at 02:26:04PM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
>
> > > We can drop the DAX name and just talk about mapping to external memory if
> > > that confuses the issue.
> >
> > DAX is for direct access block layer (X is for the cool name factor)
> > there is zero code inside DAX that would be usefull to us. Because it
> > is all about filesystem and short circuiting the pagecache. So DAX is
> > _not_ about providing rw mappings to non regular memory, it is about
> > allowing to directly map _filesystem backing storage_ into a process.
>
> Its about directly mapping memory outside of regular kernel
> management via a block device into user space. That you can put a
> filesystem on top is one possible use case. You can provide a block
> device to map the memory of the coprocessor and then configure the memory
> space to have the same layout on the coprocessor as well as the linux
> process.
_Block device_ not what we want, the API of block device does not match
anything remotely usefull for our usecase. Most of the block device api
deals with disk and scheduling io on them, none of which is interesting
to us. So we would need to carefully create various noop functions and
insert ourself as some kind of fake block device while also making sure
no userspace could actually use ourself as a regular block device. So
we would be pretending being something we are not.
>
> > Moreover DAX is not about managing that persistent memory, all the
> > management is done inside the fs (ext4, xfs, ...) in the same way as
> > for non persistent memory. While in our case we want to manage the
> > memory as a runtime resources that is allocated to process the same
> > way regular system memory is managed.
>
> I repeatedly said that. So you would have a block device that would be
> used to mmap portions of the special memory into a process.
>
> > So current DAX code have nothing of value for our usecase nor what we
> > propose will have anyvalue for DAX. Unless they decide to go down the
> > struct page road for persistent memory (which from last discussion i
> > heard was not there plan, i am pretty sure they entirely dismissed
> > that idea for now).
>
> DAX is about directly accessing memory. It is made for the purpose of
> serving as a block device for a filesystem right now but it can easily be
> used as a way to map any external memory into a processes space using the
> abstraction of a block device. But then you can do that with any device
> driver using VM_PFNMAP or VM_MIXEDMAP. Maybe we better use that term
> instead. Guess I have repeated myself 6 times or so now? I am stopping
> with this one.
>
> > My point is that this is 2 differents non overlapping problems, and
> > thus mandate 2 differents solution.
>
> Well confusion abounds since so much other stuff has ben attached to DAX
> devices.
>
> Lets drop the DAX term and use VM_PFNMAP or VM_MIXEDMAP instead. MIXEDMAP
> is the mechanism that DAX relies on in the VM.
Which would require fare more changes than you seem to think. First using
MIXED|PFNMAP means we loose any kind of memory accounting and forget about
memcg too. Seconds it means we would need to set those flags on all vma,
which kind of point out that something must be wrong here. You will also
need to have vm_ops for all those vma (including for anonymous private vma
which sounds like it will break quite few place that test for that). Then
you have to think about vma that already have vm_ops but you would need
to override it to handle case where its device memory and then forward
other case to the existing vm_ops, extra layering, extra complexity.
All in all, this points me to believe that any such approach would be
vastly more complex, involve changing many places and try to force shoe
horning something into the block device model that is clearly not a
block device.
Paul solution or mine, are far smaller, i think Paul can even get away
from adding/changing ZONE by putting the device pages onto a different
list that is not use by kernel memory allocator. Only few code place
would need a new if() (when freeing a page and when initializing device
memory struct page, you could keep the lru code intact here).
I think at this point there is nothing more to discuss here. It is pretty
clear to me that any solution using block device/MIXEDMAP would be far
more complex and far more intrusive. I do not mind being prove wrong but
i will certainly not waste my time trying to implement such solution.
Btw as a data point, if you ignore my patches to mmu_notifier (which are
mostly about passing down more context information to the callback),
i touch less then 50 lines of mm common code. Every thing else is helpers
that are only use by the device driver.
Cheers,
J?r?me
On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Rik van Riel wrote:
>
> > Why would we want to avoid the sane approach that makes this thing
> > work with the fewest required changes to core code?
>
> Becaus new ZONEs are a pretty invasive change to the memory management and
> because there are other ways to handle references to device specific
> memory.
ZONEs is just one option we put on the table.
I think we can mostly agree on the fundamentals that a good model of
such a co-processor is a NUMA node, possibly with a higher distance
than other nodes (but even that can be debated).
That gives us a lot of the basics we need such as struct page, ability
to use existing migration infrastructure, and is actually a reasonably
representation at high level as well.
The question is how do we additionally get the random stuff we don't
care about out of the way. The large distance will not help that much
under memory pressure for example.
Covering the entire device memory with a CMA goes a long way toward that
goal. It will avoid your ordinary kernel allocations.
It also provides just what we need to be able to do large contiguous
"explicit" allocations for use by workloads that don't want the
transparent migration and by the driver for the device which might also
need such special allocations for its own internal management data
structures.
We still have the risk of pages in the CMA being pinned by something
like gup however, that's where the ZONE idea comes in, to ensure the
various kernel allocators will *never* allocate in that zone unless
explicitly specified, but that could possibly implemented differently.
Maybe a concept of "exclusive" NUMA node, where allocations never
fallback to that node unless explicitly asked to go there.
Of course that would have an impact on memory pressure calculations,
nothign comes completely for free, but at this stage, this is the goal
of this thread, ie, to swap ideas around and see what's most likely to
work in the long run before we even start implementing something.
Cheers,
Ben.
On Mon, 27 Apr 2015, Jerome Glisse wrote:
> > is the mechanism that DAX relies on in the VM.
>
> Which would require fare more changes than you seem to think. First using
> MIXED|PFNMAP means we loose any kind of memory accounting and forget about
> memcg too. Seconds it means we would need to set those flags on all vma,
> which kind of point out that something must be wrong here. You will also
> need to have vm_ops for all those vma (including for anonymous private vma
> which sounds like it will break quite few place that test for that). Then
> you have to think about vma that already have vm_ops but you would need
> to override it to handle case where its device memory and then forward
> other case to the existing vm_ops, extra layering, extra complexity.
These vmas would only be used for those section of memory that use
memory in the coprocessor. Special memory accounting etc can be done at
the device driver layer. Multiple processes would be able to use different
GPU contexts (or devices) which provides proper isolations.
memcg is about accouting for regular memory and this is not regular
memory. It ooks like one would need a lot of special casing in
the VM if one wanted to handle f.e. GPU memory as regular memory under
Linux.
> I think at this point there is nothing more to discuss here. It is pretty
> clear to me that any solution using block device/MIXEDMAP would be far
> more complex and far more intrusive. I do not mind being prove wrong but
> i will certainly not waste my time trying to implement such solution.
The device driver method is the current solution used by the GPUS and
that would be the natural starting point for development. And they do not
currently add code to the core vm. I think we first need to figure out if
we cannot do what you want through that method.
On Tue, Apr 28, 2015 at 09:18:55AM -0500, Christoph Lameter wrote:
> On Mon, 27 Apr 2015, Jerome Glisse wrote:
>
> > > is the mechanism that DAX relies on in the VM.
> >
> > Which would require fare more changes than you seem to think. First using
> > MIXED|PFNMAP means we loose any kind of memory accounting and forget about
> > memcg too. Seconds it means we would need to set those flags on all vma,
> > which kind of point out that something must be wrong here. You will also
> > need to have vm_ops for all those vma (including for anonymous private vma
> > which sounds like it will break quite few place that test for that). Then
> > you have to think about vma that already have vm_ops but you would need
> > to override it to handle case where its device memory and then forward
> > other case to the existing vm_ops, extra layering, extra complexity.
>
> These vmas would only be used for those section of memory that use
> memory in the coprocessor. Special memory accounting etc can be done at
> the device driver layer. Multiple processes would be able to use different
> GPU contexts (or devices) which provides proper isolations.
>
> memcg is about accouting for regular memory and this is not regular
> memory. It ooks like one would need a lot of special casing in
> the VM if one wanted to handle f.e. GPU memory as regular memory under
> Linux.
Well i shoed this does not need much changes refer to :
http://lwn.net/Articles/597289/
More specifically :
http://thread.gmane.org/gmane.linux.kernel.mm/116584
http://thread.gmane.org/gmane.linux.kernel.mm/116584
http://thread.gmane.org/gmane.linux.kernel.mm/116584
Idea here is that even if device memory is speciak kind of memory we still
want to account it properly against process ie an anonymous page that is
on the device memory would still be accounted as regular anonymous page for
memcg (same apply to file backed pages). With that existing memcg keeps
working as intended and process memory use are properly accounted.
This does not prevent the device driver to perform its own accounting of
device memory and to allow or block migration for a given process. At this
point we do not think it is meaningfull to move such accounting to a common
layer.
Bottom line is, we want to keep existing memcg accounting intact and we
want to reflect remote memory as regular memory. Note that the memcg changes
would be even smaller now that Johannes cleaned up and simplified memcg. I
have not rebase that part of HMM yet.
>
> > I think at this point there is nothing more to discuss here. It is pretty
> > clear to me that any solution using block device/MIXEDMAP would be far
> > more complex and far more intrusive. I do not mind being prove wrong but
> > i will certainly not waste my time trying to implement such solution.
>
> The device driver method is the current solution used by the GPUS and
> that would be the natural starting point for development. And they do not
> currently add code to the core vm. I think we first need to figure out if
> we cannot do what you want through that method.
We do need a different solution, i have been working on that for last 2 years
for a reason.
Requirement: _no_ special allocator in userspace so that all kind of memory
(anonymous, share, file backed) can be use and migrated to device memory in
a transparent fashion for the application.
No special allocator imply no special vma so no special vm_ops. So we need
either to hook up in few places inside mm code with minor change to deal with
special CPU pte entry of migrated memory (on page fault, fork, write back).
For all those place it's just about adding :
if(new_special_pte)
new_helper_function()
Other solution would have been to introduce yet another vm_ops that would
superceed the existing vm_ops. This work for page fault but require more
changes for page fault and fork, and major changes for write back. Hence,
why first solution was favor.
I explored many different path before going down the road i am, and all
you are doing is hand waving some idea without even considering any of
the objection i formulated. I explained why your idea can not work or
require excessive and more complex change than solution we are proposing.
Cheers,
J?r?me
Sorry for reviving oldish thread...
On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote:
> On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:
>> On Mon, 27 Apr 2015, Rik van Riel wrote:
>>
>>> Why would we want to avoid the sane approach that makes this thing
>>> work with the fewest required changes to core code?
>>
>> Becaus new ZONEs are a pretty invasive change to the memory management and
>> because there are other ways to handle references to device specific
>> memory.
>
> ZONEs is just one option we put on the table.
>
> I think we can mostly agree on the fundamentals that a good model of
> such a co-processor is a NUMA node, possibly with a higher distance
> than other nodes (but even that can be debated).
>
> That gives us a lot of the basics we need such as struct page, ability
> to use existing migration infrastructure, and is actually a reasonably
> representation at high level as well.
>
> The question is how do we additionally get the random stuff we don't
> care about out of the way. The large distance will not help that much
> under memory pressure for example.
>
> Covering the entire device memory with a CMA goes a long way toward that
> goal. It will avoid your ordinary kernel allocations.
I think ZONE_MOVABLE should be sufficient for this. CMA is basically for
marking parts of zones as MOVABLE-only. You shouldn't need that for the
whole zone. Although it might happen that CMA will be a special zone one
day.
> It also provides just what we need to be able to do large contiguous
> "explicit" allocations for use by workloads that don't want the
> transparent migration and by the driver for the device which might also
> need such special allocations for its own internal management data
> structures.
Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE
zone. CMA allocations might IIRC additionally migrate across zones, e.g.
from the device to system memory (unlike plain compaction), which might
be what you want, or not.
> We still have the risk of pages in the CMA being pinned by something
> like gup however, that's where the ZONE idea comes in, to ensure the
> various kernel allocators will *never* allocate in that zone unless
> explicitly specified, but that could possibly implemented differently.
Kernel allocations should ignore the ZONE_MOVABLE zone as they are not
typically movable. Then it depends on how much control you want for
userspace allocations.
> Maybe a concept of "exclusive" NUMA node, where allocations never
> fallback to that node unless explicitly asked to go there.
I guess that could be doable on the zonelist level, where the device
memory node/zone wouldn't be part of the "normal" zonelists, so memory
pressure calculations should be also fine. But sure there will be some
corner cases :)
> Of course that would have an impact on memory pressure calculations,
> nothign comes completely for free, but at this stage, this is the goal
> of this thread, ie, to swap ideas around and see what's most likely to
> work in the long run before we even start implementing something.
>
> Cheers,
> Ben.
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
> Sorry for reviving oldish thread...
Well, that's actually appreciated since this is constructive discussion
of the kind I was hoping to trigger initially :-) I'll look at
ZONE_MOVABLE, I wasn't aware of its existence.
Don't we still have the problem that ZONEs must be somewhat contiguous
chunks ? Ie, my "CAPI memory" will be interleaved in the physical
address space somewhat.. This is due to the address space on some of
those systems where you'll basically have something along the lines of:
[ node 0 mem ] [ node 0 CAPI dev ] .... [ node 1 mem] [ node 1 CAPI dev] ...
> On 04/28/2015 01:54 AM, Benjamin Herrenschmidt wrote:
> > On Mon, 2015-04-27 at 11:48 -0500, Christoph Lameter wrote:
> >> On Mon, 27 Apr 2015, Rik van Riel wrote:
> >>
> >>> Why would we want to avoid the sane approach that makes this thing
> >>> work with the fewest required changes to core code?
> >>
> >> Becaus new ZONEs are a pretty invasive change to the memory management and
> >> because there are other ways to handle references to device specific
> >> memory.
> >
> > ZONEs is just one option we put on the table.
> >
> > I think we can mostly agree on the fundamentals that a good model of
> > such a co-processor is a NUMA node, possibly with a higher distance
> > than other nodes (but even that can be debated).
> >
> > That gives us a lot of the basics we need such as struct page, ability
> > to use existing migration infrastructure, and is actually a reasonably
> > representation at high level as well.
> >
> > The question is how do we additionally get the random stuff we don't
> > care about out of the way. The large distance will not help that much
> > under memory pressure for example.
> >
> > Covering the entire device memory with a CMA goes a long way toward that
> > goal. It will avoid your ordinary kernel allocations.
>
> I think ZONE_MOVABLE should be sufficient for this. CMA is basically for
> marking parts of zones as MOVABLE-only. You shouldn't need that for the
> whole zone. Although it might happen that CMA will be a special zone one
> day.
>
> > It also provides just what we need to be able to do large contiguous
> > "explicit" allocations for use by workloads that don't want the
> > transparent migration and by the driver for the device which might also
> > need such special allocations for its own internal management data
> > structures.
>
> Plain zone compaction + reclaim should work as well in a ZONE_MOVABLE
> zone. CMA allocations might IIRC additionally migrate across zones, e.g.
> from the device to system memory (unlike plain compaction), which might
> be what you want, or not.
>
> > We still have the risk of pages in the CMA being pinned by something
> > like gup however, that's where the ZONE idea comes in, to ensure the
> > various kernel allocators will *never* allocate in that zone unless
> > explicitly specified, but that could possibly implemented differently.
>
> Kernel allocations should ignore the ZONE_MOVABLE zone as they are not
> typically movable. Then it depends on how much control you want for
> userspace allocations.
>
> > Maybe a concept of "exclusive" NUMA node, where allocations never
> > fallback to that node unless explicitly asked to go there.
>
> I guess that could be doable on the zonelist level, where the device
> memory node/zone wouldn't be part of the "normal" zonelists, so memory
> pressure calculations should be also fine. But sure there will be some
> corner cases :)
>
> > Of course that would have an impact on memory pressure calculations,
> > nothign comes completely for free, but at this stage, this is the goal
> > of this thread, ie, to swap ideas around and see what's most likely to
> > work in the long run before we even start implementing something.
> >
> > Cheers,
> > Ben.
> >
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
>> Sorry for reviving oldish thread...
>
> Well, that's actually appreciated since this is constructive discussion
> of the kind I was hoping to trigger initially :-) I'll look at
I hoped so :)
> ZONE_MOVABLE, I wasn't aware of its existence.
>
> Don't we still have the problem that ZONEs must be somewhat contiguous
> chunks ? Ie, my "CAPI memory" will be interleaved in the physical
> address space somewhat.. This is due to the address space on some of
> those systems where you'll basically have something along the lines of:
>
> [ node 0 mem ] [ node 0 CAPI dev ] .... [ node 1 mem] [ node 1 CAPI dev] ...
Oh, I see. The VM code should cope with that, but some operations would
be inefficiently looping over the holes in the CAPI zone by 2MB
pageblock per iteration. This would include compaction scanning, which
would suck if you need those large contiguous allocations as you said.
Interleaving works better if it's done with a smaller granularity.
But I guess you could just represent the CAPI as multiple NUMA nodes,
each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and
"node 1 CAPI dev" differs in other characteristics than just using a
different range of PFNs... otherwise what's the point of this split anyway?
On Thu, 2015-05-14 at 09:39 +0200, Vlastimil Babka wrote:
> On 05/14/2015 01:38 AM, Benjamin Herrenschmidt wrote:
> > On Wed, 2015-05-13 at 16:10 +0200, Vlastimil Babka wrote:
> >> Sorry for reviving oldish thread...
> >
> > Well, that's actually appreciated since this is constructive discussion
> > of the kind I was hoping to trigger initially :-) I'll look at
>
> I hoped so :)
>
> > ZONE_MOVABLE, I wasn't aware of its existence.
> >
> > Don't we still have the problem that ZONEs must be somewhat contiguous
> > chunks ? Ie, my "CAPI memory" will be interleaved in the physical
> > address space somewhat.. This is due to the address space on some of
> > those systems where you'll basically have something along the lines of:
> >
> > [ node 0 mem ] [ node 0 CAPI dev ] .... [ node 1 mem] [ node 1 CAPI dev] ...
>
> Oh, I see. The VM code should cope with that, but some operations would
> be inefficiently looping over the holes in the CAPI zone by 2MB
> pageblock per iteration. This would include compaction scanning, which
> would suck if you need those large contiguous allocations as you said.
> Interleaving works better if it's done with a smaller granularity.
>
> But I guess you could just represent the CAPI as multiple NUMA nodes,
> each with single ZONE_MOVABLE zone. Especially if "node 0 CAPI dev" and
> "node 1 CAPI dev" differs in other characteristics than just using a
> different range of PFNs... otherwise what's the point of this split anyway?
Correct, I think we want the CAPI devs to look like CPU-less NUMA nodes
anyway. This is the right way to target an allocation at one of them and
it conveys the distance properly, so it makes sense.
I'll add the ZONE_MOVABLE to the list of things to investigate on our
side, thanks for the pointer !
Cheers,
Ben.