LinuxLists.cc - Mirroring process address space on device

2016-03-16 17:11:35

Subject: Mirroring process address space on device

In a nutshell:

Export the memory management functions, unmapped_area() &
unmapped_area_topdown(), as GPL symbols; this allows the kernel to
better support process address space mirroring on both CPU and device
for out-of-tree drivers by allowing the use of vm_unmapped_area() in a
driver's file operation get_unmapped_area().

This is required by drivers that want to control or limit a process VMA
range into which shared-virtual-memory (SVM) buffers are mapped during
an mmap() call in order to ensure that said SVM VMA does not collide
with any pre-existing VMAs used by non-buffer regions on the device
because SVM buffers must have identical VMAs on both CPU and device.

Exporting these functions is particularly useful for graphics devices as
SVM support is required by the OpenCL & HSA specifications and also SVM
support for 64-bit CPUs where the useable device SVM address range
is/maybe a subset of the full 64-bit range of the CPU. Exporting also
avoids the need to duplicate the VMA search code in such drivers.

Why do this:

The OpenCL API & Heterogeneous System Architecture (HSA) specifications
requires mirroring a process address space on both the CPU and GPU, a so
called shared-virtual-memory (SVM) support wherein the same virtual
address is used to address the same content on both the CPU and GPU.

There are different levels of support from coarse to fine-grained with
slightly different semantics (1: coarse-grained buffer SVM, 2:
fine-grained buffer SVM & 3: fine-grained system SVM); furthermore
support for the highest level, fine-grained system SVM, is optional and
this fact is central to the need for this requirement as explained
below.

For hardware & drivers implementing support for SVM up to the second
level only, i.e. fine-grained buffer SVM level, this mirroring is
effectively at a buffer allocation level and therefore excludes the need
for any heterogeneous memory management (HMM) like functionality which
is required to support SVM up to the highest level, i.e. fine-grained
system SVM (see http://lwn.net/Articles/597289/ for details). In this
case, drivers would benefit from being able to specify/control the SVM
VMA range during a mmap() call especially if the device SVM VMA range is
a subset of the full 32-bit/64-bit CPU (process/mmap) range.

As the kernel already provides a char driver
file->f_op->get_unmapped_area() entry point for this, the backend of
such a call would require a constrained search for an unmapped address
range using vm_unmapped_area() which currently calls into either
unmapped_area() or unmapped_area_topdown() both of which are not
currently exported symbols. Therefore, exporting these symbols allows
the kerne to provide better support this type of process address space
and it also avoids duplicating the VMA search code in these drivers.

As always, comments are welcome and many thanks in advance for
consideration.

Olu Ogunbowale (1):
mm: Export symbols unmapped_area() & unmapped_area_topdown()

mm/mmap.c | 4 ++++
1 file changed, 4 insertions(+)

--
2.7.1

2016-03-16 17:11:46

by Olu Ogunbowale

[permalink] [raw]

Subject: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

From: Olujide Ogunbowale <[email protected]>

Export the memory management functions, unmapped_area() &
unmapped_area_topdown(), as GPL symbols; this allows the kernel to
better support process address space mirroring on both CPU and device
for out-of-tree drivers by allowing the use of vm_unmapped_area() in a
driver's file operation get_unmapped_area().

This is required by drivers that want to control or limit a process VMA
range into which shared-virtual-memory (SVM) buffers are mapped during
an mmap() call in order to ensure that said SVM VMA does not collide
with any pre-existing VMAs used by non-buffer regions on the device
because SVM buffers must have identical VMAs on both CPU and device.

Exporting these functions is particularly useful for graphics devices as
SVM support is required by the OpenCL & HSA specifications and also SVM
support for 64-bit CPUs where the useable device SVM address range
is/maybe a subset of the full 64-bit range of the CPU. Exporting also
avoids the need to duplicate the VMA search code in such drivers.

Signed-off-by: Olu Ogunbowale <[email protected]>
---
mm/mmap.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/mmap.c b/mm/mmap.c
index 76d1ec2..c08b518 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1804,6 +1804,8 @@ found:
return gap_start;
}

+EXPORT_SYMBOL_GPL(unmapped_area);
+
unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
{
struct mm_struct *mm = current->mm;
@@ -1902,6 +1904,8 @@ found_highest:
return gap_end;
}

+EXPORT_SYMBOL_GPL(unmapped_area_topdown);
+
/* Get an address range which is currently unmapped.
* For shmat() with addr=0.
*
--
1.7.9.5

2016-03-16 20:37:09

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

2016-03-16 21:00:48

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

On Wed, 2016-03-16 at 13:36 -0700, Christoph Hellwig wrote:
> On Wed, Mar 16, 2016 at 05:10:34PM +0000, Olu Ogunbowale wrote:
> >
> > From: Olujide Ogunbowale <[email protected]>
> >
> > Export the memory management functions, unmapped_area() &
> > unmapped_area_topdown(), as GPL symbols; this allows the kernel to
> > better support process address space mirroring on both CPU and
> > device
> > for out-of-tree drivers by allowing the use of vm_unmapped_area()
> > in a
> > driver's file operation get_unmapped_area().
> No new exports without in-tree drivers. How about you get started
> to get your drives into the tree first?

The drivers appear to require the HMM framework though,
which people are also reluctant to merge without the
drivers.

How do we get past this chicken & egg situation?

--
All Rights Reversed.

Attachments:

signature.asc (473.00 B)
This is a digitally signed message part

2016-03-17 07:25:07

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

* Rik van Riel <[email protected]> wrote:

> On Wed, 2016-03-16 at 13:36 -0700, Christoph Hellwig wrote:
> > On Wed, Mar 16, 2016 at 05:10:34PM +0000, Olu Ogunbowale wrote:
> > >
> > > From: Olujide Ogunbowale <[email protected]>
> > >
> > > Export the memory management functions, unmapped_area() &
> > > unmapped_area_topdown(), as GPL symbols; this allows the kernel to
> > > better support process address space mirroring on both CPU and
> > > device
> > > for out-of-tree drivers by allowing the use of vm_unmapped_area()
> > > in a
> > > driver's file operation get_unmapped_area().
> > No new exports without in-tree drivers.??How about you get started
> > to get your drives into the tree first?
>
> The drivers appear to require the HMM framework though,
> which people are also reluctant to merge without the
> drivers.
>
> How do we get past this chicken & egg situation?

Submit the export together with the drivers for review and Cc: VM folks - it all
looks pretty small on the VM side.

Thanks,

Ingo

2016-03-17 14:37:31

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

On Wed, Mar 16, 2016 at 05:10:34PM +0000, Olu Ogunbowale wrote:
> From: Olujide Ogunbowale <[email protected]>
>
> Export the memory management functions, unmapped_area() &
> unmapped_area_topdown(), as GPL symbols; this allows the kernel to
> better support process address space mirroring on both CPU and device
> for out-of-tree drivers by allowing the use of vm_unmapped_area() in a
> driver's file operation get_unmapped_area().
>
> This is required by drivers that want to control or limit a process VMA
> range into which shared-virtual-memory (SVM) buffers are mapped during
> an mmap() call in order to ensure that said SVM VMA does not collide
> with any pre-existing VMAs used by non-buffer regions on the device
> because SVM buffers must have identical VMAs on both CPU and device.
>
> Exporting these functions is particularly useful for graphics devices as
> SVM support is required by the OpenCL & HSA specifications and also SVM
> support for 64-bit CPUs where the useable device SVM address range
> is/maybe a subset of the full 64-bit range of the CPU. Exporting also
> avoids the need to duplicate the VMA search code in such drivers.

What other driver do for non-buffer region is have the userspace side
of the device driver mmap the device driver file and use vma range you
get from that for those non-buffer region. On cpu access you can either
chose to fault or to return a dummy page. With that trick no need to
change kernel.

Note that i do not see how you can solve the issue of your GPU having
less bits then the cpu. For instance, lets assume that you have 46bits
for the GPU while the CPU have 48bits. Now an application start and do
bunch of allocation that end up above (1 << 46), then same application
load your driver and start using some API that allow to transparently
use previously allocated memory -> fails.

Unless you are in scheme were all allocation must go through some
special allocator but i thought this was not the case for HSA. I know
lower level of OpenCL allows that.

Cheers,
J?r?me

2016-03-17 15:39:28

by Oded Gabbay

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

On Thu, Mar 17, 2016 at 4:37 PM, Jerome Glisse <[email protected]> wrote:
> On Wed, Mar 16, 2016 at 05:10:34PM +0000, Olu Ogunbowale wrote:
>> From: Olujide Ogunbowale <[email protected]>
>>
>> Export the memory management functions, unmapped_area() &
>> unmapped_area_topdown(), as GPL symbols; this allows the kernel to
>> better support process address space mirroring on both CPU and device
>> for out-of-tree drivers by allowing the use of vm_unmapped_area() in a
>> driver's file operation get_unmapped_area().
>>
>> This is required by drivers that want to control or limit a process VMA
>> range into which shared-virtual-memory (SVM) buffers are mapped during
>> an mmap() call in order to ensure that said SVM VMA does not collide
>> with any pre-existing VMAs used by non-buffer regions on the device
>> because SVM buffers must have identical VMAs on both CPU and device.
>>
>> Exporting these functions is particularly useful for graphics devices as
>> SVM support is required by the OpenCL & HSA specifications and also SVM
>> support for 64-bit CPUs where the useable device SVM address range
>> is/maybe a subset of the full 64-bit range of the CPU. Exporting also
>> avoids the need to duplicate the VMA search code in such drivers.
>
> What other driver do for non-buffer region is have the userspace side
> of the device driver mmap the device driver file and use vma range you
> get from that for those non-buffer region. On cpu access you can either
> chose to fault or to return a dummy page. With that trick no need to
> change kernel.
>
> Note that i do not see how you can solve the issue of your GPU having
> less bits then the cpu. For instance, lets assume that you have 46bits
> for the GPU while the CPU have 48bits. Now an application start and do
> bunch of allocation that end up above (1 << 46), then same application
> load your driver and start using some API that allow to transparently
> use previously allocated memory -> fails.
>
> Unless you are in scheme were all allocation must go through some
> special allocator but i thought this was not the case for HSA. I know
> lower level of OpenCL allows that.
>
> Cheers,
> Jérôme

In amdkfd (AMD HSA kernel driver), for APU's where the CPU and GPU sit
on the same die, we don't need this as the GPU cores use the AMD IOMMU
(v2) to access the system memory. i.e. we don't need to use vram (gpu
memory) at all and we don't need to mirror address spaces.

For dGPU, it's a different story. On GPUs where there is only 40-bit
memory space, for example, GCN 1.0 and 1.1, I would assume a pass
through a special allocator is a must, while memory addresses below
the 40-bit limit will need to be reserved for HSA. Note that amdkfd
doesn't support dGPU at this time.

Thanks,
Oded

2016-03-17 15:46:50

by Olu Ogunbowale

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

On Thu, Mar 17, 2016 at 03:37:16PM +0100, Jerome Glisse wrote:
> What other driver do for non-buffer region is have the userspace side
> of the device driver mmap the device driver file and use vma range you
> get from that for those non-buffer region. On cpu access you can either
> chose to fault or to return a dummy page. With that trick no need to
> change kernel.

Yes, this approach works for some designs however arbitrary VMA ranges
for non-buffer regions is not a feature of all mobile gpu designs for
performance, power, and area (PPA) reasons.

> Note that i do not see how you can solve the issue of your GPU having
> less bits then the cpu. For instance, lets assume that you have 46bits
> for the GPU while the CPU have 48bits. Now an application start and do
> bunch of allocation that end up above (1 << 46), then same application
> load your driver and start using some API that allow to transparently
> use previously allocated memory -> fails.

Yes, you are correct however for mobile SoC(s) though current top-end
specifications have 4GB/8GB of installed ram so the usable SVM range is
upper bound by this giving a fixed base hence the need for driver control
of VMA range.

> Unless you are in scheme were all allocation must go through some
> special allocator but i thought this was not the case for HSA. I know
> lower level of OpenCL allows that.

Subsets of both specifications allows for restricted implementation AFAIK,
this proposed changes are for HSA and OpenCL up to phase 2, where all SVM
allocations go via special user mode allocator.

Regards,
Olu

2016-03-17 16:40:47

by Olu Ogunbowale

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

On Wed, Mar 16, 2016 at 05:00:41PM -0400, Rik van Riel wrote:
>
> The drivers appear to require the HMM framework though,
> which people are also reluctant to merge without the
> drivers.
>
> How do we get past this chicken & egg situation?

I would like to point out that support for HSA varies from
one vendor/design to another; for some device/drivers
(i.e AMD APU/HSA kernel driver), no form of address
space mirroring is required (i.e. AMD IOMMU v2) AFAIK,
others require address space mirroring so need the kernel
HMM framework but only because they provide support for
the full HSA/OpenCL SVM specification while some do not
require HMM at all because they implement only a subset
of the specification.

These exports enables the latter approach which does
not require the kernel HMM framework in order to
support process address space mirroring.

Regards,
Olu

2016-03-17 17:03:59

by Jerome Glisse

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

On Thu, Mar 17, 2016 at 03:46:35PM +0000, Olu Ogunbowale wrote:
> On Thu, Mar 17, 2016 at 03:37:16PM +0100, Jerome Glisse wrote:
> > What other driver do for non-buffer region is have the userspace side
> > of the device driver mmap the device driver file and use vma range you
> > get from that for those non-buffer region. On cpu access you can either
> > chose to fault or to return a dummy page. With that trick no need to
> > change kernel.
>
> Yes, this approach works for some designs however arbitrary VMA ranges
> for non-buffer regions is not a feature of all mobile gpu designs for
> performance, power, and area (PPA) reasons.

Well trick still works, if driver is loaded early during userspace program
initialization then you force mmap to specific range inside the driver
userspace code. If driver is loaded after and program is already using those
range then you can register a notifier to track when those range. If they
get release by the program you can have the userspace driver force creation
of new reserve vma again.

>
> > Note that i do not see how you can solve the issue of your GPU having
> > less bits then the cpu. For instance, lets assume that you have 46bits
> > for the GPU while the CPU have 48bits. Now an application start and do
> > bunch of allocation that end up above (1 << 46), then same application
> > load your driver and start using some API that allow to transparently
> > use previously allocated memory -> fails.
>
> Yes, you are correct however for mobile SoC(s) though current top-end
> specifications have 4GB/8GB of installed ram so the usable SVM range is
> upper bound by this giving a fixed base hence the need for driver control
> of VMA range.

Well controling range into which VMA can be allocated is not something that
you should do lightly (thing like address space randomization would be
impacted). And no the SVM range is not upper bound by the amount of memory
but by the physical bus size if it is 48bits nothing forbid to put all the
program memory above 8GB and nothing below. We are talking virtual address
here. By the way i think most 64 bit ARM are 40 bits and it seems a shame
for GPU to not go as high as the CPU.

Cheers,
J?r?me

2016-03-17 17:42:34

by Olu Ogunbowale

[permalink] [raw]

Subject: Re: [PATCH] mm: Export symbols unmapped_area() & unmapped_area_topdown()

On Thu, Mar 17, 2016 at 06:03:50PM +0100, Jerome Glisse wrote:
> Well trick still works, if driver is loaded early during userspace program
> initialization then you force mmap to specific range inside the driver
> userspace code. If driver is loaded after and program is already using those
> range then you can register a notifier to track when those range. If they
> get release by the program you can have the userspace driver force creation
> of new reserve vma again.

I should have been more clearer in my response, this applies only because
we are in a scheme were all allocations must go through a special allocator
because VMA base/range is reserved for SVM.

> Well controling range into which VMA can be allocated is not something that
> you should do lightly (thing like address space randomization would be
> impacted). And no the SVM range is not upper bound by the amount of memory
> but by the physical bus size if it is 48bits nothing forbid to put all the
> program memory above 8GB and nothing below. We are talking virtual address
> here. By the way i think most 64 bit ARM are 40 bits and it seems a shame
> for GPU to not go as high as the CPU.

Same as above. By the way, we support minimum 40-bits but can be paired with
CPU(s) of higher bits; no problem if bits are equal or greater than CPU.