2023-06-30 16:52:02

by Thomas Hellström

[permalink] [raw]
Subject: [RFC PATCH] Documentation/gpu: Draft VM_BIND locking document

Add the first version of the VM_BIND locking document which is
intended to be part of the xe driver upstreaming agreement.

The document describes and discuss the locking used during exec-
functions, evicton and for userptr gmvas. Intention is to be using the
same nomenclature as the drm-vm-bind-async.rst, but to keep naming a
little shorter, use gvm and gmva instead of gpu_vm and gpu_vma which
is used in the previous document, with an intention to modify also
that document.

Signed-off-by: Thomas Hellström <[email protected]>
---
Documentation/gpu/drm-vm-bind-locking.rst | 339 ++++++++++++++++++++++
1 file changed, 339 insertions(+)
create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst

diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
new file mode 100644
index 000000000000..f5d1a40a2906
--- /dev/null
+++ b/Documentation/gpu/drm-vm-bind-locking.rst
@@ -0,0 +1,339 @@
+===============
+VM_BIND locking
+===============
+
+This document attempts to describe what's needed to get VM_BIND locking right,
+including the userptr mmu_notifier locking and it will also discuss some
+optimizations to get rid of the looping through of all userptr mappings and
+external / shared object mappings that is needed in the simplest
+implementation. It will also discuss some implications for faulting gvms.
+
+Nomenclature
+============
+
+* ``Context``: GPU execution context.
+* ``gvm``: Abstraction of a GPU address space with meta-data. Typically
+ one per client (DRM file-private), or one per context.
+* ``gvma``: Abstraction of a GPU address range within a gvma with
+ associated meta-data. The backing storage of a gvma can either be
+ a gem buffer object or anonymous pages mapped also into the CPU
+ address space for the process.
+* ``userptr gvma or just userptr``: A gvma, the backing store of
+ which is anonymous pages as described above.
+* ``revalidating``: Revalidating a gvma means making the latest version
+ of the backing store resident and making sure the gvma's
+ page-table entries point to that backing store.
+* ``dma_fence``: A struct dma_fence that is similar to a struct completion
+ and which tracks GPU activity. When the GPU activity is finished,
+ the dma_fence signals.
+* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used
+ to track GPU activity in the form of multiple dma_fences on a
+ gvm or a gem buffer object. The dma_resv contains an array / list
+ of dma_fences and a lock that needs to be held when adding
+ additional dma_fences to the dma_resv. The lock is of a type that
+ allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
+* ``exec function``: An exec function is a function that revalidates all
+ affected gvmas, submits a GPU command batch and registers the
+ dma_fence representing the GPU command's activity with all affected
+ dma_resvs. For completeness, although not covered by this document,
+ it's worth mentioning that an exec function may also be the
+ revalidation worker that is used by some drivers in compute /
+ long-running mode.
+* ``local object``: A GEM object which is local to a gvm. Shared gem
+ objects also share the gvm's dma_resv.
+* ``shared object``: AKA external object: A GEM object which may be shared
+ by multiple gvms and whose backing storage may be shared with
+ other drivers.
+
+
+Introducing the locks
+=====================
+
+One of the benefits of VM_BIND is that local GEM objects share the gvm's
+dma_resv object and hence the dma_resv lock. So even with a huge
+number of local GEM objects, only one lock is needed to make the exec
+sequence atomic.
+
+The following locks and locking orders are used:
+
+* The ``gvm->lock`` (optionally an rwsem). Protects how the gvm is
+ partitioned into gvmas, protects the gvm's list of external objects,
+ and can also with some simplification protect the gvm's list of
+ userptr gvmas. With the CPU mm analogy this would correspond to the
+ mmap_lock.
+* The ``userptr_seqlock``. This lock is taken in read mode for each
+ userptr gvma on the gvm's userptr list, and in write mode during mmu
+ notifier invalidation.
+* The ``gvm->resv`` lock. Protects the gvm's list of gvmas needing
+ rebinding, and also the residency of all the gvm's local GEM object.
+* The ``gvm->userptr_notifier_lock``. This is an rwsem that is taken in read
+ mode during exec and write mode during a mmu notifier invalidation. In
+ the absence of a separate page-table lock, this lock can serve
+ together with the gvm's dma_resv lock as a page-table lock. More on
+ this below. The userptr notifier lock is per gvm.
+* The ``gvm->page_table_lock``. Protects the gvm's page-table updates. For
+ simplicity the gvm's dma_resv lock can be reused as page-table lock.
+
+There are certain optimizations described below that require
+additional locks. More on that later.
+
+.. code-block:: C
+
+ dma_resv_lock(&gvm->resv);
+
+ for_each_gvma_on_revalidate_list(gvm, &gvma) {
+ revalidate_gvma(&gvma);
+ remove_from_revalidate_list(&gvma);
+ }
+
+ add_dependencies(&gpu_job, &gvm->resv);
+ job_dma_fence = gpu_submit(&gpu_job));
+
+ add_dma_fence(job_dma_fence, &gvm->resv);
+ dma_resv_unlock(&gvm->resv);
+
+Eviction of one of these local objects will then be something like the
+following:
+
+.. code-block:: C
+
+ obj = get_object_from_lru();
+
+ dma_resv_lock(obj->resv);
+ for_each_gvma_of_obj(obj, &gvma);
+ put_gvma_on_revalidate_list(&gvma);
+
+ add_dependencies(&eviction_job, &obj->resv);
+ job_dma_fence = gpu_submit(&eviction_job);
+ add_dma_fence(&obj->resv, job_dma_fence);
+
+ dma_resv_unlock(&obj->resv);
+ put_object(obj);
+
+Note that since the object is local to the gvm, it will share the gvm's
+``dma_resv`` lock so that ``obj->resv == gvm->resv``. Invalidated gvmas are put
+on the gvm's revalidation list, which is protected by ``gvm->resv``, which
+is always locked while evicting, due to the above equality.
+
+Does the gvma need to be unbound before eviction? For VM_BIND gvms
+the answer is no. Since the eviction blit or copy will wait for GPU
+idle, any attempt by the GPU to access freed memory through the
+gvma will be preceded by a new exec function, which will
+make sure the gvma is revalidated, that is not an issue.
+
+Introducing external (or shared) buffer objects
+===============================================
+
+Since shared buffer objects may be shared by multiple gvm's they
+can't share their reservation object with a single gvm, but will rather
+have a reservation object of their own. The shared objects bound to a
+gvm using one or many
+gvmas are therefore typically put on a per-gvm list which is
+protected by the gvm lock. One could in theory protect it also with
+the ``gvm->resv``, but since the list of dma_resvs to take is typically
+built before the ``gvm->resv`` is locked due to a limitation in
+the current locking helpers, that is typically not done. Also see
+below for userptr gvmas.
+
+At eviction time we now need to invalidate *all* gvmas of a shared
+object, but we can no longer be certain that we hold the gvm's
+dma_resv of all the object's gvmas. We can only be certain that we
+hold the object's private dma_resv. We can trylock the dma_resvs for
+the affected gvm's but that might be unnecessarily complex. If we
+have a ww_acquire context at hand at eviction time we can also perform
+sleeping locks of those dma_resvs but that could cause expensive
+rollbacks. One option is to just mark the invalidated gvmas with a bool
+which is inspected on the next exec function, when the gvm's
+dma_resv and the object's dma_resv is held, and the invalidated
+gvmas could then be put on the gvm's list of invalidated
+gvmas. That bool would then, although being per-gvma formally be
+protected by the object's dma_resv.
+
+The exec function would then look something like the following:
+
+.. code-block:: C
+
+ read_lock(&gvm->lock);
+
+ dma_resv_lock(&gvm->resv);
+
+ // Shared object list is protected by the gvm->lock.
+ for_each_shared_obj(gvm, &obj) {
+ dma_resv_lock(&obj->resv);
+ move_marked_gvmas_to_revalidate_gvma_list(obj, &gvm);
+ }
+
+ for_each_gvma_to_revalidate(gvm, &gvma) {
+ revalidate_gvma(&gvma);
+ remove_from_revalidate_list(&gvma);
+ }
+
+ add_dependencies(&gpu_job, &gvm->resv);
+ job_dma_fence = gpu_submit(&gpu_job));
+
+ add_dma_fence(job_dma_fence, &gvm->resv);
+ for_each_shared_obj(gvm, &obj)
+ add_dma_fence(job_dma_fence, &obj->resv);
+ dma_resv_unlock_all_resv_locks();
+
+ read_unlock(&gvm->lock);
+
+And the corresponding shared-object aware eviction would look like:
+
+.. code-block:: C
+
+ obj = get_object_from_lru();
+
+ dma_resv_lock(obj->resv);
+ for_each_gvma_of_obj(obj, &gvma);
+ if (object_is_vm_local(obj))
+ put_gvma_on_revalidate_list(&gvma, &gvm);
+ else
+ mark_gvma_for_revalidation(&gvma);
+
+ add_dependencies(&eviction_job, &obj->resv);
+ job_dma_fence = gpu_submit(&eviction_job);
+ add_dma_fence(&obj->resv, job_dma_fence);
+
+ dma_resv_unlock(&obj->resv);
+ put_object(obj);
+
+Yet another option is to put the gvmas to be invalidated on a separate
+gvm list protected by a lower level lock that can be taken both at eviction
+time and at transfer-to-revalidate list time. The details are not in
+this document, but this for reference implemented in the Intel xe
+driver.
+
+Introducing userptr gvmas
+=========================
+
+A userptr gvma is a gvma that, instead of mapping a buffer object to a
+GPU virtual address range, directly maps a CPU mm range of anonymous-
+or file page-cache pages.
+A very simple approach would be to just pin the pages using
+pin_user_pages() at bind time and unpin them at unbind time, but this
+creates a Denial-Of-Service vector since a single user-space process
+would be able to pin down all of system memory, which is not
+desirable. (For special use-cases and with proper accounting pinning might
+still be a desirable feature, though). What we need to do in the general case is
+to obtain a reference to the desired pages, make sure we are notified
+using a MMU notifier just before the CPU mm unmaps the pages, dirty
+them if they are not mapped read-only to the GPU, and then drop the reference.
+When we are notified by the MMU notifier that CPU mm is about to drop the
+pages, we need to stop GPU access to the pages,
+GPU page-table and make sure that before the next time the GPU tries to access
+whatever is now present in the CPU mm range, we unmap the old pages
+from the GPU page tables and repeat the process of obtaining new page
+references. Note that when the core mm decides to laundry pages, we get such
+an unmap MMU notification and can mark the pages dirty again before the
+next GPU access. We also get similar MMU notifications for NUMA accounting
+which the GPU driver doesn't really need to care about, but so far
+it's proven difficult to exclude certain notifications.
+
+Using a MMU notifier for device DMA (and other methods) is described in
+`this document
+<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
+
+Now the method of obtaining struct page references using
+get_user_pages() unfortunately can't be used under a dma_resv lock
+since that would violate the locking order of the dma_resv lock vs the
+mmap_lock that is grabbed when resolving a CPU pagefault. This means the gvm's
+list of userptr gvmas needs to be protected by an outer lock, and this
+is the first time we strictly need the gvm->lock. While it was
+previously used also to protect the list of the gvm's shared objects,
+we could in theory have used the gvm->resv for that.
+
+The MMU interval seqlock for a userptr gvma is used in the following
+way:
+
+.. code-block:: C
+
+ down_read(&gvm->lock);
+
+ retry:
+
+ // Note: mmu_interval_read_begin() blocks until there is no
+ // invalidation notifier running anymore.
+ seq = mmu_interval_read_begin(&gvma->userptr_interval);
+ if (seq != gvma->saved_seq) {
+ obtain_new_page_pointers(&gvma);
+ dma_resv_lock(&gvm->resv);
+ put_gvma_on_revalidate_list(&gvma, &gvm);
+ dma_resv_unlock(&gvm->resv);
+ gvma->saved_seq = seq;
+ }
+
+ // The usual revalidation goes here.
+
+ // Final userptr sequence validation may not happen before the
+ // submission dma_fence is added to the gvm's resv, from the POW
+ // of the MMU invalidation notifier. Hence the
+ // userptr_notifier_lock that will make them appear atomic.
+
+ add_dependencies(&gpu_job, &gvm->resv);
+ down_read(&gvm->userptr_notifier_lock);
+ if (mmu_interval_read_retry(&gvma->userptr_interval, gvma->saved_seq)) {
+ up_read(&gvm->userptr_notifier_lock);
+ goto retry;
+ }
+
+ job_dma_fence = gpu_submit(&gpu_job));
+
+ add_dma_fence(job_dma_fence, &gvm->resv);
+
+ for_each_shared_obj(gvm, &obj)
+ add_dma_fence(job_dma_fence, &obj->resv);
+
+ dma_resv_unlock_all_resv_locks();
+ up_read(&gvm->userptr_notifier_lock);
+ up_read(&gvm->lock);
+
+The code between ``mmu_interval_read_begin()`` and the
+``mmu_interval_read_retry()`` marks the read side critical section of
+what we call the ``userptr_seqlock``. In reality the gvm's userptr
+gvma list is looped through, and the check is done for *all* of its
+userptr gvmas, although we only show a single one here.
+
+The userptr gvma MMU invalidation notifier might be called from
+reclaim context and, again to avoid locking order violations, we can't
+take any dma_resv lock nor the gvm->lock from within it.
+
+.. code-block:: C
+
+ bool gvma_userptr_invalidate(userptr_interval, cur_seq)
+ {
+ // Make sure the exec function either sees the new sequence
+ // and backs off or we wait for the dma-fence:
+
+ down_write(&gvm->userptr_notifier_lock);
+ mmu_interval_set_seq(userptr_interval, cur_seq);
+ up_write(&gvm->userptr_notifier_lock);
+
+ dma_resv_wait_timeout(&gvm->resv, DMA_RESV_USAGE_BOOKKEEP,
+ false, MAX_SCHEDULE_TIMEOUT);
+ return true;
+ }
+
+When this invalidation notifier returns, the GPU can no longer be
+accessing the old pages of the userptr gvma and needs to redo the page-binding
+before a new GPU submission can succeed.
+
+Optimizing gvma iteration
+-------------------------
+
+Iterating through all of a gvm's userptr gvmas to check the validity
+on each exec function may be very costly. There is a scheme to avoid
+this and only iterate through the userptr gvmas that actually saw an
+invalidation notifier call since the last exec. T
+
+TODO: describe that scheme here. It's implemented in the xe driver.
+
+Locking for page-table updates at bind- and unbind time
+=======================================================
+
+TODO.
+
+Recoverable page-fault implications
+===================================
+
+TODO.
--
2.40.1



2023-08-04 22:33:39

by Rodrigo Vivi

[permalink] [raw]
Subject: Re: [Intel-xe] [RFC PATCH] Documentation/gpu: Draft VM_BIND locking document

On Fri, Jun 30, 2023 at 06:44:52PM +0200, Thomas Hellstr?m wrote:
> Add the first version of the VM_BIND locking document which is
> intended to be part of the xe driver upstreaming agreement.
>
> The document describes and discuss the locking used during exec-
> functions, evicton and for userptr gmvas. Intention is to be using the
> same nomenclature as the drm-vm-bind-async.rst, but to keep naming a
> little shorter, use gvm and gmva instead of gpu_vm and gpu_vma which
> is used in the previous document, with an intention to modify also
> that document.

I preferred the gpu_vm and gpu_vma as written in the async doc.
Much easier to read imho.

>
> Signed-off-by: Thomas Hellstr?m <[email protected]>
> ---
> Documentation/gpu/drm-vm-bind-locking.rst | 339 ++++++++++++++++++++++
> 1 file changed, 339 insertions(+)
> create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
>
> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
> new file mode 100644
> index 000000000000..f5d1a40a2906
> --- /dev/null
> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
> @@ -0,0 +1,339 @@
> +===============
> +VM_BIND locking
> +===============
> +
> +This document attempts to describe what's needed to get VM_BIND locking right,
> +including the userptr mmu_notifier locking and it will also discuss some
> +optimizations to get rid of the looping through of all userptr mappings and
> +external / shared object mappings that is needed in the simplest
> +implementation. It will also discuss some implications for faulting gvms.
> +
> +Nomenclature
> +============
> +
> +* ``Context``: GPU execution context.
> +* ``gvm``: Abstraction of a GPU address space with meta-data. Typically
> + one per client (DRM file-private), or one per context.
> +* ``gvma``: Abstraction of a GPU address range within a gvma with

within a gpu_vm you meant?

> + associated meta-data. The backing storage of a gvma can either be
> + a gem buffer object or anonymous pages mapped also into the CPU
> + address space for the process.
> +* ``userptr gvma or just userptr``: A gvma, the backing store of
> + which is anonymous pages as described above.
> +* ``revalidating``: Revalidating a gvma means making the latest version
> + of the backing store resident and making sure the gvma's
> + page-table entries point to that backing store.
> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
> + and which tracks GPU activity. When the GPU activity is finished,
> + the dma_fence signals.
> +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used
> + to track GPU activity in the form of multiple dma_fences on a
> + gvm or a gem buffer object. The dma_resv contains an array / list
> + of dma_fences and a lock that needs to be held when adding
> + additional dma_fences to the dma_resv. The lock is of a type that
> + allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
> +* ``exec function``: An exec function is a function that revalidates all
> + affected gvmas, submits a GPU command batch and registers the
> + dma_fence representing the GPU command's activity with all affected
> + dma_resvs. For completeness, although not covered by this document,
> + it's worth mentioning that an exec function may also be the
> + revalidation worker that is used by some drivers in compute /
> + long-running mode.
> +* ``local object``: A GEM object which is local to a gvm. Shared gem
> + objects also share the gvm's dma_resv.
> +* ``shared object``: AKA external object: A GEM object which may be shared
> + by multiple gvms and whose backing storage may be shared with
> + other drivers.
> +
> +
> +Introducing the locks
> +=====================
> +
> +One of the benefits of VM_BIND is that local GEM objects share the gvm's
> +dma_resv object and hence the dma_resv lock. So even with a huge
> +number of local GEM objects, only one lock is needed to make the exec
> +sequence atomic.
> +
> +The following locks and locking orders are used:
> +
> +* The ``gvm->lock`` (optionally an rwsem). Protects how the gvm is
> + partitioned into gvmas, protects the gvm's list of external objects,
> + and can also with some simplification protect the gvm's list of
> + userptr gvmas. With the CPU mm analogy this would correspond to the
> + mmap_lock.
> +* The ``userptr_seqlock``. This lock is taken in read mode for each
> + userptr gvma on the gvm's userptr list, and in write mode during mmu
> + notifier invalidation.

is this something that exists withing the mmu_notifier or a new lock
when handling the notifier?

> +* The ``gvm->resv`` lock. Protects the gvm's list of gvmas needing
> + rebinding, and also the residency of all the gvm's local GEM object.
> +* The ``gvm->userptr_notifier_lock``. This is an rwsem that is taken in read
> + mode during exec and write mode during a mmu notifier invalidation. In
> + the absence of a separate page-table lock, this lock can serve
> + together with the gvm's dma_resv lock as a page-table lock. More on
> + this below. The userptr notifier lock is per gvm.

is the userptr_seqlock also per gpu_vm?
and what's the difference from this and the other?

> +* The ``gvm->page_table_lock``. Protects the gvm's page-table updates. For
> + simplicity the gvm's dma_resv lock can be reused as page-table lock.
> +
> +There are certain optimizations described below that require
> +additional locks. More on that later.
> +
> +.. code-block:: C
> +
> + dma_resv_lock(&gvm->resv);
> +
> + for_each_gvma_on_revalidate_list(gvm, &gvma) {
> + revalidate_gvma(&gvma);
> + remove_from_revalidate_list(&gvma);
> + }
> +
> + add_dependencies(&gpu_job, &gvm->resv);
> + job_dma_fence = gpu_submit(&gpu_job));
> +
> + add_dma_fence(job_dma_fence, &gvm->resv);
> + dma_resv_unlock(&gvm->resv);
> +
> +Eviction of one of these local objects will then be something like the
> +following:
> +
> +.. code-block:: C
> +
> + obj = get_object_from_lru();
> +
> + dma_resv_lock(obj->resv);
> + for_each_gvma_of_obj(obj, &gvma);
> + put_gvma_on_revalidate_list(&gvma);
> +
> + add_dependencies(&eviction_job, &obj->resv);
> + job_dma_fence = gpu_submit(&eviction_job);
> + add_dma_fence(&obj->resv, job_dma_fence);
> +
> + dma_resv_unlock(&obj->resv);
> + put_object(obj);
> +
> +Note that since the object is local to the gvm, it will share the gvm's
> +``dma_resv`` lock so that ``obj->resv == gvm->resv``. Invalidated gvmas are put
> +on the gvm's revalidation list, which is protected by ``gvm->resv``, which
> +is always locked while evicting, due to the above equality.
> +
> +Does the gvma need to be unbound before eviction? For VM_BIND gvms
> +the answer is no. Since the eviction blit or copy will wait for GPU
> +idle, any attempt by the GPU to access freed memory through the
> +gvma will be preceded by a new exec function, which will
> +make sure the gvma is revalidated, that is not an issue.

The question opening the phrase made me think this was an open,
but it more like an answer for a common question? Should we rephrase
that to an affirmative note?

> +
> +Introducing external (or shared) buffer objects
> +===============================================
> +
> +Since shared buffer objects may be shared by multiple gvm's they
> +can't share their reservation object with a single gvm, but will rather
> +have a reservation object of their own. The shared objects bound to a
> +gvm using one or many
> +gvmas are therefore typically put on a per-gvm list which is
> +protected by the gvm lock. One could in theory protect it also with
> +the ``gvm->resv``, but since the list of dma_resvs to take is typically
> +built before the ``gvm->resv`` is locked due to a limitation in
> +the current locking helpers, that is typically not done. Also see
> +below for userptr gvmas.
> +
> +At eviction time we now need to invalidate *all* gvmas of a shared
> +object, but we can no longer be certain that we hold the gvm's
> +dma_resv of all the object's gvmas. We can only be certain that we
> +hold the object's private dma_resv. We can trylock the dma_resvs for
> +the affected gvm's but that might be unnecessarily complex. If we
> +have a ww_acquire context at hand at eviction time we can also perform
> +sleeping locks of those dma_resvs but that could cause expensive
> +rollbacks. One option is to just mark the invalidated gvmas with a bool
> +which is inspected on the next exec function, when the gvm's
> +dma_resv and the object's dma_resv is held, and the invalidated
> +gvmas could then be put on the gvm's list of invalidated
> +gvmas. That bool would then, although being per-gvma formally be
> +protected by the object's dma_resv.
> +
> +The exec function would then look something like the following:
> +
> +.. code-block:: C
> +
> + read_lock(&gvm->lock);
> +
> + dma_resv_lock(&gvm->resv);
> +
> + // Shared object list is protected by the gvm->lock.
> + for_each_shared_obj(gvm, &obj) {
> + dma_resv_lock(&obj->resv);
> + move_marked_gvmas_to_revalidate_gvma_list(obj, &gvm);
> + }
> +
> + for_each_gvma_to_revalidate(gvm, &gvma) {
> + revalidate_gvma(&gvma);
> + remove_from_revalidate_list(&gvma);
> + }
> +
> + add_dependencies(&gpu_job, &gvm->resv);
> + job_dma_fence = gpu_submit(&gpu_job));
> +
> + add_dma_fence(job_dma_fence, &gvm->resv);
> + for_each_shared_obj(gvm, &obj)
> + add_dma_fence(job_dma_fence, &obj->resv);
> + dma_resv_unlock_all_resv_locks();
> +
> + read_unlock(&gvm->lock);
> +
> +And the corresponding shared-object aware eviction would look like:
> +
> +.. code-block:: C
> +
> + obj = get_object_from_lru();
> +
> + dma_resv_lock(obj->resv);
> + for_each_gvma_of_obj(obj, &gvma);
> + if (object_is_vm_local(obj))
> + put_gvma_on_revalidate_list(&gvma, &gvm);
> + else
> + mark_gvma_for_revalidation(&gvma);
> +
> + add_dependencies(&eviction_job, &obj->resv);
> + job_dma_fence = gpu_submit(&eviction_job);
> + add_dma_fence(&obj->resv, job_dma_fence);
> +
> + dma_resv_unlock(&obj->resv);
> + put_object(obj);
> +
> +Yet another option is to put the gvmas to be invalidated on a separate
> +gvm list protected by a lower level lock that can be taken both at eviction
> +time and at transfer-to-revalidate list time. The details are not in
> +this document, but this for reference implemented in the Intel xe
> +driver.

is this part of what we need to rethink for the suspend/resume evictions?

> +
> +Introducing userptr gvmas
> +=========================
> +
> +A userptr gvma is a gvma that, instead of mapping a buffer object to a
> +GPU virtual address range, directly maps a CPU mm range of anonymous-
> +or file page-cache pages.
> +A very simple approach would be to just pin the pages using
> +pin_user_pages() at bind time and unpin them at unbind time, but this
> +creates a Denial-Of-Service vector since a single user-space process
> +would be able to pin down all of system memory, which is not
> +desirable. (For special use-cases and with proper accounting pinning might
> +still be a desirable feature, though). What we need to do in the general case is
> +to obtain a reference to the desired pages, make sure we are notified
> +using a MMU notifier just before the CPU mm unmaps the pages, dirty
> +them if they are not mapped read-only to the GPU, and then drop the reference.
> +When we are notified by the MMU notifier that CPU mm is about to drop the
> +pages, we need to stop GPU access to the pages,
> +GPU page-table and make sure that before the next time the GPU tries to access
> +whatever is now present in the CPU mm range, we unmap the old pages
> +from the GPU page tables and repeat the process of obtaining new page
> +references. Note that when the core mm decides to laundry pages, we get such
> +an unmap MMU notification and can mark the pages dirty again before the
> +next GPU access. We also get similar MMU notifications for NUMA accounting
> +which the GPU driver doesn't really need to care about, but so far
> +it's proven difficult to exclude certain notifications.
> +
> +Using a MMU notifier for device DMA (and other methods) is described in
> +`this document
> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
> +
> +Now the method of obtaining struct page references using
> +get_user_pages() unfortunately can't be used under a dma_resv lock
> +since that would violate the locking order of the dma_resv lock vs the
> +mmap_lock that is grabbed when resolving a CPU pagefault. This means the gvm's
> +list of userptr gvmas needs to be protected by an outer lock, and this
> +is the first time we strictly need the gvm->lock. While it was
> +previously used also to protect the list of the gvm's shared objects,
> +we could in theory have used the gvm->resv for that.
> +
> +The MMU interval seqlock for a userptr gvma is used in the following
> +way:
> +
> +.. code-block:: C
> +
> + down_read(&gvm->lock);
> +
> + retry:
> +
> + // Note: mmu_interval_read_begin() blocks until there is no
> + // invalidation notifier running anymore.
> + seq = mmu_interval_read_begin(&gvma->userptr_interval);
> + if (seq != gvma->saved_seq) {
> + obtain_new_page_pointers(&gvma);
> + dma_resv_lock(&gvm->resv);
> + put_gvma_on_revalidate_list(&gvma, &gvm);
> + dma_resv_unlock(&gvm->resv);
> + gvma->saved_seq = seq;
> + }
> +
> + // The usual revalidation goes here.
> +
> + // Final userptr sequence validation may not happen before the
> + // submission dma_fence is added to the gvm's resv, from the POW
> + // of the MMU invalidation notifier. Hence the
> + // userptr_notifier_lock that will make them appear atomic.
> +
> + add_dependencies(&gpu_job, &gvm->resv);
> + down_read(&gvm->userptr_notifier_lock);
> + if (mmu_interval_read_retry(&gvma->userptr_interval, gvma->saved_seq)) {
> + up_read(&gvm->userptr_notifier_lock);
> + goto retry;
> + }
> +
> + job_dma_fence = gpu_submit(&gpu_job));
> +
> + add_dma_fence(job_dma_fence, &gvm->resv);
> +
> + for_each_shared_obj(gvm, &obj)
> + add_dma_fence(job_dma_fence, &obj->resv);
> +
> + dma_resv_unlock_all_resv_locks();
> + up_read(&gvm->userptr_notifier_lock);
> + up_read(&gvm->lock);
> +
> +The code between ``mmu_interval_read_begin()`` and the
> +``mmu_interval_read_retry()`` marks the read side critical section of
> +what we call the ``userptr_seqlock``. In reality the gvm's userptr
> +gvma list is looped through, and the check is done for *all* of its
> +userptr gvmas, although we only show a single one here.
> +
> +The userptr gvma MMU invalidation notifier might be called from
> +reclaim context and, again to avoid locking order violations, we can't
> +take any dma_resv lock nor the gvm->lock from within it.
> +
> +.. code-block:: C
> +
> + bool gvma_userptr_invalidate(userptr_interval, cur_seq)
> + {
> + // Make sure the exec function either sees the new sequence
> + // and backs off or we wait for the dma-fence:
> +
> + down_write(&gvm->userptr_notifier_lock);
> + mmu_interval_set_seq(userptr_interval, cur_seq);
> + up_write(&gvm->userptr_notifier_lock);
> +
> + dma_resv_wait_timeout(&gvm->resv, DMA_RESV_USAGE_BOOKKEEP,
> + false, MAX_SCHEDULE_TIMEOUT);
> + return true;
> + }
> +
> +When this invalidation notifier returns, the GPU can no longer be
> +accessing the old pages of the userptr gvma and needs to redo the page-binding
> +before a new GPU submission can succeed.
> +
> +Optimizing gvma iteration
> +-------------------------
> +
> +Iterating through all of a gvm's userptr gvmas to check the validity
> +on each exec function may be very costly. There is a scheme to avoid
> +this and only iterate through the userptr gvmas that actually saw an
> +invalidation notifier call since the last exec. T
> +
> +TODO: describe that scheme here. It's implemented in the xe driver.
> +
> +Locking for page-table updates at bind- and unbind time
> +=======================================================
> +
> +TODO.
> +
> +Recoverable page-fault implications
> +===================================
> +
> +TODO.
> --
> 2.40.1
>

2023-08-11 14:02:22

by Thomas Hellström

[permalink] [raw]
Subject: Re: [Intel-xe] [RFC PATCH] Documentation/gpu: Draft VM_BIND locking document

Thanks for reviewing, Rodrigo!


On 8/4/23 22:15, Rodrigo Vivi wrote:

> On Fri, Jun 30, 2023 at 06:44:52PM +0200, Thomas Hellström wrote:
>> Add the first version of the VM_BIND locking document which is
>> intended to be part of the xe driver upstreaming agreement.
>>
>> The document describes and discuss the locking used during exec-
>> functions, evicton and for userptr gmvas. Intention is to be using the
>> same nomenclature as the drm-vm-bind-async.rst, but to keep naming a
>> little shorter, use gvm and gmva instead of gpu_vm and gpu_vma which
>> is used in the previous document, with an intention to modify also
>> that document.
> I preferred the gpu_vm and gpu_vma as written in the async doc.
> Much easier to read imho.

OK. I'll keep that naming then.


>
>> Signed-off-by: Thomas Hellström <[email protected]>
>> ---
>> Documentation/gpu/drm-vm-bind-locking.rst | 339 ++++++++++++++++++++++
>> 1 file changed, 339 insertions(+)
>> create mode 100644 Documentation/gpu/drm-vm-bind-locking.rst
>>
>> diff --git a/Documentation/gpu/drm-vm-bind-locking.rst b/Documentation/gpu/drm-vm-bind-locking.rst
>> new file mode 100644
>> index 000000000000..f5d1a40a2906
>> --- /dev/null
>> +++ b/Documentation/gpu/drm-vm-bind-locking.rst
>> @@ -0,0 +1,339 @@
>> +===============
>> +VM_BIND lockinghttps://jira.devtools.intel.com/browse/VLK-50299a
>> +===============
>> +
>> +This document attempts to describe what's needed to get VM_BIND locking right,
>> +including the userptr mmu_notifier locking and it will also discuss some
>> +optimizations to get rid of the looping through of all userptr mappings and
>> +external / shared object mappings that is needed in the simplest
>> +implementation. It will also discuss some implications for faulting gvms.
>> +
>> +Nomenclature
>> +============
>> +
>> +* ``Context``: GPU execution context.
>> +* ``gvm``: Abstraction of a GPU address space with meta-data. Typically
>> + one per client (DRM file-private), or one per context.
>> +* ``gvma``: Abstraction of a GPU address range within a gvma with
> within a gpu_vm you meant?
Yes.
>
>> + associated meta-data. The backing storage of a gvma can either be
>> + a gem buffer object or anonymous pages mapped also into the CPU
>> + address space for the process.
>> +* ``userptr gvma or just userptr``: A gvma, the backing store of
>> + which is anonymous pages as described above.
>> +* ``revalidating``: Revalidating a gvma means making the latest version
>> + of the backing store resident and making sure the gvma's
>> + page-table entries point to that backing store.
>> +* ``dma_fence``: A struct dma_fence that is similar to a struct completion
>> + and which tracks GPU activity. When the GPU activity is finished,
>> + the dma_fence signals.
>> +* ``dma_resv``: A struct dma_resv (AKA reservation object) that is used
>> + to track GPU activity in the form of multiple dma_fences on a
>> + gvm or a gem buffer object. The dma_resv contains an array / list
>> + of dma_fences and a lock that needs to be held when adding
>> + additional dma_fences to the dma_resv. The lock is of a type that
>> + allows deadlock-safe locking of multiple dma_resvs in arbitrary order.
>> +* ``exec function``: An exec function is a function that revalidates all
>> + affected gvmas, submits a GPU command batch and registers the
>> + dma_fence representing the GPU command's activity with all affected
>> + dma_resvs. For completeness, although not covered by this document,
>> + it's worth mentioning that an exec function may also be the
>> + revalidation worker that is used by some drivers in compute /
>> + long-running mode.
>> +* ``local object``: A GEM object which is local to a gvm. Shared gem
>> + objects also share the gvm's dma_resv.
>> +* ``shared object``: AKA external object: A GEM object which may be shared
>> + by multiple gvms and whose backing storage may be shared with
>> + other drivers.
>> +
>> +
>> +Introducing the locks
>> +=====================
>> +
>> +One of the benefits of VM_BIND is that local GEM objects share the gvm's
>> +dma_resv object and hence the dma_resv lock. So even with a huge
>> +number of local GEM objects, only one lock is needed to make the exec
>> +sequence atomic.
>> +
>> +The following locks and locking orders are used:
>> +
>> +* The ``gvm->lock`` (optionally an rwsem). Protects how the gvm is
>> + partitioned into gvmas, protects the gvm's list of external objects,
>> + and can also with some simplification protect the gvm's list of
>> + userptr gvmas. With the CPU mm analogy this would correspond to the
>> + mmap_lock.
>> +* The ``userptr_seqlock``. This lock is taken in read mode for each
>> + userptr gvma on the gvm's userptr list, and in write mode during mmu
>> + notifier invalidation.
> is this something that exists withing the mmu_notifier or a new lock
> when handling the notifier?

No it's the MMU interval notifier functionality that acts like a
seqlock: The read side waits until there are no writers (invalidators)
until entering a critical section and reruns the read-side critical
section if there was a writer sneaking in before leaving the read-side
critical section. I'll add a pointer to that documentation.


>
>> +* The ``gvm->resv`` lock. Protects the gvm's list of gvmas needing
>> + rebinding, and also the residency of all the gvm's local GEM object.
>> +* The ``gvm->userptr_notifier_lock``. This is an rwsem that is taken in read
>> + mode during exec and write mode during a mmu notifier invalidation. In
>> + the absence of a separate page-table lock, this lock can serve
>> + together with the gvm's dma_resv lock as a page-table lock. More on
>> + this below. The userptr notifier lock is per gvm.
> is the userptr_seqlock also per gpu_vm?
> and what's the difference from this and the other?
This lock ensures atomicity when checking the seqlock sequnce and
submitting a batch job / publishing a dma-fence, so that either the
read-side reruns or the write-side waits for the published fence. This
is also more generically described in the MMU notifier documentation.
Will update and clarify. The userptr_seqlock is per userptr vma.

>
>> +* The ``gvm->page_table_lock``. Protects the gvm's page-table updates. For
>> + simplicity the gvm's dma_resv lock can be reused as page-table lock.
>> +
>> +There are certain optimizations described below that require
>> +additional locks. More on that later.
>> +
>> +.. code-block:: C
>> +
>> + dma_resv_lock(&gvm->resv);
>> +
>> + for_each_gvma_on_revalidate_list(gvm, &gvma) {
>> + revalidate_gvma(&gvma);
>> + remove_from_revalidate_list(&gvma);
>> + }
>> +
>> + add_dependencies(&gpu_job, &gvm->resv);
>> + job_dma_fence = gpu_submit(&gpu_job));
>> +
>> + add_dma_fence(job_dma_fence, &gvm->resv);
>> + dma_resv_unlock(&gvm->resv);
>> +
>> +Eviction of one of these local objects will then be something like the
>> +following:
>> +
>> +.. code-block:: C
>> +
>> + obj = get_object_from_lru();
>> +
>> + dma_resv_lock(obj->resv);
>> + for_each_gvma_of_obj(obj, &gvma);
>> + put_gvma_on_revalidate_list(&gvma);
>> +
>> + add_dependencies(&eviction_job, &obj->resv);
>> + job_dma_fence = gpu_submit(&eviction_job);
>> + add_dma_fence(&obj->resv, job_dma_fence);
>> +
>> + dma_resv_unlock(&obj->resv);
>> + put_object(obj);
>> +
>> +Note that since the object is local to the gvm, it will share the gvm's
>> +``dma_resv`` lock so that ``obj->resv == gvm->resv``. Invalidated gvmas are put
>> +on the gvm's revalidation list, which is protected by ``gvm->resv``, which
>> +is always locked while evicting, due to the above equality.
>> +
>> +Does the gvma need to be unbound before eviction? For VM_BIND gvms
>> +the answer is no. Since the eviction blit or copy will wait for GPU
>> +idle, any attempt by the GPU to access freed memory through the
>> +gvma will be preceded by a new exec function, which will
>> +make sure the gvma is revalidated, that is not an issue.
> The question opening the phrase made me think this was an open,
> but it more like an answer for a common question? Should we rephrase
> that to an affirmative note?
Sure.
>
>> +
>> +Introducing external (or shared) buffer objects
>> +===============================================
>> +
>> +Since shared buffer objects may be shared by multiple gvm's they
>> +can't share their reservation object with a single gvm, but will rather
>> +have a reservation object of their own. The shared objects bound to a
>> +gvm using one or many
>> +gvmas are therefore typically put on a per-gvm list which is
>> +protected by the gvm lock. One could in theory protect it also with
>> +the ``gvm->resv``, but since the list of dma_resvs to take is typically
>> +built before the ``gvm->resv`` is locked due to a limitation in
>> +the current locking helpers, that is typically not done. Also see
>> +below for userptr gvmas.
>> +
>> +At eviction time we now need to invalidate *all* gvmas of a shared
>> +object, but we can no longer be certain that we hold the gvm's
>> +dma_resv of all the object's gvmas. We can only be certain that we
>> +hold the object's private dma_resv. We can trylock the dma_resvs for
>> +the affected gvm's but that might be unnecessarily complex. If we
>> +have a ww_acquire context at hand at eviction time we can also perform
>> +sleeping locks of those dma_resvs but that could cause expensive
>> +rollbacks. One option is to just mark the invalidated gvmas with a bool
>> +which is inspected on the next exec function, when the gvm's
>> +dma_resv and the object's dma_resv is held, and the invalidated
>> +gvmas could then be put on the gvm's list of invalidated
>> +gvmas. That bool would then, although being per-gvma formally be
>> +protected by the object's dma_resv.
>> +
>> +The exec function would then look something like the following:
>> +
>> +.. code-block:: C
>> +
>> + read_lock(&gvm->lock);
>> +
>> + dma_resv_lock(&gvm->resv);
>> +
>> + // Shared object list is protected by the gvm->lock.
>> + for_each_shared_obj(gvm, &obj) {
>> + dma_resv_lock(&obj->resv);
>> + move_marked_gvmas_to_revalidate_gvma_list(obj, &gvm);
>> + }
>> +
>> + for_each_gvma_to_revalidate(gvm, &gvma) {
>> + revalidate_gvma(&gvma);
>> + remove_from_revalidate_list(&gvma);
>> + }
>> +
>> + add_dependencies(&gpu_job, &gvm->resv);
>> + job_dma_fence = gpu_submit(&gpu_job));
>> +
>> + add_dma_fence(job_dma_fence, &gvm->resv);
>> + for_each_shared_obj(gvm, &obj)
>> + add_dma_fence(job_dma_fence, &obj->resv);
>> + dma_resv_unlock_all_resv_locks();
>> +
>> + read_unlock(&gvm->lock);
>> +
>> +And the corresponding shared-object aware eviction would look like:
>> +
>> +.. code-block:: C
>> +
>> + obj = get_object_from_lru();
>> +
>> + dma_resv_lock(obj->resv);
>> + for_each_gvma_of_obj(obj, &gvma);
>> + if (object_is_vm_local(obj))
>> + put_gvma_on_revalidate_list(&gvma, &gvm);
>> + else
>> + mark_gvma_for_revalidation(&gvma);
>> +
>> + add_dependencies(&eviction_job, &obj->resv);
>> + job_dma_fence = gpu_submit(&eviction_job);
>> + add_dma_fence(&obj->resv, job_dma_fence);
>> +
>> + dma_resv_unlock(&obj->resv);
>> + put_object(obj);
>> +
>> +Yet another option is to put the gvmas to be invalidated on a separate
>> +gvm list protected by a lower level lock that can be taken both at eviction
>> +time and at transfer-to-revalidate list time. The details are not in
>> +this document, but this for reference implemented in the Intel xe
>> +driver.
> is this part of what we need to rethink for the suspend/resume evictions?

No it's not related.


>
>> +
>> +Introducing userptr gvmas
>> +=========================
>> +
>> +A userptr gvma is a gvma that, instead of mapping a buffer object to a
>> +GPU virtual address range, directly maps a CPU mm range of anonymous-
>> +or file page-cache pages.
>> +A very simple approach would be to just pin the pages using
>> +pin_user_pages() at bind time and unpin them at unbind time, but this
>> +creates a Denial-Of-Service vector since a single user-space process
>> +would be able to pin down all of system memory, which is not
>> +desirable. (For special use-cases and with proper accounting pinning might
>> +still be a desirable feature, though). What we need to do in the general case is
>> +to obtain a reference to the desired pages, make sure we are notified
>> +using a MMU notifier just before the CPU mm unmaps the pages, dirty
>> +them if they are not mapped read-only to the GPU, and then drop the reference.
>> +When we are notified by the MMU notifier that CPU mm is about to drop the
>> +pages, we need to stop GPU access to the pages,
>> +GPU page-table and make sure that before the next time the GPU tries to access
>> +whatever is now present in the CPU mm range, we unmap the old pages
>> +from the GPU page tables and repeat the process of obtaining new page
>> +references. Note that when the core mm decides to laundry pages, we get such
>> +an unmap MMU notification and can mark the pages dirty again before the
>> +next GPU access. We also get similar MMU notifications for NUMA accounting
>> +which the GPU driver doesn't really need to care about, but so far
>> +it's proven difficult to exclude certain notifications.
>> +
>> +Using a MMU notifier for device DMA (and other methods) is described in
>> +`this document
>> +<https://docs.kernel.org/core-api/pin_user_pages.html#case-3-mmu-notifier-registration-with-or-without-page-faulting-hardware>`_.
>> +
>> +Now the method of obtaining struct page references using
>> +get_user_pages() unfortunately can't be used under a dma_resv lock
>> +since that would violate the locking order of the dma_resv lock vs the
>> +mmap_lock that is grabbed when resolving a CPU pagefault. This means the gvm's
>> +list of userptr gvmas needs to be protected by an outer lock, and this
>> +is the first time we strictly need the gvm->lock. While it was
>> +previously used also to protect the list of the gvm's shared objects,
>> +we could in theory have used the gvm->resv for that.
>> +
>> +The MMU interval seqlock for a userptr gvma is used in the following
>> +way:
>> +
>> +.. code-block:: C
>> +
>> + down_read(&gvm->lock);
>> +
>> + retry:
>> +
>> + // Note: mmu_interval_read_begin() blocks until there is no
>> + // invalidation notifier running anymore.
>> + seq = mmu_interval_read_begin(&gvma->userptr_interval);
>> + if (seq != gvma->saved_seq) {
>> + obtain_new_page_pointers(&gvma);
>> + dma_resv_lock(&gvm->resv);
>> + put_gvma_on_revalidate_list(&gvma, &gvm);
>> + dma_resv_unlock(&gvm->resv);
>> + gvma->saved_seq = seq;
>> + }
>> +
>> + // The usual revalidation goes here.
>> +
>> + // Final userptr sequence validation may not happen before the
>> + // submission dma_fence is added to the gvm's resv, from the POW
>> + // of the MMU invalidation notifier. Hence the
>> + // userptr_notifier_lock that will make them appear atomic.
>> +
>> + add_dependencies(&gpu_job, &gvm->resv);
>> + down_read(&gvm->userptr_notifier_lock);
>> + if (mmu_interval_read_retry(&gvma->userptr_interval, gvma->saved_seq)) {
>> + up_read(&gvm->userptr_notifier_lock);
>> + goto retry;
>> + }
>> +
>> + job_dma_fence = gpu_submit(&gpu_job));
>> +
>> + add_dma_fence(job_dma_fence, &gvm->resv);
>> +
>> + for_each_shared_obj(gvm, &obj)
>> + add_dma_fence(job_dma_fence, &obj->resv);
>> +
>> + dma_resv_unlock_all_resv_locks();
>> + up_read(&gvm->userptr_notifier_lock);
>> + up_read(&gvm->lock);
>> +
>> +The code between ``mmu_interval_read_begin()`` and the
>> +``mmu_interval_read_retry()`` marks the read side critical section of
>> +what we call the ``userptr_seqlock``. In reality the gvm's userptr
>> +gvma list is looped through, and the check is done for *all* of its
>> +userptr gvmas, although we only show a single one here.
>> +
>> +The userptr gvma MMU invalidation notifier might be called from
>> +reclaim context and, again to avoid locking order violations, we can't
>> +take any dma_resv lock nor the gvm->lock from within it.
>> +
>> +.. code-block:: C
>> +
>> + bool gvma_userptr_invalidate(userptr_interval, cur_seq)
>> + {
>> + // Make sure the exec function either sees the new sequence
>> + // and backs off or we wait for the dma-fence:
>> +
>> + down_write(&gvm->userptr_notifier_lock);
>> + mmu_interval_set_seq(userptr_interval, cur_seq);
>> + up_write(&gvm->userptr_notifier_lock);
>> +
>> + dma_resv_wait_timeout(&gvm->resv, DMA_RESV_USAGE_BOOKKEEP,
>> + false, MAX_SCHEDULE_TIMEOUT);
>> + return true;
>> + }
>> +
>> +When this invalidation notifier returns, the GPU can no longer be
>> +accessing the old pages of the userptr gvma and needs to redo the page-binding
>> +before a new GPU submission can succeed.
>> +
>> +Optimizing gvma iteration
>> +-------------------------
>> +
>> +Iterating through all of a gvm's userptr gvmas to check the validity
>> +on each exec function may be very costly. There is a scheme to avoid
>> +this and only iterate through the userptr gvmas that actually saw an
>> +invalidation notifier call since the last exec. T
>> +
>> +TODO: describe that scheme here. It's implemented in the xe driver.
>> +
>> +Locking for page-table updates at bind- and unbind time
>> +=======================================================
>> +
>> +TODO.
>> +
>> +Recoverable page-fault implications
>> +===================================
>> +
>> +TODO.
>> --
>> 2.40.1
>>