2022-11-09 19:31:05

by Janusz Krzysztofik

[permalink] [raw]
Subject: [PATCH 0/3] Fix timeout handling when retiring requests

Fixes for issues discovered via code review while working on
https://gitlab.freedesktop.org/drm/intel/issues/7349.

Janusz Krzysztofik (3):
drm/i915: Fix timeout handling when retiring requests
drm/i915: Fix unintended submission flush after retire times out
drm/i915: Fix 0 return value from DMA fence wait on i915 requests

drivers/gpu/drm/i915/gt/intel_gt_requests.c | 19 +++++++++++++++----
drivers/gpu/drm/i915/i915_request.c | 2 +-
2 files changed, 16 insertions(+), 5 deletions(-)

--
2.25.1



2022-11-09 20:03:24

by Janusz Krzysztofik

[permalink] [raw]
Subject: [PATCH 3/3] drm/i915: Fix 0 return value from DMA fence wait on i915 requests

According to the docs, dma_fence_wait_timeout() returns "0 if the wait
timed out," ... "Other error values may be returned on custom
implementations." While it is not quite clear if a custom implementation
is allowed to return "other error" instead of 0, it is rather clear that 0
return value should always mean that the wait timed out before the fence
got signaled.

i915 implementation of dma_fence_ops.wait() used with request fences,
which is a transparent wrapper around i915_request_wait_timeout(), returns
-ETIME if the wait has timed out -- that may be considered as acceptable.
However, it can return 0 in a rare case when the fence has been found
signaled right after no more wait time was left, and that's not compatible
with expectations of dma-fence and its users.

Since other users of i915_request_wait_timeout() may interpret 0 return
value as success, don't touch it, update the i915_fence_wait() wrapper
instead. Return 1 instead of 0, but keep -ETIME in case of timeout since
some i915 users of dma_fence_wait_timeout() may expect it.

Signed-off-by: Janusz Krzysztofik <[email protected]>
---
drivers/gpu/drm/i915/i915_request.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_request.c b/drivers/gpu/drm/i915/i915_request.c
index f949a9495758a..451456ab1ddef 100644
--- a/drivers/gpu/drm/i915/i915_request.c
+++ b/drivers/gpu/drm/i915/i915_request.c
@@ -102,7 +102,7 @@ static signed long i915_fence_wait(struct dma_fence *fence,
{
return i915_request_wait_timeout(to_request(fence),
interruptible | I915_WAIT_PRIORITY,
- timeout);
+ timeout) ?: 1;
}

struct kmem_cache *i915_request_slab_cache(void)
--
2.25.1


2022-11-09 20:06:07

by Janusz Krzysztofik

[permalink] [raw]
Subject: [PATCH 2/3] drm/i915: Fix unintended submission flush after retire times out

If wait on request DMA fence times out while we are retiring requests,
-ETIME is stored as remaining time. Then, flush_submission() called
thereafter proceeds with its work instead of returning immediately due to
the value of timeout passed to it not equal 0. That's probably not what
was intended.

Fix it by replacing -ETIME value of the argument with 0.

Fixes: 09137e945437 ("drm/i915/gem: Unpin idle contexts from kswapd reclaim")
Signed-off-by: Janusz Krzysztofik <[email protected]>
---
drivers/gpu/drm/i915/gt/intel_gt_requests.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
index 6c3b8ac3055c3..309d5937d6910 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
@@ -204,7 +204,7 @@ out_active: spin_lock(&timelines->lock);
list_for_each_entry_safe(tl, tn, &free, link)
__intel_timeline_free(&tl->kref);

- if (flush_submission(gt, timeout)) /* Wait, there's more! */
+ if (flush_submission(gt, timeout > 0)) /* Wait, there's more! */
active_count++;

if (remaining_timeout)
--
2.25.1


2022-11-09 20:19:10

by Janusz Krzysztofik

[permalink] [raw]
Subject: [PATCH 1/3] drm/i915: Fix timeout handling when retiring requests

I believe that intel_gt_retire_requests_timeout() should return either
-ETIME if all time designated by timeout argument has been consumed while
waiting for fences being signaled, or remaining time if there are requests
still not retired, or 0 otherwise. In the latter case, remaining time
should be passed back via remaining_timeout argument.

Remaining time is updated with return value of each consecutive call to
dma_fence_wait_timeout(). If an error code is returned instead of
remaining time, a few potentially unexpected side effects occur:
- we no longer wait for consecutive timelines' last request fences being
signaled before we try to retire requests from those timelines -- while
expected in case of -ETIME, that's probably not intended in case of
other errors that dma_fence_wait_timeout() can return,
- the error code (a negative value) is passed back as remaining time and
if no more requests happen to be left pending despite the error, a user
may pass that value forward as a remaining timeout -- that can
potentially trigger a WARN or BUG,
- potentially unexpected error code is returned to user when a
non-critical error that probably shouldn't stop the user from retrying
occurs while active requests are still pending.
Moreover, should dma_fence_wait_timeout() ever return 0 (which should mean
timeout expiration) while we are processing requests and there are still
pending requests when we are about to return, that 0 value is returned to
user like if all requests were successfully retired.

Ignore error codes from dma_fence_wait_timeout() other than -ETIME and
don't overwrite remaining time with those error codes. Also, convert 0
value returned by dma_fence_wait_timeout() to -ETIME.

Fixes: f33a8a51602c ("drm/i915: Merge wait_for_timelines with retire_request")
Signed-off-by: Janusz Krzysztofik <[email protected]>
Cc: [email protected] # v5.5+
---
drivers/gpu/drm/i915/gt/intel_gt_requests.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_requests.c b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
index edb881d756309..6c3b8ac3055c3 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_requests.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_requests.c
@@ -156,11 +156,22 @@ long intel_gt_retire_requests_timeout(struct intel_gt *gt, long timeout,

fence = i915_active_fence_get(&tl->last_request);
if (fence) {
+ signed long time_left;
+
mutex_unlock(&tl->mutex);

- timeout = dma_fence_wait_timeout(fence,
- true,
- timeout);
+ time_left = dma_fence_wait_timeout(fence,
+ true,
+ timeout);
+ /*
+ * 0 or -ETIME: timeout expired
+ * other errors: ignore, assume no time consumed
+ */
+ if (time_left == -ETIME || time_left == 0)
+ timeout = -ETIME;
+ else if (time_left > 0)
+ timeout = time_left;
+
dma_fence_put(fence);

/* Retirement is best effort */
--
2.25.1


2022-11-10 11:43:01

by Janusz Krzysztofik

[permalink] [raw]
Subject: Re: [Intel-gfx] [PATCH 0/3] Fix timeout handling when retiring requests

On Wednesday, 9 November 2022 20:09:34 CET Janusz Krzysztofik wrote:
> Fixes for issues discovered via code review while working on
> https://gitlab.freedesktop.org/drm/intel/issues/7349.
>
> Janusz Krzysztofik (3):
> drm/i915: Fix timeout handling when retiring requests
> drm/i915: Fix unintended submission flush after retire times out
> drm/i915: Fix 0 return value from DMA fence wait on i915 requests

Based on some comments received, I'm going to rework this series and resubmit.
Please ignore this one.

Thanks,
Janusz

>
> drivers/gpu/drm/i915/gt/intel_gt_requests.c | 19 +++++++++++++++----
> drivers/gpu/drm/i915/i915_request.c | 2 +-
> 2 files changed, 16 insertions(+), 5 deletions(-)
>
>