LinuxLists.cc - __i915_spin

2015-11-12 20:36:36

Subject: __i915_spin_request() sucks

Hi,

So a few months ago I got an XPS13 laptop, the one with the high res
screen. GUI performance was never really that great, I attributed it to
coming from a more powerful laptop, and the i915 driving a lot of
pixels. But yesterday I browsed from my wife's macbook, and was blown
away. Wow, scrolling in chrome SUCKS on the xps13. Not just scrolling,
basically anything in chrome. Molasses. So I got sick of it, fired up a
quick perf record, did a bunch of stuff in chrome. No super smoking
guns, but one thing did stick out - the path leading to
__i915_spin_request().

So today, I figured I'd try just killing that spin. If it fails, we'll
punt to normal completions, so easy change. And wow, MASSIVE difference.
I can now scroll in chrome and not rage! It's like the laptop is 10x
faster now.

Ran git blame, and found:

commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
Author: Chris Wilson <[email protected]>
Date: Tue Apr 7 16:20:41 2015 +0100

drm/i915: Optimistically spin for the request completion

and read the commit message. Doesn't sound that impressive. Especially
not for something that screws up interactive performance by a LOT.

What's the deal? Revert?

--
Jens Axboe

2015-11-12 20:40:38

by Jens Axboe

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On 11/12/2015 01:36 PM, Jens Axboe wrote:
> Hi,
>
> So a few months ago I got an XPS13 laptop, the one with the high res
> screen. GUI performance was never really that great, I attributed it to
> coming from a more powerful laptop, and the i915 driving a lot of
> pixels. But yesterday I browsed from my wife's macbook, and was blown
> away. Wow, scrolling in chrome SUCKS on the xps13. Not just scrolling,
> basically anything in chrome. Molasses. So I got sick of it, fired up a
> quick perf record, did a bunch of stuff in chrome. No super smoking
> guns, but one thing did stick out - the path leading to
> __i915_spin_request().
>
> So today, I figured I'd try just killing that spin. If it fails, we'll
> punt to normal completions, so easy change. And wow, MASSIVE difference.
> I can now scroll in chrome and not rage! It's like the laptop is 10x
> faster now.
>
> Ran git blame, and found:
>
> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
> Author: Chris Wilson <[email protected]>
> Date: Tue Apr 7 16:20:41 2015 +0100
>
> drm/i915: Optimistically spin for the request completion
>
> and read the commit message. Doesn't sound that impressive. Especially
> not for something that screws up interactive performance by a LOT.
>
> What's the deal? Revert?

BTW, this:

"Limit the spinning to a single jiffie (~1us) at most"

is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms! Even if
I had HZ=1000, that'd still be 1ms of spinning. That's seriously screwed
up, guys.

--
Jens Axboe

2015-11-12 22:19:20

by Chris Wilson

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On Thu, Nov 12, 2015 at 01:40:33PM -0700, Jens Axboe wrote:
> On 11/12/2015 01:36 PM, Jens Axboe wrote:
> >Hi,
> >
> >So a few months ago I got an XPS13 laptop, the one with the high res
> >screen. GUI performance was never really that great, I attributed it to
> >coming from a more powerful laptop, and the i915 driving a lot of
> >pixels. But yesterday I browsed from my wife's macbook, and was blown
> >away. Wow, scrolling in chrome SUCKS on the xps13. Not just scrolling,
> >basically anything in chrome. Molasses. So I got sick of it, fired up a
> >quick perf record, did a bunch of stuff in chrome. No super smoking
> >guns, but one thing did stick out - the path leading to
> >__i915_spin_request().

smoking gun pointing at the messenger normally.

> >So today, I figured I'd try just killing that spin. If it fails, we'll
> >punt to normal completions, so easy change. And wow, MASSIVE difference.
> >I can now scroll in chrome and not rage! It's like the laptop is 10x
> >faster now.
> >
> >Ran git blame, and found:
> >
> >commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
> >Author: Chris Wilson <[email protected]>
> >Date: Tue Apr 7 16:20:41 2015 +0100
> >
> > drm/i915: Optimistically spin for the request completion
> >
> >and read the commit message. Doesn't sound that impressive. Especially
> >not for something that screws up interactive performance by a LOT.
> >
> >What's the deal? Revert?

The tests that it improved the most were the latency sensitive tests and
since my Broadwell xps13 behaves itself, I'd like to understand how it
culminates in an interactivity loss.

1. Maybe it is the uninterruptible nature of the polling, making X's
SIGIO jerky:

gitt a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 19e8f5442cf8..8099c2a9f88e 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1146,7 +1146,7 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
return test_bit(ring->id, &dev_priv->gpu_error.missed_irq_rings);
}

-static int __i915_spin_request(struct drm_i915_gem_request *req)
+static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
{
unsigned long timeout;

@@ -1161,6 +1161,9 @@ static int __i915_spin_request(struct drm_i915_gem_request *req)
if (time_after_eq(jiffies, timeout))
break;

+ if (signal_pending_state(state, current))
+ break;
+
cpu_relax_lowlatency();
}
if (i915_gem_request_completed(req, false))
@@ -1197,6 +1200,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
struct drm_i915_private *dev_priv = dev->dev_private;
const bool irq_test_in_progress =
ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_ring_flag(ring);
+ int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
DEFINE_WAIT(wait);
unsigned long timeout_expire;
s64 before, now;
@@ -1221,7 +1225,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
before = ktime_get_raw_ns();

/* Optimistic spin for the next jiffie before touching IRQs */
- ret = __i915_spin_request(req);
+ ret = __i915_spin_request(req, state);
if (ret == 0)
goto out;

@@ -1233,8 +1237,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
for (;;) {
struct timer_list timer;

- prepare_to_wait(&ring->irq_queue, &wait,
- interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE);
+ prepare_to_wait(&ring->irq_queue, &wait, state);

/* We need to check whether any gpu reset happened in between
* the caller grabbing the seqno and now ... */
@@ -1252,7 +1255,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
break;
}

- if (interruptible && signal_pending(current)) {
+ if (signal_pending_state(state, current)) {
ret = -ERESTARTSYS;
break;
}

2. Or maybe it is increased mutex contention:

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index a275b0478200..1e52a7444e0c 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2973,9 +2973,11 @@ void __i915_add_request(struct drm_i915_gem_request *req,
__i915_add_request(req, NULL, false)
int __i915_wait_request(struct drm_i915_gem_request *req,
unsigned reset_counter,
- bool interruptible,
+ unsigned flags,
s64 *timeout,
struct intel_rps_client *rps);
+#define WAIT_INTERRUPTIBLE 0x1
+#define WAIT_UNLOCKED 0x2
int __must_check i915_wait_request(struct drm_i915_gem_request *req);
int i915_gem_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
int __must_check
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 8099c2a9f88e..ce17d42f1c62 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1191,7 +1191,7 @@ static int __i915_spin_request(struct drm_i915_gem_request *req, int state)
*/
int __i915_wait_request(struct drm_i915_gem_request *req,
unsigned reset_counter,
- bool interruptible,
+ unsigned flags,
s64 *timeout,
struct intel_rps_client *rps)
{
@@ -1200,7 +1200,7 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
struct drm_i915_private *dev_priv = dev->dev_private;
const bool irq_test_in_progress =
ACCESS_ONCE(dev_priv->gpu_error.test_irq_rings) & intel_ring_flag(ring);
- int state = interruptible ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
+ int state = flags & WAIT_INTERRUPTIBLE ? TASK_INTERRUPTIBLE : TASK_UNINTERRUPTIBLE;
DEFINE_WAIT(wait);
unsigned long timeout_expire;
s64 before, now;
@@ -1225,9 +1225,11 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
before = ktime_get_raw_ns();

/* Optimistic spin for the next jiffie before touching IRQs */
- ret = __i915_spin_request(req, state);
- if (ret == 0)
- goto out;
+ if (flags & WAIT_UNLOCKED) {
+ ret = __i915_spin_request(req, state);
+ if (ret == 0)
+ goto out;
+ }

if (!irq_test_in_progress && WARN_ON(!ring->irq_get(ring))) {
ret = -ENODEV;
@@ -1244,7 +1246,8 @@ int __i915_wait_request(struct drm_i915_gem_request *req,
if (reset_counter != atomic_read(&dev_priv->gpu_error.reset_counter)) {
/* ... but upgrade the -EAGAIN to an -EIO if the gpu
* is truely gone. */
- ret = i915_gem_check_wedge(&dev_priv->gpu_error, interruptible);
+ ret = i915_gem_check_wedge(&dev_priv->gpu_error,
+ flags & WAIT_INTERRUPTIBLE);
if (ret == 0)
ret = -EAGAIN;
break;
@@ -1532,7 +1535,7 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,

mutex_unlock(&dev->struct_mutex);
for (i = 0; ret == 0 && i < n; i++)
- ret = __i915_wait_request(requests[i], reset_counter, true,
+ ret = __i915_wait_request(requests[i], reset_counter, 0x3,
NULL, rps);
mutex_lock(&dev->struct_mutex);

@@ -3067,7 +3070,7 @@ i915_gem_wait_ioctl(struct drm_device *dev, void *data, struct drm_file *file)

for (i = 0; i < n; i++) {
if (ret == 0)
- ret = __i915_wait_request(req[i], reset_counter, true,
+ ret = __i915_wait_request(req[i], reset_counter, 0x3,
args->timeout_ns > 0 ? &args->timeout_ns : NULL,
file->driver_priv);
i915_gem_request_unreference__unlocked(req[i]);
@@ -4043,7 +4046,7 @@ i915_gem_ring_throttle(struct drm_device *dev, struct drm_file *file)
if (target == NULL)
return 0;

- ret = __i915_wait_request(target, reset_counter, true, NULL, NULL);
+ ret = __i915_wait_request(target, reset_counter, 0x3, NULL, NULL);
if (ret == 0)
queue_delayed_work(dev_priv->wq, &dev_priv->mm.retire_work, 0);

Or maybe it is an indirect effect, such as power balancing between the
CPU and GPU, or just thermal throttling, or it may be the task being
penalised for consuming its timeslice (for which any completion polling
seems susceptible).

> BTW, this:
>
> "Limit the spinning to a single jiffie (~1us) at most"
>
> is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
> Even if I had HZ=1000, that'd still be 1ms of spinning. That's
> seriously screwed up, guys.

That's over and above the termination condition for blk_poll().
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2015-11-12 22:52:08

by Jens Axboe

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On 11/12/2015 03:19 PM, Chris Wilson wrote:
>>> So today, I figured I'd try just killing that spin. If it fails, we'll
>>> punt to normal completions, so easy change. And wow, MASSIVE difference.
>>> I can now scroll in chrome and not rage! It's like the laptop is 10x
>>> faster now.
>>>
>>> Ran git blame, and found:
>>>
>>> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
>>> Author: Chris Wilson <[email protected]>
>>> Date: Tue Apr 7 16:20:41 2015 +0100
>>>
>>> drm/i915: Optimistically spin for the request completion
>>>
>>> and read the commit message. Doesn't sound that impressive. Especially
>>> not for something that screws up interactive performance by a LOT.
>>>
>>> What's the deal? Revert?
>
> The tests that it improved the most were the latency sensitive tests and
> since my Broadwell xps13 behaves itself, I'd like to understand how it
> culminates in an interactivity loss.
>
> 1. Maybe it is the uninterruptible nature of the polling, making X's
> SIGIO jerky:

This one still feels bad.

> 2. Or maybe it is increased mutex contention:

And so does this one... I had to manually apply hunks 2-3, and after
doing seat-of-the-pants testing for both variants, I confirmed with perf
that we're still seeing a ton of time in __i915_wait_request() for both
of them.

> Or maybe it is an indirect effect, such as power balancing between the
> CPU and GPU, or just thermal throttling, or it may be the task being
> penalised for consuming its timeslice (for which any completion polling
> seems susceptible).

Look, polls in the 1-10ms range are just insane. Either you botched the
commit message and really meant "~1ms at most" and in which case I'd
suspect you of smoking something good, or you hacked it up wrong and
used jiffies when you really wanted to be using some other time check
that really did give you 1us.

I'll take an IRQ over 10 msecs of busy looping on my laptop, thanks.

>> "Limit the spinning to a single jiffie (~1us) at most"
>>
>> is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
>> Even if I had HZ=1000, that'd still be 1ms of spinning. That's
>> seriously screwed up, guys.
>
> That's over and above the termination condition for blk_poll().

?! And this is related, how? Comparing apples and oranges. One is a test
opt-in feature for experimentation, the other is unconditionally enabled
for everyone. I believe the commit even says so. See the difference?
Would I use busy loop spinning waiting for rotating storage completions,
which are in the 1-10ms range? No, with the reason being that the
potential wins for spins are in the usec range.

--
Jens Axboe

2015-11-12 22:59:21

by Jens Axboe

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On 11/12/2015 03:52 PM, Jens Axboe wrote:
> On 11/12/2015 03:19 PM, Chris Wilson wrote:
>>>> So today, I figured I'd try just killing that spin. If it fails, we'll
>>>> punt to normal completions, so easy change. And wow, MASSIVE
>>>> difference.
>>>> I can now scroll in chrome and not rage! It's like the laptop is 10x
>>>> faster now.
>>>>
>>>> Ran git blame, and found:
>>>>
>>>> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
>>>> Author: Chris Wilson <[email protected]>
>>>> Date: Tue Apr 7 16:20:41 2015 +0100
>>>>
>>>> drm/i915: Optimistically spin for the request completion
>>>>
>>>> and read the commit message. Doesn't sound that impressive. Especially
>>>> not for something that screws up interactive performance by a LOT.
>>>>
>>>> What's the deal? Revert?
>>
>> The tests that it improved the most were the latency sensitive tests and
>> since my Broadwell xps13 behaves itself, I'd like to understand how it
>> culminates in an interactivity loss.
>>
>> 1. Maybe it is the uninterruptible nature of the polling, making X's
>> SIGIO jerky:
>
> This one still feels bad.
>
>> 2. Or maybe it is increased mutex contention:
>
> And so does this one... I had to manually apply hunks 2-3, and after
> doing seat-of-the-pants testing for both variants, I confirmed with perf
> that we're still seeing a ton of time in __i915_wait_request() for both
> of them.

I don't see how #2 could make any difference, you're passing in 0x3 hard
coded for most call sites, so we poll. The ones that don't, pass a bool
(?!).

I should note that with the basic patch of just never spinning, I don't
see __i915_wait_request() in the profiles. At all.

--
Jens Axboe

2015-11-13 09:16:18

by Chris Wilson

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On Thu, Nov 12, 2015 at 03:52:02PM -0700, Jens Axboe wrote:
> On 11/12/2015 03:19 PM, Chris Wilson wrote:
> >>>So today, I figured I'd try just killing that spin. If it fails, we'll
> >>>punt to normal completions, so easy change. And wow, MASSIVE difference.
> >>>I can now scroll in chrome and not rage! It's like the laptop is 10x
> >>>faster now.
> >>>
> >>>Ran git blame, and found:
> >>>
> >>>commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
> >>>Author: Chris Wilson <[email protected]>
> >>>Date: Tue Apr 7 16:20:41 2015 +0100
> >>>
> >>> drm/i915: Optimistically spin for the request completion
> >>>
> >>>and read the commit message. Doesn't sound that impressive. Especially
> >>>not for something that screws up interactive performance by a LOT.
> >>>
> >>>What's the deal? Revert?
> >
> >The tests that it improved the most were the latency sensitive tests and
> >since my Broadwell xps13 behaves itself, I'd like to understand how it
> >culminates in an interactivity loss.
> >
> >1. Maybe it is the uninterruptible nature of the polling, making X's
> >SIGIO jerky:
>
> This one still feels bad.
>
> >2. Or maybe it is increased mutex contention:
>
> And so does this one... I had to manually apply hunks 2-3, and after
> doing seat-of-the-pants testing for both variants, I confirmed with
> perf that we're still seeing a ton of time in __i915_wait_request()
> for both of them.
>
> >Or maybe it is an indirect effect, such as power balancing between the
> >CPU and GPU, or just thermal throttling, or it may be the task being
> >penalised for consuming its timeslice (for which any completion polling
> >seems susceptible).
>
> Look, polls in the 1-10ms range are just insane. Either you botched
> the commit message and really meant "~1ms at most" and in which case
> I'd suspect you of smoking something good, or you hacked it up wrong
> and used jiffies when you really wanted to be using some other time
> check that really did give you 1us.

What other time check? I was under the impression of setting up a
hrtimer was expensive and jiffies was available.

> I'll take an IRQ over 10 msecs of busy looping on my laptop, thanks.
>
> >>"Limit the spinning to a single jiffie (~1us) at most"
> >>
> >>is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
> >>Even if I had HZ=1000, that'd still be 1ms of spinning. That's
> >>seriously screwed up, guys.
> >
> >That's over and above the termination condition for blk_poll().
>
> ?! And this is related, how? Comparing apples and oranges. One is a
> test opt-in feature for experimentation, the other is
> unconditionally enabled for everyone. I believe the commit even says
> so. See the difference? Would I use busy loop spinning waiting for
> rotating storage completions, which are in the 1-10ms range? No,
> with the reason being that the potential wins for spins are in the
> usec range.

Equally I expect the service interval for a batch to be around 2-20us.
There are many workloads that execute 30-50k requests/s, and you can
appreciate that they are sensitive to the latency in setting up an irq
and receiving it - just as equally leaving that irq enabled saturates a
CPU with simply executing the irq handler. So what mechanism do you use
to guard against either a very long queue depth or an abnormal request
causing msec+ spins?
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2015-11-13 15:12:10

by Jens Axboe

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On 11/13/2015 02:15 AM, Chris Wilson wrote:
> On Thu, Nov 12, 2015 at 03:52:02PM -0700, Jens Axboe wrote:
>> On 11/12/2015 03:19 PM, Chris Wilson wrote:
>>>>> So today, I figured I'd try just killing that spin. If it fails, we'll
>>>>> punt to normal completions, so easy change. And wow, MASSIVE difference.
>>>>> I can now scroll in chrome and not rage! It's like the laptop is 10x
>>>>> faster now.
>>>>>
>>>>> Ran git blame, and found:
>>>>>
>>>>> commit 2def4ad99befa25775dd2f714fdd4d92faec6e34
>>>>> Author: Chris Wilson <[email protected]>
>>>>> Date: Tue Apr 7 16:20:41 2015 +0100
>>>>>
>>>>> drm/i915: Optimistically spin for the request completion
>>>>>
>>>>> and read the commit message. Doesn't sound that impressive. Especially
>>>>> not for something that screws up interactive performance by a LOT.
>>>>>
>>>>> What's the deal? Revert?
>>>
>>> The tests that it improved the most were the latency sensitive tests and
>>> since my Broadwell xps13 behaves itself, I'd like to understand how it
>>> culminates in an interactivity loss.
>>>
>>> 1. Maybe it is the uninterruptible nature of the polling, making X's
>>> SIGIO jerky:
>>
>> This one still feels bad.
>>
>>> 2. Or maybe it is increased mutex contention:
>>
>> And so does this one... I had to manually apply hunks 2-3, and after
>> doing seat-of-the-pants testing for both variants, I confirmed with
>> perf that we're still seeing a ton of time in __i915_wait_request()
>> for both of them.
>>
>>> Or maybe it is an indirect effect, such as power balancing between the
>>> CPU and GPU, or just thermal throttling, or it may be the task being
>>> penalised for consuming its timeslice (for which any completion polling
>>> seems susceptible).
>>
>> Look, polls in the 1-10ms range are just insane. Either you botched
>> the commit message and really meant "~1ms at most" and in which case
>> I'd suspect you of smoking something good, or you hacked it up wrong
>> and used jiffies when you really wanted to be using some other time
>> check that really did give you 1us.
>
> What other time check? I was under the impression of setting up a
> hrtimer was expensive and jiffies was available.

Looping for 10ms is a lot more expensive :-). jiffies is always there,
but it's WAY too coarse to be used for something like this.

You could use ktime_get(), there's a lot of helpers for checking
time_after, adding msecs, etc. Something like the below, not tested here
yet.

>> I'll take an IRQ over 10 msecs of busy looping on my laptop, thanks.
>>
>>>> "Limit the spinning to a single jiffie (~1us) at most"
>>>>
>>>> is totally wrong. I have HZ=100 on my laptop. That's 10ms. 10ms!
>>>> Even if I had HZ=1000, that'd still be 1ms of spinning. That's
>>>> seriously screwed up, guys.
>>>
>>> That's over and above the termination condition for blk_poll().
>>
>> ?! And this is related, how? Comparing apples and oranges. One is a
>> test opt-in feature for experimentation, the other is
>> unconditionally enabled for everyone. I believe the commit even says
>> so. See the difference? Would I use busy loop spinning waiting for
>> rotating storage completions, which are in the 1-10ms range? No,
>> with the reason being that the potential wins for spins are in the
>> usec range.
>
> Equally I expect the service interval for a batch to be around 2-20us.
> There are many workloads that execute 30-50k requests/s, and you can
> appreciate that they are sensitive to the latency in setting up an irq
> and receiving it - just as equally leaving that irq enabled saturates a
> CPU with simply executing the irq handler. So what mechanism do you use
> to guard against either a very long queue depth or an abnormal request
> causing msec+ spins?

Not disputing that polling can work, but it needs to be a bit more
clever. Do you know which requests are fast and which ones are not?
Could you track it? Should we make this a module option?

20usec is too long to poll. If we look at the wins of polling, we're
talking anywhere from 1-2 usec to maybe 5 usec, depending on different
factors. So spinning between 1-3 usec should be a hard limit on most
platforms. And it's somewhat of a policy decision, since it does involve
throwing CPU at the problem. There's a crossover point where below it's
always a win, but that needs a lot more work than just optimistic
spinning for everything. You also do need a check for whether the task
has been woken up, that's also missing.

As for interrupt mitigation, I'd consider that a separate problem. It's
a lot simpler than the app induced polling that i915 is doing here. So
if overloading a core with IRQs is an issue, I'd solve that differently
similarly to NAPI or blk-iopoll (not to be confused with blk_poll() that
you referenced and is app induced polling).

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5cf4a1998273..658514e899b1 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1148,17 +1148,19 @@ static bool missed_irq(struct drm_i915_private *dev_priv,

static int __i915_spin_request(struct drm_i915_gem_request *req)
{
- unsigned long timeout;
+ ktime_t start, end;

if (i915_gem_request_get_ring(req)->irq_refcount)
return -EBUSY;

- timeout = jiffies + 1;
+ start = ktime_get();
+ end.tv64 = start.tv64;
+ ktime_add_us(end, 1);
while (!need_resched()) {
if (i915_gem_request_completed(req, true))
return 0;

- if (time_after_eq(jiffies, timeout))
+ if (ktime_after(start, end))
break;

cpu_relax_lowlatency();

--
Jens Axboe

2015-11-13 15:36:47

by Jens Axboe

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

Previous patch was obvious pre-coffee crap, this should work for using
ktime to spin max 1usec.

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 5cf4a1998273..21192e55c33c 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1148,17 +1148,18 @@ static bool missed_irq(struct drm_i915_private *dev_priv,

static int __i915_spin_request(struct drm_i915_gem_request *req)
{
- unsigned long timeout;
+ ktime_t timeout;

if (i915_gem_request_get_ring(req)->irq_refcount)
return -EBUSY;

- timeout = jiffies + 1;
+ timeout = ktime_get();
+ ktime_add_us(timeout, 1);
while (!need_resched()) {
if (i915_gem_request_completed(req, true))
return 0;

- if (time_after_eq(jiffies, timeout))
+ if (ktime_after(ktime_get(), timeout))
break;

cpu_relax_lowlatency();

--
Jens Axboe

2015-11-13 16:13:50

by Mike Galbraith

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
> Previous patch was obvious pre-coffee crap, this should work for using
> ktime to spin max 1usec.

1us seems a tad low. I doubt the little wooden gears and pulleys of my
core2 Toshiba Satellite lappy can get one loop ground out in a usec :)

>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 5cf4a1998273..21192e55c33c 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -1148,17 +1148,18 @@ static bool missed_irq(struct drm_i915_private *dev_priv,
>
> static int __i915_spin_request(struct drm_i915_gem_request *req)
> {
> - unsigned long timeout;
> + ktime_t timeout;
>
> if (i915_gem_request_get_ring(req)->irq_refcount)
> return -EBUSY;
>
> - timeout = jiffies + 1;
> + timeout = ktime_get();
> + ktime_add_us(timeout, 1);
> while (!need_resched()) {
> if (i915_gem_request_completed(req, true))
> return 0;
>
> - if (time_after_eq(jiffies, timeout))
> + if (ktime_after(ktime_get(), timeout))
> break;
>
> cpu_relax_lowlatency();
>

2015-11-13 16:22:57

by Jens Axboe

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On 11/13/2015 09:13 AM, Mike Galbraith wrote:
> On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
>> Previous patch was obvious pre-coffee crap, this should work for using
>> ktime to spin max 1usec.
>
> 1us seems a tad low. I doubt the little wooden gears and pulleys of my
> core2 Toshiba Satellite lappy can get one loop ground out in a usec :)

Maybe it is, it's based off the original intent of the function, though.
See the original commit referenced.

--
Jens Axboe

2015-11-13 22:12:29

by Chris Wilson

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On Fri, Nov 13, 2015 at 09:22:52AM -0700, Jens Axboe wrote:
> On 11/13/2015 09:13 AM, Mike Galbraith wrote:
> >On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
> >>Previous patch was obvious pre-coffee crap, this should work for using
> >>ktime to spin max 1usec.
> >
> >1us seems a tad low. I doubt the little wooden gears and pulleys of my
> >core2 Toshiba Satellite lappy can get one loop ground out in a usec :)
>
> Maybe it is, it's based off the original intent of the function,
> though. See the original commit referenced.

I've been looking at numbers from one laptop and I can set the timeout
at 2us before we see a steep decline in what is more or less synchronous
request handling (which affects a variety of rendering workloads).

Looking around, other busy loops seem to use local_clock() (i.e. rdstcll
with a fair wind). Is that worth using here?
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

2015-11-13 22:16:36

by Jens Axboe

[permalink] [raw]

Subject: Re: __i915_spin_request() sucks

On 11/13/2015 03:12 PM, Chris Wilson wrote:
> On Fri, Nov 13, 2015 at 09:22:52AM -0700, Jens Axboe wrote:
>> On 11/13/2015 09:13 AM, Mike Galbraith wrote:
>>> On Fri, 2015-11-13 at 08:36 -0700, Jens Axboe wrote:
>>>> Previous patch was obvious pre-coffee crap, this should work for using
>>>> ktime to spin max 1usec.
>>>
>>> 1us seems a tad low. I doubt the little wooden gears and pulleys of my
>>> core2 Toshiba Satellite lappy can get one loop ground out in a usec :)
>>
>> Maybe it is, it's based off the original intent of the function,
>> though. See the original commit referenced.
>
> I've been looking at numbers from one laptop and I can set the timeout
> at 2us before we see a steep decline in what is more or less synchronous
> request handling (which affects a variety of rendering workloads).

Alright, at least that's a vast improvement from 10ms. If you send me
something tested, I can try it here.

> Looking around, other busy loops seem to use local_clock() (i.e. rdstcll
> with a fair wind). Is that worth using here?

Honestly, don't think it matters too much for this case. You'd have to
disable preempt to use local_clock(), fwiw. It is a faster variant
though, but the RT people might hate you for 2us preempt disables :-)

--
Jens Axboe