2011-03-17 09:48:47

by Knut Petersen

[permalink] [raw]
Subject: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

Hi everybody!

Well, I updated my test partition to opensuse 11.4.

The opensuse standard kernel is unusable for me, booting hangs
most of the time, the same is true for shutdown. If booting succeeds the
system will fail after a short time of usage.

So I installed kernel 2.6.38. That helped a lot ... for a few minutes.

A possible irq lock inversion is detected during boot, after a short time
X fails, restarting the X server does not work, a reboot is necessary.

Attached find lspci output, kernel config and kernel debug messages.

cu,
Knut


Attachments:
kernelerrlog-2.6.38 (15.93 kB)
config-2.6.38 (64.12 kB)
lspci (25.00 B)
Download all attachments

2011-03-17 09:47:46

by Knut Petersen

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

Hi everyody!

lscpi output was missing ...

cu,
Knut


Attachments:
lspci (2.18 kB)

2011-03-17 16:56:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

On Thu, Mar 17, 2011 at 2:40 AM, Knut Petersen
<[email protected]> wrote:
>
> Well, I updated my test partition to opensuse 11.4.
>
> The opensuse standard kernel is unusable for me, booting hangs
> most of the time, the same is true for shutdown. If booting succeeds the
> system will fail after a short time of usage.
>
> So I installed kernel 2.6.38. ?That helped a lot ... for a few minutes.
>
> A possible irq lock inversion is detected during boot, after a short time
> X fails, restarting the X server does not work, a reboot is necessary.

Ok, so the lock inversion seems to be due to the sound/drivers/aloop.c
file, where the function "loopback_pos_update()" gets called from
within a softirq context. And it takes a lock (cable->lock) that is
also taken unprotected by loopback_trigger(). So that's liable to
deadlock as per lockdep. Jaroslav? Takashi?

The X problem seems to be something unrelated. You have those "GPU
hung" messages, along with i2c/EDID problems. But the actual oops is
at the very beginning of intel_release_load_detect_pipe(), here:

0: 55 push %ebp
1: 89 e5 mov %esp,%ebp
3: 57 push %edi
4: 89 cf mov %ecx,%edi
6: 56 push %esi
7: 53 push %ebx
8: 89 c3 mov %eax,%ebx
a: 83 ec 0c sub $0xc,%esp
d: 8b 00 mov (%eax),%eax
f: 8b 73 20 mov 0x20(%ebx),%esi
12: 80 7b 30 00 cmpb $0x0,0x30(%ebx)
16: 8b 4b 28 mov 0x28(%ebx),%ecx
19: 89 45 f0 mov %eax,-0x10(%ebp)
1c:* 8b 86 e0 01 00 00 mov 0x1e0(%esi),%eax <-- trapping
instruction
22: 89 45 ec mov %eax,-0x14(%ebp)
25: 74 2d je 0x54
27: c7 43 20 00 00 00 00 movl $0x0,0x20(%ebx)
2e: 89 f0 mov %esi,%eax

where %esi is NULL. I think that is the "crtc->helper_private" load,
and crtc is NULL.

That code does look broken. The very same function explicitly sets
crtc to NULL, so clearly it _can_ be NULL. That said, this is all old
code. I suspect the thing that made it start trigger may be commit
f5afcd3dd0dc ("drm/i915/crt: Check for a analog monitor in case of
DVI-I"), which is the only real change to the crt_detect logic I can
see.

Jesse? Chris? Ideas?

Linus

2011-03-17 17:15:14

by Takashi Iwai

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

At Thu, 17 Mar 2011 09:55:47 -0700,
Linus Torvalds wrote:
>
> On Thu, Mar 17, 2011 at 2:40 AM, Knut Petersen
> <[email protected]> wrote:
> >
> > Well, I updated my test partition to opensuse 11.4.
> >
> > The opensuse standard kernel is unusable for me, booting hangs
> > most of the time, the same is true for shutdown. If booting succeeds the
> > system will fail after a short time of usage.
> >
> > So I installed kernel 2.6.38.  That helped a lot ... for a few minutes.
> >
> > A possible irq lock inversion is detected during boot, after a short time
> > X fails, restarting the X server does not work, a reboot is necessary.
>
> Ok, so the lock inversion seems to be due to the sound/drivers/aloop.c
> file, where the function "loopback_pos_update()" gets called from
> within a softirq context. And it takes a lock (cable->lock) that is
> also taken unprotected by loopback_trigger(). So that's liable to
> deadlock as per lockdep. Jaroslav? Takashi?

The trigger callback should be called always in irq-disabled context,
so this should be OK. But loopback_pos_update() is called in the
timer callback, and this can be the issue.

Knut, how about the patch below?


thanks,

Takashi

---
diff --git a/sound/drivers/aloop.c b/sound/drivers/aloop.c
index 12b44b0..a0da775 100644
--- a/sound/drivers/aloop.c
+++ b/sound/drivers/aloop.c
@@ -482,8 +482,9 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
cable->streams[SNDRV_PCM_STREAM_CAPTURE];
unsigned long delta_play = 0, delta_capt = 0;
unsigned int running;
+ unsigned long flags;

- spin_lock(&cable->lock);
+ spin_lock_irqsave(&cable->lock, flags);
running = cable->running ^ cable->pause;
if (running & (1 << SNDRV_PCM_STREAM_PLAYBACK)) {
delta_play = jiffies - dpcm_play->last_jiffies;
@@ -495,10 +496,8 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
dpcm_capt->last_jiffies += delta_capt;
}

- if (delta_play == 0 && delta_capt == 0) {
- spin_unlock(&cable->lock);
- return running;
- }
+ if (delta_play == 0 && delta_capt == 0)
+ goto unlock;

if (delta_play > delta_capt) {
loopback_bytepos_update(dpcm_play, delta_play - delta_capt,
@@ -510,14 +509,14 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
delta_capt = delta_play;
}

- if (delta_play == 0 && delta_capt == 0) {
- spin_unlock(&cable->lock);
- return running;
- }
+ if (delta_play == 0 && delta_capt == 0)
+ goto unlock;
+
/* note delta_capt == delta_play at this moment */
loopback_bytepos_update(dpcm_capt, delta_capt, BYTEPOS_UPDATE_COPY);
loopback_bytepos_update(dpcm_play, delta_play, BYTEPOS_UPDATE_POSONLY);
- spin_unlock(&cable->lock);
+ unlock:
+ spin_unlock_irqrestore(&cable->lock, flags);
return running;
}

2011-03-17 17:35:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

On Thu, Mar 17, 2011 at 10:15 AM, Takashi Iwai <[email protected]> wrote:
>
> The trigger callback should be called always in irq-disabled context,
> so this should be OK.

Oh, ok. I missed the snd_pcm_action_lock_irq() thing disabling
interrupts in that call sequence..

>?But loopback_pos_update() is called in the
> timer callback, and this can be the issue.

Hmm. If the timer callback is the only other case doing that, then
that should be ok.

In fact, now that I look at that lockdep thing, I'm confused. Where
does the hard irq come in at all for that lock? It seems to come from
self_group.lock, but I don't see why/how they nest.

That said, the sound locking is odd, I'm sure you see it. But a
commentary about how this came about in the changelog would be good.

Linus

2011-03-17 20:05:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

On Thu, Mar 17, 2011 at 10:33:04AM -0700, Linus Torvalds wrote:
> On Thu, Mar 17, 2011 at 10:15 AM, Takashi Iwai <[email protected]> wrote:
> >
> > The trigger callback should be called always in irq-disabled context,
> > so this should be OK.
>
> Oh, ok. I missed the snd_pcm_action_lock_irq() thing disabling
> interrupts in that call sequence..
>
> >?But loopback_pos_update() is called in the
> > timer callback, and this can be the issue.
>
> Hmm. If the timer callback is the only other case doing that, then
> that should be ok.
>
> In fact, now that I look at that lockdep thing, I'm confused. Where
> does the hard irq come in at all for that lock? It seems to come from
> self_group.lock, but I don't see why/how they nest.

Here's the issue. It's quite subtle.


CPU0 CPU1
---- ----
spin_lock(cable->lock);
spin_lock(group->lock);
spin_lock(cable->lock);
<blocked>
<interrupt>
spin_lock(group->lock);
<deadlock>

If any lock is taken while holding a lock that can be used in interrupt
context, then that lock must also be protected from interrupts as well,
even if that lock has nothing to do with interrupts.

Lockdep reported that the cable->lock was held while holding the
substream->self_group->lock, and it looks like that substream->self_group->lock
can also be taken in interrupt context.

-- Steve

>
> That said, the sound locking is odd, I'm sure you see it. But a
> commentary about how this came about in the changelog would be good.

2011-03-17 23:13:13

by Knut Petersen

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

Thanks a lot. That patch below does indeed
solve the irq lock inversion problem.

cu,
Knut

>
> The trigger callback should be called always in irq-disabled context,
> so this should be OK. But loopback_pos_update() is called in the
> timer callback, and this can be the issue.
>
> Knut, how about the patch below?
>
>
> thanks,
>
> Takashi
>
> ---
> diff --git a/sound/drivers/aloop.c b/sound/drivers/aloop.c
> index 12b44b0..a0da775 100644
> --- a/sound/drivers/aloop.c
> +++ b/sound/drivers/aloop.c
> @@ -482,8 +482,9 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
> cable->streams[SNDRV_PCM_STREAM_CAPTURE];
> unsigned long delta_play = 0, delta_capt = 0;
> unsigned int running;
> + unsigned long flags;
>
> - spin_lock(&cable->lock);
> + spin_lock_irqsave(&cable->lock, flags);
> running = cable->running ^ cable->pause;
> if (running & (1 << SNDRV_PCM_STREAM_PLAYBACK)) {
> delta_play = jiffies - dpcm_play->last_jiffies;
> @@ -495,10 +496,8 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
> dpcm_capt->last_jiffies += delta_capt;
> }
>
> - if (delta_play == 0 && delta_capt == 0) {
> - spin_unlock(&cable->lock);
> - return running;
> - }
> + if (delta_play == 0 && delta_capt == 0)
> + goto unlock;
>
> if (delta_play > delta_capt) {
> loopback_bytepos_update(dpcm_play, delta_play - delta_capt,
> @@ -510,14 +509,14 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
> delta_capt = delta_play;
> }
>
> - if (delta_play == 0 && delta_capt == 0) {
> - spin_unlock(&cable->lock);
> - return running;
> - }
> + if (delta_play == 0 && delta_capt == 0)
> + goto unlock;
> +
> /* note delta_capt == delta_play at this moment */
> loopback_bytepos_update(dpcm_capt, delta_capt, BYTEPOS_UPDATE_COPY);
> loopback_bytepos_update(dpcm_play, delta_play, BYTEPOS_UPDATE_POSONLY);
> - spin_unlock(&cable->lock);
> + unlock:
> + spin_unlock_irqrestore(&cable->lock, flags);
> return running;
> }
>
>

2011-03-17 23:34:03

by Knut Petersen

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

Am 17.03.2011 17:55, schrieb Linus Torvalds:
>
> Ok, so the lock inversion seems to be due to the sound/drivers/aloop.c
> file, where the function "loopback_pos_update()" gets called from
> within a softirq context. And it takes a lock (cable->lock) that is
> also taken unprotected by loopback_trigger(). So that's liable to
> deadlock as per lockdep. Jaroslav? Takashi?
Takashi Iwai already gave a solution for the irq inversion problem.

> The X problem seems to be something unrelated. You have those "GPU
> hung" messages, along with i2c/EDID problems. But the actual oops is
> at the very beginning of intel_release_load_detect_pipe(), here:
>
> 0: 55 push %ebp
> 1: 89 e5 mov %esp,%ebp
> 3: 57 push %edi
> 4: 89 cf mov %ecx,%edi
> 6: 56 push %esi
> 7: 53 push %ebx
> 8: 89 c3 mov %eax,%ebx
> a: 83 ec 0c sub $0xc,%esp
> d: 8b 00 mov (%eax),%eax
> f: 8b 73 20 mov 0x20(%ebx),%esi
> 12: 80 7b 30 00 cmpb $0x0,0x30(%ebx)
> 16: 8b 4b 28 mov 0x28(%ebx),%ecx
> 19: 89 45 f0 mov %eax,-0x10(%ebp)
> 1c:* 8b 86 e0 01 00 00 mov 0x1e0(%esi),%eax <-- trapping
> instruction
> 22: 89 45 ec mov %eax,-0x14(%ebp)
> 25: 74 2d je 0x54
> 27: c7 43 20 00 00 00 00 movl $0x0,0x20(%ebx)
> 2e: 89 f0 mov %esi,%eax
>
> where %esi is NULL. I think that is the "crtc->helper_private" load,
> and crtc is NULL.
>
> That code does look broken. The very same function explicitly sets
> crtc to NULL, so clearly it _can_ be NULL. That said, this is all old
> code. I suspect the thing that made it start trigger may be commit
> f5afcd3dd0dc ("drm/i915/crt: Check for a analog monitor in case of
> DVI-I"), which is the only real change to the crt_detect logic I can
> see.
>
Well, I think there are two i915 problems that are independent.

The i2c/edid thing might be related to the f5a...commit, but reverting
that commit alone does not help. i2c/edid error messages might be before
or after the "gpu hung" problem, Xorg seems to be able to cope with that.
But after a gpu lockup it's time to reboot.

If nobody has a better idea I'll try to bisect ... tomorrow.

cu,
Knut

2011-03-18 06:55:47

by Takashi Iwai

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

At Fri, 18 Mar 2011 00:12:58 +0100,
Knut Petersen wrote:
>
> Thanks a lot. That patch below does indeed
> solve the irq lock inversion problem.

Thanks for testing. I applied the patch now.


Takashi

>
> cu,
> Knut
>
> >
> > The trigger callback should be called always in irq-disabled context,
> > so this should be OK. But loopback_pos_update() is called in the
> > timer callback, and this can be the issue.
> >
> > Knut, how about the patch below?
> >
> >
> > thanks,
> >
> > Takashi
> >
> > ---
> > diff --git a/sound/drivers/aloop.c b/sound/drivers/aloop.c
> > index 12b44b0..a0da775 100644
> > --- a/sound/drivers/aloop.c
> > +++ b/sound/drivers/aloop.c
> > @@ -482,8 +482,9 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
> > cable->streams[SNDRV_PCM_STREAM_CAPTURE];
> > unsigned long delta_play = 0, delta_capt = 0;
> > unsigned int running;
> > + unsigned long flags;
> >
> > - spin_lock(&cable->lock);
> > + spin_lock_irqsave(&cable->lock, flags);
> > running = cable->running ^ cable->pause;
> > if (running & (1 << SNDRV_PCM_STREAM_PLAYBACK)) {
> > delta_play = jiffies - dpcm_play->last_jiffies;
> > @@ -495,10 +496,8 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
> > dpcm_capt->last_jiffies += delta_capt;
> > }
> >
> > - if (delta_play == 0 && delta_capt == 0) {
> > - spin_unlock(&cable->lock);
> > - return running;
> > - }
> > + if (delta_play == 0 && delta_capt == 0)
> > + goto unlock;
> >
> > if (delta_play > delta_capt) {
> > loopback_bytepos_update(dpcm_play, delta_play - delta_capt,
> > @@ -510,14 +509,14 @@ static unsigned int loopback_pos_update(struct loopback_cable *cable)
> > delta_capt = delta_play;
> > }
> >
> > - if (delta_play == 0 && delta_capt == 0) {
> > - spin_unlock(&cable->lock);
> > - return running;
> > - }
> > + if (delta_play == 0 && delta_capt == 0)
> > + goto unlock;
> > +
> > /* note delta_capt == delta_play at this moment */
> > loopback_bytepos_update(dpcm_capt, delta_capt, BYTEPOS_UPDATE_COPY);
> > loopback_bytepos_update(dpcm_play, delta_play, BYTEPOS_UPDATE_POSONLY);
> > - spin_unlock(&cable->lock);
> > + unlock:
> > + spin_unlock_irqrestore(&cable->lock, flags);
> > return running;
> > }
> >
> >
>

2011-03-18 16:09:27

by Knut Petersen

[permalink] [raw]
Subject: Re: [BUG][2.6.38] IRQ Lock Inversion / i915 fails

Hi,

Bisecting does not work, I'm unable to find a "good" kernel for the
X.Org Server 1.9.3 used by openSuSE 11.4.

2.6.38 combined with X.Org 1.8.0 (openSuSE 11.3) seems to be ok.

The question is: does X.Org 1.9.3 expose an old kernel bug, or
is the X server that screws up the i915 hardware and kernel driver?

cu,
Knut