2018-11-08 17:58:49

by Pavel Machek

[permalink] [raw]
Subject: v4.20-rc1: list_del corruption on thinkpad x220

Hi!

My machine locked hard (thinkpad x220). After reboot, I found this in
syslog:

Sounds like memory corruption..? Does not sound like easy to debug.

...otoh, it still looks like an addres, so maybe it is "just" race in
GPU drivers?

Any ideas?
Pavel

Nov 8 18:35:01 duo CRON[28511]: (root) CMD (command -v debian-sa1 >
/dev/null && debian-sa
1 1 1)
Nov 8 18:42:57 duo kernel: list_del corruption. prev->next should be
ffff8801742b8178, but
was ffffc9000192fec8
Nov 8 18:42:57 duo kernel: ------------[ cut here ]------------
Nov 8 18:42:57 duo kernel: kernel BUG at
/data/fast/l/k/lib/list_debug.c:53!
Nov 8 18:42:57 duo kernel: invalid opcode: 0000 [#1] SMP PTI
Nov 8 18:42:57 duo kernel: CPU: 2 PID: 1082 Comm: i915/signal:1 Not
tainted 4.20.0-rc1+ #3
Nov 8 18:42:57 duo kernel: Hardware name: LENOVO 42872WU/42872WU,
BIOS 8DET74WW (1.44 ) 03
/13/2018
Nov 8 18:42:57 duo kernel: RIP:
0010:__list_del_entry_valid+0x8e/0x90
Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0 48
c7 c7 90 74 5e 85 e8
53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8 74 5e 85 e8 40 88 d1 ff
<0f> 0b 55 48 89 d0 48
8b 52 08 48 89 e5 48 39 f2 75 19 48 8b 32 48
Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
00210086
Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
ffff8801742b8178 RCX: 00000000000000
00
Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
ffff88019e2a53d8 RDI: ffff88019e2a53
d8
Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
ffff880196e2cd10 R09: 00000000000000
00
Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
3863656632393101 R12: ffffc9000196be
c8
Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
ffff8801742b8080 R15: ffffc9000192fd
d0
Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
GS:ffff88019e280000(0000) knlGS:000
0000000000000
Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
000000000581e001 CR4: 00000000000606a0
Nov 8 18:42:57 duo kernel: Call Trace:
Nov 8 18:42:57 duo kernel: intel_breadcrumbs_signaler+0x162/0x330
Nov 8 18:42:57 duo kernel: kthread+0x116/0x150
Nov 8 18:42:57 duo kernel: ? intel_engine_wakeup+0x40/0x40
Nov 8 18:42:57 duo kernel: ? kthread_park+0x90/0x90
Nov 8 18:42:57 duo kernel: ret_from_fork+0x35/0x40
Nov 8 18:42:57 duo kernel: Modules linked in:
Nov 8 18:42:57 duo kernel: ---[ end trace 2f8da183a56f80f6 ]---
Nov 8 18:42:57 duo kernel: RIP:
0010:__list_del_entry_valid+0x8e/0x90
Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0
48 c7 c7 90 74 5e 85 e8 53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8
74 5e 85 e8 40 88 d1 ff <0f> 0b 55 48 89 d0 48 8b 52 08 48 89 e5 48
39 f2 75 19 48 8b 32 48
Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
00210086
Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
ffff8801742b8178 RCX: 0000000000000000
Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
ffff88019e2a53d8 RDI: ffff88019e2a53d8
Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
ffff880196e2cd10 R09: 0000000000000000
Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
3863656632393101 R12: ffffc9000196bec8
Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
ffff8801742b8080 R15: ffffc9000192fdd0
Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
GS:ffff88019e280000(0000) knlGS:0000000000000000
Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
000000000581e001 CR4: 00000000000606a0

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (3.82 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-11-21 11:43:03

by Joonas Lahtinen

[permalink] [raw]
Subject: Re: v4.20-rc1: list_del corruption on thinkpad x220

+ Chris

Quoting Pavel Machek (2018-11-08 19:58:03)
> Hi!
>
> My machine locked hard (thinkpad x220). After reboot, I found this in
> syslog:
>
> Sounds like memory corruption..? Does not sound like easy to debug.

Were you doing something GPU intense when you experienced the hard hang?

And if so, have you been able to hit the issue more than once? At this
point it doesn't look like anything we've hit previously, so would be
great to have some more insight into how we could reproduce.

There's one similar for nouveau in Bugzilla, but it seems like a genuine
memory corruption (1 bit flipped):

https://bugs.freedesktop.org/show_bug.cgi?id=84880

Any extra information would be of use :)

Regards, Joonas

PS. Could you open a bug to Bugzilla, it'll help to collect the
information in one consolidated place:

https://01.org/linuxgraphics/documentation/how-report-bugs

>
> ...otoh, it still looks like an addres, so maybe it is "just" race in
> GPU drivers?
>
> Any ideas?
> Pavel
>
> Nov 8 18:35:01 duo CRON[28511]: (root) CMD (command -v debian-sa1 >
> /dev/null && debian-sa
> 1 1 1)
> Nov 8 18:42:57 duo kernel: list_del corruption. prev->next should be
> ffff8801742b8178, but
> was ffffc9000192fec8
> Nov 8 18:42:57 duo kernel: ------------[ cut here ]------------
> Nov 8 18:42:57 duo kernel: kernel BUG at
> /data/fast/l/k/lib/list_debug.c:53!
> Nov 8 18:42:57 duo kernel: invalid opcode: 0000 [#1] SMP PTI
> Nov 8 18:42:57 duo kernel: CPU: 2 PID: 1082 Comm: i915/signal:1 Not
> tainted 4.20.0-rc1+ #3
> Nov 8 18:42:57 duo kernel: Hardware name: LENOVO 42872WU/42872WU,
> BIOS 8DET74WW (1.44 ) 03
> /13/2018
> Nov 8 18:42:57 duo kernel: RIP:
> 0010:__list_del_entry_valid+0x8e/0x90
> Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0 48
> c7 c7 90 74 5e 85 e8
> 53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8 74 5e 85 e8 40 88 d1 ff
> <0f> 0b 55 48 89 d0 48
> 8b 52 08 48 89 e5 48 39 f2 75 19 48 8b 32 48
> Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
> 00210086
> Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
> ffff8801742b8178 RCX: 00000000000000
> 00
> Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
> ffff88019e2a53d8 RDI: ffff88019e2a53
> d8
> Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
> ffff880196e2cd10 R09: 00000000000000
> 00
> Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
> 3863656632393101 R12: ffffc9000196be
> c8
> Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
> ffff8801742b8080 R15: ffffc9000192fd
> d0
> Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
> GS:ffff88019e280000(0000) knlGS:000
> 0000000000000
> Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
> 000000000581e001 CR4: 00000000000606a0
> Nov 8 18:42:57 duo kernel: Call Trace:
> Nov 8 18:42:57 duo kernel: intel_breadcrumbs_signaler+0x162/0x330
> Nov 8 18:42:57 duo kernel: kthread+0x116/0x150
> Nov 8 18:42:57 duo kernel: ? intel_engine_wakeup+0x40/0x40
> Nov 8 18:42:57 duo kernel: ? kthread_park+0x90/0x90
> Nov 8 18:42:57 duo kernel: ret_from_fork+0x35/0x40
> Nov 8 18:42:57 duo kernel: Modules linked in:
> Nov 8 18:42:57 duo kernel: ---[ end trace 2f8da183a56f80f6 ]---
> Nov 8 18:42:57 duo kernel: RIP:
> 0010:__list_del_entry_valid+0x8e/0x90
> Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0
> 48 c7 c7 90 74 5e 85 e8 53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8
> 74 5e 85 e8 40 88 d1 ff <0f> 0b 55 48 89 d0 48 8b 52 08 48 89 e5 48
> 39 f2 75 19 48 8b 32 48
> Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
> 00210086
> Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
> ffff8801742b8178 RCX: 0000000000000000
> Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
> ffff88019e2a53d8 RDI: ffff88019e2a53d8
> Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
> ffff880196e2cd10 R09: 0000000000000000
> Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
> 3863656632393101 R12: ffffc9000196bec8
> Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
> ffff8801742b8080 R15: ffffc9000192fdd0
> Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
> GS:ffff88019e280000(0000) knlGS:0000000000000000
> Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
> 000000000581e001 CR4: 00000000000606a0
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2018-11-21 12:00:31

by Pavel Machek

[permalink] [raw]
Subject: Re: v4.20-rc1: list_del corruption on thinkpad x220

Hi!

> > My machine locked hard (thinkpad x220). After reboot, I found this in
> > syslog:
> >
> > Sounds like memory corruption..? Does not sound like easy to debug.
>
> Were you doing something GPU intense when you experienced the hard hang?
>
> And if so, have you been able to hit the issue more than once? At this
> point it doesn't look like anything we've hit previously, so would be
> great to have some more insight into how we could reproduce.

I seen another crash since that, but I don't think it counts at
"easily reproducible".

I may have been running flightgear at that point. That's fairly GPU intensive.

> There's one similar for nouveau in Bugzilla, but it seems like a genuine
> memory corruption (1 bit flipped):
>
> https://bugs.freedesktop.org/show_bug.cgi?id=84880
>
> Any extra information would be of use :)
>
> Regards, Joonas
>
> PS. Could you open a bug to Bugzilla, it'll help to collect the
> information in one consolidated place:
>
> https://01.org/linuxgraphics/documentation/how-report-bugs

I prefer email... certainly for bugs that can't be reproduced.

Best regards,
Pavel

> > > > ...otoh, it still looks like an addres, so maybe it is "just" race in
> > GPU drivers?
> >
> > Any ideas?
> > Pavel
> >
> > Nov 8 18:35:01 duo CRON[28511]: (root) CMD (command -v debian-sa1 >
> > /dev/null && debian-sa
> > 1 1 1)
> > Nov 8 18:42:57 duo kernel: list_del corruption. prev->next should be
> > ffff8801742b8178, but
> > was ffffc9000192fec8
> > Nov 8 18:42:57 duo kernel: ------------[ cut here ]------------
> > Nov 8 18:42:57 duo kernel: kernel BUG at
> > /data/fast/l/k/lib/list_debug.c:53!
> > Nov 8 18:42:57 duo kernel: invalid opcode: 0000 [#1] SMP PTI
> > Nov 8 18:42:57 duo kernel: CPU: 2 PID: 1082 Comm: i915/signal:1 Not
> > tainted 4.20.0-rc1+ #3
> > Nov 8 18:42:57 duo kernel: Hardware name: LENOVO 42872WU/42872WU,
> > BIOS 8DET74WW (1.44 ) 03
> > /13/2018
> > Nov 8 18:42:57 duo kernel: RIP:
> > 0010:__list_del_entry_valid+0x8e/0x90
> > Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0 48
> > c7 c7 90 74 5e 85 e8
> > 53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8 74 5e 85 e8 40 88 d1 ff
> > <0f> 0b 55 48 89 d0 48
> > 8b 52 08 48 89 e5 48 39 f2 75 19 48 8b 32 48
> > Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
> > 00210086
> > Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
> > ffff8801742b8178 RCX: 00000000000000
> > 00
> > Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
> > ffff88019e2a53d8 RDI: ffff88019e2a53
> > d8
> > Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
> > ffff880196e2cd10 R09: 00000000000000
> > 00
> > Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
> > 3863656632393101 R12: ffffc9000196be
> > c8
> > Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
> > ffff8801742b8080 R15: ffffc9000192fd
> > d0
> > Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
> > GS:ffff88019e280000(0000) knlGS:000
> > 0000000000000
> > Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> > 0000000080050033
> > Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
> > 000000000581e001 CR4: 00000000000606a0
> > Nov 8 18:42:57 duo kernel: Call Trace:
> > Nov 8 18:42:57 duo kernel: intel_breadcrumbs_signaler+0x162/0x330
> > Nov 8 18:42:57 duo kernel: kthread+0x116/0x150
> > Nov 8 18:42:57 duo kernel: ? intel_engine_wakeup+0x40/0x40
> > Nov 8 18:42:57 duo kernel: ? kthread_park+0x90/0x90
> > Nov 8 18:42:57 duo kernel: ret_from_fork+0x35/0x40
> > Nov 8 18:42:57 duo kernel: Modules linked in:
> > Nov 8 18:42:57 duo kernel: ---[ end trace 2f8da183a56f80f6 ]---
> > Nov 8 18:42:57 duo kernel: RIP:
> > 0010:__list_del_entry_valid+0x8e/0x90
> > Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0
> > 48 c7 c7 90 74 5e 85 e8 53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8
> > 74 5e 85 e8 40 88 d1 ff <0f> 0b 55 48 89 d0 48 8b 52 08 48 89 e5 48
> > 39 f2 75 19 48 8b 32 48
> > Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
> > 00210086
> > Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
> > ffff8801742b8178 RCX: 0000000000000000
> > Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
> > ffff88019e2a53d8 RDI: ffff88019e2a53d8
> > Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
> > ffff880196e2cd10 R09: 0000000000000000
> > Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
> > 3863656632393101 R12: ffffc9000196bec8
> > Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
> > ffff8801742b8080 R15: ffffc9000192fdd0
> > Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
> > GS:ffff88019e280000(0000) knlGS:0000000000000000
> > Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> > 0000000080050033
> > Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
> > 000000000581e001 CR4: 00000000000606a0
> >
> > --
> > (english) http://www.livejournal.com/~pavelmachek
> > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (5.34 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-11-24 08:11:01

by Joonas Lahtinen

[permalink] [raw]
Subject: Re: v4.20-rc1: list_del corruption on thinkpad x220

Quoting Pavel Machek (2018-11-21 13:54:49)
> Hi!
>
> > > My machine locked hard (thinkpad x220). After reboot, I found this in
> > > syslog:
> > >
> > > Sounds like memory corruption..? Does not sound like easy to debug.
> >
> > Were you doing something GPU intense when you experienced the hard hang?
> >
> > And if so, have you been able to hit the issue more than once? At this
> > point it doesn't look like anything we've hit previously, so would be
> > great to have some more insight into how we could reproduce.
>
> I seen another crash since that, but I don't think it counts at
> "easily reproducible".
>
> I may have been running flightgear at that point. That's fairly GPU intensive.
>
> > There's one similar for nouveau in Bugzilla, but it seems like a genuine
> > memory corruption (1 bit flipped):
> >
> > https://bugs.freedesktop.org/show_bug.cgi?id=84880
> >
> > Any extra information would be of use :)
> >
> > Regards, Joonas
> >
> > PS. Could you open a bug to Bugzilla, it'll help to collect the
> > information in one consolidated place:
> >
> > https://01.org/linuxgraphics/documentation/how-report-bugs
>
> I prefer email... certainly for bugs that can't be reproduced.

By adding it to the Bugzilla it may be recognized by somebody else
who is experiencing a similar issue. Internet points are not deducted
for submitting bugs in good faith, even if they get closed as NOTABUG.

It sounds like you've hit the same signature twice, so it may very well
be reproducible. Does flightgear have some demo mode where you could
leave it running a heavy scene overnight?

Were you running 4.19 kernel previously, distro one or vanilla? A full
dmesg from a boot would be appreciated (from kernel where you didn't
experience issues, and from one where you do).

We actually have a well defined process and personnel to look into the
Bugzilla entries, so it'd still be helpful to have this logged to
Bugzilla.

Regards, Joonas

>
> Best regards,
> Pavel
>
> > > > > ...otoh, it still looks like an addres, so maybe it is "just" race in
> > > GPU drivers?
> > >
> > > Any ideas?
> > > Pavel
> > >
> > > Nov 8 18:35:01 duo CRON[28511]: (root) CMD (command -v debian-sa1 >
> > > /dev/null && debian-sa
> > > 1 1 1)
> > > Nov 8 18:42:57 duo kernel: list_del corruption. prev->next should be
> > > ffff8801742b8178, but
> > > was ffffc9000192fec8
> > > Nov 8 18:42:57 duo kernel: ------------[ cut here ]------------
> > > Nov 8 18:42:57 duo kernel: kernel BUG at
> > > /data/fast/l/k/lib/list_debug.c:53!
> > > Nov 8 18:42:57 duo kernel: invalid opcode: 0000 [#1] SMP PTI
> > > Nov 8 18:42:57 duo kernel: CPU: 2 PID: 1082 Comm: i915/signal:1 Not
> > > tainted 4.20.0-rc1+ #3
> > > Nov 8 18:42:57 duo kernel: Hardware name: LENOVO 42872WU/42872WU,
> > > BIOS 8DET74WW (1.44 ) 03
> > > /13/2018
> > > Nov 8 18:42:57 duo kernel: RIP:
> > > 0010:__list_del_entry_valid+0x8e/0x90
> > > Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0 48
> > > c7 c7 90 74 5e 85 e8
> > > 53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8 74 5e 85 e8 40 88 d1 ff
> > > <0f> 0b 55 48 89 d0 48
> > > 8b 52 08 48 89 e5 48 39 f2 75 19 48 8b 32 48
> > > Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
> > > 00210086
> > > Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
> > > ffff8801742b8178 RCX: 00000000000000
> > > 00
> > > Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
> > > ffff88019e2a53d8 RDI: ffff88019e2a53
> > > d8
> > > Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
> > > ffff880196e2cd10 R09: 00000000000000
> > > 00
> > > Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
> > > 3863656632393101 R12: ffffc9000196be
> > > c8
> > > Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
> > > ffff8801742b8080 R15: ffffc9000192fd
> > > d0
> > > Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
> > > GS:ffff88019e280000(0000) knlGS:000
> > > 0000000000000
> > > Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> > > 0000000080050033
> > > Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
> > > 000000000581e001 CR4: 00000000000606a0
> > > Nov 8 18:42:57 duo kernel: Call Trace:
> > > Nov 8 18:42:57 duo kernel: intel_breadcrumbs_signaler+0x162/0x330
> > > Nov 8 18:42:57 duo kernel: kthread+0x116/0x150
> > > Nov 8 18:42:57 duo kernel: ? intel_engine_wakeup+0x40/0x40
> > > Nov 8 18:42:57 duo kernel: ? kthread_park+0x90/0x90
> > > Nov 8 18:42:57 duo kernel: ret_from_fork+0x35/0x40
> > > Nov 8 18:42:57 duo kernel: Modules linked in:
> > > Nov 8 18:42:57 duo kernel: ---[ end trace 2f8da183a56f80f6 ]---
> > > Nov 8 18:42:57 duo kernel: RIP:
> > > 0010:__list_del_entry_valid+0x8e/0x90
> > > Nov 8 18:42:57 duo kernel: Code: 66 88 d1 ff 0f 0b 48 89 fe 31 c0
> > > 48 c7 c7 90 74 5e 85 e8 53 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 c8
> > > 74 5e 85 e8 40 88 d1 ff <0f> 0b 55 48 89 d0 48 8b 52 08 48 89 e5 48
> > > 39 f2 75 19 48 8b 32 48
> > > Nov 8 18:42:57 duo kernel: RSP: 0000:ffffc9000196be78 EFLAGS:
> > > 00210086
> > > Nov 8 18:42:57 duo kernel: RAX: 0000000000000054 RBX:
> > > ffff8801742b8178 RCX: 0000000000000000
> > > Nov 8 18:42:57 duo kernel: RDX: 0000000000000000 RSI:
> > > ffff88019e2a53d8 RDI: ffff88019e2a53d8
> > > Nov 8 18:42:57 duo kernel: RBP: ffffc9000196be78 R08:
> > > ffff880196e2cd10 R09: 0000000000000000
> > > Nov 8 18:42:57 duo kernel: R10: 00000000e7684eb9 R11:
> > > 3863656632393101 R12: ffffc9000196bec8
> > > Nov 8 18:42:57 duo kernel: R13: ffff88019707e000 R14:
> > > ffff8801742b8080 R15: ffffc9000192fdd0
> > > Nov 8 18:42:57 duo kernel: FS: 0000000000000000(0000)
> > > GS:ffff88019e280000(0000) knlGS:0000000000000000
> > > Nov 8 18:42:57 duo kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> > > 0000000080050033
> > > Nov 8 18:42:57 duo kernel: CR2: 00000000ed2bf000 CR3:
> > > 000000000581e001 CR4: 00000000000606a0
> > >
> > > --
> > > (english) http://www.livejournal.com/~pavelmachek
> > > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
>
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2018-11-24 15:26:31

by Pavel Machek

[permalink] [raw]
Subject: Re: v4.20-rc1: list_del corruption on thinkpad x220

Hi!

> > > There's one similar for nouveau in Bugzilla, but it seems like a genuine
> > > memory corruption (1 bit flipped):
> > >
> > > https://bugs.freedesktop.org/show_bug.cgi?id=84880
> > >
> > > Any extra information would be of use :)
> > >
> > > Regards, Joonas
> > >
> > > PS. Could you open a bug to Bugzilla, it'll help to collect the
> > > information in one consolidated place:
> > >
> > > https://01.org/linuxgraphics/documentation/how-report-bugs
> >
> > I prefer email... certainly for bugs that can't be reproduced.
>
> By adding it to the Bugzilla it may be recognized by somebody else
> who is experiencing a similar issue. Internet points are not deducted
> for submitting bugs in good faith, even if they get closed as
> NOTABUG.

Feel free to copy from email to bugzilla :-).

> It sounds like you've hit the same signature twice, so it may very well
> be reproducible. Does flightgear have some demo mode where you could
> leave it running a heavy scene overnight?

I'm not sure if it was same signature twice. I had two lockups, but
IIRC only investigated one.

Not really a demo mode. I can put plane on autopilot, but eventually
gas runs out. (And I guess window needs to be visible for test to be
effective.) I tried today, but it did not crash.

Do you have something else I could run to do the testing?

> Were you running 4.19 kernel previously, distro one or vanilla? A full
> dmesg from a boot would be appreciated (from kernel where you didn't
> experience issues, and from one where you do).

Recent kernels I'm running are self-compiled.

> We actually have a well defined process and personnel to look into the
> Bugzilla entries, so it'd still be helpful to have this logged to
> Bugzilla.

If I can reproduce it, it makes sense to create bugzilla entry.

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.97 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-12-08 11:14:54

by Pavel Machek

[permalink] [raw]
Subject: Re: v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

Hi!

> > > > There's one similar for nouveau in Bugzilla, but it seems like a genuine
> > > > memory corruption (1 bit flipped):
> > > >
> > > > https://bugs.freedesktop.org/show_bug.cgi?id=84880
> > > >
> > > > Any extra information would be of use :)
> > > >
> > > > Regards, Joonas
> > > >
> > > > PS. Could you open a bug to Bugzilla, it'll help to collect the
> > > > information in one consolidated place:
> > > >
> > > > https://01.org/linuxgraphics/documentation/how-report-bugs
> > >
> > > I prefer email... certainly for bugs that can't be reproduced.
> >
> > By adding it to the Bugzilla it may be recognized by somebody else
> > who is experiencing a similar issue. Internet points are not deducted
> > for submitting bugs in good faith, even if they get closed as
> > NOTABUG.

Well, your documentation suggests you'll deduce my internet points:

Before filing the bug, please try to reproduce your issue with the
latest kernel. Use the latest drm-tip branch from
http://cgit.freedesktop.org/drm-tip and build as instructed on our
Build Guide.

:-)

> Feel free to copy from email to bugzilla :-).

Hmm, so it seems it happened again today:

Dec 8 11:45:01 duo CRON[29325]: (root) CMD (command -v debian-sa1 >
/dev/null && debian-sa1 1 1)
Dec 8 11:46:42 duo
org.mate.panel.applet.MateWeatherAppletFactory[3983]:
(mateweather-applet-2:4242): GLib-CRITICAL **: Source ID 14603 was not
found
when attempting to remove it
Dec 8 11:54:59 duo kernel: list_del corruption. prev->next should be
ffff88019283ea28, but was ffff8801411a1c68
Dec 8 11:54:59 duo kernel: ------------[ cut here ]------------
Dec 8 11:54:59 duo kernel: kernel BUG at
/data/fast/l/k/lib/list_debug.c:53!
Dec 8 11:54:59 duo kernel: invalid opcode: 0000 [#1] SMP PTI
Dec 8 11:54:59 duo kernel: CPU: 1 PID: 3428 Comm: Xorg Not tainted
4.20.0-rc1+ #4
Dec 8 11:54:59 duo kernel: Hardware name: LENOVO 42872WU/42872WU,
BIOS 8DET74WW (1.44 ) 03/13/2018
Dec 8 11:54:59 duo kernel: RIP:
0010:__list_del_entry_valid+0x8e/0x90
Dec 8 11:54:59 duo kernel: Code: 16 88 d1 ff 0f 0b 48 89 fe 31 c0 48
c7 c7 08 75 5e 85 e8 03 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 40 75
5e 85 e8 f0
87 d1 ff <0f> 0b 55 48 89 d0 48 8b 52 08 48 89 e5 48 39 f2 75 19 48
8b 32 48
Dec 8 11:54:59 duo kernel: RSP: 0000:ffffc90000223ac0 EFLAGS:
00213282
Dec 8 11:54:59 duo kernel: RAX: 0000000000000054 RBX:
ffff880115a07c40 RCX: 0000000000000000
Dec 8 11:54:59 duo kernel: RDX: 0000000000000000 RSI:
ffff88019e2653d8 RDI: ffff88019e2653d8
Dec 8 11:54:59 duo kernel: RBP: ffffc90000223ac0 R08:
ffff880193a2ad10 R09: 0000000000000000
Dec 8 11:54:59 duo kernel: R10: 00000000008e9088 R11:
2e6e6f6974707501 R12: ffff8801960cb240
Dec 8 11:54:59 duo kernel: R13: ffff88019283e900 R14:
ffff880115a07ec0 R15: ffff88019283ea28
Dec 8 11:54:59 duo kernel: FS: 0000000000000000(0000)
GS:ffff88019e240000(0063) knlGS:00000000f79c4880
Dec 8 11:54:59 duo kernel: CS: 0010 DS: 002b ES: 002b CR0:
0000000080050033
Dec 8 11:54:59 duo kernel: CR2: 00000000086b0df8 CR3:
00000001939f6004 CR4: 00000000000606a0
Dec 8 11:54:59 duo kernel: Call Trace:
Dec 8 11:54:59 duo kernel: i915_vma_move_to_active+0x1c3/0x510
Dec 8 11:54:59 duo kernel: ? i915_request_await_object+0xf4/0x280
Dec 8 11:54:59 duo kernel: i915_gem_do_execbuffer+0xe2f/0x10a0
Dec 8 11:54:59 duo kernel: ? find_held_lock+0x39/0xb0
Dec 8 11:54:59 duo kernel: ? kvmalloc_node+0x26/0x70
Dec 8 11:54:59 duo kernel: i915_gem_execbuffer2_ioctl+0x1b4/0x360
Dec 8 11:54:59 duo kernel: ? i915_gem_execbuffer_ioctl+0x290/0x290
Dec 8 11:54:59 duo kernel: drm_ioctl_kernel+0xaa/0xf0
Dec 8 11:54:59 duo kernel: drm_ioctl+0x323/0x3d0
Dec 8 11:54:59 duo kernel: ? i915_gem_execbuffer_ioctl+0x290/0x290
Dec 8 11:54:59 duo kernel: ? posix_ktime_get_ts+0xc/0x10
Dec 8 11:54:59 duo kernel: i915_compat_ioctl+0x37/0x40
Dec 8 11:54:59 duo kernel: __ia32_compat_sys_ioctl+0x429/0xe90
Dec 8 11:54:59 duo kernel: ? put_old_timespec32+0x9/0x10
Dec 8 11:54:59 duo kernel: ?
__ia32_compat_sys_clock_gettime+0x67/0x90
Dec 8 11:54:59 duo kernel: do_int80_syscall_32+0x50/0x100
Dec 8 11:54:59 duo kernel: entry_INT80_compat+0x7d/0x82
Dec 8 11:54:59 duo kernel: RIP: 0023:0xf7fd5c42
Dec 8 11:54:59 duo kernel: Code: 65 8b 15 04 00 00 00 8b 0e 8b 0c
ca 83 f9 ff 75 0c 89 04 24 89 f0 e8 b3 fe ff ff eb 05 8b 46 04 01 c8
83 c4 14 5b 5e c3 cd 80 <c3> 8d b6 00 00 00 00 8d bc 27 00 00 00 00
8b 1c 24 c3 8d b6 00 00
Dec 8 11:54:59 duo kernel: RSP: 002b:00000000fff1a014 EFLAGS:
00203292 ORIG_RAX: 0000000000000036
Dec 8 11:54:59 duo kernel: RAX: ffffffffffffffda RBX:
000000000000000a RCX: 0000000040406469
Dec 8 11:54:59 duo kernel: RDX: 00000000fff1a0bc RSI:
0000000000000000 RDI: 0000000040406469
Dec 8 11:54:59 duo kernel: RBP: 000000000000000a R08:
0000000000000000 R09: 0000000000000000
Dec 8 11:54:59 duo kernel: R10: 0000000000000000 R11:
0000000000000000 R12: 0000000000000000
Dec 8 11:54:59 duo kernel: R13: 0000000000000000 R14:
0000000000000000 R15: 0000000000000000
Dec 8 11:54:59 duo kernel: Modules linked in:
Dec 8 11:54:59 duo kernel: ---[ end trace 0c1e74ccc719c763 ]---
Dec 8 11:54:59 duo kernel: RIP:
0010:__list_del_entry_valid+0x8e/0x90
Dec 8 11:54:59 duo kernel: Code: 16 88 d1 ff 0f 0b 48 89 fe 31 c0
48 c7 c7 08 75 5e 85 e8 03 88 d1 ff 0f 0b 48 89 fe 31 c0 48 c7 c7 40
75 5e 85 e8 f0 87 d1 ff <0f> 0b 55 48 89 d0 48 8b 52 08 48 89 e5 48
39 f2 75 19 48 8b 32 48
Dec 8 11:54:59 duo kernel: RSP: 0000:ffffc90000223ac0 EFLAGS:
00213282
Dec 8 11:54:59 duo kernel: RAX: 0000000000000054 RBX:
ffff880115a07c40 RCX: 0000000000000000
Dec 8 11:54:59 duo kernel: RDX: 0000000000000000 RSI:
ffff88019e2653d8 RDI: ffff88019e2653d8
Dec 8 11:54:59 duo kernel: RBP: ffffc90000223ac0 R08:
ffff880193a2ad10 R09: 0000000000000000
Dec 8 11:54:59 duo kernel: R10: 00000000008e9088 R11:
2e6e6f6974707501 R12: ffff8801960cb240
Dec 8 11:54:59 duo kernel: R13: ffff88019283e900 R14:
ffff880115a07ec0 R15: ffff88019283ea28
Dec 8 11:54:59 duo kernel: FS: 0000000000000000(0000)
GS:ffff88019e240000(0063) knlGS:00000000f79c4880
Dec 8 11:54:59 duo kernel: CS: 0010 DS: 002b ES: 002b CR0:
0000000080050033
Dec 8 11:54:59 duo kernel: CR2: 00000000086b0df8 CR3:
00000001939f6004 CR4: 00000000000606a0
Dec 8 11:54:59 duo org.mate.panel.applet.WnckletFactory[3983]:
wnck-applet: Fatal IO error 11 (Resource temporarily unavailable) on
X server :0.
Dec 8 11:54:59 duo
org.mate.panel.applet.MateWeatherAppletFactory[3983]:
mateweather-applet-2: Fatal IO error 11 (Resource temporarily
unavailable) on X server :0.
Dec 8 11:55:00 duo
org.mate.panel.applet.CommandAppletFactory[3983]: command-applet:
Fatal IO error 11 (Resource temporarily unavailable) on X server :0.
Dec 8 11:55:00 duo
org.mate.panel.applet.NotificationAreaAppletFactory[3983]:
notification-area-applet: Fatal IO error 11 (Resource temporarily
unavailable) on X server :0.
Dec 8 11:55:00 duo org.mate.panel.applet.ClockAppletFactory[3983]:
clock-applet: Fatal IO error 11 (Resource temporarily unavailable)
on X server :0.
Dec 8 11:55:01 duo CRON[30056]: (root) CMD (command -v debian-sa1 >
/dev/null && debian-sa1 1 1)
Dec 8 11:55:02 duo
org.mate.panel.applet.InhibitAppletFactory[3983]:
mate-inhibit-applet: Fatal IO error 11 (Resource temporarily
unavailable) on X server :0.
Dec 8 11:55:09 duo org.a11y.atspi.Registry[4114]: XIO: fatal IO
error 11 (Resource temporarily unavailable) on X server ":0"

Do you see high chance of this being DRM/Intel issue?

> > It sounds like you've hit the same signature twice, so it may very well
> > be reproducible. Does flightgear have some demo mode where you could
> > leave it running a heavy scene overnight?
>
> I'm not sure if it was same signature twice. I had two lockups, but
> IIRC only investigated one.

So it is twice now.

> Not really a demo mode. I can put plane on autopilot, but eventually
> gas runs out. (And I guess window needs to be visible for test to be
> effective.) I tried today, but it did not crash.
>
> Do you have something else I could run to do the testing?

This time I was not really running anything graphics heavy, except of
chromium playing youtube video.

Best regards,
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (8.51 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-12-08 11:25:44

by Pavel Machek

[permalink] [raw]
Subject: Re: v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

On Sat 2018-12-08 12:13:46, Pavel Machek wrote:
> Hi!
>
> > > > > There's one similar for nouveau in Bugzilla, but it seems like a genuine
> > > > > memory corruption (1 bit flipped):
> > > > >
> > > > > https://bugs.freedesktop.org/show_bug.cgi?id=84880
> > > > >
> > > > > Any extra information would be of use :)
> > > > >
> > > > > Regards, Joonas
> > > > >
> > > > > PS. Could you open a bug to Bugzilla, it'll help to collect the
> > > > > information in one consolidated place:
> > > > >
> > > > > https://01.org/linuxgraphics/documentation/how-report-bugs
> > > >
> > > > I prefer email... certainly for bugs that can't be reproduced.
> > >
> > > By adding it to the Bugzilla it may be recognized by somebody else
> > > who is experiencing a similar issue. Internet points are not deducted
> > > for submitting bugs in good faith, even if they get closed as
> > > NOTABUG.
>
> Well, your documentation suggests you'll deduce my internet points:
>
> Before filing the bug, please try to reproduce your issue with the
> latest kernel. Use the latest drm-tip branch from
> http://cgit.freedesktop.org/drm-tip and build as instructed on our
> Build Guide.
>
> :-)

I'd prefer not to run drm-tip. I'll update to 2.6.20-rc5+ and see if
it re-appears (but it takes long time to reproduce :-().

If you think it is useful, I can try to update my machine to
linux-next.

Best regards,
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.56 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-12-09 11:20:07

by Pavel Machek

[permalink] [raw]
Subject: v4.20-rc5+ on x220: Resetting chip for hang on rcs0


Hi!

Another day, another problem... but this one is different from the
previous hang, as machine survives.

Chromium was running with youtube video playing.

[31850.666274] [drm] GPU hangs can indicate a bug anywhere in the
entire gfx stack, including userspace.
[31850.666277] [drm] Please file a _new_ bug report on
bugs.freedesktop.org against DRI -> DRM/Intel
[31850.666279] [drm] drm/i915 developers can then reassign to the
right component if it's not a kernel issue.
[31850.666282] [drm] The gpu crash dump is required to analyze gpu
hangs, so please always attach it.
[31850.666285] [drm] GPU crash dump saved to
/sys/class/drm/card0/error
[31850.666394] i915 0000:00:02.0: Resetting chip for hang on rcs0
[31850.668474] WARNING: CPU: 0 PID: 13675 at
/data/fast/l/k/include/linux/dma-fence.h:503
i915_request_skip+0x71/0x80
[31850.668478] Modules linked in:
[31850.668484] CPU: 0 PID: 13675 Comm: kworker/0:3 Not tainted
4.20.0-rc5+ #5
[31850.668487] Hardware name: LENOVO 42872WU/42872WU, BIOS 8DET74WW
(1.44 ) 03/13/2018

Dmesg and /sys/class/drm/card0/error are attached.

Best regards,
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (0.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-12-10 08:30:51

by Joonas Lahtinen

[permalink] [raw]
Subject: Re: v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

On Sat, 2018-12-08 at 12:24 +0100, Pavel Machek wrote:
> On Sat 2018-12-08 12:13:46, Pavel Machek wrote:
> > Hi!
> >
> > > > > > There's one similar for nouveau in Bugzilla, but it seems like a genuine
> > > > > > memory corruption (1 bit flipped):
> > > > > >
> > > > > > https://bugs.freedesktop.org/show_bug.cgi?id=84880
> > > > > >
> > > > > > Any extra information would be of use :)
> > > > > >
> > > > > > Regards, Joonas
> > > > > >
> > > > > > PS. Could you open a bug to Bugzilla, it'll help to collect the
> > > > > > information in one consolidated place:
> > > > > >
> > > > > > https://01.org/linuxgraphics/documentation/how-report-bugs
> > > > >
> > > > > I prefer email... certainly for bugs that can't be reproduced.
> > > >
> > > > By adding it to the Bugzilla it may be recognized by somebody else
> > > > who is experiencing a similar issue. Internet points are not deducted
> > > > for submitting bugs in good faith, even if they get closed as
> > > > NOTABUG.
> >
> > Well, your documentation suggests you'll deduce my internet points:
> >
> > Before filing the bug, please try to reproduce your issue with the
> > latest kernel. Use the latest drm-tip branch from
> > http://cgit.freedesktop.org/drm-tip and build as instructed on our
> > Build Guide.
> >
> > :-)
>
> I'd prefer not to run drm-tip. I'll update to 2.6.20-rc5+ and see if
> it re-appears (but it takes long time to reproduce :-().

If we can or can not reproduce the issue with drm-tip, is a very useful
datapoint for us. If we can not reproduce, it'll be possible to bisect
which commit fixed it, and backport that. On the other hand, if it's
still reproducible, we know we're not spending time on something we
already fixed, and the priority gets a bump.

> If you think it is useful, I can try to update my machine to
> linux-next.

linux-next is closer to drm-tip, so it's better. Do you have some
specific reason for not wanting to run drm-tip (but linux-next is still
ok)?

Regards, Joonas

>
> Best regards,
> Pavel
>
--
Joonas Lahtinen
Open Source Graphics Center
Intel Corporation


2018-12-10 08:31:44

by Joonas Lahtinen

[permalink] [raw]
Subject: Re: v4.20-rc5+ on x220: Resetting chip for hang on rcs0

On Sun, 2018-12-09 at 12:18 +0100, Pavel Machek wrote:
> Hi!
>
> Another day, another problem... but this one is different from the
> previous hang, as machine survives.

Please, file a bug. It says so even in the splat...

Regards, Joonas

>
> Chromium was running with youtube video playing.
>
> [31850.666274] [drm] GPU hangs can indicate a bug anywhere in the
> entire gfx stack, including userspace.
> [31850.666277] [drm] Please file a _new_ bug report on
> bugs.freedesktop.org against DRI -> DRM/Intel
> [31850.666279] [drm] drm/i915 developers can then reassign to the
> right component if it's not a kernel issue.
> [31850.666282] [drm] The gpu crash dump is required to analyze gpu
> hangs, so please always attach it.
> [31850.666285] [drm] GPU crash dump saved to
> /sys/class/drm/card0/error
> [31850.666394] i915 0000:00:02.0: Resetting chip for hang on rcs0
> [31850.668474] WARNING: CPU: 0 PID: 13675 at
> /data/fast/l/k/include/linux/dma-fence.h:503
> i915_request_skip+0x71/0x80
> [31850.668478] Modules linked in:
> [31850.668484] CPU: 0 PID: 13675 Comm: kworker/0:3 Not tainted
> 4.20.0-rc5+ #5
> [31850.668487] Hardware name: LENOVO 42872WU/42872WU, BIOS 8DET74WW
> (1.44 ) 03/13/2018
>
> Dmesg and /sys/class/drm/card0/error are attached.
>
> Best regards,
> Pavel
--
Joonas Lahtinen
Open Source Graphics Center
Intel Corporation


2018-12-12 18:30:39

by Pavel Machek

[permalink] [raw]
Subject: 4.20.0-rc6-next-20181210, v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

Hi!

> > > > > > > There's one similar for nouveau in Bugzilla, but it seems like a genuine
> > > > > > > memory corruption (1 bit flipped):
> > > > > > >
> > > > > > > https://bugs.freedesktop.org/show_bug.cgi?id=84880
> > > > > > >
> > > > > > > Any extra information would be of use :)
> > > > > > >
> > > > > > > Regards, Joonas
> > > > > > >
> > > > > > > PS. Could you open a bug to Bugzilla, it'll help to collect the
> > > > > > > information in one consolidated place:
> > > > > > >
> > > > > > > https://01.org/linuxgraphics/documentation/how-report-bugs
> > > > > >
> > > > > > I prefer email... certainly for bugs that can't be reproduced.
> > > > >
> > > > > By adding it to the Bugzilla it may be recognized by somebody else
> > > > > who is experiencing a similar issue. Internet points are not deducted
> > > > > for submitting bugs in good faith, even if they get closed as
> > > > > NOTABUG.
> > >
> > > Well, your documentation suggests you'll deduce my internet points:
> > >
> > > Before filing the bug, please try to reproduce your issue with the
> > > latest kernel. Use the latest drm-tip branch from
> > > http://cgit.freedesktop.org/drm-tip and build as instructed on our
> > > Build Guide.
> > >
> > > :-)
> >
> > I'd prefer not to run drm-tip. I'll update to 2.6.20-rc5+ and see if
> > it re-appears (but it takes long time to reproduce :-().
>
> If we can or can not reproduce the issue with drm-tip, is a very useful
> datapoint for us. If we can not reproduce, it'll be possible to bisect
> which commit fixed it, and backport that. On the other hand, if it's
> still reproducible, we know we're not spending time on something we
> already fixed, and the priority gets a bump.

bisect ... is not practical on something that takes 2 days to reproduce.

> > If you think it is useful, I can try to update my machine to
> > linux-next.
>
> linux-next is closer to drm-tip, so it's better. Do you have some
> specific reason for not wanting to run drm-tip (but linux-next is still
> ok)?

I already have build/update scripts for -next, and I trust -next not
to store screenshots of my desktop in my master boot record :-).

Anyway, it does happen with -next. This time, chromiums were running,
and crash happened minute? after I exited flightgear. It can be seen
in the logs.

Oh and I might want to mention -- machine was rather deep in swap this
time, as in "mouse jumping when starting fgfs" and "could feel the
chromium being swapped back in". I might have had this situation
before, and just powercycled the machine "because it is so deep in
swap that it will not recover".

top says:

top - 19:18:24 up 2 days, 8:03, 2 users, load average: 3.02, 3.45,
3.21
Tasks: 141 total, 1 running, 86 sleeping, 0 stopped, 2 zombie
%Cpu(s): 18.8 us, 7.6 sy, 3.0 ni, 68.4 id, 1.3 wa, 0.0 hi, 0.9
si, 0.0 st
KiB Mem: 5967968 total, 663244 used, 5304724 free, 48876
buffers
KiB Swap: 1681428 total, 170904 used, 1510524 free. 446280
cached Mem

....but of course that memory is free once everything died.

Any ideas? Should I go back to v4.19 to see if it happens there, too?


Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (0.00 B)
signature.asc (188.00 B)
Digital signature
Download all attachments

2018-12-13 08:31:35

by Joonas Lahtinen

[permalink] [raw]
Subject: Re: 4.20.0-rc6-next-20181210, v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

Quoting Pavel Machek (2018-12-12 20:29:02)
> Hi!
>
> > > > > > > > There's one similar for nouveau in Bugzilla, but it seems like a genuine
> > > > > > > > memory corruption (1 bit flipped):
> > > > > > > >
> > > > > > > > https://bugs.freedesktop.org/show_bug.cgi?id=84880
> > > > > > > >
> > > > > > > > Any extra information would be of use :)
> > > > > > > >
> > > > > > > > Regards, Joonas
> > > > > > > >
> > > > > > > > PS. Could you open a bug to Bugzilla, it'll help to collect the
> > > > > > > > information in one consolidated place:
> > > > > > > >
> > > > > > > > https://01.org/linuxgraphics/documentation/how-report-bugs
> > > > > > >
> > > > > > > I prefer email... certainly for bugs that can't be reproduced.
> > > > > >
> > > > > > By adding it to the Bugzilla it may be recognized by somebody else
> > > > > > who is experiencing a similar issue. Internet points are not deducted
> > > > > > for submitting bugs in good faith, even if they get closed as
> > > > > > NOTABUG.
> > > >
> > > > Well, your documentation suggests you'll deduce my internet points:
> > > >
> > > > Before filing the bug, please try to reproduce your issue with the
> > > > latest kernel. Use the latest drm-tip branch from
> > > > http://cgit.freedesktop.org/drm-tip and build as instructed on our
> > > > Build Guide.
> > > >
> > > > :-)
> > >
> > > I'd prefer not to run drm-tip. I'll update to 2.6.20-rc5+ and see if
> > > it re-appears (but it takes long time to reproduce :-().
> >
> > If we can or can not reproduce the issue with drm-tip, is a very useful
> > datapoint for us. If we can not reproduce, it'll be possible to bisect
> > which commit fixed it, and backport that. On the other hand, if it's
> > still reproducible, we know we're not spending time on something we
> > already fixed, and the priority gets a bump.
>
> bisect ... is not practical on something that takes 2 days to reproduce.
>
> > > If you think it is useful, I can try to update my machine to
> > > linux-next.
> >
> > linux-next is closer to drm-tip, so it's better. Do you have some
> > specific reason for not wanting to run drm-tip (but linux-next is still
> > ok)?
>
> I already have build/update scripts for -next, and I trust -next not
> to store screenshots of my desktop in my master boot record :-).
>
> Anyway, it does happen with -next. This time, chromiums were running,
> and crash happened minute? after I exited flightgear. It can be seen
> in the logs.
>
> Oh and I might want to mention -- machine was rather deep in swap this
> time, as in "mouse jumping when starting fgfs" and "could feel the
> chromium being swapped back in". I might have had this situation
> before, and just powercycled the machine "because it is so deep in
> swap that it will not recover".
>
> top says:
>
> top - 19:18:24 up 2 days, 8:03, 2 users, load average: 3.02, 3.45,
> 3.21
> Tasks: 141 total, 1 running, 86 sleeping, 0 stopped, 2 zombie
> %Cpu(s): 18.8 us, 7.6 sy, 3.0 ni, 68.4 id, 1.3 wa, 0.0 hi, 0.9
> si, 0.0 st
> KiB Mem: 5967968 total, 663244 used, 5304724 free, 48876
> buffers
> KiB Swap: 1681428 total, 170904 used, 1510524 free. 446280
> cached Mem
>
> ....but of course that memory is free once everything died.
>
> Any ideas? Should I go back to v4.19 to see if it happens there, too?

linux-next includes very much the same code as drm-tip. There's nobody
magically reviewing the code more than it is reviewed for inclusion into
drm-tip, when it is fed into linux-next. So thinking linux-next would be
some way safer is an illusion.

It sounds like having memory pressure expedites the corruption, which
should make it easier to reproduce and thus fix.

So if you could please try drm-tip reproducing AND open a bug in Bugzilla.
If you are unwilling to do that, it is very difficult to help you more.

Regards, Joonas

>
>
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2018-12-27 21:52:18

by Pavel Machek

[permalink] [raw]
Subject: [regression from v4.19] Re: 4.20.0-rc6-next-20181210, v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

Hi!

> > > > If you think it is useful, I can try to update my machine to
> > > > linux-next.
> > >
> > > linux-next is closer to drm-tip, so it's better. Do you have some
> > > specific reason for not wanting to run drm-tip (but linux-next is still
> > > ok)?
> >
> > I already have build/update scripts for -next, and I trust -next not
> > to store screenshots of my desktop in my master boot record :-).
> >
> > Anyway, it does happen with -next. This time, chromiums were running,
> > and crash happened minute? after I exited flightgear. It can be seen
> > in the logs.
> >
> > Oh and I might want to mention -- machine was rather deep in swap this
> > time, as in "mouse jumping when starting fgfs" and "could feel the
> > chromium being swapped back in". I might have had this situation
> > before, and just powercycled the machine "because it is so deep in
> > swap that it will not recover".
> >
> > top says:
> >
> > top - 19:18:24 up 2 days, 8:03, 2 users, load average: 3.02, 3.45,
> > 3.21
> > Tasks: 141 total, 1 running, 86 sleeping, 0 stopped, 2 zombie
> > %Cpu(s): 18.8 us, 7.6 sy, 3.0 ni, 68.4 id, 1.3 wa, 0.0 hi, 0.9
> > si, 0.0 st
> > KiB Mem: 5967968 total, 663244 used, 5304724 free, 48876
> > buffers
> > KiB Swap: 1681428 total, 170904 used, 1510524 free. 446280
> > cached Mem
> >
> > ....but of course that memory is free once everything died.
> >
> > Any ideas? Should I go back to v4.19 to see if it happens there, too?
>
> linux-next includes very much the same code as drm-tip. There's nobody
> magically reviewing the code more than it is reviewed for inclusion into
> drm-tip, when it is fed into linux-next. So thinking linux-next would be
> some way safer is an illusion.
>
> It sounds like having memory pressure expedites the corruption, which
> should make it easier to reproduce and thus fix.
>
> So if you could please try drm-tip reproducing AND open a bug in Bugzilla.
> If you are unwilling to do that, it is very difficult to help you
> more.

Website says I have to read and agree to two different pieces of
legalesee, and I'd need to keep track of yet another password... so
you can "communicate" with me.

But you can already communicate with me, over email.

I verified v4.19 is stable -- it worked ok for way more than two days
it usually takes to crash.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (2.55 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2019-01-02 10:50:24

by Joonas Lahtinen

[permalink] [raw]
Subject: Re: [regression from v4.19] Re: 4.20.0-rc6-next-20181210, v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

Quoting Pavel Machek (2018-12-27 10:34:39)
> Hi!
>
> > > > > If you think it is useful, I can try to update my machine to
> > > > > linux-next.
> > > >
> > > > linux-next is closer to drm-tip, so it's better. Do you have some
> > > > specific reason for not wanting to run drm-tip (but linux-next is still
> > > > ok)?
> > >
> > > I already have build/update scripts for -next, and I trust -next not
> > > to store screenshots of my desktop in my master boot record :-).
> > >
> > > Anyway, it does happen with -next. This time, chromiums were running,
> > > and crash happened minute? after I exited flightgear. It can be seen
> > > in the logs.
> > >
> > > Oh and I might want to mention -- machine was rather deep in swap this
> > > time, as in "mouse jumping when starting fgfs" and "could feel the
> > > chromium being swapped back in". I might have had this situation
> > > before, and just powercycled the machine "because it is so deep in
> > > swap that it will not recover".
> > >
> > > top says:
> > >
> > > top - 19:18:24 up 2 days, 8:03, 2 users, load average: 3.02, 3.45,
> > > 3.21
> > > Tasks: 141 total, 1 running, 86 sleeping, 0 stopped, 2 zombie
> > > %Cpu(s): 18.8 us, 7.6 sy, 3.0 ni, 68.4 id, 1.3 wa, 0.0 hi, 0.9
> > > si, 0.0 st
> > > KiB Mem: 5967968 total, 663244 used, 5304724 free, 48876
> > > buffers
> > > KiB Swap: 1681428 total, 170904 used, 1510524 free. 446280
> > > cached Mem
> > >
> > > ....but of course that memory is free once everything died.
> > >
> > > Any ideas? Should I go back to v4.19 to see if it happens there, too?
> >
> > linux-next includes very much the same code as drm-tip. There's nobody
> > magically reviewing the code more than it is reviewed for inclusion into
> > drm-tip, when it is fed into linux-next. So thinking linux-next would be
> > some way safer is an illusion.
> >
> > It sounds like having memory pressure expedites the corruption, which
> > should make it easier to reproduce and thus fix.
> >
> > So if you could please try drm-tip reproducing AND open a bug in Bugzilla.
> > If you are unwilling to do that, it is very difficult to help you
> > more.
>
> Website says I have to read and agree to two different pieces of
> legalesee, and I'd need to keep track of yet another password... so
> you can "communicate" with me.
>
> But you can already communicate with me, over email.

I've listed all the reasons why our bug handling process is what it is.

If registering to the Bugzilla is too much of an effort for you, then I
won't be able to help you further on this.

Regards, Joonas

> I verified v4.19 is stable -- it worked ok for way more than two days
> it usually takes to crash.
>
> Pavel
> --
> (english) http://www.livejournal.com/~pavelmachek
> (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2019-01-03 00:38:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [regression from v4.19] Re: 4.20.0-rc6-next-20181210, v4.20-rc1: list_del corruption on thinkpad x220, graphics related?

Hi!

> > > So if you could please try drm-tip reproducing AND open a bug in Bugzilla.
> > > If you are unwilling to do that, it is very difficult to help you
> > > more.
> >
> > Website says I have to read and agree to two different pieces of
> > legalesee, and I'd need to keep track of yet another password... so
> > you can "communicate" with me.
> >
> > But you can already communicate with me, over email.
>
> I've listed all the reasons why our bug handling process is what it is.
>
> If registering to the Bugzilla is too much of an effort for you, then I
> won't be able to help you further on this.

Actually I did register at the bugzilla. Only useful help there
was that CONFIG_DRM_I915_DEBUG_GEM might be useful. Unfortunately that
one seems to make it panic() and impossible to get anything useful.

https://bugs.freedesktop.org/show_bug.cgi?id=109175

Best regards,

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.04 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments