2012-11-15 16:09:15

by Zdenek Kabelac

[permalink] [raw]
Subject: Acpi deadlocks with 3.7.0-rc4

Hello


I've already seen twice this oops after resuming my Lenovo T61 in docking station.

Since for some reason currently the serial line doesn't work correctly after
resume
(while I'm pretty sure it used to work in past) here is at least hand-written oops
message from mobile camera picture.

From the trace it seem os_wait semaphore is accessed twice.
Unsure which device is behind it - but it seem docking station is need to hit
this issue.


kernel 3.7.0-rc4

Pid: pm-suspend

RIP: acpi_ns_lookup + 0xa1/0x5b9

Call Trace:

? acpi_os_wait_semaphore + 0x136/0x149
acpi_ns_get_mode + 0x96/0x102
? __lock_is_held +0x5f/0x90
acpi_ns_evaluate +0x47/0x2de
? _raw_spin_lock_irqsave
? acpi_ut_evaluate_object
? sub_preempt_count
? pnpacpi_can_wakeup
acpi_rs_get_method_data
? acpi_os_signal_semaphore
acpi_walk_resources
? acpi_ut_release_mutex
pnpacpi_build_resource_template
? acpi_bus_get_device
pnpacpi_set_resources
? pnp_device_shutdown
pnp_start_dev
pnp_bus_resume
dpm_run_callback
device_resume
dpm_resume
dpm_resume_end
? acpi_suspend_begin_old
suspend_devices_and_enter
pm_suspend
state_store
kobj_attr_store
sysfs_write_file
vgs_write
sys_write
system_call_fastpath

Zdenek


PS: jpg on request


2012-11-28 16:01:28

by Linus Torvalds

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

Adding more people (and the acpi list) to this report.

I'm seeing *very* few changes to the core suspend/resume path in 3.7,
and while there are some acpia updates, they seem to be pretty mild
too.

I think the acpi_os_wait_semaphore thing is a red herring - that's
just stale on the stack.

Do you have the register state from the oops? Or at least the "Code:"
line? It would be nice to see exactly where the oops happens, and I
cannot line up your "acpi_ns_lookup + 0xa1/0x5b9" with any code due
to different compilers (and configurations etc).

Linus


On Thu, Nov 15, 2012 at 8:09 AM, Zdenek Kabelac <[email protected]> wrote:
> Hello
>
>
> I've already seen twice this oops after resuming my Lenovo T61 in docking
> station.
>
> Since for some reason currently the serial line doesn't work correctly after
> resume
> (while I'm pretty sure it used to work in past) here is at least
> hand-written oops
> message from mobile camera picture.
>
> From the trace it seem os_wait semaphore is accessed twice.
> Unsure which device is behind it - but it seem docking station is need to
> hit this issue.
>
>
> kernel 3.7.0-rc4
>
> Pid: pm-suspend
>
> RIP: acpi_ns_lookup + 0xa1/0x5b9
>
> Call Trace:
>
> ? acpi_os_wait_semaphore + 0x136/0x149
> acpi_ns_get_mode + 0x96/0x102
> ? __lock_is_held +0x5f/0x90
> acpi_ns_evaluate +0x47/0x2de
> ? _raw_spin_lock_irqsave
> ? acpi_ut_evaluate_object
> ? sub_preempt_count
> ? pnpacpi_can_wakeup
> acpi_rs_get_method_data
> ? acpi_os_signal_semaphore
> acpi_walk_resources
> ? acpi_ut_release_mutex
> pnpacpi_build_resource_template
> ? acpi_bus_get_device
> pnpacpi_set_resources
> ? pnp_device_shutdown
> pnp_start_dev
> pnp_bus_resume
> dpm_run_callback
> device_resume
> dpm_resume
> dpm_resume_end
> ? acpi_suspend_begin_old
> suspend_devices_and_enter
> pm_suspend
> state_store
> kobj_attr_store
> sysfs_write_file
> vgs_write
> sys_write
> system_call_fastpath
>
> Zdenek
>
>
> PS: jpg on request
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2012-11-28 16:21:39

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

Dne 28.11.2012 17:01, Linus Torvalds napsal(a):
> Adding more people (and the acpi list) to this report.
>
> I'm seeing *very* few changes to the core suspend/resume path in 3.7,
> and while there are some acpia updates, they seem to be pretty mild
> too.
>
> I think the acpi_os_wait_semaphore thing is a red herring - that's
> just stale on the stack.
>
> Do you have the register state from the oops? Or at least the "Code:"
> line? It would be nice to see exactly where the oops happens, and I
> cannot line up your "acpi_ns_lookup + 0xa1/0x5b9" with any code due
> to different compilers (and configurations etc).
>
> Linus
>


I've opened https://bugzilla.kernel.org/show_bug.cgi?id=51071
and attached picture there which is all I have.

I'll try to decode exact code line.


In all cases - I've played even with 3.4 kernel - which also does not survive
multiple resumes in docking station - though it just leaves black screen - so
this oops is rather 'progress' and it also could be false path.

It's probably not a regression from 3.6 - since this problem was there for
much longer - but now it has just become much more visible.

As I usually do not have reason to make multiple suspend/resumes in dock I've
not noticed it until now when I've tried to capture this ACPI trace.

Zdenek

2012-11-28 17:02:59

by Linus Torvalds

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Wed, Nov 28, 2012 at 8:21 AM, Zdenek Kabelac <[email protected]> wrote:
>
> I've opened https://bugzilla.kernel.org/show_bug.cgi?id=51071
> and attached picture there which is all I have.
>
> I'll try to decode exact code line.

Uhhuh. It's missing much of the relevant parts of the code line, in
particular the actual oopsing instruction. But what is there decodes
to

41 b8 10 00 00 00 mov $0x10,%r8d
48 c7 c1 88 52 64 81 mov $0xffffffff81645288,%rcx
31 c0 xor %eax,%eax
48 c7 c2 98 52 64 81 mov $0xffffffff81645298,%rdx
bf 00 04 00 0. mov $0x0.00400,%edi

.. oops in here ..

74 33 je 0x50
48 89 df mov %rbx,%rdi
e8 4d c9 00 00 callq ? <offset 0xc972>
48 89 d9 mov %rbx,%rcx
48 c7 c2 0a .. .. .. mov $0x......0a,%rdx

which isn't really very obvious. Do you have that kernel around (or at
least the same compiler and configuration)? Doing a

objdump --disassemble drivers/acpi/acpica/nsaccess.o

might help pinpoint where that is..

> It's probably not a regression from 3.6 - since this problem was there for
> much longer - but now it has just become much more visible.

Ok.

Linus

2012-11-28 17:27:58

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

Dne 28.11.2012 18:02, Linus Torvalds napsal(a):
> On Wed, Nov 28, 2012 at 8:21 AM, Zdenek Kabelac <[email protected]> wrote:
>>
>> I've opened https://bugzilla.kernel.org/show_bug.cgi?id=51071
>> and attached picture there which is all I have.
>>
>> I'll try to decode exact code line.
>
> Uhhuh. It's missing much of the relevant parts of the code line, in
> particular the actual oopsing instruction. But what is there decodes
> to
>
> 41 b8 10 00 00 00 mov $0x10,%r8d
> 48 c7 c1 88 52 64 81 mov $0xffffffff81645288,%rcx
> 31 c0 xor %eax,%eax
> 48 c7 c2 98 52 64 81 mov $0xffffffff81645298,%rdx
> bf 00 04 00 0. mov $0x0.00400,%edi
>
> .. oops in here ..
>
> 74 33 je 0x50
> 48 89 df mov %rbx,%rdi
> e8 4d c9 00 00 callq ? <offset 0xc972>
> 48 89 d9 mov %rbx,%rcx
> 48 c7 c2 0a .. .. .. mov $0x......0a,%rdx
>
> which isn't really very obvious. Do you have that kernel around (or at
> least the same compiler and configuration)? Doing a
>
> objdump --disassemble drivers/acpi/acpica/nsaccess.o

I've attached bigger disasfun script output to BZ 51071.
https://bugzilla.kernel.org/show_bug.cgi?id=51071#c1


if (ACPI_GET_DESCRIPTOR_TYPE(prefix_node) !=
00000000000000a1 <acpi_ns_lookup+0xa1> cmpb $0xf,0x8(%rbx)
00000000000000a5 <acpi_ns_lookup+0xa5> je 0da <acpi_ns_lookup+0xda>

seems to be going out of bounds.

Zdenek

2012-11-28 18:31:12

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Wednesday, November 28, 2012 05:21:24 PM Zdenek Kabelac wrote:
> Dne 28.11.2012 17:01, Linus Torvalds napsal(a):
> > Adding more people (and the acpi list) to this report.
> >
> > I'm seeing *very* few changes to the core suspend/resume path in 3.7,
> > and while there are some acpia updates, they seem to be pretty mild
> > too.
> >
> > I think the acpi_os_wait_semaphore thing is a red herring - that's
> > just stale on the stack.
> >
> > Do you have the register state from the oops? Or at least the "Code:"
> > line? It would be nice to see exactly where the oops happens, and I
> > cannot line up your "acpi_ns_lookup + 0xa1/0x5b9" with any code due
> > to different compilers (and configurations etc).
> >
> > Linus
> >
>
>
> I've opened https://bugzilla.kernel.org/show_bug.cgi?id=51071
> and attached picture there which is all I have.
>
> I'll try to decode exact code line.
>
>
> In all cases - I've played even with 3.4 kernel - which also does not survive
> multiple resumes in docking station - though it just leaves black screen - so
> this oops is rather 'progress' and it also could be false path.
>
> It's probably not a regression from 3.6 - since this problem was there for
> much longer - but now it has just become much more visible.
>
> As I usually do not have reason to make multiple suspend/resumes in dock I've
> not noticed it until now when I've tried to capture this ACPI trace.

Well, if ACPI PNP devices are involved, and it looks like they are, that
definitely is not a new problem.

We're now in the process of reworking this area quite a bit and we'll do our
best to address this issue too in the process.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2012-11-28 19:07:55

by Linus Torvalds

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Wed, Nov 28, 2012 at 9:27 AM, Zdenek Kabelac <[email protected]> wrote:
>
> I've attached bigger disasfun script output to BZ 51071.
> https://bugzilla.kernel.org/show_bug.cgi?id=51071#c1
>
>
> if (ACPI_GET_DESCRIPTOR_TYPE(prefix_node) !=
> 00000000000000a1 <acpi_ns_lookup+0xa1> cmpb $0xf,0x8(%rbx)
> 00000000000000a5 <acpi_ns_lookup+0xa5> je 0da <acpi_ns_lookup+0xda>
>
> seems to be going out of bounds.

The whole "prefix_node" pointer is bogus. It seems to have the value 0x1000.

I wonder how that happened. It's loaded from 'scope_info->scope.node',
and it *should* be a valid pointer.

Can you add a print-out of

scope_info->common.descriptor_type

and check that it is ACPI_DESC_TYPE_STATE_WSCOPE (== 8). If it is not,
return early.

Or just something like the attatched, which just uses the root node
(and warns once) if it's not a valid WSCOPE thing.

Linus


Attachments:
patch.diff (638.00 B)

2012-11-28 20:00:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Wednesday, November 28, 2012 11:07:32 AM Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 9:27 AM, Zdenek Kabelac <[email protected]> wrote:
> >
> > I've attached bigger disasfun script output to BZ 51071.
> > https://bugzilla.kernel.org/show_bug.cgi?id=51071#c1
> >
> >
> > if (ACPI_GET_DESCRIPTOR_TYPE(prefix_node) !=
> > 00000000000000a1 <acpi_ns_lookup+0xa1> cmpb $0xf,0x8(%rbx)
> > 00000000000000a5 <acpi_ns_lookup+0xa5> je 0da <acpi_ns_lookup+0xda>
> >
> > seems to be going out of bounds.
>
> The whole "prefix_node" pointer is bogus. It seems to have the value 0x1000.
>
> I wonder how that happened. It's loaded from 'scope_info->scope.node',
> and it *should* be a valid pointer.

Well, suppose that pnpacpi_build_resource_template() passes a handle that's
not a valid pointer to acpi_walk_resources(). What happens then is that
it is passed directly to acpi_rs_get_method_data() and from there to
acpi_ut_evaluate_object() - without validation (acpi_rs_get_method_data()
even has a comment about the parameters validity guaranteed by the caller,
heh, heh). Then it becomes the prefix_node and is written into
info->prefix_node. acpi_ns_evaluate() takes that and passes it to
acpi_ns_get_node() along with info->pathname that is just the name of the
method to evaluate, which is a valid string, so the "if (!pathname)" block in
acpi_ns_get_node() is not executed and we get scope_info.scope.node = prefix_node,
which is our bad pointer. A pointer to that scope_info is passed to
acpi_ns_lookup() and we get the above.

So the code in pnpacpi_build_resource_template() is at fault by passing a
wrong pointer to acpi_walk_resources(). And the pointer is wrong probably
because the struct acpi_device pointed to by dev->data in there has been
removed during a previous suspend or resume (I'm not sure which one does
that), but the PNP layer has no idea about that. And that bug has been there
for quite a while (like forever?).

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2012-11-28 20:26:55

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Wednesday, November 28, 2012 06:27:50 PM Zdenek Kabelac wrote:
> Dne 28.11.2012 18:02, Linus Torvalds napsal(a):
> > On Wed, Nov 28, 2012 at 8:21 AM, Zdenek Kabelac <[email protected]> wrote:
> >>
> >> I've opened https://bugzilla.kernel.org/show_bug.cgi?id=51071
> >> and attached picture there which is all I have.

I wonder if you can try to apply the patch below and see if that makes any
difference?

Rafael


---
drivers/pnp/pnpacpi/rsparser.c | 8 ++++++++
1 file changed, 8 insertions(+)

Index: linux/drivers/pnp/pnpacpi/rsparser.c
===================================================================
--- linux.orig/drivers/pnp/pnpacpi/rsparser.c
+++ linux/drivers/pnp/pnpacpi/rsparser.c
@@ -610,6 +610,14 @@ int pnpacpi_build_resource_template(stru
struct acpi_resource *resource;
int res_cnt = 0;
acpi_status status;
+ int ret;
+
+ /* Sanity check. */
+ ret = acpi_bus_get_device(handle, &acpi_dev);
+ if (ret) {
+ dev_err(&dev->dev, "ACPI node is invalid in %s\n", __func__);
+ return ret;
+ }

status = acpi_walk_resources(handle, METHOD_NAME__CRS,
pnpacpi_count_resources, &res_cnt);



--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2012-11-29 09:04:06

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

Dne 28.11.2012 21:31, Rafael J. Wysocki napsal(a):
> On Wednesday, November 28, 2012 06:27:50 PM Zdenek Kabelac wrote:
>> Dne 28.11.2012 18:02, Linus Torvalds napsal(a):
>>> On Wed, Nov 28, 2012 at 8:21 AM, Zdenek Kabelac <[email protected]> wrote:
>>>>
>>>> I've opened https://bugzilla.kernel.org/show_bug.cgi?id=51071
>>>> and attached picture there which is all I have.
>
> I wonder if you can try to apply the patch below and see if that makes any
> difference?
>

Yep - extended BZ with 2 new pictures - it's now crashing earlier within
the call of acpi_bus_get_device().

Zdenek

2012-11-29 10:04:53

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Thursday, November 29, 2012 10:03:53 AM Zdenek Kabelac wrote:
> Dne 28.11.2012 21:31, Rafael J. Wysocki napsal(a):
> > On Wednesday, November 28, 2012 06:27:50 PM Zdenek Kabelac wrote:
> >> Dne 28.11.2012 18:02, Linus Torvalds napsal(a):
> >>> On Wed, Nov 28, 2012 at 8:21 AM, Zdenek Kabelac <[email protected]> wrote:
> >>>>
> >>>> I've opened https://bugzilla.kernel.org/show_bug.cgi?id=51071
> >>>> and attached picture there which is all I have.
> >
> > I wonder if you can try to apply the patch below and see if that makes any
> > difference?
> >
>
> Yep - extended BZ with 2 new pictures - it's now crashing earlier within
> the call of acpi_bus_get_device().

Pretty much as expected.

So this is the problem I was thinking it was: the previous suspend-resume
cycle unregistered the struct acpi_device pointed to by dev->data in
pnpacpi_build_resource_template(), so the handle pointer in there is now
invalid.

Thanks,
Rafael


--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2012-11-29 10:13:18

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

Dne 28.11.2012 20:07, Linus Torvalds napsal(a):
> whole "prefix_node" pointer is bogus. It seems to have the value 0x1000.

Tested also this patch with this result:

https://bugzilla.kernel.org/show_bug.cgi?id=51071#c8

So while it's made it pass suspend/resume, it's not really usable
as docking then.

Zdenek

2012-11-29 10:54:50

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Thursday, November 29, 2012 11:13:10 AM Zdenek Kabelac wrote:
> Dne 28.11.2012 20:07, Linus Torvalds napsal(a):
> > whole "prefix_node" pointer is bogus. It seems to have the value 0x1000.
>
> Tested also this patch with this result:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=51071#c8
>
> So while it's made it pass suspend/resume, it's not really usable
> as docking then.

This just makes acpi_ns_lookup() always return acpi_gbl_root_node
for things looked up by acpi_ns_get_node() as far as I can say.

Hmm.

If my theory correct, the patch below should catch the bug. Can you please
test it?

Rafael


---
drivers/acpi/scan.c | 1 +
drivers/pnp/pnpacpi/rsparser.c | 3 +++
2 files changed, 4 insertions(+)

Index: linux/drivers/acpi/scan.c
===================================================================
--- linux.orig/drivers/acpi/scan.c
+++ linux/drivers/acpi/scan.c
@@ -704,6 +704,7 @@ static void acpi_device_unregister(struc
mutex_unlock(&acpi_device_lock);

acpi_detach_data(device->handle, acpi_bus_data_handler);
+ device->handle = ERR_PTR(-ENODEV);

acpi_device_remove_files(device);
device_unregister(&device->dev);
Index: linux/drivers/pnp/pnpacpi/rsparser.c
===================================================================
--- linux.orig/drivers/pnp/pnpacpi/rsparser.c
+++ linux/drivers/pnp/pnpacpi/rsparser.c
@@ -611,6 +611,9 @@ int pnpacpi_build_resource_template(stru
int res_cnt = 0;
acpi_status status;

+ if (WARN_ON_ONCE(IS_ERR(handle)))
+ return PTR_ERR(handle);
+
status = acpi_walk_resources(handle, METHOD_NAME__CRS,
pnpacpi_count_resources, &res_cnt);
if (ACPI_FAILURE(status)) {



--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

2012-11-29 12:26:59

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

Dne 29.11.2012 11:59, Rafael J. Wysocki napsal(a):
> On Thursday, November 29, 2012 11:13:10 AM Zdenek Kabelac wrote:
>> Dne 28.11.2012 20:07, Linus Torvalds napsal(a):
>>> whole "prefix_node" pointer is bogus. It seems to have the value 0x1000.
>>
>> Tested also this patch with this result:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=51071#c8
>>
>> So while it's made it pass suspend/resume, it's not really usable
>> as docking then.
>
> This just makes acpi_ns_lookup() always return acpi_gbl_root_node
> for things looked up by acpi_ns_get_node() as far as I can say.
>
> Hmm.
>
> If my theory correct, the patch below should catch the bug. Can you please
> test it?
>

Ok now crashing right after 'undock' button press:

https://bugzilla.kernel.org/show_bug.cgi?id=51071#c10


Zdenek

2012-11-29 16:55:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Acpi deadlocks with 3.7.0-rc4

On Thursday, November 29, 2012 01:26:48 PM Zdenek Kabelac wrote:
> Dne 29.11.2012 11:59, Rafael J. Wysocki napsal(a):
> > On Thursday, November 29, 2012 11:13:10 AM Zdenek Kabelac wrote:
> >> Dne 28.11.2012 20:07, Linus Torvalds napsal(a):
> >>> whole "prefix_node" pointer is bogus. It seems to have the value 0x1000.
> >>
> >> Tested also this patch with this result:
> >>
> >> https://bugzilla.kernel.org/show_bug.cgi?id=51071#c8
> >>
> >> So while it's made it pass suspend/resume, it's not really usable
> >> as docking then.
> >
> > This just makes acpi_ns_lookup() always return acpi_gbl_root_node
> > for things looked up by acpi_ns_get_node() as far as I can say.
> >
> > Hmm.
> >
> > If my theory correct, the patch below should catch the bug. Can you please
> > test it?
> >
>
> Ok now crashing right after 'undock' button press:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=51071#c10

And that's because I made a mistake in the patch. We're currently testing
the appended one and it's showing that the added WARN_ON_ONCE() actually
triggers, so the theory appears to be correct.

I think we can debug this further in the Bugzilla.

Thanks,
Rafael


---
drivers/acpi/scan.c | 2 ++
drivers/pnp/pnpacpi/rsparser.c | 3 +++
2 files changed, 5 insertions(+)

Index: linux/drivers/acpi/scan.c
===================================================================
--- linux.orig/drivers/acpi/scan.c
+++ linux/drivers/acpi/scan.c
@@ -707,6 +707,8 @@ static void acpi_device_unregister(struc

acpi_device_remove_files(device);
device_unregister(&device->dev);
+
+ device->handle = ERR_PTR(-ENODEV);
}

/* --------------------------------------------------------------------------
Index: linux/drivers/pnp/pnpacpi/rsparser.c
===================================================================
--- linux.orig/drivers/pnp/pnpacpi/rsparser.c
+++ linux/drivers/pnp/pnpacpi/rsparser.c
@@ -611,6 +611,9 @@ int pnpacpi_build_resource_template(stru
int res_cnt = 0;
acpi_status status;

+ if (WARN_ON_ONCE(IS_ERR(handle)))
+ return PTR_ERR(handle);
+
status = acpi_walk_resources(handle, METHOD_NAME__CRS,
pnpacpi_count_resources, &res_cnt);
if (ACPI_FAILURE(status)) {

--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.