2021-07-24 15:59:58

by Jens Axboe

[permalink] [raw]
Subject: 5.14-rc failure to resume

Hi,

I ran into this when doing the last bit of testing on pending changes
for this release on the laptop. Outside of running testing on these
changes, I always build and boot current -git and my changes on my
laptop as well.

5.14-rc1 + changes works fine, current -git and changes fail to resume
every single time. I just get a black screen. Tip of tree before merging
fixes is:

commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
Merge: 05daae0fb033 0077a5008272
Author: Linus Torvalds <[email protected]>
Date: Fri Jul 23 11:30:12 2021 -0700

Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client

Since bisection takes forever on the laptop (gen7 x1 carbon), I
opportunistically reverted some of the most recent git pulls:

- ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
- 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
- 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
- e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
- 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)

as they could potentially be involved, but even with all of those
reverted it still won't resume.

Sending this out in case someone has already reported this and I just
couldn't find it. If this is a new/unknown issues, I'll go ahead and
bisect it.

--
Jens Axboe


2021-07-24 17:11:56

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 9:57 AM, Jens Axboe wrote:
> Hi,
>
> I ran into this when doing the last bit of testing on pending changes
> for this release on the laptop. Outside of running testing on these
> changes, I always build and boot current -git and my changes on my
> laptop as well.
>
> 5.14-rc1 + changes works fine, current -git and changes fail to resume
> every single time. I just get a black screen. Tip of tree before merging
> fixes is:

Typo, 5.14-rc2 + changes. Hence the problematic change is sometime between
-rc2 and current -git.

Running a bisect now but expecting it to take all day, will report
what it uncovers.

--
Jens Axboe

2021-07-24 17:59:39

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 9:57 AM, Jens Axboe wrote:
> Hi,
>
> I ran into this when doing the last bit of testing on pending changes
> for this release on the laptop. Outside of running testing on these
> changes, I always build and boot current -git and my changes on my
> laptop as well.
>
> 5.14-rc1 + changes works fine, current -git and changes fail to resume
> every single time. I just get a black screen. Tip of tree before merging
> fixes is:
>
> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
> Merge: 05daae0fb033 0077a5008272
> Author: Linus Torvalds <[email protected]>
> Date: Fri Jul 23 11:30:12 2021 -0700
>
> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>
> Since bisection takes forever on the laptop (gen7 x1 carbon), I
> opportunistically reverted some of the most recent git pulls:
>
> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>
> as they could potentially be involved, but even with all of those
> reverted it still won't resume.
>
> Sending this out in case someone has already reported this and I just
> couldn't find it. If this is a new/unknown issues, I'll go ahead and
> bisect it.

Ran a bisect, and it pinpoints:

71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
Author: Andy Shevchenko <[email protected]>
Date: Mon Jul 12 21:21:21 2021 +0300

ACPI: utils: Fix reference counting in for_each_acpi_dev_match()

which seems odd, as it worked for me with the acpi changes reverted. It
could be that it _sometimes_ works with that commit, not sure. Adding
relevant folks to the CC.

I'm going to revert this on top of current master and run with that
and see if it does 10 successful resumes.

--
Jens Axboe

2021-07-24 19:45:55

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 11:56 AM, Jens Axboe wrote:
> On 7/24/21 9:57 AM, Jens Axboe wrote:
>> Hi,
>>
>> I ran into this when doing the last bit of testing on pending changes
>> for this release on the laptop. Outside of running testing on these
>> changes, I always build and boot current -git and my changes on my
>> laptop as well.
>>
>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
>> every single time. I just get a black screen. Tip of tree before merging
>> fixes is:
>>
>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>> Merge: 05daae0fb033 0077a5008272
>> Author: Linus Torvalds <[email protected]>
>> Date: Fri Jul 23 11:30:12 2021 -0700
>>
>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>
>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>> opportunistically reverted some of the most recent git pulls:
>>
>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>
>> as they could potentially be involved, but even with all of those
>> reverted it still won't resume.
>>
>> Sending this out in case someone has already reported this and I just
>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>> bisect it.
>
> Ran a bisect, and it pinpoints:
>
> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
> Author: Andy Shevchenko <[email protected]>
> Date: Mon Jul 12 21:21:21 2021 +0300
>
> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>
> which seems odd, as it worked for me with the acpi changes reverted. It
> could be that it _sometimes_ works with that commit, not sure. Adding
> relevant folks to the CC.
>
> I'm going to revert this on top of current master and run with that
> and see if it does 10 successful resumes.

This does appear to be the culprit. With it reverted on top of current
master (and with the block and io_uring changes pulled in too), the
kernel survives many resumes without issue.

--
Jens Axboe

2021-07-24 19:54:37

by Andy Shevchenko

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On Sat, Jul 24, 2021 at 8:56 PM Jens Axboe <[email protected]> wrote:
> On 7/24/21 9:57 AM, Jens Axboe wrote:

> > I ran into this when doing the last bit of testing on pending changes
> > for this release on the laptop. Outside of running testing on these
> > changes, I always build and boot current -git and my changes on my
> > laptop as well.
> >
> > 5.14-rc1 + changes works fine, current -git and changes fail to resume
> > every single time. I just get a black screen. Tip of tree before merging
> > fixes is:
> >
> > commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
> > Merge: 05daae0fb033 0077a5008272
> > Author: Linus Torvalds <[email protected]>
> > Date: Fri Jul 23 11:30:12 2021 -0700
> >
> > Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
> >
> > Since bisection takes forever on the laptop (gen7 x1 carbon), I
> > opportunistically reverted some of the most recent git pulls:
> >
> > - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
> > - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
> > - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
> > - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
> > - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
> >
> > as they could potentially be involved, but even with all of those
> > reverted it still won't resume.
> >
> > Sending this out in case someone has already reported this and I just
> > couldn't find it. If this is a new/unknown issues, I'll go ahead and
> > bisect it.
>
> Ran a bisect, and it pinpoints:

Thanks for the report!

> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
> Author: Andy Shevchenko <[email protected]>
> Date: Mon Jul 12 21:21:21 2021 +0300
>
> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>
> which seems odd, as it worked for me with the acpi changes reverted. It
> could be that it _sometimes_ works with that commit, not sure. Adding
> relevant folks to the CC.
>
> I'm going to revert this on top of current master and run with that
> and see if it does 10 successful resumes.

This commit touches two parts (and API) EFI for Apple devices (seems
not your case) and CIO2 bridge (Camera device on Intel Sky Lake and
Kaby Lake machines). The EFI code runs at boot time AFAIU and CIO2
code runs at device's ->probe() time. I'm a bit puzzled as to why it
affects resume parts... Daniel, any ideas?

--
With Best Regards,
Andy Shevchenko

2021-07-24 20:08:22

by Andy Shevchenko

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On Sat, Jul 24, 2021 at 10:43 PM Jens Axboe <[email protected]> wrote:
> On 7/24/21 11:56 AM, Jens Axboe wrote:
> > On 7/24/21 9:57 AM, Jens Axboe wrote:
> >> Hi,
> >>
> >> I ran into this when doing the last bit of testing on pending changes
> >> for this release on the laptop. Outside of running testing on these
> >> changes, I always build and boot current -git and my changes on my
> >> laptop as well.
> >>
> >> 5.14-rc1 + changes works fine, current -git and changes fail to resume
> >> every single time. I just get a black screen. Tip of tree before merging
> >> fixes is:
> >>
> >> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
> >> Merge: 05daae0fb033 0077a5008272
> >> Author: Linus Torvalds <[email protected]>
> >> Date: Fri Jul 23 11:30:12 2021 -0700
> >>
> >> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
> >>
> >> Since bisection takes forever on the laptop (gen7 x1 carbon), I
> >> opportunistically reverted some of the most recent git pulls:
> >>
> >> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
> >> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
> >> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
> >> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
> >> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
> >>
> >> as they could potentially be involved, but even with all of those
> >> reverted it still won't resume.
> >>
> >> Sending this out in case someone has already reported this and I just
> >> couldn't find it. If this is a new/unknown issues, I'll go ahead and
> >> bisect it.
> >
> > Ran a bisect, and it pinpoints:
> >
> > 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
> > commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
> > Author: Andy Shevchenko <[email protected]>
> > Date: Mon Jul 12 21:21:21 2021 +0300
> >
> > ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
> >
> > which seems odd, as it worked for me with the acpi changes reverted. It
> > could be that it _sometimes_ works with that commit, not sure. Adding
> > relevant folks to the CC.
> >
> > I'm going to revert this on top of current master and run with that
> > and see if it does 10 successful resumes.
>
> This does appear to be the culprit. With it reverted on top of current
> master (and with the block and io_uring changes pulled in too), the
> kernel survives many resumes without issue.

If I read correctly it seems at one of the local rebase I might have
missed something.

Can you replace sensor->adev with adev here

https://elixir.bootlin.com/linux/v5.14-rc2/source/drivers/media/pci/intel/ipu3/cio2-bridge.c#L231

and retest?

--
With Best Regards,
Andy Shevchenko

2021-07-24 20:25:32

by Daniel Scally

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume



On 24/07/2021 20:49, Andy Shevchenko wrote:
> On Sat, Jul 24, 2021 at 8:56 PM Jens Axboe <[email protected]> wrote:
>> On 7/24/21 9:57 AM, Jens Axboe wrote:
>
>>> I ran into this when doing the last bit of testing on pending changes
>>> for this release on the laptop. Outside of running testing on these
>>> changes, I always build and boot current -git and my changes on my
>>> laptop as well.
>>>
>>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
>>> every single time. I just get a black screen. Tip of tree before merging
>>> fixes is:
>>>
>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>>> Merge: 05daae0fb033 0077a5008272
>>> Author: Linus Torvalds <[email protected]>
>>> Date: Fri Jul 23 11:30:12 2021 -0700
>>>
>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>>
>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>>> opportunistically reverted some of the most recent git pulls:
>>>
>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>>
>>> as they could potentially be involved, but even with all of those
>>> reverted it still won't resume.
>>>
>>> Sending this out in case someone has already reported this and I just
>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>>> bisect it.
>>
>> Ran a bisect, and it pinpoints:
>
> Thanks for the report!
>
>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
>> Author: Andy Shevchenko <[email protected]>
>> Date: Mon Jul 12 21:21:21 2021 +0300
>>
>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>>
>> which seems odd, as it worked for me with the acpi changes reverted. It
>> could be that it _sometimes_ works with that commit, not sure. Adding
>> relevant folks to the CC.
>>
>> I'm going to revert this on top of current master and run with that
>> and see if it does 10 successful resumes.
>
> This commit touches two parts (and API) EFI for Apple devices (seems
> not your case) and CIO2 bridge (Camera device on Intel Sky Lake and
> Kaby Lake machines). The EFI code runs at boot time AFAIU and CIO2
> code runs at device's ->probe() time. I'm a bit puzzled as to why it
> affects resume parts... Daniel, any ideas?

Not off the top of my head, puzzles me too. I'm building to test now.

2021-07-24 20:28:28

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 2:05 PM, Andy Shevchenko wrote:
> On Sat, Jul 24, 2021 at 10:43 PM Jens Axboe <[email protected]> wrote:
>> On 7/24/21 11:56 AM, Jens Axboe wrote:
>>> On 7/24/21 9:57 AM, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> I ran into this when doing the last bit of testing on pending changes
>>>> for this release on the laptop. Outside of running testing on these
>>>> changes, I always build and boot current -git and my changes on my
>>>> laptop as well.
>>>>
>>>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
>>>> every single time. I just get a black screen. Tip of tree before merging
>>>> fixes is:
>>>>
>>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>>>> Merge: 05daae0fb033 0077a5008272
>>>> Author: Linus Torvalds <[email protected]>
>>>> Date: Fri Jul 23 11:30:12 2021 -0700
>>>>
>>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>>>
>>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>>>> opportunistically reverted some of the most recent git pulls:
>>>>
>>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>>>
>>>> as they could potentially be involved, but even with all of those
>>>> reverted it still won't resume.
>>>>
>>>> Sending this out in case someone has already reported this and I just
>>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>>>> bisect it.
>>>
>>> Ran a bisect, and it pinpoints:
>>>
>>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
>>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
>>> Author: Andy Shevchenko <[email protected]>
>>> Date: Mon Jul 12 21:21:21 2021 +0300
>>>
>>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>>>
>>> which seems odd, as it worked for me with the acpi changes reverted. It
>>> could be that it _sometimes_ works with that commit, not sure. Adding
>>> relevant folks to the CC.
>>>
>>> I'm going to revert this on top of current master and run with that
>>> and see if it does 10 successful resumes.
>>
>> This does appear to be the culprit. With it reverted on top of current
>> master (and with the block and io_uring changes pulled in too), the
>> kernel survives many resumes without issue.
>
> If I read correctly it seems at one of the local rebase I might have
> missed something.
>
> Can you replace sensor->adev with adev here
>
> https://elixir.bootlin.com/linux/v5.14-rc2/source/drivers/media/pci/intel/ipu3/cio2-bridge.c#L231
>
> and retest?

Tried that, doesn't work unfortunately. Hung on first resume, just like
current -git with that single line edit.

--
Jens Axboe

2021-07-24 20:30:39

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 1:49 PM, Andy Shevchenko wrote:
> On Sat, Jul 24, 2021 at 8:56 PM Jens Axboe <[email protected]> wrote:
>> On 7/24/21 9:57 AM, Jens Axboe wrote:
>
>>> I ran into this when doing the last bit of testing on pending changes
>>> for this release on the laptop. Outside of running testing on these
>>> changes, I always build and boot current -git and my changes on my
>>> laptop as well.
>>>
>>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
>>> every single time. I just get a black screen. Tip of tree before merging
>>> fixes is:
>>>
>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>>> Merge: 05daae0fb033 0077a5008272
>>> Author: Linus Torvalds <[email protected]>
>>> Date: Fri Jul 23 11:30:12 2021 -0700
>>>
>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>>
>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>>> opportunistically reverted some of the most recent git pulls:
>>>
>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>>
>>> as they could potentially be involved, but even with all of those
>>> reverted it still won't resume.
>>>
>>> Sending this out in case someone has already reported this and I just
>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>>> bisect it.
>>
>> Ran a bisect, and it pinpoints:
>
> Thanks for the report!
>
>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
>> Author: Andy Shevchenko <[email protected]>
>> Date: Mon Jul 12 21:21:21 2021 +0300
>>
>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>>
>> which seems odd, as it worked for me with the acpi changes reverted. It
>> could be that it _sometimes_ works with that commit, not sure. Adding
>> relevant folks to the CC.
>>
>> I'm going to revert this on top of current master and run with that
>> and see if it does 10 successful resumes.
>
> This commit touches two parts (and API) EFI for Apple devices (seems
> not your case) and CIO2 bridge (Camera device on Intel Sky Lake and
> Kaby Lake machines). The EFI code runs at boot time AFAIU and CIO2
> code runs at device's ->probe() time. I'm a bit puzzled as to why it
> affects resume parts... Daniel, any ideas?

That does match my camera type at least, but apart from that, no ideas.
Does seem to be very reliable, as per the one-liner test I just made.
Unfortunately I don't have a way to log a crash, since it's just a
laptop...

Rebuilding with the revert again, should be solid (I'll retest, since
you never know with issues like this).

--
Jens Axboe

2021-07-24 20:36:12

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 2:28 PM, Jens Axboe wrote:
> On 7/24/21 1:49 PM, Andy Shevchenko wrote:
>> On Sat, Jul 24, 2021 at 8:56 PM Jens Axboe <[email protected]> wrote:
>>> On 7/24/21 9:57 AM, Jens Axboe wrote:
>>
>>>> I ran into this when doing the last bit of testing on pending
>>>> changes for this release on the laptop. Outside of running testing
>>>> on these changes, I always build and boot current -git and my
>>>> changes on my laptop as well.
>>>>
>>>> 5.14-rc1 + changes works fine, current -git and changes fail to
>>>> resume every single time. I just get a black screen. Tip of tree
>>>> before merging fixes is:
>>>>
>>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>>>> Merge: 05daae0fb033 0077a5008272
>>>> Author: Linus Torvalds <[email protected]>
>>>> Date: Fri Jul 23 11:30:12 2021 -0700
>>>>
>>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>>>
>>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>>>> opportunistically reverted some of the most recent git pulls:
>>>>
>>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>>>
>>>> as they could potentially be involved, but even with all of those
>>>> reverted it still won't resume.
>>>>
>>>> Sending this out in case someone has already reported this and I just
>>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>>>> bisect it.
>>>
>>> Ran a bisect, and it pinpoints:
>>
>> Thanks for the report!
>>
>>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
>>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
>>> Author: Andy Shevchenko <[email protected]>
>>> Date: Mon Jul 12 21:21:21 2021 +0300
>>>
>>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>>>
>>> which seems odd, as it worked for me with the acpi changes reverted.
>>> It could be that it _sometimes_ works with that commit, not sure.
>>> Adding relevant folks to the CC.
>>>
>>> I'm going to revert this on top of current master and run with that
>>> and see if it does 10 successful resumes.
>>
>> This commit touches two parts (and API) EFI for Apple devices (seems
>> not your case) and CIO2 bridge (Camera device on Intel Sky Lake and
>> Kaby Lake machines). The EFI code runs at boot time AFAIU and CIO2
>> code runs at device's ->probe() time. I'm a bit puzzled as to why it
>> affects resume parts... Daniel, any ideas?
>
> That does match my camera type at least, but apart from that, no
> ideas. Does seem to be very reliable, as per the one-liner test I just
> made. Unfortunately I don't have a way to log a crash, since it's just
> a laptop...
>
> Rebuilding with the revert again, should be solid (I'll retest, since
> you never know with issues like this).

Works fine again with the revert. It really does seem to be that commit
that is problematic.

--
Jens Axboe

2021-07-24 20:50:16

by Linus Torvalds

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On Sat, Jul 24, 2021 at 12:44 PM Jens Axboe <[email protected]> wrote:
>
> This does appear to be the culprit. With it reverted on top of current
> master (and with the block and io_uring changes pulled in too), the
> kernel survives many resumes without issue.

That commit seems fundamentally buggy.

It makes "acpi_dev_get_next_match_dev()" always do

acpi_dev_put(adev);

to put the previous device, but "adev" is perfectly valid as NULL, and
acpi_dev_get_next_match_dev() even tests for it:

struct device *start = adev ? &adev->dev : NULL;

so it can - and will - do

acpi_dev_put(NULL);

which does

put_device(&adev->dev);

and passes in an invalid pointer to put_device().

And yes, that adev very much can be NULL, with drivers/acpi/utils.c
even passing it in explicitly:

struct acpi_device *
acpi_dev_get_first_match_dev(const char *hid, const char *uid, s64 hrv)
{
return acpi_dev_get_next_match_dev(NULL, hid, uid, hrv);
}
EXPORT_SYMBOL(acpi_dev_get_first_match_dev);

Am I missing something? How does that code work at all for anybody?

I probably _am_ missing something.

Linus

2021-07-24 21:04:02

by Andy Shevchenko

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On Sat, Jul 24, 2021 at 11:48 PM Linus Torvalds
<[email protected]> wrote:
>
> On Sat, Jul 24, 2021 at 12:44 PM Jens Axboe <[email protected]> wrote:
> >
> > This does appear to be the culprit. With it reverted on top of current
> > master (and with the block and io_uring changes pulled in too), the
> > kernel survives many resumes without issue.
>
> That commit seems fundamentally buggy.
>
> It makes "acpi_dev_get_next_match_dev()" always do
>
> acpi_dev_put(adev);
>
> to put the previous device, but "adev" is perfectly valid as NULL, and
> acpi_dev_get_next_match_dev() even tests for it:
>
> struct device *start = adev ? &adev->dev : NULL;
>
> so it can - and will - do
>
> acpi_dev_put(NULL);
>
> which does
>
> put_device(&adev->dev);
>
> and passes in an invalid pointer to put_device().
>
> And yes, that adev very much can be NULL, with drivers/acpi/utils.c
> even passing it in explicitly:
>
> struct acpi_device *
> acpi_dev_get_first_match_dev(const char *hid, const char *uid, s64 hrv)
> {
> return acpi_dev_get_next_match_dev(NULL, hid, uid, hrv);
> }
> EXPORT_SYMBOL(acpi_dev_get_first_match_dev);
>
> Am I missing something? How does that code work at all for anybody?
>
> I probably _am_ missing something.

Yeah, the two changes (acpi_dev_put() and this fix) were in separate
submissions during different times. So, what we have here:
1) this fix misses that one line to be changed and after adding it
2) reveals UAF, or i.o.w. NULL dereference which is a bug in acpi_dev_put().

acpi_dev_put() has to be

if (adev)
put_device(&adev->dev);


--
With Best Regards,
Andy Shevchenko

2021-07-24 21:17:11

by Andy Shevchenko

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On Sat, Jul 24, 2021 at 11:26 PM Jens Axboe <[email protected]> wrote:
>
> On 7/24/21 2:05 PM, Andy Shevchenko wrote:
> > On Sat, Jul 24, 2021 at 10:43 PM Jens Axboe <[email protected]> wrote:
> >> On 7/24/21 11:56 AM, Jens Axboe wrote:
> >>> On 7/24/21 9:57 AM, Jens Axboe wrote:
> >>>> Hi,
> >>>>
> >>>> I ran into this when doing the last bit of testing on pending changes
> >>>> for this release on the laptop. Outside of running testing on these
> >>>> changes, I always build and boot current -git and my changes on my
> >>>> laptop as well.
> >>>>
> >>>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
> >>>> every single time. I just get a black screen. Tip of tree before merging
> >>>> fixes is:
> >>>>
> >>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
> >>>> Merge: 05daae0fb033 0077a5008272
> >>>> Author: Linus Torvalds <[email protected]>
> >>>> Date: Fri Jul 23 11:30:12 2021 -0700
> >>>>
> >>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
> >>>>
> >>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
> >>>> opportunistically reverted some of the most recent git pulls:
> >>>>
> >>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
> >>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
> >>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
> >>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
> >>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
> >>>>
> >>>> as they could potentially be involved, but even with all of those
> >>>> reverted it still won't resume.
> >>>>
> >>>> Sending this out in case someone has already reported this and I just
> >>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
> >>>> bisect it.
> >>>
> >>> Ran a bisect, and it pinpoints:
> >>>
> >>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
> >>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
> >>> Author: Andy Shevchenko <[email protected]>
> >>> Date: Mon Jul 12 21:21:21 2021 +0300
> >>>
> >>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
> >>>
> >>> which seems odd, as it worked for me with the acpi changes reverted. It
> >>> could be that it _sometimes_ works with that commit, not sure. Adding
> >>> relevant folks to the CC.
> >>>
> >>> I'm going to revert this on top of current master and run with that
> >>> and see if it does 10 successful resumes.
> >>
> >> This does appear to be the culprit. With it reverted on top of current
> >> master (and with the block and io_uring changes pulled in too), the
> >> kernel survives many resumes without issue.
> >
> > If I read correctly it seems at one of the local rebase I might have
> > missed something.
> >
> > Can you replace sensor->adev with adev here
> >
> > https://elixir.bootlin.com/linux/v5.14-rc2/source/drivers/media/pci/intel/ipu3/cio2-bridge.c#L231
> >
> > and retest?
>
> Tried that, doesn't work unfortunately. Hung on first resume, just like
> current -git with that single line edit.

So, does it work along with the acpi_dev_put() converted to be

if (adev)
put_device(&adev->dev);

?

--
With Best Regards,
Andy Shevchenko

2021-07-24 21:25:40

by Daniel Scally

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume



On 24/07/2021 22:15, Andy Shevchenko wrote:
> On Sat, Jul 24, 2021 at 11:26 PM Jens Axboe <[email protected]> wrote:
>>
>> On 7/24/21 2:05 PM, Andy Shevchenko wrote:
>>> On Sat, Jul 24, 2021 at 10:43 PM Jens Axboe <[email protected]> wrote:
>>>> On 7/24/21 11:56 AM, Jens Axboe wrote:
>>>>> On 7/24/21 9:57 AM, Jens Axboe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I ran into this when doing the last bit of testing on pending changes
>>>>>> for this release on the laptop. Outside of running testing on these
>>>>>> changes, I always build and boot current -git and my changes on my
>>>>>> laptop as well.
>>>>>>
>>>>>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
>>>>>> every single time. I just get a black screen. Tip of tree before merging
>>>>>> fixes is:
>>>>>>
>>>>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>>>>>> Merge: 05daae0fb033 0077a5008272
>>>>>> Author: Linus Torvalds <[email protected]>
>>>>>> Date: Fri Jul 23 11:30:12 2021 -0700
>>>>>>
>>>>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>>>>>
>>>>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>>>>>> opportunistically reverted some of the most recent git pulls:
>>>>>>
>>>>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>>>>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>>>>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>>>>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>>>>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>>>>>
>>>>>> as they could potentially be involved, but even with all of those
>>>>>> reverted it still won't resume.
>>>>>>
>>>>>> Sending this out in case someone has already reported this and I just
>>>>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>>>>>> bisect it.
>>>>>
>>>>> Ran a bisect, and it pinpoints:
>>>>>
>>>>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
>>>>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
>>>>> Author: Andy Shevchenko <[email protected]>
>>>>> Date: Mon Jul 12 21:21:21 2021 +0300
>>>>>
>>>>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>>>>>
>>>>> which seems odd, as it worked for me with the acpi changes reverted. It
>>>>> could be that it _sometimes_ works with that commit, not sure. Adding
>>>>> relevant folks to the CC.
>>>>>
>>>>> I'm going to revert this on top of current master and run with that
>>>>> and see if it does 10 successful resumes.
>>>>
>>>> This does appear to be the culprit. With it reverted on top of current
>>>> master (and with the block and io_uring changes pulled in too), the
>>>> kernel survives many resumes without issue.
>>>
>>> If I read correctly it seems at one of the local rebase I might have
>>> missed something.
>>>
>>> Can you replace sensor->adev with adev here
>>>
>>> https://elixir.bootlin.com/linux/v5.14-rc2/source/drivers/media/pci/intel/ipu3/cio2-bridge.c#L231
>>>
>>> and retest?
>>
>> Tried that, doesn't work unfortunately. Hung on first resume, just like
>> current -git with that single line edit.
>
> So, does it work along with the acpi_dev_put() converted to be
>
> if (adev)
> put_device(&adev->dev);
>
> ?

For me; yes. I replicated the issue, and adding the if (adev) guard and
the previous change you mentioned clears it again.

2021-07-24 21:37:38

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 3:15 PM, Andy Shevchenko wrote:
> On Sat, Jul 24, 2021 at 11:26 PM Jens Axboe <[email protected]> wrote:
>>
>> On 7/24/21 2:05 PM, Andy Shevchenko wrote:
>>> On Sat, Jul 24, 2021 at 10:43 PM Jens Axboe <[email protected]> wrote:
>>>> On 7/24/21 11:56 AM, Jens Axboe wrote:
>>>>> On 7/24/21 9:57 AM, Jens Axboe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I ran into this when doing the last bit of testing on pending changes
>>>>>> for this release on the laptop. Outside of running testing on these
>>>>>> changes, I always build and boot current -git and my changes on my
>>>>>> laptop as well.
>>>>>>
>>>>>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
>>>>>> every single time. I just get a black screen. Tip of tree before merging
>>>>>> fixes is:
>>>>>>
>>>>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>>>>>> Merge: 05daae0fb033 0077a5008272
>>>>>> Author: Linus Torvalds <[email protected]>
>>>>>> Date: Fri Jul 23 11:30:12 2021 -0700
>>>>>>
>>>>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>>>>>
>>>>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>>>>>> opportunistically reverted some of the most recent git pulls:
>>>>>>
>>>>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>>>>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>>>>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>>>>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>>>>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>>>>>
>>>>>> as they could potentially be involved, but even with all of those
>>>>>> reverted it still won't resume.
>>>>>>
>>>>>> Sending this out in case someone has already reported this and I just
>>>>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>>>>>> bisect it.
>>>>>
>>>>> Ran a bisect, and it pinpoints:
>>>>>
>>>>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
>>>>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
>>>>> Author: Andy Shevchenko <[email protected]>
>>>>> Date: Mon Jul 12 21:21:21 2021 +0300
>>>>>
>>>>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>>>>>
>>>>> which seems odd, as it worked for me with the acpi changes reverted. It
>>>>> could be that it _sometimes_ works with that commit, not sure. Adding
>>>>> relevant folks to the CC.
>>>>>
>>>>> I'm going to revert this on top of current master and run with that
>>>>> and see if it does 10 successful resumes.
>>>>
>>>> This does appear to be the culprit. With it reverted on top of current
>>>> master (and with the block and io_uring changes pulled in too), the
>>>> kernel survives many resumes without issue.
>>>
>>> If I read correctly it seems at one of the local rebase I might have
>>> missed something.
>>>
>>> Can you replace sensor->adev with adev here
>>>
>>> https://elixir.bootlin.com/linux/v5.14-rc2/source/drivers/media/pci/intel/ipu3/cio2-bridge.c#L231
>>>
>>> and retest?
>>
>> Tried that, doesn't work unfortunately. Hung on first resume, just like
>> current -git with that single line edit.
>
> So, does it work along with the acpi_dev_put() converted to be
>
> if (adev)
> put_device(&adev->dev);
>
> ?

Building, will test when it's done.

--
Jens Axboe

2021-07-24 21:42:18

by Jens Axboe

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On 7/24/21 3:35 PM, Jens Axboe wrote:
> On 7/24/21 3:15 PM, Andy Shevchenko wrote:
>> On Sat, Jul 24, 2021 at 11:26 PM Jens Axboe <[email protected]> wrote:
>>>
>>> On 7/24/21 2:05 PM, Andy Shevchenko wrote:
>>>> On Sat, Jul 24, 2021 at 10:43 PM Jens Axboe <[email protected]> wrote:
>>>>> On 7/24/21 11:56 AM, Jens Axboe wrote:
>>>>>> On 7/24/21 9:57 AM, Jens Axboe wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I ran into this when doing the last bit of testing on pending changes
>>>>>>> for this release on the laptop. Outside of running testing on these
>>>>>>> changes, I always build and boot current -git and my changes on my
>>>>>>> laptop as well.
>>>>>>>
>>>>>>> 5.14-rc1 + changes works fine, current -git and changes fail to resume
>>>>>>> every single time. I just get a black screen. Tip of tree before merging
>>>>>>> fixes is:
>>>>>>>
>>>>>>> commit 704f4cba43d4ed31ef4beb422313f1263d87bc55 (origin/master, origin/HEAD, master)
>>>>>>> Merge: 05daae0fb033 0077a5008272
>>>>>>> Author: Linus Torvalds <[email protected]>
>>>>>>> Date: Fri Jul 23 11:30:12 2021 -0700
>>>>>>>
>>>>>>> Merge tag 'ceph-for-5.14-rc3' of git://github.com/ceph/ceph-client
>>>>>>>
>>>>>>> Since bisection takes forever on the laptop (gen7 x1 carbon), I
>>>>>>> opportunistically reverted some of the most recent git pulls:
>>>>>>>
>>>>>>> - ec6badfbe1cde0eb2bec4a0b8f6e738171156b5b (acpi changes)
>>>>>>> - 1d597682d3e669ec7021aa33d088ed3d136a5149 (driver-core changes)
>>>>>>> - 74738c556db6c7f780a8b98340937e55b72c896a (usb changes)
>>>>>>> - e7562a00c1f54116f5a058e7e3ddd500188f60b2 (sound changes)
>>>>>>> - 8baef6386baaefb776bdd09b5c7630cf057c51c6 (drm changes)
>>>>>>>
>>>>>>> as they could potentially be involved, but even with all of those
>>>>>>> reverted it still won't resume.
>>>>>>>
>>>>>>> Sending this out in case someone has already reported this and I just
>>>>>>> couldn't find it. If this is a new/unknown issues, I'll go ahead and
>>>>>>> bisect it.
>>>>>>
>>>>>> Ran a bisect, and it pinpoints:
>>>>>>
>>>>>> 71f6428332844f38c7cb10461d9f29e9c9b983a0 is the first bad commit
>>>>>> commit 71f6428332844f38c7cb10461d9f29e9c9b983a0
>>>>>> Author: Andy Shevchenko <[email protected]>
>>>>>> Date: Mon Jul 12 21:21:21 2021 +0300
>>>>>>
>>>>>> ACPI: utils: Fix reference counting in for_each_acpi_dev_match()
>>>>>>
>>>>>> which seems odd, as it worked for me with the acpi changes reverted. It
>>>>>> could be that it _sometimes_ works with that commit, not sure. Adding
>>>>>> relevant folks to the CC.
>>>>>>
>>>>>> I'm going to revert this on top of current master and run with that
>>>>>> and see if it does 10 successful resumes.
>>>>>
>>>>> This does appear to be the culprit. With it reverted on top of current
>>>>> master (and with the block and io_uring changes pulled in too), the
>>>>> kernel survives many resumes without issue.
>>>>
>>>> If I read correctly it seems at one of the local rebase I might have
>>>> missed something.
>>>>
>>>> Can you replace sensor->adev with adev here
>>>>
>>>> https://elixir.bootlin.com/linux/v5.14-rc2/source/drivers/media/pci/intel/ipu3/cio2-bridge.c#L231
>>>>
>>>> and retest?
>>>
>>> Tried that, doesn't work unfortunately. Hung on first resume, just like
>>> current -git with that single line edit.
>>
>> So, does it work along with the acpi_dev_put() converted to be
>>
>> if (adev)
>> put_device(&adev->dev);
>>
>> ?
>
> Building, will test when it's done.

Yep, works with acpi_dev_put() checking for != NULL before doing the
put.

--
Jens Axboe

2021-07-24 22:39:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: 5.14-rc failure to resume

On Sat, Jul 24, 2021 at 2:39 PM Jens Axboe <[email protected]> wrote:
>
> Yep, works with acpi_dev_put() checking for != NULL before doing the
> put.

Ok, pushed the oneliner fix so that we get rid of this issue asap.

Linus