2023-12-29 12:08:52

by Borislav Petkov

[permalink] [raw]
Subject: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

Hi,

we're seeing the below splat in our testing of linux-next.

Disassembling Code: gives

17: 90 nop
18: f3 0f 1e fa endbr64
1c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
21: 55 push %rbp
22: 48 89 e5 mov %rsp,%rbp
25: 41 55 push %r13
27: 41 54 push %r12
29: 53 push %rbx
2a:* 8b 87 fc 01 00 00 mov 0x1fc(%rdi),%eax <-- trapping instruction

which is

regmap_read:
endbr64
1: call __fentry__
.section __mcount_loc, "a",@progbits
.quad 1b
.previous
pushq %rbp #
movq %rsp, %rbp #,
pushq %r13 #
pushq %r12 #
pushq %rbx #
# drivers/base/regmap/regmap.c:2826: if (!IS_ALIGNED(reg, map->reg_stride))
movl 508(%rdi), %eax # map_12(D)->reg_stride, tmp107
^^^^^^^^

i.e., that @map argument is 0.

Looking at the call stack, I see

2409205acd3c ("i2c: designware: fix __i2c_dw_disable() in case master is holding SCL low")

which does that dev->map deref in __i2c_dw_disable() but maybe ->map is
invalid by then...?

Just a stab in the dark anyway...

Thx.

[ 6.245173] i2c_designware AMDI0010:00: Unknown Synopsys component type: 0xffffffff
[ 6.252683] BUG: kernel NULL pointer dereference, address: 00000000000001fc
[ 6.256551] #PF: supervisor read access in kernel mode
[ 6.256551] #PF: error_code(0x0000) - not-present page
[ 6.256551] PGD 0
[ 6.256551] Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 6.256551] CPU: 32 PID: 211 Comm: kworker/32:0 Not tainted 6.7.0-rc6-next-20231222-1703820640818 #1
[ 6.256551] Workqueue: pm pm_runtime_work
[ 6.256551] RIP: 0010:regmap_read+0x12/0x70
[ 6.256551] Code: 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 <8b> 87 fc 01 00 00 83 e8 01 85 f0 75 42 48 89 fb 41 89 f4 49 89 d5
[ 6.256551] RSP: 0018:ff7fa5c740bcbc98 EFLAGS: 00010246
[ 6.256551] RAX: 0000000000000000 RBX: ff38ff5c159f1028 RCX: 0000000000000008
[ 6.256551] RDX: ff7fa5c740bcbcc4 RSI: 0000000000000034 RDI: 0000000000000000
[ 6.256551] RBP: ff7fa5c740bcbcb0 R08: ff38ff5c02ceb8b0 R09: ff38ff5c002a4500
[ 6.256551] R10: 0000000000000003 R11: 0000000000000003 R12: ff38ff5c159f1028
[ 6.256551] R13: 0000000000000000 R14: 0000000000000000 R15: ff38ff5c159ed8f4
[ 6.256551] FS: 0000000000000000(0000) GS:ff38ff6b0d200000(0000) knlGS:0000000000000000
[ 6.256551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6.256551] CR2: 00000000000001fc CR3: 000000007403c001 CR4: 0000000000771ef0
[ 6.256551] PKRU: 55555554
[ 6.256551] Call Trace:
[ 6.256551] <TASK>
[ 6.256551] ? show_regs+0x6d/0x80
[ 6.256551] ? __die+0x29/0x70
[ 6.256551] ? page_fault_oops+0x153/0x4a0
[ 6.256551] ? do_user_addr_fault+0x30f/0x6c0
[ 6.256551] ? exc_page_fault+0x7c/0x190
[ 6.256551] ? asm_exc_page_fault+0x2b/0x30
[ 6.256551] ? regmap_read+0x12/0x70
[ 6.256551] ? update_load_avg+0x82/0x7d0
[ 6.256551] __i2c_dw_disable+0x38/0x180
[ 6.256551] i2c_dw_disable+0x3f/0xb0
[ 6.256551] i2c_dw_runtime_suspend+0x33/0x50
[ 6.256551] ? __pfx_pm_generic_runtime_suspend+0x10/0x10
[ 6.256551] pm_generic_runtime_suspend+0x2f/0x40
[ 6.256551] __rpm_callback+0x48/0x120
[ 6.256551] ? __pfx_pm_generic_runtime_suspend+0x10/0x10
[ 6.256551] rpm_callback+0x66/0x70
[ 6.256551] ? __pfx_pm_generic_runtime_suspend+0x10/0x10
[ 6.256551] rpm_suspend+0x166/0x700
[ 6.256551] ? srso_alias_return_thunk+0x5/0xfbef5
[ 6.256551] ? __schedule+0x3df/0x1720
[ 6.256551] pm_runtime_work+0xb2/0xd0
[ 6.256551] process_one_work+0x178/0x350
[ 6.256551] worker_thread+0x2f5/0x420
[ 6.256551] ? __pfx_worker_thread+0x10/0x10
[ 6.256551] kthread+0xf5/0x130
[ 6.256551] ? __pfx_kthread+0x10/0x10
[ 6.256551] ret_from_fork+0x3d/0x60
[ 6.256551] ? __pfx_kthread+0x10/0x10
[ 6.256551] ret_from_fork_asm+0x1a/0x30
[ 6.256551] </TASK>
[ 6.256551] Modules linked in:
[ 6.256551] CR2: 00000000000001fc
[ 6.256551] ---[ end trace 0000000000000000 ]---
[ 6.256551] RIP: 0010:regmap_read+0x12/0x70
[ 6.256551] Code: 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 <8b> 87 fc 01 00 00 83 e8 01 85 f0 75 42 48 89 fb 41 89 f4 49 89 d5
[ 6.256551] RSP: 0018:ff7fa5c740bcbc98 EFLAGS: 00010246
[ 6.256551] RAX: 0000000000000000 RBX: ff38ff5c159f1028 RCX: 0000000000000008
[ 6.256551] RDX: ff7fa5c740bcbcc4 RSI: 0000000000000034 RDI: 0000000000000000
[ 6.256551] RBP: ff7fa5c740bcbcb0 R08: ff38ff5c02ceb8b0 R09: ff38ff5c002a4500
[ 6.256551] R10: 0000000000000003 R11: 0000000000000003 R12: ff38ff5c159f1028
[ 6.256551] R13: 0000000000000000 R14: 0000000000000000 R15: ff38ff5c159ed8f4
[ 6.256551] FS: 0000000000000000(0000) GS:ff38ff6b0d200000(0000) knlGS:0000000000000000
[ 6.256551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6.256551] CR2: 00000000000001fc CR3: 000000007403c001 CR4: 0000000000771ef0
[ 6.256551] PKRU: 55555554
[ 6.256551] note: kworker/32:0[211] exited with irqs disabled


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


2024-01-02 13:42:56

by Jarkko Nikula

[permalink] [raw]
Subject: Re: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

Hi

On 12/29/23 14:08, Borislav Petkov wrote:
> Looking at the call stack, I see
>
> 2409205acd3c ("i2c: designware: fix __i2c_dw_disable() in case master is holding SCL low")
>
> which does that dev->map deref in __i2c_dw_disable() but maybe ->map is
> invalid by then...?
>
> Just a stab in the dark anyway...
>
Do you run same tests on vanilla? I.e. do you see this on v6.7-rc8?

I'm curious to know is this already existing issue or regression because
of recent cleanup patches in linux-next between v6.7-rc8..
drivers/i2c/busses/i2c-designware-*.

2024-01-03 15:25:33

by Jarkko Nikula

[permalink] [raw]
Subject: Re: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

On 1/2/24 17:47, V, Narasimhan wrote:
> [AMD Official Use Only - General]
>
>
> No, we don't see this issue on linus' tree or on linux-next in the till
> the previous week
>
Thanks, this indeed shows it's a regression coming from recent Andy's
patchset. Notes and questions below:

> [ 6.245173] i2c_designware AMDI0010:00: Unknown Synopsys component type: 0xffffffff

This made me scratching my head since driver probing will fail in this
case with -ENODEV and I could not trigger runtime PM activity in such
case but perhaps this is timing specific which happens to happen in your
case.

Out of curiosity do you see this same "i2c_designware AMDI0010:00:
Unknown Synopsys component type: 0xffffffff" error on Vanilla or is it
also regression in linux-next?

> [ 6.252683] BUG: kernel NULL pointer dereference, address: 00000000000001fc
> [ 6.256551] #PF: supervisor read access in kernel mode
> [ 6.256551] #PF: error_code(0x0000) - not-present page
> [ 6.256551] PGD 0
> [ 6.256551] Oops: 0000 [#1] PREEMPT SMP NOPTI
> [ 6.256551] CPU: 32 PID: 211 Comm: kworker/32:0 Not tainted 6.7.0-rc6-next-20231222-1703820640818 #1
> [ 6.256551] Workqueue: pm pm_runtime_work
> [ 6.256551] RIP: 0010:regmap_read+0x12/0x70
> [ 6.256551] Code: 00 00 00 00 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 <8b> 87 fc 01 00 00 83 e8 01 85 f0 75 42 48 89 fb 41 89 f4 49 89 d5
> [ 6.256551] RSP: 0018:ff7fa5c740bcbc98 EFLAGS: 00010246
> [ 6.256551] RAX: 0000000000000000 RBX: ff38ff5c159f1028 RCX: 0000000000000008
> [ 6.256551] RDX: ff7fa5c740bcbcc4 RSI: 0000000000000034 RDI: 0000000000000000
> [ 6.256551] RBP: ff7fa5c740bcbcb0 R08: ff38ff5c02ceb8b0 R09: ff38ff5c002a4500
> [ 6.256551] R10: 0000000000000003 R11: 0000000000000003 R12: ff38ff5c159f1028
> [ 6.256551] R13: 0000000000000000 R14: 0000000000000000 R15: ff38ff5c159ed8f4
> [ 6.256551] FS: 0000000000000000(0000) GS:ff38ff6b0d200000(0000) knlGS:0000000000000000
> [ 6.256551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6.256551] CR2: 00000000000001fc CR3: 000000007403c001 CR4: 0000000000771ef0
> [ 6.256551] PKRU: 55555554
> [ 6.256551] Call Trace:
> [ 6.256551] <TASK>
> [ 6.256551] ? show_regs+0x6d/0x80
> [ 6.256551] ? __die+0x29/0x70
> [ 6.256551] ? page_fault_oops+0x153/0x4a0
> [ 6.256551] ? do_user_addr_fault+0x30f/0x6c0
> [ 6.256551] ? exc_page_fault+0x7c/0x190
> [ 6.256551] ? asm_exc_page_fault+0x2b/0x30
> [ 6.256551] ? regmap_read+0x12/0x70
> [ 6.256551] ? update_load_avg+0x82/0x7d0
> [ 6.256551] __i2c_dw_disable+0x38/0x180
> [ 6.256551] i2c_dw_disable+0x3f/0xb0
> [ 6.256551] i2c_dw_runtime_suspend+0x33/0x50

I think this Oops comes because of the first commit in the patchset:

bd466a892612 ("i2c: designware: Fix PM calls order in dw_i2c_plat_probe()"

Do you see the issue if you test at that commit?

Before that commit when the i2c_dw_probe() path fails we explicitly
disable the runtime PM before returning but now let the managed calls to
do it. Perhaps there is some time window that runtime suspending occurs
in parallel while drivers base is executing post probe code?

dw_i2c_plat_probe
i2c_dw_probe
i2c_dw_probe_master
i2c_dw_init_regmap
-> failure and thus dev->map is not set

i2c_dw_runtime_suspend
i2c_dw_disable
__i2c_dw_disable
regmap_read(dev->map, ...)
-> Oops because dev->map is NULL

Other PM related commit in the patchset is commit 2347b8dc0d2e ("i2c:
designware: Consolidate PM ops") but I don't think that is the reason.

2024-01-04 13:40:56

by Jarkko Nikula

[permalink] [raw]
Subject: Re: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

On 1/4/24 08:35, V, Narasimhan wrote:
>> [    6.245173] i2c_designware AMDI0010:00: Unknown Synopsys component type: 0xffffffff
>
> This made me scratching my head since driver probing will fail in this
> case with -ENODEV and I could not trigger runtime PM activity in such
> case but perhaps this is timing specific which happens to happen in your
> case.
>
> Out of curiosity do you see this same "i2c_designware AMDI0010:00:
> Unknown Synopsys component type: 0xffffffff" error on Vanilla or is it
> also regression in linux-next?
>
> This does not happen on Vanilla, only on linux-next.
>
This is even more strange. Controller is in reset but I'm blind to see
from Andy's patches why. Do you have change to test at these commits?

bd466a892612 ("i2c: designware: Fix PM calls order in dw_i2c_plat_probe()")
c012fde343d2 ("i2c: designware: Fix reset call order in dw_i2c_plat_probe()"

and maybe the last one
4bff054b64e1 ("i2c: designware: Fix spelling and other issues in the
comments")

I'm trying to narrow does the regression come from first two patches and
if not, then test the last one.

Andy is out of office and if we can narrow the regression to first two
patches we perhaps can revert just them and otherwise need to drop the
whole set.

2024-01-06 16:09:40

by Andy Shevchenko

[permalink] [raw]
Subject: Re: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

On Thu, Jan 04, 2024 at 03:40:44PM +0200, Jarkko Nikula wrote:
> On 1/4/24 08:35, V, Narasimhan wrote:
> > > [??? 6.245173] i2c_designware AMDI0010:00: Unknown Synopsys component type: 0xffffffff
> >
> > This made me scratching my head since driver probing will fail in this
> > case with -ENODEV and I could not trigger runtime PM activity in such
> > case but perhaps this is timing specific which happens to happen in your
> > case.
> >
> > Out of curiosity do you see this same "i2c_designware AMDI0010:00:
> > Unknown Synopsys component type: 0xffffffff" error on Vanilla or is it
> > also regression in linux-next?
> >
> > This does not happen on Vanilla, only on linux-next.
> >
> This is even more strange. Controller is in reset but I'm blind to see from
> Andy's patches why. Do you have change to test at these commits?
>
> bd466a892612 ("i2c: designware: Fix PM calls order in dw_i2c_plat_probe()")
> c012fde343d2 ("i2c: designware: Fix reset call order in dw_i2c_plat_probe()"
>
> and maybe the last one
> 4bff054b64e1 ("i2c: designware: Fix spelling and other issues in the
> comments")
>
> I'm trying to narrow does the regression come from first two patches and if
> not, then test the last one.
>
> Andy is out of office and if we can narrow the regression to first two
> patches we perhaps can revert just them and otherwise need to drop the whole
> set.

Since I saw this email...

First of all, it's easy just to go patch-by-patch and see if it helps.
Or simple bisect among 24 commits (4 iterations only).

Second, it seems that we are using autosuspend but we don't prevent
the PM to go down during the ->probe(). So, a WA can be to take a reference
count preventing PM from going down.

->remove, for instance, uses RPM get/put calls.

--
With Best Regards,
Andy Shevchenko



2024-01-09 10:12:25

by Jarkko Nikula

[permalink] [raw]
Subject: Re: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

Hi

On 1/9/24 09:56, V, Narasimhan wrote:
> * Looks like the issue is with this below commit:
> * i2c: designware: Fix lock probe call order in dw_i2c_plat_probe()
>
Hmm... This makes me even more confused since your device AMDI0010
should not even use the access semaphore.

So linux-next works if you run a commit before it or revert these three
patches? (commit 2f571a725434 ("i2c: designware: Fix lock probe call
order in dw_i2c_plat_probe()") doesn't revert without reverting two
other related commits after it)

git show f9b51f600217b38f46ea39d6aa445e594bf3eb30 |patch -p1 -R
git show b8034c7d28a988be82efbf4d65faa847334811f7 |patch -p1 -R
git show 2f571a72543463ef07dc3ac61e7b703b9ad997f9 |patch -p1 -R

2024-01-10 22:56:47

by Kim Phillips

[permalink] [raw]
Subject: Re: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

Hi,

On 1/9/24 4:11 AM, Jarkko Nikula wrote> On 1/9/24 09:56, V, Narasimhan wrote:
>>   * Looks like the issue is with this below commit:
>>   * i2c: designware: Fix lock probe call order in dw_i2c_plat_probe()
>>
> Hmm... This makes me even more confused since your device AMDI0010 should not even use the access semaphore.
>
> So linux-next works if you run a commit before it or revert these three patches? (commit 2f571a725434 ("i2c: designware: Fix lock probe call order in dw_i2c_plat_probe()") doesn't revert without reverting two other related commits after it)
>
> git show f9b51f600217b38f46ea39d6aa445e594bf3eb30 |patch -p1 -R
> git show b8034c7d28a988be82efbf4d65faa847334811f7 |patch -p1 -R
> git show 2f571a72543463ef07dc3ac61e7b703b9ad997f9 |patch -p1 -R

Narasimhan is right, if I check out, build and boot this commit:

2f571a725434 i2c: designware: Fix lock probe call order in dw_i2c_plat_probe()

I get the same stacktrace on the serial console.

If I try the previous commit (174a0c565cea "efi/loongarch: Directly position the loaded image file"),
the system boots fine.

The same thing happens with the three reversions above:
next-20240110 gets the stacktrace, but with the three
reversions, it doesn't.

Is your parallel post probe runtime suspending time window
theory no longer applicable? These AMD EPYC systems have a
lot more cores than their client equivalents, and AMD power
management code has had a lot of improvements lately.

Thanks,

Kim

2024-01-11 13:02:03

by Jarkko Nikula

[permalink] [raw]
Subject: Re: i2c-designware: NULL ptr at RIP: 0010:regmap_read+0x12/0x70

Hi

On 1/11/24 00:56, Kim Phillips wrote:
> Hi,
>
> On 1/9/24 4:11 AM, Jarkko Nikula wrote> On 1/9/24 09:56, V, Narasimhan
> wrote:
>>>   * Looks like the issue is with this below commit:
>>>   * i2c: designware: Fix lock probe call order in dw_i2c_plat_probe()
>>>
>> Hmm... This makes me even more confused since your device AMDI0010
>> should not even use the access semaphore.
>>
>> So linux-next works if you run a commit before it or revert these
>> three patches? (commit 2f571a725434 ("i2c: designware: Fix lock probe
>> call order in dw_i2c_plat_probe()") doesn't revert without reverting
>> two other related commits after it)
>>
>> git show f9b51f600217b38f46ea39d6aa445e594bf3eb30 |patch -p1 -R
>> git show b8034c7d28a988be82efbf4d65faa847334811f7 |patch -p1 -R
>> git show 2f571a72543463ef07dc3ac61e7b703b9ad997f9 |patch -p1 -R
>
> Narasimhan is right, if I check out, build and boot this commit:
>
>       2f571a725434 i2c: designware: Fix lock probe call order in
> dw_i2c_plat_probe()
>
> I get the same stacktrace on the serial console.
>
> If I try the previous commit (174a0c565cea "efi/loongarch: Directly
> position the loaded image file"),
> the system boots fine.
>
> The same thing happens with the three reversions above:
> next-20240110 gets the stacktrace, but with the three
> reversions, it doesn't.
>
Thanks, I just sent a fix reverting those commits.

> Is your parallel post probe runtime suspending time window
> theory no longer applicable?  These AMD EPYC systems have a
> lot more cores than their client equivalents, and AMD power
> management code has had a lot of improvements lately.
>
It still a mystery to me but I let Andy to figure out it if he wants to
during next development cycle :-)