LinuxLists.cc - Re: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and watchdog reset

2023-09-19 13:24:35

Subject: Re: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and watchdog reset

On 19/09/2023 12:39, Linu Cherian wrote:
> Hi James,
>
>> -----Original Message-----
>> From: James Clark <[email protected]>
>> Sent: Friday, September 15, 2023 7:20 PM
>> To: Linu Cherian <[email protected]>; [email protected];
>> [email protected]; [email protected]
>> Cc: [email protected]; [email protected]; linux-
>> [email protected]; [email protected];
>> [email protected]; [email protected];
>> [email protected]; Sunil Kovvuri Goutham
>> <[email protected]>; George Cherian <[email protected]>
>> Subject: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and
>> watchdog reset
>>
>> External Email
>>
>> ----------------------------------------------------------------------
>>
>>
>> On 04/09/2023 06:05, Linu Cherian wrote:
>>> This RFC v3 patch series is rebased on v6.5-rc7 and is dependent on
>>> the below two patches.
>> [...]
>>>
>>> Steps for reading trace data captured in previous boot
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1. cd /sys/bus/coresight/devices/tmc_etrXX/
>>>
>>> 2. Change to special mode called, read_prevboot.
>>>
>>> #echo 1 > read_prevboot
>>>
>>> 3. Dump trace buffer data to a file,
>>>
>>> #dd if=/dev/tmc_etrXX of=~/cstrace.bin
>>
>> Hi Linu,
>>
>> I left this comment on V2, but I tested it again and get the same result.
>> Instead of linking it I'll just re-paste it here:
>>
>> I made a reserved region, but when I run this command I get "Unable to
>> handle kernel paging request at virtual address 001f1921ed10ffae".
>>
>> Is there an extra step involved if there was no trace captured from a previous
>> panic? I thought I'd just be able to read out uninitialised data. Or is it the
>> uninitialised metadata that's causing this issue?
>>
>> Also that's without KASAN or lockdep turned on. If I have a kernel with either
>> of those things I get a different warning for each one. I expect the lockdep
>> one would happen even in the working scenario though?
>
> Somehow I missed this comment on V2.
>
> I retried the above steps on my board and I do not see issues either with KASAN OR lockdep enabled configs.
> Please see logs below.
>
> a. Lockdep enabled config
> ~# cd /sys/bus/coresight/devices/tmc_etr0
> tmc_etr0# echo 1 > read_prevboot
> tmc_etr0# dd if=/dev/tmc_etr0 of=~/cstrace.bin
> 12324+1 records in
> 12324+1 records out
> 6310032 bytes (6.3 MB, 6.0 MiB) copied, 0.122883 s, 51.3 MB/s
>
> # zcat /proc/config.gz | grep LOCKDEP
> CONFIG_LOCKDEP_SUPPORT=y
> CONFIG_LOCKDEP=y
> CONFIG_LOCKDEP_BITS=15
> CONFIG_LOCKDEP_CHAINS_BITS=16
> CONFIG_LOCKDEP_STACK_TRACE_BITS=19
> CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
> CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
> # CONFIG_DEBUG_LOCKDEP is not set
>
> b. KASAN enabled config
> # cd /sys/bus/coresight/devices/tmc_etr0/
> tmc_etr0# ls
> buf_mode_preferred connections power trigger_cntr
> buf_modes_available enable_sink read_prevboot uevent
> buffer_size mgmt subsystem waiting_for_supplier
> tmc_etr0# echo 1 > read_prevboot
> tmc_etr0# dd if=/dev/tmc_etr0 of=~/cstrace.bin
> 12324+1 records in
> 12324+1 records out
> 6310032 bytes (6.3 MB, 6.0 MiB) copied, 0.0940671 s, 67.1 MB/s
>
> ~# zcat /proc/config.gz | grep -i kasan
> CONFIG_KASAN_SHADOW_OFFSET=0xdfff800000000000
> CONFIG_HAVE_ARCH_KASAN=y
> CONFIG_HAVE_ARCH_KASAN_SW_TAGS=y
> CONFIG_HAVE_ARCH_KASAN_HW_TAGS=y
> CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
> CONFIG_CC_HAS_KASAN_GENERIC=y
> CONFIG_CC_HAS_KASAN_SW_TAGS=y
> CONFIG_KASAN=y
> CONFIG_KASAN_GENERIC=y
> # CONFIG_KASAN_SW_TAGS is not set
> # CONFIG_KASAN_HW_TAGS is not set
> CONFIG_KASAN_OUTLINE=y
> # CONFIG_KASAN_INLINE is not set
> CONFIG_KASAN_STACK=y
> CONFIG_KASAN_VMALLOC=y
> # CONFIG_KASAN_MODULE_TEST is not set
>
>
> But then I am able to trigger kernel crash with bad metadata(corrupted rwp and rrp) with below stack trace.
>
> [ 107.442991] __arch_copy_to_user+0x180/0x240
> [ 107.447254] vfs_read+0xc8/0x2a8
> [ 107.450476] ksys_read+0x74/0x110
> [ 107.453783] __arm64_sys_read+0x24/0x38
> [ 107.457611] invoke_syscall.constprop.0+0x58/0xf8
> [ 107.462309] do_el0_svc+0x6c/0x158
> [ 107.465704] el0_svc+0x54/0x1c0
> [ 107.468839] el0t_64_sync_handler+0x100/0x130
> [ 107.473188] el0t_64_sync+0x190/0x198
> [ 107.476843] Code: d503201f d503201f d503201f d503201f (a8c12027)
>
> Does your stack trace looks similar ? Then its very likely due to bad metadata.
> If not, kindly please share yours.
>
> For example, if we have bad values for rwp and rrp, offset can get messed up resulting in above crash.
> Will add more validation checks while setting up the prevboot buffer, so as to avoid processing with bogus metadata values
> in the next patch version.
>
> Thanks James for trying this out.
>
>

I think it must be bad metadata because I didn't try it with a previous
crash saved yet. I suppose we do need some kind of validation then if
it's possible for bad metadata to cause a crash.

I will try after filling in the metadata and see if that was the issue.
>
>>
>>>
>>> 4. Reset back to normal mode
>>>
>>> #echo 0 > read_prevboot
>>>

2023-09-27 08:05:42

by Linu Cherian

[permalink] [raw]

Subject: RE: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and watchdog reset

Hi James,

> -----Original Message-----
> From: James Clark <[email protected]>
> Sent: Tuesday, September 19, 2023 6:41 PM
> To: Linu Cherian <[email protected]>; [email protected];
> [email protected]; [email protected]
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; Sunil Kovvuri Goutham
> <[email protected]>; George Cherian <[email protected]>
> Subject: Re: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and
> watchdog reset
>
>
>
> On 19/09/2023 12:39, Linu Cherian wrote:
> > Hi James,
> >
> >> -----Original Message-----
> >> From: James Clark <[email protected]>
> >> Sent: Friday, September 15, 2023 7:20 PM
> >> To: Linu Cherian <[email protected]>; [email protected];
> >> [email protected]; [email protected]
> >> Cc: [email protected]; [email protected];
> >> linux- [email protected]; [email protected];
> >> [email protected]; [email protected];
> >> [email protected]; Sunil Kovvuri Goutham
> >> <[email protected]>; George Cherian <[email protected]>
> >> Subject: [EXT] Re: [RFC PATCH v3 0/8] Coresight for Kernel panic and
> >> watchdog reset
> >>
> >> External Email
> >>
> >> ---------------------------------------------------------------------
> >> -
> >>
> >>
> >> On 04/09/2023 06:05, Linu Cherian wrote:
> >>> This RFC v3 patch series is rebased on v6.5-rc7 and is dependent on
> >>> the below two patches.
> >> [...]
> >>>
> >>> Steps for reading trace data captured in previous boot
> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>> 1. cd /sys/bus/coresight/devices/tmc_etrXX/
> >>>
> >>> 2. Change to special mode called, read_prevboot.
> >>>
> >>> #echo 1 > read_prevboot
> >>>
> >>> 3. Dump trace buffer data to a file,
> >>>
> >>> #dd if=/dev/tmc_etrXX of=~/cstrace.bin
> >>
> >> Hi Linu,
> >>
> >> I left this comment on V2, but I tested it again and get the same result.
> >> Instead of linking it I'll just re-paste it here:
> >>
> >> I made a reserved region, but when I run this command I get "Unable
> >> to handle kernel paging request at virtual address 001f1921ed10ffae".
> >>
> >> Is there an extra step involved if there was no trace captured from a
> >> previous panic? I thought I'd just be able to read out uninitialised
> >> data. Or is it the uninitialised metadata that's causing this issue?
> >>
> >> Also that's without KASAN or lockdep turned on. If I have a kernel
> >> with either of those things I get a different warning for each one. I
> >> expect the lockdep one would happen even in the working scenario
> though?
> >
> > Somehow I missed this comment on V2.
> >
> > I retried the above steps on my board and I do not see issues either with
> KASAN OR lockdep enabled configs.
> > Please see logs below.
> >
> > a. Lockdep enabled config
> > ~# cd /sys/bus/coresight/devices/tmc_etr0
> > tmc_etr0# echo 1 > read_prevboot
> > tmc_etr0# dd if=/dev/tmc_etr0 of=~/cstrace.bin
> > 12324+1 records in
> > 12324+1 records out
> > 6310032 bytes (6.3 MB, 6.0 MiB) copied, 0.122883 s, 51.3 MB/s
> >
> > # zcat /proc/config.gz | grep LOCKDEP
> > CONFIG_LOCKDEP_SUPPORT=y
> > CONFIG_LOCKDEP=y
> > CONFIG_LOCKDEP_BITS=15
> > CONFIG_LOCKDEP_CHAINS_BITS=16
> > CONFIG_LOCKDEP_STACK_TRACE_BITS=19
> > CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
> > CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
> > # CONFIG_DEBUG_LOCKDEP is not set
> >
> > b. KASAN enabled config
> > # cd /sys/bus/coresight/devices/tmc_etr0/
> > tmc_etr0# ls
> > buf_mode_preferred connections power trigger_cntr
> > buf_modes_available enable_sink read_prevboot uevent
> > buffer_size mgmt subsystem waiting_for_supplier
> > tmc_etr0# echo 1 > read_prevboot
> > tmc_etr0# dd if=/dev/tmc_etr0 of=~/cstrace.bin
> > 12324+1 records in
> > 12324+1 records out
> > 6310032 bytes (6.3 MB, 6.0 MiB) copied, 0.0940671 s, 67.1 MB/s
> >
> > ~# zcat /proc/config.gz | grep -i kasan
> > CONFIG_KASAN_SHADOW_OFFSET=0xdfff800000000000
> > CONFIG_HAVE_ARCH_KASAN=y
> > CONFIG_HAVE_ARCH_KASAN_SW_TAGS=y
> > CONFIG_HAVE_ARCH_KASAN_HW_TAGS=y
> > CONFIG_HAVE_ARCH_KASAN_VMALLOC=y
> > CONFIG_CC_HAS_KASAN_GENERIC=y
> > CONFIG_CC_HAS_KASAN_SW_TAGS=y
> > CONFIG_KASAN=y
> > CONFIG_KASAN_GENERIC=y
> > # CONFIG_KASAN_SW_TAGS is not set
> > # CONFIG_KASAN_HW_TAGS is not set
> > CONFIG_KASAN_OUTLINE=y
> > # CONFIG_KASAN_INLINE is not set
> > CONFIG_KASAN_STACK=y
> > CONFIG_KASAN_VMALLOC=y
> > # CONFIG_KASAN_MODULE_TEST is not set
> >
> >
> > But then I am able to trigger kernel crash with bad metadata(corrupted rwp
> and rrp) with below stack trace.
> >
> > [ 107.442991] __arch_copy_to_user+0x180/0x240 [ 107.447254]
> > vfs_read+0xc8/0x2a8 [ 107.450476] ksys_read+0x74/0x110 [
> > 107.453783] __arm64_sys_read+0x24/0x38 [ 107.457611]
> > invoke_syscall.constprop.0+0x58/0xf8
> > [ 107.462309] do_el0_svc+0x6c/0x158
> > [ 107.465704] el0_svc+0x54/0x1c0
> > [ 107.468839] el0t_64_sync_handler+0x100/0x130 [ 107.473188]
> > el0t_64_sync+0x190/0x198 [ 107.476843] Code: d503201f d503201f
> > d503201f d503201f (a8c12027)
> >
> > Does your stack trace looks similar ? Then its very likely due to bad
> metadata.
> > If not, kindly please share yours.
> >
> > For example, if we have bad values for rwp and rrp, offset can get messed
> up resulting in above crash.
> > Will add more validation checks while setting up the prevboot buffer,
> > so as to avoid processing with bogus metadata values in the next patch
> version.
> >
> > Thanks James for trying this out.
> >
> >
>
> I think it must be bad metadata because I didn't try it with a previous crash
> saved yet. I suppose we do need some kind of validation then if it's possible
> for bad metadata to cause a crash.
>
> I will try after filling in the metadata and see if that was the issue.

Found a regression in this series. Below is the fix.

@@ -1840,7 +1853,7 @@ static int tmc_panic_sync_etr(struct coresight_device *csdev)

/* Sync registers from hardware to metadata region */
tmc->size = csdev_access_relaxed_read32(csa, TMC_RSZ);
- tmc->sts = csdev_access_relaxed_read32(csa, TMC_RSZ);
+ tmc->sts = csdev_access_relaxed_read32(csa, TMC_STS);

> >
> >>
> >>>
> >>> 4. Reset back to normal mode
> >>>
> >>> #echo 0 > read_prevboot
> >>>