2023-05-08 16:04:29

by Mukesh Ojha

[permalink] [raw]
Subject: RFC design of device coredump collection on panic in Pstore

Hi All,

We are exploring a way where device driver(registered/interested and optional could be
callback) can dump their data(consist of list of buffers or hardware registers) for post
mortam debugging of a crash via dumping the content to pstore region. Some of the example
data could be like clock dump/regulators/ etc.

These stuff should be already part of entire RAM dump but in some cases it's however not
feasible to capture the entire content of RAM, so was thinking if pstore region can be
used to capture above information either in the form of elf or binary and how pstore
can get this in human readable format can be discussed further.

Also, existing devcoredump driver does not cover panic usecases so we thought of something
like below RFC design where,

1. Device_coredump allocates some configurable contigous memory that can be controlled
via CONFIG or bootargs and later registers for panic notifiers.
2. Notifier gets added.
3. Pstore adds device_coredump as its front-end via dumper registration similar
to kmsg being dump today.
4. Successful registration of dumper.
5. A device driver(A-Z) can register their buffer to be dumped as part of panic.
6. buffer gets added to the dump list.
7. Panic occurs.
8. iterate over registered drivers and copy their dump list to its own memory and if
it crosses device core dump memory log an error stop iterating.
9. Similar to devcore_dump() inline with kmsg_dump()
10.Copy the content to pstore region and this could be elf or raw binary that can be
discussed.


Device coredump memory(1) size could be passed from pstore and should be same as size
of devcoredump frontend size given in DT or some other way.

Let me know your concern and view on this.


pstore device_coredump deviceA..Z panic
┼ │
│ │ panic_notifier(1) │
│ ├───────────────────┬─────────────────►│
│ │ notifier added │ (2) │
│ │◄──────────────────┼──────────────────┤
pstore can give │ (3) │ │ │
its region for │ dumper_registration │ │ │
dump ├─────────────────────►│ │ │
│ (4) │ │ │
│◄─────────────────────┤ │ │
│ register_dumper │ │ │
│ │ │ │
│ │ │ │
│ │ │ init │
│ │ ───┼── │
│ │ (5) │ │
│ │devcoredump_register(dev,buf)); │
│ │◄──────────────────┐ │
│ │ (6) │ │
│ ├──────────────────►│ │
│ │device/buf gets add│d │
│ │ to the list │ │panic
│ │ │ ──┼───
│ │ │ │
│ │(7) panic_notifier │ call │
│ │◄──────────────────┼──────────────────┤
│ │ prepare coredump │for the deviceA-Z │
│ ├──────────────────►├────┐ │
│ │ (8) │ │ │
│ │ │ │ │
│ │ │ │ │
│ │ │◄───┘ │
dumper will │ │ (9) │dev_coredump()
write all the device dump to │◄──────────────────┼──────────────────┤
pstore ┌◄───┬─────────────────┤ │ │
region │ │ (10) │ │ │
│ │ │ │ │
│ │ │ │ │
│◄───┘ │ │ │
│ │ │ │
│ │



Regards,
Mukesh

--
2.7.4


2023-05-15 07:33:19

by Pavan Kondeti

[permalink] [raw]
Subject: Re: RFC design of device coredump collection on panic in Pstore

On Mon, May 08, 2023 at 09:21:00PM +0530, Mukesh Ojha wrote:
> Hi All,
>
> We are exploring a way where device driver(registered/interested and optional could be
> callback) can dump their data(consist of list of buffers or hardware registers) for post
> mortam debugging of a crash via dumping the content to pstore region. Some of the example
> data could be like clock dump/regulators/ etc.
>
> These stuff should be already part of entire RAM dump but in some cases it's however not
> feasible to capture the entire content of RAM, so was thinking if pstore region can be
> used to capture above information either in the form of elf or binary and how pstore
> can get this in human readable format can be discussed further.
>
> Also, existing devcoredump driver does not cover panic usecases so we thought of something
> like below RFC design where,
>
> 1. Device_coredump allocates some configurable contigous memory that can be controlled
> via CONFIG or bootargs and later registers for panic notifiers.
> 2. Notifier gets added.
> 3. Pstore adds device_coredump as its front-end via dumper registration similar
> to kmsg being dump today.
> 4. Successful registration of dumper.
> 5. A device driver(A-Z) can register their buffer to be dumped as part of panic.
> 6. buffer gets added to the dump list.
> 7. Panic occurs.
> 8. iterate over registered drivers and copy their dump list to its own memory and if
> it crosses device core dump memory log an error stop iterating.
> 9. Similar to devcore_dump() inline with kmsg_dump()
> 10.Copy the content to pstore region and this could be elf or raw binary that can be
> discussed.
>
>
> Device coredump memory(1) size could be passed from pstore and should be same as size
> of devcoredump frontend size given in DT or some other way.
>
> Let me know your concern and view on this.
>
>
> pstore device_coredump deviceA..Z panic
> ┼ │
> │ │ panic_notifier(1) │
> │ ├───────────────────┬─────────────────►│
> │ │ notifier added │ (2) │
> │ │◄──────────────────┼──────────────────┤
> pstore can give │ (3) │ │ │
> its region for │ dumper_registration │ │ │
> dump ├─────────────────────►│ │ │
> │ (4) │ │ │
> │◄─────────────────────┤ │ │
> │ register_dumper │ │ │
> │ │ │ │
> │ │ │ │
> │ │ │ init │
> │ │ ───┼── │
> │ │ (5) │ │
> │ │devcoredump_register(dev,buf)); │
> │ │◄──────────────────┐ │
> │ │ (6) │ │
> │ ├──────────────────►│ │
> │ │device/buf gets add│d │
> │ │ to the list │ │panic
> │ │ │ ──┼───
> │ │ │ │
> │ │(7) panic_notifier │ call │
> │ │◄──────────────────┼──────────────────┤
> │ │ prepare coredump │for the deviceA-Z │
> │ ├──────────────────►├────┐ │
> │ │ (8) │ │ │
> │ │ │ │ │
> │ │ │ │ │
> │ │ │◄───┘ │
> dumper will │ │ (9) │dev_coredump()
> write all the device dump to │◄──────────────────┼──────────────────┤
> pstore ┌◄───┬─────────────────┤ │ │
> region │ │ (10) │ │ │
> │ │ │ │ │
> │ │ │ │ │
> │◄───┘ │ │ │
> │ │ │ │
> │ │
>
>

I really like the idea of providing an interface for device drivers to
dump specific data for offline use. Currently in MSM/QCOM downstream
kernel, we do register for panic notifiers and dump relavant data for
later parsing/analysis. In some cases, this message gets printed to console
so that it would also available in pstore / extracted dmesg. However
having separate record/buffer for individual drivers would help.

I believe dev_coredump() is just an analogy here as it is meant for
taking a dump of the device when the driver thinks its device/firmware
is hung/crashed etc, AFAICT. IIUC, What we are looking for here a different
interface/semantics i.e querying devices to dump any information when
the *system* is crashed/paniced. May be something like dev_panicdump()
and a separate method in device_driver.

Also, are there any use cases for this dev_panicdump() outside the
pstore()? For ex: it would still be helpful to gather all the relavant
data of a device and put it a separate buffer. The list of such buffers
can be parsed offline via standard ramdump analysis.

Thanks,
Pavan

2023-05-16 21:16:44

by Kees Cook

[permalink] [raw]
Subject: Re: RFC design of device coredump collection on panic in Pstore

On Mon, May 08, 2023 at 09:21:00PM +0530, Mukesh Ojha wrote:
> 1. Device_coredump allocates some configurable contigous memory that can be controlled
> via CONFIG or bootargs and later registers for panic notifiers.
> 2. Notifier gets added.
> 3. Pstore adds device_coredump as its front-end via dumper registration similar
> to kmsg being dump today.
> 4. Successful registration of dumper.
> 5. A device driver(A-Z) can register their buffer to be dumped as part of panic.
> 6. buffer gets added to the dump list.
> 7. Panic occurs.
> 8. iterate over registered drivers and copy their dump list to its own memory and if
> it crosses device core dump memory log an error stop iterating.
> 9. Similar to devcore_dump() inline with kmsg_dump()
> 10.Copy the content to pstore region and this could be elf or raw binary that can be
> discussed.

Yeah, I think having something like the panic dump handlers for
driver-specific data sounds like a fine idea (as long as we have actual
users of such an interface).

I think it would be easy to add another front-end to pstore (via the
enums, etc), and expose it via a new template filename in the pstorefs.

--
Kees Cook