Hello,
Last week at the Embedded Linux Conference in Seattle we had an
"unconference session", which is a free discussion about a topic. The
topic I had proposed is "Hot-Pluggable Hardware with Device Tree
Overlays Runtime Loading and Unloading (yes, at runtime)". As suggested
by Saravana, here is a brief summary of the discussion.
15 people were present:
Luca Ceresoli (Bootlin)
Thomas Petazzoni (Bootlin)
Alexandre Belloni (Bootlin)
Maxime Chevallier (Bootlin)
Krzysztof Kozlowski (Linaro)
Bartosz Golaszewski (Linaro)
Doug Anderson (Google)
Chen-Yu Tsai (Google)
Matt Coster (Imagination Technologies)
Martino Facchin (Arduino)
(5 more, I don't know the names)
The topic is how to implement in Linux using device tree overlays
runtime (un)loading any hardware add-on that:
- can be plugged and unplugged to a base board at runtime, without
notice
- adds hardware on non-discoverable busses
- provides a way to detect the add-on model that gets attached.
Cold-plug and discoverable busses (e.g. USB) are not in topic.
We described 2 use cases we are working on at Bootlin.
One use case is for the LAN966x, a classic SoC that can be however be
started in "endpoint mode", i.e. with the CPU cores deactivated and a
PCI endpoint that allows an external CPU to access all the peripherals
over PCIe. In practice the whole SoC would be used as a peripheral chip
providing lots of devices for another SoC where the OS runs. This use
case has been described by Rob Herring and Lizhi Hou at LPC 2023 [4][5].
The other use case, which was discussed in more detail, is for an
industrial product under development by a Bootlin customer, which is a
regular, self-standing embedded Linux system with a connector allowing
to connect an add-on with additional peripherals. The add-on
peripherals are on I2C, MIPI DSI and potentially other non-discoverable
busses (there are also peripherals on natively hot-pluggable busses
such as USB and Ethernet, but by their nature they don't need special
work).
For both use cases (and perhaps others we are unaware of) runtime
loading/unloading DT overlays appears as the most fitting technique.
Except it is not yet ready for real usage.
For it to work, we highlighted 3 main areas in need of work in the
Linux kernel:
1. how to describe the connector and the add-ons in device tree
(bindings etc) -- only relevant for the 2nd use case
2. implementation of DT overlays for adding/removing the add-on
peripherals
3. fixing issues with various subsystems and drivers that don't react
well on device removal
* Topic 1: DT description *
I mentioned the DT structure I proposed in [0] which allows decoupling
the bus segments, so supporting both different add-on models and
different base boards with different SoCs around the same connector
definition (think of the Beaglebone family). No objection was raised
about this approach.
Some mentioned the recently posted patches for Mikrobus support on the
Beagle Play [1], which I was unaware of. The proposed connector
description appears similar to our proposal. However I later checked
the e-mail thread and although the connector description appears
similar, there is a big difference: in the Beagle Play proposal the
add-on is not described via DT but rather via a greybus manifest, and
the connector driver has code to parse it and populate the various
devices mentioned in the manifest.
* Topic 2: Implementation of the connector and overlay (un)loading *
The proposed idea is to have a connector driver that reacts to plug
events in two stages.
- Stage 1: load a "small" overlay common to all add-on models which
describes enough to get the add-on model ID, e.g. from an EEPROM on
the add-on itself.
- Stage 2: after getting the model ID, load the model-specific overlay
that describes everything else.
Stage 1 could be unnecessary if the model can be detected without
loading any add-on device drivers, e.g. is defined by pulling some
GPIOs on the connector.
Overlay (un)loading is well known for triggering several issues, the
largest one (in terms of lines of code involved) is the memory leaks or
use-after-free [6] of nodes and especially properties that happen when
an overlay is removed.
* Topic 3: fixing drivers/subsystems not handling removal correctly *
Bartosz raised the concern that many subsystems crash or hang or are
otherwise buggy when a device is removed (I think the quote was "are
you guys going to fix them all?") -- a sound concern indeed.
We plan to address issues as they appear on the busses we use, which is
already a relevant work and is already in progress here. The
others (e.g. SPI) can be addressed by whoever needs to hotplug them
anytime in the future. It's worth mentioning that Bartosz gave a BoF
[2] and talk [3] on the following day, both with useful information for
those needing to make a subsystem safe against removals.
* Status *
In the end there are 3 main areas in need of work: DT description, DT
overlay implementation, fixing drivers and subsystems that don't work
correctly.
Bootlin is actively working on all of these topics and already sent a
few patches to fix some issues that were found [7][8][9]. More is under
work and will be sent as it is ready.
That's all. For those present, please feel free to add any relevant
details I have missed.
[0] https://lore.kernel.org/all/20240403213327.36d731ec@booty/
[1] https://lore.kernel.org/all/[email protected]/
[2] https://sched.co/1aBGK
[3] https://sched.co/1aBGf
[4] https://www.youtube.com/watch?v=MVGElnZW7BQ
[5] https://lpc.events/event/17/contributions/1421/
[6] https://elinux.org/Frank%27s_Evolving_Overlay_Thoughts#issues_and_what_needs_to_be_completed_--_Not_an_exhaustive_list
[7] https://lore.kernel.org/all/[email protected]
[8] https://lore.kernel.org/all/[email protected]
[9] https://lore.kernel.org/all/[email protected]
Best regards,
Luca
--
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com
Hello,
On Fri, 26 Apr 2024 11:51:41 +0200
Luca Ceresoli <[email protected]> wrote:
[...]
> We described 2 use cases we are working on at Bootlin.
>
> One use case is for the LAN966x, a classic SoC that can be however be
> started in "endpoint mode", i.e. with the CPU cores deactivated and a
> PCI endpoint that allows an external CPU to access all the peripherals
> over PCIe. In practice the whole SoC would be used as a peripheral chip
> providing lots of devices for another SoC where the OS runs. This use
> case has been described by Rob Herring and Lizhi Hou at LPC 2023 [4][5].
>
> The other use case, which was discussed in more detail, is for an
> industrial product under development by a Bootlin customer, which is a
> regular, self-standing embedded Linux system with a connector allowing
> to connect an add-on with additional peripherals. The add-on
> peripherals are on I2C, MIPI DSI and potentially other non-discoverable
> busses (there are also peripherals on natively hot-pluggable busses
> such as USB and Ethernet, but by their nature they don't need special
> work).
>
> For both use cases (and perhaps others we are unaware of) runtime
> loading/unloading DT overlays appears as the most fitting technique.
> Except it is not yet ready for real usage.
>
> For it to work, we highlighted 3 main areas in need of work in the
> Linux kernel:
>
> 1. how to describe the connector and the add-ons in device tree
> (bindings etc) -- only relevant for the 2nd use case
> 2. implementation of DT overlays for adding/removing the add-on
> peripherals
> 3. fixing issues with various subsystems and drivers that don't react
> well on device removal
Quick update: I just sent a series with a proposal covering items 1 and
2:
https://lore.kernel.org/all/[email protected]/
Luca
--
Luca Ceresoli, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com