Hello,
We have seen an issue when some of ConnectX-3 devices are not initialized
when mlx4 drivers are a part of initrd.
The basic scenario looks as follows:
* A machine has multiple ConnectX-3 devices, they can be VF ones. The system
uses an initrd driven by dracut+systemd. The initrd is built as no-hostonly
(think of a VM image) and includes the mlx4 drivers.
* The machine boots. The initrd invokes udevd to start inserting device
drivers until the root disk is available.
* The udev daemon inserts the mlx4_core driver, which asynchronously requests
a load of mlx4_en. This is done by calling request_module_nowait() from
mlx4_request_modules(). The kernel spawns a modprobe userspace task to
handle this request.
* The modprobe task finds the mlx4_en module and asks the kernel to load it.
The module loader runs the init function of the module which starts
iterating over mlx4_core devices and initializing their eth support.
* The root disk becomes available in the meantime and the initrd logic starts
the switch root process.
* Systemd stops running services and then sends SIGTERM to "unmanaged" tasks
on the system to terminate them too. This includes the modprobe task.
* Initialization of mlx4_en is interrupted in the middle of its init function.
The module remains inserted but only some eth devices are initialized and
operational.
The modprobe task uses the default SIGTERM handling and so this signal becomes
fatal. Specifically, it causes the create_singlethread_workqueue() call in
mlx4_en_add() to error out. The workqueue requires a rescuer thread and a wait
on the new thread fails because a fatal signal is pending.
As mentioned, this can result in only a part of all devices being initialized.
It could also likely happen that the modprobe task fails in some other obscure
way as it has its root switched under its hands. It is a task that is
completely asynchronous from any systemd control.
Has anyone else seen this issue before too?
Note that some parts of the problem are not fully clear to me yet. In
particular, systemd also sends SIGSTOP before and SIGCONT after the mentioned
SIGTERM signal, which can actually in some cases prevent the kernel from
treating SIGTERM immediately as a fatal signal. I'm waiting on some additional
test machine to analyze this part more.
One idea how to address this issue is to model the mlx4 drivers using an
auxiliary bus, similar to how the same conversion was already done in mlx5.
This leaves all module loads to udevd which better integrates with the systemd
processing and a load of mlx4_en doesn't get interrupted.
My incomplete patches implementing this idea are available at:
https://github.com/petrpavlu/linux/commits/bsc1187236-wip-v1
The rework turned out to be not exactly straightforward and would need more
effort.
I realize mlx4 is only used for ConnectX-3 and older hardware. I wonder then
if this kind of rework would be suitable and something to proceed with, or if
some simpler idea how to address the described issue would be better and
preferred.
Thank you,
Petr
On Thu, Dec 15, 2022 at 10:51:15AM +0100, Petr Pavlu wrote:
> Hello,
>
> We have seen an issue when some of ConnectX-3 devices are not initialized
> when mlx4 drivers are a part of initrd.
<...>
> * Systemd stops running services and then sends SIGTERM to "unmanaged" tasks
> on the system to terminate them too. This includes the modprobe task.
> * Initialization of mlx4_en is interrupted in the middle of its init function.
And why do you think that this systemd behaviour is correct one?
> The module remains inserted but only some eth devices are initialized and
> operational.
<...>
> One idea how to address this issue is to model the mlx4 drivers using an
> auxiliary bus, similar to how the same conversion was already done in mlx5.
> This leaves all module loads to udevd which better integrates with the systemd
> processing and a load of mlx4_en doesn't get interrupted.
>
> My incomplete patches implementing this idea are available at:
> https://github.com/petrpavlu/linux/commits/bsc1187236-wip-v1
>
> The rework turned out to be not exactly straightforward and would need more
> effort.
Right, I didn't see any ROI of converting mlx4 to aux bus.
>
> I realize mlx4 is only used for ConnectX-3 and older hardware. I wonder then
> if this kind of rework would be suitable and something to proceed with, or if
> some simpler idea how to address the described issue would be better and
> preferread.
Will it help if you move mlx4_en to rootfs?
Thanks
>
> Thank you,
> Petr
>
On 12/18/22 10:53, Leon Romanovsky wrote:
> On Thu, Dec 15, 2022 at 10:51:15AM +0100, Petr Pavlu wrote:
>> Hello,
>>
>> We have seen an issue when some of ConnectX-3 devices are not initialized
>> when mlx4 drivers are a part of initrd.
>
> <...>
>
>> * Systemd stops running services and then sends SIGTERM to "unmanaged" tasks
>> on the system to terminate them too. This includes the modprobe task.
>> * Initialization of mlx4_en is interrupted in the middle of its init function.
>
> And why do you think that this systemd behaviour is correct one?
My view is that this is an issue between the kernel and initrd/systemd.
Switching the root is a delicate operation and both parts need to carefully
cooperate for it to work correctly.
I think it is generally sensible that systemd tries to terminate any remaining
processes started from the initrd. They would have troubles when the root is
switched under their hands anyway, unless they are specifically prepared for
it. Systemd only skips terminating kthreads and allows to exclude root storage
daemons. A modprobe helper could be excluded from being terminated too but the
problem with the root switch remains.
It looks to me that a good approach is to complete all running module loads
before switching the root and continue with any further loads after the
operation is done. Leaving module loads to udevd assures this, hence the idea
to use an auxiliary bus.
>> The module remains inserted but only some eth devices are initialized and
>> operational.
>
> <...>
>
>> One idea how to address this issue is to model the mlx4 drivers using an
>> auxiliary bus, similar to how the same conversion was already done in mlx5.
>> This leaves all module loads to udevd which better integrates with the systemd
>> processing and a load of mlx4_en doesn't get interrupted.
>>
>> My incomplete patches implementing this idea are available at:
>> https://github.com/petrpavlu/linux/commits/bsc1187236-wip-v1
>>
>> The rework turned out to be not exactly straightforward and would need more
>> effort.
>
> Right, I didn't see any ROI of converting mlx4 to aux bus.
I see, but in case you and other maintainers are not immediately opposed to
this conversion idea, I could try to resolve remaining problems in my port and
see how it turns out?
>> I realize mlx4 is only used for ConnectX-3 and older hardware. I wonder then
>> if this kind of rework would be suitable and something to proceed with, or if
>> some simpler idea how to address the described issue would be better and
>> preferread.
>
> Will it help if you move mlx4_en to rootfs?
Yes, if mlx4 drivers are not in the initrd but only on the rootfs then this
issue is not present. A problem is that VM image templates have their initrd
typically generated as no-hostonly and so include all drivers. Some images
might also require that networking is already available in the initrd for
instance initialization and so must include these drivers.
Thanks,
Petr
On Mon, Jan 02, 2023 at 11:33:15AM +0100, Petr Pavlu wrote:
> On 12/18/22 10:53, Leon Romanovsky wrote:
> > On Thu, Dec 15, 2022 at 10:51:15AM +0100, Petr Pavlu wrote:
> >> Hello,
> >>
> >> We have seen an issue when some of ConnectX-3 devices are not initialized
> >> when mlx4 drivers are a part of initrd.
> >
> > <...>
> >
> >> * Systemd stops running services and then sends SIGTERM to "unmanaged" tasks
> >> on the system to terminate them too. This includes the modprobe task.
> >> * Initialization of mlx4_en is interrupted in the middle of its init function.
> >
> > And why do you think that this systemd behaviour is correct one?
>
> My view is that this is an issue between the kernel and initrd/systemd.
> Switching the root is a delicate operation and both parts need to carefully
> cooperate for it to work correctly.
>
> I think it is generally sensible that systemd tries to terminate any remaining
> processes started from the initrd. They would have troubles when the root is
> switched under their hands anyway, unless they are specifically prepared for
> it. Systemd only skips terminating kthreads and allows to exclude root storage
> daemons. A modprobe helper could be excluded from being terminated too but the
> problem with the root switch remains.
>
> It looks to me that a good approach is to complete all running module loads
> before switching the root and continue with any further loads after the
> operation is done. Leaving module loads to udevd assures this, hence the idea
> to use an auxiliary bus.
I'm not sure about it. Everything above are user-space troubles which
are invited once systemd does root switch. Anyway, if you want to do
aux bus for mlx4, go for it.
Feel free to send me patches off-list and I will add them to our
regression, but be aware that you are stepping on landmine field
here.
Thanks