2016-10-05 18:00:26

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] fs: add userspace critical mounts event support

On Tue, Sep 13, 2016 at 09:38:17PM -0500, Rob Landley wrote:
> On 09/02/2016 07:20 PM, Luis R. Rodriguez wrote:
> > kernel_read_file_from_path() can try to read a file from
> > the system's filesystem. This is typically done for firmware
> > for instance, which lives in /lib/firmware. One issue with
> > this is that the kernel cannot know for sure when the real
> > final /lib/firmare/ is ready, and even if you use initramfs
> > drivers are currently initialized *first* prior to the initramfs
> > kicking off.
>
> Why?

do_initcalls() is called prior to prepare_namespace(), other than that
we have no strict rules over where the real rootfs should be, and since
we have pivot_root() its up to userspace to decide when/how the real
rootfs goes. This and the fact that its also up to userspace to design
what files to place in initramfs of further rootfs -- only userspace
will know for sure when all firmware for all drivers is really ready.

> > During init we run through all init calls first
> > (do_initcalls()) and finally the initramfs is processed via
> > prepare_namespace():
>
> What's the downside of moving initramfs cpio extraction earlier in the boot?

That would help users of initrafms, some folks seem to not want to use
initramfs, one of such users are that of the large firmwares for remote-proc
(Documentation/remoteproc.txt), we're talking about over 200 MiB for some
firmware for example.

> I did some shuffling around of those code to make initmpfs work, does
> anybody know why initramfs extraction _before_ we initialize drivers
> would be a bad thing?

No, but it seems sensible to me, if its done before do_initcalls()
that should resolve the race for initramfs users but -- so long
as the drivers that need firmware early are dumped into initramfs.
We have no assurances/warnings for this, but we can add such things
if we want them. This would not resolve the race for non-initramfs
users / pivot_root() changes.

Luis


2016-10-05 18:08:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] fs: add userspace critical mounts event support

On Wed, Oct 5, 2016 at 11:00 AM, Luis R. Rodriguez <[email protected]> wrote:
> On Tue, Sep 13, 2016 at 09:38:17PM -0500, Rob Landley wrote:
>
>> I did some shuffling around of those code to make initmpfs work, does
>> anybody know why initramfs extraction _before_ we initialize drivers
>> would be a bad thing?
>
> No, but it seems sensible to me, if its done before do_initcalls()
> that should resolve the race for initramfs users

initramfs should already be set up before drivers are. Exactly what is
it that has trouble right now?

The gating issue for initramfs is that technically the filesystem
setup needs to be done, which means that it currently ends up being
populated _fairly_ late in the initcall series, but certainly before
drivers. But since initramfs really only needs very limited filesystem
functionality, I assume Rob had few problems with just moving it
earlier.

Still, what kind of ordering issues did people have? What is it that
needs to load files even before driver init? Some crazy subsystem?

Linus

2016-10-05 19:46:44

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] fs: add userspace critical mounts event support

On Wed, Oct 05, 2016 at 11:08:06AM -0700, Linus Torvalds wrote:
> On Wed, Oct 5, 2016 at 11:00 AM, Luis R. Rodriguez <[email protected]> wrote:
> > On Tue, Sep 13, 2016 at 09:38:17PM -0500, Rob Landley wrote:
> >
> >> I did some shuffling around of those code to make initmpfs work, does
> >> anybody know why initramfs extraction _before_ we initialize drivers
> >> would be a bad thing?
> >
> > No, but it seems sensible to me, if its done before do_initcalls()
> > that should resolve the race for initramfs users
>
> initramfs should already be set up before drivers are.

Actually you are right, the issue would only be for old initrd, for initramfs
we populate that via rootfs_initcall(populate_rootfs), so as long as drivers
in question use an init level beyond rootfs's we're good there.

> Exactly what is it that has trouble right now?

It would seem then that the only current stated race possible should
then be non-initramfs users. One example if very large firmware for
remote-proc, whereby an initramfs is just not practical or desirable.

> The gating issue for initramfs is that technically the filesystem
> setup needs to be done, which means that it currently ends up being
> populated _fairly_ late in the initcall series, but certainly before
> drivers. But since initramfs really only needs very limited filesystem
> functionality, I assume Rob had few problems with just moving it
> earlier.
>
> Still, what kind of ordering issues did people have? What is it that
> needs to load files even before driver init? Some crazy subsystem?

No, I think this is just about non-initramfs users now, if we disregard
old initrd users. Bjorn, Marc, correct me if I'm wrong, as I think its
so far you both who have seemed to run into race issues and have then
ended up trying to look for hacks to address this race or considered using
the usermode helper (which we're trying to minimize users for). Daniel
seems to note a lot of video drivers use firmware on probe as well so
there's a potential issue for those users if they don't use initramfs.

Luis

2016-11-08 22:47:42

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] fs: add userspace critical mounts event support

On Wed, Oct 05, 2016 at 09:46:33PM +0200, Luis R. Rodriguez wrote:
> On Wed, Oct 05, 2016 at 11:08:06AM -0700, Linus Torvalds wrote:
> > On Wed, Oct 5, 2016 at 11:00 AM, Luis R. Rodriguez <[email protected]> wrote:
> > > On Tue, Sep 13, 2016 at 09:38:17PM -0500, Rob Landley wrote:
> > >
> > >> I did some shuffling around of those code to make initmpfs work, does
> > >> anybody know why initramfs extraction _before_ we initialize drivers
> > >> would be a bad thing?
> > >
> > > No, but it seems sensible to me, if its done before do_initcalls()
> > > that should resolve the race for initramfs users
> >
> > initramfs should already be set up before drivers are.
>
> Actually you are right, the issue would only be for old initrd, for initramfs
> we populate that via rootfs_initcall(populate_rootfs), so as long as drivers
> in question use an init level beyond rootfs's we're good there.
>
> > Exactly what is it that has trouble right now?
>
> It would seem then that the only current stated race possible should
> then be non-initramfs users.

Or:

a) initramfs users that include a driver but do not include the firmware
into initramfs

b) driver is built-in but firmware is not in initrafms (Johannes reports
this causes driver failure on intel wireless for instance, and I guess
you need to reload)

> One example if very large firmware for
> remote-proc, whereby an initramfs is just not practical or desirable.

This issue still stands. At Plumbers Johannes Berg did indicate to me
he had a simple elegant solution in mind. He suggested that since the
usermode helper was available, he had added support to be able to
differentiate async firmware request calls form sync requests, and that
userspace should not return an error *iff* the request made was async and
it can determine we're initramfs. The semantics issue is the same though,
is there a generic way to determine we're initramfs ? What if we move
multiple levels? Anyway -- provided we could figure this out, userspace
would simply yield and wait until the real rootfs is met. Upon pivot_root()
the assumption is all previous udev events pending would be re-triggered
and finally udev could finally confirm/deny if the firmware was present.
This would *also* allow you to stuff your firmware whever, however big
it was. This however relied on the userspace firmware loading support,
it turns out that (I think because of an incorrect negative backlash
back in the day over blaming this over booting issues due to the timeout
whereas the real issue was the kmod timeout was affecting our long
standing serial init()/probe()) the systemd userspace firmware laoding
support was removed from systemd udev in 2014 by Kay via commit
be2ea723b1d023b3d ("udev: remove userspace firmware loading support").

Systemd might *still* be able to provide a solution here, however I will
note Johannes was asking for *all* async firmware requests to always
rely on the kernel syfs UMH fallback -- this suggestion is against the
direction we've been taking to eventually compartamentalize the kernel UMH
code, so whatever we decide to do, lets please take a breather and seriously
address this properly *with* systemd folks.

A side race discussed at Plumbers worth mentioning given its related to the
UMH was inspired by Jiri's talk on the abuse of the freezer for kthreads --
and his suggestion to use freeze_super(). Currently the UMH lock is used
for the UMH but as I have noted in Daniel Wagner's recent patches to
give some love to this code and further compartamentalize the UMH --
the UMH lock was originally added to help avoid drivers use the firmware
API on resume, given the races. The firmware cache solution implemented by
Ming Lei years ago helped address this, whereby devm helpers are used
based on the requested firmware and prior to suspend we cache all required
firmware for drivers so that upon resume calls would work without the
effective race present. This mitigated the actual races / issues with
drivers, but they must not use the firmware API on suspend/resume. Since
this solution *kills* all pending UMH caller on suspend obviously this
means on suspend using request_firmware*() API and expecting it to work
is brutally dumb as we will eventually kill any pending requests. This
is a long winded way to say that if you rely on the UMH for firmware
you must figure out your own proactive firmware cache solution and
must definitely not request firmware on suspend. Two things then:

1) I've been brainstorming with Daniel how to use freeze_super() to
replace the now invalid UMH lock -- its purpose only helps races
on boot, for the fallback case to the UMH. But note most distributions
disable forcing it always on, so these days we *only* rely on the UMH
as a fallback if the driver explicitly requested it

2) Drivers relying on the UMH very likely have a broken cache solution
if they are doing this on suspend

Whatever the outcome of this discussion is -- Johannes seemed to *want*
to further use the UMH by default on *all* async alls... even if the
driver did not explicitly requested it -- I'm concerned about this given
all the above and the existing flip/flop on systemd for it. Whatever
we try to dream up here, please consider all the above as well.

> > The gating issue for initramfs is that technically the filesystem
> > setup needs to be done, which means that it currently ends up being
> > populated _fairly_ late in the initcall series, but certainly before
> > drivers. But since initramfs really only needs very limited filesystem
> > functionality, I assume Rob had few problems with just moving it
> > earlier.
> >
> > Still, what kind of ordering issues did people have? What is it that
> > needs to load files even before driver init? Some crazy subsystem?
>
> No, I think this is just about non-initramfs users now,

And as Johannes pointed two above to cases.

> if we disregard
> old initrd users. Bjorn, Marc, correct me if I'm wrong, as I think its
> so far you both who have seemed to run into race issues and have then
> ended up trying to look for hacks to address this race or considered using
> the usermode helper (which we're trying to minimize users for). Daniel
> seems to note a lot of video drivers use firmware on probe as well so
> there's a potential issue for those users if they don't use initramfs.

Luis

2016-11-09 09:13:30

by Daniel Wagner

[permalink] [raw]
Subject: Re: [RFC] fs: add userspace critical mounts event support

[CC: added Harald]

On 11/08/2016 11:47 PM, Luis R. Rodriguez wrote:
> On Wed, Oct 05, 2016 at 09:46:33PM +0200, Luis R. Rodriguez wrote:
>> On Wed, Oct 05, 2016 at 11:08:06AM -0700, Linus Torvalds wrote:
>>> On Wed, Oct 5, 2016 at 11:00 AM, Luis R. Rodriguez <[email protected]> wrote:
>>>> On Tue, Sep 13, 2016 at 09:38:17PM -0500, Rob Landley wrote:
>>>>
>>>>> I did some shuffling around of those code to make initmpfs work, does
>>>>> anybody know why initramfs extraction _before_ we initialize drivers
>>>>> would be a bad thing?
>>>>
>>>> No, but it seems sensible to me, if its done before do_initcalls()
>>>> that should resolve the race for initramfs users
>>>
>>> initramfs should already be set up before drivers are.
>>
>> Actually you are right, the issue would only be for old initrd, for initramfs
>> we populate that via rootfs_initcall(populate_rootfs), so as long as drivers
>> in question use an init level beyond rootfs's we're good there.
>>
>>> Exactly what is it that has trouble right now?
>>
>> It would seem then that the only current stated race possible should
>> then be non-initramfs users.
>
> Or:
>
> a) initramfs users that include a driver but do not include the firmware
> into initramfs
>
> b) driver is built-in but firmware is not in initrafms (Johannes reports
> this causes driver failure on intel wireless for instance, and I guess
> you need to reload)
>
>> One example if very large firmware for
>> remote-proc, whereby an initramfs is just not practical or desirable.
>
> This issue still stands. At Plumbers Johannes Berg did indicate to me
> he had a simple elegant solution in mind. He suggested that since the
> usermode helper was available, he had added support to be able to
> differentiate async firmware request calls form sync requests, and that
> userspace should not return an error *iff* the request made was async and
> it can determine we're initramfs. The semantics issue is the same though,
> is there a generic way to determine we're initramfs ? What if we move
> multiple levels? Anyway -- provided we could figure this out, userspace
> would simply yield and wait until the real rootfs is met. Upon pivot_root()
> the assumption is all previous udev events pending would be re-triggered
> and finally udev could finally confirm/deny if the firmware was present.
> This would *also* allow you to stuff your firmware whever, however big
> it was. This however relied on the userspace firmware loading support,
> it turns out that (I think because of an incorrect negative backlash
> back in the day over blaming this over booting issues due to the timeout
> whereas the real issue was the kmod timeout was affecting our long
> standing serial init()/probe()) the systemd userspace firmware laoding
> support was removed from systemd udev in 2014 by Kay via commit
> be2ea723b1d023b3d ("udev: remove userspace firmware loading support").
>
> Systemd might *still* be able to provide a solution here, however I will
> note Johannes was asking for *all* async firmware requests to always
> rely on the kernel syfs UMH fallback -- this suggestion is against the
> direction we've been taking to eventually compartamentalize the kernel UMH
> code, so whatever we decide to do, lets please take a breather and seriously
> address this properly *with* systemd folks.
>
> A side race discussed at Plumbers worth mentioning given its related to the
> UMH was inspired by Jiri's talk on the abuse of the freezer for kthreads --
> and his suggestion to use freeze_super(). Currently the UMH lock is used
> for the UMH but as I have noted in Daniel Wagner's recent patches to
> give some love to this code and further compartamentalize the UMH --
> the UMH lock was originally added to help avoid drivers use the firmware
> API on resume, given the races. The firmware cache solution implemented by
> Ming Lei years ago helped address this, whereby devm helpers are used
> based on the requested firmware and prior to suspend we cache all required
> firmware for drivers so that upon resume calls would work without the
> effective race present. This mitigated the actual races / issues with
> drivers, but they must not use the firmware API on suspend/resume. Since
> this solution *kills* all pending UMH caller on suspend obviously this
> means on suspend using request_firmware*() API and expecting it to work
> is brutally dumb as we will eventually kill any pending requests. This
> is a long winded way to say that if you rely on the UMH for firmware
> you must figure out your own proactive firmware cache solution and
> must definitely not request firmware on suspend. Two things then:
>
> 1) I've been brainstorming with Daniel how to use freeze_super() to
> replace the now invalid UMH lock -- its purpose only helps races
> on boot, for the fallback case to the UMH. But note most distributions
> disable forcing it always on, so these days we *only* rely on the UMH
> as a fallback if the driver explicitly requested it
>
> 2) Drivers relying on the UMH very likely have a broken cache solution
> if they are doing this on suspend
>
> Whatever the outcome of this discussion is -- Johannes seemed to *want*
> to further use the UMH by default on *all* async alls... even if the
> driver did not explicitly requested it -- I'm concerned about this given
> all the above and the existing flip/flop on systemd for it. Whatever
> we try to dream up here, please consider all the above as well.

As Harald pointed out over a beer yesterday evening, there is at least
one more reason why UMH isn't obsolete. The ordering of the firmware
loading might be of important. Say you want to greet the user with a
splash screen really early on, the graphic card firmware should be
loaded first. Also the automotive world has this fancy requirement that
rear camera must be on the screen within 2 seconds. So controlling the
firmware loading order is of importance (e.g. also do not overcommit the
I/O bandwith not so important firmwares). A user space helper is able to
prioritize the request accordingly the use case.

>>> The gating issue for initramfs is that technically the filesystem
>>> setup needs to be done, which means that it currently ends up being
>>> populated _fairly_ late in the initcall series, but certainly before
>>> drivers. But since initramfs really only needs very limited filesystem
>>> functionality, I assume Rob had few problems with just moving it
>>> earlier.
>>>
>>> Still, what kind of ordering issues did people have? What is it that
>>> needs to load files even before driver init? Some crazy subsystem?
>>
>> No, I think this is just about non-initramfs users now,
>
> And as Johannes pointed two above to cases.
>
>> if we disregard
>> old initrd users. Bjorn, Marc, correct me if I'm wrong, as I think its
>> so far you both who have seemed to run into race issues and have then
>> ended up trying to look for hacks to address this race or considered using
>> the usermode helper (which we're trying to minimize users for). Daniel
>> seems to note a lot of video drivers use firmware on probe as well so
>> there's a potential issue for those users if they don't use initramfs.

Daniel

2016-11-09 23:41:18

by Luis Chamberlain

[permalink] [raw]
Subject: Re: [RFC] fs: add userspace critical mounts event support

On Tue, Nov 8, 2016 at 2:47 PM, Luis R. Rodriguez <[email protected]> wrote:
> Whatever the outcome of this discussion is -- Johannes seemed to *want*
> to further use the UMH by default on *all* async alls... even if the
> driver did not explicitly requested it -- I'm concerned about this given
> all the above and the existing flip/flop on systemd for it. Whatever
> we try to dream up here, please consider all the above as well.

One addition to this: the current API does not always require the UMH
firmware fallback, for most distributions that do not enable
CONFIG_FW_LOADER_USER_HELPER_FALLBACK but do enable
CONFIG_FW_LOADER_USER_HELPER we only require the UMH firmware fallback
*iff* the driver explicitly requests it. For kernels with
CONFIG_FW_LOADER_USER_HELPER_FALLBACK enabled we *always* use the UMH
fallback. By fallback note that this means its used only if the first
direct filesystem request failed. For further details on complexities
of the UMH refer to two ongoing threads [0] [1] about it.

Johannes, you seemed to note you added some uevent classifier for
async requests, I checked and it seems this was with commit
e9045f9178f3e ("firmware class: export nowait to userspace") was this
the change you were referring to ? Even with all these complexities
annotated, do you still believe we need the UMH always for all async
calls ?

[0] https://lkml.kernel.org/r/[email protected]
[1] https://lkml.kernel.org/r/[email protected]

Luis

2016-11-15 09:30:32

by Johannes Berg

[permalink] [raw]
Subject: Re: [RFC] fs: add userspace critical mounts event support

On Tue, 2016-11-08 at 23:47 +0100, Luis R. Rodriguez wrote:

> This issue still stands. At Plumbers Johannes Berg did indicate to me
> he had a simple elegant solution in mind. He suggested that since the
> usermode helper was available, he had added support to be able to
> differentiate async firmware request calls form sync requests,

For reference:

commit e9045f9178f3e3445a3a5b85206f8681b3869562
Author: Johannes Berg <[email protected]>
Date:   Mon Mar 29 17:57:20 2010 +0200

    firmware class: export nowait to userspace

> it can determine we're initramfs. The semantics issue is the same
> though, is there a generic way to determine we're initramfs ? What if
> we move multiple levels? Anyway -- provided we could figure this out,
> userspace would simply yield and wait until the real rootfs is met.

One way or another we have to have this kind of information somewhere.
I don't actually know how/where though.

> Upon pivot_root() the assumption is all previous udev events pending
> would be re-triggered and finally udev could finally confirm/deny if
> the firmware was present.

The retriggering is already the case, as far as I know, if only to load
modules that weren't part of initramfs.

> note Johannes was asking for *all* async firmware requests to always
> rely on the kernel syfs UMH fallback -- this suggestion is against
> the direction we've been taking to eventually compartamentalize the
> kernel UMH code, so whatever we decide to do, lets please take a
> breather and seriously address this properly *with* systemd folks.

I was saying that because that's the only way you can actually rely on
this functionality as a system integrator. If drivers have to opt in or
can opt out then you'll always end up chasing the drivers around.

My argument basically goes like this:

First, given good drivers (i.e. using request_firmware_nowait())
putting firmware even for a built-in driver into initramfs or not
should be a system integrator decision. If they don't need the device
that early, it should be possible for them to delay it. Or, perhaps, if
the firmware is too big, etc. I'm sure we can all come up with more
examples of why you'd want to do it one way or another.

Second, all of this can be solved in other ways by adding logic to the
kernel, like the rejected proposal to add a "rootfs available" bit
somewhere, that would cause async requests to behave similarly within
the kernel (don't return "not found" until they time out or this bit is
set, and retry loading when the bit gets set)

Third, having this in place can be more friendly to users who play with
kernel compilation, modules, etc. This is a fringe group in some ways,
but it is (was?) actually a relatively common complaint that drivers
built into the kernel wouldn't work - we'd always have to direct users
to do magic steps like rebuilding initramfs with the right options etc.

johannes