Multiple devices may be waiting for firmware with the same name.
In that case we will make them all use the same struct firmware_buf.
When wake up happens make sure it's propagated to all of them.
Signed-off-by: Jakub Kicinski <[email protected]>
---
drivers/base/firmware_class.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index ac350c518e0c..c23b58e64b33 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -148,7 +148,7 @@ static void __fw_state_set(struct fw_state *fw_st,
WRITE_ONCE(fw_st->status, status);
if (status == FW_STATUS_DONE || status == FW_STATUS_ABORTED)
- swake_up(&fw_st->wq);
+ swake_up_all(&fw_st->wq);
}
#define fw_state_start(fw_st) \
--
2.11.0
Thank you for your patch!
On Fri, Jun 23, 2017 at 04:37:02PM -0700, Jakub Kicinski wrote:
> Multiple devices may be waiting for firmware with the same name.
This is due to a hidden and not-well understood feature of the firmware API. I
can trace commit logs loosely documenting this as an intended feature by Ming
Lei since commit 1f2b79599ee8f5f ("firmware loader: always let firmware_buf own
the pages buffer") so at least it would seems this is intended functionality.
Unfortunately this feature also has quite a big of bugs which will need to be
addressed after you patch. I'll address first your patch, and then explain the
rest of the issues lingering.
To expedite things I'll re-submit your patch with a different commit log
describing this mechanism a bit better otherwise this fix will would be hard to
understand. Understanding the impact is also key as we want this to be
evaluated for for stable as well! I'd prefer your patch to go in after the
pending stable signal fixes as well.
Proposed alternative commit log:
******
Subject: [PATCH] firmware: fix batched requests - wake all waiters
The firmware cache mechanism serves two purposes, the secondary purpose is
not well documented nor understood. This fixes a regression with the secondary
purpose of the firmware cache mechanism: batched requests.
The firmware cache is used for:
1) Addressing races with file lookups during the suspend/resume cycle
by keeping firmware in memory during the cycle
2) Batched requests for the same file rely only on work from the first file
lookup, which keeps the firmware in memory until the last release_firmware()
is called
Batched requests *only* take effect if secondary requests come in prior to the
first user calling release_firmware(). The devres name used for the internal
firmware cache is used as a hint other pending requests are ongoing, the
firmware buffer data is kept in memory until the last user of the buffer
calls release_firmware(), therefore serializing requests and delaying the
release until all requests are done.
Batched requests wait for a wakup or signal (we only accept SIGKILL now) so we
can rely on the first file fetch to write to the pending secondary requests.
Commit 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
ported the firmware API to use swait, and in doing so failed to convert
complete_all() to swake_up_all() -- it used swake_up(), loosing the ability
for *some* batched requests to take effect.
Without this fix it has been reported plugging in two Intel 6260 Wifi cards
on a system will end up enumerating the two devices only 50% of the time
[0]. The ported swake_up() should have actually two devices, however,
*if more than two cards are used* the swake_up() would not suffice. This
change is only part of the required fixes for batched requests. Subsequent
fixes will follow.
This particular change should fix the cases where more than three requests
with the same firmware name is used, otherwise batched requests will wait for
MAX_SCHEDULE_TIMEOUT and just timeout eventually.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=195477
Fixes: 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
CC: <[email protected]> [4.10+]
******
This was merged on v4.10 and complete_all() was used before older kernels,
so older kernels should not be affected by this particular regression.
> In that case we will make them all use the same struct firmware_buf.
> When wake up happens make sure it's propagated to all of them.
>
> Signed-off-by: Jakub Kicinski <[email protected]>
There's a slew of bugs lurking here though!
As noted the reported Intel driver issues still need other fixes, one was the
fw_state_done() on the direct filesystem lookup mechanism [1], and that may be
a regression since direct filesystem loading was added, and even secondary
requests would seem to just wait forever (MAX_SCHEDULE_TIMEOUT); the combination
of both fixes should fix your reported issue.
Do you intend on submitting those changes as well ? There's still *other* bugs
with this feature though... Knowing if you will follow up with further fixes
will be appreciated.
After this patch things we need then:
0) addressing the remainder of the delta from kernel.org bug 195477 [1]
1) addressing error paths on the 1st request to wake up waiters
2) documenting this hidden feature
3) a test case for this feature
This feature takes effect effect when fw_lookup_and_allocate_buf() returns 1,
when __fw_lookup_buf() finds the firmware requested on the firmware cache list
already. This was designed to only take effect if release_firmware() was not
called before the secondary lookups, as otherwise kref_put() would be called with
the respective freeing of the buffer used for waiting and data.
If the 1st request did not free the buf with kref_put() and __fw_free_buf(),
that means its up to the batched requests to address the release. This should
be OK today if the 2nd request was successful, but on failure we have nothing
freeing the old buf currently. This fix is lower priority due to how rare it
could be, but given we currently always fail even if we were successful on
load on direct fs lookup this issue should have been more common on systems
with more than 2 cards then.
Yet another stable fix.
Also consider the case of the 1st request is still being processed, and batched
requests are in queue. As noted since fw_state_wait() is used we'll wait for
MAX_SCHEDULE_TIMEOUT, but if a failure on the 1st request happens there are
a slew of cases where we do not issue a wake up! So we're missing some sprinkled
fw_state_aborted() on error paths. Because of these issues a failed request
with batched requets pending will just wait for MAX_SCHEDULE_TIMEOUT.
Yet another stable fix.
[1] https://bugzilla.kernel.org/attachment.cgi?id=256493
Luis
> ---
> drivers/base/firmware_class.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
> index ac350c518e0c..c23b58e64b33 100644
> --- a/drivers/base/firmware_class.c
> +++ b/drivers/base/firmware_class.c
> @@ -148,7 +148,7 @@ static void __fw_state_set(struct fw_state *fw_st,
> WRITE_ONCE(fw_st->status, status);
>
> if (status == FW_STATUS_DONE || status == FW_STATUS_ABORTED)
> - swake_up(&fw_st->wq);
> + swake_up_all(&fw_st->wq);
> }
>
> #define fw_state_start(fw_st) \
From: Jakub Kicinski <[email protected]>
The firmware cache mechanism serves two purposes, the secondary purpose is
not well documented nor understood. This fixes a regression with the secondary
purpose of the firmware cache mechanism: batched requests.
The firmware cache is used for:
1) Addressing races with file lookups during the suspend/resume cycle
by keeping firmware in memory during the cycle
2) Batched requests for the same file rely only on work from the first file
lookup, which keeps the firmware in memory until the last release_firmware()
is called
Batched requests *only* take effect if secondary requests come in prior to the
first user calling release_firmware(). The devres name used for the internal
firmware cache is used as a hint other pending requests are ongoing, the
firmware buffer data is kept in memory until the last user of the buffer
calls release_firmware(), therefore serializing requests and delaying the
release until all requests are done.
Batched requests wait for a wakup or signal (we only accept SIGKILL now) so we
can rely on the first file fetch to write to the pending secondary requests.
Commit 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
ported the firmware API to use swait, and in doing so failed to convert
complete_all() to swake_up_all() -- it used swake_up(), loosing the ability
for *some* batched requests to take effect.
Without this fix it has been reported plugging in two Intel 6260 Wifi cards
on a system will end up enumerating the two devices only 50% of the time
[0]. The ported swake_up() should have actually two devices, however,
*if more than two cards are used* the swake_up() would not suffice. This
change is only part of the required fixes for batched requests. Subsequent
fixes will follow.
This particular change should fix the cases where more than three requests
with the same firmware name is used, otherwise batched requests will wait for
MAX_SCHEDULE_TIMEOUT and just timeout eventually.
[0] https://bugzilla.kernel.org/show_bug.cgi?id=195477
Fixes: 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
CC: <[email protected]> [4.10+]
Cc: Ming Lei <[email protected]>
Signed-off-by: Jakub Kicinski <[email protected]>
[mcgrof: expanded on impact on commit log]
Signed-off-by: Luis R. Rodriguez <[email protected]>
---
Greg, I think it would make sense to queue this in after the signal stable
fixes [1].
[1] https://lkml.kernel.org/r/[email protected]
drivers/base/firmware_class.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/base/firmware_class.c b/drivers/base/firmware_class.c
index b9f907eedbf7..686381a621a0 100644
--- a/drivers/base/firmware_class.c
+++ b/drivers/base/firmware_class.c
@@ -148,7 +148,7 @@ static void __fw_state_set(struct fw_state *fw_st,
WRITE_ONCE(fw_st->status, status);
if (status == FW_STATUS_DONE || status == FW_STATUS_ABORTED)
- swake_up(&fw_st->wq);
+ swake_up_all(&fw_st->wq);
}
#define fw_state_start(fw_st) \
--
2.11.0
On Fri, Jun 23, 2017 at 4:37 PM, Jakub Kicinski
<[email protected]> wrote:
> - swake_up(&fw_st->wq);
> + swake_up_all(&fw_st->wq);
Why is that code using the braindamaed "swait" model in the first place?
That's the real problem here - swait() is a very specialized
interface, and it does not make sense to use it here.
Among all the simplifications it has is exactly the fact that it wakes
up only one thing, because it is *so* specialized.
But the *only* reason for swait is extreme memory issues and some very
special realtime issues, where it saves a couple of bytes in
structures that need close packing, and doesn't even use normal
spinlocks, so it saves a couple of cycles at wakeup/sleep because it
doesn't do a good job in general.
The "avoid normal spinlocks" is because it is meant for code that is
*so* special that it needs the magical low-level raw spinlocks.
I really have *no* idea why the firmware code uses that idiotic
special wait-queue. It has no reason to do so, except this comment
from the commit that added it:
"We use also swait instead of wait because don't need all the additional
features wait provides."
which is bogus, since it clearly just got the waiting wrong exactly
*because* swait is pretty damn bad and specialized.
I think the two valid users are RCU (which needed it for RT), and kvm
(which also needed it for similar issues - it needs to be
non-preemptible).
I don't see any similar reason for the firmware loading, and all it
did was use an odd interface that resulted in this bug.
Why is the firmware code being so damn odd on purpose?
Linus
On Mon, Jun 26, 2017 at 02:44:17PM -0700, Linus Torvalds wrote:
> On Fri, Jun 23, 2017 at 4:37 PM, Jakub Kicinski
> <[email protected]> wrote:
> > - swake_up(&fw_st->wq);
> > + swake_up_all(&fw_st->wq);
>
> Why is that code using the braindamaed "swait" model in the first place?
The conversion was done via commit 5b029624948d ("firmware: do not use fw_lock
for fw_state protection"). This commit overlooked this "batched request"
feature.
> That's the real problem here - swait() is a very specialized
> interface, and it does not make sense to use it here.
>
> Among all the simplifications it has is exactly the fact that it wakes
> up only one thing, because it is *so* specialized.
Not sure I follow, it can wake up all items in queue with swake_up_all(), no?
At first I *thought* we don't add further items with __prepare_to_swait() but
actually that if (list_empty(&wait->task_list)) check just checks if the
item was already added to a wait list. So swait can queue more waiters, no?
> But the *only* reason for swait is extreme memory issues and some very
> special realtime issues, where it saves a couple of bytes in
> structures that need close packing, and doesn't even use normal
> spinlocks, so it saves a couple of cycles at wakeup/sleep because it
> doesn't do a good job in general.
I see.
> The "avoid normal spinlocks" is because it is meant for code that is
> *so* special that it needs the magical low-level raw spinlocks.
>
> I really have *no* idea why the firmware code uses that idiotic
> special wait-queue. It has no reason to do so, except this comment
> from the commit that added it:
>
> "We use also swait instead of wait because don't need all the additional
> features wait provides."
>
> which is bogus, since it clearly just got the waiting wrong exactly
> *because* swait is pretty damn bad and specialized.
If indeed it cannot queue and wake all then surely this is buggered!
> I think the two valid users are RCU (which needed it for RT), and kvm
> (which also needed it for similar issues - it needs to be
> non-preemptible).
Got it.
> I don't see any similar reason for the firmware loading, and all it
> did was use an odd interface that resulted in this bug.
>From my review when this was suggested it seemed to cover all grounds
and an API that was more light weight seemed sensible. The swait documentation
seems to note "One would recommend using this wait queue where possible."
If its non suitable for multiple waits then indeed this needs fixing so
this doesn't happen again.
> Why is the firmware code being so damn odd on purpose?
I've been asking myself that since I started reviewing the firmware code :) But
in this case I believe the commit overlooked batched requests functionality and
to use swake_up_all().
Luis
On Mon, Jun 26, 2017 at 11:20:36PM +0200, Luis R. Rodriguez wrote:
> On Fri, Jun 23, 2017 at 04:37:02PM -0700, Jakub Kicinski wrote:
> There's a slew of bugs lurking here though!
>
> As noted the reported Intel driver issues still need other fixes, one was the
> fw_state_done() on the direct filesystem lookup mechanism [1], and that may be
> a regression since direct filesystem loading was added, and even secondary
> requests would seem to just wait forever (MAX_SCHEDULE_TIMEOUT); the combination
> of both fixes should fix your reported issue.
Actually fw_state_done() is already called on success on direct filesystem loading,
so that should be fine. The bug report proposed change only adds a fw_state_aborted()
in case of a failure on direct fs lookups. That in turn, needs consideration of the
fallback mechanism, ie, only in case of *real* final falure should fw_state_aborte()
be issued. Distros that have enabled the fallback mechanism (seems like andoid now)
have no other option but to wait for the fallback mechanism or timeout to complete.
Its one reason why the firmwared thing is a good thing to review which has the
best-effort mode and final-mode [0].
[0] https://github.com/teg/firmwared.git
Luis
On Mon, Jun 26, 2017 at 4:30 PM, Luis R. Rodriguez <[email protected]> wrote:
>
> On Mon, Jun 26, 2017 at 02:44:17PM -0700, Linus Torvalds wrote:
> >
> > Among all the simplifications it has is exactly the fact that it wakes
> > up only one thing, because it is *so* specialized.
>
> Not sure I follow, it can wake up all items in queue with swake_up_all(), no?
You can, yes.
But it's like using assembly language to build a compiler. Sure, it's
possible, but it's the wrong thing to do.
The thing should just use regular wait/wakeup, which has sane
*default* behavior that people are used to.
> If indeed it cannot queue and wake all then surely this is buggered!
It's not that it cannot, but that it's much more limited than all our
normal waiting support (and it's *meant* for much more limited
situations).
For example, our regular waiting code can handle not just "wake up
one" and "wake up all", it can handle "wake up <n> exclusive waiters,
and whoever isn't exclusive".
And by default, normal wait queues just do the right thing, so you
don't have to specify the exact details of the behavior.
Sure, the firmware loader code doesn't _need_ that, but there also
isn't one of the users who really needs the very limited interface
that has odd semantics and odd default behavior that will trip you up.
The swait interface is so special and so undocumented, that I really
didn't expect anybody to even know about it unless they had very
specific needs, much less use it.
Linus
On Mon, Jun 26, 2017 at 04:43:10PM -0700, Linus Torvalds wrote:
> The swait interface is so special and so undocumented, that I really
> didn't expect anybody to even know about it unless they had very
> specific needs, much less use it.
If swait is really not designed and intended to be used for cases that do not
require all the bells and whistles of wait, and is just very special-case, a
nice big warning about it seems appropriate on swait.h, instead of the
welcoming open armed, "One would recommend using this wait queue where
possible".
>From d751201aebf7ca8acb765284a9017a711ddbe791 Mon Sep 17 00:00:00 2001
From: "Luis R. Rodriguez" <[email protected]>
Date: Mon, 26 Jun 2017 17:06:10 -0700
Subject: [PATCH] swait: annotate swait's special use
Before kernel hipsters start thinking swait is the cool thing to do.
Signed-off-by: Luis R. Rodriguez <[email protected]>
---
include/linux/swait.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/include/linux/swait.h b/include/linux/swait.h
index 4a4e180d0a35..14fcf23cece4 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -29,7 +29,10 @@
*
* As a side effect of this; the data structures are slimmer.
*
- * One would recommend using this wait queue where possible.
+ * NOTE: swait is for cases of extreme memory considerations and some very
+ * special realtime issues, where it saves a couple of bytes in structures that
+ * need close packing. As such its very special-use. Consider using regular
+ * waits queues from wait.h instead *first*.
*/
struct task_struct;
--
2.11.0
On Mon, 26 Jun 2017 23:20:36 +0200, Luis R. Rodriguez wrote:
> > In that case we will make them all use the same struct firmware_buf.
> > When wake up happens make sure it's propagated to all of them.
> >
> > Signed-off-by: Jakub Kicinski <[email protected]>
>
> There's a slew of bugs lurking here though!
>
> As noted the reported Intel driver issues still need other fixes, one was the
> fw_state_done() on the direct filesystem lookup mechanism [1], and that may be
> a regression since direct filesystem loading was added, and even secondary
> requests would seem to just wait forever (MAX_SCHEDULE_TIMEOUT); the combination
> of both fixes should fix your reported issue.
>
> Do you intend on submitting those changes as well ? There's still *other* bugs
> with this feature though... Knowing if you will follow up with further fixes
> will be appreciated.
No, I don't have any more fixes in my tree right now :) What I'm
looking towards implementing is actually a ability for NICs to load
default FW but then enable users to load different FW on their request.
The problem is that advanced NICs are quite programmable [1] and
depending on use case one may want to load different firmware files.
It's slightly close to the FPGA use case, only with FPGA people don't
expect much plug and play, and with NICs the default mode after boot
must be "look as much as a standard NIC as possible". Then loading
"advanced"/hand crafted firmware can turn more interesting features on.
The FW loading we have now in drivers/net/ethernet/netronome/nfp is
requesting default FW and returning -EPROBE_DEFER if not found. Now I
need to find a way to allow users to "push" whatever advanced FW they
have into the NIC after/during boot.
Current firmware subsystem doesn't seem to cater to this use case to
well. I have to look at the FPGA-related code. The three main
problems to solve are:
- how to stay bound and retry the direct default FW load until rootfs
is mounted (equivalent to when -EPROBE_DEFER would give up);
- how to expose permanent FW loading sysfs interface which won't
disappear after the first -1/1 is written to .../loading;
- how to make sure different cards, which request the same file name
can be served different default firmwares...
Thanks for the improved commit message!
[1] HW links:
https://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.60-Networking-epub/HC25.27.620-22nm-Flow-Proc-Stark-Netronome.pdf
https://www.netronome.com/media/pdfs/NFP_Programming_Model_h6vxM7Y.pdf
http://open-nfp.org/resources/
On Mon, Jun 26, 2017 at 07:10:09PM -0700, Jakub Kicinski wrote:
> On Mon, 26 Jun 2017 23:20:36 +0200, Luis R. Rodriguez wrote:
> > > In that case we will make them all use the same struct firmware_buf.
> > > When wake up happens make sure it's propagated to all of them.
> > >
> > > Signed-off-by: Jakub Kicinski <[email protected]>
> >
> > There's a slew of bugs lurking here though!
> >
> > As noted the reported Intel driver issues still need other fixes, one was the
> > fw_state_done() on the direct filesystem lookup mechanism [1], and that may be
> > a regression since direct filesystem loading was added, and even secondary
> > requests would seem to just wait forever (MAX_SCHEDULE_TIMEOUT); the combination
> > of both fixes should fix your reported issue.
> >
> > Do you intend on submitting those changes as well ? There's still *other* bugs
> > with this feature though... Knowing if you will follow up with further fixes
> > will be appreciated.
>
> No, I don't have any more fixes in my tree right now :)
Ok I can take on the other bits.
> What I'm
> looking towards implementing is actually a ability for NICs to load
> default FW but then enable users to load different FW on their request.
request_firmware_direct() loads optional firmware but this is a sync call. We
don't currently have a similar API for async, we would have gotten this with
the driver data API I wrote, but am now looking forward to Greg advising how to
implement this. But it seems you need more actually, comments below.
> The problem is that advanced NICs are quite programmable [1] and
> depending on use case one may want to load different firmware files.
Right, so in the 802.11 world some devices might use different firmware for
different modes of operation, STA, AP, Mesh, but this is all very protocol
specific, so userspace could tickle the kernel about a mode.
Do your use cases have protocol definitions which can be exposed in userspace?
Or are these just fw variants with different bells and whistles? How man
different use cases are we talking about?
> It's slightly close to the FPGA use case, only with FPGA people don't
> expect much plug and play, and with NICs the default mode after boot
> must be "look as much as a standard NIC as possible". Then loading
> "advanced"/hand crafted firmware can turn more interesting features on.
Makes sense.
> The FW loading we have now in drivers/net/ethernet/netronome/nfp is
> requesting default FW and returning -EPROBE_DEFER if not found.
Oh I see -- right now nfp_nsp_init() is the path that will call the firmware
load via request_firmware() on nfp_net_fw_find(), and if this fails it fails to
find firmware it still returns 0, and the nfp_net_pci_probe() does the
-EPROBE_DEFER handling.
Ugh. This is super hacky, and I realize -EPROBE_DEFER is used for these hacks
folks should stop doing this, specially for this use case given we thought
about it and I believe we have a solution now.
Tom Gundersen and Daniel Wagner worked on a userspace solution to help with
this, it works with two simple modes: best-effort and final-mode. The idea is
the firmwared daemon will be kicked into final-mode once userspace knows the
real rootfs is ready, and this in turn can be used to signal a final
notification that the optional or required firmware is *definitely* not there.
Arend was going to start toying with it, so it would be good to wait for his
feedback.
> Now I
> need to find a way to allow users to "push" whatever advanced FW they
> have into the NIC after/during boot.
Be careful how you do this as you'll have to support it in the driver forever
if you use something like sysfs I think, otherwise you will break some
userspace. However if you use debugfs I think its understood that's loose API.
I'd recommend instead to first see if you can get a mapping of the modes as
specific knobs / tunables through the networking stack, if so then those can
be used as triggers. If not, consider the *features* that are exposed by
the different firmwares and consider their need as triggers for a reload.
How many other devices do the same you do? In what modes?
> Current firmware subsystem doesn't seem to cater to this use case to
> well.
Its a matter of asking and talking. I've provided references of things to
try to address the hacky -EPROBE_DEFER. It does however require a userspace
daemon used, so it does require use of the uevent fallback mechanism.
> I have to look at the FPGA-related code.
Not sure how that would help. Is it huge firmware?
> The three main
> problems to solve are:
> - how to stay bound and retry the direct default FW load until rootfs
> is mounted (equivalent to when -EPROBE_DEFER would give up);
I've thrown a bone for that.
> - how to expose permanent FW loading sysfs interface which won't
> disappear after the first -1/1 is written to .../loading;
The lib/test_firmware.c driver has an example sysfs know a driver could use
on its own to load firmware. This is not as dynamic as you'd want, so I had
implemented an alternative interface which lets you customize hooks in userspace
first and then you just have a sync or async trigger for the test driver
data. It would seem this will not go upstream but you can look at it as an
example of what could be done:
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20170605-driver-data
https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/commit/?h=20170605-driver-data&id=3696afe8d4aba5606dc8f3c562aeae1687f3b53e
But take the warning above about using sysfs serious, you don't want to break
userspace for users, and you want to see if you can first work towards something
more generic with the networking folks.
> - how to make sure different cards, which request the same file name
> can be served different default firmwares...
I believe your patch + the error path fix will handle this now, no?
Luis
>
> Thanks for the improved commit message!
>
> [1] HW links:
> https://www.hotchips.org/wp-content/uploads/hc_archives/hc25/HC25.60-Networking-epub/HC25.27.620-22nm-Flow-Proc-Stark-Netronome.pdf
> https://www.netronome.com/media/pdfs/NFP_Programming_Model_h6vxM7Y.pdf
> http://open-nfp.org/resources/
>
--
Luis Rodriguez, SUSE LINUX GmbH
Maxfeldstrasse 5; D-90409 Nuernberg
On Mon, Jun 26, 2017 at 7:10 PM, Jakub Kicinski
<[email protected]> wrote:
> On Mon, 26 Jun 2017 23:20:36 +0200, Luis R. Rodriguez wrote:
[..]
> - how to stay bound and retry the direct default FW load until rootfs
> is mounted (equivalent to when -EPROBE_DEFER would give up);
If you constrain this problem to only await the mounting of a root
file system you miss the various cases where rootfs is later pivoted
or the firmware isn't stored in the root file system (e.g. every
Android device out there).
Regards,
Bjorn
On Tue, Jun 27, 2017 at 10:48:25AM -0700, Bjorn Andersson wrote:
> On Mon, Jun 26, 2017 at 7:10 PM, Jakub Kicinski
> <[email protected]> wrote:
> > On Mon, 26 Jun 2017 23:20:36 +0200, Luis R. Rodriguez wrote:
> [..]
> > - how to stay bound and retry the direct default FW load until rootfs
> > is mounted (equivalent to when -EPROBE_DEFER would give up);
>
> If you constrain this problem to only await the mounting of a root
> file system you miss the various cases where rootfs is later pivoted
I do believe the firmwared case can work with pivot root, I can't see
why not. In fact it was a case considered from what I can tell
> or the firmware isn't stored in the root file system (e.g. every
> Android device out there).
This was also considered as part of the design. I had particular mentioned
to Tom and Daniel the case of an NVRAM sitting somewhere custom and this
needed to be tossed laster somehow thorugh some custom mechanism.
Let's consider a crazy case where the uevent gets triggered, and userspace goes
and signals Elon Musk somehow to transmit the needed firmware from Mars through
a serial satellite link to earth, and somehow someday the device is finally
ready to upload firmware from userspace. Once Elon's firmware lands home, we
know all needed firmware has arrived so anything missing we can acknowledge now
as missing, so we upload what we can and kick firmward into final-mode to tell
the kernel we know we're really ready and any pending things will have to be
given up.
This would prove the custom fallback crap was also never needed.
I think perhaps one enhancement consideration here may be having the option
for best-effort mode and final-mode be per file, but that's all daemon
specific code.
Luis
On Tue 27 Jun 11:03 PDT 2017, Luis R. Rodriguez wrote:
[..]
> Let's consider a crazy case where the uevent gets triggered, and userspace goes
> and signals Elon Musk somehow to transmit the needed firmware from Mars through
> a serial satellite link to earth, and somehow someday the device is finally
> ready to upload firmware from userspace. Once Elon's firmware lands home, we
> know all needed firmware has arrived so anything missing we can acknowledge now
> as missing, so we upload what we can and kick firmward into final-mode to tell
> the kernel we know we're really ready and any pending things will have to be
> given up.
>
> This would prove the custom fallback crap was also never needed.
>
Are you saying that each kernel driver should be written so that it will
either do direct loading or use firmwared?
>From the previous discussion Linus made it very clear that he expect
this driver to be compiled as a kernel module and Elon to issue a
modprobe once he lands. But IMHO, Elon's decision of using modprobe vs
firmwared should be his to make and not mine as a kernel developer.
Regards,
Bjorn
On Tue, Jun 27, 2017 at 11:59:15AM -0700, Bjorn Andersson wrote:
> On Tue 27 Jun 11:03 PDT 2017, Luis R. Rodriguez wrote:
> [..]
> > Let's consider a crazy case where the uevent gets triggered, and userspace goes
> > and signals Elon Musk somehow to transmit the needed firmware from Mars through
> > a serial satellite link to earth, and somehow someday the device is finally
> > ready to upload firmware from userspace. Once Elon's firmware lands home, we
> > know all needed firmware has arrived so anything missing we can acknowledge now
> > as missing, so we upload what we can and kick firmward into final-mode to tell
> > the kernel we know we're really ready and any pending things will have to be
> > given up.
> >
> > This would prove the custom fallback crap was also never needed.
> >
>
> Are you saying that each kernel driver should be written so that it will
> either do direct loading or use firmwared?
Hell No! You can fork firmwared or use whatever the hell bin-foo you want.
Even if its proprietary and glued with evil rainbow unicorns on it. The dual
mode, best-effort mode and final-mode devices it implemented are key to what
you want to mimic as an example to achieve the goal in question.
Please give that thought / architecture solution a spin and let me know if
it suffices for your needs.
Luis
On Tue 27 Jun 12:08 PDT 2017, Luis R. Rodriguez wrote:
> On Tue, Jun 27, 2017 at 11:59:15AM -0700, Bjorn Andersson wrote:
> > On Tue 27 Jun 11:03 PDT 2017, Luis R. Rodriguez wrote:
> > [..]
> > > Let's consider a crazy case where the uevent gets triggered, and userspace goes
> > > and signals Elon Musk somehow to transmit the needed firmware from Mars through
> > > a serial satellite link to earth, and somehow someday the device is finally
> > > ready to upload firmware from userspace. Once Elon's firmware lands home, we
> > > know all needed firmware has arrived so anything missing we can acknowledge now
> > > as missing, so we upload what we can and kick firmward into final-mode to tell
> > > the kernel we know we're really ready and any pending things will have to be
> > > given up.
> > >
> > > This would prove the custom fallback crap was also never needed.
> > >
> >
> > Are you saying that each kernel driver should be written so that it will
> > either do direct loading or use firmwared?
>
> Hell No! You can fork firmwared or use whatever the hell bin-foo you want.
> Even if its proprietary and glued with evil rainbow unicorns on it. The dual
> mode, best-effort mode and final-mode devices it implemented are key to what
> you want to mimic as an example to achieve the goal in question.
>
I'm sorry but your language is totally inappropriate and the reason why
I tend to stay away from firmware-related discussions.
> Please give that thought / architecture solution a spin and let me know if
> it suffices for your needs.
>
Which solution do you refer to here?
But as I said, in my view, the decision of making the kernel depend on a
user space firmware loading mechanism or direct loading should be that
of the system designer - not the kernel.
Regards,
Bjorn
On Tue, Jun 27, 2017 at 12:52 PM, Bjorn Andersson
<[email protected]> wrote:
> On Tue 27 Jun 12:08 PDT 2017, Luis R. Rodriguez wrote:
>> On Tue, Jun 27, 2017 at 11:59:15AM -0700, Bjorn Andersson wrote:
>> > Are you saying that each kernel driver should be written so that it will
>> > either do direct loading or use firmwared?
>>
>> Hell No! You can fork firmwared or use whatever the hell bin-foo you want.
>> Even if its proprietary and glued with evil rainbow unicorns on it. The dual
>> mode, best-effort mode and final-mode devices it implemented are key to what
>> you want to mimic as an example to achieve the goal in question.
>>
>
> I'm sorry but your language is totally inappropriate and the reason why
> I tend to stay away from firmware-related discussions.
I'm sorry if I offended you, the goal here was to use an exaggerated
example of what anyone could use to draw in firmware into the kernel.
>> Please give that thought / architecture solution a spin and let me know if
>> it suffices for your needs.
>>
>
> Which solution do you refer to here?
The model of a best-effort and final-mode.
> But as I said, in my view, the decision of making the kernel depend on a
> user space firmware loading mechanism or direct loading should be that
> of the system designer - not the kernel.
You get that freedom today. The fallback mechanism *allows* files to
be fetched in whatever way possible, by issuing a uevent, its up to
userspace to figure out how to gather that and toss it back into the
kernel using the sysfs interface. The firemward daemon is nothing but
an example *new* daemon which uses a model to address the rootfs /
pivot root dilema, as only userspace can know when userspace *is
ready*.
So its not clear to me yet what your grudge with using firmwared with
this new model is exactly. Are you saying you want to *require* the
fallback mechanism from the start, so just skipping the direct FS
lookup ? That would be a new feature request, and we can certainly
consider it, but I'll need Greg to clue me in first on how he'd prefer
an API evolution for new features.
Luis
On Tue, 27 Jun 2017 18:39:42 +0200, Luis R. Rodriguez wrote:
> On Mon, Jun 26, 2017 at 07:10:09PM -0700, Jakub Kicinski wrote:
> > On Mon, 26 Jun 2017 23:20:36 +0200, Luis R. Rodriguez wrote:
> > > > In that case we will make them all use the same struct firmware_buf.
> > > > When wake up happens make sure it's propagated to all of them.
> > > >
> > > > Signed-off-by: Jakub Kicinski <[email protected]>
> > >
> > > There's a slew of bugs lurking here though!
> > >
> > > As noted the reported Intel driver issues still need other fixes, one was the
> > > fw_state_done() on the direct filesystem lookup mechanism [1], and that may be
> > > a regression since direct filesystem loading was added, and even secondary
> > > requests would seem to just wait forever (MAX_SCHEDULE_TIMEOUT); the combination
> > > of both fixes should fix your reported issue.
> > >
> > > Do you intend on submitting those changes as well ? There's still *other* bugs
> > > with this feature though... Knowing if you will follow up with further fixes
> > > will be appreciated.
> >
> > No, I don't have any more fixes in my tree right now :)
>
> Ok I can take on the other bits.
>
> > What I'm
> > looking towards implementing is actually a ability for NICs to load
> > default FW but then enable users to load different FW on their request.
>
> request_firmware_direct() loads optional firmware but this is a sync call. We
> don't currently have a similar API for async, we would have gotten this with
> the driver data API I wrote, but am now looking forward to Greg advising how to
> implement this. But it seems you need more actually, comments below.
>
> > The problem is that advanced NICs are quite programmable [1] and
> > depending on use case one may want to load different firmware files.
>
> Right, so in the 802.11 world some devices might use different firmware for
> different modes of operation, STA, AP, Mesh, but this is all very protocol
> specific, so userspace could tickle the kernel about a mode.
>
> Do your use cases have protocol definitions which can be exposed in userspace?
> Or are these just fw variants with different bells and whistles? How man
> different use cases are we talking about?
Right now we have three modes that come from Netronome itself, a "basic
NIC" one, and two advanced for TC flower/Open vSwitch acceleration and
for eBPF offload. I was hoping some enumeration scheme could work here,
but I really can't come up with one. People can download the SDK and
write a FW with their own offloads, bells and whistles, I feel like they
should be able to load that with the upstream kernel and minimal
effort :(
> > It's slightly close to the FPGA use case, only with FPGA people don't
> > expect much plug and play, and with NICs the default mode after boot
> > must be "look as much as a standard NIC as possible". Then loading
> > "advanced"/hand crafted firmware can turn more interesting features on.
>
> Makes sense.
>
> > The FW loading we have now in drivers/net/ethernet/netronome/nfp is
> > requesting default FW and returning -EPROBE_DEFER if not found.
>
> Oh I see -- right now nfp_nsp_init() is the path that will call the firmware
> load via request_firmware() on nfp_net_fw_find(), and if this fails it fails to
> find firmware it still returns 0, and the nfp_net_pci_probe() does the
> -EPROBE_DEFER handling.
>
> Ugh. This is super hacky, and I realize -EPROBE_DEFER is used for these hacks
> folks should stop doing this, specially for this use case given we thought
> about it and I believe we have a solution now.
>
> Tom Gundersen and Daniel Wagner worked on a userspace solution to help with
> this, it works with two simple modes: best-effort and final-mode. The idea is
> the firmwared daemon will be kicked into final-mode once userspace knows the
> real rootfs is ready, and this in turn can be used to signal a final
> notification that the optional or required firmware is *definitely* not there.
>
> Arend was going to start toying with it, so it would be good to wait for his
> feedback.
Haha, yes the -EPROBE_DEFER is definitely a hack. I was trying to do
the right thing, but then I found out that systemd dropped the support
for FW uevents and major distributions don't even build the fallback in.
To be honest waiting for rootfs to be available is lower on my list of
priorities, but it's definitely nice to have. I also don't care about
supporting more complex rootfs setups, simply trying whatever comes
after initramfs covers 99.9% use cases. 0.1% can load the FW manually/
rebind the driver IMHO.
> > Now I
> > need to find a way to allow users to "push" whatever advanced FW they
> > have into the NIC after/during boot.
>
> Be careful how you do this as you'll have to support it in the driver forever
> if you use something like sysfs I think, otherwise you will break some
> userspace. However if you use debugfs I think its understood that's loose API.
Unfortunately the netdev community does not like debugfs. I would
prefer to extend the firmware subsystem if possible and use the
existing sysfs interface, just in a new "mode".
> I'd recommend instead to first see if you can get a mapping of the modes as
> specific knobs / tunables through the networking stack, if so then those can
> be used as triggers. If not, consider the *features* that are exposed by
> the different firmwares and consider their need as triggers for a reload.
> How many other devices do the same you do? In what modes?
There are a number of NPU-like products. Also the ability to parse
packets in arbitrary ways and look things up in TCAMs is expanding in
simpler hardware. It's hard to express parsing graph as a set of
features, especially that nowadays crafting custom protocols seems to
be in vogue.
> > Current firmware subsystem doesn't seem to cater to this use case to
> > well.
>
> Its a matter of asking and talking. I've provided references of things to
> try to address the hacky -EPROBE_DEFER. It does however require a userspace
> daemon used, so it does require use of the uevent fallback mechanism.
Do you know how systemd developers feel about the issue (CCed)? Given
that it seems to dominate in data center OSes now I'm slightly worried
having to push Big Linux Vendors to package some seemingly
embedded-centric software just to make advanced NICs run :(
> > I have to look at the FPGA-related code.
>
> Not sure how that would help. Is it huge firmware?
Up to a few megabytes. You are right, the major feature of FPGA API
doesn't interest us...
> > The three main
> > problems to solve are:
> > - how to stay bound and retry the direct default FW load until rootfs
> > is mounted (equivalent to when -EPROBE_DEFER would give up);
>
> I've thrown a bone for that.
Thanks.
> > - how to expose permanent FW loading sysfs interface which won't
> > disappear after the first -1/1 is written to .../loading;
>
> The lib/test_firmware.c driver has an example sysfs know a driver could use
> on its own to load firmware. This is not as dynamic as you'd want, so I had
> implemented an alternative interface which lets you customize hooks in userspace
> first and then you just have a sync or async trigger for the test driver
> data. It would seem this will not go upstream but you can look at it as an
> example of what could be done:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20170605-driver-data
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/commit/?h=20170605-driver-data&id=3696afe8d4aba5606dc8f3c562aeae1687f3b53e
>
> But take the warning above about using sysfs serious, you don't want to break
> userspace for users, and you want to see if you can first work towards something
> more generic with the networking folks.
Thanks for the pointers!
> > - how to make sure different cards, which request the same file name
> > can be served different default firmwares...
>
> I believe your patch + the error path fix will handle this now, no?
I'm not sure. I think it would work if I set FW_OPT_NOCACHE, though.
I need to test that.
Thanks a lot for the comments!
On Tue, Jun 27, 2017 at 02:25:53PM -0700, Jakub Kicinski wrote:
> On Tue, 27 Jun 2017 18:39:42 +0200, Luis R. Rodriguez wrote:
> >
> > > The problem is that advanced NICs are quite programmable [1] and
> > > depending on use case one may want to load different firmware files.
> >
> > Right, so in the 802.11 world some devices might use different firmware for
> > different modes of operation, STA, AP, Mesh, but this is all very protocol
> > specific, so userspace could tickle the kernel about a mode.
> >
> > Do your use cases have protocol definitions which can be exposed in userspace?
> > Or are these just fw variants with different bells and whistles? How man
> > different use cases are we talking about?
>
> Right now we have three modes that come from Netronome itself, a "basic
> NIC" one, and two advanced for TC flower/Open vSwitch acceleration and
> for eBPF offload. I was hoping some enumeration scheme could work here,
> but I really can't come up with one.
How about just supporting 3 firmware names, with the first two being optional,
but if found one of those two is found it would use that one. Then only if
both of these are not present would a default be looked for and used?
In terms of interface, a simple symlink / renaming scheme would suffice to
support this. No custom hooks at all.
> To be honest waiting for rootfs to be available is lower on my list of
> priorities, but it's definitely nice to have. I also don't care about
> supporting more complex rootfs setups, simply trying whatever comes
> after initramfs covers 99.9% use cases. 0.1% can load the FW manually/
> rebind the driver IMHO.
Right, the only issue with firmwared is you'd expect folks would have
it deployed, and that's not the case today, but when you do control the
ecosystem you can certainly use it as an option. Let me know if you
do try it out.
> > Be careful how you do this as you'll have to support it in the driver forever
> > if you use something like sysfs I think, otherwise you will break some
> > userspace. However if you use debugfs I think its understood that's loose API.
>
> Unfortunately the netdev community does not like debugfs. I would
> prefer to extend the firmware subsystem if possible and use the
> existing sysfs interface, just in a new "mode".
I don't think this is required, you can simply use different filenames as
noted above.
> > > Current firmware subsystem doesn't seem to cater to this use case to
> > > well.
> >
> > Its a matter of asking and talking. I've provided references of things to
> > try to address the hacky -EPROBE_DEFER. It does however require a userspace
> > daemon used, so it does require use of the uevent fallback mechanism.
>
> Do you know how systemd developers feel about the issue (CCed)? Given
> that it seems to dominate in data center OSes now I'm slightly worried
> having to push Big Linux Vendors to package some seemingly
> embedded-centric software just to make advanced NICs run :(
firmwared was written by a systemd developer :)
I think it was first packaged into systemd, and then it was split out to
help those who want it external.
> > > - how to make sure different cards, which request the same file name
> > > can be served different default firmwares...
> >
> > I believe your patch + the error path fix will handle this now, no?
>
> I'm not sure. I think it would work if I set FW_OPT_NOCACHE, though.
> I need to test that.
Why do you need FW_OPT_NOCACHE?
Luis
On Wed, 28 Jun 2017 00:24:19 +0200, Luis R. Rodriguez wrote:
> On Tue, Jun 27, 2017 at 02:25:53PM -0700, Jakub Kicinski wrote:
> > On Tue, 27 Jun 2017 18:39:42 +0200, Luis R. Rodriguez wrote:
> > >
> > > > The problem is that advanced NICs are quite programmable [1] and
> > > > depending on use case one may want to load different firmware files.
> > >
> > > Right, so in the 802.11 world some devices might use different firmware for
> > > different modes of operation, STA, AP, Mesh, but this is all very protocol
> > > specific, so userspace could tickle the kernel about a mode.
> > >
> > > Do your use cases have protocol definitions which can be exposed in userspace?
> > > Or are these just fw variants with different bells and whistles? How man
> > > different use cases are we talking about?
> >
> > Right now we have three modes that come from Netronome itself, a "basic
> > NIC" one, and two advanced for TC flower/Open vSwitch acceleration and
> > for eBPF offload. I was hoping some enumeration scheme could work here,
> > but I really can't come up with one.
>
> How about just supporting 3 firmware names, with the first two being optional,
> but if found one of those two is found it would use that one. Then only if
> both of these are not present would a default be looked for and used?
>
> In terms of interface, a simple symlink / renaming scheme would suffice to
> support this. No custom hooks at all.
That's what we do today as a stop-gap solution :) (BTW mkinitrd
doesn't seem to like symlinks so I use hard links.)
The issue is if you have two identical cards, how to load different FW
on them? We would have to encode the serial number into the firmware
name or the pci_name() at least. The nice thing about this solution is
that it requires almost zero code on the kernel size... hm...
On Tue, Jun 27, 2017 at 3:39 PM, Jakub Kicinski
<[email protected]> wrote:
> On Wed, 28 Jun 2017 00:24:19 +0200, Luis R. Rodriguez wrote:
>> On Tue, Jun 27, 2017 at 02:25:53PM -0700, Jakub Kicinski wrote:
>> > On Tue, 27 Jun 2017 18:39:42 +0200, Luis R. Rodriguez wrote:
>> > >
>> > > > The problem is that advanced NICs are quite programmable [1] and
>> > > > depending on use case one may want to load different firmware files.
>> > >
>> > > Right, so in the 802.11 world some devices might use different firmware for
>> > > different modes of operation, STA, AP, Mesh, but this is all very protocol
>> > > specific, so userspace could tickle the kernel about a mode.
>> > >
>> > > Do your use cases have protocol definitions which can be exposed in userspace?
>> > > Or are these just fw variants with different bells and whistles? How man
>> > > different use cases are we talking about?
>> >
>> > Right now we have three modes that come from Netronome itself, a "basic
>> > NIC" one, and two advanced for TC flower/Open vSwitch acceleration and
>> > for eBPF offload. I was hoping some enumeration scheme could work here,
>> > but I really can't come up with one.
>>
>> How about just supporting 3 firmware names, with the first two being optional,
>> but if found one of those two is found it would use that one. Then only if
>> both of these are not present would a default be looked for and used?
>>
>> In terms of interface, a simple symlink / renaming scheme would suffice to
>> support this. No custom hooks at all.
>
> That's what we do today as a stop-gap solution :) (BTW mkinitrd
> doesn't seem to like symlinks so I use hard links.)
I see only 1 request_firmware() request on your driver and its not optional?
What I was suggesting was have 3 separate requests. Two optional files
which you hunt for first, and if either of them are present you use
them. If both are missing you go with the default.
> The issue is if you have two identical cards, how to load different FW
> on them? We would have to encode the serial number into the firmware
> name or the pci_name() at least. The nice thing about this solution is
> that it requires almost zero code on the kernel size... hm...
Yeah, that could work -- ie, for each of the 3 mode firmware you
actually *first* look for the pci_name() derivative first, if any of
them is found use that. Otherwise you then move on to the non
pci_name() varaints with the same logic. The last one would be the
default Fisher-Price® one.
Anyway -- I still think *long term* this is best ironed out by
identifying the modes and seeing if perhaps networking folks might
come to some agreement on these as different modes, then your driver
can have respective hooks for the different modes based on some user
specified mode flip or requirement.
If this is not possible... a sysfs know interface is always possible.
Luis
On Wed, 28.06.17 00:24, Luis R. Rodriguez ([email protected]) wrote:
> > Do you know how systemd developers feel about the issue (CCed)? Given
> > that it seems to dominate in data center OSes now I'm slightly worried
> > having to push Big Linux Vendors to package some seemingly
> > embedded-centric software just to make advanced NICs run :(
>
> firmwared was written by a systemd developer :)
No it wasn't. I don't know what firmwared is really. Sorry.
> I think it was first packaged into systemd, and then it was split out to
> help those who want it external.
Certainly not. I'd sure know about that. ;-)
Lennart
--
Lennart Poettering, Red Hat
On Tue, 27 Jun 2017, Luis R. Rodriguez wrote:
>diff --git a/include/linux/swait.h b/include/linux/swait.h
>index 4a4e180d0a35..14fcf23cece4 100644
>--- a/include/linux/swait.h
>+++ b/include/linux/swait.h
>@@ -29,7 +29,10 @@
> *
> * As a side effect of this; the data structures are slimmer.
> *
>- * One would recommend using this wait queue where possible.
So I think this was added due to the smaller footprint and fewer
cycles that swait has compared to the traditional (bulkier)
waitqueues. While probably not worth it, I guess we could offer
super-simple waitqueues (sswait? :-) which do not have the rt caveats
and uses a regular spinlock. The wakeup_all() call would not drop
the lock upon every wakeup as we are stripping the waitqueue not
for determinism, but for overhead. To mitigate this, we might
also want to use wake_q for reduced hold q->lock hold times.
But I don't think its worth yet another wait interface.
Alternatively, it crossed my mind we could also have wakeup_all()
use in the regular waitqueues, but I'd have to audit all the
current users to make sure we could actually do this.
Thanks,
Davidlohr
On Wed, Jun 28, 2017 at 06:45:14AM -0700, Davidlohr Bueso wrote:
> On Tue, 27 Jun 2017, Luis R. Rodriguez wrote:
>
> > diff --git a/include/linux/swait.h b/include/linux/swait.h
> > index 4a4e180d0a35..14fcf23cece4 100644
> > --- a/include/linux/swait.h
> > +++ b/include/linux/swait.h
> > @@ -29,7 +29,10 @@
> > *
> > * As a side effect of this; the data structures are slimmer.
> > *
> > - * One would recommend using this wait queue where possible.
>
> So I think this was added due to the smaller footprint and fewer
> cycles that swait has compared to the traditional (bulkier)
> waitqueues. While probably not worth it, I guess we could offer
> super-simple waitqueues (sswait? :-) which do not have the rt caveats
> and uses a regular spinlock. The wakeup_all() call would not drop
> the lock upon every wakeup as we are stripping the waitqueue not
> for determinism, but for overhead. To mitigate this, we might
> also want to use wake_q for reduced hold q->lock hold times.
>
> But I don't think its worth yet another wait interface.
> Alternatively, it crossed my mind we could also have wakeup_all()
> use in the regular waitqueues, but I'd have to audit all the
> current users to make sure we could actually do this.
But this open-welcoming invite for swait then, should it go?
Luis
On Wed, Jun 28, 2017 at 12:06 AM, Lennart Poettering
<[email protected]> wrote:
> On Wed, 28.06.17 00:24, Luis R. Rodriguez ([email protected]) wrote:
>
>> > Do you know how systemd developers feel about the issue (CCed)? Given
>> > that it seems to dominate in data center OSes now I'm slightly worried
>> > having to push Big Linux Vendors to package some seemingly
>> > embedded-centric software just to make advanced NICs run :(
>>
>> firmwared was written by a systemd developer :)
>
> No it wasn't. I don't know what firmwared is really. Sorry.
Is Tom Gundersen not a systemd developer?
>> I think it was first packaged into systemd, and then it was split out to
>> help those who want it external.
>
> Certainly not. I'd sure know about that. ;-)
Sorry I may have confused 'intended to be at first'. Tom and Daniel
can elaborate.
Luis
On Wed, 28.06.17 09:06, Luis R. Rodriguez ([email protected]) wrote:
> On Wed, Jun 28, 2017 at 12:06 AM, Lennart Poettering
> <[email protected]> wrote:
> > On Wed, 28.06.17 00:24, Luis R. Rodriguez ([email protected]) wrote:
> >
> >> > Do you know how systemd developers feel about the issue (CCed)? Given
> >> > that it seems to dominate in data center OSes now I'm slightly worried
> >> > having to push Big Linux Vendors to package some seemingly
> >> > embedded-centric software just to make advanced NICs run :(
> >>
> >> firmwared was written by a systemd developer :)
> >
> > No it wasn't. I don't know what firmwared is really. Sorry.
>
> Is Tom Gundersen not a systemd developer?
Not really anymore, and "firmwared" is an effort independent of
systemd, never was part of it, and while I heard Tom was working on
this I was not aware of the project's naming or anything else...
Lennart
--
Lennart Poettering, Red Hat
On Wed, Jun 28, 2017 at 06:21:09PM +0200, Lennart Poettering wrote:
> On Wed, 28.06.17 09:06, Luis R. Rodriguez ([email protected]) wrote:
>
> > On Wed, Jun 28, 2017 at 12:06 AM, Lennart Poettering
> > <[email protected]> wrote:
> > > On Wed, 28.06.17 00:24, Luis R. Rodriguez ([email protected]) wrote:
> > >
> > >> > Do you know how systemd developers feel about the issue (CCed)? Given
> > >> > that it seems to dominate in data center OSes now I'm slightly worried
> > >> > having to push Big Linux Vendors to package some seemingly
> > >> > embedded-centric software just to make advanced NICs run :(
> > >>
> > >> firmwared was written by a systemd developer :)
> > >
> > > No it wasn't. I don't know what firmwared is really. Sorry.
> >
> > Is Tom Gundersen not a systemd developer?
>
> Not really anymore, and "firmwared" is an effort independent of
> systemd, never was part of it, and while I heard Tom was working on
> this I was not aware of the project's naming or anything else...
Alright, thanks for the clarifications and sorry for the confusion!
In that case firmwared remains *just* an architecture example of an alternative
to the problem of looking for firmware through a *fallback mechanism* and
addressing "is my real rootfs mounted yet" problem some folks have struggled to
resolve, "are we sure we're ready to look for all firmware?".
Lennart, if you have a better architectural suggestion let us know.
I realize that the firmware fallback mechanism was ripped out of systemd long
ago, specifically as of systemd commit be2ea723b1d0 (“udev: remove userspace
firmware loading support”) as of v217 on August, 2014. This means most Linux
distributions today are not using or taking advantage of the firmware fallback
mechanism provided by kobject uevents.
This is specially exacerbated due to the fact that most distributions today
disable CONFIG_FW_LOADER_USER_HELPER_FALLBACK which *means* only the custom
fallback mechanism (no uevents are issued) can be used on those distributions,
and we actually want to *avoid* having more drivers use that mechanism. Only
2 drivers remain upstream now which explicitly require the custom fallback
mechanism. We don't to add any more.
This leaves distributions that want a fallback mechanism today only with the
option to enable CONFIG_FW_LOADER_USER_HELPER_FALLBACK and rely on uevents, and
firmwared was an architectural example of how to address the rootfs problem.
Android is enabling CONFIG_FW_LOADER_USER_HELPER_FALLBACK these days it seems.
If it helps the fallback mechanism is now documented here:
https://www.kernel.org/doc/html/latest/driver-api/firmware/fallback-mechanisms.html
The concept of firmwared was simple: it had best-effort mode and final-mode.
It relies on CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y and relies on uevents.
You boot with it on best-effort mode where firmware is hunted for in a best
effort way, but it does not fail a load through the kernel's sysfs interface
used for the fallback mechanism. Then once userspace knows we have reached the
real rootfs (since only it knows when this happens) it kicks firmwared into
final-mode, which in turn can now iterate over pending firmware and send a "not
found" with certainty.
So the focus for now is ironing out something that we know works *very well*
for the CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y folks.
It'd be great if we had a solution that could work for
CONFIG_FW_LOADER_USER_HELPER_FALLBACK=n folks but its unclear if that's
possible, so it may be best to only revisit this if and when we know for sure
CONFIG_FW_LOADER_USER_HELPER_FALLBACK=y and the rootfs issue is properly ironed
out with the uevents fallback mechanism.
Luis
Hi Luis!
On Wed, 28 Jun 2017, Luis R. Rodriguez wrote:
>On Wed, Jun 28, 2017 at 06:45:14AM -0700, Davidlohr Bueso wrote:
>> On Tue, 27 Jun 2017, Luis R. Rodriguez wrote:
>>
>> > diff --git a/include/linux/swait.h b/include/linux/swait.h
>> > index 4a4e180d0a35..14fcf23cece4 100644
>> > --- a/include/linux/swait.h
>> > +++ b/include/linux/swait.h
>> > @@ -29,7 +29,10 @@
>> > *
>> > * As a side effect of this; the data structures are slimmer.
>> > *
>> > - * One would recommend using this wait queue where possible.
>>
>> So I think this was added due to the smaller footprint and fewer
>> cycles that swait has compared to the traditional (bulkier)
>> waitqueues. While probably not worth it, I guess we could offer
>> super-simple waitqueues (sswait? :-) which do not have the rt caveats
>> and uses a regular spinlock. The wakeup_all() call would not drop
>> the lock upon every wakeup as we are stripping the waitqueue not
>> for determinism, but for overhead. To mitigate this, we might
>> also want to use wake_q for reduced hold q->lock hold times.
>>
>> But I don't think its worth yet another wait interface.
>> Alternatively, it crossed my mind we could also have wakeup_all()
>> use in the regular waitqueues, but I'd have to audit all the
>> current users to make sure we could actually do this.
>
>But this open-welcoming invite for swait then, should it go?
I have nothing against removing it.
Thanks,
Davidlohr
On Mon, Jun 26, 2017 at 02:23:12PM -0700, Luis R. Rodriguez wrote:
> From: Jakub Kicinski <[email protected]>
>
> The firmware cache mechanism serves two purposes, the secondary purpose is
> not well documented nor understood. This fixes a regression with the secondary
> purpose of the firmware cache mechanism: batched requests.
>
> The firmware cache is used for:
>
> 1) Addressing races with file lookups during the suspend/resume cycle
> by keeping firmware in memory during the cycle
>
> 2) Batched requests for the same file rely only on work from the first file
> lookup, which keeps the firmware in memory until the last release_firmware()
> is called
>
> Batched requests *only* take effect if secondary requests come in prior to the
> first user calling release_firmware(). The devres name used for the internal
> firmware cache is used as a hint other pending requests are ongoing, the
> firmware buffer data is kept in memory until the last user of the buffer
> calls release_firmware(), therefore serializing requests and delaying the
> release until all requests are done.
>
> Batched requests wait for a wakup or signal (we only accept SIGKILL now) so we
> can rely on the first file fetch to write to the pending secondary requests.
> Commit 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
> ported the firmware API to use swait, and in doing so failed to convert
> complete_all() to swake_up_all() -- it used swake_up(), loosing the ability
> for *some* batched requests to take effect.
>
> Without this fix it has been reported plugging in two Intel 6260 Wifi cards
> on a system will end up enumerating the two devices only 50% of the time
> [0]. The ported swake_up() should have actually two devices, however,
> *if more than two cards are used* the swake_up() would not suffice. This
> change is only part of the required fixes for batched requests. Subsequent
> fixes will follow.
>
> This particular change should fix the cases where more than three requests
> with the same firmware name is used, otherwise batched requests will wait for
> MAX_SCHEDULE_TIMEOUT and just timeout eventually.
>
> [0] https://bugzilla.kernel.org/show_bug.cgi?id=195477
>
> Fixes: 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
> CC: <[email protected]> [4.10+]
> Cc: Ming Lei <[email protected]>
> Signed-off-by: Jakub Kicinski <[email protected]>
> [mcgrof: expanded on impact on commit log]
> Signed-off-by: Luis R. Rodriguez <[email protected]>
> ---
>
> Greg, I think it would make sense to queue this in after the signal stable
> fixes [1].
As I just dropped them, can you redo this based on Linus's tree now?
thanks,
greg k-h
On Thu, Jun 29, 2017 at 05:16:41PM +0200, Greg KH wrote:
> On Mon, Jun 26, 2017 at 02:23:12PM -0700, Luis R. Rodriguez wrote:
> > From: Jakub Kicinski <[email protected]>
> >
> > The firmware cache mechanism serves two purposes, the secondary purpose is
> > not well documented nor understood. This fixes a regression with the secondary
> > purpose of the firmware cache mechanism: batched requests.
> >
> > The firmware cache is used for:
> >
> > 1) Addressing races with file lookups during the suspend/resume cycle
> > by keeping firmware in memory during the cycle
> >
> > 2) Batched requests for the same file rely only on work from the first file
> > lookup, which keeps the firmware in memory until the last release_firmware()
> > is called
> >
> > Batched requests *only* take effect if secondary requests come in prior to the
> > first user calling release_firmware(). The devres name used for the internal
> > firmware cache is used as a hint other pending requests are ongoing, the
> > firmware buffer data is kept in memory until the last user of the buffer
> > calls release_firmware(), therefore serializing requests and delaying the
> > release until all requests are done.
> >
> > Batched requests wait for a wakup or signal (we only accept SIGKILL now) so we
> > can rely on the first file fetch to write to the pending secondary requests.
> > Commit 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
> > ported the firmware API to use swait, and in doing so failed to convert
> > complete_all() to swake_up_all() -- it used swake_up(), loosing the ability
> > for *some* batched requests to take effect.
> >
> > Without this fix it has been reported plugging in two Intel 6260 Wifi cards
> > on a system will end up enumerating the two devices only 50% of the time
> > [0]. The ported swake_up() should have actually two devices, however,
> > *if more than two cards are used* the swake_up() would not suffice. This
> > change is only part of the required fixes for batched requests. Subsequent
> > fixes will follow.
> >
> > This particular change should fix the cases where more than three requests
> > with the same firmware name is used, otherwise batched requests will wait for
> > MAX_SCHEDULE_TIMEOUT and just timeout eventually.
> >
> > [0] https://bugzilla.kernel.org/show_bug.cgi?id=195477
> >
> > Fixes: 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
> > CC: <[email protected]> [4.10+]
> > Cc: Ming Lei <[email protected]>
> > Signed-off-by: Jakub Kicinski <[email protected]>
> > [mcgrof: expanded on impact on commit log]
> > Signed-off-by: Luis R. Rodriguez <[email protected]>
> > ---
> >
> > Greg, I think it would make sense to queue this in after the signal stable
> > fixes [1].
>
> As I just dropped them, can you redo this based on Linus's tree now?
Oh nevermind, it does apply to that tree now. Wait, what am I supposed
to do here?
confused,
greg k-h
On Thu, Jun 29, 2017 at 05:17:15PM +0200, Greg KH wrote:
> On Thu, Jun 29, 2017 at 05:16:41PM +0200, Greg KH wrote:
> > On Mon, Jun 26, 2017 at 02:23:12PM -0700, Luis R. Rodriguez wrote:
> > > From: Jakub Kicinski <[email protected]>
> > >
> > > The firmware cache mechanism serves two purposes, the secondary purpose is
> > > not well documented nor understood. This fixes a regression with the secondary
> > > purpose of the firmware cache mechanism: batched requests.
> > >
> > > The firmware cache is used for:
> > >
> > > 1) Addressing races with file lookups during the suspend/resume cycle
> > > by keeping firmware in memory during the cycle
> > >
> > > 2) Batched requests for the same file rely only on work from the first file
> > > lookup, which keeps the firmware in memory until the last release_firmware()
> > > is called
> > >
> > > Batched requests *only* take effect if secondary requests come in prior to the
> > > first user calling release_firmware(). The devres name used for the internal
> > > firmware cache is used as a hint other pending requests are ongoing, the
> > > firmware buffer data is kept in memory until the last user of the buffer
> > > calls release_firmware(), therefore serializing requests and delaying the
> > > release until all requests are done.
> > >
> > > Batched requests wait for a wakup or signal (we only accept SIGKILL now) so we
> > > can rely on the first file fetch to write to the pending secondary requests.
> > > Commit 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
> > > ported the firmware API to use swait, and in doing so failed to convert
> > > complete_all() to swake_up_all() -- it used swake_up(), loosing the ability
> > > for *some* batched requests to take effect.
> > >
> > > Without this fix it has been reported plugging in two Intel 6260 Wifi cards
> > > on a system will end up enumerating the two devices only 50% of the time
> > > [0]. The ported swake_up() should have actually two devices, however,
> > > *if more than two cards are used* the swake_up() would not suffice. This
> > > change is only part of the required fixes for batched requests. Subsequent
> > > fixes will follow.
> > >
> > > This particular change should fix the cases where more than three requests
> > > with the same firmware name is used, otherwise batched requests will wait for
> > > MAX_SCHEDULE_TIMEOUT and just timeout eventually.
> > >
> > > [0] https://bugzilla.kernel.org/show_bug.cgi?id=195477
> > >
> > > Fixes: 5b029624948d ("firmware: do not use fw_lock for fw_state protection")
> > > CC: <[email protected]> [4.10+]
> > > Cc: Ming Lei <[email protected]>
> > > Signed-off-by: Jakub Kicinski <[email protected]>
> > > [mcgrof: expanded on impact on commit log]
> > > Signed-off-by: Luis R. Rodriguez <[email protected]>
> > > ---
> > >
> > > Greg, I think it would make sense to queue this in after the signal stable
> > > fixes [1].
> >
> > As I just dropped them, can you redo this based on Linus's tree now?
>
> Oh nevermind, it does apply to that tree now. Wait, what am I supposed
> to do here?
>
> confused,
Yea.. and moving way from swait actually means this should be applied first,
just that this patch becomes a port back away from swait. The signal series of
fixes are patches which should be applied even before swait.
Since this can be confusing I'm just going to fold this patch into the other larger
series. That should make order clear.
Luis
On Tue, 27 Jun 2017, Luis R. Rodriguez wrote:
> * As a side effect of this; the data structures are slimmer.
> *
>- * One would recommend using this wait queue where possible.
>+ * NOTE: swait is for cases of extreme memory considerations and some very
>+ * special realtime issues, where it saves a couple of bytes in structures that
>+ * need close packing. As such its very special-use. Consider using regular
>+ * waits queues from wait.h instead *first*.
How about the following?
diff --git a/include/linux/swait.h b/include/linux/swait.h
index 4a4e180d0a35..f72f274f2a5f 100644
--- a/include/linux/swait.h
+++ b/include/linux/swait.h
@@ -9,13 +9,16 @@
/*
* Simple wait queues
*
- * While these are very similar to the other/complex wait queues (wait.h) the
- * most important difference is that the simple waitqueue allows for
- * deterministic behaviour -- IOW it has strictly bounded IRQ and lock hold
- * times.
+ * While these are very similar to regular wait queues (wait.h) the most
+ * important difference is that the simple waitqueue allows for deterministic
+ * behaviour -- IOW it has strictly bounded IRQ and lock hold times.
*
- * In order to make this so, we had to drop a fair number of features of the
- * other waitqueue code; notably:
+ * Mainly, this is accomplished by two things. Firstly not allowing swake_up_all
+ * from IRQ disabled, and dropping the lock upon every wakeup, giving a higher
+ * priority task a chance to run.
+ *
+ * Secondly, we had to drop a fair number of features of the other waitqueue
+ * code; notably:
*
* - mixing INTERRUPTIBLE and UNINTERRUPTIBLE sleeps on the same waitqueue;
* all wakeups are TASK_NORMAL in order to avoid O(n) lookups for the right
@@ -24,12 +27,14 @@
* - the exclusive mode; because this requires preserving the list order
* and this is hard.
*
- * - custom wake functions; because you cannot give any guarantees about
- * random code.
- *
- * As a side effect of this; the data structures are slimmer.
+ * - custom wake callback functions; because you cannot give any guarantees
+ * about random code. This also allows swait to be used in RT, such that
+ * raw spinlock can be used for the swait queue head.
*
- * One would recommend using this wait queue where possible.
+ * As a side effect of these; the data structures are slimmer albeit more ad-hoc.
+ * For all the above, note that simple wait queues should _only_ be used under
+ * very specific realtime constraints -- it is best to stick with the regular
+ * wait queues in most cases.
*/
struct task_struct;
On Thu, Jun 29, 2017 at 12:08 PM, Davidlohr Bueso <[email protected]> wrote:
> On Tue, 27 Jun 2017, Luis R. Rodriguez wrote:
>>
>> * As a side effect of this; the data structures are slimmer.
>> *
>> - * One would recommend using this wait queue where possible.
>> + * NOTE: swait is for cases of extreme memory considerations and some
>> very
>> + * special realtime issues, where it saves a couple of bytes in
>> structures that
>> + * need close packing. As such its very special-use. Consider using
>> regular
>> + * waits queues from wait.h instead *first*.
>
>
> How about the following?
And do we want to make it even more *special-use* by wrapping it with
#ifndef MODULE as suggested by Matthew Wilcox ?
Luis
On 06/28/2017 06:06 PM, Luis R. Rodriguez wrote:
> On Wed, Jun 28, 2017 at 12:06 AM, Lennart Poettering
> <[email protected]> wrote:
>> On Wed, 28.06.17 00:24, Luis R. Rodriguez ([email protected]) wrote:
>>> I think it was first packaged into systemd, and then it was split out to
>>> help those who want it external.
>>
>> Certainly not. I'd sure know about that. ;-)
>
> Sorry I may have confused 'intended to be at first'. Tom and Daniel
> can elaborate.
Tom helped out building the daemon as standalone project. There is no
real reason that it needs to be part of the systemd code base.
Thanks,
Daniel
On Thu, 29 Jun 2017, Luis R. Rodriguez wrote:
>And do we want to make it even more *special-use* by wrapping it with
>#ifndef MODULE as suggested by Matthew Wilcox ?
Not if we want to add any sort of module that tests swait vs wait.
Thanks,
Davidlohr
On Tue, Jun 27, 2017 at 01:30:30AM +0200, Luis R. Rodriguez wrote:
> > But the *only* reason for swait is extreme memory issues and some very
> > special realtime issues, where it saves a couple of bytes in
> > structures that need close packing, and doesn't even use normal
> > spinlocks, so it saves a couple of cycles at wakeup/sleep because it
> > doesn't do a good job in general.
Yes, its about real-time. But it shouldn't be _that_ special purpose.
So swait should be capable of most things people want from a waitqueue.
Things it explicitly does not do are things like:
- custom wake functions
- exclusive mode
- mixing different sleep types on a queue (which I would recommend
against in any case).
But only very few people need those.
> > The "avoid normal spinlocks" is because it is meant for code that is
> > *so* special that it needs the magical low-level raw spinlocks.
> > I think the two valid users are RCU (which needed it for RT), and kvm
> > (which also needed it for similar issues - it needs to be
> > non-preemptible).
Right, so we need the raw_spinlock in order to run from non-preemptible
code on RT. And we then also need bounded runtimes on stuff.
The only function which is affected by that is swake_up_all(), that does
an unbounded list iteration and is therefore required to be called from
the schedulable context (so we can drop the lock and gain bounds on the
preemption latency).
But aside of all that, it should provide 'everything' people want from a
bog standard regular waitqueue. So I don't see why people shouldn't use
it more.
In any case, I'm not seeing why you call it idiotic.
On Wed, Jul 5, 2017 at 9:18 AM, Peter Zijlstra <[email protected]> wrote:
>
> So swait should be capable of most things people want from a waitqueue.
But what's the point?
Regular wait-queues work fine. The advantages of swait aren't huge
even if you fix the crap it does now.
And the disadvantages of "another interface to do the same thing" are
big indeed.
Keep it specialized.
> Right, so we need the raw_spinlock in order to run from non-preemptible
> code on RT. And we then also need bounded runtimes on stuff.
>
> The only function which is affected by that is swake_up_all(),
No.
The fact is, "swake_up()" needs to do that "wake up all" for this
interface to be even *REMOTELY* acceptable for general use.
Seriously. If the regular "swake_up()" doesn't do what people expect
from a wakeup, then it damn well should not be used, and not be taught
to people. It's a very specialized interface for only two users, and
right now it looks like one of those two users shouldn't have used it
in the first place.
> In any case, I'm not seeing why you call it idiotic.
Have you read the problems?
There were originally three users:
- one of them is firmware loading, and it was actively *buggy* due to
using hat shit interface, and it's going away
- one of them is kvm, and for kvm that interface was shit.
- the final one is RCU, and even there it is very questionable.
There is no question: swait() is a mistake and should not be used. And
you're apparently still in denial about how completely broken it is to
have "swake_up()" have different semantics from "swake_up_all()".
Linus
On Wed, Jul 05, 2017 at 09:33:38AM -0700, Linus Torvalds wrote:
> On Wed, Jul 5, 2017 at 9:18 AM, Peter Zijlstra <[email protected]> wrote:
> > In any case, I'm not seeing why you call it idiotic.
>
> Have you read the problems?
>
> There were originally three users:
>
> - one of them is firmware loading, and it was actively *buggy* due to
> using hat shit interface, and it's going away
To be fair the issue with the firmware loading was due to a regression on the
port from using swait/completion to swait. swake_up() was used instead of
swake_up_all(). Before the port complate_all() was used instead of just
complete().
So I would personally not add the firmware API to the list of reasons why swait
would be crap.
The more *general* semantics issue you pointed out with swake_up() though seem
more reasonable to be attacking the swait API and those are best followed on
the other thread [0].
I should also point out that there are *other* issues with the firmware API on
this whole wait stuff, but that have been present since even before when we
were using the regular wait / completion API -- a wake was never issued upon an
error in some situations on some kernel builds. I have a fix for that now along
with a test case for it which I'll post soon. After all these fixes the code
works as expected with either the swait API or the good 'ol wait/completion
API. But given swait is still out by you as only specialized I did already post
patches to revert the firmware API to switch back to the wait/completion API.
[0] https://lkml.kernel.org/r/CA+55aFykNULx-b6M6FmUYdK2cn-OJKKfjaPwLN5xZGK+bioGaA@mail.gmail.com
Luis